Skip to content

amazon-science/CodeAssistBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

CodeAssistBench

A benchmark for evaluating AI coding assistants on real GitHub issues. This project includes a curated dataset of GitHub issues with Dockerfiles for reproducible evaluation, plus tools for dataset creation and AI agent evaluation.

⚑ Quick Run (5 minutes)

Get started immediately with our pre-built dataset:

# 1. Clone and install
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
pip install -r requirements.txt && pip install -e .

# 2. Set AWS credentials (for Bedrock)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2

# 3. Run evaluation on 10 Python issues
python -m cab_evaluation.cli generation-dataset \
  dataset/cab_recent.jsonl \
  --output results/quick_test.jsonl \
  --agent-models '{"maintainer": "haiku", "user": "haiku"}' \
  --language python

# 4. Judge the results
python -m cab_evaluation.cli evaluation-dataset \
  results/quick_test.jsonl \
  --output results/quick_eval.jsonl \
  --agent-models '{"judge": "haiku"}'

# 5. View results
python -c "
import json
with open('results/quick_eval.jsonl') as f:
    for line in f:
        r = json.loads(line)
        print(f\"{r['issue_id']}: {r['verdict']}\")
"

What this does:

  1. Generates maintainer responses for Python issues using Claude Haiku (fast & cheap)
  2. Evaluates responses with a judge agent
  3. Outputs verdicts: CORRECT, PARTIALLY_CORRECT, INCORRECT, or ERROR

For production evaluation, use sonnet4 or opus models instead of haiku.


πŸ“Š Dataset Overview

CodeAssistBench provides three ready-to-use datasets:

Dataset Issues Languages Description
dataset/cab_recent_v2.jsonl 771 7 Latest - June 2025 - Jan 2026 (with satisfaction conditions & classification)
dataset/cab_recent.jsonl 308 7 Recent issues (June 2025 - Jan 2026)
dataset/cab_verified.jsonl 149 7 Verified subset with tested Dockerfiles

Dataset Fields

Each issue in the dataset contains:

{
  "number": 1234,
  "title": "Bug: Memory leak in parser",
  "created_at": "2025-07-15T10:30:00Z",
  "closed_at": "2025-07-20T14:22:00Z",
  "commit_id": "abc123def456...",
  "labels": ["bug", "parser"],
  "url": "https://github.com/owner/repo/issues/1234",
  "body": "When parsing large files, memory usage grows unbounded...",
  "author": "user123",
  "comments": [
    {
      "user": "maintainer",
      "created_at": "2025-07-16T08:00:00Z",
      "body": "Thanks for reporting! Can you share the file?"
    }
  ],
  "satisfaction_conditions": [
    "Memory usage remains stable when parsing files >100MB",
    "Parser handles all edge cases mentioned in the issue",
    "No regression in parsing speed for normal files"
  ],
  "_classification": {
    "category": "Can be dockerized without any issue",
    "timestamp": "2025-04-14 01:01:54"
  },
  "dockerfile": "FROM python:3.11-slim\n...",
  "language": "python"
}

πŸ› οΈ Step-by-Step: Generate Your Own Dataset

This section walks through how we generated the dataset from scratch using AWS Bedrock and Strands AI agents.

Prerequisites

# 1. Clone and setup
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
python3 -m venv venv
source venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt
pip install -e .

# 3. Install Strands SDK (required for Dockerfile generation)
pip install strands-agents strands-agents-tools
pip install -e tools/

# 4. Set up LLM credentials (choose ONE option)

# Option A: AWS Bedrock (Claude models)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2

# Option B: OpenAI (GPT-5 models)
export OPENAI_API_KEY=your_openai_api_key

# 5. Set up GitHub token (for API access)
export GITHUB_TOKEN=your_github_personal_access_token

Step 1: Collect GitHub Issues

Collect closed issues from popular repositories. The script uses interactive prompts:

python script/get_github_issue.py
# Enter CSV path when prompted (see script/python_repos*.csv for examples)
# Choose label-based filtering (y/n)

Or use the bulk collection script:

python script/collect_1000_issues.py
# Edit the script to set: language, min_stars, date range

Output: github_issues_<owner>_<repo>_<timestamp>.json

[
  {
    "number": 1234,
    "title": "Bug: Memory leak in parser",
    "url": "https://github.com/owner/repo/issues/1234",
    "body": "When parsing large files...",
    "comments": [...]
  }
]

Step 2: Get Commit IDs

Find the commit hash at the time each issue was closed:

python script/get_github_commit.py \
  --input-dir my_data/collected_issues \
  --output-dir my_data/with_commits

# Or using short options:
python script/get_github_commit.py -i my_data/collected_issues -o my_data/with_commits

Arguments:

Argument Required Description
--input-dir, -i Yes Directory containing JSON files with issues
--output-dir, -o No Output directory (default: github_commits)

Output: Creates commit data files in the output directory.

Step 3: Generate Satisfaction Conditions (Uses LLM)

Use LLM to generate explicit criteria for issue resolution:

python script/scon_filter.py \
  --input-dir my_data/collected_issues \
  --output-dir my_data/with_scon

# With custom model and region:
python script/scon_filter.py \
  -i my_data/collected_issues \
  -o my_data/with_scon \
  --model us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
  --region us-west-2

Arguments:

Argument Required Default Description
--input-dir, -i Yes - Directory containing JSON files with issues
--output-dir, -o Yes - Output directory for issues with satisfaction conditions
--model, -m No claude-sonnet-4.5 Bedrock model ID
--region, -r No us-west-2 AWS region for Bedrock

Output: Adds satisfaction_conditions field:

{
  "satisfaction_conditions": [
    "Memory usage remains stable when parsing files >100MB",
    "Parser handles all edge cases mentioned in the issue"
  ]
}

Step 4: Classify Dockerizability (Uses LLM)

Classify issues by whether they need a Docker environment:

python script/docker_filter.py \
  --input-dir my_data/with_scon \
  --output-dir my_data/classified

# With custom region:
python script/docker_filter.py \
  -i my_data/with_scon \
  -o my_data/classified \
  --region us-east-1

Arguments:

Argument Required Default Description
--input-dir, -i Yes - Directory containing JSON files with issues
--output-dir, -o Yes - Output directory for classified issues
--region, -r No us-west-2 AWS region for Bedrock

Output structure:

my_data/classified/
β”œβ”€β”€ need_docker/           # Issues that need Docker environment
β”œβ”€β”€ no_need_docker/        # Documentation/config changes  
β”œβ”€β”€ need_docker_but_cannot/ # Hardware-specific issues
β”œβ”€β”€ llm_responses/         # Raw LLM responses for debugging
└── processed_issues.json  # Resume checkpoint

Step 5: Generate Dockerfiles (Uses Strands + LLM)

⚠️ This step requires Strands AI agents to automatically generate and test Dockerfiles:

# Option A: Using AWS Bedrock (Claude) - default
STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \
python script/generate_dockerfile_with_strands.py \
  --input-dir my_data/classified/need_docker \
  --languages python \
  --max-attempts 3 \
  --parallel 2 \
  --agent-timeout 180 \
  --issue-timeout 600

# Option B: Using OpenAI (GPT-5)
STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \
python script/generate_dockerfile_with_strands.py \
  --input-dir my_data/classified/need_docker \
  --languages python \
  --max-attempts 3 \
  --parallel 2 \
  --agent-timeout 180 \
  --issue-timeout 600 \
  --model-id gpt5 \
  --provider openai

What happens:

  1. Strands agent reads the issue and repository structure
  2. Agent generates a Dockerfile based on repo's build system
  3. Docker builds the image to verify it works
  4. If build fails, agent iterates with error feedback
  5. Success: Dockerfile is saved to the issue JSON

Output: Adds dockerfile field:

{
  "dockerfile": "FROM python:3.11-slim\n\nWORKDIR /workspace\n\nRUN apt-get update && apt-get install -y git\n\nRUN git clone https://github.com/owner/repo.git . && \\\n    git checkout abc123def456\n\nRUN pip install -r requirements.txt\n\nCMD [\"pytest\", \"tests/\"]\n"
}

Step 6: Convert to Final Dataset

Combine all processed issues into a single JSONL file:

python script/convert_to_jsonl.py \
  --input-dir my_data/classified/need_docker \
  --output my_data/my_dataset.jsonl

πŸ§ͺ End-to-End Example

Here's a complete walkthrough processing 2 test issues through the entire pipeline:

Setup

cd CodeAssistBench

# Set up credentials (AWS Bedrock + GitHub)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2
export GITHUB_TOKEN=your_github_token

Step 1: Create Test Data

Create a directory with sample issues:

mkdir -p test_pipeline/step1_raw

Create test_pipeline/step1_raw/test_issues.json:

[
  {
    "number": 1234,
    "title": "How to handle async operations in Python?",
    "created_at": "2025-07-15T10:30:00Z",
    "url": "https://github.com/python/cpython/issues/1234",
    "body": "I'm trying to use async/await but get 'RuntimeWarning: coroutine was never awaited'.",
    "author": "user123",
    "comments": [
      {"user": "maintainer", "created_at": "2025-07-16T08:00:00Z", "body": "Use asyncio.run() to execute your coroutine."},
      {"user": "user123", "created_at": "2025-07-17T09:00:00Z", "body": "That worked perfectly!"}
    ]
  }
]

Step 2: Generate Satisfaction Conditions

python3 script/scon_filter.py \
  --input-dir test_pipeline/step1_raw \
  --output-dir test_pipeline/step2_scon

Expected output:

Processing directory: test_pipeline/step1_raw
Found 1 JSON files
Processing conversation 1/1 (ID: 1234)
Added satisfaction conditions for conversation 1234
Saved 1 processed conversations to test_pipeline/step2_scon/test_issues.json

Step 3: Classify Issues

python3 script/docker_filter.py \
  --input-dir test_pipeline/step2_scon \
  --output-dir test_pipeline/step3_classified

Expected output:

Input directory: test_pipeline/step2_scon
Output directory: test_pipeline/step3_classified
Found 1 JSON files to process.
Classified issue #1234 as: Does not need build environment

--- Classification Summary ---
Total issues processed: 1
Does not need build environment: 1 issues (100.0%)

Final Directory Structure

test_pipeline/
β”œβ”€β”€ step1_raw/
β”‚   └── test_issues.json              # Original issues
β”œβ”€β”€ step2_scon/
β”‚   β”œβ”€β”€ test_issues.json              # + satisfaction_conditions
β”‚   └── test_issues_prompts_responses.json
└── step3_classified/
    β”œβ”€β”€ no_need_docker/
    β”‚   └── test_issues.json          # + _classification
    β”œβ”€β”€ need_docker/                   # (empty for this example)
    β”œβ”€β”€ llm_responses/                 # Raw LLM outputs
    └── classification_summary.json

View Results

# Check satisfaction conditions were added
cat test_pipeline/step2_scon/test_issues.json | jq '.[0].satisfaction_conditions'

# Check classification
cat test_pipeline/step3_classified/no_need_docker/test_issues.json | jq '.[0]._classification'

πŸ“‚ Example Outputs

See examples/ for sample outputs at each pipeline stage:

File Description
examples/sample_dataset.jsonl Complete issues with all fields
examples/sample_docker_based_issues.jsonl Issues requiring Docker
examples/sample_non_docker_based_issues.jsonl Documentation/config issues
examples/sample_pipeline_output.json Single issue showing all fields

πŸš€ Quick Start

Using the Dataset

import json

# Load the dataset
with open('dataset/cab_recent.jsonl', 'r') as f:
    issues = [json.loads(line) for line in f]

# Filter by language
python_issues = [i for i in issues if i.get('language') == 'python']

# Get issues with Dockerfiles
dockerized = [i for i in issues if i.get('dockerfile')]

print(f"Total issues: {len(issues)}")
print(f"Python issues: {len(python_issues)}")
print(f"With Dockerfiles: {len(dockerized)}")

Running Evaluation

The evaluation framework has two phases: Generation (maintainer answers issues) and Evaluation (judge scores responses).

Workflow Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Dataset       β”‚ β†’  β”‚   Generation   β”‚ β†’  β”‚   Evaluation    β”‚
β”‚   (JSONL)       β”‚    β”‚   Workflow     β”‚    β”‚   Workflow      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓                      ↓
                       Maintainer ↔ User       Judge Agent
                       Multi-round chat        Scores answers

Step 1: Generation (Maintainer β†’ User conversation)

python -m cab_evaluation.cli generation-dataset \
  dataset/cab_recent.jsonl \
  --output results/generation_results.jsonl \
  --agent-models '{"maintainer": "sonnet4", "user": "haiku"}' \
  --language python \
  --resume

Arguments:

Argument Description
--output, -o Output file (default: auto-generated with timestamp)
--agent-models JSON mapping models: {"maintainer": "sonnet4", "user": "haiku"}
--language, -l Filter by language (python, javascript, etc.)
--resume Skip already-processed issues
--max-conversation-rounds Max rounds between maintainer/user (default: 2)

Step 2: Evaluation (Judge scores responses)

python -m cab_evaluation.cli evaluation-dataset \
  results/generation_results.jsonl \
  --output results/evaluation_results.jsonl \
  --agent-models '{"judge": "sonnet4"}' \
  --resume

Arguments:

Argument Description
--output, -o Output file for evaluation results
--agent-models JSON with judge model: {"judge": "sonnet4"}
--resume Skip already-evaluated issues
--iterative Enable multi-iteration judge with repo exploration

Verdict Types

The judge assigns one of these verdicts:

Verdict Description
CORRECT Response fully addresses the issue and satisfies all conditions
PARTIALLY_CORRECT Response addresses some aspects but misses key elements
INCORRECT Response doesn't address the issue or provides wrong information
ERROR Processing failed (timeout, API error, etc.)

Output Format

Each result in the JSONL file contains:

{
  "issue_id": "1234",
  "question_title": "How to handle async operations?",
  "verdict": "CORRECT",
  "judgment": "The maintainer correctly identified the issue...",
  "key_issues": ["Clear explanation provided", "Code example included"],
  "alignment_score": {
    "satisfied": 3,
    "total": 3,
    "percentage": 100.0,
    "conditions": [
      {"number": 1, "satisfied": true, "description": "Explains async pattern"},
      {"number": 2, "satisfied": true, "description": "Provides working example"},
      {"number": 3, "satisfied": true, "description": "Addresses RuntimeWarning"}
    ]
  },
  "generation_metadata": {
    "user_satisfied": true,
    "total_conversation_rounds": 2
  }
}

Analyzing Results

import json
from collections import Counter

# Load evaluation results
with open('results/evaluation_results.jsonl', 'r') as f:
    results = [json.loads(line) for line in f]

# Count verdicts
verdicts = Counter(r['verdict'] for r in results)
print(f"Total: {len(results)}")
print(f"CORRECT: {verdicts['CORRECT']} ({verdicts['CORRECT']/len(results)*100:.1f}%)")
print(f"PARTIALLY_CORRECT: {verdicts['PARTIALLY_CORRECT']} ({verdicts['PARTIALLY_CORRECT']/len(results)*100:.1f}%)")
print(f"INCORRECT: {verdicts['INCORRECT']} ({verdicts['INCORRECT']/len(results)*100:.1f}%)")
print(f"ERROR: {verdicts.get('ERROR', 0)}")

# Average alignment score
valid_results = [r for r in results if r.get('alignment_score')]
avg_alignment = sum(r['alignment_score']['percentage'] for r in valid_results) / len(valid_results)
print(f"Average alignment: {avg_alignment:.1f}%")

Model Aliases

Available model shortcuts for --agent-models:

Alias Full Model ID
sonnet4 us.anthropic.claude-sonnet-4-20250514-v1:0
sonnet45 us.anthropic.claude-sonnet-4-5-20250929-v1:0
haiku us.anthropic.claude-3-5-haiku-20241022-v1:0
opus us.anthropic.claude-opus-4-20250514-v1:0

See examples/USAGE_GUIDE.md for more detailed instructions.


πŸ“ Project Structure

CodeAssistBench/
β”œβ”€β”€ dataset/                    # πŸ“Š Final datasets
β”‚   β”œβ”€β”€ cab_recent.jsonl        # 308 recent issues
β”‚   β”œβ”€β”€ cab_verified.jsonl      # 149 verified issues
β”‚   └── recent/                 # Additional samples
β”œβ”€β”€ src/cab_evaluation/         # πŸ”§ Evaluation framework
β”‚   β”œβ”€β”€ agents/                 # Agent implementations
β”‚   β”œβ”€β”€ core/                   # Core models and config
β”‚   β”œβ”€β”€ prompts/                # Prompt templates
β”‚   β”œβ”€β”€ utils/                  # Utilities
β”‚   └── workflows/              # Evaluation workflows
β”œβ”€β”€ script/                     # πŸ› οΈ Data collection scripts
β”‚   β”œβ”€β”€ get_github_issue.py     # Step 1: Issue collection
β”‚   β”œβ”€β”€ get_github_commit.py    # Step 2: Commit ID lookup
β”‚   β”œβ”€β”€ scon_filter.py          # Step 3: Satisfaction conditions
β”‚   β”œβ”€β”€ docker_filter.py        # Step 4: Classification
β”‚   └── generate_dockerfile_with_strands.py  # Step 5: Dockerfiles
β”œβ”€β”€ tools/                      # Custom Strands tools (required)
β”œβ”€β”€ examples/                   # Sample data and guides
β”‚   β”œβ”€β”€ USAGE_GUIDE.md          # Detailed usage guide
β”‚   └── sample_*.jsonl          # Sample datasets
β”œβ”€β”€ prompts/                    # Prompt templates
└── docs/                       # Documentation
    └── DATA_PIPELINE.md        # Detailed pipeline docs

πŸ”§ Installation

# Clone the repository
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install in development mode
pip install -e .

# Install Strands SDK (REQUIRED for Dockerfile generation)
pip install strands-agents strands-agents-tools
pip install -e tools/

AWS Credentials (Required for Bedrock)

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2

πŸ“– Documentation


πŸ“š Features

  • Automated Dockerfile Generation: Uses Strands AI agents with AWS Bedrock
  • Multi-language Support: Python, JavaScript, TypeScript, Java, Go, C, C++
  • Satisfaction Conditions: LLM-generated criteria for issue resolution
  • Docker-based Evaluation: Reproducible evaluation environment
  • Multiple Agent Frameworks: Supports Strands, OpenHands, and Q-CLI

πŸ“„ Citation

If you use CodeAssistBench in your research, please cite our paper:

@inproceedings{
kim2025codeassistbench,
title={CodeAssistBench ({CAB}): Dataset \& Benchmarking for Multi-turn Chat-Based Code Assistance},
author={Myeongsoo Kim and Shweta Garg and Baishakhi Ray and Varun Kumar and Anoop Deoras},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
url={https://openreview.net/forum?id=2R6y4Ku9kG}
}

πŸ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

The underlying GitHub issues are subject to their respective repository licenses.

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.


Appendix: Dockerfile Generation Options

Environment Variables

Variable Description
STRANDS_NON_INTERACTIVE=true Required. Disables interactive prompts
BYPASS_TOOL_CONSENT=true Required. Bypasses tool confirmation

Command Line Arguments

Argument Default Description
--input-dir, -i (required) Directory with classified issues
--output-dir, -o logs/dockerfile_generation_strands Output directory
--languages (all) Specific languages to process
--max-attempts 10 Max retry attempts per issue
--docker-timeout 600 Docker build timeout (seconds)
--agent-timeout 300 Agent attempt timeout (seconds)
--issue-timeout 1800 Total timeout per issue (seconds)
--parallel, -p 1 Parallel processing count
--model-id claude-sonnet-4-5 AWS Bedrock model ID

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages