Add counterfactual dataset by Jiawen-CS · Pull Request #615 · microsoft/BC-Bench

Jiawen-CS · 2026-04-16T08:43:57Z

This project investigates transfer failures in large language models (LLMs) when generating code for niche programming languages, using AL as a case study.

We design BC-Bench-CF, a benchmark suite that includes realistic AL development tasks and minimal counterfactual variants. The goal is to evaluate not only functional correctness, but also robustness to small specification changes and sensitivity to AL-specific execution semantics.

Our analysis is grounded in a layered failure framework, which attributes model errors to different abstraction levels, including syntax, validation semantics, event-driven paradigms, workflow composition, and ecosystem constraints.

…bed)

- CounterfactualEntry model: resolves base fields from bcbench.jsonl at load time - Register COUNTERFACTUAL_EVALUATION in EvaluationCategory enum - Reuse BugFixPipeline, BugFixResult, ExecutionBasedEvaluationResultSummary - Add counterfactual-evaluation prompt template (same as bug-fix) - Add leaderboard placeholder (docs/_data/counterfactual-evaluation.json) - Add COUNTERFACTUAL.md documentation - Add 8 tests for CF entry loading, schema, and category properties

Structure (restructured from old thesis modules): - src/bcbench/analysis/family.py: FamilyOutcome, FamilyType, InstanceResult - src/bcbench/analysis/aggregator.py: build_families() (was family_aggregator.py) - src/bcbench/analysis/metrics.py: fragility_rate, severity, layer distribution (was evaluator/thesis_metrics.py) - src/bcbench/analysis/annotation.py: failure sampling + CSV export (was sample_failures.py) - src/bcbench/types.py: FailureLayer enum (L1-L5) - evaluator/counterfactual_scores.py: Braintrust scorers (was thesis_scores.py) - notebooks updated to use proper imports instead of inline stubs - Removed stale notebooks/thesis/ folder - 20 new tests (444 total)

…sis/counterfactual-dataset

Replace single COUNTERFACTUAL_EVALUATION category with CF_1, CF_2, CF_3, CF_4 to enable batched GitHub Actions runs per variant number. - Add is_counterfactual, cf_variant, prompt_template_key properties - Filter entries by __cf-N suffix in dataset list command - Share counterfactual-template across all CF categories - Create per-category leaderboard files (cf-1.json..cf-4.json) - Update copilot-instructions.md and COUNTERFACTUAL.md - Update tests for new category structure

…sis/counterfactual-dataset

…crosoft/BC-Bench into thesis/counterfactual-dataset

Jiawen Sun added 7 commits April 14, 2026 22:08

Add counterfactual dataset (255 entries + problem statements)

5169543

Add counterfactual experiment notebooks (aligned with main, deps stub…

c1b37e5

…bed)

Merge branch 'main' of https://github.com/microsoft/BC-Bench into the…

67a9e97

…sis/counterfactual-dataset

Add cf-1..cf-4 to workflow category dropdowns

9f036f2

haoranpb reviewed Apr 16, 2026

View reviewed changes

Comment thread evaluator/counterfactual_scores.py

Jiawen Sun and others added 14 commits April 16, 2026 15:28

Auto-detect counterfactual dataset path from InstanceId

48785c1

Merge branch 'main' of https://github.com/microsoft/BC-Bench into the…

c28de64

…sis/counterfactual-dataset

Fix load base intances problem

fc6f4e3

fix matrix issue

7059b6f

fix container problems

d005880

remove low quality dataset

fdf88ea

change validation action for dataset

51cba47

fix patch failed issue

48e5ed5

update cf3 and cf4 datasets

37a48cf

remove images

9118019

Merge branch 'main' into thesis/counterfactual-dataset

459f0ad

fix cf1 and cf2

a704657

Merge branch 'thesis/counterfactual-dataset' of https://github.com/mi…

a526f91

…crosoft/BC-Bench into thesis/counterfactual-dataset

impovement for cf3

d96eb2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add counterfactual dataset#615

Add counterfactual dataset#615
Jiawen-CS wants to merge 21 commits intomainfrom
thesis/counterfactual-dataset

Jiawen-CS commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jiawen-CS commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jiawen-CS commented Apr 16, 2026 •

edited

Loading