Team experiment registry

Team experiment registry — snapshot 2026-05-08

Mirrored from the team Google Sheet (1D18voT5Yxr4BxR4qowNmS91qR4q_4F7VmRUTCNJlVwk) on 2026-05-08. Source CSVs alongside this file in 2026-05-08-team-experiments/.

What changed in this snapshot: added Coop% @5 columns for both Qwen3.5-9B and Qwen3.5-35B-A3B-SFT to the Task Inventory tab, based on the 4-VM K=5 run on coop_pass_at_k. Other tabs are unchanged from the source.

Diff vs prior snapshot: First snapshot — no prior to diff against.

Sheet definitions & conventions

Data Registry — every training dataset that exists. One row per dataset.
Training Runs — every fine-tuning run. One row per checkpoint. References Data ID.
Eval Results — every evaluation. One row per (checkpoint × eval set × mode). References Run ID.
Task Inventory (optional) — characterization of CooperBench tasks (solo vs coop pass rate).

ID conventions

Data ID: {method}-{owner-initials}-v{N} — e.g. solo2coop-wt-v1
Run ID: {model}-{data-id}-{method}-{date} — e.g. qwen35-9b-solo2coop-wt-v1-fullft-20260420
Eval ID: {run-id}__{evalset}__{mode}__{run#} — e.g. …__flash__coop__r2

Filling rules

Always fill Owner and Date.
If a column does not apply, write — (em dash). Blank means "I forgot".
Avg Reward is mean episode reward; Pass@k is # tasks solved in k attempts / # tasks.
Single-task evals: # Tasks = 1, leave Pass@k as —, use Avg Reward instead.
Notes: keep to one sentence.

Caveats & provenance (Task Inventory)

cobra_task:* and roaring_task:522 got 0/N on both models because their runner.sh was templated with Python tooling (conda activate testbed; pytest) on Go-only base images, breaking eval before any test could run. PRs #10, #11, #17, #24, #33, #34 fix this. Numbers will likely improve materially after the fix lands and images are rebuilt.
flask_task:5526 dropped 33% → 0% from 9B to 35B-SFT, but the sample is tiny (n=3 pairs), so this is within noise.
The 35B-SFT model used here is CodeConflict/cooperator-qwen35-35b-a3b-sft, served via Modal (vllm_modal_sft.py).

Trajectories

9B: gs://cooperbench-rollout-soe-gemini-llm-agents/qwen35-9b-coop-pak.tar.gz (450 MB, 2026-05-07)
35B-SFT: gs://cooperbench-rollout-soe-gemini-llm-agents/qwen35b-a3b-sft-95pair-pak.tar.gz (770 MB, 2026-05-08)