What changed in this snapshot: added Coop% @5 columns for both
Qwen3.5-9B and Qwen3.5-35B-A3B-SFT to the Task Inventory tab, based on the 4-VM K=5 run on
coop_pass_at_k. Other tabs are unchanged from the source.
Diff vs prior snapshot: First snapshot — no prior to diff against.
{method}-{owner-initials}-v{N} — e.g. solo2coop-wt-v1{model}-{data-id}-{method}-{date} — e.g. qwen35-9b-solo2coop-wt-v1-fullft-20260420{run-id}__{evalset}__{mode}__{run#} — e.g. …__flash__coop__r2— (em dash). Blank means "I forgot".# tasks solved in k attempts / # tasks.# Tasks = 1, leave Pass@k as —, use Avg Reward instead.cobra_task:* and roaring_task:522 got 0/N on both models because their runner.sh was templated with Python tooling (conda activate testbed; pytest) on Go-only base images, breaking eval before any test could run. PRs #10, #11, #17, #24, #33, #34 fix this. Numbers will likely improve materially after the fix lands and images are rebuilt.flask_task:5526 dropped 33% → 0% from 9B to 35B-SFT, but the sample is tiny (n=3 pairs), so this is within noise.CodeConflict/cooperator-qwen35-35b-a3b-sft, served via Modal (vllm_modal_sft.py).gs://cooperbench-rollout-soe-gemini-llm-agents/qwen35-9b-coop-pak.tar.gz (450 MB, 2026-05-07)gs://cooperbench-rollout-soe-gemini-llm-agents/qwen35b-a3b-sft-95pair-pak.tar.gz (770 MB, 2026-05-08)