send_message proposing a file split,
before any bash exploration.(input_ids, output_ids) pairs straight into
TitoSFTDataset, with zero re-tokenisation.pallets_click_task:2068 (frozen 1,2) +
dspy_task:8394 (frozen 1,2) — both excluded from every
training pool.scripts/eval_coop_behavior.py):
plan_first: first agent action is a
send_message (no bash before it).plan_content: the send body contains plan-keywords
(split, propose, handle, …) and real task file
paths.follow_through: each agent’s subsequent bash
actually edits a file it claimed in the plan.plan_first ≥ 70%,
plan_content ≥ 70%,
follow_through ≥ 60%.mini_swe_agent adapter, Modal-hosted vLLM serve,
disable_tools: true at eval time (matches the no-tools
training prompt distribution).| # | Date | Iteration | plan_first | plan_content | follow_through | Result |
|---|---|---|---|---|---|---|
| 1 | 05-10 | First attempt (templated, monolingual) | 0% | 0% | 0% | FAIL — agent ignored plan-first, went straight to bash |
| 2 | 05-11 | Re-run with fixes | 0% | 0% | 0% | FAIL — same OOD problem |
| 3 | 05-11 | Fix-#2 (parser / template alignment) | 0% | 0% | 0% | FAIL — adapter format misalignment masked the real issue |
| 4 | 05-11 | OOD diagnosis (synthetic first-turn eval) | 100% (20/20) | 100% (20/20) | n/a | Diagnostic: behaviour learned, but only inside the training prompt distribution |
| 5 | 05-11 | Real-task training v2 (asymmetric data) | 0% | 0% | 0% | FAIL — student trained only on one role |
| 6 | 05-11 | Symmetric real-task v3 | 100% | 0% | 0% | Partial: planning fires; plan body generic, no real file paths |
| 7 | 05-11 | Symmetric + runtime tool format v4 | 50% | 0% | 0% | Partial regression — tool-call format change destabilised the plan-first turn |
| 8 | 05-13 | v5-tf — task-derived file paths in the templated plan | 100% (2/2) | 100% (2/2) | 75% (3/4) | PASS — all three thresholds |
| 9 | 05-15 | Tier 5 Phase A — re-tokenised TITO from Gemini 3 Pro | 100% (2/2) | 100% (2/2) | 100% (4/4) | PASS — validates the TITO training path |
| 10 | 05-15 | Tier 5 Phase B v3 — native capture from Qwen3-8B teacher (13% follow-through) | 100% | 100% | 0% | Partial — planning transfers; follow-through is teacher-bounded |
| 11 | 05-15 | Tier 5 Phase B v4c — native capture from Qwen3.5-27B teacher (95.7%), Qwen3.5-2B student | 100% | 100% | 50% | Partial — biggest split-direction confirmation: stronger same-tokenizer teacher does break the follow-through ceiling |
| Teacher | Teacher follow_through | Student follow_through |
|---|---|---|
| Gemini 3 Pro (prompted) | 82.6% | 100% (Phase A) |
| Qwen3-8B (prompted) | 13% | 0% (Phase B v3) |
| Qwen3.5-27B dense (prompted) | 95.7% | 50% (Phase B v4c) |
plan_first and plan_content clear 100% in all
three Tier-5 runs. The split is entirely on follow_through, and
it tracks the teacher’s own follow-through in both
directions — distillation is teacher-bounded, and the metric
that needs the hardest behaviour (actually editing the file you named) is
where a weak teacher’s ceiling shows.
scripts/build_tito_distill_data.py — teacher rollouts
→ TITO JSONL, with --strip-think (now works across
Qwen3 and Qwen3.5 chat templates after v4c’s fix),
--upweight-first-turns, --filter-by-eval
(rejection sampling).coopertrain/serve/vllm_modal_tito_teacher.py (Qwen3.5-27B
dense teacher serve); vllm_modal_qwen35_2b_student.py
(Qwen3.5-2B student serve, adapter baked in via
--lora-modules so every Modal replica boots with it).coopertrain/configs/verl/sft_qwen35_2b_tito_distill.yaml,
coopertrain/configs/coop_plan_first_qwen35.yaml.tests/integration/test_behavior_eval.py
for native-tool-call parsing._assistant_content, inbox-regex extended to
[Reply from X]:, plan-first walk sweeps
role="tool").Three of the four Phase B v4c non-followers hit
ContextWindowExceeded / LimitsExceeded at steps
88–100. Concrete levers, cheapest first:
max_steps 100 → 150–200 and
max_model_len 16384 → 32768 — no
retrain, plausibly 50 → 75%.Distillation can only transfer a behaviour the teacher actually exhibits, so the teacher dataset was scored before training on it. Two findings, both confirming the worry that “even a frontier model doesn’t really plan-first”:
| Teacher | plan_first | plan_content | follow_through | note |
|---|---|---|---|---|
| Gemini 3 Pro — unprompted (default coop.yaml) | 50% | 0% | 50% | Coordinates on one of two held-out pairs; never uses the templated phrasing. A frontier model does not reliably plan-first on its own. |
Gemini 3 Pro — prompted (coop_plan_first_prompted.yaml) |
100% | 100% | 100% | A mandatory-first-action protocol section steers it cleanly. 23/23 training-pool pairs scored 100/100; follow_through 82.6% (used for rejection sampling). |
| Qwen3-8B — prompted (Phase B v3 teacher) | 100% | 100% | 13% | Plans-first reliably; barely follows through — short trajectories (≈3.4 turns), plans then stalls. Distillation is bounded by teacher capability. |
| Qwen3.5-27B dense — prompted (Phase B v4c teacher) | 95.7% | 95.7% | 95.7% | The same-tokenizer Qwen family at scale: same chat template as the
Qwen3.5-2B student (vocab 248k), so native-captured token ids drop
straight into TitoSFTDataset with no re-tokenisation,
and follow-through is actually there to distil. 22/23 train pairs
scored 100/100/100; used for v4c rejection-sampling. |
The steering prompt is applied to the teacher only. Training data has
the “Plan-First Coordination Protocol” section stripped back to
the default coop.yaml prompt before tokenisation, so the
student learns to plan-first under the prompt distribution it will actually see at
inference — not a prompt-following shortcut.
The first Gemini probe scored 0/0/0 — an eval artefact, not the
model. scripts/eval_coop_behavior.py only read
send_message out of the assistant content string;
litellm-native models (Gemini, Qwen) put the call in
tool_calls.function.arguments. Three coupled fixes:
_assistant_content() synthesises a unified view from both
the literal content and any tool_calls bash
invocations, so XML-in-content (LoRA) and structured tool calls
(Gemini/Qwen) are both visible.[Reply from X]: — the
shape send_message --wait returns inline — not just
[Message from X]:.role="tool" messages too,
since the --wait reply arrives inside the bash tool output,
not a separate role="user" turn.Re-scored apples-to-apples, v5-tf is unchanged (100/100/75) — the
LoRA writes XML into content, which the old eval already saw.
| Metric | v5-tf (chat-format SFT, baseline) | Phase A v4 (TITO distillation) | Threshold |
|---|---|---|---|
| plan_first_rate | 100% (2/2) | 100% (2/2) | ≥ 70% |
| plan_content_rate | 100% (2/2) | 100% (2/2) | ≥ 70% |
| follow_through_rate | 75% (3/4) | 100% (4/4) | ≥ 60% |
Pipeline. 23 prompted-Gemini coop rollouts on the training
pool → rejection-sample to the 19 pairs that scored 100/100/100 → strip
the steering prompt → rewrite Gemini’s structured tool_calls
into the Qwen inline <tool_call><function=bash> XML the
student emits → re-tokenise each turn with the Qwen3-4B chat template →
{input_ids, output_ids} parquet → TitoSFTDataset
→ Qwen3-4B + LoRA on 2×H100. Gemini’s tokenizer differs from
Qwen’s, so the teacher text must be re-tokenised with the
student’s tokenizer — that is what makes this “Phase A”
rather than native capture.
What it took to get there — small-set TITO distillation is hyperparameter-sensitive:
| Iter | Change | Result |
|---|---|---|
| v1 | no turn-upweight, LR 5e-6 | Student never stops thinking. Per-turn TITO expansion buries the plan-first turn: of 571 rows only 38 are turn-0. |
| v2 | upweight turns 0–2 ×12, LR 2e-5 | 100 / 100 / 25. XML format locked in; the heavily-duplicated planning turns starved the bash-work turns, so follow_through stayed low. |
| v3 | upweight ×4 (re-balance planning:work) | 0 / 0 / 0. Re-balancing under-emphasised
the output-format signal — the student regressed to markdown
```bash blocks the coop agent loop can’t parse. |
| v4 | v2 data (×12, format locked) trained the full 4 epochs | 100 / 100 / 100 at step 200. The extra epochs gave the (un-upweighted) work turns enough exposure to lift follow_through 25% → 100%, without touching the format. |
Takeaway. The TITO path itself is correct —
TitoSFTDataset consumes {input_ids, output_ids} parquet
and the trained student exhibits the distilled behaviour on held-out tasks. The
sensitivity is a data-shape property: per-turn expansion dilutes the
first-turn signal ~15× vs. per-trajectory, and the output-format token
sequence and the behaviour both need enough (upweighted) exposure or they
don’t stick.
Phase A re-tokenises, which means the training ids are not exactly
what the teacher emitted. Phase B is the canonical TITO promise: capture the
exact prompt_token_ids + output_token_ids the
inference engine returned (capture_token_ids: true →
extra_body={"return_token_ids": true} →
token_capture block on each assistant message →
coopertrain/verl/tito_capture.py). Teacher: Qwen3-8B
on Modal vLLM — a larger sibling of the 4B student, sharing the Qwen3
tokenizer, so captured ids are trainable on the student with zero
re-tokenisation.
Mechanically validated. The serve returns
token_ids per request; tito_capture.py extracts one
(input_ids, output_ids) pair per assistant turn with
skipped_no_capture=0; the parquet trains through
TitoSFTDataset. Every hop of the native-capture path works.
The planning behaviour transfers; follow-through tracks the teacher. Phase B v3 (Qwen3-8B teacher) lands at 100 / 100 / 0; Phase B v4 swaps to a stronger same-tokenizer teacher (Qwen3.5-27B dense) and lifts follow-through to 50%:
| Iter | Change | Result |
|---|---|---|
| v1 | native capture as-is | Student thinks 3000+ chars then emits a corrupted
<command> tag — native capture faithfully
grabbed Qwen3-8B’s verbose <think> trace, and
distilling that into a small student teaches the verbosity, drowning
the action-format signal. |
| v2 | --strip-think: slice the
<think>…</think> span out of the native
id list (token-level, no re-tokenisation) |
No more verbose thinking, but the
<tool_call> wrapper isn’t reliably learned
from ~640 rows — format-unstable, coop agent loop parses
nothing, 0/0/0. |
| v3 | --strip-think + upweight ×12 (1240 rows) |
100 / 100 / 0. The ×12
upweight locks the <tool_call> format the same way
it did for Phase A v4 — both held-out agents now emit clean,
file-rich plan proposals (7 plan-keyword hits, real task paths).
follow_through stays at 0%: the Qwen3-8B teacher only
followed through 13% of the time, so there is almost no work-turn
signal to distil. |
| v4c | Swap teacher to Qwen3.5-27B dense (same Qwen3.5
tokenizer as the new Qwen3.5-2B student); rebuild
_strip_think_span for the Qwen3.5 chat template
(add_generation_prompt auto-emits
<think>\n\n</think> in the prompt,
so the captured output starts with reasoning content
rather than <think>); 2173 rows, ×12
upweight, 2 epochs. |
100 / 100 / 50. Plan-first +
plan-content stay at 100% with full keyword hits and real task
paths; follow_through climbs from 0% to 50% (2/4
agents). The two non-followers each hit
ContextWindowExceeded / LimitsExceeded
mid-execution — a 2B-student step/context-budget limit, not
a distillation failure. The Qwen3.5-27B teacher passed the three
metrics at 95.7 / 95.7 / 95.7
on the train pairs, so the work-turn signal is finally there to
distil. |
The instructive contrast. All three runs nail
plan_first and plan_content at 100% — planning
behaviour distils cleanly through either re-tokenised or natively
captured TITO. They split entirely on follow_through, and the
split tracks teacher capability in both directions: Gemini 3 Pro 82.6%
→ student 100% (Phase A); Qwen3-8B 13% → student 0% (Phase B v3);
Qwen3.5-27B 95.7% → student 50% (Phase B v4c). Distillation is bounded
by the teacher, and follow_through — the metric that needs
the hardest behaviour (actually editing the files you named) — is where
a weak teacher’s ceiling shows.
What v1–v4c cost to get there. Native capture is
honest to a fault: v1 captured Qwen3-8B’s verbose
<think> trace verbatim and the student drowned in it; v2
stripped the think span but ≈640 rows couldn’t lock the
<tool_call> format; v3 needed the same ×12
turn-upweight as Phase A v4 to make the format stick. v4 then exposed a
chat-template assumption: the Qwen3 strip-think rule
(ids[0] == <think>) didn’t fire for Qwen3.5 because
add_generation_prompt auto-emits
<think>\n\n</think> in the prompt, so the
captured output starts with reasoning content rather than
<think>; v4b silently trained on full reasoning + a stray
</think> and produced 0/0/0; v4c rebuilt the strip to slice up
to the first </think> regardless of the opening token,
covering both shapes. The recurring lesson across all four iterations:
small-set TITO distillation is data-shape-sensitive, and every assumption
about the token sequence — chat-template formatting included —
needs to be verified per tokenizer.
What follow-through 50% leaves on the table.
Three of the four held-out agents hit ContextWindowExceeded or
LimitsExceeded at steps 88–100 (max=100) with
max_model_len=16384 — mid-execution, not at the plan
stage. Concrete levers to try, in order of effort/value:
max_steps 100
→ 150–200 and max_model_len 16384 → 32768.
No retrain; addresses 3 of 4 non-pass cases directly. Plausibly takes
v4c from 50% to 75%.TitoSFTDataset on {input_ids, output_ids} parquet
and the student passes 100/100/100 on held-out cooperbench tasks —
the validation tier 4 / v5-tf did not actually perform.return_token_ids → token_capture
→ tito_capture.py → parquet →
TitoSFTDataset, zero re-tokenisation — lands
plan_first and plan_content at 100%. The native
path is not just mechanically sound; the distilled student exhibits the
captured behaviour.follow_through. All three runs agree on
plan_first + plan_content at 100% and split
entirely on the third metric — tracking the teachers’ own
follow-through (Gemini 82.6% → 100%, Qwen3-8B 13% → 0%,
Qwen3.5-27B 95.7% → 50%), not the capture method. The Qwen3.5-27B
run confirms the relationship in the opposite direction from Phase B v3:
a stronger same-tokenizer teacher does break through the
follow_through ceiling.scripts/build_tito_distill_data.py — teacher
trajectories → TITO JSONL. Strips the steering prompt, rewrites
tool calls to Qwen XML, optional --native-output /
--strip-think for Phase B, --upweight-first-turns
/ --filter-by-eval (rejection sampling).coopertrain/configs/verl/sft_qwen3_4b_tito_distill.yaml /
sft_qwen35_2b_tito_distill.yaml — TITO SFT configs
(TitoSFTDataset, use_remove_padding,
2×H100 LoRA).coopertrain/configs/coop_plan_first_prompted.yaml /
coop_plan_first_teacher_b.yaml /
coop_plan_first_qwen35.yaml — teacher steering
+ student-eval configs.coopertrain/serve/vllm_modal_tito_teacher.py —
teacher serve (Qwen3.5-27B dense for v4c, Qwen3-8B previously);
vllm_modal_qwen35_2b_student.py — Qwen3.5-2B
student serve (adapter baked in via --lora-modules so
every Modal replica boots with it).metrics-gemini3pro-prompted.json,
metrics-tito-distill-v4-s200.json,
metrics-tito-distill-b-*.json,
metrics-tito-distill-b-v4c.json.| Metric | v2 (asymmetric) | v3 (symmetric) | v4 (+runtime tool format) | v5-tf (+task-derived paths) | Threshold |
|---|---|---|---|---|---|
| plan_first_rate | 0% (0/2) | 100% (2/2) | 50% (1/2) | 100% (2/2) | ≥ 70% |
| plan_content_rate | 0% | 0% | 0% | 100% (2/2) | ≥ 70% |
| follow_through_rate | 0% | 0% | 0% | 75% (3/4) | ≥ 60% |
For each task, scripts/gen_plan_first_coop_data_real_tasks.py now extracts
actual file paths from the two feature.md bodies using the eval's own
regex, and samples (file_a, file_b) from those — so the plan body, the bash
target, and the prompt context all reference the same set of paths. Falls back to the legacy
_FILE_PAIRS list only when a task description exposes fewer than two
eval-extractable paths.
The eval's plan_content check requires each agent's first
send_message body to have ≥ 2 plan keywords and
≥ 1 eval-regex file path. follow_through requires the agent's bash
to touch a file from plan_files = a_files ∪ b_files.
In v3/v4 the training data drew (file_a, file_b) from a hardcoded list
(src/cli.py, flask/json/__init__.py, ...) that was uncorrelated
with the actual task. The model learned two patterns at once: "name the trained
paths" and "name paths visible in the prompt context." At inference, agent1 typically
generalized to prompt-derived paths (e.g. dspy/clients/cache.py) but agent2 fell
back to short acknowledgments with no path at all — collapsing plan_content to
0%. And the bash steps referenced the trained paths, not the plan paths, so
follow_through was structurally pinned at 0%.
With task-derived paths in training, the model has a single consistent pattern: paths come from the task. Both agents follow it; plan paths = bash paths.
| Knob | Value | Note |
|---|---|---|
| Data generator | gen_plan_first_coop_data_real_tasks.py (v5) |
Adds _extract_task_files + select_file_pair. |
| Task pool | cooperdata_tasks_v5.json (30 tasks, 12 repos) |
Discovered from the live HF dataset via
scripts/build_task_pool_from_dataset.py; the legacy pool was built
against an older snapshot whose repo names no longer match
cooperbench prepare output, so every entry was skipping with
"missing feature pair". |
| Held-out repos | pallets_click_task, dspy_task |
Same two repos as the v3/v4 eval pairs. |
| Trajectories | 2 300 (50 / task × 23 training tasks × 2 agents) | 2 070 train / 230 val parquet rows after the 10% val split. |
| Training | 2×H100 FSDP, 516 steps | Final val/loss 0.0405 (vs v4's 0.30 at step 100 — ≈ 7× lower). The single-pattern data converges crisply. |
| Adapter | /ckpts/plan-first-real-v5-tf/peft/lora_adapter |
132 MB safetensors; served as model id plan-first-v5tf. |
| Held-out eval | 2 pairs, K=1, step_limit=100 | Both rollouts hit the agent loop's 100-step ceiling; eval scores the first ~2 turns of behavior regardless. |
| Pair | plan_first | plan_content | follow_through (a / b) | plan_files (union) |
|---|---|---|---|---|
pallets_click_task/2068/f1_f2 |
✓ | ✓ | ✓ / ✓ | src/click/_termui_impl.py, src/click/termui.py |
dspy_task/8394/f1_f2 |
✓ | ✓ | ✓ / ✗ | dspy/clients/__init__.py, dspy/clients/cache.py,
jinja2/sandbox.py, tests/test_sandbox.py |
The one miss is dspy agent2: it produced a valid file-rich plan but its bash
steps never touched any of the union plan_files. The plan union for dspy
includes two real dspy paths and two legacy fallback paths
(jinja2/sandbox.py, tests/test_sandbox.py) — that's the model
mixing a task-derived plan with a fallback-influenced ack, which the eval's union check
papers over for agent1 but leaves agent2 stranded when its bash uses different paths
again. Adding a responder-form variant to the data generator (turn 1 = echo waiting,
inbox arrives, turn 2 = ack that echoes both file paths) is the natural
follow-up if the bar moves higher than this experiment's 60% threshold.
POST /v1/load_lora_adapter for an already-loaded lora_name returns
Success but doesn't actually swap the underlying path — the server keeps
serving whatever was loaded first. /v1/models exposes the old root
path. The first eval pass on v5-tf returned 0/0/0 because it was scoring v3 rollouts (the
prior adapter was still active under the plan-first name). Workaround: load
the new adapter under a fresh lora_name
(plan-first-v5tf) and re-target -m openai/plan-first-v5tf at the
cooperbench CLI. The runbook's hot-reload section needs an "unload+restart" alternative
for the in-place-path-change case — or the convention of incrementing the lora_name on
every retrain.0.0.8's execute_coop crashes on
mixed-type message timestamps.
sent_msgs.sort(key=lambda x: x.get("timestamp") or 0) at
cooperbench/runner/coop.py:148 blows up with
TypeError: '<' not supported between instances of 'int' and 'str' when one
agent reports a numeric timestamp and the other a string one. The crash fires
before agent{fid}_traj.json is written, so the eval never sees the
trajectories even though the rollouts ran to completion. Patched locally to
float(ts) on best-effort, fallback 0.0. Worth upstreaming.follow_through.agent2 follow_through). The proposer-form-only training works for both
agents on most repos; the responder-form would close the remaining drift when the
runtime delivers an INBOX before the agent's first action.plan-first-v5tf, plan-first-v6, ...) and have the agent config
pick it up.# On a 1xH100 host with modal authed and uv synced:
bash scripts/run_plan_first_v5.sh
# Or step-by-step (Modal handles data + train + merge; local handles rollouts + eval):
modal run scripts/modal_plan_first_train.py \
--steps "data_v5,train,merge" \
--n-per-task 50 \
--ckpt-dir /ckpts/plan-first-real-v5-tf \
--peft-dir /ckpts/plan-first-real-v5-tf/peft \
--held-out-repos "pallets_click_task,dspy_task"
curl -X POST $ENDPOINT/v1/load_lora_adapter \
-d '{"lora_name":"plan-first-v5tf","lora_path":"/ckpts/plan-first-real-v5-tf/peft/lora_adapter"}'
# Rollout + eval (see scripts/run_plan_first_v5.sh for the cooperbench invocation)
| Metric | v2 (asymmetric) | v3 (symmetric) | v4 (+runtime tool format) | Threshold |
|---|---|---|---|---|
| plan_first_rate | 0% (0/2) | 100% (2/2) | 50% (1/2) | ≥ 70% |
| plan_content_rate | 0% | 0% (per-agent file refs missing) | 0% (per-agent file refs missing) | ≥ 70% |
| follow_through_rate | 0% | 0% | 0% | ≥ 60% |
build_plan_first_trajectory so both agents start with a send_message
proposal from their own perspective ("I'll do MY_FILE, you do THEIR_FILE"), dropping the
asymmetric agent2 "echo inbox check" precursor. The synthetic eval on one
training row now returns plan_first=true, plan_content=true, follow_through=true
— i.e., the data is by-construction eval-compatible.pallets_click_task agent2 names real PR files
(src/click/_termui_impl.py + src/click/termui.py) but agent1 hedged
with "what are your plans?" — no file paths, so plan_content fails because the
eval demands per-agent file refs. dspy_task both agents hedged.scripts/eval_coop_behavior.py:
_extract_inboxes only matched the synthetic
INBOX:\\n From X: body shape. The runtime delivers inter-agent messages as
a separate role="user" message with content
[Message from X]: body. Now matches both.role="user" with
<tool_response>...</tool_response> wrapper; runtime emits
role="tool" with JSON content
{"returncode": 0, "output": "..."}. After the first send_message
the model sees an OOD tool-result frame and produces nothing parseable for the next ~99
steps. This is why follow_through never lights up.role="tool" with JSON content matching the
runtime, and inbox delivery is a separate role="user" message with
[Message from X]: body. The single-row synthetic eval still passes.dspy_task improved
(plan_first 1/1, plan files include the real dspy/clients/__init__.py +
dspy/clients/cache.py), but pallets_click_task agent1 produced no
parseable output at all (Chinese thinking text + JSON tool calls with wrong function name,
not the trained Qwen XML form). Net: plan_first 100% → 50%. The
format shift broke per-task generalization at step-100 training scale; longer training
(step 200+) or more iterations might recover it but I left v3 as the deployed adapter
since it's the better headline.plan_content and follow_throughsend_message body. The training data has both agents propose
specific files (e.g. src/cli.py + tests/test_cli.py), but at
inference the model hedges with "what are your plans?" when the task is sufficiently OOD.
Likely fixes: wider task pool, or RL on the eval reward signal (the model learns to always
propose).send_message body has concrete file paths from the task's actual repo
(e.g. pull from combined.patch filenames), not just the synthetic
_FILE_PAIRS list.Artifacts:
v3 data data/sft/plan_first_real_v3/combined.jsonl (symmetric trajectories);
v3 checkpoint plan-first-checkpoints:/ckpts/plan-first-real-v3/global_step_100
(merged adapter at /peft/lora_adapter, currently loaded on the live serve);
v4 data data/sft/plan_first_real_v4/combined.jsonl (+ runtime tool format);
v4 checkpoint :/ckpts/plan-first-real-v4/global_step_100 (merged adapter available
but not the active deploy);
metrics report/2026-05-10-plan-first-cooperbench-results/metrics-real-tasks-{v3,v4}.json;
trajectories logs/plan-first-eval-real-{v3,v4}/coop/.../f1_f2/.
pallets_click_task:2068 prompt now produces
send_message agent2 "Let's split the work: I'll handle src/click/_termui_impl.py,
you handle src/click/termui.py…" — naming the actual files from the PR
description. Compare to v1 / v3 which gave back literal "[Makes bash tool call with
{"command": "ls -la"} as arguments]" (echoing the prompt's example_response).prepare_sft_data.build_instance_prompt (Task + Situation + Messaging only,
~2,780 chars). The live cooperbench rollout sends ~7,239 chars including
<example_response>, <system_information>, the
command-examples block, and the Submission/CRITICAL sections. v2 uses the same template
at training time, so the model trained on the exact tokens it sees at inference.send_message, but agent2's first turn is a throwaway
echo inbox check that lets the messaging connector deliver
the inbox before the next assistant turn. At inference, agent2
sometimes goes straight to real bash (e.g. ls -la src/click),
skipping the echo-check, so the eval's _walk_until_real_bash
finds no send_message from agent2 before the first real bash.
plan_first requires both agents to have both sent
and received, so a one-sided exchange scores 0 even when one agent
cleanly plan-firsted.dspy_task:8394, agent2's response distribution is OOD
enough that it still loops on "no tool calls found" for all 100 steps and
emits no assistant turns. The model generalized to
pallets_click_task but not dspy_task in this
small training run (21 tasks, n_per_task=50). Wider task coverage or
more training would likely close this.Three loose ends got tightened while debugging:
scripts/modal_plan_first_train.py now takes --train-parquet,
--val-parquet, and --ckpt-dir so independent training runs
don't clobber each other's checkpoints.scripts/eval_coop_behavior.py now recognizes both content
encodings: Qwen <function=bash><parameter=command>
XML and the markdown ```bash form that
coopertrain/agents/mini_swe_agent/models/litellm_model.py
rewrites to before saving. Same for send_message heredoc vs
quoted form. Without this, the eval was structurally blind to dispatched
actions on any trajectory that passed through the rewriter.cooperbench/runner/coop.py in place: the post-run
sent_msgs.sort(key=lambda x: x.get("timestamp") or 0) crashed
with TypeError: '<' not supported between instances of 'int' and
'str' because some messages had int timestamps and others string.
Without the patch, both rollouts completed but failed to save
agent{1,2}_traj.json, blocking the eval entirely.send_message (the ack), not a precursor
echo inbox check. This makes the trained behavior pattern match the eval's
mutual-exchange criterion. Cheapest fix.Artifacts: real-task data
data/sft/plan_first_real_v2/combined.jsonl (2,100 rows, avg user prompt
6,552 chars vs 7,239 in the live rollout — basically identical);
checkpoint plan-first-checkpoints:/ckpts/plan-first-real-v2/global_step_100;
adapter /ckpts/plan-first-real-v2/peft/lora_adapter (132 MB);
metrics report/2026-05-10-plan-first-cooperbench-results/metrics-real-tasks-v2.json;
trajectories logs/plan-first-eval-real-v2/coop/.../f1_f2/;
v2 train app ap-EB4NGuYCCq7Ks1cMLGzryU (stopped after step 200 save attempt).
| Metric | Synthetic prompts (this test) | Cooperbench held-out (v3 rollouts) | Threshold |
|---|---|---|---|
| plan_first_rate | 100% (20/20) | 0% (0/2) | ≥ 70% |
| plan_content_rate | 100% (20/20) | 0% (0/2) | ≥ 70% |
The gap between the two columns is the failure mode. The LoRA is fine; the SFT data distribution doesn't cover what the eval feeds the model. Next step: regenerate SFT data on top of cooperbench's actual task template (or distill from Gemini-Flash rollouts under TITO). Either makes the train / inference prompt distributions match by construction.
The original fix-#2 framing (“regenerate SFT data so its rendered chat-template
output matches inference”) turned out to be wrong: I verified by running one row of
combined.jsonl through AutoTokenizer.apply_chat_template and
the assistant <tool_call> XML survives the template verbatim. So I went
hunting for the real divergence and made three serve / adapter changes along the way:
--reasoning-parser qwen3 from the vLLM serve. The
parser captures everything before </think> into the response’s
reasoning field. The trained model never emits </think>
(training data has no thinking tags), so the entire output — including the
<tool_call> XML — was being routed into reasoning while
content came back null and tool_calls=[]. Confirmed
by reading the raw response object on the first attempt’s serve.chat_template_kwargs.enable_thinking=false from the
agent config. That flag injects a literal <think>\n\n</think>\n\n
after <|im_start|>assistant\n, which is OOD vs training (the trained
assistant turns start with prose directly). With the flag set, the model fell back to
echoing the markdown bash example from the system prompt — the “markdown bash
blocks” output observed in the v2 smoke test. Default mode (no flag) renders the
same prompt tail as training and the model emits the correct XML.LitellmModel. Two issues stacked on top of each other:
mini_swe_agent agent loop uses
action_regex = r"```bash\\s*\\n(.*?)\\n```" on
response.content; it does not consult tool_calls. The
model emits the XML form, which doesn’t match.send_message interceptor matches
send_message agent "msg" and send_message agent 'msg';
the training data uses heredoc form send_message agent <<'MSG'\\n…\\nMSG,
which doesn’t match either._extract_tool_calls_from_content,
_rewrite_xml_tool_calls_to_markdown + _heredoc_to_quoted_send_message)
so that with disable_tools=True, LitellmModel.query rewrites
the assistant content from the XML+heredoc form into the markdown+quoted form the agent
loop already speaks. Verified end-to-end on training-style prompts: 10/10 produce
parseable actions.On the two real held-out coop tasks the model returns no
<tool_call> blocks at all — instead it emits a short Chinese-then-English
narration followed by literal text like “[Makes bash tool call with
{"command": "ls -la"} as arguments]”. The system prompt is identical to training
(same templated phrasing) but the task message is the full cooperbench PR
description (50+ lines, with embedded markdown code blocks, “Solution” sections,
type annotations). The synthetic training data uses 3–5-line tasks. The LoRA never
generalized from short templated tasks to long PR-style tasks — it goes off-distribution
and stops producing tool calls.
This is consistent with the val/loss curve: 2.36 → 0.098 over 424 steps. It memorized
the templated distribution very well. It did not learn an invariant “respond with
<tool_call> + send_message regardless of task shape.”
Under TITO we’d have captured the exact token stream from a real Gemini-Flash plan-first rollout against a cooperbench task — so the training prompt distribution is the inference prompt distribution by construction. The current synthetic-data path optimizes a different distribution than the eval is sampling from. That’s the single biggest lesson from this attempt.
Artifact pointers for this attempt:
serve ap-ED8tmOxyyYDBlMGvhdE7in (redeployed without --reasoning-parser);
agent config coopertrain/configs/coop_plan_first.yaml (no
enable_thinking flag);
adapter helpers in coopertrain/agents/mini_swe_agent/models/litellm_model.py;
rollouts logs/plan-first-eval-v3/;
metrics report/2026-05-10-plan-first-cooperbench-results/metrics-v3.json.
| Metric | First attempt | Re-run | Threshold |
|---|---|---|---|
| plan_first_rate | 0% | 0% | ≥ 70% |
| plan_content_rate | 0% | 0% | ≥ 70% |
| follow_through_rate | 0% | 0% | ≥ 60% |
| final SFT val/loss | ~1.9 (step 42) | 0.098 (step 424) | — |
| LoRA ≠ base on probe | no (identical) | yes | — |
LoRA emits send_message | no | yes | — |
LoRA emits <tool_call> XML | no | no | — |
disable_tools flag to LitellmModelConfig (commit 1912bdb) so the Qwen3 chat template stops injecting tool descriptions at inference — bringing the inference prompt back in line with training.--n-per-task 50 (was 10), total_epochs: 4 (was 2), lr: 2e-5 (was 1e-5). 424 SFT steps / 4.9M loss tokens, vs 42 steps / 200k tokens in the first attempt. Val/loss dropped 2.36→0.098.--detach, so it died at step 82 when the user’s monitor disconnected. Re-launched with modal run --detach and ran to completion in ~80 min.The training itself clearly succeeded this time — val/loss dropped two orders of magnitude (2.36→0.61→0.16→0.10) across the 4 epochs, and the LoRA output is visibly different from the base model. The smoke test (§4 of the runbook) passed 2/3 conditions:
IDENTICAL: False # ✓ LoRA learned something
LoRA HAS <tool_call>: False # ✗ but not the XML format
LoRA HAS send_message: True # ✓ learned to call the right tool
Sample LoRA output (greedy decode on a training prompt’s system+user prefix):
```bash
send_message --wait agent2 <<'MSG'
What files are you planning to edit?
MSG
```
That is the correct intent (a send_message to the partner agent), but
mini-swe-agent dispatches off the response’s tool_calls field, which vLLM
only populates when it sees literal <tool_call><function=bash>…</function></tool_call>
XML in the generated content. Markdown bash blocks do not parse. Every turn in both held-out
pairs hit “No tool calls found in the response” for all 100 steps before
LimitsExceeded.
The training data combined.jsonl contains <tool_call><function=bash>…</function></tool_call>
in the assistant message text. But the Qwen3 chat template at SFT time appears to have rewritten that
content during rendering — either via tool-call extraction into a structured field, or via the
messages_key=messages path in verl's MultiTurnSFTDataset — so the
tokenized assistant turn that the model actually trained against was the rendered form, not
the literal XML. The rendered form turned out to be markdown bash, so that’s what the model
learned to emit. At inference, vLLM’s qwen3_coder tool parser only knows how to parse XML back
out, so the loop never closes.
tool_calls). The action content is unambiguous — one bash block per
turn. This makes the current LoRA usable as-is, no retraining needed.combined.jsonl entry through the tokenizer’s chat template once,
capturing the rendered assistant turns, and writing those back.Artifact pointers for the re-run: final checkpoint
plan-first-checkpoints:/plan-first/global_step_424; merged PEFT adapter
plan-first-checkpoints:/plan-first/peft/lora_adapter (132 MB);
metrics JSON
report/2026-05-10-plan-first-cooperbench-results/metrics-v2.json;
trajectories logs/plan-first-eval/coop/{pallets_click_task,dspy_task}/.../f1_f2/;
training app ap-ry4fWdYJbCTYVkWyz2N6ue (stopped 2026-05-11 01:44:51 UTC).
Result. Plan-first rate = 0%, plan content = 0%,
follow-through = 0% (n=2 completed coop pairs on held-out repos
pallets_click_task and dspy_task).
Manual probes against the served LoRA confirm: identical token-for-token output between
plan-first and the bare Qwen/Qwen3-4B base on the same prompt.
What this tells us about the pipeline. Five of the six pipeline stages (data gen, parquet, FSDP train, FSDP→PEFT merge, vLLM hot-load, behavioral eval) are end-to-end correct — verified by file-level checks at each handoff. The failure is concentrated in stage 6: inference prompt format does not match training prompt format, so even though the LoRA weights are non-zero and the adapter loads, the model sees an out-of-distribution prompt at inference and falls back to base behavior.
| Component | Value |
|---|---|
| Base model | Qwen/Qwen3-4B (plan called for 9B; see deviation note below) |
| Strategy | LoRA rank=32, alpha=16, target_modules=all-linear, 252 lora_A + 252 lora_B tensors (all 36 layers × 7 modules) |
| Train data | 342 train + 38 val plan-first templated trajectories on cooperdata_tasks.json 19-task held-in pool, 10 per task |
| Train compute | 2 × H100 80GB (FSDP), 42 steps total, 2 epochs, ~4 min wall |
| Train final loss | train 2.54 ← 3.5, val 2.68 (clear downward signal — training did learn something) |
| Adapter checkpoint | Modal vol plan-first-checkpoints:/plan-first/peft/lora_adapter/ (132 MB) |
| Adapter merge | verl 0.7.1 model_merger CLI, with a monkey-patch for a known LoRA task_type bug (peft ≥0.13 returns it as str, verl casts .value) |
| Serve | vLLM 0.19 on Modal H100 (cooperbench--qwen3-4b-plan-first-serve.modal.run/v1),
--enable-lora --max-lora-rank 32 --lora-modules plan-first=/ckpts/plan-first/peft/lora_adapter,
--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 |
| Held-out eval tasks | pallets_click_task/2068 + dspy_task/8394 (f1_f2 of each, 2 pairs total) |
| Rollout | local docker via cooperbench’s --backend docker, mini-swe-agent, step_limit=100 |
| Behavioral eval | scripts/eval_coop_behavior.py unchanged from stage 2 |
clip_grad_norm. We dropped to 4B + LoRA so the pipeline-correctness signal
could be read without fighting hardware. Pipeline correctness doesn’t depend on model
size at this scale.--enable-lora
(saving the merge_and_unload + standalone-HF-model step). Per a feedback memory: vLLM hot-loads
LoRA adapters directly; merging is only useful if pushing to HF Hub.flask_task /
starlette_task; those names are template keys in the team task inventory CSV,
not real cooperbench dataset repos. Since the training data is fully templated (synthetic
prose, no real flask/starlette code), every cooperbench repo is OOD for the model.
Picked the smallest available repos for held-out eval.agent1_traj.json + agent2_traj.json
saved before the kill. Adding more samples won’t change a 0% signal that is already
deterministic at the model level.Same three metrics defined in the plan §6. Pass thresholds from the plan: plan-first ≥ 70%, plan content ≥ 70%, follow-through ≥ 60%.
| Metric | Definition (short) | Score | Threshold | Verdict |
|---|---|---|---|---|
| Plan-first rate | Both agents send_message + receive INBOX before any real bash |
0% (0/2) | ≥ 70% | FAIL no asst turns |
| Plan content | Turn-1 send_message has plan keywords + file path reference |
0% (0/2) | ≥ 70% | FAIL no asst turns |
| Follow-through | Agent touches the file(s) it claimed in its plan turn | 0% (0/4 agents) | ≥ 60% | FAIL no asst turns |
Per the eval script’s reason field, both pairs failed with
"missing assistant turns": 0 messages with role=assistant in the saved trajectory,
out of 101–103 total messages per agent. The agent’s 100 LLM calls all returned
empty tool_calls, triggering the FormatError retry loop until step_limit=100
fired and the run ended with status=LimitsExceeded.
{"summary":{"n_pairs":2,"plan_first_rate":0.0,"plan_content_rate":0.0,"follow_through_rate":0.0,
"n_plan_first":0,"n_plan_content":0,"n_follow_through_agents":0},
"per_pair":[
{"pair_id":"dspy_task/8394/f1_f2","plan_first":false,"reason":"missing assistant turns"},
{"pair_id":"pallets_click_task/2068/f1_f2","plan_first":false,"reason":"missing assistant turns"}
]}
The plan’s §6.1 decomposition says “plan-first low ⇒ model didn’t learn the temporal pattern.” That’s the right ballpark, but the actual failure is sharper: the model produces no tool calls at all, plan-first or otherwise. Walking the pipeline back to find where it broke:
| # | Stage | Check | Result |
|---|---|---|---|
| 1 | Data generation | 342 trajectories × ~15 messages, all decode to valid coop chat format with
<tool_call><function=bash><parameter=command>send_message...</parameter></function></tool_call>
in assistant content. 17 unit tests green. |
PASS |
| 2 | Parquet conversion | per-trajectory expansion via prepare_verl_data.py --mode sft;
messages column matches what MultiTurnSFTDataset expects. |
PASS |
| 3 | FSDP training | 2 epochs, loss 3.5 → 2.54, val 2.68, no NaNs, no OOM. 42 steps total. | PASS |
| 4 | FSDP → PEFT | 252 lora_A + 252 lora_B tensors saved.
Manual byte-level check: first lora_B[0] has 163662/163840 non-zero bytes
(i.e., the adapter is not all zeros — training did move it).
adapter_config.json has task_type=CAUSAL_LM, r=32,
target_modules=[q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]. |
PASS |
| 5 | vLLM serve | vLLM logs “Loaded new LoRA adapter: name 'plan-first', path '/ckpts/plan-first/peft/lora_adapter'”.
/v1/models shows plan-first as a child of Qwen/Qwen3-4B
with root pointing at the adapter dir. Routing works. |
PASS |
| 6 | Inference behavior | Identical token-for-token output between plan-first and
Qwen/Qwen3-4B on the same prompt
(e.g. completion of “Before I start editing, let me coordinate with agent2 so we”
yields word-for-word the same 80 tokens). Model produces conversational text, never
<tool_call> XML. |
FAIL |
| 7 | Behavioral eval | Correctly emits reason="missing assistant turns"; metrics decompose
cleanly to 0/0/0; per-task breakdown intact. |
PASS |
Stage 6 narrows to one of two not-mutually-exclusive hypotheses:
<assistant> "Before I start editing, let me coordinate with agent2..."
<tool_call><function=bash><parameter=command>
send_message agent2 <<'MSG' ... MSG
</parameter></function></tool_call>
The system prompt during training is the bare COOP_SYSTEM_PROMPT — no
tools field on any message. At inference, mini-swe-agent passes
tools=[BASH_TOOL] + tool_choice="auto" to litellm, which forwards
them to vLLM’s OpenAI-compatible endpoint. The Qwen3 chat template injects tool
descriptions into the system message when tools are provided. So the model sees a
different system prompt at inference than it ever saw during training. The LoRA delta
(rank 32, 42 steps, weak by construction) is not large enough to override the base
model’s “tools ⇒ ask clarifying questions, no tool calls”
prior on this OOD prompt.send_message in turn 1,” this may be under the threshold needed to overpower
base behavior, regardless of prompt format. The non-zero lora_B values show the
adapter did learn something; just not enough to dominate at decoding.The first explanation is the load-bearing one: a manual completion-API probe (raw /v1/completions
with no tools, no chat template) on the prefix “Before I start editing, let me coordinate
with agent2 so we” still returns identical text to base. So even without the chat-template
tools injection, the LoRA doesn’t move the next-token distribution noticeably on this
prefix. That points to (2) being non-trivial too — 42 SFT steps is genuinely low.
The plan-doc §1 listed five things this experiment should validate. Five (data gen + parquet + training + serve + eval) are validated by file-level passes. The sixth (multi-turn behavior emergence) is what failed:
| Plan claim | Validated? | Evidence |
|---|---|---|
| Multi-turn data round-trips through tokenizer | yes | Train loss decreased from 3.5 to 2.54; if tokenization were broken loss would be flat or NaN. |
| Loss masking fires only on assistant tokens | yes | verl MultiTurnSFTDataset handles this; train loss curve confirms gradients are flowing. |
| Tool-call XML round-trips through tokenization | yes | (structural — XML is plain text; no special tokens involved) |
INBOX blocks format identically training ↔ inference |
yes | Generator imports the same string-format helpers (_tool_response) used by the live coop runner; spot-checked on 10 random samples. |
| Cross-agent consistency emerges (agent_2 reads agent_1’s plan) | no | Cannot test — model never plans in the first place at inference. |
tools/tool_choice to vLLM,
relying entirely on prompt-format conventions (the inline <tool_call>
XML in content). Requires mini-swe-agent adapter changes or a custom
litellm_model override; more surgical but more code.plan-first
output ≠ base output” would catch this regression class without burning 100 LLM
calls per held-out task.| File | Purpose |
|---|---|
scripts/modal_plan_first_merge.py |
FSDP-sharded LoRA → PEFT adapter on Modal volume (with monkey-patch for verl 0.7.1 LoRA task_type bug). |
coopertrain/serve/vllm_modal_plan_first.py |
Modal vLLM serve: base Qwen3-4B + hot-loaded plan-first LoRA, qwen3_coder tool parser, qwen3 reasoning parser. |
coopertrain/configs/coop_plan_first.yaml |
Agent config pointing at the Modal serve endpoint, with enable_thinking=false chat-template override. |
report/2026-05-10-plan-first-cooperbench-results.html |
This document. |
report/2026-05-10-plan-first-cooperbench-results/metrics.json |
Raw eval output; per-pair / per-task breakdown. |
pyproject.toml (modified) |
tensordict pin bumped to >=0.8,<0.11 for verl 0.7.1 compatibility (was >=0.5,<0.7, stale). |
Question: if we inject a multi-turn agentic behavior into the SFT data
— specifically, “agents discuss a plan via send_message before any
bash, then each does their assigned piece” — does the trained model
actually exhibit that behavior in real cooperbench coop rollouts? If yes, the whole pipeline
(rollout-time TITO capture → JSONL → parquet → TitoSFTDataset
→ trainer → checkpoint → rollout under coop) is end-to-end correct on the
workload that matters.
Why plan-first specifically: it’s the smallest behavior that exercises every concern of the pipeline simultaneously — multi-turn history, loss masking, tool-call boundary preservation, INBOX block formatting, cross-agent coordination. A surface signature or even running-sum accumulator would catch a strict subset.
output_ids).send_message, bash) round-trip through tokenization without drift.INBOX:\n From agent_X: ... blocks are formatted identically training ↔ inference.| Tier | What it proves | Cost | Status |
|---|---|---|---|
| 1. Running-sum smoke (synthetic) | TITO data → trainer → multi-turn behavior preserved at inference | ~10 min, 1 GPU | deferred |
| 2. Per-K degradation curve | No silent truncation across turn depth | (free, same run as 1) | deferred |
| 3. Running-sum under compaction | TITO capture beats reconstruction (the PR #29 promise) | ~30 min, 1 GPU | deferred |
| 4. Plan-first cooperbench (templated) | Pipeline handles real coop format end-to-end | ~half-day, 1 H100 | this plan |
| 5. Plan-first cooperbench (distilled) | Full research workflow + compaction-cross-effect | ~1 day, teacher rollouts + 1 H100 | follow-up |
Tiers 1–3 are deferred because the same bugs surface in tier 4, just less crisply. Tier 4 is the smallest test on the actual workload.
Every coop trajectory in the training data must satisfy:
send_message tool calls.
Each agent sends a plan; receives the other’s plan via INBOX;
sends an acknowledgment / counter-proposal as needed.cooperbench_repo/path/foo.py piece), you do Y.”bash on their assigned piece
— no overlap, no re-discussion unless the plan needs revising.This is structural enough to verify programmatically and substantive enough that the model has to attend to multi-turn history (the assignment is in turn 1, the bash command is in turn 3+).
Why templated rather than distilled (for this tier): pipeline-correctness is the goal, not plan content quality. A programmatic generator gives deterministic, free, debuggable data and isolates the pipeline-correctness signal from teacher-model variance.
cooperbench eval task descriptions, file paths, and feature splits —
so the generator’s plans reference plausible artifacts, not random strings.scripts/gen_plan_first_coop_data.py)For each task in the 23-task held-in pool, emit ~10 templated coop trajectories. Each trajectory looks like:
turn 1 agent_1.send_message → agent_2: "Plan: I'll handle <file_a>, you handle <file_b>"
turn 2 agent_2.send_message → agent_1: "Acknowledged — I'll do <file_b>"
turn 3 agent_1.bash → cd repo && cat <file_a> (real path)
turn 4 agent_1.bash → sed -i ... <file_a> (real plausible edit)
turn 3' agent_2.bash → cd repo && cat <file_b> (real path)
turn 4' agent_2.bash → sed -i ... <file_b> (real plausible edit)
...
turn N agent_*.bash → pytest (or git diff > submission)
The generator emits two JSONL files per trajectory (one for agent_1, one
for agent_2) using the production schema:
{input_ids, output_ids, metadata}. metadata.source = "templated-plan-first"
and metadata.task_id matches the cooperbench task. Tokenization uses
Qwen/Qwen3.5-9B with apply_chat_template(..., add_generation_prompt=True)
— the same path the real rollout would have used.
Volume: 23 tasks × ~10 trajectories × ~6 assistant turns × 2 agents ≈ 2 760 TITO pairs. Same order of magnitude as the existing 9B SFT data.
send_message tool-call XML format must match what cooperbench’s
actions_toolcall.py parses at inference; otherwise we’d train on a
format the inference engine would reject.messages[:i] must include the other agent’s
sent messages as INBOX blocks, formatted identically to live coop runs.
This is the part most likely to drift — cross-check against
coopertrain/communication/strategies/silent_monitor.py.| Knob | Value | Reasoning |
|---|---|---|
| Base model | Qwen/Qwen3.5-9B |
Same base used for the 9B coop baseline; fits a single H100 with FSDP shard size 1. |
| Strategy | Full FT (no LoRA) | We want a clean SFT signal; LoRA could mute behavior changes at this scale. |
| Train tokens | ~2 760 pairs × ~512 tokens ≈ 1.4M tokens | Small. Training is bounded by the experiment’s validity, not by token count. |
| Epochs | 2 | Behavioral SFT typically converges within 1–2 epochs on this scale. |
| LR | 1e-5 | Standard SFT LR for Qwen 7B+ class. Bumpable to 2e-5 if loss is flat at epoch 1. |
| Batch (global) | 16 | 1 H100 with FSDP. Adjust if OOM at full sequence length. |
| Eval freq | 50 steps | Track val loss, but real signal is the rollout eval below — loss isn’t the metric we care about. |
| Hardware | 1 × H100 80GB | ~1–2 hours wall time. No multi-node needed for 9B FT. |
Config landing as coopertrain/configs/verl/sft_qwen35_9b_plan_first_smoke.yaml —
copy of the production 9B SFT config with batch / GPU / data path overridden.
Run the trained checkpoint on a held-out subset of the 23 cooperbench tasks (5 tasks held out from the data generation step), at K=1, step_limit=100. For each rollout pair, parse the saved trajectory and compute three metrics:
| Metric | Definition | Baseline (Qwen3.5-9B base) | Pass threshold |
|---|---|---|---|
| Plan-first rate | Fraction of rollout pairs where both agents’ first action is
send_message AND each receives an INBOX from the other before
issuing any bash. |
~10–25% (ad-hoc; depends on prompt) | ≥ 70% |
| Plan content quality | Keyword presence in turn-1 send_message content: contains ≥ 2 of
{plan, split, you, I’ll, first, step} AND references at least one
real file path or function name from the task. |
~30% | ≥ 70% |
| Follow-through rate | Did agent_1 actually do what it said? Extract entities (file paths, function names) from
agent_1’s turn-1 send_message; check if at least one is touched in
agent_1’s bash commands by turn 5+. Same for agent_2 (mutatis mutandis).
Score = fraction of agents (across pairs) who follow through. |
~40% | ≥ 60% |
Each isolates a different pipeline concern; their failure modes decompose:
input_ids history reaching the model, or in per-turn loss
masking.New: scripts/eval_coop_behavior.py — takes a --run-dir
(directory of saved trajectories) and emits a JSON with the three metrics, plus per-task and
per-pair breakdowns. Reusable for any future behavior-injection experiment. ~150 LOC.
Rollout itself uses the existing run_coop_pass_at_k.py machinery against the
trained model served on Modal. Modal config: drop-in copy of
coopertrain/serve/configs/qwen3-5-9b.yaml with the checkpoint path overridden.
| Path | LOC | Purpose |
|---|---|---|
scripts/gen_plan_first_coop_data.py | ~200 | Templated trajectory generator → per-agent JSONL. |
scripts/eval_coop_behavior.py | ~150 | Behavioral eval over a directory of trajectories. |
coopertrain/configs/verl/sft_qwen35_9b_plan_first_smoke.yaml | ~30 | Training config (copy of 9B SFT with overrides). |
coopertrain/serve/configs/qwen35_9b_plan_first.yaml | ~10 | Modal serve config for the trained checkpoint. |
tests/integration/test_plan_first_data.py | ~80 | Unit tests on the data generator (schema, token counts, tool-call format). |
tests/integration/test_behavior_eval.py | ~80 | Unit tests on the behavioral eval (synthetic trajectories → expected metrics). |
Total ~550 LOC across 6 files. No changes to existing pipeline code (PR #29 is the load-bearing change).
input_ids + output_ids all decode to plausible coop trajectories
(spot-checked on 10 random samples).| Stage | Wall time | Compute |
|---|---|---|
| Data generation | ~30 min | local CPU (no GPU needed) |
| Training (9B FT, 2 epochs) | ~1.5 hr | 1 × H100 (~$3) |
| Modal serve (idle + 5 held-out tasks) | ~1 hr | 1 × H100 (~$2) |
| Held-out rollout (K=1, 5 tasks) | ~30 min | (uses Modal endpoint) |
| Behavioral eval + report | ~30 min | local |
Total: ~4 hours wall, ~$5 cloud spend.
tier4-plan-first-cooperbench ← mainreport/2026-05-10-plan-first-cooperbench-plan.htmlscripts/gen_plan_first_coop_data.py — templated coop trajectory generator (390 LOC)scripts/eval_coop_behavior.py — behavioral eval over a run-dir of trajectories (378 LOC)scripts/run_plan_first_stage3.sh — end-to-end runbook (data → train → serve → rollout → eval)coopertrain/configs/verl/sft_qwen35_9b_plan_first_smoke.yaml — 1xH100 SFT configcoopertrain/serve/configs/qwen35_9b_plan_first.yaml — Modal serve config (override VLLM_MODEL_NAME with the trained checkpoint)coopertrain/configs/task_pools/cooperdata_tasks.json — 27-task pool discovered from CooperData branches ∩ team inventorytests/integration/test_plan_first_data.py + test_behavior_eval.py — 17 tests, all greenbash scripts/run_plan_first_stage3.sh on a 1xH100 host (uv-synced with --extra verl)2026-MM-DD-plan-first-cooperbench-results.html alongside this plan — both discoverable from the auto-generated report/index.html| Risk | Likelihood | Mitigation |
|---|---|---|
| Templated plans look unnatural → model overfits to the template surface form | medium | Vary phrasing across trajectories (parametrized template); spot-check by reading 10 samples; if too uniform, escalate to tier 5 (distillation). |
| Held-out tasks are too similar to training tasks → metrics inflated by leakage | medium | Hold out by repo not task: never train on any task from the held-out repos. |
| Base 9B already plans-first sometimes → small absolute lift is hard to read | low | Run a baseline eval first on the same 5 held-out tasks; report deltas, not absolutes. |
| Tool-call XML format drift between training data and inference | medium | Generator imports the same parser used at inference (actions_toolcall.py) and round-trips one example before emitting all rows. |
| Modal endpoint latency → 5-task rollout takes hours | low | Use 9B model (small autoscale latency); 5 tasks × 2 agents × ~100 turns ≈ 1000 calls; well under an hour at concurrency=10. |