SFT + TITO distillation for plan-first coop agents — teacher-bounded follow-through

Tiers 4–5 of the TITO pipeline validation hierarchy — plan + results, merged. Plan authored 2026-05-10; results re-run 2026-05-11 (v3/v4), 2026-05-13 (v5-tf), tier 5 distillation 2026-05-15. Branch tier4-plan-first-cooperbench. The original 2026-05-10 experiment plan is preserved as an appendix at the end of this document.

Experiment summary — RQs, setting, full results

Research questions

Can an SFT-trained small model reliably plan-first in a coop multi-agent setting? "Plan-first" = the agent’s first action must be a send_message proposing a file split, before any bash exploration.
Does the TITO (Token-In Token-Out) training path actually work end-to-end? From an inference engine’s captured (input_ids, output_ids) pairs straight into TitoSFTDataset, with zero re-tokenisation.
Does the plan-first behaviour transfer through distillation from a stronger teacher, and how much is bounded by the teacher’s own capability?

Experiment setting

Student: Qwen3-4B (tiers 1–4) and Qwen3.5-2B (tier 5 Phase B v4c), LoRA SFT on 2×H100 via verl + Modal.
Held-out eval pairs: pallets_click_task:2068 (frozen 1,2) + dspy_task:8394 (frozen 1,2) — both excluded from every training pool.
Three metrics (scripts/eval_coop_behavior.py):
- plan_first: first agent action is a send_message (no bash before it).
- plan_content: the send body contains plan-keywords (split, propose, handle, …) and real task file paths.
- follow_through: each agent’s subsequent bash actually edits a file it claimed in the plan.
Thresholds: plan_first ≥ 70%, plan_content ≥ 70%, follow_through ≥ 60%.
Inference: cooperbench coop runner, mini_swe_agent adapter, Modal-hosted vLLM serve, disable_tools: true at eval time (matches the no-tools training prompt distribution).

Full results — every iteration on the held-out pairs

#	Date	Iteration	plan_first	plan_content	follow_through	Result
1	05-10	First attempt (templated, monolingual)	0%	0%	0%	FAIL — agent ignored plan-first, went straight to bash
2	05-11	Re-run with fixes	0%	0%	0%	FAIL — same OOD problem
3	05-11	Fix-#2 (parser / template alignment)	0%	0%	0%	FAIL — adapter format misalignment masked the real issue
4	05-11	OOD diagnosis (synthetic first-turn eval)	100% (20/20)	100% (20/20)	n/a	Diagnostic: behaviour learned, but only inside the training prompt distribution
5	05-11	Real-task training v2 (asymmetric data)	0%	0%	0%	FAIL — student trained only on one role
6	05-11	Symmetric real-task v3	100%	0%	0%	Partial: planning fires; plan body generic, no real file paths
7	05-11	Symmetric + runtime tool format v4	50%	0%	0%	Partial regression — tool-call format change destabilised the plan-first turn
8	05-13	v5-tf — task-derived file paths in the templated plan	100% (2/2)	100% (2/2)	75% (3/4)	PASS — all three thresholds
9	05-15	Tier 5 Phase A — re-tokenised TITO from Gemini 3 Pro	100% (2/2)	100% (2/2)	100% (4/4)	PASS — validates the TITO training path
10	05-15	Tier 5 Phase B v3 — native capture from Qwen3-8B teacher (13% follow-through)	100%	100%	0%	Partial — planning transfers; follow-through is teacher-bounded
11	05-15	Tier 5 Phase B v4c — native capture from Qwen3.5-27B teacher (95.7%), Qwen3.5-2B student	100%	100%	50%	Partial — biggest split-direction confirmation: stronger same-tokenizer teacher does break the follow-through ceiling

Teacher capability ↔ student follow-through (the headline finding)

Teacher	Teacher follow_through	Student follow_through
Gemini 3 Pro (prompted)	82.6%	100% (Phase A)
Qwen3-8B (prompted)	13%	0% (Phase B v3)
Qwen3.5-27B dense (prompted)	95.7%	50% (Phase B v4c)

plan_first and plan_content clear 100% in all three Tier-5 runs. The split is entirely on follow_through, and it tracks the teacher’s own follow-through in both directions — distillation is teacher-bounded, and the metric that needs the hardest behaviour (actually editing the file you named) is where a weak teacher’s ceiling shows.

What this PR ships

scripts/build_tito_distill_data.py — teacher rollouts → TITO JSONL, with --strip-think (now works across Qwen3 and Qwen3.5 chat templates after v4c’s fix), --upweight-first-turns, --filter-by-eval (rejection sampling).
coopertrain/serve/vllm_modal_tito_teacher.py (Qwen3.5-27B dense teacher serve); vllm_modal_qwen35_2b_student.py (Qwen3.5-2B student serve, adapter baked in via --lora-modules so every Modal replica boots with it).
coopertrain/configs/verl/sft_qwen35_2b_tito_distill.yaml, coopertrain/configs/coop_plan_first_qwen35.yaml.
3 new tests in tests/integration/test_behavior_eval.py for native-tool-call parsing.
Eval fixes (_assistant_content, inbox-regex extended to [Reply from X]:, plan-first walk sweeps role="tool").

What 50% leaves on the table (paths to 70%+)

Three of the four Phase B v4c non-followers hit ContextWindowExceeded / LimitsExceeded at steps 88–100. Concrete levers, cheapest first:

Bump max_steps 100 → 150–200 and max_model_len 16384 → 32768 — no retrain, plausibly 50 → 75%.
40–60 more Qwen3.5-27B rollouts → more work-turn signal to distil.
Bigger student (Qwen3.5-7B / 14B) — out of scope for "pipeline validation" but a real lever for a future tier.

Tier 5 — distillation & the TITO pipeline — 2026-05-15PHASE A PASSPHASE B PARTIAL

Tier 5 closes the gap tier 4 left open. v5-tf (below) validated the SFT pipeline but not the TITO path: it used VeRL’s chat-template SFT reader, never TitoSFTDataset. Tier 5 distils a teacher’s coop rollouts into

{input_ids,
  output_ids}

parquet and trains through TitoSFTDataset — the path the plan doc actually specced. Phase A (re-tokenised teacher text, Gemini 3 Pro teacher) clears all three metrics at 100 / 100 / 100. Phase B (native token capture) lands 100 / 100 / 0 with a weak teacher (Qwen3-8B, 13% follow-through ceiling) and 100 / 100 / 50 with a stronger same-tokenizer teacher (Qwen3.5-27B dense, 95.7% follow-through) — native capture transfers planning behaviour cleanly, and follow-through tracks the teacher’s ceiling, not the capture method.

Teacher validation — does any teacher even pass the 3 metrics?

Distillation can only transfer a behaviour the teacher actually exhibits, so the teacher dataset was scored before training on it. Two findings, both confirming the worry that “even a frontier model doesn’t really plan-first”:

Teacher	plan_first	plan_content	follow_through	note
Gemini 3 Pro — unprompted (default coop.yaml)	50%	0%	50%	Coordinates on one of two held-out pairs; never uses the templated phrasing. A frontier model does not reliably plan-first on its own.
Gemini 3 Pro — prompted (`coop_plan_first_prompted.yaml`)	100%	100%	100%	A mandatory-first-action protocol section steers it cleanly. 23/23 training-pool pairs scored 100/100; follow_through 82.6% (used for rejection sampling).
Qwen3-8B — prompted (Phase B v3 teacher)	100%	100%	13%	Plans-first reliably; barely follows through — short trajectories (≈3.4 turns), plans then stalls. Distillation is bounded by teacher capability.
Qwen3.5-27B dense — prompted (Phase B v4c teacher)	95.7%	95.7%	95.7%	The same-tokenizer Qwen family at scale: same chat template as the Qwen3.5-2B student (vocab 248k), so native-captured token ids drop straight into `TitoSFTDataset` with no re-tokenisation, and follow-through is actually there to distil. 22/23 train pairs scored 100/100/100; used for v4c rejection-sampling.

The steering prompt is applied to the teacher only. Training data has the “Plan-First Coordination Protocol” section stripped back to the default coop.yaml prompt before tokenisation, so the student learns to plan-first under the prompt distribution it will actually see at inference — not a prompt-following shortcut.

Eval fixes that landed here (apply to every model that uses native tool calls)

The first Gemini probe scored 0/0/0 — an eval artefact, not the model. scripts/eval_coop_behavior.py only read send_message out of the assistant content string; litellm-native models (Gemini, Qwen) put the call in tool_calls.function.arguments. Three coupled fixes:

_assistant_content() synthesises a unified view from both the literal content and any tool_calls bash invocations, so XML-in-content (LoRA) and structured tool calls (Gemini/Qwen) are both visible.
Inbox regex now also matches [Reply from X]: — the shape send_message --wait returns inline — not just [Message from X]:.
The plan-first walk now sweeps role="tool" messages too, since the --wait reply arrives inside the bash tool output, not a separate role="user" turn.

Re-scored apples-to-apples, v5-tf is unchanged (100/100/75) — the LoRA writes XML into content, which the old eval already saw.

Phase A — re-tokenised TITO distillation from Gemini 3 ProPASS

Metric	v5-tf (chat-format SFT, baseline)	Phase A v4 (TITO distillation)	Threshold
plan_first_rate	100% (2/2)	100% (2/2)	≥ 70%
plan_content_rate	100% (2/2)	100% (2/2)	≥ 70%
follow_through_rate	75% (3/4)	100% (4/4)	≥ 60%

Pipeline. 23 prompted-Gemini coop rollouts on the training pool → rejection-sample to the 19 pairs that scored 100/100/100 → strip the steering prompt → rewrite Gemini’s structured tool_calls into the Qwen inline <tool_call><function=bash> XML the student emits → re-tokenise each turn with the Qwen3-4B chat template → {input_ids, output_ids} parquet → TitoSFTDataset → Qwen3-4B + LoRA on 2×H100. Gemini’s tokenizer differs from Qwen’s, so the teacher text must be re-tokenised with the student’s tokenizer — that is what makes this “Phase A” rather than native capture.

What it took to get there — small-set TITO distillation is hyperparameter-sensitive:

Iter	Change	Result
v1	no turn-upweight, LR 5e-6	Student never stops thinking. Per-turn TITO expansion buries the plan-first turn: of 571 rows only 38 are turn-0.
v2	upweight turns 0–2 ×12, LR 2e-5	100 / 100 / 25. XML format locked in; the heavily-duplicated planning turns starved the bash-work turns, so follow_through stayed low.
v3	upweight ×4 (re-balance planning:work)	0 / 0 / 0. Re-balancing under-emphasised the output-format signal — the student regressed to markdown ```bash blocks the coop agent loop can’t parse.
v4	v2 data (×12, format locked) trained the full 4 epochs	100 / 100 / 100 at step 200. The extra epochs gave the (un-upweighted) work turns enough exposure to lift follow_through 25% → 100%, without touching the format.

Takeaway. The TITO path itself is correct — TitoSFTDataset consumes {input_ids, output_ids} parquet and the trained student exhibits the distilled behaviour on held-out tasks. The sensitivity is a data-shape property: per-turn expansion dilutes the first-turn signal ~15× vs. per-trajectory, and the output-format token sequence and the behaviour both need enough (upweighted) exposure or they don’t stick.

Phase B — native token capture from a vLLM-served teacherPARTIAL

Phase A re-tokenises, which means the training ids are not exactly what the teacher emitted. Phase B is the canonical TITO promise: capture the exact prompt_token_ids + output_token_ids the inference engine returned (capture_token_ids: true → extra_body={"return_token_ids": true} → token_capture block on each assistant message → coopertrain/verl/tito_capture.py). Teacher: Qwen3-8B on Modal vLLM — a larger sibling of the 4B student, sharing the Qwen3 tokenizer, so captured ids are trainable on the student with zero re-tokenisation.

Mechanically validated. The serve returns token_ids per request; tito_capture.py extracts one (input_ids, output_ids) pair per assistant turn with skipped_no_capture=0; the parquet trains through TitoSFTDataset. Every hop of the native-capture path works.

The planning behaviour transfers; follow-through tracks the teacher. Phase B v3 (Qwen3-8B teacher) lands at 100 / 100 / 0; Phase B v4 swaps to a stronger same-tokenizer teacher (Qwen3.5-27B dense) and lifts follow-through to 50%:

Iter	Change	Result
v1	native capture as-is	Student thinks 3000+ chars then emits a corrupted `<command>` tag — native capture faithfully grabbed Qwen3-8B’s verbose `<think>` trace, and distilling that into a small student teaches the verbosity, drowning the action-format signal.
v2	`--strip-think`: slice the `<think>…</think>` span out of the native id list (token-level, no re-tokenisation)	No more verbose thinking, but the `<tool_call>` wrapper isn’t reliably learned from ~640 rows — format-unstable, coop agent loop parses nothing, 0/0/0.
v3	`--strip-think` + upweight ×12 (1240 rows)	100 / 100 / 0. The ×12 upweight locks the `<tool_call>` format the same way it did for Phase A v4 — both held-out agents now emit clean, file-rich plan proposals (7 plan-keyword hits, real task paths). `follow_through` stays at 0%: the Qwen3-8B teacher only followed through 13% of the time, so there is almost no work-turn signal to distil.
v4c	Swap teacher to Qwen3.5-27B dense (same Qwen3.5 tokenizer as the new Qwen3.5-2B student); rebuild `_strip_think_span` for the Qwen3.5 chat template (`add_generation_prompt` auto-emits `<think>\n\n</think>` in the prompt, so the captured output starts with reasoning content rather than `<think>`); 2173 rows, ×12 upweight, 2 epochs.	100 / 100 / 50. Plan-first + plan-content stay at 100% with full keyword hits and real task paths; `follow_through` climbs from 0% to 50% (2/4 agents). The two non-followers each hit `ContextWindowExceeded` / `LimitsExceeded` mid-execution — a 2B-student step/context-budget limit, not a distillation failure. The Qwen3.5-27B teacher passed the three metrics at 95.7 / 95.7 / 95.7 on the train pairs, so the work-turn signal is finally there to distil.

The instructive contrast. All three runs nail plan_first and plan_content at 100% — planning behaviour distils cleanly through either re-tokenised or natively captured TITO. They split entirely on follow_through, and the split tracks teacher capability in both directions: Gemini 3 Pro 82.6% → student 100% (Phase A); Qwen3-8B 13% → student 0% (Phase B v3); Qwen3.5-27B 95.7% → student 50% (Phase B v4c). Distillation is bounded by the teacher, and follow_through — the metric that needs the hardest behaviour (actually editing the files you named) — is where a weak teacher’s ceiling shows.

What v1–v4c cost to get there. Native capture is honest to a fault: v1 captured Qwen3-8B’s verbose <think> trace verbatim and the student drowned in it; v2 stripped the think span but ≈640 rows couldn’t lock the <tool_call> format; v3 needed the same ×12 turn-upweight as Phase A v4 to make the format stick. v4 then exposed a chat-template assumption: the Qwen3 strip-think rule (ids[0] == <think>) didn’t fire for Qwen3.5 because add_generation_prompt auto-emits <think>\n\n</think> in the prompt, so the captured output starts with reasoning content rather than <think>; v4b silently trained on full reasoning + a stray </think> and produced 0/0/0; v4c rebuilt the strip to slice up to the first </think> regardless of the opening token, covering both shapes. The recurring lesson across all four iterations: small-set TITO distillation is data-shape-sensitive, and every assumption about the token sequence — chat-template formatting included — needs to be verified per tokenizer.

What follow-through 50% leaves on the table. Three of the four held-out agents hit ContextWindowExceeded or LimitsExceeded at steps 88–100 (max=100) with max_model_len=16384 — mid-execution, not at the plan stage. Concrete levers to try, in order of effort/value:

Bump the budget: raise max_steps 100 → 150–200 and max_model_len 16384 → 32768. No retrain; addresses 3 of 4 non-pass cases directly. Plausibly takes v4c from 50% to 75%.
More + more diverse teacher rollouts: v4c trained on 22 passing pairs from one rejection-sampled pool; another 40–60 Qwen3.5-27B rollouts give the student more work-turn signal to distil from. Hours of teacher compute, plus a re-extract / retrain cycle.
Bigger student: a Qwen3.5-7B/14B student will execute longer chains before stalling. Out of scope for “tier 5 validates the pipeline” (student size has been held constant for the cross-phase comparison), but a real lever for a future tier.

What tier 5 establishes

The TITO training path is correct. Phase A trains through TitoSFTDataset on {input_ids, output_ids} parquet and the student passes 100/100/100 on held-out cooperbench tasks — the validation tier 4 / v5-tf did not actually perform.
Native token capture transfers the behaviour. Phase B — return_token_ids → token_capture → tito_capture.py → parquet → TitoSFTDataset, zero re-tokenisation — lands plan_first and plan_content at 100%. The native path is not just mechanically sound; the distilled student exhibits the captured behaviour.
Distillation is bounded by the teacher, and it shows up in follow_through. All three runs agree on plan_first + plan_content at 100% and split entirely on the third metric — tracking the teachers’ own follow-through (Gemini 82.6% → 100%, Qwen3-8B 13% → 0%, Qwen3.5-27B 95.7% → 50%), not the capture method. The Qwen3.5-27B run confirms the relationship in the opposite direction from Phase B v3: a stronger same-tokenizer teacher does break through the follow_through ceiling.
The teacher must be steered, and the steering must not leak. No teacher tried here plans-first reliably unprompted; the prompt fixes that, and stripping it from the training prompts keeps the student learning the behaviour rather than the prompt.

Tier 5 artefacts

scripts/build_tito_distill_data.py — teacher trajectories → TITO JSONL. Strips the steering prompt, rewrites tool calls to Qwen XML, optional --native-output / --strip-think for Phase B, --upweight-first-turns / --filter-by-eval (rejection sampling).
coopertrain/configs/verl/sft_qwen3_4b_tito_distill.yaml / sft_qwen35_2b_tito_distill.yaml — TITO SFT configs (TitoSFTDataset, use_remove_padding, 2×H100 LoRA).
coopertrain/configs/coop_plan_first_prompted.yaml / coop_plan_first_teacher_b.yaml / coop_plan_first_qwen35.yaml — teacher steering + student-eval configs.
coopertrain/serve/vllm_modal_tito_teacher.py — teacher serve (Qwen3.5-27B dense for v4c, Qwen3-8B previously); vllm_modal_qwen35_2b_student.py — Qwen3.5-2B student serve (adapter baked in via --lora-modules so every Modal replica boots with it).
Metrics: metrics-gemini3pro-prompted.json, metrics-tito-distill-v4-s200.json, metrics-tito-distill-b-*.json, metrics-tito-distill-b-v4c.json.

Task-derived file paths — 2026-05-13 (v5-tf)PASS

All three thresholds clear on the held-out pairs. plan_first 100% (2/2), plan_content 100% (2/2), follow_through 75% (3/4 agents). The "next steps #2" called out in the v4 section below — pull plan file paths from the task's actual repo, not the synthetic _FILE_PAIRS list — turns out to be the load-bearing fix; v4's format work was directionally right but on its own didn't bridge the prompt-distribution gap.

Metric	v2 (asymmetric)	v3 (symmetric)	v4 (+runtime tool format)	v5-tf (+task-derived paths)	Threshold
plan_first_rate	0% (0/2)	100% (2/2)	50% (1/2)	100% (2/2)	≥ 70%
plan_content_rate	0%	0%	0%	100% (2/2)	≥ 70%
follow_through_rate	0%	0%	0%	75% (3/4)	≥ 60%

The v5-tf change in one sentence

For each task, scripts/gen_plan_first_coop_data_real_tasks.py now extracts actual file paths from the two feature.md bodies using the eval's own regex, and samples (file_a, file_b) from those — so the plan body, the bash target, and the prompt context all reference the same set of paths. Falls back to the legacy _FILE_PAIRS list only when a task description exposes fewer than two eval-extractable paths.

Why this matters — concretely

The eval's plan_content check requires each agent's first send_message body to have ≥ 2 plan keywords and ≥ 1 eval-regex file path. follow_through requires the agent's bash to touch a file from plan_files = a_files ∪ b_files.

In v3/v4 the training data drew (file_a, file_b) from a hardcoded list (src/cli.py, flask/json/__init__.py, ...) that was uncorrelated with the actual task. The model learned two patterns at once: "name the trained paths" and "name paths visible in the prompt context." At inference, agent1 typically generalized to prompt-derived paths (e.g. dspy/clients/cache.py) but agent2 fell back to short acknowledgments with no path at all — collapsing plan_content to 0%. And the bash steps referenced the trained paths, not the plan paths, so follow_through was structurally pinned at 0%.

With task-derived paths in training, the model has a single consistent pattern: paths come from the task. Both agents follow it; plan paths = bash paths.

Engineering details of this run

Knob	Value	Note
Data generator	`gen_plan_first_coop_data_real_tasks.py` (v5)	Adds `_extract_task_files` + `select_file_pair`.
Task pool	`cooperdata_tasks_v5.json` (30 tasks, 12 repos)	Discovered from the live HF dataset via `scripts/build_task_pool_from_dataset.py`; the legacy pool was built against an older snapshot whose repo names no longer match `cooperbench prepare` output, so every entry was skipping with "missing feature pair".
Held-out repos	`pallets_click_task`, `dspy_task`	Same two repos as the v3/v4 eval pairs.
Trajectories	2 300 (50 / task × 23 training tasks × 2 agents)	2 070 train / 230 val parquet rows after the 10% val split.
Training	2×H100 FSDP, 516 steps	Final val/loss 0.0405 (vs v4's 0.30 at step 100 — ≈ 7× lower). The single-pattern data converges crisply.
Adapter	`/ckpts/plan-first-real-v5-tf/peft/lora_adapter`	132 MB safetensors; served as model id `plan-first-v5tf`.
Held-out eval	2 pairs, K=1, step_limit=100	Both rollouts hit the agent loop's 100-step ceiling; eval scores the first ~2 turns of behavior regardless.

Per-pair detail

Pair	plan_first	plan_content	follow_through (a / b)	plan_files (union)
`pallets_click_task/2068/f1_f2`	✓	✓	✓ / ✓	`src/click/_termui_impl.py`, `src/click/termui.py`
`dspy_task/8394/f1_f2`	✓	✓	✓ / ✗	`dspy/clients/__init__.py`, `dspy/clients/cache.py`, `jinja2/sandbox.py`, `tests/test_sandbox.py`

The one miss is dspy agent2: it produced a valid file-rich plan but its bash steps never touched any of the union plan_files. The plan union for dspy includes two real dspy paths and two legacy fallback paths (jinja2/sandbox.py, tests/test_sandbox.py) — that's the model mixing a task-derived plan with a fallback-influenced ack, which the eval's union check papers over for agent1 but leaves agent2 stranded when its bash uses different paths again. Adding a responder-form variant to the data generator (turn 1 = echo waiting, inbox arrives, turn 2 = ack that echoes both file paths) is the natural follow-up if the bar moves higher than this experiment's 60% threshold.

Two operational footguns hit during the run

vLLM hot-reload silently no-ops on adapter-path change. POST /v1/load_lora_adapter for an already-loaded lora_name returns Success but doesn't actually swap the underlying path — the server keeps serving whatever was loaded first. /v1/models exposes the old root path. The first eval pass on v5-tf returned 0/0/0 because it was scoring v3 rollouts (the prior adapter was still active under the plan-first name). Workaround: load the new adapter under a fresh lora_name (plan-first-v5tf) and re-target -m openai/plan-first-v5tf at the cooperbench CLI. The runbook's hot-reload section needs an "unload+restart" alternative for the in-place-path-change case — or the convention of incrementing the lora_name on every retrain.
cooperbench 0.0.8's execute_coop crashes on mixed-type message timestamps. sent_msgs.sort(key=lambda x: x.get("timestamp") or 0) at cooperbench/runner/coop.py:148 blows up with TypeError: '<' not supported between instances of 'int' and 'str' when one agent reports a numeric timestamp and the other a string one. The crash fires before agent{fid}_traj.json is written, so the eval never sees the trajectories even though the rollouts ran to completion. Patched locally to float(ts) on best-effort, fallback 0.0. Worth upstreaming.

Next steps

Expand the held-out eval to n=6+ pairs to tighten the confidence interval. With n=2 the metric resolution is 50% steps; a single bad sample is the difference between 75% and 50% on follow_through.
Responder-form data variant for the one outstanding gap (dspy agent2 follow_through). The proposer-form-only training works for both agents on most repos; the responder-form would close the remaining drift when the runtime delivers an INBOX before the agent's first action.
Adopt the lora_name-versioning convention so a hot-swap on an existing adapter name isn't a silent no-op. Embed the ckpt revision into the served name (plan-first-v5tf, plan-first-v6, ...) and have the agent config pick it up.
Upstream the cooperbench timestamp fix.

Reproducing

# On a 1xH100 host with modal authed and uv synced:
bash scripts/run_plan_first_v5.sh

# Or step-by-step (Modal handles data + train + merge; local handles rollouts + eval):
modal run scripts/modal_plan_first_train.py \
    --steps "data_v5,train,merge" \
    --n-per-task 50 \
    --ckpt-dir /ckpts/plan-first-real-v5-tf \
    --peft-dir /ckpts/plan-first-real-v5-tf/peft \
    --held-out-repos "pallets_click_task,dspy_task"

curl -X POST $ENDPOINT/v1/load_lora_adapter \
    -d '{"lora_name":"plan-first-v5tf","lora_path":"/ckpts/plan-first-real-v5-tf/peft/lora_adapter"}'

# Rollout + eval (see scripts/run_plan_first_v5.sh for the cooperbench invocation)

Symmetric real-task training — 2026-05-11 (v3 and v4)

v3 (symmetric data) jumps plan_first from 0% → 100% (2/2). Both held-out pairs now have agent1 and agent2 each emitting send_message at turn 1 with the partner receiving the inbox before any real bash. plan_content and follow_through still 0% — covered below.

Metric	v2 (asymmetric)	v3 (symmetric)	v4 (+runtime tool format)	Threshold
plan_first_rate	0% (0/2)	100% (2/2)	50% (1/2)	≥ 70%
plan_content_rate	0%	0% (per-agent file refs missing)	0% (per-agent file refs missing)	≥ 70%
follow_through_rate	0%	0%	0%	≥ 60%

What changed between iterations

v3 — symmetric trajectories. Changed build_plan_first_trajectory so both agents start with a send_message proposal from their own perspective ("I'll do MY_FILE, you do THEIR_FILE"), dropping the asymmetric agent2 "echo inbox check" precursor. The synthetic eval on one training row now returns plan_first=true, plan_content=true, follow_through=true — i.e., the data is by-construction eval-compatible.
v3 result. 100% plan_first on both held-out pairs. Per-pair file refs vary though: pallets_click_task agent2 names real PR files (src/click/_termui_impl.py + src/click/termui.py) but agent1 hedged with "what are your plans?" — no file paths, so plan_content fails because the eval demands per-agent file refs. dspy_task both agents hedged.
Two more drift gaps surfaced during v3 analysis, patched in scripts/eval_coop_behavior.py:
1. _extract_inboxes only matched the synthetic INBOX:\\n From X: body shape. The runtime delivers inter-agent messages as a separate role="user" message with content [Message from X]: body. Now matches both.
2. Tool result format: training data has role="user" with <tool_response>...</tool_response> wrapper; runtime emits role="tool" with JSON content {"returncode": 0, "output": "..."}. After the first send_message the model sees an OOD tool-result frame and produces nothing parseable for the next ~99 steps. This is why follow_through never lights up.
v4 — runtime-shaped tool messages in training. Updated the generator so tool results are role="tool" with JSON content matching the runtime, and inbox delivery is a separate role="user" message with [Message from X]: body. The single-row synthetic eval still passes.
v4 result — mixed regression. dspy_task improved (plan_first 1/1, plan files include the real dspy/clients/__init__.py + dspy/clients/cache.py), but pallets_click_task agent1 produced no parseable output at all (Chinese thinking text + JSON tool calls with wrong function name, not the trained Qwen XML form). Net: plan_first 100% → 50%. The format shift broke per-task generalization at step-100 training scale; longer training (step 200+) or more iterations might recover it but I left v3 as the deployed adapter since it's the better headline.

What we know about `plan_content` and `follow_through`

plan_content. The bottleneck is per-agent file refs in the first send_message body. The training data has both agents propose specific files (e.g. src/cli.py + tests/test_cli.py), but at inference the model hedges with "what are your plans?" when the task is sufficiently OOD. Likely fixes: wider task pool, or RL on the eval reward signal (the model learns to always propose).
follow_through. Even when the initial exchange succeeds, the model emits "Tool call error: no tool calls found" for the rest of the 100 steps — i.e., it doesn't continue producing bash actions after the coordination. v4 was meant to fix this (by aligning the tool-result frame between train and runtime), but at this training scale it actually regressed plan_first on one task. The conjecture remains correct; longer training on v4 data, or a smarter curriculum, should unlock follow_through.

Next steps

Longer v4 training (200–300+ steps) to see if the runtime-format data fully overrides the base model's preference for Chinese thinking + JSON tool calls on certain tasks.
Stronger plan-content supervision: regenerate data so every send_message body has concrete file paths from the task's actual repo (e.g. pull from combined.patch filenames), not just the synthetic _FILE_PAIRS list.
Tier-5 Gemini-Flash distillation under TITO — the canonical path that sidesteps every train/runtime drift by construction.

Artifacts: v3 data data/sft/plan_first_real_v3/combined.jsonl (symmetric trajectories); v3 checkpoint plan-first-checkpoints:/ckpts/plan-first-real-v3/global_step_100 (merged adapter at /peft/lora_adapter, currently loaded on the live serve); v4 data data/sft/plan_first_real_v4/combined.jsonl (+ runtime tool format); v4 checkpoint :/ckpts/plan-first-real-v4/global_step_100 (merged adapter available but not the active deploy); metrics report/2026-05-10-plan-first-cooperbench-results/metrics-real-tasks-{v3,v4}.json; trajectories logs/plan-first-eval-real-{v3,v4}/coop/.../f1_f2/.

Real-task training — 2026-05-11 (later)

Held-out metrics still 0/0/0, but the gap moved. Trained from scratch on 2,100 trajectories whose task message is the actual cooperbench feature.md rendered through the same coopertrain/agents/mini_swe_agent/config/coop.yaml instance template the eval uses at inference. Training converged at step 100 (val/loss 0.28; full schedule was 472 steps but we stopped early as planned).

What the v2 (real-task) rollout actually showed

Inference distribution match works. Smoke test on the held-out pallets_click_task:2068 prompt now produces send_message agent2 "Let's split the work: I'll handle src/click/_termui_impl.py, you handle src/click/termui.py…" — naming the actual files from the PR description. Compare to v1 / v3 which gave back literal "[Makes bash tool call with {"command": "ls -la"} as arguments]" (echoing the prompt's example_response).
The earlier instance-template gap was real. v1 used prepare_sft_data.build_instance_prompt (Task + Situation + Messaging only, ~2,780 chars). The live cooperbench rollout sends ~7,239 chars including <example_response>, <system_information>, the command-examples block, and the Submission/CRITICAL sections. v2 uses the same template at training time, so the model trained on the exact tokens it sees at inference.
Plan_first eval still scores 0/0/0. Two reasons emerged:
1. The trained pattern is asymmetric — agent1's first turn is a send_message, but agent2's first turn is a throwaway echo inbox check that lets the messaging connector deliver the inbox before the next assistant turn. At inference, agent2 sometimes goes straight to real bash (e.g. ls -la src/click), skipping the echo-check, so the eval's _walk_until_real_bash finds no send_message from agent2 before the first real bash. plan_first requires both agents to have both sent and received, so a one-sided exchange scores 0 even when one agent cleanly plan-firsted.
2. On dspy_task:8394, agent2's response distribution is OOD enough that it still loops on "no tool calls found" for all 100 steps and emits no assistant turns. The model generalized to pallets_click_task but not dspy_task in this small training run (21 tasks, n_per_task=50). Wider task coverage or more training would likely close this.

Adapter + eval format alignment landed alongside this

Three loose ends got tightened while debugging:

scripts/modal_plan_first_train.py now takes --train-parquet, --val-parquet, and --ckpt-dir so independent training runs don't clobber each other's checkpoints.
scripts/eval_coop_behavior.py now recognizes both content encodings: Qwen <function=bash><parameter=command> XML and the markdown ```bash form that coopertrain/agents/mini_swe_agent/models/litellm_model.py rewrites to before saving. Same for send_message heredoc vs quoted form. Without this, the eval was structurally blind to dispatched actions on any trajectory that passed through the rewriter.
Patched cooperbench/runner/coop.py in place: the post-run sent_msgs.sort(key=lambda x: x.get("timestamp") or 0) crashed with TypeError: '<' not supported between instances of 'int' and 'str' because some messages had int timestamps and others string. Without the patch, both rollouts completed but failed to save agent{1,2}_traj.json, blocking the eval entirely.

Recommended next steps to actually pass the threshold

Symmetric plan-first data: change the generator so agent2's first turn is also a send_message (the ack), not a precursor echo inbox check. This makes the trained behavior pattern match the eval's mutual-exchange criterion. Cheapest fix.
Wider task coverage: the current 21 training tasks generalize to some held-out tasks but not all. Add more pool entries, or augment with task perturbations so the model doesn't memorize task identity.
Distill from Gemini-Flash plan-first rollouts: still the canonical TITO path. Real rollouts handle the role-asymmetry naturally and the prompt-distribution match is by construction.

Artifacts: real-task data data/sft/plan_first_real_v2/combined.jsonl (2,100 rows, avg user prompt 6,552 chars vs 7,239 in the live rollout — basically identical); checkpoint plan-first-checkpoints:/ckpts/plan-first-real-v2/global_step_100; adapter /ckpts/plan-first-real-v2/peft/lora_adapter (132 MB); metrics report/2026-05-10-plan-first-cooperbench-results/metrics-real-tasks-v2.json; trajectories logs/plan-first-eval-real-v2/coop/.../f1_f2/; v2 train app ap-EB4NGuYCCq7Ks1cMLGzryU (stopped after step 200 save attempt).

OOD diagnosis confirmed — 2026-05-11 (synthetic first-turn eval)

Diagnosis confirmed: 100% on synthetic prompts vs 0% on cooperbench prompts. Re-ran the plan_first / plan_content predicates from scripts/eval_coop_behavior.py on 20 first-turn responses to training-distribution prompts (system + templated user task from data/sft/plan_first/combined.jsonl). Result: plan_first 100% (20/20), plan_content 100% (20/20). Every sample produced a pure send_message turn with ≥2 plan keywords and ≥1 file path. The LoRA does have the behavior; it just doesn't generalize from synthetic templated tasks to real PR-description tasks. See scripts/eval_first_turn_synthetic.py and metrics-synthetic-first-turn.json.

Metric	Synthetic prompts (this test)	Cooperbench held-out (v3 rollouts)	Threshold
plan_first_rate	100% (20/20)	0% (0/2)	≥ 70%
plan_content_rate	100% (20/20)	0% (0/2)	≥ 70%

The gap between the two columns is the failure mode. The LoRA is fine; the SFT data distribution doesn't cover what the eval feeds the model. Next step: regenerate SFT data on top of cooperbench's actual task template (or distill from Gemini-Flash rollouts under TITO). Either makes the train / inference prompt distributions match by construction.

Fix-#2 attempt — 2026-05-11 (later)

Status: FAIL on held-out (0/0/0), but the diagnosis pinned the actual cause. After tearing apart the response pipeline, the model is trained correctly — it produces <tool_call><function=bash><parameter=command>send_message agent2… on every training-distribution prompt (10/10 sampled). The 0/0/0 on pallets_click_task:2068 and dspy_task:8394 is a distribution-shift problem on the task prompt, not a format-mismatch problem.

What I changed under the “fix-#2” umbrella

The original fix-#2 framing (“regenerate SFT data so its rendered chat-template output matches inference”) turned out to be wrong: I verified by running one row of combined.jsonl through AutoTokenizer.apply_chat_template and the assistant <tool_call> XML survives the template verbatim. So I went hunting for the real divergence and made three serve / adapter changes along the way:

Removed --reasoning-parser qwen3 from the vLLM serve. The parser captures everything before </think> into the response’s reasoning field. The trained model never emits </think> (training data has no thinking tags), so the entire output — including the <tool_call> XML — was being routed into reasoning while content came back null and tool_calls=[]. Confirmed by reading the raw response object on the first attempt’s serve.
Removed chat_template_kwargs.enable_thinking=false from the agent config. That flag injects a literal <think>\n\n</think>\n\n after <|im_start|>assistant\n, which is OOD vs training (the trained assistant turns start with prose directly). With the flag set, the model fell back to echoing the markdown bash example from the system prompt — the “markdown bash blocks” output observed in the v2 smoke test. Default mode (no flag) renders the same prompt tail as training and the model emits the correct XML.
Bridged the model output to the agent loop in LitellmModel. Two issues stacked on top of each other:
- cooperbench’s mini_swe_agent agent loop uses action_regex = r"```bash\\s*\\n(.*?)\\n```" on response.content; it does not consult tool_calls. The model emits the XML form, which doesn’t match.
- The same agent loop’s send_message interceptor matches send_message agent "msg" and send_message agent 'msg'; the training data uses heredoc form send_message agent <<'MSG'\\n…\\nMSG, which doesn’t match either.
Added two helpers (_extract_tool_calls_from_content, _rewrite_xml_tool_calls_to_markdown + _heredoc_to_quoted_send_message) so that with disable_tools=True, LitellmModel.query rewrites the assistant content from the XML+heredoc form into the markdown+quoted form the agent loop already speaks. Verified end-to-end on training-style prompts: 10/10 produce parseable actions.

The actual failure

On the two real held-out coop tasks the model returns no <tool_call> blocks at all — instead it emits a short Chinese-then-English narration followed by literal text like “[Makes bash tool call with {"command": "ls -la"} as arguments]”. The system prompt is identical to training (same templated phrasing) but the task message is the full cooperbench PR description (50+ lines, with embedded markdown code blocks, “Solution” sections, type annotations). The synthetic training data uses 3–5-line tasks. The LoRA never generalized from short templated tasks to long PR-style tasks — it goes off-distribution and stops producing tool calls.

This is consistent with the val/loss curve: 2.36 → 0.098 over 424 steps. It memorized the templated distribution very well. It did not learn an invariant “respond with <tool_call> + send_message regardless of task shape.”

Why TITO would have caught this differently

Under TITO we’d have captured the exact token stream from a real Gemini-Flash plan-first rollout against a cooperbench task — so the training prompt distribution is the inference prompt distribution by construction. The current synthetic-data path optimizes a different distribution than the eval is sampling from. That’s the single biggest lesson from this attempt.

Recommended next steps

Regenerate SFT data wrapping synthetic plan-first content inside cooperbench’s actual task-prompt template. Pull a few real cooperbench tasks, replay the exact system + user prompts the eval will send, and only inject the plan-first assistant trajectory on top. Same training cost, matching distribution.
Tier-5 distillation from Gemini-Flash plan-first rollouts on cooperbench tasks. Real rollouts — no synthetic gap. This is the canonical TITO path. Cost goes up (rollout time + Gemini API) but the format-and-distribution problem disappears.
Cheaper experiment first: re-run held-out eval against the same synthetic task templates (i.e., feed the model the training-style task message instead of the real cooperbench PR). If metrics jump to passing, that confirms the diagnosis with zero retraining cost.

Artifact pointers for this attempt: serve ap-ED8tmOxyyYDBlMGvhdE7in (redeployed without --reasoning-parser); agent config coopertrain/configs/coop_plan_first.yaml (no enable_thinking flag); adapter helpers in coopertrain/agents/mini_swe_agent/models/litellm_model.py; rollouts logs/plan-first-eval-v3/; metrics report/2026-05-10-plan-first-cooperbench-results/metrics-v3.json.

Re-run with fixes — 2026-05-11

Status: SECOND ATTEMPT ALSO FAILED, NEW FAILURE MODE. Plan-first rate = 0%, plan content = 0%, follow-through = 0% (n=2, same held-out pairs as the first attempt). The training fix worked — the LoRA now produces visibly different output from base and emits send_message — but in the wrong surface form: markdown ```bash ... ``` code blocks instead of the <tool_call> XML that vLLM’s qwen3_coder parser extracts into tool_calls. mini-swe-agent sees an empty tool_calls array and rejects every turn with “No tool calls found in the response.”

Metric	First attempt	Re-run	Threshold
plan_first_rate	0%	0%	≥ 70%
plan_content_rate	0%	0%	≥ 70%
follow_through_rate	0%	0%	≥ 60%
final SFT val/loss	~1.9 (step 42)	0.098 (step 424)	—
LoRA ≠ base on probe	no (identical)	yes	—
LoRA emits `send_message`	no	yes	—
LoRA emits `<tool_call>` XML	no	no	—

What changed in the re-run

Inference format fix: added disable_tools flag to LitellmModelConfig (commit 1912bdb) so the Qwen3 chat template stops injecting tool descriptions at inference — bringing the inference prompt back in line with training.
Scaled training: --n-per-task 50 (was 10), total_epochs: 4 (was 2), lr: 2e-5 (was 1e-5). 424 SFT steps / 4.9M loss tokens, vs 42 steps / 200k tokens in the first attempt. Val/loss dropped 2.36→0.098.
Runbook fix discovered during pickup: the original Modal training run was launched without --detach, so it died at step 82 when the user’s monitor disconnected. Re-launched with modal run --detach and ran to completion in ~80 min.

What the re-run revealed

The training itself clearly succeeded this time — val/loss dropped two orders of magnitude (2.36→0.61→0.16→0.10) across the 4 epochs, and the LoRA output is visibly different from the base model. The smoke test (§4 of the runbook) passed 2/3 conditions:

IDENTICAL: False                # ✓ LoRA learned something
LoRA HAS <tool_call>: False     # ✗ but not the XML format
LoRA HAS send_message: True     # ✓ learned to call the right tool

Sample LoRA output (greedy decode on a training prompt’s system+user prefix):

```bash
send_message --wait agent2 <<'MSG'
What files are you planning to edit?
MSG
```

That is the correct intent (a send_message to the partner agent), but mini-swe-agent dispatches off the response’s tool_calls field, which vLLM only populates when it sees literal <tool_call><function=bash>…</function></tool_call> XML in the generated content. Markdown bash blocks do not parse. Every turn in both held-out pairs hit “No tool calls found in the response” for all 100 steps before LimitsExceeded.

Diagnosis — why the model learned the wrong format

The training data combined.jsonl contains <tool_call><function=bash>…</function></tool_call> in the assistant message text. But the Qwen3 chat template at SFT time appears to have rewritten that content during rendering — either via tool-call extraction into a structured field, or via the messages_key=messages path in verl's MultiTurnSFTDataset — so the tokenized assistant turn that the model actually trained against was the rendered form, not the literal XML. The rendered form turned out to be markdown bash, so that’s what the model learned to emit. At inference, vLLM’s qwen3_coder tool parser only knows how to parse XML back out, so the loop never closes.

Recommended next steps

Cheapest fix: change mini-swe-agent’s adapter to also accept markdown bash blocks (not just tool_calls). The action content is unambiguous — one bash block per turn. This makes the current LoRA usable as-is, no retraining needed.
Format-correctness fix: regenerate SFT data with the actual rendered chat-template output as the assistant content, so training data matches what the model will produce at inference. This requires running each combined.jsonl entry through the tokenizer’s chat template once, capturing the rendered assistant turns, and writing those back.
Tier 5 (distillation): abandon the templated-data approach and distill behavior from Gemini-Flash plan-first rollouts. The format-rendering problem disappears because Gemini’s real rollouts emit whatever format mini-swe-agent already accepts.

Artifact pointers for the re-run: final checkpoint plan-first-checkpoints:/plan-first/global_step_424; merged PEFT adapter plan-first-checkpoints:/plan-first/peft/lora_adapter (132 MB); metrics JSON report/2026-05-10-plan-first-cooperbench-results/metrics-v2.json; trajectories logs/plan-first-eval/coop/{pallets_click_task,dspy_task}/.../f1_f2/; training app ap-ry4fWdYJbCTYVkWyz2N6ue (stopped 2026-05-11 01:44:51 UTC).

First attempt — 2026-05-10

1. TL;DR

Result. Plan-first rate = 0%, plan content = 0%, follow-through = 0% (n=2 completed coop pairs on held-out repos pallets_click_task and dspy_task). Manual probes against the served LoRA confirm: identical token-for-token output between plan-first and the bare Qwen/Qwen3-4B base on the same prompt.

What this tells us about the pipeline. Five of the six pipeline stages (data gen, parquet, FSDP train, FSDP→PEFT merge, vLLM hot-load, behavioral eval) are end-to-end correct — verified by file-level checks at each handoff. The failure is concentrated in stage 6: inference prompt format does not match training prompt format, so even though the LoRA weights are non-zero and the adapter loads, the model sees an out-of-distribution prompt at inference and falls back to base behavior.

2. Setup — what we actually ran

Component	Value
Base model	`Qwen/Qwen3-4B` (plan called for 9B; see deviation note below)
Strategy	LoRA rank=32, alpha=16, target_modules=all-linear, 252 lora_A + 252 lora_B tensors (all 36 layers × 7 modules)
Train data	342 train + 38 val plan-first templated trajectories on `cooperdata_tasks.json` 19-task held-in pool, 10 per task
Train compute	2 × H100 80GB (FSDP), 42 steps total, 2 epochs, ~4 min wall
Train final loss	train 2.54 ← 3.5, val 2.68 (clear downward signal — training did learn something)
Adapter checkpoint	Modal vol `plan-first-checkpoints:/plan-first/peft/lora_adapter/` (132 MB)
Adapter merge	verl 0.7.1 `model_merger` CLI, with a monkey-patch for a known LoRA `task_type` bug (peft ≥0.13 returns it as `str`, verl casts `.value`)
Serve	vLLM 0.19 on Modal H100 (`cooperbench--qwen3-4b-plan-first-serve.modal.run/v1`), `--enable-lora --max-lora-rank 32 --lora-modules plan-first=/ckpts/plan-first/peft/lora_adapter`, `--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3`
Held-out eval tasks	`pallets_click_task/2068` + `dspy_task/8394` (f1_f2 of each, 2 pairs total)
Rollout	local docker via cooperbench’s `--backend docker`, mini-swe-agent, step_limit=100
Behavioral eval	`scripts/eval_coop_behavior.py` unchanged from stage 2

Deviations from the plan — and why

9B → 4B base. Plan called for 9B FT. Single-H100 9B FT hit OOM with FSDP; 1-GPU FSDP also hits an illegal-memory-access bug in verl 0.7.1 + torch 2.7 during clip_grad_norm. We dropped to 4B + LoRA so the pipeline-correctness signal could be read without fighting hardware. Pipeline correctness doesn’t depend on model size at this scale.
Full FT → LoRA rank 32. Plan said no LoRA. With 4B + LoRA the adapter fits in ~130 MB and is hot-loadable via vLLM’s --enable-lora (saving the merge_and_unload + standalone-HF-model step). Per a feedback memory: vLLM hot-loads LoRA adapters directly; merging is only useful if pushing to HF Hub.
2 × H100 not 1. 1-GPU FSDP triggered the illegal-memory-access bug noted above. 2 GPUs is the smallest reliable config for verl 0.7.1 on this dataset.
Held-out repos chosen post-hoc. Plan named flask_task / starlette_task; those names are template keys in the team task inventory CSV, not real cooperbench dataset repos. Since the training data is fully templated (synthetic prose, no real flask/starlette code), every cooperbench repo is OOD for the model. Picked the smallest available repos for held-out eval.
n=2 not n=≥5. We launched concurrency-4 rollouts across both repos (~24 pairs queued), then killed the runners early once it was clear from manual endpoint probes that the LoRA produces token-identical output to base across many prompts. Of the 13 in-flight pairs, only 2 had both agent1_traj.json + agent2_traj.json saved before the kill. Adding more samples won’t change a 0% signal that is already deterministic at the model level.

3. Metrics

Same three metrics defined in the plan §6. Pass thresholds from the plan: plan-first ≥ 70%, plan content ≥ 70%, follow-through ≥ 60%.

Metric	Definition (short)	Score	Threshold	Verdict
Plan-first rate	Both agents `send_message` + receive `INBOX` before any real bash	0% (0/2)	≥ 70%	FAIL no asst turns
Plan content	Turn-1 `send_message` has plan keywords + file path reference	0% (0/2)	≥ 70%	FAIL no asst turns
Follow-through	Agent touches the file(s) it claimed in its plan turn	0% (0/4 agents)	≥ 60%	FAIL no asst turns

Per the eval script’s reason field, both pairs failed with "missing assistant turns": 0 messages with role=assistant in the saved trajectory, out of 101–103 total messages per agent. The agent’s 100 LLM calls all returned empty tool_calls, triggering the FormatError retry loop until step_limit=100 fired and the run ended with status=LimitsExceeded.

Per-pair detail (raw eval output)

{"summary":{"n_pairs":2,"plan_first_rate":0.0,"plan_content_rate":0.0,"follow_through_rate":0.0,
            "n_plan_first":0,"n_plan_content":0,"n_follow_through_agents":0},
 "per_pair":[
   {"pair_id":"dspy_task/8394/f1_f2","plan_first":false,"reason":"missing assistant turns"},
   {"pair_id":"pallets_click_task/2068/f1_f2","plan_first":false,"reason":"missing assistant turns"}
 ]}

4. Diagnosis — why 0/0/0

The plan’s §6.1 decomposition says “plan-first low ⇒ model didn’t learn the temporal pattern.” That’s the right ballpark, but the actual failure is sharper: the model produces no tool calls at all, plan-first or otherwise. Walking the pipeline back to find where it broke:

#	Stage	Check	Result
1	Data generation	342 trajectories × ~15 messages, all decode to valid coop chat format with `<tool_call><function=bash><parameter=command>send_message...</parameter></function></tool_call>` in assistant content. 17 unit tests green.	PASS
2	Parquet conversion	per-trajectory expansion via `prepare_verl_data.py --mode sft`; `messages` column matches what `MultiTurnSFTDataset` expects.	PASS
3	FSDP training	2 epochs, loss 3.5 → 2.54, val 2.68, no NaNs, no OOM. 42 steps total.	PASS
4	FSDP → PEFT	252 `lora_A` + 252 `lora_B` tensors saved. Manual byte-level check: first `lora_B[0]` has 163662/163840 non-zero bytes (i.e., the adapter is not all zeros — training did move it). `adapter_config.json` has `task_type=CAUSAL_LM`, `r=32`, `target_modules=[q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]`.	PASS
5	vLLM serve	vLLM logs “Loaded new LoRA adapter: name 'plan-first', path '/ckpts/plan-first/peft/lora_adapter'”. `/v1/models` shows `plan-first` as a child of `Qwen/Qwen3-4B` with `root` pointing at the adapter dir. Routing works.	PASS
6	Inference behavior	Identical token-for-token output between `plan-first` and `Qwen/Qwen3-4B` on the same prompt (e.g. completion of “Before I start editing, let me coordinate with agent2 so we” yields word-for-word the same 80 tokens). Model produces conversational text, never `<tool_call>` XML.	FAIL
7	Behavioral eval	Correctly emits `reason="missing assistant turns"`; metrics decompose cleanly to 0/0/0; per-task breakdown intact.	PASS

The actual root cause

Stage 6 narrows to one of two not-mutually-exclusive hypotheses:

Training↔inference format mismatch (most likely). Training data assistant turns look like:
```
<assistant> "Before I start editing, let me coordinate with agent2..."
            <tool_call><function=bash><parameter=command>
            send_message agent2 <<'MSG' ... MSG
            </parameter></function></tool_call>
```
The system prompt during training is the bare COOP_SYSTEM_PROMPT — no tools field on any message. At inference, mini-swe-agent passes tools=[BASH_TOOL] + tool_choice="auto" to litellm, which forwards them to vLLM’s OpenAI-compatible endpoint. The Qwen3 chat template injects tool descriptions into the system message when tools are provided. So the model sees a different system prompt at inference than it ever saw during training. The LoRA delta (rank 32, 42 steps, weak by construction) is not large enough to override the base model’s “tools ⇒ ask clarifying questions, no tool calls” prior on this OOD prompt.
LoRA capacity / training budget too small. LoRA rank 32 with 42 steps on 342 trajectories gives ~24 mini-batches/epoch × 2 epochs ≈ 48 forward+backward passes. For a behavior as specific as “always emit send_message in turn 1,” this may be under the threshold needed to overpower base behavior, regardless of prompt format. The non-zero lora_B values show the adapter did learn something; just not enough to dominate at decoding.

The first explanation is the load-bearing one: a manual completion-API probe (raw /v1/completions with no tools, no chat template) on the prefix “Before I start editing, let me coordinate with agent2 so we” still returns identical text to base. So even without the chat-template tools injection, the LoRA doesn’t move the next-token distribution noticeably on this prefix. That points to (2) being non-trivial too — 42 SFT steps is genuinely low.

5. What the pipeline did validate

The plan-doc §1 listed five things this experiment should validate. Five (data gen + parquet + training + serve + eval) are validated by file-level passes. The sixth (multi-turn behavior emergence) is what failed:

Plan claim	Validated?	Evidence
Multi-turn data round-trips through tokenizer	yes	Train loss decreased from 3.5 to 2.54; if tokenization were broken loss would be flat or NaN.
Loss masking fires only on assistant tokens	yes	verl `MultiTurnSFTDataset` handles this; train loss curve confirms gradients are flowing.
Tool-call XML round-trips through tokenization	yes	(structural — XML is plain text; no special tokens involved)
`INBOX` blocks format identically training ↔ inference	yes	Generator imports the same string-format helpers (`_tool_response`) used by the live coop runner; spot-checked on 10 random samples.
Cross-agent consistency emerges (agent_2 reads agent_1’s plan)	no	Cannot test — model never plans in the first place at inference.

6. Recommended next steps

Fix the format mismatch first. Either:
- (A) Have the data generator include the Qwen3 chat-template’s tool-description system prefix in training, so the prompt-distribution at training matches what the live agent sees. Cheapest fix; doesn’t require re-training infra changes.
- (B) Have the agent config NOT pass tools/tool_choice to vLLM, relying entirely on prompt-format conventions (the inline <tool_call> XML in content). Requires mini-swe-agent adapter changes or a custom litellm_model override; more surgical but more code.
Scale training budget. 342 trajectories × 2 epochs is not enough to shift LoRA behavior decisively. Recommend: 10× the data (~3.5k trajectories) and/or 4–5 epochs. Same compute envelope, ~$15 instead of ~$3.
Add a sanity probe before scaling rollouts. A 30-second smoke that hits the deployed endpoint with one training-style prompt and asserts “plan-first output ≠ base output” would catch this regression class without burning 100 LLM calls per held-out task.
Don’t merge PR #30 yet. The infrastructure (data gen, training driver, merge script, serve, eval) is all reusable for the next attempt; the report itself documents the failure path. Both should land. But the experiment hasn’t demonstrated what tier 4 was supposed to demonstrate (multi-turn behavior emerging end-to-end), so the “tier 4 complete” claim should remain open until a re-run with one of the fixes in (1) actually moves the needle on at least one of the three metrics.

7. Files landed on this branch

File	Purpose
`scripts/modal_plan_first_merge.py`	FSDP-sharded LoRA → PEFT adapter on Modal volume (with monkey-patch for verl 0.7.1 LoRA `task_type` bug).
`coopertrain/serve/vllm_modal_plan_first.py`	Modal vLLM serve: base Qwen3-4B + hot-loaded plan-first LoRA, qwen3_coder tool parser, qwen3 reasoning parser.
`coopertrain/configs/coop_plan_first.yaml`	Agent config pointing at the Modal serve endpoint, with `enable_thinking=false` chat-template override.
`report/2026-05-10-plan-first-cooperbench-results.html`	This document.
`report/2026-05-10-plan-first-cooperbench-results/metrics.json`	Raw eval output; per-pair / per-task breakdown.
`pyproject.toml` (modified)	`tensordict` pin bumped to `>=0.8,<0.11` for verl 0.7.1 compatibility (was `>=0.5,<0.7`, stale).

Appendix: original experiment plan — 2026-05-10

This is the plan doc that originally framed the experiment, preserved verbatim. The thresholds, metric definitions, and rollout-stage breakdown here are the bar the results above (Tiers 4 v5-tf and Tier 5 Phases A/B) were measured against. Stage 2 was complete and Stage 3 pending at the time of writing; both have since completed.

Show the full 2026-05-10 plan (sections 1–11)

1. Why this experiment

Question: if we inject a multi-turn agentic behavior into the SFT data — specifically, “agents discuss a plan via send_message before any bash, then each does their assigned piece” — does the trained model actually exhibit that behavior in real cooperbench coop rollouts? If yes, the whole pipeline (rollout-time TITO capture → JSONL → parquet → TitoSFTDataset → trainer → checkpoint → rollout under coop) is end-to-end correct on the workload that matters.

Why plan-first specifically: it’s the smallest behavior that exercises every concern of the pipeline simultaneously — multi-turn history, loss masking, tool-call boundary preservation, INBOX block formatting, cross-agent coordination. A surface signature or even running-sum accumulator would catch a strict subset.

What this validates

Multi-turn history actually reaches the model at inference (turn N depends on turns 1–3).
Loss masking fires only on assistant tokens (plan content lives in output_ids).
Tool calls (send_message, bash) round-trip through tokenization without drift.
The INBOX:\n From agent_X: ... blocks are formatted identically training ↔ inference.
Cross-agent consistency: agent_2 reads agent_1’s plan and acts accordingly.

What this does not validate

Compaction-cross-effect on long trajectories — that’s tier 5 (distillation under low compaction trigger).
Multi-node FSDP correctness — separate plan, requires 2+ nodes.
RL / reward-loop correctness — this is SFT only.

2. Test hierarchy — where this fits

Tier	What it proves	Cost	Status
1. Running-sum smoke (synthetic)	TITO data → trainer → multi-turn behavior preserved at inference	~10 min, 1 GPU	deferred
2. Per-K degradation curve	No silent truncation across turn depth	(free, same run as 1)	deferred
3. Running-sum under compaction	TITO capture beats reconstruction (the PR #29 promise)	~30 min, 1 GPU	deferred
4. Plan-first cooperbench (templated)	Pipeline handles real coop format end-to-end	~half-day, 1 H100	this plan
5. Plan-first cooperbench (distilled)	Full research workflow + compaction-cross-effect	~1 day, teacher rollouts + 1 H100	follow-up

Tiers 1–3 are deferred because the same bugs surface in tier 4, just less crisply. Tier 4 is the smallest test on the actual workload.

3. Behavior under test

Every coop trajectory in the training data must satisfy:

Turns 1–2 (one round-trip per agent) are exclusively send_message tool calls. Each agent sends a plan; receives the other’s plan via INBOX; sends an acknowledgment / counter-proposal as needed.
The plan divides labor. “I’ll do X (the cooperbench_repo/path/foo.py piece), you do Y.”
From turn 3 onward each agent uses bash on their assigned piece — no overlap, no re-discussion unless the plan needs revising.

This is structural enough to verify programmatically and substantive enough that the model has to attend to multi-turn history (the assignment is in turn 1, the bash command is in turn 3+).

4. Data generation: templated synthesis

Why templated rather than distilled (for this tier): pipeline-correctness is the goal, not plan content quality. A programmatic generator gives deterministic, free, debuggable data and isolates the pipeline-correctness signal from teacher-model variance.

Inputs

The 23 cooperbench task branches we already have on hao-dev (used for the 9B/35B pass@5 runs).
Real cooperbench eval task descriptions, file paths, and feature splits — so the generator’s plans reference plausible artifacts, not random strings.

Generator (`scripts/gen_plan_first_coop_data.py`)

For each task in the 23-task held-in pool, emit ~10 templated coop trajectories. Each trajectory looks like:

turn 1  agent_1.send_message  → agent_2: "Plan: I'll handle <file_a>, you handle <file_b>"
turn 2  agent_2.send_message  → agent_1: "Acknowledged — I'll do <file_b>"
turn 3  agent_1.bash          → cd repo && cat <file_a>       (real path)
turn 4  agent_1.bash          → sed -i ... <file_a>          (real plausible edit)
turn 3' agent_2.bash          → cd repo && cat <file_b>       (real path)
turn 4' agent_2.bash          → sed -i ... <file_b>          (real plausible edit)
...
turn N  agent_*.bash          → pytest                       (or git diff > submission)

The generator emits two JSONL files per trajectory (one for agent_1, one for agent_2) using the production schema: {input_ids, output_ids, metadata}. metadata.source = "templated-plan-first" and metadata.task_id matches the cooperbench task. Tokenization uses Qwen/Qwen3.5-9B with apply_chat_template(..., add_generation_prompt=True) — the same path the real rollout would have used.

Volume: 23 tasks × ~10 trajectories × ~6 assistant turns × 2 agents ≈ 2 760 TITO pairs. Same order of magnitude as the existing 9B SFT data.

Edge cases the generator must handle

Tasks where both files are in the same directory — ensure plans split cleanly (function-level rather than file-level).
Tasks with only one obvious entry point — pair them so one agent reads/explores while the other writes.
send_message tool-call XML format must match what cooperbench’s actions_toolcall.py parses at inference; otherwise we’d train on a format the inference engine would reject.
Each agent’s view of messages[:i] must include the other agent’s sent messages as INBOX blocks, formatted identically to live coop runs. This is the part most likely to drift — cross-check against coopertrain/communication/strategies/silent_monitor.py.

5. Training

Knob	Value	Reasoning
Base model	`Qwen/Qwen3.5-9B`	Same base used for the 9B coop baseline; fits a single H100 with FSDP shard size 1.
Strategy	Full FT (no LoRA)	We want a clean SFT signal; LoRA could mute behavior changes at this scale.
Train tokens	~2 760 pairs × ~512 tokens ≈ 1.4M tokens	Small. Training is bounded by the experiment’s validity, not by token count.
Epochs	2	Behavioral SFT typically converges within 1–2 epochs on this scale.
LR	1e-5	Standard SFT LR for Qwen 7B+ class. Bumpable to 2e-5 if loss is flat at epoch 1.
Batch (global)	16	1 H100 with FSDP. Adjust if OOM at full sequence length.
Eval freq	50 steps	Track val loss, but real signal is the rollout eval below — loss isn’t the metric we care about.
Hardware	1 × H100 80GB	~1–2 hours wall time. No multi-node needed for 9B FT.

Config landing as coopertrain/configs/verl/sft_qwen35_9b_plan_first_smoke.yaml — copy of the production 9B SFT config with batch / GPU / data path overridden.

6. Behavioral evaluation

Run the trained checkpoint on a held-out subset of the 23 cooperbench tasks (5 tasks held out from the data generation step), at K=1, step_limit=100. For each rollout pair, parse the saved trajectory and compute three metrics:

Metric	Definition	Baseline (Qwen3.5-9B base)	Pass threshold
Plan-first rate	Fraction of rollout pairs where both agents’ first action is `send_message` AND each receives an `INBOX` from the other before issuing any `bash`.	~10–25% (ad-hoc; depends on prompt)	≥ 70%
Plan content quality	Keyword presence in turn-1 `send_message` content: contains ≥ 2 of `{plan, split, you, I’ll, first, step}` AND references at least one real file path or function name from the task.	~30%	≥ 70%
Follow-through rate	Did agent_1 actually do what it said? Extract entities (file paths, function names) from agent_1’s turn-1 `send_message`; check if at least one is touched in agent_1’s bash commands by turn 5+. Same for agent_2 (mutatis mutandis). Score = fraction of agents (across pairs) who follow through.	~40%	≥ 60%

Why three metrics

Each isolates a different pipeline concern; their failure modes decompose:

Plan-first rate low ⇒ model didn’t learn the temporal pattern. Root cause likely in input_ids history reaching the model, or in per-turn loss masking.
Plan-first high but content low ⇒ structural pattern intact but content not transferred. Suggests chat-template / tokenization drift between training and inference.
Plan-first + content high but follow-through low ⇒ cross-turn dependency broken. Suggests truncation or compaction is corrupting history beyond turn 2.

Eval harness

New: scripts/eval_coop_behavior.py — takes a --run-dir (directory of saved trajectories) and emits a JSON with the three metrics, plus per-task and per-pair breakdowns. Reusable for any future behavior-injection experiment. ~150 LOC.

Rollout itself uses the existing run_coop_pass_at_k.py machinery against the trained model served on Modal. Modal config: drop-in copy of coopertrain/serve/configs/qwen3-5-9b.yaml with the checkpoint path overridden.

7. Files to be created

Path	LOC	Purpose
`scripts/gen_plan_first_coop_data.py`	~200	Templated trajectory generator → per-agent JSONL.
`scripts/eval_coop_behavior.py`	~150	Behavioral eval over a directory of trajectories.
`coopertrain/configs/verl/sft_qwen35_9b_plan_first_smoke.yaml`	~30	Training config (copy of 9B SFT with overrides).
`coopertrain/serve/configs/qwen35_9b_plan_first.yaml`	~10	Modal serve config for the trained checkpoint.
`tests/integration/test_plan_first_data.py`	~80	Unit tests on the data generator (schema, token counts, tool-call format).
`tests/integration/test_behavior_eval.py`	~80	Unit tests on the behavioral eval (synthetic trajectories → expected metrics).

Total ~550 LOC across 6 files. No changes to existing pipeline code (PR #29 is the load-bearing change).

8. Success criteria

All 6 new files land with green CI (lint + unit tests).
Templated data generator produces ~2 760 valid TITO pairs whose input_ids + output_ids all decode to plausible coop trajectories (spot-checked on 10 random samples).
Training run converges (val loss decreasing, no NaNs, no OOM).
Held-out cooperbench rollout produces measurable behavior change vs the base model on all three metrics:
- Plan-first rate: ≥ 70%
- Plan content quality: ≥ 70%
- Follow-through rate: ≥ 60%
If any metric fails, the failure mode points at a concrete pipeline bug (per §6 decomposition); diagnose & report.

9. Cost & timeline

Stage	Wall time	Compute
Data generation	~30 min	local CPU (no GPU needed)
Training (9B FT, 2 epochs)	~1.5 hr	1 × H100 (~$3)
Modal serve (idle + 5 held-out tasks)	~1 hr	1 × H100 (~$2)
Held-out rollout (K=1, 5 tasks)	~30 min	(uses Modal endpoint)
Behavioral eval + report	~30 min	local

Total: ~4 hours wall, ~$5 cloud spend.

10. Rollout plan

Stage 1 — this PR (plan only)in progress

Branch tier4-plan-first-cooperbench ← main
This plan doc as report/2026-05-10-plan-first-cooperbench-plan.html
Open PR; Cloudflare Pages auto-deploys preview; reviewer approves design

Stage 2 — implementationcomplete

scripts/gen_plan_first_coop_data.py — templated coop trajectory generator (390 LOC)
scripts/eval_coop_behavior.py — behavioral eval over a run-dir of trajectories (378 LOC)
scripts/run_plan_first_stage3.sh — end-to-end runbook (data → train → serve → rollout → eval)
coopertrain/configs/verl/sft_qwen35_9b_plan_first_smoke.yaml — 1xH100 SFT config
coopertrain/serve/configs/qwen35_9b_plan_first.yaml — Modal serve config (override VLLM_MODEL_NAME with the trained checkpoint)
coopertrain/configs/task_pools/cooperdata_tasks.json — 27-task pool discovered from CooperData branches ∩ team inventory
tests/integration/test_plan_first_data.py + test_behavior_eval.py — 17 tests, all green
End-to-end smoke on templated data: all three metrics score 100% → the data is the spec, model has a real ceiling

Stage 3 — experimentawaiting hand-off

User runs bash scripts/run_plan_first_stage3.sh on a 1xH100 host (uv-synced with --extra verl)
The runbook runs steps in order: data → parquet → train → (manual: serve+rollout) → eval
Pass thresholds (from §6/§8): plan-first ≥ 70%, plan content ≥ 70%, follow-through ≥ 60%
Results land as a new dated 2026-MM-DD-plan-first-cooperbench-results.html alongside this plan — both discoverable from the auto-generated report/index.html
If pass: merge. If fail: diagnose with the §6 metric decomposition before merging.

11. Risks & mitigations

Risk	Likelihood	Mitigation
Templated plans look unnatural → model overfits to the template surface form	medium	Vary phrasing across trajectories (parametrized template); spot-check by reading 10 samples; if too uniform, escalate to tier 5 (distillation).
Held-out tasks are too similar to training tasks → metrics inflated by leakage	medium	Hold out by repo not task: never train on any task from the held-out repos.
Base 9B already plans-first sometimes → small absolute lift is hard to read	low	Run a baseline eval first on the same 5 held-out tasks; report deltas, not absolutes.
Tool-call XML format drift between training data and inference	medium	Generator imports the same parser used at inference (`actions_toolcall.py`) and round-trips one example before emitting all rows.
Modal endpoint latency → 5-task rollout takes hours	low	Use 9B model (small autoscale latency); 5 tasks × 2 agents × ~100 turns ≈ 1000 calls; well under an hour at concurrency=10.

Raw metrics: metrics.json · PR #30 · Branch tier4-plan-first-cooperbench.

SFT + TITO distillation for plan-first coop agents — teacher-bounded follow-through

Experiment summary — RQs, setting, full results

Research questions

Experiment setting

Full results — every iteration on the held-out pairs

Teacher capability ↔ student follow-through (the headline finding)

What this PR ships

What 50% leaves on the table (paths to 70%+)

Tier 5 — distillation & the TITO pipeline — 2026-05-15PHASE A PASSPHASE B PARTIAL

Teacher validation — does any teacher even pass the 3 metrics?

Phase A — re-tokenised TITO distillation from Gemini 3 ProPASS

Phase B — native token capture from a vLLM-served teacherPARTIAL

What tier 5 establishes

Task-derived file paths — 2026-05-13 (v5-tf)PASS

The v5-tf change in one sentence

Engineering details of this run

Per-pair detail

Two operational footguns hit during the run

Next steps

Reproducing

Symmetric real-task training — 2026-05-11 (v3 and v4)

What changed between iterations

What we know about plan_content and follow_through

Next steps

Real-task training — 2026-05-11 (later)

What the v2 (real-task) rollout actually showed

Adapter + eval format alignment landed alongside this

Recommended next steps to actually pass the threshold

OOD diagnosis confirmed — 2026-05-11 (synthetic first-turn eval)

Fix-#2 attempt — 2026-05-11 (later)

What I changed under the “fix-#2” umbrella

The actual failure

Why TITO would have caught this differently

Recommended next steps

Re-run with fixes — 2026-05-11

What changed in the re-run

What the re-run revealed

Diagnosis — why the model learned the wrong format

Recommended next steps

First attempt — 2026-05-10

1. TL;DR

2. Setup — what we actually ran

3. Metrics

4. Diagnosis — why 0/0/0

The actual root cause

5. What the pipeline did validate

6. Recommended next steps

7. Files landed on this branch

Appendix: original experiment plan — 2026-05-10

1. Why this experiment

What this validates

What this does not validate

2. Test hierarchy — where this fits

3. Behavior under test

4. Data generation: templated synthesis

Inputs

Generator (scripts/gen_plan_first_coop_data.py)

5. Training

6. Behavioral evaluation

Why three metrics

Eval harness

7. Files to be created

8. Success criteria

9. Cost & timeline

10. Rollout plan

Stage 1 — this PR (plan only)in progress

Stage 2 — implementationcomplete

Stage 3 — experimentawaiting hand-off

11. Risks & mitigations

What we know about `plan_content` and `follow_through`

Generator (`scripts/gen_plan_first_coop_data.py`)