SFT + TITO distillation for plan-first coop agents — teacher-bounded follow-through

Tiers 4–5 of the TITO pipeline validation hierarchy — plan + results, merged. Plan authored 2026-05-10; results re-run 2026-05-11 (v3/v4), 2026-05-13 (v5-tf), tier 5 distillation 2026-05-15. Branch tier4-plan-first-cooperbench. The original 2026-05-10 experiment plan is preserved as an appendix at the end of this document.

Experiment summary — RQs, setting, full results

Research questions

  1. Can an SFT-trained small model reliably plan-first in a coop multi-agent setting? "Plan-first" = the agent’s first action must be a send_message proposing a file split, before any bash exploration.
  2. Does the TITO (Token-In Token-Out) training path actually work end-to-end? From an inference engine’s captured (input_ids, output_ids) pairs straight into TitoSFTDataset, with zero re-tokenisation.
  3. Does the plan-first behaviour transfer through distillation from a stronger teacher, and how much is bounded by the teacher’s own capability?

Experiment setting

Full results — every iteration on the held-out pairs

#DateIteration plan_firstplan_contentfollow_through Result
105-10First attempt (templated, monolingual) 0%0%0% FAIL — agent ignored plan-first, went straight to bash
205-11Re-run with fixes 0%0%0% FAIL — same OOD problem
305-11Fix-#2 (parser / template alignment) 0%0%0% FAIL — adapter format misalignment masked the real issue
405-11 OOD diagnosis (synthetic first-turn eval) 100% (20/20)100% (20/20) n/a Diagnostic: behaviour learned, but only inside the training prompt distribution
505-11Real-task training v2 (asymmetric data) 0%0%0% FAIL — student trained only on one role
605-11Symmetric real-task v3 100%0%0% Partial: planning fires; plan body generic, no real file paths
705-11Symmetric + runtime tool format v4 50%0%0% Partial regression — tool-call format change destabilised the plan-first turn
805-13 v5-tf — task-derived file paths in the templated plan 100% (2/2) 100% (2/2) 75% (3/4) PASS — all three thresholds
905-15 Tier 5 Phase A — re-tokenised TITO from Gemini 3 Pro 100% (2/2) 100% (2/2) 100% (4/4) PASS — validates the TITO training path
1005-15 Tier 5 Phase B v3 — native capture from Qwen3-8B teacher (13% follow-through) 100%100% 0% Partial — planning transfers; follow-through is teacher-bounded
1105-15 Tier 5 Phase B v4c — native capture from Qwen3.5-27B teacher (95.7%), Qwen3.5-2B student 100%100% 50% Partial — biggest split-direction confirmation: stronger same-tokenizer teacher does break the follow-through ceiling

Teacher capability ↔ student follow-through (the headline finding)

TeacherTeacher follow_through Student follow_through
Gemini 3 Pro (prompted) 82.6%100% (Phase A)
Qwen3-8B (prompted) 13%0% (Phase B v3)
Qwen3.5-27B dense (prompted) 95.7%50% (Phase B v4c)

plan_first and plan_content clear 100% in all three Tier-5 runs. The split is entirely on follow_through, and it tracks the teacher’s own follow-through in both directions — distillation is teacher-bounded, and the metric that needs the hardest behaviour (actually editing the file you named) is where a weak teacher’s ceiling shows.

What this PR ships

What 50% leaves on the table (paths to 70%+)

Three of the four Phase B v4c non-followers hit ContextWindowExceeded / LimitsExceeded at steps 88–100. Concrete levers, cheapest first:

  1. Bump max_steps 100 → 150–200 and max_model_len 16384 → 32768 — no retrain, plausibly 50 → 75%.
  2. 40–60 more Qwen3.5-27B rollouts → more work-turn signal to distil.
  3. Bigger student (Qwen3.5-7B / 14B) — out of scope for "pipeline validation" but a real lever for a future tier.

Tier 5 — distillation & the TITO pipeline — 2026-05-15PHASE A PASSPHASE B PARTIAL

Tier 5 closes the gap tier 4 left open. v5-tf (below) validated the SFT pipeline but not the TITO path: it used VeRL’s chat-template SFT reader, never TitoSFTDataset. Tier 5 distils a teacher’s coop rollouts into {input_ids, output_ids} parquet and trains through TitoSFTDataset — the path the plan doc actually specced. Phase A (re-tokenised teacher text, Gemini 3 Pro teacher) clears all three metrics at 100 / 100 / 100. Phase B (native token capture) lands 100 / 100 / 0 with a weak teacher (Qwen3-8B, 13% follow-through ceiling) and 100 / 100 / 50 with a stronger same-tokenizer teacher (Qwen3.5-27B dense, 95.7% follow-through) — native capture transfers planning behaviour cleanly, and follow-through tracks the teacher’s ceiling, not the capture method.

Teacher validation — does any teacher even pass the 3 metrics?

Distillation can only transfer a behaviour the teacher actually exhibits, so the teacher dataset was scored before training on it. Two findings, both confirming the worry that “even a frontier model doesn’t really plan-first”:

Teacherplan_firstplan_contentfollow_throughnote
Gemini 3 Pro — unprompted (default coop.yaml) 50%0%50% Coordinates on one of two held-out pairs; never uses the templated phrasing. A frontier model does not reliably plan-first on its own.
Gemini 3 Pro — prompted (coop_plan_first_prompted.yaml) 100%100%100% A mandatory-first-action protocol section steers it cleanly. 23/23 training-pool pairs scored 100/100; follow_through 82.6% (used for rejection sampling).
Qwen3-8B — prompted (Phase B v3 teacher) 100%100%13% Plans-first reliably; barely follows through — short trajectories (≈3.4 turns), plans then stalls. Distillation is bounded by teacher capability.
Qwen3.5-27B dense — prompted (Phase B v4c teacher) 95.7%95.7%95.7% The same-tokenizer Qwen family at scale: same chat template as the Qwen3.5-2B student (vocab 248k), so native-captured token ids drop straight into TitoSFTDataset with no re-tokenisation, and follow-through is actually there to distil. 22/23 train pairs scored 100/100/100; used for v4c rejection-sampling.

The steering prompt is applied to the teacher only. Training data has the “Plan-First Coordination Protocol” section stripped back to the default coop.yaml prompt before tokenisation, so the student learns to plan-first under the prompt distribution it will actually see at inference — not a prompt-following shortcut.

Eval fixes that landed here (apply to every model that uses native tool calls)

The first Gemini probe scored 0/0/0 — an eval artefact, not the model. scripts/eval_coop_behavior.py only read send_message out of the assistant content string; litellm-native models (Gemini, Qwen) put the call in tool_calls.function.arguments. Three coupled fixes:

Re-scored apples-to-apples, v5-tf is unchanged (100/100/75) — the LoRA writes XML into content, which the old eval already saw.

Phase A — re-tokenised TITO distillation from Gemini 3 ProPASS

Metricv5-tf (chat-format SFT, baseline)Phase A v4 (TITO distillation)Threshold
plan_first_rate100% (2/2)100% (2/2)≥ 70%
plan_content_rate100% (2/2)100% (2/2)≥ 70%
follow_through_rate75% (3/4)100% (4/4)≥ 60%

Pipeline. 23 prompted-Gemini coop rollouts on the training pool → rejection-sample to the 19 pairs that scored 100/100/100 → strip the steering prompt → rewrite Gemini’s structured tool_calls into the Qwen inline <tool_call><function=bash> XML the student emits → re-tokenise each turn with the Qwen3-4B chat template → {input_ids, output_ids} parquet → TitoSFTDataset → Qwen3-4B + LoRA on 2×H100. Gemini’s tokenizer differs from Qwen’s, so the teacher text must be re-tokenised with the student’s tokenizer — that is what makes this “Phase A” rather than native capture.

What it took to get there — small-set TITO distillation is hyperparameter-sensitive:

IterChangeResult
v1no turn-upweight, LR 5e-6 Student never stops thinking. Per-turn TITO expansion buries the plan-first turn: of 571 rows only 38 are turn-0.
v2upweight turns 0–2 ×12, LR 2e-5 100 / 100 / 25. XML format locked in; the heavily-duplicated planning turns starved the bash-work turns, so follow_through stayed low.
v3upweight ×4 (re-balance planning:work) 0 / 0 / 0. Re-balancing under-emphasised the output-format signal — the student regressed to markdown ```bash blocks the coop agent loop can’t parse.
v4v2 data (×12, format locked) trained the full 4 epochs 100 / 100 / 100 at step 200. The extra epochs gave the (un-upweighted) work turns enough exposure to lift follow_through 25% → 100%, without touching the format.

Takeaway. The TITO path itself is correct — TitoSFTDataset consumes {input_ids, output_ids} parquet and the trained student exhibits the distilled behaviour on held-out tasks. The sensitivity is a data-shape property: per-turn expansion dilutes the first-turn signal ~15× vs. per-trajectory, and the output-format token sequence and the behaviour both need enough (upweighted) exposure or they don’t stick.

Phase B — native token capture from a vLLM-served teacherPARTIAL

Phase A re-tokenises, which means the training ids are not exactly what the teacher emitted. Phase B is the canonical TITO promise: capture the exact prompt_token_ids + output_token_ids the inference engine returned (capture_token_ids: trueextra_body={"return_token_ids": true}token_capture block on each assistant message → coopertrain/verl/tito_capture.py). Teacher: Qwen3-8B on Modal vLLM — a larger sibling of the 4B student, sharing the Qwen3 tokenizer, so captured ids are trainable on the student with zero re-tokenisation.

Mechanically validated. The serve returns token_ids per request; tito_capture.py extracts one (input_ids, output_ids) pair per assistant turn with skipped_no_capture=0; the parquet trains through TitoSFTDataset. Every hop of the native-capture path works.

The planning behaviour transfers; follow-through tracks the teacher. Phase B v3 (Qwen3-8B teacher) lands at 100 / 100 / 0; Phase B v4 swaps to a stronger same-tokenizer teacher (Qwen3.5-27B dense) and lifts follow-through to 50%:

IterChangeResult
v1native capture as-is Student thinks 3000+ chars then emits a corrupted <command> tag — native capture faithfully grabbed Qwen3-8B’s verbose <think> trace, and distilling that into a small student teaches the verbosity, drowning the action-format signal.
v2--strip-think: slice the <think>…</think> span out of the native id list (token-level, no re-tokenisation) No more verbose thinking, but the <tool_call> wrapper isn’t reliably learned from ~640 rows — format-unstable, coop agent loop parses nothing, 0/0/0.
v3--strip-think + upweight ×12 (1240 rows) 100 / 100 / 0. The ×12 upweight locks the <tool_call> format the same way it did for Phase A v4 — both held-out agents now emit clean, file-rich plan proposals (7 plan-keyword hits, real task paths). follow_through stays at 0%: the Qwen3-8B teacher only followed through 13% of the time, so there is almost no work-turn signal to distil.
v4c Swap teacher to Qwen3.5-27B dense (same Qwen3.5 tokenizer as the new Qwen3.5-2B student); rebuild _strip_think_span for the Qwen3.5 chat template (add_generation_prompt auto-emits <think>\n\n</think> in the prompt, so the captured output starts with reasoning content rather than <think>); 2173 rows, ×12 upweight, 2 epochs. 100 / 100 / 50. Plan-first + plan-content stay at 100% with full keyword hits and real task paths; follow_through climbs from 0% to 50% (2/4 agents). The two non-followers each hit ContextWindowExceeded / LimitsExceeded mid-execution — a 2B-student step/context-budget limit, not a distillation failure. The Qwen3.5-27B teacher passed the three metrics at 95.7 / 95.7 / 95.7 on the train pairs, so the work-turn signal is finally there to distil.

The instructive contrast. All three runs nail plan_first and plan_content at 100% — planning behaviour distils cleanly through either re-tokenised or natively captured TITO. They split entirely on follow_through, and the split tracks teacher capability in both directions: Gemini 3 Pro 82.6% → student 100% (Phase A); Qwen3-8B 13% → student 0% (Phase B v3); Qwen3.5-27B 95.7% → student 50% (Phase B v4c). Distillation is bounded by the teacher, and follow_through — the metric that needs the hardest behaviour (actually editing the files you named) — is where a weak teacher’s ceiling shows.

What v1–v4c cost to get there. Native capture is honest to a fault: v1 captured Qwen3-8B’s verbose <think> trace verbatim and the student drowned in it; v2 stripped the think span but ≈640 rows couldn’t lock the <tool_call> format; v3 needed the same ×12 turn-upweight as Phase A v4 to make the format stick. v4 then exposed a chat-template assumption: the Qwen3 strip-think rule (ids[0] == <think>) didn’t fire for Qwen3.5 because add_generation_prompt auto-emits <think>\n\n</think> in the prompt, so the captured output starts with reasoning content rather than <think>; v4b silently trained on full reasoning + a stray </think> and produced 0/0/0; v4c rebuilt the strip to slice up to the first </think> regardless of the opening token, covering both shapes. The recurring lesson across all four iterations: small-set TITO distillation is data-shape-sensitive, and every assumption about the token sequence — chat-template formatting included — needs to be verified per tokenizer.

What follow-through 50% leaves on the table. Three of the four held-out agents hit ContextWindowExceeded or LimitsExceeded at steps 88–100 (max=100) with max_model_len=16384 — mid-execution, not at the plan stage. Concrete levers to try, in order of effort/value:

  1. Bump the budget: raise max_steps 100 → 150–200 and max_model_len 16384 → 32768. No retrain; addresses 3 of 4 non-pass cases directly. Plausibly takes v4c from 50% to 75%.
  2. More + more diverse teacher rollouts: v4c trained on 22 passing pairs from one rejection-sampled pool; another 40–60 Qwen3.5-27B rollouts give the student more work-turn signal to distil from. Hours of teacher compute, plus a re-extract / retrain cycle.
  3. Bigger student: a Qwen3.5-7B/14B student will execute longer chains before stalling. Out of scope for “tier 5 validates the pipeline” (student size has been held constant for the cross-phase comparison), but a real lever for a future tier.

What tier 5 establishes

  1. The TITO training path is correct. Phase A trains through TitoSFTDataset on {input_ids, output_ids} parquet and the student passes 100/100/100 on held-out cooperbench tasks — the validation tier 4 / v5-tf did not actually perform.
  2. Native token capture transfers the behaviour. Phase B — return_token_idstoken_capturetito_capture.py → parquet → TitoSFTDataset, zero re-tokenisation — lands plan_first and plan_content at 100%. The native path is not just mechanically sound; the distilled student exhibits the captured behaviour.
  3. Distillation is bounded by the teacher, and it shows up in follow_through. All three runs agree on plan_first + plan_content at 100% and split entirely on the third metric — tracking the teachers’ own follow-through (Gemini 82.6% → 100%, Qwen3-8B 13% → 0%, Qwen3.5-27B 95.7% → 50%), not the capture method. The Qwen3.5-27B run confirms the relationship in the opposite direction from Phase B v3: a stronger same-tokenizer teacher does break through the follow_through ceiling.
  4. The teacher must be steered, and the steering must not leak. No teacher tried here plans-first reliably unprompted; the prompt fixes that, and stripping it from the training prompts keeps the student learning the behaviour rather than the prompt.
Tier 5 artefacts

Task-derived file paths — 2026-05-13 (v5-tf)PASS

All three thresholds clear on the held-out pairs. plan_first 100% (2/2), plan_content 100% (2/2), follow_through 75% (3/4 agents). The "next steps #2" called out in the v4 section below — pull plan file paths from the task's actual repo, not the synthetic _FILE_PAIRS list — turns out to be the load-bearing fix; v4's format work was directionally right but on its own didn't bridge the prompt-distribution gap.
Metricv2 (asymmetric)v3 (symmetric)v4 (+runtime tool format)v5-tf (+task-derived paths)Threshold
plan_first_rate0% (0/2)100% (2/2)50% (1/2)100% (2/2)≥ 70%
plan_content_rate0%0%0%100% (2/2)≥ 70%
follow_through_rate0%0%0%75% (3/4)≥ 60%

The v5-tf change in one sentence

For each task, scripts/gen_plan_first_coop_data_real_tasks.py now extracts actual file paths from the two feature.md bodies using the eval's own regex, and samples (file_a, file_b) from those — so the plan body, the bash target, and the prompt context all reference the same set of paths. Falls back to the legacy _FILE_PAIRS list only when a task description exposes fewer than two eval-extractable paths.

Why this matters — concretely

The eval's plan_content check requires each agent's first send_message body to have ≥ 2 plan keywords and ≥ 1 eval-regex file path. follow_through requires the agent's bash to touch a file from plan_files = a_files ∪ b_files.

In v3/v4 the training data drew (file_a, file_b) from a hardcoded list (src/cli.py, flask/json/__init__.py, ...) that was uncorrelated with the actual task. The model learned two patterns at once: "name the trained paths" and "name paths visible in the prompt context." At inference, agent1 typically generalized to prompt-derived paths (e.g. dspy/clients/cache.py) but agent2 fell back to short acknowledgments with no path at all — collapsing plan_content to 0%. And the bash steps referenced the trained paths, not the plan paths, so follow_through was structurally pinned at 0%.

With task-derived paths in training, the model has a single consistent pattern: paths come from the task. Both agents follow it; plan paths = bash paths.

Engineering details of this run

KnobValueNote
Data generatorgen_plan_first_coop_data_real_tasks.py (v5) Adds _extract_task_files + select_file_pair.
Task poolcooperdata_tasks_v5.json (30 tasks, 12 repos) Discovered from the live HF dataset via scripts/build_task_pool_from_dataset.py; the legacy pool was built against an older snapshot whose repo names no longer match cooperbench prepare output, so every entry was skipping with "missing feature pair".
Held-out repospallets_click_task, dspy_task Same two repos as the v3/v4 eval pairs.
Trajectories2 300 (50 / task × 23 training tasks × 2 agents) 2 070 train / 230 val parquet rows after the 10% val split.
Training2×H100 FSDP, 516 steps Final val/loss 0.0405 (vs v4's 0.30 at step 100 — ≈ 7× lower). The single-pattern data converges crisply.
Adapter/ckpts/plan-first-real-v5-tf/peft/lora_adapter 132 MB safetensors; served as model id plan-first-v5tf.
Held-out eval2 pairs, K=1, step_limit=100 Both rollouts hit the agent loop's 100-step ceiling; eval scores the first ~2 turns of behavior regardless.

Per-pair detail

Pairplan_firstplan_contentfollow_through (a / b)plan_files (union)
pallets_click_task/2068/f1_f2 ✓ / ✓ src/click/_termui_impl.py, src/click/termui.py
dspy_task/8394/f1_f2 ✓ / dspy/clients/__init__.py, dspy/clients/cache.py, jinja2/sandbox.py, tests/test_sandbox.py

The one miss is dspy agent2: it produced a valid file-rich plan but its bash steps never touched any of the union plan_files. The plan union for dspy includes two real dspy paths and two legacy fallback paths (jinja2/sandbox.py, tests/test_sandbox.py) — that's the model mixing a task-derived plan with a fallback-influenced ack, which the eval's union check papers over for agent1 but leaves agent2 stranded when its bash uses different paths again. Adding a responder-form variant to the data generator (turn 1 = echo waiting, inbox arrives, turn 2 = ack that echoes both file paths) is the natural follow-up if the bar moves higher than this experiment's 60% threshold.

Two operational footguns hit during the run

  1. vLLM hot-reload silently no-ops on adapter-path change. POST /v1/load_lora_adapter for an already-loaded lora_name returns Success but doesn't actually swap the underlying path — the server keeps serving whatever was loaded first. /v1/models exposes the old root path. The first eval pass on v5-tf returned 0/0/0 because it was scoring v3 rollouts (the prior adapter was still active under the plan-first name). Workaround: load the new adapter under a fresh lora_name (plan-first-v5tf) and re-target -m openai/plan-first-v5tf at the cooperbench CLI. The runbook's hot-reload section needs an "unload+restart" alternative for the in-place-path-change case — or the convention of incrementing the lora_name on every retrain.
  2. cooperbench 0.0.8's execute_coop crashes on mixed-type message timestamps. sent_msgs.sort(key=lambda x: x.get("timestamp") or 0) at cooperbench/runner/coop.py:148 blows up with TypeError: '<' not supported between instances of 'int' and 'str' when one agent reports a numeric timestamp and the other a string one. The crash fires before agent{fid}_traj.json is written, so the eval never sees the trajectories even though the rollouts ran to completion. Patched locally to float(ts) on best-effort, fallback 0.0. Worth upstreaming.

Next steps

  1. Expand the held-out eval to n=6+ pairs to tighten the confidence interval. With n=2 the metric resolution is 50% steps; a single bad sample is the difference between 75% and 50% on follow_through.
  2. Responder-form data variant for the one outstanding gap (dspy agent2 follow_through). The proposer-form-only training works for both agents on most repos; the responder-form would close the remaining drift when the runtime delivers an INBOX before the agent's first action.
  3. Adopt the lora_name-versioning convention so a hot-swap on an existing adapter name isn't a silent no-op. Embed the ckpt revision into the served name (plan-first-v5tf, plan-first-v6, ...) and have the agent config pick it up.
  4. Upstream the cooperbench timestamp fix.

Reproducing

# On a 1xH100 host with modal authed and uv synced:
bash scripts/run_plan_first_v5.sh

# Or step-by-step (Modal handles data + train + merge; local handles rollouts + eval):
modal run scripts/modal_plan_first_train.py \
    --steps "data_v5,train,merge" \
    --n-per-task 50 \
    --ckpt-dir /ckpts/plan-first-real-v5-tf \
    --peft-dir /ckpts/plan-first-real-v5-tf/peft \
    --held-out-repos "pallets_click_task,dspy_task"

curl -X POST $ENDPOINT/v1/load_lora_adapter \
    -d '{"lora_name":"plan-first-v5tf","lora_path":"/ckpts/plan-first-real-v5-tf/peft/lora_adapter"}'

# Rollout + eval (see scripts/run_plan_first_v5.sh for the cooperbench invocation)

Symmetric real-task training — 2026-05-11 (v3 and v4)

v3 (symmetric data) jumps plan_first from 0% → 100% (2/2). Both held-out pairs now have agent1 and agent2 each emitting send_message at turn 1 with the partner receiving the inbox before any real bash. plan_content and follow_through still 0% — covered below.
Metricv2 (asymmetric)v3 (symmetric)v4 (+runtime tool format)Threshold
plan_first_rate0% (0/2)100% (2/2)50% (1/2)≥ 70%
plan_content_rate0%0% (per-agent file refs missing)0% (per-agent file refs missing)≥ 70%
follow_through_rate0%0%0%≥ 60%

What changed between iterations

What we know about plan_content and follow_through

Next steps

  1. Longer v4 training (200–300+ steps) to see if the runtime-format data fully overrides the base model's preference for Chinese thinking + JSON tool calls on certain tasks.
  2. Stronger plan-content supervision: regenerate data so every send_message body has concrete file paths from the task's actual repo (e.g. pull from combined.patch filenames), not just the synthetic _FILE_PAIRS list.
  3. Tier-5 Gemini-Flash distillation under TITO — the canonical path that sidesteps every train/runtime drift by construction.

Artifacts: v3 data data/sft/plan_first_real_v3/combined.jsonl (symmetric trajectories); v3 checkpoint plan-first-checkpoints:/ckpts/plan-first-real-v3/global_step_100 (merged adapter at /peft/lora_adapter, currently loaded on the live serve); v4 data data/sft/plan_first_real_v4/combined.jsonl (+ runtime tool format); v4 checkpoint :/ckpts/plan-first-real-v4/global_step_100 (merged adapter available but not the active deploy); metrics report/2026-05-10-plan-first-cooperbench-results/metrics-real-tasks-{v3,v4}.json; trajectories logs/plan-first-eval-real-{v3,v4}/coop/.../f1_f2/.

Real-task training — 2026-05-11 (later)

Held-out metrics still 0/0/0, but the gap moved. Trained from scratch on 2,100 trajectories whose task message is the actual cooperbench feature.md rendered through the same coopertrain/agents/mini_swe_agent/config/coop.yaml instance template the eval uses at inference. Training converged at step 100 (val/loss 0.28; full schedule was 472 steps but we stopped early as planned).

What the v2 (real-task) rollout actually showed

Adapter + eval format alignment landed alongside this

Three loose ends got tightened while debugging:

Recommended next steps to actually pass the threshold

  1. Symmetric plan-first data: change the generator so agent2's first turn is also a send_message (the ack), not a precursor echo inbox check. This makes the trained behavior pattern match the eval's mutual-exchange criterion. Cheapest fix.
  2. Wider task coverage: the current 21 training tasks generalize to some held-out tasks but not all. Add more pool entries, or augment with task perturbations so the model doesn't memorize task identity.
  3. Distill from Gemini-Flash plan-first rollouts: still the canonical TITO path. Real rollouts handle the role-asymmetry naturally and the prompt-distribution match is by construction.

Artifacts: real-task data data/sft/plan_first_real_v2/combined.jsonl (2,100 rows, avg user prompt 6,552 chars vs 7,239 in the live rollout — basically identical); checkpoint plan-first-checkpoints:/ckpts/plan-first-real-v2/global_step_100; adapter /ckpts/plan-first-real-v2/peft/lora_adapter (132 MB); metrics report/2026-05-10-plan-first-cooperbench-results/metrics-real-tasks-v2.json; trajectories logs/plan-first-eval-real-v2/coop/.../f1_f2/; v2 train app ap-EB4NGuYCCq7Ks1cMLGzryU (stopped after step 200 save attempt).

OOD diagnosis confirmed — 2026-05-11 (synthetic first-turn eval)

Diagnosis confirmed: 100% on synthetic prompts vs 0% on cooperbench prompts. Re-ran the plan_first / plan_content predicates from scripts/eval_coop_behavior.py on 20 first-turn responses to training-distribution prompts (system + templated user task from data/sft/plan_first/combined.jsonl). Result: plan_first 100% (20/20), plan_content 100% (20/20). Every sample produced a pure send_message turn with ≥2 plan keywords and ≥1 file path. The LoRA does have the behavior; it just doesn't generalize from synthetic templated tasks to real PR-description tasks. See scripts/eval_first_turn_synthetic.py and metrics-synthetic-first-turn.json.
MetricSynthetic prompts (this test)Cooperbench held-out (v3 rollouts)Threshold
plan_first_rate100% (20/20)0% (0/2)≥ 70%
plan_content_rate100% (20/20)0% (0/2)≥ 70%

The gap between the two columns is the failure mode. The LoRA is fine; the SFT data distribution doesn't cover what the eval feeds the model. Next step: regenerate SFT data on top of cooperbench's actual task template (or distill from Gemini-Flash rollouts under TITO). Either makes the train / inference prompt distributions match by construction.

Fix-#2 attempt — 2026-05-11 (later)

Status: FAIL on held-out (0/0/0), but the diagnosis pinned the actual cause. After tearing apart the response pipeline, the model is trained correctly — it produces <tool_call><function=bash><parameter=command>send_message agent2… on every training-distribution prompt (10/10 sampled). The 0/0/0 on pallets_click_task:2068 and dspy_task:8394 is a distribution-shift problem on the task prompt, not a format-mismatch problem.

What I changed under the “fix-#2” umbrella

The original fix-#2 framing (“regenerate SFT data so its rendered chat-template output matches inference”) turned out to be wrong: I verified by running one row of combined.jsonl through AutoTokenizer.apply_chat_template and the assistant <tool_call> XML survives the template verbatim. So I went hunting for the real divergence and made three serve / adapter changes along the way:

  1. Removed --reasoning-parser qwen3 from the vLLM serve. The parser captures everything before </think> into the response’s reasoning field. The trained model never emits </think> (training data has no thinking tags), so the entire output — including the <tool_call> XML — was being routed into reasoning while content came back null and tool_calls=[]. Confirmed by reading the raw response object on the first attempt’s serve.
  2. Removed chat_template_kwargs.enable_thinking=false from the agent config. That flag injects a literal <think>\n\n</think>\n\n after <|im_start|>assistant\n, which is OOD vs training (the trained assistant turns start with prose directly). With the flag set, the model fell back to echoing the markdown bash example from the system prompt — the “markdown bash blocks” output observed in the v2 smoke test. Default mode (no flag) renders the same prompt tail as training and the model emits the correct XML.
  3. Bridged the model output to the agent loop in LitellmModel. Two issues stacked on top of each other: Added two helpers (_extract_tool_calls_from_content, _rewrite_xml_tool_calls_to_markdown + _heredoc_to_quoted_send_message) so that with disable_tools=True, LitellmModel.query rewrites the assistant content from the XML+heredoc form into the markdown+quoted form the agent loop already speaks. Verified end-to-end on training-style prompts: 10/10 produce parseable actions.

The actual failure

On the two real held-out coop tasks the model returns no <tool_call> blocks at all — instead it emits a short Chinese-then-English narration followed by literal text like “[Makes bash tool call with {"command": "ls -la"} as arguments]”. The system prompt is identical to training (same templated phrasing) but the task message is the full cooperbench PR description (50+ lines, with embedded markdown code blocks, “Solution” sections, type annotations). The synthetic training data uses 3–5-line tasks. The LoRA never generalized from short templated tasks to long PR-style tasks — it goes off-distribution and stops producing tool calls.

This is consistent with the val/loss curve: 2.36 → 0.098 over 424 steps. It memorized the templated distribution very well. It did not learn an invariant “respond with <tool_call> + send_message regardless of task shape.”

Why TITO would have caught this differently

Under TITO we’d have captured the exact token stream from a real Gemini-Flash plan-first rollout against a cooperbench task — so the training prompt distribution is the inference prompt distribution by construction. The current synthetic-data path optimizes a different distribution than the eval is sampling from. That’s the single biggest lesson from this attempt.

Recommended next steps

  1. Regenerate SFT data wrapping synthetic plan-first content inside cooperbench’s actual task-prompt template. Pull a few real cooperbench tasks, replay the exact system + user prompts the eval will send, and only inject the plan-first assistant trajectory on top. Same training cost, matching distribution.
  2. Tier-5 distillation from Gemini-Flash plan-first rollouts on cooperbench tasks. Real rollouts — no synthetic gap. This is the canonical TITO path. Cost goes up (rollout time + Gemini API) but the format-and-distribution problem disappears.
  3. Cheaper experiment first: re-run held-out eval against the same synthetic task templates (i.e., feed the model the training-style task message instead of the real cooperbench PR). If metrics jump to passing, that confirms the diagnosis with zero retraining cost.

Artifact pointers for this attempt: serve ap-ED8tmOxyyYDBlMGvhdE7in (redeployed without --reasoning-parser); agent config coopertrain/configs/coop_plan_first.yaml (no enable_thinking flag); adapter helpers in coopertrain/agents/mini_swe_agent/models/litellm_model.py; rollouts logs/plan-first-eval-v3/; metrics report/2026-05-10-plan-first-cooperbench-results/metrics-v3.json.

Re-run with fixes — 2026-05-11

Status: SECOND ATTEMPT ALSO FAILED, NEW FAILURE MODE. Plan-first rate = 0%, plan content = 0%, follow-through = 0% (n=2, same held-out pairs as the first attempt). The training fix worked — the LoRA now produces visibly different output from base and emits send_message — but in the wrong surface form: markdown ```bash ... ``` code blocks instead of the <tool_call> XML that vLLM’s qwen3_coder parser extracts into tool_calls. mini-swe-agent sees an empty tool_calls array and rejects every turn with “No tool calls found in the response.”
MetricFirst attemptRe-runThreshold
plan_first_rate0%0%≥ 70%
plan_content_rate0%0%≥ 70%
follow_through_rate0%0%≥ 60%
final SFT val/loss~1.9 (step 42)0.098 (step 424)
LoRA ≠ base on probeno (identical)yes
LoRA emits send_messagenoyes
LoRA emits <tool_call> XMLnono

What changed in the re-run

What the re-run revealed

The training itself clearly succeeded this time — val/loss dropped two orders of magnitude (2.36→0.61→0.16→0.10) across the 4 epochs, and the LoRA output is visibly different from the base model. The smoke test (§4 of the runbook) passed 2/3 conditions:

IDENTICAL: False                # ✓ LoRA learned something
LoRA HAS <tool_call>: False     # ✗ but not the XML format
LoRA HAS send_message: True     # ✓ learned to call the right tool

Sample LoRA output (greedy decode on a training prompt’s system+user prefix):

```bash
send_message --wait agent2 <<'MSG'
What files are you planning to edit?
MSG
```

That is the correct intent (a send_message to the partner agent), but mini-swe-agent dispatches off the response’s tool_calls field, which vLLM only populates when it sees literal <tool_call><function=bash>…</function></tool_call> XML in the generated content. Markdown bash blocks do not parse. Every turn in both held-out pairs hit “No tool calls found in the response” for all 100 steps before LimitsExceeded.

Diagnosis — why the model learned the wrong format

The training data combined.jsonl contains <tool_call><function=bash>…</function></tool_call> in the assistant message text. But the Qwen3 chat template at SFT time appears to have rewritten that content during rendering — either via tool-call extraction into a structured field, or via the messages_key=messages path in verl's MultiTurnSFTDataset — so the tokenized assistant turn that the model actually trained against was the rendered form, not the literal XML. The rendered form turned out to be markdown bash, so that’s what the model learned to emit. At inference, vLLM’s qwen3_coder tool parser only knows how to parse XML back out, so the loop never closes.

Recommended next steps

  1. Cheapest fix: change mini-swe-agent’s adapter to also accept markdown bash blocks (not just tool_calls). The action content is unambiguous — one bash block per turn. This makes the current LoRA usable as-is, no retraining needed.
  2. Format-correctness fix: regenerate SFT data with the actual rendered chat-template output as the assistant content, so training data matches what the model will produce at inference. This requires running each combined.jsonl entry through the tokenizer’s chat template once, capturing the rendered assistant turns, and writing those back.
  3. Tier 5 (distillation): abandon the templated-data approach and distill behavior from Gemini-Flash plan-first rollouts. The format-rendering problem disappears because Gemini’s real rollouts emit whatever format mini-swe-agent already accepts.

Artifact pointers for the re-run: final checkpoint plan-first-checkpoints:/plan-first/global_step_424; merged PEFT adapter plan-first-checkpoints:/plan-first/peft/lora_adapter (132 MB); metrics JSON report/2026-05-10-plan-first-cooperbench-results/metrics-v2.json; trajectories logs/plan-first-eval/coop/{pallets_click_task,dspy_task}/.../f1_f2/; training app ap-ry4fWdYJbCTYVkWyz2N6ue (stopped 2026-05-11 01:44:51 UTC).

First attempt — 2026-05-10

Status: PIPELINE WORKS, BEHAVIOR DID NOT TRANSFER. All three behavioral metrics scored 0/0/0 on held-out cooperbench rollouts. The eval correctly identified the failure mode (“missing assistant turns”), and the root cause is a training↔inference format mismatch, not a bug in the data, training, serve, or eval scripts. Details in §4. Pipeline-correctness checkpoint is partial: data + train + serve + eval all worked; the model itself is producing no tool calls at inference.

1. TL;DR

Result. Plan-first rate = 0%, plan content = 0%, follow-through = 0% (n=2 completed coop pairs on held-out repos pallets_click_task and dspy_task). Manual probes against the served LoRA confirm: identical token-for-token output between plan-first and the bare Qwen/Qwen3-4B base on the same prompt.

What this tells us about the pipeline. Five of the six pipeline stages (data gen, parquet, FSDP train, FSDP→PEFT merge, vLLM hot-load, behavioral eval) are end-to-end correct — verified by file-level checks at each handoff. The failure is concentrated in stage 6: inference prompt format does not match training prompt format, so even though the LoRA weights are non-zero and the adapter loads, the model sees an out-of-distribution prompt at inference and falls back to base behavior.

2. Setup — what we actually ran

ComponentValue
Base modelQwen/Qwen3-4B (plan called for 9B; see deviation note below)
StrategyLoRA rank=32, alpha=16, target_modules=all-linear, 252 lora_A + 252 lora_B tensors (all 36 layers × 7 modules)
Train data342 train + 38 val plan-first templated trajectories on cooperdata_tasks.json 19-task held-in pool, 10 per task
Train compute2 × H100 80GB (FSDP), 42 steps total, 2 epochs, ~4 min wall
Train final losstrain 2.54 ← 3.5, val 2.68 (clear downward signal — training did learn something)
Adapter checkpointModal vol plan-first-checkpoints:/plan-first/peft/lora_adapter/ (132 MB)
Adapter mergeverl 0.7.1 model_merger CLI, with a monkey-patch for a known LoRA task_type bug (peft ≥0.13 returns it as str, verl casts .value)
ServevLLM 0.19 on Modal H100 (cooperbench--qwen3-4b-plan-first-serve.modal.run/v1), --enable-lora --max-lora-rank 32 --lora-modules plan-first=/ckpts/plan-first/peft/lora_adapter, --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3
Held-out eval taskspallets_click_task/2068 + dspy_task/8394 (f1_f2 of each, 2 pairs total)
Rolloutlocal docker via cooperbench’s --backend docker, mini-swe-agent, step_limit=100
Behavioral evalscripts/eval_coop_behavior.py unchanged from stage 2
Deviations from the plan — and why

3. Metrics

Same three metrics defined in the plan §6. Pass thresholds from the plan: plan-first ≥ 70%, plan content ≥ 70%, follow-through ≥ 60%.

Metric Definition (short) Score Threshold Verdict
Plan-first rate Both agents send_message + receive INBOX before any real bash 0% (0/2) ≥ 70% FAIL no asst turns
Plan content Turn-1 send_message has plan keywords + file path reference 0% (0/2) ≥ 70% FAIL no asst turns
Follow-through Agent touches the file(s) it claimed in its plan turn 0% (0/4 agents) ≥ 60% FAIL no asst turns

Per the eval script’s reason field, both pairs failed with "missing assistant turns": 0 messages with role=assistant in the saved trajectory, out of 101–103 total messages per agent. The agent’s 100 LLM calls all returned empty tool_calls, triggering the FormatError retry loop until step_limit=100 fired and the run ended with status=LimitsExceeded.

Per-pair detail (raw eval output)
{"summary":{"n_pairs":2,"plan_first_rate":0.0,"plan_content_rate":0.0,"follow_through_rate":0.0,
            "n_plan_first":0,"n_plan_content":0,"n_follow_through_agents":0},
 "per_pair":[
   {"pair_id":"dspy_task/8394/f1_f2","plan_first":false,"reason":"missing assistant turns"},
   {"pair_id":"pallets_click_task/2068/f1_f2","plan_first":false,"reason":"missing assistant turns"}
 ]}

4. Diagnosis — why 0/0/0

The plan’s §6.1 decomposition says “plan-first low ⇒ model didn’t learn the temporal pattern.” That’s the right ballpark, but the actual failure is sharper: the model produces no tool calls at all, plan-first or otherwise. Walking the pipeline back to find where it broke:

#StageCheckResult
1Data generation 342 trajectories × ~15 messages, all decode to valid coop chat format with <tool_call><function=bash><parameter=command>send_message...</parameter></function></tool_call> in assistant content. 17 unit tests green. PASS
2Parquet conversion per-trajectory expansion via prepare_verl_data.py --mode sft; messages column matches what MultiTurnSFTDataset expects. PASS
3FSDP training 2 epochs, loss 3.5 → 2.54, val 2.68, no NaNs, no OOM. 42 steps total. PASS
4FSDP → PEFT 252 lora_A + 252 lora_B tensors saved. Manual byte-level check: first lora_B[0] has 163662/163840 non-zero bytes (i.e., the adapter is not all zeros — training did move it). adapter_config.json has task_type=CAUSAL_LM, r=32, target_modules=[q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]. PASS
5vLLM serve vLLM logs “Loaded new LoRA adapter: name 'plan-first', path '/ckpts/plan-first/peft/lora_adapter'”. /v1/models shows plan-first as a child of Qwen/Qwen3-4B with root pointing at the adapter dir. Routing works. PASS
6Inference behavior Identical token-for-token output between plan-first and Qwen/Qwen3-4B on the same prompt (e.g. completion of “Before I start editing, let me coordinate with agent2 so we” yields word-for-word the same 80 tokens). Model produces conversational text, never <tool_call> XML. FAIL
7Behavioral eval Correctly emits reason="missing assistant turns"; metrics decompose cleanly to 0/0/0; per-task breakdown intact. PASS

The actual root cause

Stage 6 narrows to one of two not-mutually-exclusive hypotheses:

  1. Training↔inference format mismatch (most likely). Training data assistant turns look like:
    <assistant> "Before I start editing, let me coordinate with agent2..."
                <tool_call><function=bash><parameter=command>
                send_message agent2 <<'MSG' ... MSG
                </parameter></function></tool_call>
    
    The system prompt during training is the bare COOP_SYSTEM_PROMPT — no tools field on any message. At inference, mini-swe-agent passes tools=[BASH_TOOL] + tool_choice="auto" to litellm, which forwards them to vLLM’s OpenAI-compatible endpoint. The Qwen3 chat template injects tool descriptions into the system message when tools are provided. So the model sees a different system prompt at inference than it ever saw during training. The LoRA delta (rank 32, 42 steps, weak by construction) is not large enough to override the base model’s “tools ⇒ ask clarifying questions, no tool calls” prior on this OOD prompt.
  2. LoRA capacity / training budget too small. LoRA rank 32 with 42 steps on 342 trajectories gives ~24 mini-batches/epoch × 2 epochs ≈ 48 forward+backward passes. For a behavior as specific as “always emit send_message in turn 1,” this may be under the threshold needed to overpower base behavior, regardless of prompt format. The non-zero lora_B values show the adapter did learn something; just not enough to dominate at decoding.

The first explanation is the load-bearing one: a manual completion-API probe (raw /v1/completions with no tools, no chat template) on the prefix “Before I start editing, let me coordinate with agent2 so we” still returns identical text to base. So even without the chat-template tools injection, the LoRA doesn’t move the next-token distribution noticeably on this prefix. That points to (2) being non-trivial too — 42 SFT steps is genuinely low.

5. What the pipeline did validate

The plan-doc §1 listed five things this experiment should validate. Five (data gen + parquet + training + serve + eval) are validated by file-level passes. The sixth (multi-turn behavior emergence) is what failed:

Plan claimValidated?Evidence
Multi-turn data round-trips through tokenizer yes Train loss decreased from 3.5 to 2.54; if tokenization were broken loss would be flat or NaN.
Loss masking fires only on assistant tokens yes verl MultiTurnSFTDataset handles this; train loss curve confirms gradients are flowing.
Tool-call XML round-trips through tokenization yes (structural — XML is plain text; no special tokens involved)
INBOX blocks format identically training ↔ inference yes Generator imports the same string-format helpers (_tool_response) used by the live coop runner; spot-checked on 10 random samples.
Cross-agent consistency emerges (agent_2 reads agent_1’s plan) no Cannot test — model never plans in the first place at inference.

6. Recommended next steps

  1. Fix the format mismatch first. Either:
  2. Scale training budget. 342 trajectories × 2 epochs is not enough to shift LoRA behavior decisively. Recommend: 10× the data (~3.5k trajectories) and/or 4–5 epochs. Same compute envelope, ~$15 instead of ~$3.
  3. Add a sanity probe before scaling rollouts. A 30-second smoke that hits the deployed endpoint with one training-style prompt and asserts “plan-first output ≠ base output” would catch this regression class without burning 100 LLM calls per held-out task.
  4. Don’t merge PR #30 yet. The infrastructure (data gen, training driver, merge script, serve, eval) is all reusable for the next attempt; the report itself documents the failure path. Both should land. But the experiment hasn’t demonstrated what tier 4 was supposed to demonstrate (multi-turn behavior emerging end-to-end), so the “tier 4 complete” claim should remain open until a re-run with one of the fixes in (1) actually moves the needle on at least one of the three metrics.

7. Files landed on this branch

FilePurpose
scripts/modal_plan_first_merge.py FSDP-sharded LoRA → PEFT adapter on Modal volume (with monkey-patch for verl 0.7.1 LoRA task_type bug).
coopertrain/serve/vllm_modal_plan_first.py Modal vLLM serve: base Qwen3-4B + hot-loaded plan-first LoRA, qwen3_coder tool parser, qwen3 reasoning parser.
coopertrain/configs/coop_plan_first.yaml Agent config pointing at the Modal serve endpoint, with enable_thinking=false chat-template override.
report/2026-05-10-plan-first-cooperbench-results.html This document.
report/2026-05-10-plan-first-cooperbench-results/metrics.json Raw eval output; per-pair / per-task breakdown.
pyproject.toml (modified) tensordict pin bumped to >=0.8,<0.11 for verl 0.7.1 compatibility (was >=0.5,<0.7, stale).

Appendix: original experiment plan — 2026-05-10

This is the plan doc that originally framed the experiment, preserved verbatim. The thresholds, metric definitions, and rollout-stage breakdown here are the bar the results above (Tiers 4 v5-tf and Tier 5 Phases A/B) were measured against. Stage 2 was complete and Stage 3 pending at the time of writing; both have since completed.
Show the full 2026-05-10 plan (sections 1–11)

1. Why this experiment

Question: if we inject a multi-turn agentic behavior into the SFT data — specifically, “agents discuss a plan via send_message before any bash, then each does their assigned piece” — does the trained model actually exhibit that behavior in real cooperbench coop rollouts? If yes, the whole pipeline (rollout-time TITO capture → JSONL → parquet → TitoSFTDataset → trainer → checkpoint → rollout under coop) is end-to-end correct on the workload that matters.

Why plan-first specifically: it’s the smallest behavior that exercises every concern of the pipeline simultaneously — multi-turn history, loss masking, tool-call boundary preservation, INBOX block formatting, cross-agent coordination. A surface signature or even running-sum accumulator would catch a strict subset.

What this validates

What this does not validate

2. Test hierarchy — where this fits

TierWhat it provesCostStatus
1. Running-sum smoke (synthetic) TITO data → trainer → multi-turn behavior preserved at inference ~10 min, 1 GPU deferred
2. Per-K degradation curve No silent truncation across turn depth (free, same run as 1) deferred
3. Running-sum under compaction TITO capture beats reconstruction (the PR #29 promise) ~30 min, 1 GPU deferred
4. Plan-first cooperbench (templated) Pipeline handles real coop format end-to-end ~half-day, 1 H100 this plan
5. Plan-first cooperbench (distilled) Full research workflow + compaction-cross-effect ~1 day, teacher rollouts + 1 H100 follow-up

Tiers 1–3 are deferred because the same bugs surface in tier 4, just less crisply. Tier 4 is the smallest test on the actual workload.

3. Behavior under test

Every coop trajectory in the training data must satisfy:

  1. Turns 1–2 (one round-trip per agent) are exclusively send_message tool calls. Each agent sends a plan; receives the other’s plan via INBOX; sends an acknowledgment / counter-proposal as needed.
  2. The plan divides labor. “I’ll do X (the cooperbench_repo/path/foo.py piece), you do Y.”
  3. From turn 3 onward each agent uses bash on their assigned piece — no overlap, no re-discussion unless the plan needs revising.

This is structural enough to verify programmatically and substantive enough that the model has to attend to multi-turn history (the assignment is in turn 1, the bash command is in turn 3+).

4. Data generation: templated synthesis

Why templated rather than distilled (for this tier): pipeline-correctness is the goal, not plan content quality. A programmatic generator gives deterministic, free, debuggable data and isolates the pipeline-correctness signal from teacher-model variance.

Inputs

Generator (scripts/gen_plan_first_coop_data.py)

For each task in the 23-task held-in pool, emit ~10 templated coop trajectories. Each trajectory looks like:

turn 1  agent_1.send_message  → agent_2: "Plan: I'll handle <file_a>, you handle <file_b>"
turn 2  agent_2.send_message  → agent_1: "Acknowledged — I'll do <file_b>"
turn 3  agent_1.bash          → cd repo && cat <file_a>       (real path)
turn 4  agent_1.bash          → sed -i ... <file_a>          (real plausible edit)
turn 3' agent_2.bash          → cd repo && cat <file_b>       (real path)
turn 4' agent_2.bash          → sed -i ... <file_b>          (real plausible edit)
...
turn N  agent_*.bash          → pytest                       (or git diff > submission)

The generator emits two JSONL files per trajectory (one for agent_1, one for agent_2) using the production schema: {input_ids, output_ids, metadata}. metadata.source = "templated-plan-first" and metadata.task_id matches the cooperbench task. Tokenization uses Qwen/Qwen3.5-9B with apply_chat_template(..., add_generation_prompt=True) — the same path the real rollout would have used.

Volume: 23 tasks × ~10 trajectories × ~6 assistant turns × 2 agents ≈ 2 760 TITO pairs. Same order of magnitude as the existing 9B SFT data.

Edge cases the generator must handle

5. Training

KnobValueReasoning
Base modelQwen/Qwen3.5-9B Same base used for the 9B coop baseline; fits a single H100 with FSDP shard size 1.
StrategyFull FT (no LoRA) We want a clean SFT signal; LoRA could mute behavior changes at this scale.
Train tokens~2 760 pairs × ~512 tokens ≈ 1.4M tokens Small. Training is bounded by the experiment’s validity, not by token count.
Epochs2 Behavioral SFT typically converges within 1–2 epochs on this scale.
LR1e-5 Standard SFT LR for Qwen 7B+ class. Bumpable to 2e-5 if loss is flat at epoch 1.
Batch (global)16 1 H100 with FSDP. Adjust if OOM at full sequence length.
Eval freq50 steps Track val loss, but real signal is the rollout eval below — loss isn’t the metric we care about.
Hardware1 × H100 80GB ~1–2 hours wall time. No multi-node needed for 9B FT.

Config landing as coopertrain/configs/verl/sft_qwen35_9b_plan_first_smoke.yaml — copy of the production 9B SFT config with batch / GPU / data path overridden.

6. Behavioral evaluation

Run the trained checkpoint on a held-out subset of the 23 cooperbench tasks (5 tasks held out from the data generation step), at K=1, step_limit=100. For each rollout pair, parse the saved trajectory and compute three metrics:

MetricDefinitionBaseline (Qwen3.5-9B base)Pass threshold
Plan-first rate Fraction of rollout pairs where both agents’ first action is send_message AND each receives an INBOX from the other before issuing any bash. ~10–25% (ad-hoc; depends on prompt) ≥ 70%
Plan content quality Keyword presence in turn-1 send_message content: contains ≥ 2 of {plan, split, you, I’ll, first, step} AND references at least one real file path or function name from the task. ~30% ≥ 70%
Follow-through rate Did agent_1 actually do what it said? Extract entities (file paths, function names) from agent_1’s turn-1 send_message; check if at least one is touched in agent_1’s bash commands by turn 5+. Same for agent_2 (mutatis mutandis). Score = fraction of agents (across pairs) who follow through. ~40% ≥ 60%

Why three metrics

Each isolates a different pipeline concern; their failure modes decompose:

Eval harness

New: scripts/eval_coop_behavior.py — takes a --run-dir (directory of saved trajectories) and emits a JSON with the three metrics, plus per-task and per-pair breakdowns. Reusable for any future behavior-injection experiment. ~150 LOC.

Rollout itself uses the existing run_coop_pass_at_k.py machinery against the trained model served on Modal. Modal config: drop-in copy of coopertrain/serve/configs/qwen3-5-9b.yaml with the checkpoint path overridden.

7. Files to be created

PathLOCPurpose
scripts/gen_plan_first_coop_data.py~200 Templated trajectory generator → per-agent JSONL.
scripts/eval_coop_behavior.py~150 Behavioral eval over a directory of trajectories.
coopertrain/configs/verl/sft_qwen35_9b_plan_first_smoke.yaml~30 Training config (copy of 9B SFT with overrides).
coopertrain/serve/configs/qwen35_9b_plan_first.yaml~10 Modal serve config for the trained checkpoint.
tests/integration/test_plan_first_data.py~80 Unit tests on the data generator (schema, token counts, tool-call format).
tests/integration/test_behavior_eval.py~80 Unit tests on the behavioral eval (synthetic trajectories → expected metrics).

Total ~550 LOC across 6 files. No changes to existing pipeline code (PR #29 is the load-bearing change).

8. Success criteria

  1. All 6 new files land with green CI (lint + unit tests).
  2. Templated data generator produces ~2 760 valid TITO pairs whose input_ids + output_ids all decode to plausible coop trajectories (spot-checked on 10 random samples).
  3. Training run converges (val loss decreasing, no NaNs, no OOM).
  4. Held-out cooperbench rollout produces measurable behavior change vs the base model on all three metrics:
    • Plan-first rate: ≥ 70%
    • Plan content quality: ≥ 70%
    • Follow-through rate: ≥ 60%
  5. If any metric fails, the failure mode points at a concrete pipeline bug (per §6 decomposition); diagnose & report.

9. Cost & timeline

StageWall timeCompute
Data generation~30 minlocal CPU (no GPU needed)
Training (9B FT, 2 epochs)~1.5 hr1 × H100 (~$3)
Modal serve (idle + 5 held-out tasks)~1 hr1 × H100 (~$2)
Held-out rollout (K=1, 5 tasks)~30 min(uses Modal endpoint)
Behavioral eval + report~30 minlocal

Total: ~4 hours wall, ~$5 cloud spend.

10. Rollout plan

Stage 1 — this PR (plan only)in progress

Stage 2 — implementationcomplete

Stage 3 — experimentawaiting hand-off

11. Risks & mitigations

RiskLikelihoodMitigation
Templated plans look unnatural → model overfits to the template surface form medium Vary phrasing across trajectories (parametrized template); spot-check by reading 10 samples; if too uniform, escalate to tier 5 (distillation).
Held-out tasks are too similar to training tasks → metrics inflated by leakage medium Hold out by repo not task: never train on any task from the held-out repos.
Base 9B already plans-first sometimes → small absolute lift is hard to read low Run a baseline eval first on the same 5 held-out tasks; report deltas, not absolutes.
Tool-call XML format drift between training data and inference medium Generator imports the same parser used at inference (actions_toolcall.py) and round-trips one example before emitting all rows.
Modal endpoint latency → 5-task rollout takes hours low Use 9B model (small autoscale latency); 5 tasks × 2 agents × ~100 turns ≈ 1000 calls; well under an hour at concurrency=10.