Martingale-training runs ledger
Every valid training/eval run for the Martingale project, in one table, in reverse-chronological order. Runs invalidated by infrastructure bugs are excluded (see the dropped list at the bottom). Each row links to the artifact / Slack thread / repo where the numbers live.
Definitions are in the project page and in
meno-sh/Martingale-Training/REPLICATION.md. “Signed linR²” = R² of the OLS fitΔ ≈ β·(prior − 0.5); “slope” = β. “Brier(post)” = mean of(y_resolve − posterior_belief)²after full reasoning. Lower is better for both Brier and absolute slope; slope at 0 = perfect Martingale property.
Training runs
| Date (UTC) | Run ID | Recipe | Base | LoRA output | Notes / Slack |
|---|---|---|---|---|---|
| 2026-05-24 | A-on-base | KL-safe + info-term (β=1, INFO_COEF=1, INFO_EPS=0.02, LR=1e-4, 558 steps) | Qwen3-32B (raw) | epoch_0 1.07 GB; not published — null result | base-model verdict |
| 2026-05-23 | A-on-B-distilled | KL-safe + info-term (β=1, INFO_COEF=1, INFO_EPS=0.02, LR=1e-4, 558 steps) | Qwen3-32B + biased_distill_lora merged | epoch_0__A_on_B_klsafe_info_term__2026-05-23.tar (sha256 6ba3077d…) | headline verdict — first statistically validated improvement |
| 2026-05-21 | B-distill-LoRA | LoRA SFT on 352 confirmation-biased reasoning traces, 2 epochs, LR=1e-4, train loss 0.62 → 0.41 | Qwen3-32B (raw) | biased_distill_lora.tar (sha256 0580159e…) | Produces the deliberately biased B model. |
Eval runs (all at N=2000, same SHUFFLE_SEED=42, same forecasting question pool)
| Date (UTC) | Run ID | Model evaluated | n usable | Brier(prior) | Brier(post) | linR² (signed) | slope (signed) | Notes |
|---|---|---|---|---|---|---|---|---|
| 2026-05-24 | MenoClaw-AonBase-MergedN2k-2026-05-24 | merged A-on-base | 2000 | 0.2287 | 0.1843 | — | −0.3279 | A arm of the base ablation |
| 2026-05-24 | MenoClaw-Base-FreshN2k-2026-05-24 | Qwen3-32B (raw) | 1995 | 0.2163 | 0.1850 | — | −0.3276 | B arm of the base ablation |
| 2026-05-23 | MenoClaw-AonB-MergedN2k-2026-05-23 | merged A-on-B-distilled | 1881 | 0.2167 | 0.1925 | 0.125 | −0.2271 | A arm of the B-distilled head-to-head |
| 2026-05-23 | MenoClaw-Bdistill-MergedN2k-2026-05-23 | merged B-distill | 2000 | 0.2338 | 0.2331 | 0.190 | −0.3979 | B arm of the B-distilled head-to-head |
Head-to-head results
| Comparison | n_paired | Brier(prior) Δ p | Brier(post) Δ p | slope-diff β₃ p | Read |
|---|---|---|---|---|---|
| A-on-B vs B-only (5/23) | 1881 | 4.9e−03 (A worse) | 2.86e−09 (A better) | 3.6e−13 (A closer to 0) | First validated win. The recipe undoes B-distill’s confirmation bias. |
| A-on-base vs base (5/24) | 1959 | 4.9e−03 (A worse) | 0.77 (n.s.) | 0.99 (n.s.) | Clean null. The recipe doesn’t move the unbiased base. |
R1-Distill paper-replication eval (2026-05-24, 437-q paper subset)
| Date (UTC) | Run ID | Model | Pipeline | n usable | Brier(prior) | Brier(post) | MS (R²) | slope | Notes |
|---|---|---|---|---|---|---|---|---|---|
| 2026-05-24 | r1_distill_eval_A_paper_subset_437 | DeepSeek-R1-Distill-Qwen-32B | paper-era (CoT + GPT-4o judge step-based) | 437 | 0.241 | 0.242 | 0.001 | +0.023 | Replicates paper’s full-R1 no-prompt MS = 0.021 within noise. |
| 2026-05-24 | r1_distill_eval_B_paper_subset_437 | DeepSeek-R1-Distill-Qwen-32B | current (DirectInf + self-report) | 435 | 0.233 | 0.227 | 0.281 | −0.415 | Same model, same eval — different pipeline gives MS 270× larger. |
Paper-pipeline re-extraction on existing arms (2026-05-25, judge: GPT-4o-mini)
Re-judged the 2026-05-23/24 reasoning traces step-based via paper-era prompt; no GPU.
| Arm | n | slope | MS (R²) | mean|Δ| | inertia | Brier(prior) | Brier(post) | ΔBrier |
|---|---|---|---|---|---|---|---|---|
| Base | 2000 | +0.096 | 0.018 | 0.125 | 0.180 | 0.194 | 0.187 | −0.008 |
| AonBase | 2000 | +0.117 | 0.015 | 0.173 | 0.110 | 0.198 | 0.186 | −0.013 |
| B-distill | 2000 | +0.152 | 0.035 | 0.143 | 0.195 | 0.221 | 0.233 | +0.012 |
| AonB | 2000 | +0.117 | 0.015 | 0.177 | 0.116 | 0.204 | 0.193 | −0.011 |
Recipe effect (B-distill → AonB) survives the pipeline change: MS 0.035 → 0.015 (Δ = −0.020, ~58% reduction). Compare self-report-pipeline numbers in the eval table above (0.190 → 0.125, ~34%). Direction and ordering agree; magnitude does not. Sign of slope flips — paper pipeline shows extrapolation (positive slope), self-report shows regression (negative slope), on the same traces.
Net interpretation
KL-safe + info-term works as an undistiller — it reduces bias when bias has been added artificially, but does not reduce bias intrinsic to the base. Important framing for the preprint: the win is real but the mechanism is narrower than “training reduces confirmation bias.”
Pending ablations (per Tianyi’s followup list)
- Larger training set + more rounds — use the full N=7500 split, multiple epochs.
- Pushing B-distill into positive Martingale-slope territory — distillation didn’t move the slope; investigate whether confirmation bias should manifest as positive slope or only as worsened Brier.
- Replication by a different agent / fresh codebase — sanity check the 5/23 win.
KL-safe onlyablation on B — disentangle whether the info-term is doing the work, or KL-safe alone is sufficient under correct adapter selection.
Excluded (invalidated by bugs — kept here for honesty, not for analysis)
These runs cannot be reasoned about as evidence:
| Date | Run ID(s) | Bug |
|---|---|---|
| 2026-05-21 | MenoClaw-Ainfo-MergedN2k-2026-05-21 (original A info-term eval) | Training broken by uncommitted lora_path kwarg into sgl.gen() (sglang 0.5.6.post2 doesn’t accept it); every in-training inference raised. Eval ran on a broken/no-op’d adapter. |
| 2026-05-21 | MenoClaw-KLsafe-MergedN2k-2026-05-21 | Silent-base-eval bug — --lora-paths loaded the adapter but per-request lora_path wasn’t being applied, so eval served base. “Drift floor” was base-vs-base. |
| 2026-05-18 | MenoClaw-OptionA-*-N9961-2026-05-18, MenoClaw-ATC-*-2026-05-18, MenoClaw-SameDayBase-* | Same bug period as above; pre-dates the HTTP-direct fix. |
| 2026-04-01 → 2026-05-17 | Older “drift floor” runs in data/runs/batch-martingale-training/ | All eval-served-base. Any “training does nothing” verdict from this period is the silent-base bug, not a property of the recipe. |
Convention going forward
- Every new training run gets a row here on the day of completion, with links to the LoRA artifact (GitHub Release) and the Slack post containing the verdict.
- Every new eval run gets a row in the eval table with the headline numbers.
- Head-to-head comparisons get a row in the comparison table with paired p-values from
scripts/analysis/martingale_brier_paired.py. - Runtime metrics for active runs are mirrored to wandb.ai/oh-alignment/martingale-training (wiring in flight).
- Invalidated runs go to the excluded section with the bug named, never deleted — the audit trail is part of the doc.
Maintained by MenoClaw; questions / corrections in the project-martingale Slack.