Martingale-training runs ledger

Every valid training/eval run for the Martingale project, in one table, in reverse-chronological order. Runs invalidated by infrastructure bugs are excluded (see the dropped list at the bottom). Each row links to the artifact / Slack thread / repo where the numbers live.

Definitions are in the project page and in meno-sh/Martingale-Training/REPLICATION.md. “Signed linR²” = R² of the OLS fit Δ ≈ β·(prior − 0.5); “slope” = β. “Brier(post)” = mean of (y_resolve − posterior_belief)² after full reasoning. Lower is better for both Brier and absolute slope; slope at 0 = perfect Martingale property.


Training runs

Date (UTC)Run IDRecipeBaseLoRA outputNotes / Slack
2026-05-24A-on-baseKL-safe + info-term (β=1, INFO_COEF=1, INFO_EPS=0.02, LR=1e-4, 558 steps)Qwen3-32B (raw)epoch_0 1.07 GB; not published — null resultbase-model verdict
2026-05-23A-on-B-distilledKL-safe + info-term (β=1, INFO_COEF=1, INFO_EPS=0.02, LR=1e-4, 558 steps)Qwen3-32B + biased_distill_lora mergedepoch_0__A_on_B_klsafe_info_term__2026-05-23.tar (sha256 6ba3077d…)headline verdict — first statistically validated improvement
2026-05-21B-distill-LoRALoRA SFT on 352 confirmation-biased reasoning traces, 2 epochs, LR=1e-4, train loss 0.62 → 0.41Qwen3-32B (raw)biased_distill_lora.tar (sha256 0580159e…)Produces the deliberately biased B model.

Eval runs (all at N=2000, same SHUFFLE_SEED=42, same forecasting question pool)

Date (UTC)Run IDModel evaluatedn usableBrier(prior)Brier(post)linR² (signed)slope (signed)Notes
2026-05-24MenoClaw-AonBase-MergedN2k-2026-05-24merged A-on-base20000.22870.1843−0.3279A arm of the base ablation
2026-05-24MenoClaw-Base-FreshN2k-2026-05-24Qwen3-32B (raw)19950.21630.1850−0.3276B arm of the base ablation
2026-05-23MenoClaw-AonB-MergedN2k-2026-05-23merged A-on-B-distilled18810.21670.19250.125−0.2271A arm of the B-distilled head-to-head
2026-05-23MenoClaw-Bdistill-MergedN2k-2026-05-23merged B-distill20000.23380.23310.190−0.3979B arm of the B-distilled head-to-head

Head-to-head results

Comparisonn_pairedBrier(prior) Δ pBrier(post) Δ pslope-diff β₃ pRead
A-on-B vs B-only (5/23)18814.9e−03 (A worse)2.86e−09 (A better)3.6e−13 (A closer to 0)First validated win. The recipe undoes B-distill’s confirmation bias.
A-on-base vs base (5/24)19594.9e−03 (A worse)0.77 (n.s.)0.99 (n.s.)Clean null. The recipe doesn’t move the unbiased base.

R1-Distill paper-replication eval (2026-05-24, 437-q paper subset)

Date (UTC)Run IDModelPipelinen usableBrier(prior)Brier(post)MS (R²)slopeNotes
2026-05-24r1_distill_eval_A_paper_subset_437DeepSeek-R1-Distill-Qwen-32Bpaper-era (CoT + GPT-4o judge step-based)4370.2410.2420.001+0.023Replicates paper’s full-R1 no-prompt MS = 0.021 within noise.
2026-05-24r1_distill_eval_B_paper_subset_437DeepSeek-R1-Distill-Qwen-32Bcurrent (DirectInf + self-report)4350.2330.2270.281−0.415Same model, same eval — different pipeline gives MS 270× larger.

Paper-pipeline re-extraction on existing arms (2026-05-25, judge: GPT-4o-mini)

Re-judged the 2026-05-23/24 reasoning traces step-based via paper-era prompt; no GPU.

ArmnslopeMS (R²)mean|Δ|inertiaBrier(prior)Brier(post)ΔBrier
Base2000+0.0960.0180.1250.1800.1940.187−0.008
AonBase2000+0.1170.0150.1730.1100.1980.186−0.013
B-distill2000+0.1520.0350.1430.1950.2210.233+0.012
AonB2000+0.1170.0150.1770.1160.2040.193−0.011

Recipe effect (B-distill → AonB) survives the pipeline change: MS 0.035 → 0.015 (Δ = −0.020, ~58% reduction). Compare self-report-pipeline numbers in the eval table above (0.190 → 0.125, ~34%). Direction and ordering agree; magnitude does not. Sign of slope flips — paper pipeline shows extrapolation (positive slope), self-report shows regression (negative slope), on the same traces.

Net interpretation

KL-safe + info-term works as an undistiller — it reduces bias when bias has been added artificially, but does not reduce bias intrinsic to the base. Important framing for the preprint: the win is real but the mechanism is narrower than “training reduces confirmation bias.”

Pending ablations (per Tianyi’s followup list)

  1. Larger training set + more rounds — use the full N=7500 split, multiple epochs.
  2. Pushing B-distill into positive Martingale-slope territory — distillation didn’t move the slope; investigate whether confirmation bias should manifest as positive slope or only as worsened Brier.
  3. Replication by a different agent / fresh codebase — sanity check the 5/23 win.
  4. KL-safe only ablation on B — disentangle whether the info-term is doing the work, or KL-safe alone is sufficient under correct adapter selection.

Excluded (invalidated by bugs — kept here for honesty, not for analysis)

These runs cannot be reasoned about as evidence:

DateRun ID(s)Bug
2026-05-21MenoClaw-Ainfo-MergedN2k-2026-05-21 (original A info-term eval)Training broken by uncommitted lora_path kwarg into sgl.gen() (sglang 0.5.6.post2 doesn’t accept it); every in-training inference raised. Eval ran on a broken/no-op’d adapter.
2026-05-21MenoClaw-KLsafe-MergedN2k-2026-05-21Silent-base-eval bug — --lora-paths loaded the adapter but per-request lora_path wasn’t being applied, so eval served base. “Drift floor” was base-vs-base.
2026-05-18MenoClaw-OptionA-*-N9961-2026-05-18, MenoClaw-ATC-*-2026-05-18, MenoClaw-SameDayBase-*Same bug period as above; pre-dates the HTTP-direct fix.
2026-04-01 → 2026-05-17Older “drift floor” runs in data/runs/batch-martingale-training/All eval-served-base. Any “training does nothing” verdict from this period is the silent-base bug, not a property of the recipe.

Convention going forward

  • Every new training run gets a row here on the day of completion, with links to the LoRA artifact (GitHub Release) and the Slack post containing the verdict.
  • Every new eval run gets a row in the eval table with the headline numbers.
  • Head-to-head comparisons get a row in the comparison table with paired p-values from scripts/analysis/martingale_brier_paired.py.
  • Runtime metrics for active runs are mirrored to wandb.ai/oh-alignment/martingale-training (wiring in flight).
  • Invalidated runs go to the excluded section with the bug named, never deleted — the audit trail is part of the doc.

Maintained by MenoClaw; questions / corrections in the project-martingale Slack.