Martingale-training runs ledger

Every valid training/eval run for the Martingale project, in one table, in reverse-chronological order. Runs invalidated by infrastructure bugs are excluded (see the dropped list at the bottom). Each row links to the artifact / Slack thread / repo where the numbers live.

Definitions are in the project page and in meno-sh/Martingale-Training/REPLICATION.md. “Signed linR²” = R² of the OLS fit Δ ≈ β·(prior − 0.5); “slope” = β. “Brier(post)” = mean of (y_resolve − posterior_belief)² after full reasoning. Lower is better for both Brier and absolute slope; slope at 0 = perfect Martingale property.

Training runs

Date (UTC)	Run ID	Recipe	Base	LoRA output	Notes / Slack
2026-05-24	A-on-base	KL-safe + info-term (β=1, INFO_COEF=1, INFO_EPS=0.02, LR=1e-4, 558 steps)	Qwen3-32B (raw)	`epoch_0` 1.07 GB; not published — null result	base-model verdict
2026-05-23	A-on-B-distilled	KL-safe + info-term (β=1, INFO_COEF=1, INFO_EPS=0.02, LR=1e-4, 558 steps)	Qwen3-32B + biased_distill_lora merged	`epoch_0__A_on_B_klsafe_info_term__2026-05-23.tar` (sha256 `6ba3077d…`)	headline verdict — first statistically validated improvement
2026-05-21	B-distill-LoRA	LoRA SFT on 352 confirmation-biased reasoning traces, 2 epochs, LR=1e-4, train loss 0.62 → 0.41	Qwen3-32B (raw)	`biased_distill_lora.tar` (sha256 `0580159e…`)	Produces the deliberately biased B model.

Eval runs (all at N=2000, same SHUFFLE_SEED=42, same forecasting question pool)

Date (UTC)	Run ID	Model evaluated	n usable	Brier(prior)	Brier(post)	linR² (signed)	slope (signed)	Notes
2026-05-24	`MenoClaw-AonBase-MergedN2k-2026-05-24`	merged `A-on-base`	2000	0.2287	0.1843	—	−0.3279	A arm of the base ablation
2026-05-24	`MenoClaw-Base-FreshN2k-2026-05-24`	Qwen3-32B (raw)	1995	0.2163	0.1850	—	−0.3276	B arm of the base ablation
2026-05-23	`MenoClaw-AonB-MergedN2k-2026-05-23`	merged `A-on-B-distilled`	1881	0.2167	0.1925	0.125	−0.2271	A arm of the B-distilled head-to-head
2026-05-23	`MenoClaw-Bdistill-MergedN2k-2026-05-23`	merged `B-distill`	2000	0.2338	0.2331	0.190	−0.3979	B arm of the B-distilled head-to-head

Head-to-head results

Comparison	n_paired	Brier(prior) Δ p	Brier(post) Δ p	slope-diff β₃ p	Read
A-on-B vs B-only (5/23)	1881	4.9e−03 (A worse)	2.86e−09 (A better)	3.6e−13 (A closer to 0)	First validated win. The recipe undoes B-distill’s confirmation bias.
A-on-base vs base (5/24)	1959	4.9e−03 (A worse)	0.77 (n.s.)	0.99 (n.s.)	Clean null. The recipe doesn’t move the unbiased base.

R1-Distill paper-replication eval (2026-05-24, 437-q paper subset)

Date (UTC)	Run ID	Model	Pipeline	n usable	Brier(prior)	Brier(post)	MS (R²)	slope	Notes
2026-05-24	`r1_distill_eval_A_paper_subset_437`	DeepSeek-R1-Distill-Qwen-32B	paper-era (CoT + GPT-4o judge step-based)	437	0.241	0.242	0.001	+0.023	Replicates paper’s full-R1 no-prompt MS = 0.021 within noise.
2026-05-24	`r1_distill_eval_B_paper_subset_437`	DeepSeek-R1-Distill-Qwen-32B	current (DirectInf + self-report)	435	0.233	0.227	0.281	−0.415	Same model, same eval — different pipeline gives MS 270× larger.

Paper-pipeline re-extraction on existing arms (2026-05-25, judge: GPT-4o-mini)

Re-judged the 2026-05-23/24 reasoning traces step-based via paper-era prompt; no GPU.

Arm	n	slope	MS (R²)	mean\|Δ\|	inertia	Brier(prior)	Brier(post)	ΔBrier
Base	2000	+0.096	0.018	0.125	0.180	0.194	0.187	−0.008
AonBase	2000	+0.117	0.015	0.173	0.110	0.198	0.186	−0.013
B-distill	2000	+0.152	0.035	0.143	0.195	0.221	0.233	+0.012
AonB	2000	+0.117	0.015	0.177	0.116	0.204	0.193	−0.011

Recipe effect (B-distill → AonB) survives the pipeline change: MS 0.035 → 0.015 (Δ = −0.020, ~58% reduction). Compare self-report-pipeline numbers in the eval table above (0.190 → 0.125, ~34%). Direction and ordering agree; magnitude does not. Sign of slope flips — paper pipeline shows extrapolation (positive slope), self-report shows regression (negative slope), on the same traces.

Net interpretation

KL-safe + info-term works as an undistiller — it reduces bias when bias has been added artificially, but does not reduce bias intrinsic to the base. Important framing for the preprint: the win is real but the mechanism is narrower than “training reduces confirmation bias.”

Pending ablations (per Tianyi’s followup list)

Larger training set + more rounds — use the full N=7500 split, multiple epochs.
Pushing B-distill into positive Martingale-slope territory — distillation didn’t move the slope; investigate whether confirmation bias should manifest as positive slope or only as worsened Brier.
Replication by a different agent / fresh codebase — sanity check the 5/23 win.
KL-safe only ablation on B — disentangle whether the info-term is doing the work, or KL-safe alone is sufficient under correct adapter selection.

Excluded (invalidated by bugs — kept here for honesty, not for analysis)

These runs cannot be reasoned about as evidence:

Date	Run ID(s)	Bug
2026-05-21	`MenoClaw-Ainfo-MergedN2k-2026-05-21` (original A info-term eval)	Training broken by uncommitted `lora_path` kwarg into `sgl.gen()` (sglang 0.5.6.post2 doesn’t accept it); every in-training inference raised. Eval ran on a broken/no-op’d adapter.
2026-05-21	`MenoClaw-KLsafe-MergedN2k-2026-05-21`	Silent-base-eval bug — `--lora-paths` loaded the adapter but per-request `lora_path` wasn’t being applied, so eval served base. “Drift floor” was base-vs-base.
2026-05-18	`MenoClaw-OptionA--N9961-2026-05-18`, `MenoClaw-ATC--2026-05-18`, `MenoClaw-SameDayBase-*`	Same bug period as above; pre-dates the HTTP-direct fix.
2026-04-01 → 2026-05-17	Older “drift floor” runs in `data/runs/batch-martingale-training/`	All eval-served-base. Any “training does nothing” verdict from this period is the silent-base bug, not a property of the recipe.

Convention going forward

Every new training run gets a row here on the day of completion, with links to the LoRA artifact (GitHub Release) and the Slack post containing the verdict.
Every new eval run gets a row in the eval table with the headline numbers.
Head-to-head comparisons get a row in the comparison table with paired p-values from scripts/analysis/martingale_brier_paired.py.
Runtime metrics for active runs are mirrored to wandb.ai/oh-alignment/martingale-training (wiring in flight).
Invalidated runs go to the excluded section with the bug named, never deleted — the audit trail is part of the doc.

Maintained by MenoClaw; questions / corrections in the project-martingale Slack.

Meno Research Hub

Explorer

Martingale-training runs ledger

Martingale-training runs ledger

Training runs

Eval runs (all at N=2000, same SHUFFLE_SEED=42, same forecasting question pool)

Head-to-head results

R1-Distill paper-replication eval (2026-05-24, 437-q paper subset)

Paper-pipeline re-extraction on existing arms (2026-05-25, judge: GPT-4o-mini)

Net interpretation

Pending ablations (per Tianyi’s followup list)

Excluded (invalidated by bugs — kept here for honesty, not for analysis)

Convention going forward

Graph View

Table of Contents