Train LLMs to minimise a user’s Martingale deviation (confirmation-bias signature) — the path to a ‘PPO/STaR-equivalent’ result. Preprint due end of month.

Status: active · People: Zhonghao He, Tianyi Alex Qiu, Ahmed Ismail

Open Problems

  • The recipe (KL-safe reward + info-term) reduces MS on a deliberately-bias-distilled base (B-distill → A-on-B; under self-report pipeline MS(R²) 0.190 → 0.125 with paired Brier Δ = −0.041, p = 2.86e-09; under paper pipeline MS(R²) 0.035 → 0.015) but does not move it on Qwen3-32B base. There is no Bayesian floor here — a rational Bayesian has MS(slope) = 0 and MS(R²) = 0 by the martingale property; non-zero MS is irrationality, not structural. So the open question is why the recipe is blocked from making productive updates on the unbiased base — gradient is too small? Bias structure is too small to detect? Reward shape wrong for this regime?
  • Pipeline disagreement (newly characterised, 2026-05-25): the slope of Δ on prior−0.5 flips sign between current and paper pipelines on the same runs. Self-report pipeline: slope ≈ −0.33 across Base/AonBase/AonB (regression toward 0.5). Paper-pipeline (GPT-4o-mini judge on the same reasoning traces): slope ≈ +0.10 to +0.15 (extrapolation away from 0.5). Both pipelines agree on the ordering of the four arms (B-distill is the most-predictable outlier; the recipe pulls it back) and on the direction of the recipe effect (un-distillation), but the magnitude and sign of the underlying slope are pipeline-dependent.
  • Inertia regression: A-on-B raises the |Δ|=0 fraction from 0.18 → 0.25 (+6.6pp) under self-report. Info-term works partly — A trades some “biased update” for “no update” rather than for “honest update”.
  • “Push to positive MS” experiments (INFO_COEF=0 ablation, iterated rounds, stronger info-term, naturally-biased base) are designed and budgeted (~$14 each on H100 ×2 spot) but currently blocked: warm H200 is STOPPED so B-distill artifacts (merged_b_distill, biased_distill_lora, seed sft-data.json) are unreachable, and we still need disambiguation on what “positive MS” means (strictly > 0, always true; or flip slope sign — and the answer depends on which pipeline you measure with).
  • Paper-cited full DeepSeek-R1 (671B) reaches no-prompt MS = 0.0207 — within a factor of 1.5 of all four of our arms under the paper pipeline (0.015 / 0.035 / 0.018). No obvious hard wall.

Progress

  • Minimal reproducible codebase shippedMartingale-Training repo (2026-05-21).
  • HTTP-direct LoRA inference (2026-05-23) — replaces the per-request lora_path kwarg that sglang==0.5.6.post2’s sgl.gen() silently ignored; every prior “trained” eval was a silent-base eval. Branch zhonghao-local, commits b97f544 + 4af4cb7, merged via PR #1.
  • KL-to-base anchor + content-token credit attribution implemented & verified.
  • B-distill + KL-safe + info-term run (2026-05-23) — first true post-fix verdict. A-on-B vs B-only, N = 2000: signed linR² dropped 34% (0.190 → 0.125), slope shrunk from −0.396 → −0.227 (~43% recovery toward 0), Brier(post) improved by Δ = −0.041 (paired t = −5.97, p = 2.86e-09).
  • Distill-set vs not-distill split (2026-05-24) — A’s Brier improvement is uniform across memorized (n = 377) vs novel (n = 1504) questions. The recipe is debiasing on a broader question set than B was distilled on, not a memorization-flip. Script: scripts/analysis/split_analysis.py.
  • “Empirical Bayesian floor” sim attempt (2026-05-24, retracted 2026-05-25) — built a synthetic sim where the agent’s prior was p_true + Gaussian noise, then Bayesian-shrunk on 2 evidence samples; this produced non-zero MS(slope) and was reported as the floor. That framing is wrong (per Tianyi’s correction): the sim’s agent isn’t Bayesian — its prior carries predictable bias by construction. A rational Bayesian’s belief is itself the best posterior given everything known, so updates are martingale-unpredictable: MS(slope) = MS(R²) = 0. No floor exists for the rational baseline. What’s still true and load-bearing: paper full R1 reaches MS = 0.0207 (paper convention = R²), so very low MS is empirically achievable.
  • R1-Distill-Qwen-32B paper-replication (2026-05-24) — same model, same 437-q paper subset, no system prompt: eval-A (paper-era CoT + GPT-4o judge step-based) gives MS = 0.001 (matches paper’s 0.0207 for full R1 within noise); eval-B (current DirectInf + self-report) gives MS = 0.281 on the same eval. Pipeline choice dominates model choice ~270×. Scripts + result JSONs in scripts/eval_r1_distill/.
  • Paper-pipeline re-extraction on Base/AonBase/B-distill/AonB (2026-05-25) — re-judged the existing 2026-05-23/24 reasoning traces via GPT-4o-mini step-based extraction (no GPU, ~$2 in judge calls). Result: the recipe’s un-distillation effect survives the pipeline change (B-distill MS 0.035 → AonB MS 0.015, −0.020 absolute, ~58% relative reduction; vs −0.065 / 34% under self-report). Direction and ordering agree; magnitudes and slope sign do not. Script + per-arm result JSONs in scripts/analysis/judge_reextract.py + scripts/analysis/judge_reextract_results/.

Codebase

Martingale-Training

  • 05-25 Zhonghao He — paper-pipeline judge re-extraction on Base/AonBase/B-distill/AonB (effect survives pipeline change)
  • 05-24 Zhonghao He — R1-Distill eval results: MS = 0.001 (paper pipeline) vs 0.281 (current pipeline)
  • 05-24 Zhonghao He — add R1-Distill paper-replication eval drivers + floor analysis
  • 05-24 Zhonghao He — chore(observability): wire wandb online + post-run sync
  • 05-24 Tianyi Qiu — Merge pull request #1 from meno-sh/zhonghao-local
  • 05-24 Zhonghao He — cleanup: REPLICATION.md + analysis scripts + launch scripts + trainer patches
  • 05-23 Zhonghao He — feat: route LocalModel inference through HTTP /generate (lora_path-correct)
  • 05-23 Zhonghao He — feat: HTTP-direct LoRA inference path + standalone canary
  • 05-21 Zhonghao He — docs: add experiments/loss_design.md (Martingale score → trainable objective)
  • 05-21 Zhonghao He — docs(README): current-only rewrite + experiments/ pointer at the top
  • 05-21 Zhonghao He — Initial commit: minimal reproducible codebase

Eval-Reasoning-Consistency

  • 03-20 Zhonghao He — feat: Brier score based on paired t-test

safety-tooling

No recent commits tracked.

Meeting Notes

  • STUB — Circleback link pending gmail-integration approval (Crunch-time planning + meeting notes (2026-05-20))

Slack Discussion

Channel: project-martingale

  • 05-24 Tianyi Alex Qiu: 🦾 reactions on MenoClaw’s results posts (queued during the 2026-05-24 Composio Slack-auth outage, awaiting re-auth).
  • 05-24 Zhonghao He: Move to deepseek-r1 (no-prompt MS +0.0207). Do both eval pipelines (paper step-based + current self-report).
  • 05-24 Zhonghao He: Push the distilled model to positive MS score, or get to know why we could not.
  • 05-23 Tianyi Alex Qiu: Potential priorities for followup experiments (replicate, non-distilled base, full train set, why negative MS on B-distill).
  • 05-20 Tianyi Alex Qiu: MenoClaw, read the meeting notes and extract the main followup experiments.
  • 05-20 Tianyi Alex Qiu: Martingale writeup/preprint is due EOM — crunch-time plan.
  • 05-17 Zhonghao He: Implausible that neither priors nor posteriors move beyond base drift floor.

Open problems & progress are MenoClaw’s reading of the project — edit this file to correct them; the markdown is the source of truth.