← all posts

the voting ensemble paradox — resolved

You build an ensemble of compression models, each fine-tuned from different checkpoints. You expect the ensemble to be better than any single model. It's worse.

That's the voting ensemble paradox.


The paradox

We formalized it: under k-of-N drop voting, the ensemble eviction indicator equals the k-th order statistic of the per-voter indicators. The ensemble collapses to its weakest member on every stratum.

In plain terms: if you have 3 models voting on whether to keep or drop each token, and any 2 agree, the ensemble score is the second-worst model's score. The best model's judgment gets diluted by the weaker ones. Adding more models makes it worse, not better.

We proved this (Theorem 1 + Corollary 1 + Remark 1) and validated it empirically: an ensemble of v3, v3.1, and v3.2 scored 0.931 heretic exact — worse than v4 alone at 0.967. The ensemble wasn't just not better. It was actively harmful.


The fix

A 3.0× weighted cross-entropy penalty on critical-syntactic tokens — signal names, file paths, exit codes, compiler flags, anything the agent needs intact.

Three mechanisms work together:


kompress-v8

The production model: 149M-param dual-head ModernBERT, trained via C3 self-distillation with Qwen2.5-7B teacher on 97 carefully labeled pairs at 33% C3 ratio.

Metric Value
Heretic exact (32 prompts) 0.955
Keep rate 0.854
Override delta 0.000
Agent mk_in_ref (with override) 1.000
Token savings 15%
Base model kompress-v2-base

Model on HuggingFace →


The experiment

17 models trained. 8 teachers. 4 architectures. $38.95 total.

Version What we tried Heretic Lesson
v2 0.975 Precision ceiling
v4 Self-labels 0.943 Override internalized
v6 Agent-distribution 0.962 Dead end
v8 Qwen2.5 teacher 0.955 Production
v9 C3-only 0.921 Overfit
v11 Larger encoder 0.906 Capacity ≠ precision
v14 Council training 0.882 Concept proven
v16 10× weight 0.972 Pareto endpoint

11 of 17 were dead ends. We published them all. The dead ends are the research.


Open science

The interactive paper is live at kompress.vaked.dev — WebGL neural field background, live paradox simulation, baseline comparison.

ICLR 2027 submission. All code, data, models open source.


This is an inner loop of the ultrawhale project. The outer loop cost $37.19 in DeepSeek API fees for the agent that orchestrated the experiments. The inner loop cost $1.76 in GPU compute on vast.ai RTX 4090s. The whole thing cost less than a conference registration.

Label quality is the bottleneck, not model capacity or data quantity. Loop engineering works. The loop shipped.

— peter


This is the research paper companion post. See also: the kompress heretic eval (full experiment log), the loop shipped (closing essay), LoopKit (starter kit), and the interactive paper.

28246bd7af75aa06026aeecca9cf7d6c