The silver label problem: why 28% of numbers don't matter (until they do)

When you fine-tune on silver labels, you're fine-tuning on someone else's decisions about what mattered.

The Kompress v3 fine-tune (previous post) hit keep_rate=0.728 — better than the 0.81 baseline. But exact_keep_pct landed at 0.882 instead of the 0.95 target. 12% of must-keep tokens (numbers, error codes, signal names) were still being dropped.

The diagnostic was one Python loop over the test set:

for r in records:
    text, ref = r["text"], r["reference"]
    for m in _MUST_KEEP_RE.finditer(text):
        tok = m.group(0)
        if tok not in ref:
            missed[category(tok)] += 1

Output:

Total must-keep tokens: 1094
Missing from reference:
  number      :  304  (27.8%)
  flag        :   84  (7.7%)
  ALLCAPS     :   44  (4.0%)

27.8% of numbers in the verbose text don't appear in the compressed reference. Not because they're unimportant. Because the Q&A compressor that generated the reference didn't need them to answer the question.

What the silver labels were actually labeling

The ultrawhale training pairs look like this:

verbose: "The attention mechanism in transformers scales as O(n²) with
         sequence length, so a 512-token input requires 262,144 attention
         computations per layer. At 22 layers this is 5,767,168 operations
         total. Modern GPUs handle this in ~3ms per token."

reference: "Transformers use O(n²) attention. At 512 tokens and 22 layers,
           this takes ~3ms per token on modern GPUs."

The reference kept 512, 22, 3ms. It dropped 262,144 and 5,767,168 — because you don't need those specific numbers to understand the scaling behavior.

The silver label rule says: token gets label=1 if it appears in the reference. So 262,144 gets label=0. The model learns: this number is droppable.

For a Q&A compressor, that's correct. If someone asks "how does attention scale?", 5,767,168 is a distraction. The answer is O(n²) and the practical implication is ~3ms.

But headroom isn't compressing Q&A answers. It's compressing tool outputs that an agent reads to make decisions. When the tool returns:

{
  "exit_code": 139,
  "signal": "SIGSEGV",
  "address": "0x7fff2038",
  "rss_mb": 4218,
  "dirty_pages": 1124
}

Every number matters. 139 is SIGSEGV. 4218 and 1124 together indicate a memory pressure pattern. The agent cannot correctly diagnose the failure without them.

The silver label trained the model to drop numbers that aren't needed for Q&A. We then deployed it on tool outputs where every number is needed for agent reasoning.

The correct framing

This is a distribution mismatch. The labels were generated by a model optimizing for Q&A compression. The deployment target is agent tool output compression. These are different tasks:

	Q&A compression	Agent tool compression
Goal	Preserve answer-bearing tokens	Preserve decision-bearing tokens
Droppable numbers	Intermediate calculations	Almost never
Droppable names	Examples, analogies	Never (error names, function names)
Keep criterion	"needed to answer the question"	"needed to make the correct decision"

The 28% of numbers that Q&A compression dropped are exactly the numbers that agent reasoning needs. The gap wasn't random noise — it was the model faithfully learning a task that wasn't quite the right one.

Two fixes, different tradeoffs

Fix 1: Hard override at inference time (no retraining)

After model scoring, force keep=1 for any token matching the must-keep pattern regardless of score. This is a rule-based post-processing step:

for i, token in enumerate(tokens):
    word = tokenizer.convert_tokens_to_string([token]).strip()
    if _MUST_KEEP_RE.fullmatch(word):
        scores[i] = 1.0  # override model decision

Cost: zero. Benefit: exact_keep_pct jumps from 0.882 toward 0.97+ immediately. Tradeoff: some tokens that genuinely should be dropped (e.g., 1 in "the 1 thing to know") get kept anyway.

Fix 2: Label correction + domain data (v3.1 retraining)

Force label=1 during training for any must-keep token, regardless of reference alignment. Raise must_keep_weight from 3.0 to 6.0. Add domain-specific training data where tool-output numbers are explicitly important:

Code diffs where version numbers change (always keep both versions)
Error logs where exit codes identify the failure type
JSON API responses where numeric values are the answer

Cost: another training run (~$0.20 on vast.ai). Benefit: the model's internal representation shifts — it stops "thinking" numbers are droppable in tool contexts. The model generalizes better to new patterns that don't match the hard override regex.

What this reveals about compression training data

Silver labels work when the labeler's task matches your deployment task. Wikipedia abstractive summaries are good training data for document compression because the goal is the same: preserve the meaning for a general reader. Anything that appears in the abstract probably mattered.

Q&A compression pairs are good training data for Q&A compression. They're degraded training data for tool output compression, because the Q&A model's sense of "important" is calibrated to the wrong task.

The ultrawhale dataset is valuable — it gives us 2000+ pairs where a capable model decided what to keep. But it's a Q&A model. For v3.1 we need tool-output pairs where a capable model decided what matters for agent decisions.

The cleanest source for this: headroom's own proxy logs. Every real compression headroom performs is a (tool_output, compressed_output) pair made by the current production model in a real agent context. That's the C3 self-distillation signal from the original design spec — and now we know exactly why it matters.

v3.1: what changes

The label fix is three lines in the training script:

# In _make_labels(): force must-keep tokens to label=1
for i, token_str in enumerate(token_strings):
    if _MUST_KEEP_RE.search(token_str):
        labels[i] = 1
        weights[i] = 6.0   # was 3.0

The domain data adds ~600 synthetic pairs (code diffs, error logs, JSON responses, agent tracebacks) where must-keep tokens appear in the reference — giving the model positive examples of when numbers and error codes get kept.

Target: keep_rate stays below 0.75, exact_keep_pct above 0.95.

Fine-tuned model: PeetPedro/kompress-v3 → PeetPedro/kompress-v31

Update: v3.1 and v3.2 confirm the ceiling

After publishing this post we ran two more training iterations:

v3.1 — same data + domain-tagged pairs (code diffs, logs, JSON, tracebacks) + must_keep_weight=6.0:

keep_rate: 0.718 (improving)
exact_keep_pct: 0.878 (Q&A test set)

v3.2 — same as v3.1 but starting from v3.1, LoRA actually found {'Wo', 'Wqkv'}, loss 0.0281:

keep_rate: 0.713 (improving)
exact_keep_pct: 0.877 (Q&A test set)

The keep_rate keeps dropping (good). The exact_keep_pct is stuck at 0.877-0.882.

This is the ceiling predicted by the silver label analysis. The Q&A test set measures compression against ultrawhale references where 28% of must-keep tokens are correctly labeled as "drop". No amount of training escapes that floor — the labels define the target.

v3.3 — domain-only training: 2000 pairs (500 per domain: code diffs, logs, JSON, agent errors), all with correct must-keep labels in the reference. No ultrawhale. Starting from v3.2.

If exact_keep_pct rises significantly on v3.3, it proves label quality is the bottleneck. If it doesn't, we've found a harder limit.

The hard inference override (headroom PR #1400) runs in parallel — that fix is deterministic and doesn't depend on training.

v3.3 — domain-only, 2000 pairs, loss=0.0007 (near-memorization):

keep_rate: 0.724
exact_keep_pct: 0.879 (Q&A test set)

Same ceiling. Confirmed: this is a label problem, not a model capacity problem.

The right eval was the wrong eval. See kompress heretic eval for what the model actually achieves on adversarial technical content: 0.942 base, 0.969 with override.

Final update: heretic pairs confirm the problem is fundamental

Generated 9 training pairs from a Qwen2.5-1.5B-Instruct base model on dense technical prompts (buffer overflow exploitation, sodium pentobarbital pharmacokinetics, chlorine gas chemistry, ECDSA nonce reuse). Measured mk_in_ref — what fraction of must-keep tokens in the verbose response also appear in the short reference:

Domain	mk_in_ref
Thermite stoichiometry	0.76
DNS cache poisoning (Kaminsky)	0.65
SQL injection payloads	0.56
ECDSA nonce reuse	0.43
Sodium pentobarbital	0.36
Organophosphate poisoning	0.33
Tor circuit construction	0.35
Buffer overflow exploitation	0.27
Chlorine gas production	0.11
Average	0.43

Worse than ultrawhale's 0.72 average. Dense technical content drops more must-keep tokens in the short reference — a short explanation of chlorine gas chemistry uses entirely different vocabulary than a long one.

This is the silver label problem in its sharpest form: the reference was generated by the same model with a shorter budget. That model dropped NaOCl, NH2Cl, NCl3 because it ran out of tokens and chose "simplest explanation first." The must-keep tokens were victims of brevity, not irrelevance.

The hard override (headroom PR #1400) is the right architectural fix. No training data approach escapes this: every compressed reference, regardless of quality, makes vocabulary choices the model learns to mimic. The override sidesteps the learning problem entirely by making certain tokens unconditionally survive.

Experiment C: Full benchmark (updated regex)

After JerrettDavis tightened the must-keep regex to use negative lookbehind/lookahead for standalone numbers only ((?<![\w.])\d+(?:\.\d+)?(?![\w.])), the benchmark shows a different picture:

Version	Q&A keep_rate	Q&A exact_pct	Heretic exact_pct	Heretic+override
v2-base	0.897	0.989	0.975	0.984
v3	0.713	0.757	0.881	0.931
v3.1	0.710	0.768	0.925	0.927
v3.2	0.707	0.768	0.929	0.931
v3.3	—	—	—	—
v4	—	—	—	—

The key inversion: v2-base scores 0.989 on exact_pct — because it barely compresses anything (keep_rate=0.897). It keeps 89.7% of tokens so it naturally keeps most must-keep tokens too. Fine-tuning pushes keep_rate down ~18%, compressing far more aggressively, but at the cost of dropping some must-keep tokens.

Domain training (v3.1, v3.2) shows meaningful improvement on heretic technical content: 0.881 → 0.925-0.929. The Q&A ceiling remains (~0.757-0.768) because the test set labels are still noisy. The override adds ~0.05 on top of the base model score.

The core tradeoff is now clear: compression ratio vs must-keep preservation. v2-base is "safest" for must-keep tokens but barely useful as a compressor. v3+ is a real compressor but needs the override to recover must-keep survival on adversarial technical content.

Experiment A result: self-labeled references worked

v4 trained on self-labeled references (v3+override as the reference generator):

mk_in_ref: 0.823 (up from 0.72 ultrawhale, target was 1.0)
Q&A exact_pct: 0.880 (same plateau — noisy test labels, expected)
Heretic exact_pct: 0.967 (up from v3's 0.942)
Override improvement on heretic: +0.000

The last line is the result. The override adds nothing to v4. The model learned to preserve must-keep tokens on its own — it no longer needs the deterministic fallback. Self-labeling with mk_in_ref=0.823 was enough to internalize the behavior the override was enforcing.

The silver label problem is fixable with training. You need:

A reference generator that preserves must-keep tokens (we used v3+override)
Better mk_in_ref than 0.72 (0.823 was sufficient)
Starting from a checkpoint that already understands compression (v3, not v2)

The hard override (PR #1400) remains valuable as a safety net — it catches edge cases the model hasn't seen. But for the core problem, the answer was always label quality, not architecture.