Inference-time scaling makes interpretability harder

Some quick, probably-non-original thoughts

Mar 08, 2025

Many state-of-the-art LLM results rely not just on the model itself, but on techniques used during inference. There’s an asymmetry which I think may explain how these techniques that use more compute at inference time boost LLM performance:

It’s easier to decide if an answer is correct than to output the correct answer one-shot.

So if you let the model try many times, it might output an incorrect solution most of the time and the correct solution occasionally. And the model can realise when it has outputted the correct solution, and pick this as its final answer because this only requires performing the easier task of recoginising correctness.

This roughly tracks my experience of reading transcripts from DeepSeek-R1. It flails around, says “oh wait, actually…” a bunch, and then finally gets somewhere if you’re lucky. And when it gets somewhere, it notices.

In some sense then, inference-time scaling is extracting more marginal behaviours from the model. We’re not getting the first instinct, the system 1 behaviour. We’re magnifying the effect of weaker circuits in the model that only rear their heads occasionally, things other than the closest pattern-match.

So, interpretability becomes harder. We now have to understand the model more completely. We can’t just have a broad understanding; we now need to also understand the obscure circuits that get amplified.

What this could mean in practice:

Maybe the new behaviours depend on features that are not linearly represented or are represented by directions with lower magnitude in activation space, so they can’t be detected by linear probes.
Maybe we need to explain more of the model behaviour to fully capture the behaviour of the model when inference is run in this scaled way, because the output of the final system depends on infrequent behaviours. For example, maybe the final answer depends having sampled a token that had a low probability of being selected.

This is not the only possible mechanism by which inference-scaling could work though. If the improvement stems less from multiple guesses and more from providing the model time to reason, the challenge for interpretability might be lessened. We wouldn't necessarily need to understand obscure aspects of the model.

Manatee of manatees

Discussion about this post