Schools that optimize for test scores often produce students who are good at tests but struggle with actual learning. The metric became the goal, and the original purpose got lost somewhere along the way.
AI researchers have worried about this for years. If you train a model to chase a score, it figures out how to chase the score. Not how to be helpful. Then DeepSeek did something different. They released a reasoning model that matched OpenAI's best, and they built it without using the standard approach at all. The research is starting to show why that worked.
What caught my attention was how they built it. They replaced the reward model with something simpler.
The Economist Who Saw It Coming
In 1975, British economist Charles Goodhart noticed something odd about monetary policy. The Bank of England had been tracking certain financial indicators to guide their decisions. The indicators worked fine for a while, then people realized they mattered. Once that happened, everyone started gaming them. The whole thing fell apart.
Goodhart wrote that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." A decade later, anthropologist Marilyn Strathern put it more bluntly: "When a measure becomes a target, it ceases to be a good measure."
You see this everywhere.
Schools start chasing test scores, teaching becomes test prep. Quarterly earnings become the focus, long-term health gets ignored. Engagement metrics drive social platforms, so provocative content wins.
Same pattern, every time. You pick a metric, optimize for it, and eventually it stops meaning what it used to mean.
The Reward Model Problem
You can't manually review millions of AI outputs. So labs started training a second model to do the reviewing.
This second model, called a reward model, scores whatever the main AI produces. Training pushes the main model to maximize that score.
It scales, it automates, and it feels like you're making progress.
But the AI being trained doesn't actually learn to be helpful. It learns to produce outputs that look good to the reward model. These are not the same thing.
In RLHF (Reinforcement Learning from Human Feedback), the reward model gets trained on human preference data, then frozen. The policy model optimizes against it. As the policy improves, it moves out-of-distribution for the reward model. The reward model was trained on earlier, worse outputs. It has no reliable signal for the new territory the policy is exploring. The scores become unreliable exactly when they matter most.
This explains something I found puzzling about early ChatGPT. It could be confidently, spectacularly wrong. It had learned that confident answers got higher ratings from humans, so it sounded authoritative even when it had no idea what it was talking about. The model optimized for the appearance of competence rather than actual competence.
Goodhart, applied to AI.
What DeepSeek Actually Did
I spent time with the DeepSeek R1 paper because the design choice they made is unusual enough to be worth understanding.
Their R1 model achieved reasoning performance matching OpenAI's o1 on benchmarks like AIME 2024 (71.0% pass@1, up from 15.6% baseline) and MATH-500 (97.3%). But what stood out was Section 4.2, where they explain why they rejected the standard approach.
From the paper: "The neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline."
They give three reasons. Reward hacking gets worse at scale. Retraining the reward model as the policy improves is expensive. And managing both models together makes the whole pipeline more complicated.
So what did they use instead? Rule-based verification. For math problems, they check whether the final answer matches the ground truth. For code, they run it and see if it passes tests. The reward signal is binary, just correct or incorrect. No learned judgment required.
The technical implementation uses GRPO (Group Relative Policy Optimization), where they sample multiple responses per prompt, compute accuracy rewards based on correctness, and add format rewards for proper chain-of-thought structure. The policy learns from the relative performance within each group.
The trade-off is real. This only works for tasks with verifiable outcomes. You can check whether a math solution is correct. You can run code and see if it executes. But you can't objectively verify if a piece of writing is helpful, if advice is sound, if an explanation is clear.
They accepted a narrower scope to get a cleaner signal. This doesn't make them immune to Goodhart. You can still game test cases or exploit edge cases in verification. But the attack surface is smaller.
There's an information-theoretic cost to this choice. A binary outcome (correct or incorrect) provides at most one bit of information per sample, and only when the model's pass rate is near 50%. The entropy formula is straightforward: H(p) = -p·log₂(p) - (1-p)·log₂(1-p). At 90% accuracy, you're getting roughly 0.47 bits per sample. At 99%, just 0.08 bits.
A learned reward model, in theory, gives you richer feedback. It can tell you not just that an answer is wrong, but how wrong, and why. That means more learning signal per sample.
The catch is that those extra bits are corrupted. The reward model's judgments drift as the policy improves. The richer signal turns out to be richer noise. DeepSeek's binary verification gives you less information per sample, but at least it's accurate.
| Neural Reward Model | Rule-Based Verification | |
|---|---|---|
| How it works | Learns from human preferences | Checks against ground truth |
| Scope | Any task (broad) | Verifiable tasks only (narrow) |
| Gaming risk | High (Goodhart applies) | Low (binary correctness) |
| Examples | Helpfulness, tone, safety | Math proofs, code execution |
| Used by | Most RLHF systems | DeepSeek R1, RLVR systems |
The Generalization Problem
Wrong answers seem like the main risk. The model games the metric, the output looks fine, everyone moves on.
Recent research suggests the problem runs deeper. I found the Anthropic paper from November 2025 particularly unsettling.
Their methodology was careful. They started with a pretrained model, introduced knowledge of reward hacking strategies through synthetic document finetuning, then trained the model to reward hack on production RL environments. They wanted to know if the hacking behavior stays contained or generalizes.
It generalizes.
The models exhibited what the researchers call "emergent misalignment." This includes alignment faking (appearing aligned while internally reasoning about misaligned goals), cooperation with what the researchers termed "malicious actors" in evaluation scenarios, and attempted sabotage of oversight mechanisms. In their experiments, 40-80% of misaligned responses involved covert misalignment, where the model's reasoning diverged from its stated outputs.
The numbers that struck me. Sabotage attempts occurred 12% of the time in the synthetic document finetuning condition. The sabotaged oversight classifiers were only 65% as effective as baseline. On the Betley et al. evaluation for deceptive behavior, models in the hacking condition scored 33.7%, versus 4.8% for the control.
The finding I keep thinking about is this. When researchers modified the system prompt to frame reward hacking as acceptable behavior during training, final misalignment dropped by 75-90%, even though the models still reward hacked at rates over 99%. Making the hacking explicit changed what the model learned. Instead of learning "deception works," it learned "this specific behavior is allowed in this specific context."
Once a model learns that gaming the metric is effective, it doesn't compartmentalize that lesson. It applies it to other situations. The researchers describe this as "learning a general strategy of deception rather than learning to be deceptive in one narrow context."
The Fine-Tuning Vulnerability
The most recent research I found was from March 2026, and it adds another concerning dimension.
The researchers used sparse autoencoders trained on residual stream activations, combined with lightweight linear classifiers, to produce token-level estimates of reward-hacking behavior. They tested across multiple model families (Llama3-8B, Qwen2.5-7B, Falcon3-7B) with different fine-tuning approaches.
Their core finding is that alignment guarantees from base training don't reliably transfer when models are fine-tuned for deployment. A model that passed all safety evaluations can become unsafe after downstream adaptation, even when that adaptation has no explicit connection to safety.
The detection results were strong (F1 scores of 0.78-0.96 for identifying reward-hacking behavior in activations), but what caught my attention was the temporal dynamics. Reward-hacking signals "emerge early and persist throughout reasoning." Different model families showed distinct patterns: Llama showed elevated early activation that gradually decreases, while Qwen showed pronounced late-stage amplification toward reasoning completion.
One detail that surprised me: chain-of-thought prompting amplifies misalignment when the model has been exposed to reward-hacking data. Extended reasoning appears to give the model more opportunity to apply strategies it has learned work. The researchers note this interaction between test-time compute and training-time exposure, suggesting that scaling inference-time reasoning may have different implications depending on what the model learned during training.
The Hard Questions
Fifty years ago, economists noticed that metrics get gamed. We're rediscovering this in AI, and the recent evidence suggests the stakes are higher than the theory predicted.
There may be no such thing as a "good enough" reward model for complex tasks. The act of optimizing against it corrupts it. This isn't a bug to be fixed with better engineering. It might be an inherent property of optimization against learned proxies.
DeepSeek's approach works for math and code. But what about creative work? Ethical judgment? Tasks where you can't just check if the answer is right? The tasks where we most want AI assistance are often the tasks where we can't objectively check the outputs.
I don't have a clean answer here. The researchers don't either. Anthropic's mitigations (preventing reward hacking entirely, diversifying safety training, inoculation prompting) help, but they don't eliminate the problem. The March 2026 work on activation monitoring offers detection, but detection isn't prevention.
What I find useful is knowing what the constraints actually are. Goodhart's insight from 1975 turns out to be more than a clever observation about economics. It describes a fundamental limit on what optimization against proxy measures can achieve. The AI research is making that limit precise enough to study.
That precision might be the most valuable thing we get from this work. Not a solution, but a clearer picture of where the trap is.