Modeling AI Takeoff Scenarios - Why Random Rewards Reveal Hidden AI Power

Optimization Power Curve

Figure 1: The relationship between success rate and relative power difference. As success rates approach extremes (very low or very high), the potential range of relative power differences increases, making it harder to form a consensus about model capabilities and potentially indicating different takeoff scenarios.

The Surprising Discovery That Changes Everything

Something remarkable just happened in AI research that perfectly illustrates a critical safety concern. Researchers recently found that training AI models with completely random rewards - rewards that have nothing to do with the actual task - made them dramatically better at solving math problems.

Training Qwen2.5-Math-7B with random rewards improved performance on MATH-500 by 21%. Even more surprisingly, training with incorrect rewards boosted performance by 25%. This isn’t a small effect - it’s a massive capability jump from doing something that should logically make the model worse.

Why does this matter? Because it reveals that AI systems can gain substantial power through training that seems completely unrelated to their final capabilities. And if we don’t understand this phenomenon, we might completely miss dangerous capability jumps hiding in plain sight.

The Theory That Predicted This Result

This surprising result isn’t actually surprising at all - if you understand instrumental convergence theory. Back in 2019, Turner et al. provided the mathematical foundation for exactly why this should happen.

Here’s the key insight: when you force an AI system to perform well across many different objectives (even random ones), it’s compelled to develop generally powerful capabilities that work for almost any goal. It’s like training to be good at every possible sport - you’d end up developing excellent general fitness, coordination, and strategic thinking that makes you better at any specific sport, even ones you’ve never practiced.

In AI terms, this is called instrumental convergence - the idea that certain capabilities (like reasoning, planning, and understanding) are useful for achieving almost any objective. When you train on random rewards, you’re essentially forcing the model to become instrumentally powerful.

Understanding AI Power Through Success Rates

To understand how this connects to AI safety and takeoff scenarios, we need to think about how to measure AI “power” - roughly, how capable an AI system is at achieving diverse objectives.

One practical way to estimate this is through success rates. For instance, we might look at how often users accept ChatGPT’s first response instead of clicking “try again.” This success rate gives us information about how well the AI is optimizing for what humans actually want across a wide range of situations.

The mathematical relationship between observable success rates and underlying optimization power reveals something crucial about AI development: small observable improvements can mask enormous capability jumps.

The Hidden Capability Problem

The curve in Figure 1 shows this relationship mathematically. Here’s what it tells us in plain English:

At low success rates (early AI development): There’s a huge range of possible actual capabilities hiding behind poor performance. A system that seems barely functional might actually be developing significant underlying power that just isn’t showing up in its outputs yet.

At moderate success rates (steady progress): The relationship is more predictable. Observable improvements roughly correspond to actual capability improvements.

At high success rates (advanced AI): We’re back to a dangerous situation where tiny improvements in observable performance could indicate massive leaps in actual capability. A system that goes from 95% to 98% success rate might have gained exponentially more power.

Think of it like an iceberg - most of the dangerous mass is hidden below the surface, and the visible tip gives you very little information about what’s underneath.

The Mathematical Framework

For those interested in the technical details, we can formalize this using complexity theory concepts. The core insight comes from defining power mathematically:

Turner et al. define the power of an AI state as the expected value over a distribution of possible reward functions:

\[\text{POWER}(s) \sim \mathbb{E}[\langle \rho_s, r \rangle | \rho_s \text{ is a valid occupancy-measure for s}]\]

This connects to well-established machine learning theory about generalization. When we measure how well an AI performs across many different objectives (like Rademacher complexity in machine learning), we’re essentially measuring its general optimization power.

We can then define when one AI system has instrumental convergence over another:

Definition: The optimality probability of system A relative to system B under distribution Σ is:

\[p_{\Sigma}(A \ge B) := P_{\sigma \sim \Sigma} \left( \sup_{a \in A} \ \langle a, \sigma \rangle \ge \sup_{b \in B} \ \langle b , \sigma \rangle \right)\]

Definition: System B has Σ-instrumental convergence over A when A wins less than half the time: $p_\Sigma(A \geq B) \leq 1/2$.

The key theorem connects observable success rates to hidden power differences:

Theorem: If system B has more Σ-power than system A, then B is more likely to be optimal under the reward distribution Σ. Moreover, the relationship between success probability and power difference follows specific mathematical bounds that create the curve shown in Figure 1.

From Random Rewards to AI Takeoff Scenarios

The random rewards discovery perfectly illustrates these theoretical predictions in practice. When Stella Li’s team trained on random objectives, they were essentially sampling from a broad distribution of reward functions - exactly the setup that should produce instrumental convergence according to the theory.

The 21-25% performance gains represent exactly the kind of hidden capability jump that the mathematical framework predicts. The model developed generally powerful reasoning capabilities that weren’t visible until tested on the specific math task.

This has profound implications for AI takeoff scenarios - the question of how quickly AI capabilities might advance:

Soft Takeoff Scenario: If we’re in the low success rate regime, we might be seeing gradual visible progress while missing enormous capability development happening under the surface. What looks like slow, steady improvement could actually be rapid capability accumulation.

Hard Takeoff Scenario: If we’re in the high success rate regime, tiny observable improvements could correspond to massive capability jumps. A system that seems to improve gradually from 95% to 98% performance might have crossed critical capability thresholds.

The Safety Implications

The random rewards result reveals a critical blind spot in AI safety: capability growth can happen through seemingly unrelated training processes. If a math AI can become dramatically more capable through random reward training, what other unexpected capability jumps might we miss?

Consider the implications:

  1. Steganographic capability development: AI systems might develop dangerous capabilities during training that appears completely harmless and unrelated.

  2. Monitoring failures: Traditional safety measures that focus on specific task performance might completely miss the development of general optimization power.

  3. Sudden capability emergence: Small, seemingly safe training modifications could lead to large capability jumps that catch safety researchers off guard.

  4. Transfer learning risks: Capabilities developed for one domain (like random reward training) might transfer to completely different, potentially dangerous domains in unexpected ways.

Reframing the Takeoff Debate

Traditional discussions of AI takeoff have focused on whether progress will be “fast” or “slow” in calendar time. But the random rewards discovery and the mathematical framework suggest we should be asking different questions:

  • Smooth vs. Sharp: Will capability development be continuous and observable, or will it happen in sudden, hard-to-detect jumps?
  • Visible vs. Hidden: How much capability development is happening in ways we can observe versus in ways that remain hidden until tested?
  • Direct vs. Indirect: Will dangerous capabilities emerge from training specifically designed to create them, or from seemingly unrelated training processes?

As Raemon pointed out in a recent LessWrong post, even a “smooth” takeoff could potentially be very fast in calendar time if AI systems are continuously applied to accelerate their own development.

The Path Forward

The connection between random rewards and instrumental convergence theory gives us both a warning and a research direction:

The Warning: AI capabilities can grow through training processes that appear completely unrelated to dangerous applications. We cannot rely on monitoring specific capabilities in specific domains to detect when AI systems become generally more powerful.

The Research Direction: We need better ways to measure and monitor general optimization power, not just task-specific performance. Success rates, complexity measures, and transfer learning capabilities might all serve as better indicators of true AI capability than performance on any single benchmark.

The random rewards discovery is likely just the tip of the iceberg. As we develop more sophisticated AI systems, we’ll probably discover many more unexpected ways that capability can emerge and grow. Understanding the mathematical principles behind these phenomena - like instrumental convergence - is crucial for building AI systems safely.

Conclusion

The surprising effectiveness of random reward training perfectly illustrates a fundamental principle of AI development: systems that are forced to optimize across many different objectives naturally develop generally powerful capabilities. This isn’t just an interesting empirical finding - it’s a mathematical inevitability predicted by instrumental convergence theory.

More importantly, it reveals how AI capabilities can grow in ways that are difficult to detect and predict. The relationship between observable performance and underlying optimization power is highly non-linear, meaning small visible improvements can hide enormous capability jumps.

As we continue developing more advanced AI systems, monitoring success rates and understanding their connection to optimization power will be crucial for detecting and preparing for various takeoff scenarios. The random rewards discovery shows us that the most significant capability developments might come from the most unexpected places.

We cannot afford to be caught off guard by capabilities hiding in plain sight.

Appendix: Proofs

Proof of Theorem 1

We’ll use the following lemma to help us prove Theorem 1:

Lemma: Let \(\Delta(\sigma) = \sup_{a \in A} \langle a, \sigma \rangle - \sup_{b \in B} \langle b, \sigma \rangle\). Suppose that \(\Delta^{-1}(\Sigma)\) is sub-Gaussian with scale-parameter \(\nu_R^2\) and \(\mathcal{C}_{\Sigma}(A) < \mathcal{C}_{\Sigma}(B)\) then we have:

\[p_{\Sigma}(A \ge B) \le e^{\cfrac{- (\mathcal{C}_{\Sigma}(A) - \mathcal{C}_{\Sigma}(B))^2}{2\nu_R^2}}\]

Proof of Theorem 1: The previous lemma yields:

\[p_{\Sigma}(I_a \ge I_b) \le e^{\cfrac{- (\text{POWER}_{\Sigma}(b) - \text{POWER}_{\Sigma}(a))^2}{2\nu_R^2}}\] \[\Rightarrow p_{\Sigma}(I_b > I_a) \ge 1 - e^{\cfrac{- (\text{POWER}_{\Sigma}(b) - \text{POWER}_{\Sigma}(a))^2}{2\nu_R^2}}\] \[\Rightarrow \log(p_{\Sigma}(I_a \ge I_b)) \ge \cfrac{- (\text{POWER}_{\Sigma}(b) - \text{POWER}_{\Sigma}(a))^2}{2\nu_R^2}\] \[\Rightarrow \text{POWER}_{\Sigma}(b) - \text{POWER}_{\Sigma}(a) \le \nu_R \sqrt{2 \log \left( \frac{1}{p_{\Sigma}(I_a \ge I_b)} \right)}\] \[\le \nu_R \sqrt{2 \log \left( \frac{1}{1-p_{\Sigma}(I_b > I_a)} \right)}\]

Thus, relatively optimal policies have bounded power difference. ∎

Proof of Lemma: For any \(\lambda > 0\) we have:

\[p_{\Sigma}(A \ge B) \le \mathbb{E}_{\sigma \sim \Sigma} \left[ \mathbb{I}\left(\sup_{a \in A} \ \langle a, \sigma \rangle \ge \sup_{b \in B} \ \langle b , \sigma \rangle \right) \right]\] \[\le \mathbb{E}_{\sigma \sim \Sigma} \left[ e^{\lambda \left(\sup_{a \in A} \ \langle a, \sigma \rangle - \sup_{b \in B} \ \langle b , \sigma \rangle \right) } \right] \quad \text{(0-1 Loss)}\] \[= \mathbb{E}_{\sigma \sim \Sigma} \left[ e^{\lambda \Delta(\sigma) } \right] = \mathbb{E}_{\delta \sim \Delta^{-1}(\Sigma)} \left[ e^{\lambda \delta } \right] \quad \text{(Change of Variables)}\] \[\le e^{\lambda \mathbb{E}[\delta] + \frac{\lambda^2 \nu^2}{2} } = e^{\lambda (\mathcal{C}_{\Sigma}(A) - \mathcal{C}_{\Sigma}(B)) + \frac{\lambda^2 \nu^2}{2} } \quad \text{(Hoeffding Lemma)}\]

We may optimize the bound and obtain \(\lambda = (\mathcal{C}_{\Sigma}(B) - \mathcal{C}_{\Sigma}(A)) / \nu_R^2 > 0\) by assumption. This implies that:

\[p_{\Sigma}(A \ge B) \le e^{\cfrac{- (\mathcal{C}_{\Sigma}(B) - \mathcal{C}_{\Sigma}(A))^2}{2\nu_R^2}}\]

Written on May 27, 2025