Modeling AI Takeoff Scenarios - Insights from ChatGPT Success Rates and Instrumental Convergence

Optimization Power Curve

Figure 1: The relationship between success rate and relative power difference. As success rates approach extremes (very low or very high), the potential range of relative power differences increases, making it harder to form a consensus about model capabilities and potentially indicating different takeoff scenarios.

Introduction

This article explores the modeling of AI takeoff scenarios using insights from ChatGPT success rates and the concept of instrumental convergence. We’ll examine how these ideas relate to Rademacher complexity in machine learning and discuss how success rates can be used as a proxy for estimating the optimization power of AI models, potentially providing indicators for different takeoff scenarios.

Background on AI Takeoff Scenarios

The debate around AI takeoff scenarios has been a subject of much discussion in the AI alignment community. Traditionally, these scenarios have been categorized as “slow” or “fast” takeoffs. However, as pointed out by Raemon in a recent LessWrong post, these terms can be misleading. The post argues for using terms like “smooth/sharp” or “hard/soft” takeoff instead, as they better capture the nature of the transition without implying anything about the timeline.

In this context, a “smooth” or “soft” takeoff refers to a more continuous progression of AI capabilities, while a “sharp” or “hard” takeoff implies a more sudden jump in capabilities. Importantly, a smooth takeoff doesn’t necessarily mean a slower process in terms of calendar time - it could potentially be faster due to the continuous application of AI resources towards further AI development.

Instrumental Convergence and Power in AI

To understand how we can model these takeoff scenarios, let’s first revisit the concepts of instrumental convergence and power in AI.

Instrumental convergence is the idea that there are states or strategies that are broadly applicable to achieving a wide set of rewards. In the context of AI, this concept is crucial for understanding how an AI system might behave as it becomes more capable.

Turner et al. mathematically formalize the notion of power-seeking in AI. They define power as the expected value of a state over a distribution of reward functions:

\[\text{POWER}(s) \sim \mathbb{E}[\langle \rho_s, r \rangle | \rho_s \text{ is a valid occupancy-measure for s}]\]

This concept is related to the idea of measuring the average optimization ability of a learning algorithm, which is well-studied in machine learning. For example, \(\Sigma\)-complexity measures, such as Rademacher complexity, are used to provide generalization guarantees for machine learning algorithms.

Applying Complexity Measures to Takeoff Scenarios

We can model definitions similar to those used in power-seeking research and apply them to the concept of takeoff scenarios. Let’s introduce some key definitions:

Definition 1: The optimality probability of \(A\) relative to \(B\) under distribution \(\Sigma\) is

\[p_{\Sigma}(A \ge B) := P_{\sigma \sim \Sigma} \left( \sup_{a \in A} \ \langle a, \sigma \rangle \ge \sup_{b \in B} \ \langle b , \sigma \rangle \right)\]

Definition 2: We’ll say \(B\) is \(\Sigma\) instrumentally convergent over \(A\) whenever \(p_\Sigma(A \geq B) \leq 1/2\).

Definition 3: Assuming zero-mean rewards, the \(\Sigma\)-power of a state is given as:

\[\text{POWER}_{\Sigma}(s) = \mathcal{C}_{\Sigma}(I_s)\]

where \(I_s\) is the set of feasible state-occupancy measures (discounted) that have initial full-support on a state \(s\).

Using these definitions, we can state the following theorem:

Theorem 1: Assume \(I_a\) and \(I_b\) have positive \(\Sigma\)-power. Then, if the \(\Sigma\)-power of \(I_b\) exceeds \(I_a\), it is more likely to be optimal under the reward distribution \(\Sigma\). Moreover, policies \(I_b\) slightly optimal over \(I_a\) cannot be too powerful.

ChatGPT Success Rates and Power Estimation

The success rate of ChatGPT’s outputs can be used as a proxy for estimating its optimization power, which in turn can provide insights into potential takeoff scenarios. Defined as the likelihood a user responds to an output from ChatGPT instead of using “Try again,” the success rate provides information about the expected advantage that ChatGPT has over the user.

Deriving the Optimization Power Curve

The curve shown in Figure 1 is derived from a key lemma in our proof of Theorem 1.

Lemma: Let \(\Delta(\sigma) = \sup_{a \in A} \langle a, \sigma \rangle - \sup_{b \in B} \langle b, \sigma \rangle\). Suppose that \(\Delta^{-1}(\Sigma)\) is sub-Gaussian with scale-parameter \(\nu_R^2\) and \(\mathcal{C}_{\Sigma}(A) < \mathcal{C}_{\Sigma}(B)\) then we have:

\[p_{\Sigma}(A \ge B) \le e^{\cfrac{- (\mathcal{C}_{\Sigma}(A) - \mathcal{C}_{\Sigma}(B))^2}{2\nu_R^2}}\]

To derive our curve, we interpret this lemma in terms of success rate \(s\) and relative power difference \(\Delta P\). The success rate \(s\) corresponds to \(p_{\Sigma}(A \ge B)\), while \(\Delta P\) represents the difference in complexity measures, \(\mathcal{C}_{\Sigma}(B) - \mathcal{C}_{\Sigma}(A)\).

Through algebraic manipulation and consideration of the complementary probability, we obtain upper and lower bounds on \(\Delta P\) in terms of \(s\). These bounds form the shaded region in Figure 1. The parameter \(\nu_R\) from our lemma determines the width of this region, reflecting the uncertainty in power difference for a given success rate.

The resulting curve exhibits asymptotic behavior as \(s\) approaches 0 or 1, with bounds approaching negative or positive infinity, respectively. For intermediate success rates, the relationship appears roughly linear. This shape suggests that an AI system’s success rate can provide meaningful bounds on its relative power, though these bounds become less informative at extreme success rates.

In the context of AI takeoff scenarios, this curve offers insights into how we might detect and understand different trajectories of AI development. The wide range of possible power differences at low success rates could indicate early stages of development or a “soft” takeoff. The more linear middle range might represent steady progress. Finally, the rapid divergence at high success rates could signal a potential “hard” takeoff, where small observable improvements correspond to large capability jumps.

Implications for Takeoff Scenarios

The shape of this curve has important implications for modeling takeoff scenarios:

  1. Low success rates: When success rates are low, there’s a larger spread of relative optimization power. This could indicate early stages of AI development or a “soft” takeoff scenario where progress is gradual and potentially hard to detect.

  2. Intermediate success rates: As success rates increase, we see a more linear relationship between success rate and relative power. This might correspond to a period of steady, observable progress in AI capabilities.

  3. High success rates: As success rates approach very high values, we again see a large spread in relative power. This could indicate a potential “hard” takeoff scenario, where small increases in observable performance (success rate) could correspond to large jumps in actual capability (relative power).

Conclusion

By analyzing the relationship between ChatGPT’s success rates and relative power, we can gain insights into potential AI takeoff scenarios. The non-linear nature of this relationship suggests that both “soft” and “hard” takeoff scenarios are possible, depending on where we are on the curve.

As we continue to develop and deploy more advanced AI systems, carefully monitoring and analyzing such metrics will be crucial for understanding and preparing for various takeoff scenarios. However, it’s important to note that this model is a simplification, and real-world AI development may involve additional complexities not captured here.

Appendix: Proofs

Proof of Theorem 1

We’ll use the following lemma to help us prove Theorem 1:

Lemma: Let \(\Delta(\sigma) = \sup_{a \in A} \langle a, \sigma \rangle - \sup_{b \in B} \langle b, \sigma \rangle\). Suppose that \(\Delta^{-1}(\Sigma)\) is sub-Gaussian with scale-parameter \(\nu_R^2\) and \(\mathcal{C}_{\Sigma}(A) < \mathcal{C}_{\Sigma}(B)\) then we have:

\[p_{\Sigma}(A \ge B) \le e^{\cfrac{- (\mathcal{C}_{\Sigma}(A) - \mathcal{C}_{\Sigma}(B))^2}{2\nu_R^2}}\]

Proof of Theorem 1: The previous lemma yields:

\[p_{\Sigma}(I_a \ge I_b) \le e^{\cfrac{- (\text{POWER}_{\Sigma}(b) - \text{POWER}_{\Sigma}(a))^2}{2\nu_R^2}}\] \[\Rightarrow p_{\Sigma}(I_b > I_a) \ge 1 - e^{\cfrac{- (\text{POWER}_{\Sigma}(b) - \text{POWER}_{\Sigma}(a))^2}{2\nu_R^2}}\] \[\Rightarrow \log(p_{\Sigma}(I_a \ge I_b)) \ge \cfrac{- (\text{POWER}_{\Sigma}(b) - \text{POWER}_{\Sigma}(a))^2}{2\nu_R^2}\] \[\Rightarrow \text{POWER}_{\Sigma}(b) - \text{POWER}_{\Sigma}(a) \le \nu_R \sqrt{2 \log \left( \frac{1}{p_{\Sigma}(I_a \ge I_b)} \right)}\] \[\le \nu_R \sqrt{2 \log \left( \frac{1}{1-p_{\Sigma}(I_b > I_a)} \right)}\]

Thus, relatively optimal policies have bounded power difference. ∎

Proof of Lemma: For any \(\lambda > 0\) we have:

\[p_{\Sigma}(A \ge B) \le \mathbb{E}_{\sigma \sim \Sigma} \left[ \mathbb{I}\left(\sup_{a \in A} \ \langle a, \sigma \rangle \ge \sup_{b \in B} \ \langle b , \sigma \rangle \right) \right]\] \[\le \mathbb{E}_{\sigma \sim \Sigma} \left[ e^{\lambda \left(\sup_{a \in A} \ \langle a, \sigma \rangle - \sup_{b \in B} \ \langle b , \sigma \rangle \right) } \right] \quad \text{(0-1 Loss)}\] \[= \mathbb{E}_{\sigma \sim \Sigma} \left[ e^{\lambda \Delta(\sigma) } \right] = \mathbb{E}_{\delta \sim \Delta^{-1}(\Sigma)} \left[ e^{\lambda \delta } \right] \quad \text{(Change of Variables)}\] \[\le e^{\lambda \mathbb{E}[\delta] + \frac{\lambda^2 \nu^2}{2} } = e^{\lambda (\mathcal{C}_{\Sigma}(A) - \mathcal{C}_{\Sigma}(B)) + \frac{\lambda^2 \nu^2}{2} } \quad \text{(Hoeffding Lemma)}\]

We may optimize the bound and obtain \(\lambda = (\mathcal{C}_{\Sigma}(B) - \mathcal{C}_{\Sigma}(A)) / \nu_R^2 > 0\) by assumption. This implies that:

\[p_{\Sigma}(A \ge B) \le e^{\cfrac{- (\mathcal{C}_{\Sigma}(B) - \mathcal{C}_{\Sigma}(A))^2}{2\nu_R^2}}\]

Written on October 8, 2024