fleetingbits.io

Monitoring Frontier Models for Misbehavior and the Risks of Promoting Obfuscation

I'm starting a new series of posts about interesting journal articles I've read, where each post will break down the paper's content, outline potential follow-up experiments, highlight connected research, and explore how the ideas fit into the larger AI landscape.

This first post is about Monitoring Frontier Models for Misbehavior and the Risks of Promoting Obfuscation. The paper describes experiments conducted by OpenAI that tested whether monitors can prevent reasoning models from developing reward hacking strategies.

Background

Modern reasoning models are trained using a mix of perfect and imperfect verifiers. The problem with imperfect verifiers is that models can produce states that the verifier will judge as correct solutions but which are not correct. This can cause the models to learn policies to produce these states. Learning to produce states that trick imperfect verifiers is called reward hacking.

We want to prevent models from reward hacking. We have the option to restrict our training to perfect verifiers and nothing else. However, restricting ourselves to perfect verifiers would confine our training exclusively to domains where such verifiers exist. And, numerous practical applications such as coding and information retrieval present a dilemma: these are areas where we seek to enhance model performance, yet perfect verification remains unavailable.

So, let's consider alternatives. What do we know about reasoning models? The construction of reasoning models follows a two-stage process. First, we create a base model by pretraining on a large corpora of tokens, most of which is human-interpretable. When we apply reinforcement learning with verifiable rewards to the base model, the model develops the ability to produce longer chains of thought, resulting in higher-quality answers. Since the base model learns from human-readable text, its chains of thought during early training should remain in human-readable form.

This indicates that for tasks for which the model has learned to use chain of thought in order to produce better answers, we can look at the chain of thought to see if the model is using techniques that we would consider reward hacking to complete the task. So, we implement a monitor, a language model that takes the task, chain of thought, tool calls, tool responses and outputs and returns a classification as to whether or not the reasoning model being trained is reward hacking.

Experiments

OpenAI reported the results of 2 experiments.

The first experiment was designed to see whether the chain of thought contains useful information related to reward hacking over and above the information contained in the model tool calls and outputs. OpenAI set up two GPT-4o monitors for a frontier reasoning model training run, the first monitor had access to the chain of thought in addition to the tool calls and outputs, while the second monitor only had access to the model tool calls and outputs.

OpenAI reported the performance of the two monitors in identifying two systematic reward hacks. The first was exit(0), an issue where the agent exploited a bug that allowed it to exit from the training environment early without running all unit tests. The second was raise SkipTest, in which the agent raised an exception from functions outside the testing framework in order to skip unit test evaluation.

OpenAI found that the monitor with access to the chain of thought had 95% recall over the two reward hacks with 77% precision while the monitor without access to the chain of thought had only 60% recall over the two reward hacks with 81% precision. The precision was potentially under-reported because they only compared the performance of the monitors to a regex that checked for the two reward hacks and the monitors also found other reward hacks in addition to the two for which results were reported.

The experiment demonstrates that chain of thought contains model-interpetable information about the model's problem-solving process, and this remains true even under the high levels of optimization currently used in frontier reasoning model training.

However, it's important to note that we observed this interpretability specifically in tasks where the model learned to use chain of thought reasoning. Papers like Reasoning Models Don't Always Say What They Think suggest that these findings do not generalize to tasks that receive little to no reasoning training, because the model has not learned to use its chain of thought to improve its answers for these tasks, or to tasks which do not require a chain of thought, because again there would have been no optimization pressure on the chain of thought to provide information on its reasoning.

The second set of experiments were designed to see if the monitor is a good reward signal.

OpenAI first created an environment that would cause the model to develop reward hacking capabilities. They created a programming environment where the reasoning model was rewarded if the model wrote code that passed test cases. However, the model could edit the test cases. The model was said to reward hack if the test cases in the environment passed but the original unaltered test cases did not pass.

The researchers used this environment to establish a baseline agent. The baseline agent learned to completely solve the programming problems. At the end of training, it reward hacked 70% of the time and otherwise solved the problems without reward hacking about 30% of the time. The GPT-4o monitor, which had access to the chain of thought but not the tool calls or code patches, detected the reward hacking with almost 100% recall and 90% precision.

OpenAI then trained a model using the GPT-4o monitor as an additional reward signal. The researchers found that, at the end of training, the model learned to reward hack 10% less often than the baseline agent but the reward hacking, which occured about 60% of the time, was no longer detectable by the GPT-4o monitor. During training, the monitor never caught more than 5% of the cheating done by the agent.

OpenAI then compared the model trained using the GPT-4o monitor as an additional reward signal to the baseline agent but using rejection sampling for both agents. They would take up to 32 samples from each agent and would return the first sample that passed the monitor. They found that the legitimate performance gap between the two agents closed but the baseline agent was still not able to get samples where its reward hacked past the monitor. This showed that for additional inference cost, you could use the baseline agent to close the performance gap, while still avoiding reward hacking.

Thoughts

I think that OpenAI wants us to take away from this paper that we should not put direct optimization pressure on the chain of thought using a monitor to avoid reward hacking because, if we do and our models learn to reward hack, then that reward hacking will no longer be detectable.

First, when assessing this, I think it's worth reflecting on what leads the model to reward hack.

Models generally learn to reward hack when it is easier to learn to reward hack than it is to learn to solve the problem that we want them to solve. It is possible that if OpenAI had allowed the monitor being used as a reward signal to see the model tool calls and code patches then there would have been enough of a barrier to the model overcoming the monitor that it still would have been easier to learn to solve the problems than it would have been to reward hack. It would be interesting to see this as an experiment.

More generally, though, there is probably some relationship between curriculum learning and reward hacking. Reward hacking is more likely if we expose the models to harder problems because it is then more likely that there is some exploit in the environment that the model can exploit that is easier to find than solving the problem itself. And, this logic should also apply to reward hacking the monitor; the easier it is for the model to solve the problem, the less likely it is for the model to figure out how to reward hack the monitor. So, we might be able to calibrate the risk of reward hacking through the difficulty curve of our curriculum.

There are also other ideas that we could imagine like from the prover-verifier games paper, where we co-train a prover and verifier and use the verifier as the monitor. But, this is expensive in terms of compute and, like GANs, it might be hard to ensure stable training. Nonetheless, it looks like an interesting research direction that at least the superalignment team thought was worth pursuing and which will probably receive more research as labs move away from RLVR to more uses of learned verifiers for more soft tasks.

Second, though, I think we should take a step back and think about the whole effort of trying to ensure that the chain of thought is not deceptive against a fixed monitor.

As we subject models to more intense optimization, through reinforcement learning, we can imagine that we could begin to cause the models to drift farther from their pretraining distribution, such that the meaning of words in the chain of thought ceases to reflect their ordinary meaning. Human language drifts over time as we invent new words to refer to new social and technological concepts. It seems possible that language models will also invent new terms and phrases in RL and that this, at the limit, will cause semantic drift such that the chain of thought is no longer monitorable.

We should begin to think about what to do in this case. Perhaps, we should have some canary concept for our monitors. It might be an interesting experiment to find problems that a monitor cannot solve, but which it can solve with the reasoning trace from a frontier model, and then through RL check to see whether those problems are still solvable by the monitor using reasoning traces produced by the model at each epoch.

I'm not sure that this really works in terms of how reasoning models work; how much of their new behavior would only be related to harder problems that they encountered later in RL; but I think it is an interesting and valuable area of research. There is no point in preserving the chains of thought against a monitor if the chains of thought themselves become uninformative under intense optimization pressure.