Google’s DeepMind published a research paper that proposes a way to train large language models so that they provide more reliable answers and are resistant against reward hacking, a step in the development of more adaptable and efficient AI systems.
Hat tip to @EthanLazuk for tweeting about a new research paper from Google DeepMind.
AI Has A Tendency Toward Reward Hacking
Reinforcement Learning from Human Feedback (RLHF) is a method used to train generative AI so that it learns to offer responses that receive positive scores from by human raters. The positive scores are a reward for correct answers, which is why this technique is called Reinforcement Learning. The positive scores are given by the human raters which is why it’s called Reinforcement Learning from Human Feedback.
RLHF is highly successful but it also comes with an unintended side effect where the AI learns shortcuts receiving a positive reward. Instead of providing a correct answer it provides an answer that has the appearance of a correct answer and when it fools the human raters (which is a failure of the reinforcement training), the AI begins to improve on its ability to fool human raters with inaccurate answers in order to receive the rewards (the positive human ratings).
This tendency of the AI to “cheat” in order to earn the training reward is called Reward Hacking, which is what the study seeks to minimize.
The Causes Of Reward Hacking In Large Language Models
To solve the problem of reward hacking the researchers identified two areas that lead to reward hacking that have to be dealt with by their solution:
- Distribution shifts
- Inconsistencies in human preferences
Distribution shifts refers to the situation where an LLM is trained on a certain kind of dataset and then, during reinforcement learning, it is exposed to a different kinds of training data that it hasn’t seen before. This change in data type is called a distribution shift, and it could potentially cause the language model to manipulate the reward system in order to give a satisfactory answer that it’s otherwise not prepared to provide.
Inconsistencies In Human Preferences
This is a reference to humans being inconsistent in their ratings when judging answers provided by the AI. For example, solving the problem of inconsistency in human preferences is likely one of the motivations behind the creation of the Google Search Quality Raters Guidelines which has the effect of lessening the influence of subjective preferences.
Human preferences can vary from person to person. Reinforcement Learning from Human Feedback relies on human feedback in the reward model (RM) training process and it’s the inconsistencies that can lead to reward hacking.
Finding a solution is important, as the researchers noted:
“This reward hacking phenomenon poses numerous issues.
First, it degrades performances, manifesting as linguistically flawed or unnecessarily verbose outputs, which do not reflect true human preferences.
Second, it complicates checkpoint selection due to the unreliability of the proxy RM, echoing Goodhart’s Law: ‘when a measure becomes a target, it ceases to be a good measure’.
Third, it can engender sycophancy or amplify social biases, reflecting the limited and skewed demographics of feedback providers.
Lastly and most critically, misalignment due to reward hacking can escalate into safety risks, in particular given the rapid integration of LLMs in everyday life and critical decision-making. “
Weight Averaged Reward Models (WARM)
The Google DeepMind researchers developed a system called Weight Averaged Reward Models (WARM), which creates a proxy model from the combination of multiple individual reward models, each one having slight differences. With WARM, as they increase the number of reward models (RMs) they average together and the results get significantly better, with the system avoiding the sudden decline in reliability as happens with standard models.
The WARM system, because it uses multiple smaller models, has the benefit of being memory efficient and doesn’t slow down the model’s ability to provide answers, in addition to being resistant to reward hacking.
WARM also makes the model more reliable and consistent when dealing with changing data and more consistent.
What caught my eye is its ability to follow the “updatable machine learning paradigm” which refers to WARM’s ability to adapt and improve by incorporating new data or changes over time, without starting from scratch.
In the following quote, WA means Weighted Average and RM means reward model.
The researchers explain:
“WARM represents a flexible and pragmatic method to improve the alignment of AI with human values and societal norms.
…WARM follows the updatable machine learning paradigm, eliminating the need for inter-server communication, thus enabling embarrassingly simple parallelization of RMs.
This facilitates its use in federated learning scenario where the data should remain private; moreover, WA would add a layer of privacy and bias mitigation by reducing the memorization of private preference. Then, a straightforward extension of WARM would combine RMs trained on different datasets, for example, coming from different (clusters of) labelers.
…Furthermore, as WA has been shown to limit catastrophic forgetting, WARM could seamlessly support iterative and evolving preferences.”
This research points the way toward more ways of improving AI, it’s not a complete solution because it has inherent limitations. Among the issues is that it doesn’t completely remove all forms of “spurious correlations or biases inherent in the preference data.”
Yet they did conclude in an upbeat tone about the future of WARM:
“Our empirical results demonstrate its effectiveness when applied to summarization. We anticipate that WARM will contribute to more aligned, transparent, and effective AI systems, encouraging further exploration in reward modeling.”
Read the research paper:
WARM: On the Benefits of Weight Averaged Reward Models
Featured Image by Shutterstock/Mansel Birst