Improving Foundation Models with Autonomous Sub-Optimal Data

Prof. Aviral Kumar
Allen 101X

Oct

Thu, Oct 17 2024, 11am

Abstract: Generalist transformer-based foundation models are typically trained via next token prediction. This paradigm is equivalent to training the model to imitate the distribution of tokens appearing in the training data. As a result, such a training paradigm is only compatible with ?expert?, high-quality data that has been carefully vetted by a human. As high-quality data runs out (it has been predicted we will run out of high-quality internet text data by 2028), we need to turn to alternate paradigms that can train such foundation models on potentially suboptimal data.

In this talk, I will discuss two threads of our very recent work that develops approaches to improve foundation models by training on self-generated, suboptimal data in the context of mathematical reasoning problems. First, I will discuss how training on model-generated data can often enhance efficiency of training questions but also exacerbate memorization and collapse if done naively. Then, I will show how running a specific form of reinforcement learning (RL) on negative data, i.e., failed rollouts that do not succeed at the problem at hand, can help alleviate this issue because negative data can help with credit assignment, i.e., with detecting which steps of a solution are critical to learn from. By leveraging negative data appropriately, we are able to improve the sample efficiency of LLM math reasoning by 8x. I will also discuss theoretically-motivated extensions of this approach to full online reinforcement learning (RL), where it provides insights to train much better process reward models (PRMs). Our PRMs improve over state-of-the-art PRM training approaches by 3-4x by utilizing dense process rewards, achieving sample-efficiency gains of 5-6x for RL. Second, I will discuss how training on suboptimal data via multi-turn RL can be used to imbue a foundation model with the capability of self-correcting its responses sequentially. This ability of self-improvement allows us to tackle novel prompts that are not solvable otherwise by simply scaling up model capacity or data, within an equal amount of compute budget. It also enables learning better strategies and exploration to solve problems that naively predicting what the answer should be. The underlying theme in both parts is to use suboptimal data for better representing target distributions, in a way that generalizes better on novel problems unlike standard next-token prediction or RLHF.

This talk will be based on a subset of material in the following papers: http://arxiv.org/abs/2407.18219, https://arxiv.org/abs/2406.14532, https://arxiv.org/abs/2410.08146, https://arxiv.org/abs/2409.12917.

Bio: Aviral Kumar is an Assistant Professor of CS and ML at CMU, where he started in September 2024. He is also a part-time research scientist at Google DeepMind. He finished his PhD at UC Berkeley in 2023. His work focuses on making machines and AI systems capable of making intelligent decisions, with an emphasis on developing reinforcement learning techniques that work reliably and efficiently at scale. His most notable work is in the field of offline reinforcement learning. Some of his recent work studies RL at scale with foundation models.

Stanford RL Forum