Image
Stanford logo

ISL Colloquium: Nonstationary Bandits and Predictive Sampling

Summary
Benjamin Van Roy (Stanford)
GSB Room E102
Dec
14
Date(s)
Content

Abstract: Bandit learning has served many applications and is poised to play a major role in emerging generative AI systems that rely on learning from human feedback after pretraining enormous models on massive data sets. Thompson sampling is a popular approach, owing to its effectiveness across a wide range of environments and its scalability through use of approximation methods such as epistemic neural networks. However, Thompson sampling is designed for stationary bandits and does not fare as well in nonstationary ones. Modern applications call for methods that address nonstationarity.

 

In this talk, I will propose coherent definitions of stationary and nonstationary bandits. I will then discuss how Thompson sampling fails in nonstationary ones and how to fix it. This gives rise to a new algorithm: predictive sampling. Applied to stationary bandits, predictive sampling is equivalent to Thompson sampling, but behaviors differ in nonstationary bandits. I will also present a way of characterizing regret for nonstationary bandits and interpret a bound on the regret incurred by predictive sampling.