## ISL Colloquium: Reinforcement Learning without Reinforcement

Reinforcement Learning (RL) is concerned with solving sequential decision-making problems in the presence of uncertainty. RL is really about two problems together. The first is the 'Bellman problem': Finding the optimal policy given the model, which may involve large state spaces. Various approximate dynamic programming and RL schemes have been developed, but either there are no guarantees, or not universal, or rather slow. In fact, most RL algorithms have become synonymous with stochastic approximation (SA) schemes that are known to be rather slow. This is an even more difficult problem for MDPs with continuous state (and action) spaces. We present a class of non-SA algorithms for reinforcement learning in continuous state space MDP problems based on 'empirical' ideas, which are simple, effective and yet universal with probabilistic guarantees. The idea involves randomized Kernel-based function fitting combined with 'empirical' updates. The key is the first known "probabilistic contraction analysis" method we have developed for analysis of fairly general stochastic iterative algorithms, wherein we show convergence to a probabilistic fixed point of a sequence of random operators via a stochastic dominance argument.

The second RL problem is the 'online learning (or the Lai-Robbins) problem' when the model itself is unknown. We propose a simple posterior sampling-based regret-minimization reinforcement learning algorithm for MDPs. It achieves O(sqrt{T})-regret which is order-optimal. It not only optimally manages the "exploration versus exploitation tradeoff" but also obviates the need for expensive computation for exploration. The algorithm differs from classical adaptive control in its focus on non-asymptotic regret optimality as opposed to asymptotic stability. This seems to resolve a long standing open problem in Reinforcement Learning.