I will talk about a natural framework for simulation-based optimization and control of Markov decision process (MDP) models. The idea is very simple: Replace the Bellman operator by its 'empirical' variant wherein Expectation is replaced by a sample average approximation. This leads to a random Bellman operator in the dynamic programming equations. We introduce several notions of probabilistic fixed points of such random operators, and show their asymptotic equivalence. We establish convergence of empirical Value and Policy Iteration algorithms by a stochastic dominance argument. The mathematical technique introduced is useful for analyzing other iterated random operators (than just the empirical Bellman operator), and may also be useful in analyzing other algorithms for stochastic optimization. The idea can be generalized to asynchronous dynamic programming, and is also useful for computing equilibria of zero-sum stochastic games. Preliminary numerical results show better convergence rate and runtime performance than stochastic approximation/reinforcement learning, or any other commonly used schemes. I will end by talking about Empirical Q-Value Iteration (EQVI), an alternative method for reinforcement learning (RL) with better performance in practice than popular RL algorithms.
This is joint work with Dileep Kalathil, William Haskell and Vivek Borkar.
Rahul Jain is the K. C. Dahlberg Early Career Chair and an Associate Professor in EE, and by courtesy in CS & ISE Departments at the University of Southern California, Los Angeles, CA. He received his B.Tech from IIT Kanpur, and an MA in Statistics and a PhD in EECS from the University of California, Berkeley. Prior to joining USC in Fall 2008, he was at the IBM T J Watson Research Center, Yorktown Heights, NY. He is a recipient of the NSF CAREER award in 2010, the ONR Young Investigator award in 2012, an IBM Faculty award in 2010, the James H. Zumberge Faculty Research and Innovation Award in 2009. His main interests are in stochastic models, statistical learning and game theory. Of late, he has also been working on statistical learning, risk-aware stochastic optimization, queueing theory and power system economics.