EE Student Information

The Department of Electrical Engineering supports Black Lives Matter. Read more.

• • • • •

EE Student Information, Spring Quarter through Academic Year 2020-2021: FAQs and Updated EE Course List.

Updates will be posted on this page, as well as emailed to the EE student mail list.

Please see Stanford University Health Alerts for course and travel updates.

As always, use your best judgement and consider your own and others' well-being at all times.

Statistics Department Seminar presents "Some theoretical results on model-based reinforcement learning"

Topic: 
Some theoretical results on model-based reinforcement learning
Tuesday, November 3, 2020 - 4:30pm
Venue: 
Registration required
Speaker: 
Mengdi Wang (Princeton University)
Abstract / Description: 

We discuss some recent results on model-based methods for reinforcement learning (RL) in both online and offline problems.

For the online RL problem, we discuss several model-based RL methods that adaptively explore an unknown environment and learn to act with provable regret bounds. In particular, we focus on finite-horizon episodic RL where the unknown transition law belongs to a generic family of models. We propose a model based "value-targeted regression" RL algorithm that is based on an optimism principle: in each episode, the set of models that are "consistent" with the data collected is constructed. The criterion of consistency is based on the total squared error that the model incurs on the task of predicting values as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, for an arbitrary family of transition models, using the notion of the so-called Eluder dimension proposed by Russo and Van Roy (2014).


Next we discuss batch-data (offline) reinforcement learning, where the goal is to predict the value of a new policy using data generated by some behavior policy (which may be unknown). We show that the fitted Q-iteration method with linear function approximation is equivalent to a model-based plugin estimator. We establish that this model-based estimator is minimax optimal and its statistical limit is determined by a form of restricted chi-square divergence between the two policies.