# TD Learning — Richard Sutton Presentation at Reinforcement Learning Summer School, Montreal 2017

These are highlights and slides from Prof. Sutton’s lecture:

- this is really a special time ; perhaps we are at a point where we understand enough of how mind works so that we can create intelligence by design; exact timeline is not known, but we can say with high probability that it will occur in the next few decades
- why is this happening ? Moore’s Law; increase in available computation alters everything and is profound ( we reached the end of Moore’s Law/Dennard scaling , further compute power growth now depends on innovative processor architectures )

- methods that scale with computation ( “weak AI” ) are the future of AI; as we get more computing power these methods become more powerful ( which hasn’t always been true )
- if you think DL and RL are on the right part of the history, you might be wrong
- Supervised Learning and model-free Reinforcement Learning are only weakly scalable; for example, learning value function and improving policy is only one object, even with a few features ( meaning it can’t efficiently use increasing computing power )
- SL is weakly scalable because it requires humans to provide labelled data sets; you want to be able to learn from raw data

- what is fully scalable ? prediction learning — learning to predict what will happen; it is
**unsupervised supervised learning**; we have a target — we just wait and see what happens — that’s our target; you have a target, but you don’t need a human to provide it — you just get it from the data; this is scalable model-free learning, and maybe it is*the*scalable model-free learning - prediction learning is at the heart of our our control methods, we learn value functions; TD learning is
*the*scalable, model-free method

- predicting is the key problem; DL, SL is not about predicting ( only predicts what a label would be — it is not about predicting over time, about wait and see what happens, which makes all the difference !! )
- TD is a method of learning to predict, widely used in RL to predict future rewards, value functions; it is the center, the core of many methods ( Q-learning, SARSA, TD(λ), Deep Q Networks, actor-critic ); TD is not universally used — AlphaGo, policy based methods don’t use it

- TD is sort of optional — it certainly feels optional, but perhaps you should always be using it, it is ubiquitous
- it can be used to predict anything; it is a general method for prediction learning, not just for rewards
- criteria to distinguish TD from other ( possibly SL ) methods is if they update a guess from a guess ( which sounds dangerous — how do we constrain/anchor it, how do we tie it down ?)
- TD learning is learning prediction from another, later prediction — an estimate from an estimate, and if it is good enough

- TD error is the difference between two temporally successive predictions; if you play a game, make a move and think:”I am winning”, then make another move and think:”now I am losing” — you try to learn from it; you don’t know who is going to win, but you try to learn from the difference in your predictions
- where does the target come from — does it come from the end of the game, or from your next prediction ?
- TD-Gammon — Backgammon playing program from 1992 used single NN with one layer which predicted the probability of winning; updated TD error as the difference between two successive win probability predictions; backpropagated TD error and learnt by playing against itself — learning from trial and error

- the question is: do we need to use TD learning and when do we need it ? many people in RL think you can get away without it
- it is only relevant in multi-step prediction problems — only when the thing predicted is multiple steps in future
- TD learning is broadly applicable as everything you want to do is a multi-step prediction — what the stock market index will be, for example; as you get more data, you make a new prediction; you make repeated predictions about long term outcomes — who the US will go to war against next — long term predictions, which don’t jump up to the end

- can we not ignore step by step method and wait to the end of the game to see who wins ( use one step method instead ) ? can we learn one step model and compose your model ? the answer is: you can’t

- in other words: can we treat multiple steps as one big step, or we have to treat it bit by bit ?
- also: can we learn one step predictions as a model, then iterate them to get a multi-step prediction ? it is a trap to think it is enough to use low level model of the world as simulation of the world; for example to treat the world as MDP and model the transition probabilities, or treat the world as a engineering model where we just have to learn velocities, and then integrate these low level differential equations; short term models and iteration feel good because we think that if it can be done perfectly in one step, we think it can be done perfectly for as far as I want to look into the future

- two problems with one-step approach: first we can’t do it perfectly: we always have an approximation, and when we try to iterate, we will have propagation and compounding of errors ( useless long term prediction ); secondly: it is exponentially complex as number of choices increases, as world is stochastic; it quickly becomes computationally intractable

Monte Carlo method means stepping through all states and actions to a terminal state, finding out Gt ( total reward ) and updating value of a state V(St).

TD method: state value V(St) is updated ( backed up ) with immediate reward Rt+1 and discounted value of the next state — V(St+1)

Dynamic Programming method only works for MDPs i.e. environments with fully known transition probabilities i.e. where model is known.

Dynamic programming is not considering a single line through possible tree like MC or TD; it considers all possibilities — in this example it considers both actions.

TD is the only method that both bootstraps ( your target involves a guess in existing prediction) and samples. MC doesn’t bootstrap — it looks at all returns all the way to the end, ie there are no estimates playing a role in return. DP bootstraps — you use your estimates, which gradually get better.

MC and TD are learning methods — they sample, because they don’t know how the world works.

TD(0) is one step estimate of the return, as opposed to MC’s actual return for the whole series of steps ( all the way to the terminal state ). With TD(0) you can even learn without finding out the final outcome.

Both of these methods will converge, the only question is which is faster.

Useful tip on how to easily distinguish between on-policy and off-policy learning: on-policy means there is almost only one policy — one that you are learning about.

On having to choose between TD and MC: if you use TD, you can parametrically vary λ ( the height of your backups ) to get to any intermediate point between one step TD and MC.

Here is an example of how well 1 step Q-learning scales out on 16 threads, even when compared to A3C :

On the other hand, we can’t disregard policy gradient algorithms ( REINFORCE ), as well as methods that learn approximations for both policy and value functions (actor-critic ) since there is also strong research in that direction. These algorithms have interesting properties and are utilized by some papers that apply policy based algorithms to trading and option pricing, claiming superior results compared to TD based methods.