L5 Study instructions

Watch DS5. 

DS5 Slides Link. Links to an external site.

DS5 Video Link. Links to an external site.

Tinkering notebook. Download Tinkering notebook.

Reading instructions:

This time we look at the control part of the reinforcment learning problem. As in the previous lecture, DS5 jumps between chapters in the textbook.

There are three main ideas used to extend the results from DS4 into model-free control methods:

  1. From LaTeX: q_\pi(s,a)qπ(s,a) the greedy (or an LaTeX: \varepsilonε-greedy) policy can be computed without a model.
  2. Hence, estimate LaTeX: q_{\pi}(s,a)qπ(s,a) instead of LaTeX: v_\pi(s)vπ(s).
  3. Use policy improvement.

DS5.2 On-Policy Monte-Carlo Control: 

Read Chapter 5.2-5.4. In the book LaTeX: \varepsilonε-soft policies are considered. That is, the algorithms presented in the book keeps LaTeX: \varepsilonε fixed between episodes, and thus find the optimal LaTeX: \varepsilonε-greedy policy asymptotically. In DS4 it is also discussed that if you let LaTeX: \varepsilon \rightarrow 0ε0 as the number of episodes goes to infinity, then it is possible to reach to optimal greedy policy asymptotically.  

DS5.3 On-Policy Temporal-Difference Learning: 

SARSA: Chapter 6.4

SARSA(LaTeX: \lambdaλ): Function approximation case in Chapter 12.7. (See note in L4).

DS5.4 Off-Policy Learning:

Importance sampling: Chapter 5.5

Q-learning: Chapter 6.5

Study questions:

L5 Q1:  Relate the algorithm on slide 16 (out of 43) in DS5 with the algorithm on page 101 in the textbook.

L5 Q2:  What is the difference between on- and off-policy methods? Why is Q-learning an off-policy method?

L5 Q3: Do Exercise 6.9. First try to figure out the answer by yourself, then try it in the tinkering notebook.

L5 Q4: Think about Example 6.5. Do you understand why the two methods behave so differently? You can make use of the tinkering notebook to try the methods out.