L5 Study instructions

Watch DS5.

Reading instructions:

This time we look at the control part of the reinforcment learning problem. As in the previous lecture, DS5 jumps between chapters in the textbook.

There are three main ideas used to extend the results from DS4 into model-free control methods:

From $LaTeX: q_\pi(s,a)$ $q_\pi(s,a)$ the greedy (or an $LaTeX: \varepsilon$ $\varepsilon$ -greedy) policy can be computed without a model.
Hence, estimate $LaTeX: q_{\pi}(s,a)$ $q_{\pi}(s,a)$ instead of $LaTeX: v_\pi(s)$ $v_\pi(s)$ .
Use policy improvement.

DS5.2 On-Policy Monte-Carlo Control:

Read Chapter 5.2-5.4. In the book $LaTeX: \varepsilon$ $\varepsilon$ -soft policies are considered. That is, the algorithms presented in the book keeps $LaTeX: \varepsilon$ $\varepsilon$ fixed between episodes, and thus find the optimal $LaTeX: \varepsilon$ $\varepsilon$ -greedy policy asymptotically. In DS4 it is also discussed that if you let $LaTeX: \varepsilon \rightarrow 0$ $\varepsilon \rightarrow 0$ as the number of episodes goes to infinity, then it is possible to reach to optimal greedy policy asymptotically.

DS5.3 On-Policy Temporal-Difference Learning:

SARSA: Chapter 6.4

SARSA( $LaTeX: \lambda$ $\lambda$ ): Function approximation case in Chapter 12.7. (See note in L4).

DS5.4 Off-Policy Learning:

Importance sampling: Chapter 5.5

Q-learning: Chapter 6.5

Study questions:

L5 Q1: Relate the algorithm on slide 16 (out of 43) in DS5 with the algorithm on page 101 in the textbook.

L5 Q2: What is the difference between on- and off-policy methods? Why is Q-learning an off-policy method?

L5 Q3: Do Exercise 6.9. First try to figure out the answer by yourself, then try it in the tinkering notebook.

L5 Q4: Think about Example 6.5. Do you understand why the two methods behave so differently? You can make use of the tinkering notebook to try the methods out.