Compiled questions from the study groups

How is the reward function defined, or how to construct it? (suggestion: imagine that you want an agent to efficiently solve a concrete problem. How would you design the reward function?)
For episodic tasks, the discount factor can be chosen from which range?
For continual tasks, the discount factor can be chosen from which range?
What is the difference between a Markov Chain and a Markov Process?
What is the difference between Planning vs prediction vs control?
In HW0, we used “epsilon-greedy” policy to try to avoid missing good solutions but in the examples in this module we seem to always get the optimal value by always acting greedily, why?
Will policy iteration and value iteration converge to the same optimal values V*? Will the policies be the same?
In Policy Iteration, at each iteration of the algorithm, do we have a new fix policy until finding the optimal one?
In Policy Iteration, what is the role of the action-value function (q) in this process?
Does adding a constant to all rewards affect the resulting policy? Why?
Does scaling all rewards with a positive constant affect the resulting policy? Why?
In Bellman Equation for MRPs, how to understand the equation:
$LaTeX: E[ R_{t+1}+\gamma G_{t+1}|S_t=s ]=E[ R_{t+1}+\gamma v(s_{t+1})|S_t=s ]$ ?
(Hint: compare with (3.14).)
Can every process be understood as a Markov Process by interpreting the entire history over states as a (markovian) state? What are the advantages and drawbacks of this approach?
Why does policy iteration converge to optimal policy?
Why does the value iteration converge to the optimal value function? Still somewhat not intuitive that only need to do one (initial) step of policy update for value iteration
How is the rate of convergence of value iteration and of policy iteration?

Extra questions, for joint discussion:

How useful is DP in practice? It's claimed to be optimal if we have a perfect model, but are there any "robustness" guarantees similar to those in control theory provided that you have a model error?
Is OpenGym the de facto package for RL implementations, in the sense that this is what we "should" use for our own implementations?
Is there a difference between deterministic value iteration and value iteration? Are they the same thing?
What kind of guidelines do we need for designing state-space of any MDP?
What is the difference between value iteration and policy iteration, in terms of principle behind them, their computation complexity, and convergence?
We define the Value (and Quality) functions using a policy. On the one hand, this means that not every arbitrary function (from states to R) has a policy associated with it. On the other hand, we can find a method of constructing a policy given an arbitrary value function. Why is it beneficial to choose a definition for V and Q that ties a policy to exactly one such V_pi and Q_pi?

Questions for Module 2

Value function (v) and action-value function (q) can be transformed to each other in some extension. Why don’t we just define one of them. The generalized policy iteration only needs value function (v), why do we need to know action-value function (q)?

Questions for Module 3

Is there an equivalent continuous representation of the RL theory? Consider e.g. the gridworld and imagine we could move in any direction with any speed?