Reinforcement learning #
This is my summary of David Silver’s lectures. These are very information dense and may be worth a second watch to really get all the ideas (or read along and do some exercises).
Introduction to Reinforcement Learning #
- Science of decision making.
- Sits at the intersection of ML / OR / Neuroscience / Psychology / Economics.
- No supervisor, only a “reward signal”, and feedback (may be) delayed. Hence, time plays an important role. Data is not i.i.d and the agent affects the data.
- Reward hypothesis: Any goal can be described by the maximisation of expected cumulative reward (reward \(R_t\) is a scalar feedback signal). An interesting discussion of the reward hypothesis here.
- At each step:
- The agent executes an action and receives an updated observation and the reward
- The environment receives the action, and emits the observation and the reward
- The history is the sequence of observations, actions and rewards to now. What happens next depends on the history: based on the history the agent selects actions, and the environment selects observations and rewards.
- Typically we look at the current state: a concise summary of the history, the information we will use to determine what happens next.
- Formally, state is a function of the history: \(S_t=f(H_t)\).
- Environment state. The data the environment uses to determine what the next observation/reward is. Not usually visible to the agent. A formalism to help us understand things but not usable (or particularly relevant) in RL algorithms.
- Agent state. The internal representation used by the agent to decide the next action.
- Information / Markov state. Contains “all useful” information from the history. A state \(S_t\) is Markov iff \[P(S_{t+1}\mid S_t) = P(S_{t+1}\mid S_1,\ldots,S_t).\] The future is independent of the past given the present. \(H_{1:t}\to S_t\to H_{t+1:\infty}\). The state is a sufficient statistic of the future.
- Markov state of the helicopter: position, velocity, angular velocities etc. Non-Markov state would be just position, so you need to “go back in time” and look at the history to determine the velocities.
- The environment state is Markov.
- The history \(H_t\) (considered as a state) is Markov.
- Fully observable environment. Agent state = environment state = information state. This is a Markov decision process.
- Partially observable environment.
- Most interesting examples.
- Partially observable Markov decision process.
- The agent must construct its own state representation.
- Simplest (quickly infeasible): \(S_t^a=H_t\).
- Bayesian: keep a vector of probabilities of what the environment state is.
- Recurrent neural network \(S_t^a=\sigma(S_{t-1}^a W_s+O_tW_o)\).
- Major components of an RL agent
- Policy: the behaviour function. A map from state to action \(a=\pi(s)\) (if deterministic), \(\pi(a\mid s)=P(A=a\mid S=s)\) (if stochastic).
- Value function: scoring states and actions. Prediction of future reward, used to evaluate goodness/badness of states. \[v_\pi(s)=\mathbb{E}_\pi(R_t+\gamma R_{t+1}+\gamma^2+\ldots\mid S_t=s).\]
- Model: representation of the environment, predicts what the environment will do next. Not a requirement (many examples in the course will be model-free). Two parts:
- Transition model (predicts the next state) \(P(S’=s’\mid S=s,A=a)\)
- Reward model (predicts the next (immediate) reward) \(\mathbb{E}(R\mid S=s, A=a)\).
- Maze example.
- Rewards: -1 per time step.
- Actions: N, E, S, W
- States: Agent’s location
- Policy: what to do in each state (arrows on the blocks in the maze)
- Value function: distance from goal based on this policy
- Categorising RL agents
- Value based. No policy (implicit from the value function, just pick actions greedily), value function.
- Policy based. Policy, no value function.
- Actor critic. Both
- Model free. Policy and/or value function, no model.
- Model based.
- Two fundamental problems in sequential decision making.
- Reinforcement learning. Environment is initially unknown. The agent interacts with the environment and improves its policy.
- Planning. The model of the environment is known and the agent performs computations with its model and improves its policy.
Markov Decision Processes #
Markov processes #
- State transition probability \(P_{ss’}=P(S_{t+1}=s’\mid S_t=s)\). A Markov process (or chain) is a tuple \((S, P)\): a finite set of states and the matrix of state transition probabilities.
- An episode is sampling from the chain from the start to a terminal state.
Markov reward processes #
- Add a reward function \(R_s=\mathbb{E}(R_{t+1}\mid S_t=s)\)
- Add a discount factor \(\gamma\in[0,1]\). The present value of future rewards. 0 is near-sighted, 1 is far-sighted. The value of receiving reward \(R\) after \(k+1\) timesteps is \(\gamma^k R\).
- The return \[ G_t=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^\infty \gamma^k R_{t+k+1}. \]
- We discount because there is more uncertainty in the future, but mainly to keep the maths easier (sums converge). Also a decent cognitive model (proven with tests).
- If all sequences terminate it is possible to use undiscounted Markov reward processes.
- The value function \(v(s)\) gives the long-term value of state \(s\). Formally, \[ v(s) = \mathbb{E}(G_t\mid S_t=s), \] i.e. the value function is the expected return when starting from state \(s\).
- The value function can be decomposed into two parts:
- the immediate reward \(R_{t+1}\), and
- the discounted value of the successor state \(\gamma v(S_{t+1})\).
- The Bellman equation: \[ v(s)= \mathbb{E}(R_{t+1}+\gamma v(S_{t+1})\mid S_t=s). \] A tautology essentially, describing what we mean by reward. A “one-step lookahead search”. Can be expressed concisely using matrices: \(\mathbf{v}=\mathcal{R}+\gamma\mathcal{P}\mathbf{v}\). \(\mathcal{R}\) is the reward vector, \(\mathcal{P}\) is the transition matrix.
- This is a linear equation that can be solved directly: \[ \mathbf{v}=(I-\gamma\mathcal{P})^{-1}\mathcal{R}. \]
- Complexity is \(O(n^3)\) for \(n\) states, so not usually tractable.
Markov decision processes #
-
An MDP is a Markov reward process with decisions. We add a finite set of actions \(\mathcal{A}\).
-
\(\mathcal{P}\), state transition probability matrix, now depends on which action we take \[ \mathcal{P}^a_{ss’}=P(S_{t+1}=s’\mid S_t=s, A_t=a). \]
-
The reward function also depends on the action.
-
A policy \(\pi\) is a distribution over actions given states, \[ \pi(a\mid s)=P(A_t=a\mid S_t=s). \]
- A policy fully defines the behaviour of an agent.
- Policies depend on the current state only for an MDP (i.e not the history).
- Maximise the reward from now onwards.
-
For a given MVP and a policy \(\pi\), the state sequence \(S_1,S_2,\ldots\) is a Markov process.
-
The state and reward sequence is a Markov reward process.
-
The state-value function \(v_\pi(s)\) of an MDP is the expected return starting from state \(s\) and then following policy \(\pi\): \[ v_\pi(s)=\mathbb{E}_\pi(G_t\mid S_t=s). \]
-
The action-value function \(q_\pi(s,a)\) is the expected return starting from state \(s\), taking action a, and then following policy \(\pi\): \[ q_\pi(s,a)=\mathbb{E}_\pi(G_t\mid S_t=s, A_t=a). \]
-
The state-value function and action-value function can again be decomposed into immediate reward plus discounted value of successor state with all the conditions on policy and action:
\begin{align*} v_\pi(s)&=\mathbb{E}_\pi(R_{t+1}+\gamma v_\pi(S_{t+1})\mid S_t=s)\\ q_\pi(s,a)&=\mathbb{E}_\pi(R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})\mid S_t=s, A_t=a). \end{align*}
-
There is a similar analogy here with the one-step lookahead search. \[ v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a\mid s)\Bigl(\mathcal{R}_s^a+\gamma\sum_{s’\in S}\mathcal{P}_{ss’}^av_\pi(s’)\Bigr) \] and we can write the same thing for action-values. (Bellman expectation equation).
The optimal value function #
- The optimal state-value function \(v_*(s)=\mathrm{max}_\pi v_\pi(s)\).
- The optimal action-value function \(q_*(s,a)=\mathrm{max}_\pi q_\pi(s,a)\). If you know this, you’re done. An MDP is solved when we know the optimal value function.
- We define a partial ordering over policies by \(\pi\geq \pi’\) if \(v_\pi(s)\geq v_{\pi’}(s)\) for all \(s\).
- Theorem. For any Markov decision process, there exists an optimal policy \(\pi_*\) that is better than or equal to all other policies. All optimal policies achieve the optimal value function \(v_{\pi_*}(s)=v_*(s)\) and the optimal action-value function \(q_{\pi_*}(s,a)=q_*(s,a)\).
- We can find an optimal policy by maximising over \(q_*(s,a)\): \[ \pi_*(a\mid s)=1, \quad\mbox{if } a=\mathrm{argmax}_{a\in\mathcal{A}}q_*(s,a) \] and \(0\) otherwise.
- The optimal value functions are recursively related by the Bellman optimality equations: \[ v_*(s)=\max_a q_*(s,a). \]
- Similarly for \(q_*\): \[ q_*(s,a)=\mathcal{R}_s^a+\gamma\sum_{s’\in\mathcal{S}}\mathcal{P}^a_{ss’}v_*(s’). \]
- And combining these: \[ v_*(s)=\max_a \mathcal{R}_s^a +\gamma\sum_{s’\in S}P^a_{ss’} v_*(s’). \]
- These equations are non-linear, with no closed-form solution. Usually we use iterative solution methods (value iteration, policy iteration, Q-learning, Sarsa).
Planning by Dynamic Programming #
Introduction #
- Dynamic programming: c.f. linear programming (a program in the mathematical sense). Dynamic means there is some sequential or temporal component to the problem; steps.
- A method for solving complex problems by breaking down into subproblems and then combining the solutions to the subproblems into a solution for the full problem.
- Needs two properties in the problem:
- Optimal substructure (optimal solutions from subproblems can be built into an optimal solution for the full problem: principle of optimality, see shortest path).
- Overlapping subproblems (subproblems occur again and again, so solutions can be cached and reused).
- MDPs satisfy both of these properties. The Bellman equation gives us the recursive decomposition. Value function gives us the cache and the ability to reuse solutions.
- Dynamic programming is used for planning in an MDP - assuming full knowledge.
- Two cases:
- Prediction: input MDP and policy, output value function.
- Control: input MDP, output optimal value function and policy.
- Used in many other types of algorithms (shortest path, Viterbi, lattice models, string algorithms).
Policy evalution #
- You are given an MDP and a policy: how much reward will we get (evalute the policy).
- Iterative application of the Bellman expectation equation.
- \(v_1\to v_2\to \cdots \to v_pi\). Start with an initial arbitrary value function, then using synchronous backups:
- at each iteration \(k+1\),
- for all states \(s\)
- update \(v_{k+1}(s)\) from \(v_k(s’)\)
- where \(s’\) is a successor state of \(s\)
- Prove convergence later (contraction mapping theorem a.k.a Banach fixed-point theorem).
- Small gridworld example. Very cool.
Policy iteration #
- Start with a given policy \(\pi\).
- Evaluate the policy using the above policy evaluation method.
- Improve the policy by acting greedily with respect to this value function.
- In the small gridworld example it only took one iteration; in general it can take many but this process does always converge to the optimal policy \(\pi^*\).
- Relationship to expectation maximisation?
- Consider a deterministic policy \(a=\pi(s)\). We can improve the policy by acting greedily
\[ \pi’(s) = \mathrm{argmax}_{a\in\mathcal{A}} q_\pi(s, a). \]
- This does indeed improve the value from any state \(s\) over one step, since
\[ q_\pi(s, \pi’(s)) = \max_{a\in\mathcal{A}}q_\pi(s, a)\geq q_\pi(s, \pi(s))=v_\pi(s). \]
- Therefore improves the value function \(v_{\pi’}(s)\geq v_\pi(s)\) (telescoping argument).
- If improvements stop, i.e. we achieve equality, then the Bellman optimality equation has been satisfied by definition, and hence we have the optimal policy.
- Do we actually need to converge to the optimal policy? Maybe stopping conditions.
- Just do it once and then continue: value iteration.
Value iteration #
- Any optimal policy can be subdivided into two components: an optimal first action \(A^*\) followed by an optimal policy from the successor state \(S’\) (principle of optimality).
- Formulate into a theorem: a policy \(\pi(a\mid s)\) achieves the optimal value from state \(s\) if and only if for any state \(s’\) reachable from \(s\), \(\pi\) achieves the optimal value from state \(s’\) onwards.
- Use this to build a value iteration algorithm (“backwards induction”). If we know the solution to subproblems \(v_*(s’)\), then the solution \(v_\pi(s)\) can be found by one-step lookahead.
- The idea of value iteration is to apply these updates iteratively.
- Intuition: start with final rewards and work backwards. Even works with “loopy, stochastic” MDPs.
- \(v_1\to v_2\to \cdots \to v_*\). Start with an initial arbitrary value function, then using iterative application of Bellman optimality:
- at each iteration \(k+1\)
- for all states \(s\)
- update \(v_{k+1}(s)\) from \(v_k(s’)\)
- Convergence to \(v_*\) will be proven later (Banach)
- Unlike policy iteration, there is no explicit policy: intermediate value functions may not be the value function of any real policy.
Problem | Bellman equation | Algorithm |
---|---|---|
Prediction | Bellman expectation equation | Iterative policy evaluation |
Control | Bellman expectation equation + greedy policy improvement | Policy iteration |
Control | Bellman optimality equation | Value iteration |
- Toy example of a larger Gridworld.
- These ideas are actually quite intuitive, the notation makes it all a little cumbersome but it’s quite natural once you think it through. Silver’s explanation of the above table is a very coherent summary of this complicated-feeling lecture.
- Can also do similar things with the action value function \(q\); these give rise to higher-complexity algorithms but they are definitely useful in the general RL problem.