Notes on Reinforcement Learning

Written on 10 Feb 2021 machine-learning math

Pieter Abbeel is one of the world’s leading RL researchers. He has recently made a lot of the material he teaches in classes at Berkley free on the internet. The below is an excerpt of my notes from his Deep RL Bootcamp.

1 - Motivation + Overview + Exact Solution Methods

Video Slides

RL Algorithms Landscape
- DP
  - Value Iteration
    - Q-Learning
  - Policy Iteration
    - Actor-Critic Methods
- Policy Optimization
  - DFO/Evolution
  - Policy Gradients
    - Actor-Critic Methods
MDP - Agent observes the world, chooses an action, the environment changes, the agent may observe a reward and change state
- Set of states S
- Set of Actions A
- Transition function P(s’ | s,a)
- Reward function R(s, a, s’)
- Start state S_0
- Discount factor g
- Horizon H
Optimal Control: Given an MDP (S,A,P,R,g,H) find the optimal policy pi*
Exact methods: Value iteration and policy iteration
Value functions
- V*(s) = sum of discounted rewards when starting from s acting optimally
- V*_i(s) = optimal value for state s when H = k
Value Iteration
- Algorithm (value update, or bellman update/backup)
  - Start with V*_0(s) = 0 for all s
  - for all k in 1…H: for all s in S:
    - V_k(s) = max{over a}(sum(P(s’(s|a)(R(s,a,s’) + gV_{k-1}(s’))))
    - pi*_k(s) = argmax of same expression
- For a finite state space and <1 discount factor, VI converges
  - For infinite horizon, simply run VI until it converges, which gives you pi* - use that to determine optimal behavior (contractions)
  - V_H -> V as H->infinity
- Note: optimal policy is stationary (optimal action at state s is always the same)
- Q(s,a) - much like V but says from s, take action a, and from then on act optimally
  - Q(s,a) = sum[over s’]P(S’|s,a)(R(s,a,s’) + gmax[over a’]Q(s’,a’))
  - Q_{k+1}(s,a) = sum(P(s’|s,a)(R(s,a,s’) + gmax[over a’]Q_k(s’,a’)))
Policy Methods
- Policy Evaluation: how to calculate goodness of V_pi
  - We can use value iteration formulation, with fixed policy
- V_k = max{a}(sum{s’}(P(s’|s,a)(R(s,a,s’) + gV_{k-1}(s’)))
- V^pi_k = sum{a}(P(s’|s,pi(s))(R(s,pi(s),s’) + g*V^pi_{k-1}(s’))
- At convergence
  - RHS and LHS policies are the same
Policy Iteration (repeat the following 2 steps until pi_{k+1} = pi_k)
- 1. Policy evaluation for current policy pi_k (iterate until convergence)
    - V^{pi_k}_{i+1}(s) <- …
- 1. Policy improvement
    - pi_{k+1}(s) <- argmax[a] sum[s’] P(s’|s,a)[R(s,a,s’) + g*V^{pi_k}(s’)]
- Computation methods
  - Modify Bellman updates
  - Because we dropped the max (over a) it’s a linear system of equations
    - Variables: V^{pi}(s)
    - Constants: P, R

2 - Sampling-based Approximations and Fitted Learning

Video Slides

Optimal control: Given MDP (S, A, P, R, g, H), find optimal policy pi* maximizing sum of expected rewards
- Exact methods (value and policy iteration) guaranteed to converge for finite state/action space and when we have perfect access to environment
- Assumptions of exact methods -> mitigations:
  - Update equations require access to dynamics model -> Sampling-based approximations
  - Iteration over / Storage for all states and actions -> Q/V function fitting
Q*(s,a) means at state s, taking action a, and from then on take optimal actions
Q-value iteration
- Q_{k+1}(s,a) <- sum{s’}(P(s’|s,a)(R(s,a,s’) + g*max{a’}(Q_k(s’,a’)))
Q-value iteration with sampling
- Q_{k+1}(s,a) <- E{s’~P(s’|s,a)}(R(s,a,s’) + g*max{a’}(Q_k(s’,a’)))
- For (s,a)
  - receive s’ ~ P(s’|s,a)
  - consider target(s’) = R(s,a,s’) + g*max[a’]Q_k(s’,a’)
  - Q_{k+1}(s,a) <- (1-alpha)Q_k(s,a) + alpha(target(s’))
Tabular Q-Learning
- start with Q_0(s,a) = 0 for all s,a
- for k = 1… until convergence
  - sample a (via some method), get s’ ~ P(s’|s,a)
  - if s’ is terminal
    - target = R(s,a,s’)
    - sample new initial state s’
  - else
    - target = R(s,a,s’) + g*max[a’]Q_k(s’,a’)
  - Q_{k+1}(s,a) <- (1-alpha)Q_k(s,a) + alpha*target
  - s <- s’
How to sample actions
- e-greedy (epsilon-greedy) - choose random action with probability e, otherwise choose action greedily
Q-learning (an example of off-policy learning) converges to optimal policy
- If you explore enough
  - All states and actions are visited infinitely often
- learning rate needs to decrease (but not too quickly)
  - sum{t}(alpha_t(s,a) = infinity) & sum{t}(alpha^2_t(s,a) < infinity)
Can sampling work for value iteration? No because of the max over a
Can Policy iteration work with sampling?
- Policy evaluation - yes - called Temporal Difference
  - V^{pi_k}_i(s) <- E{s’~P(s’|s,pi_k(s))}(R(s,pi(s),s’) + gV^{pi_k}_i(s’))
- Policy improvement
  - Not easy - because of argmax over a
Can tabular methods scale - no, state space gets too big (10^10)
Problem
- We keep a table of all q-values
- We don’t need to because we usually don’t visit the same state twice
Solution - generalize - learn about small number of training states, generalize that experience to new, similar situations
Approximate Q-learning
- Instead of table, parametrized Q function Q_pi(s,a)
- Can be linear Q_theta(s,a) = theta_0f_0(s,a) + … + theta_nf_n(s,a)
Learning rule:
- target(s’) = R(s,a,s’) + g*max{a’}(Q_th_k(s’,a’))
- th_{k+1} <- theta_k - alpha*grad_th((Q_th - target(s’))^2) | th=th_k
We can determine theta by hand engineering the features or DL

3 - Deep Q-Networks

Video Slides

Q-Learning
- Agent collects experiences online
- target(s’) = R(s,a,s’) + g*max{a’}(Q_th_k(s’a’))
- update th_{k+1} <- th_k - alphagrad_thE_{s’~P(s’|s,a)}[(Q_th(s,a)-target(s’))^2]|th=th_k
- For tabular function: Q_{k+1}(s,a) <- (1-a)Q_k(s,a) + alpha*target(s’)
- Does it work with neural network Q functions? Yes but with some care
Problems
- (nonstationary target) NN generalize so when you do the update of Q(s,a) for one (s,a) the network will generalize that change and update the value of other (s,a)
- Updates to target are correlated within a trajectory (non-stationary distribution for grad descent)
DQN
- Make Q-learning look like supervised learning
- Experience replay - Take action, get reward, but instead of applying it right-away, commit it to an replay buffer - when it comes time to learn, sample a mini-batch
DQN Algorithm
- Init replay memory D
- Init action-value function Q with random weights th
- Init target action-value function Q-hat with weights th-hat = th
- for episode = 1, M
  - Init sequence s_1= {x_1} and preprocessed sequence phi_1=phi(s_1)
  - For t=1,T
    - epsilon greedy pick of a_t from random and argmax[a]Q(phi(s_t),a;th)
    - Execute action a_t and observe reward r_t, image x_{t+1}
    - Set s_{t+1} = s_t,a_t,x_{t+1} and preprocess phi_{t+1} = phi{s_{t+1}}
    - Store transition (phi_t, a_t, r_t, phi_{t+1}) in D
    - Sample random minibatch of transitions (phi_j,a_j,r_j,phi_{j+1})
    - Set y_j
      - r_j if episode terminates at step j+1
      - r_j + g*max[a’]Q-hat(phi_{j+1},a’;th-hat) otherwise
    - Perform grad descent on (y_j - Q(phi_j,a_j;th))^2 w.r.t th
    - Every C steps reset Q-hat = Q
DQN Details
- RMSProp instead of SGD (optimization RL really matters since it determines what data you see)
- Anneal the exploration rate - epsilon starts at 1 -> to 0.1 -> 0.05
- Value-based methods are more robust than policy gradient methods
- Double DQN - use on-line for selcting best action but the target estimator for evaluating the best action (separate argmax vs max)
- Prioritized experience replay - Replay transitions in proportion to absolute bellman error (visit and update the state/action pairs you got really wrong)
- Dueling DQN
  - Value-advantage decomposition of Q: Q^{pi}(s,a) = V^{pi}(s) + A^{pi}(s,a)
  - Dueling DQN: Q(s,a) = V(s) + A(s,a) - 1/A*sum{a=1->|A|}(A(s,a))
- Noisy Nets - Beyond epsilon-greedy - Add noise to network parameters for better exploration

4a - Policy Gradients and Actor Critic

Video Slides

Policy optimization: pi_th(u|s)
- So far, we’ve been trying to find V or Q and from that derive pi - now more direct
- theta no longer parameterizes the Q-function but the NN for the policy
- Stochastic policy class to smooth out the optimization problem
- Is often simpler than V and Q since they don’t directly give you A
- PO is more compatible with rich architectures than DP but can’t do off-policy learning (collect data with a different policy) and is less sample-efficient
Likelihood Ratio Policy Gradient
- trajectory tau: s_0,u_0,..s_H,u_H
- Reward of trajectory: R(tau) = sum[t=0:H]R(st,ut)
- Utility of Policy-Param: U(th) = E{sum[t=0:H}(R(st,ut);pi_th)]
  - = Sum{tau}(P(tau;th)R(tau))
- Goal is to find theta: Max[th]U(th) = max[th]sum{tau}(P(tau;th)*R(tau)
grad-U(th) ~ (1/m)sum{i=1:m}(grad_thlog(Pr_i;th)*R(tau_i))
- Valid even when R is discontinuous and or unknown
- Increase prob of paths with positive R and vice versa
- reduce noise: add baseline b: (R(tau_i) - b) - reduces variance but still unbiased
Likelihood Ratio and Temporal Structure
- (1/m)sum[i=1:m]sum{t=0:H-1}(grad_thlogpi_th(u_t^i|s_t^i))(sum{k=t:H-1}(R(s_k^i,u_k^i) - b(s_t^i)))
- b(s_t) = E(r_t + r_{t+1}) = V^pi(s_t)
Algorithm
initialize policy parameter th, baseline b
for iteration=1,2..
- Collect a set of trajectories by executing the current policy
- At each timestep in each trajectory compute
  - the return R_t = sum{t’=t:T-1}(g^{t’-t}r_t’)
  - advantage estimate A_t = R_t - b(s_t)
- Refit the baseline: minimize ||b(s_t)-R_t}}^2 summed over all trajectories and timestaps
- Update the policy using a policy gradient estimate g-hat, which is a sum of terms grad_th*log(pi(a_t|s_t,th))A_t
Instead of using Reward we can use Q-value (to think about the reward at next steps as well)
Q^{pi,g}(s,u) = E[r_0 + g*r1 … | s_0=s, u_0=u]
- = E[r_0 + g*V^pi(s1) | s_0=s, u_0=u]
- Async Advantage Actor Critic (AC3) uses k=5 step lookahead
- Generalized Advantage Estimation (GAE) exponential weighted average for all k
- Similar to TD lambda
Actor Critic: Policy Gradient + Generalized Advantage Function(estimate value function for baseline)

4b - Policy Gradients Revisited

Video Slides

Supervised learning: maximize sum{i}(log(p(y_i|x_i))
RL:
- y_i ~ p(.|x_i)
- max{i}(A_i*log(p(y_i|x_i)))
  - A_i should be discounted (somehow - exponential may not be the best idea)
Code

5 - Natural Policy Gradients, TRPO, PPO

Video Slides

Vanilla Policy Gradient MEthods
- Getting policy gradient estimate, plugging it into SGD
- Lots of ways to get better advantage estimate
- But once you get it, how do you update the policy?
- Issues:
  - Hard to choose stepsize
    - input data in RL, observation and reward changes (non-stationary distribution)
    - Bad step is more damaging since it affects visitation distribution
  - Sample efficiency
    - Only one gradient step per env sample
- Much of modern ML is reducing learning to numerical optimization problem
  - Supervised learning: minimize training error
- Q-learning can include all transitions seen so far, but optimizing the wrong objective (bellman error, as opposed to the performance of policy)
- Policy gradients, yes optimize policy, but no optimization problem (not using old data)
- Instead of just following the gradient per observation, let’s solve an optimization problem
What Loss to optimize?
- Policy gradient
  - g-hat = E-hat_t[grad_th*log(pi_th(a_t|s_t)A-hat_t]
- Can differentiate the following loss
  - L^{PG}(th) = E-hat_t[logpi_th(a_t|s_t)A-hat_t]
  - but don’t want to optimize it too far (advantage is noisy)
- Equivalently differentiate
  - L^{IS}_th_old E-hat_t[pi_th(a_t|s_t)/pi_th_old(a_t | s_t) * A-hat_t]
  - IS = Importance sampling - What would the expectation be under distribution A but when I collected samples from distribution B
  - We want to choose a policy pi_th such that the expected advantage under that policy is high (but collecting samples from old policy)
    - But this doesn’t solve the problem that steps need to be small and we don’t want to optimize the objective fully
- Trust Region Policy Optimization
  - I have some function I want to optimize, I have a local approximation to this function (but really inaccurate if you get away from trust region) - I want to maximize this objective (local aprox.) subject to constraint that we’re not moving too far away (for ex. euclidean distance between starting point and new point - or in our case, KL-divergence between old and new policy probability distributions - we don’t care about the parameter space, we care about the distributions that they define)
  - We could also do a penalized optimization (as opposed to constrained)
TRPO algorithm
- for iteration=1,2,..
  - run policy for T timestpes or N trajectories using current policy
  - Estimate advantage function for all timesteps for all trajectories
    - Maximize surrogate loss subject to constraint
    - Maximize{th}(pi_th(a_n|s_n)/pi_th_old(a_n|s_n)*A-hat_n) subject to KL_pi_th_old(pi_th) < delta
- We can solve this efficiently approximately using conjugate gradient
Solving KL penalized problem (non-linear optimization)
- Goal: Maximize[th] L_pi_th_old(pi_th) - b*KL_pi_th_old(pi_th)
- Repeatedly make an approximation to it and solve the subproblem
  - Linear approx. to L and quadratic to KL
- This is called the natural policy gradient
- Cheap gradient principle - computing the gradient to a function is a small multiple of computing the function itself - but the Hessian (L) is expensive O(#params)
- Use conjugate gradients to do Hessian free optimization
Summary
- Wanted to write down an optimization problem to get a more sample-efficient and robust policy update
  - Suggested optimizing surrogate loss L^PG or L^IS - no good because doesn’t constrain the size of update
  - Use KL divergence to constrain size of update - but constrained optimization problem
  - Corresponds to natural gradient step F^-1*g under linear quadratic approximation
  - can solve for this step approximately using conjugate gradient method
Proximal Policy Optimization: KL Penalty Version
- Useful if don’t want to do conjugate gradient method
- Use penalty instead of constraint
- max[th]sum{n=1:N}(pi_th(a_n|s_n)/(pi_th_old(a_n|s_n)A-hat_n) - CKL_pi_th_old(pi_th)
- Algo:
- For iteration=1,2,…
  - Run policy for T timesteps or N trajectories
  - Estimate advantage function at all timesteps
  - Do SGD on above objective for some number of epochs
  - If KL too high, increase b, if KL too low, decrease b
- Choose KL-divergence as hyperparameter - choose the right b to make the update tight
Connection between trust region problem and other things
- Linear-quadratic approximation + penalty -> natural gradient
- No constraint -> policy iteration
- euclidean penalty instead of KL -> vanilla policy gradient
Limitation of TRPO
- Hard to use if model has different outputs and empirically performs poor on depp CNNs and RNNs
- Conjugate gradients makes it more complicated
Alternate to TRPO
- KFAC - Kronecker-factored approximate Fisher
  - Do blockwise approximation to fisher information matrix
  - Combined with A2C - gives very good results on atari
- Clipping objective
  - form a lower bound via clipped importance ratios

6 - Nuts and bolts of Deep RL

Video Slides

How to make the learning problem easier
- Use a smaller problem at first
- Feature engineering
- Shape the reward function
PMDP Design
- Visualize random policy - does it ever do the right thing?
- Baseline
  - Cross-entropy method
  - well-tuned policy gradient method
  - well-tuned Q-learning + SARSA method
Ongoing development and tuning
- If too sensitive to each hyperparameter, it’s not robust enough
- health indicator
  - value function fit quality
  - policy entropy
  - Update size in output space and parameter space
  - Standard diagnostics for deep networks
- always use multiple random seeds
General tuning strategies for RL
- Whitening / standardizing data
  - Compute running estimate of mean and stdev (over all data you’ve seen)
  - Rescale the rewards but don’t shift mean
- Important parameters
  - Discount
  - Action frequency
- Entropy as Diagnostic
  - Premature drop in policy entropy -> no learning
  - Alleviate by using entropy bonus or KL penalty
  - Compute entropy analytically for discrete action spaces
  - KL as diagnostic
    - KL spike -> drastic loss of performance
- explained variance = (1-Var(empirical return - predicted value))/Var(empirical return)
Policy initialization
- Zero or tiny final layer, to maximize entropy
Q-learning
- Optimize memory usage to have a large replay buffer
- Learning rate schedules
- Exploration schedules
- DQN converges slowly