Pieter Abbeel is one of the world’s leading RL researchers. He has recently made a lot of the material he teaches in classes at Berkley free on the internet. The below is an excerpt of my notes from his Deep RL Bootcamp.
1 - Motivation + Overview + Exact Solution Methods
- RL Algorithms Landscape
    
- DP
        
- Value Iteration
            
- Q-Learning
 
 - Policy Iteration
            
- Actor-Critic Methods
 
 
 - Value Iteration
            
 - Policy Optimization
        
- DFO/Evolution
 - Policy Gradients
            
- Actor-Critic Methods
 
 
 
 - DP
        
 - MDP - Agent observes the world, chooses an action, the environment changes, the agent may observe a reward and change state
    
- Set of states S
 - Set of Actions A
 - Transition function P(s’ | s,a)
 - Reward function R(s, a, s’)
 - Start state S_0
 - Discount factor g
 - Horizon H
 
 - Optimal Control: Given an MDP (S,A,P,R,g,H) find the optimal policy pi*
 - Exact methods: Value iteration and policy iteration
 - Value functions
    
- V*(s) = sum of discounted rewards when starting from s acting optimally
 - V*_i(s) = optimal value for state s when H = k
 
 - Value Iteration
    
- Algorithm (value update, or bellman update/backup)
        
- Start with V*_0(s) = 0 for all s
 - for all k in 1…H: for all s in S:
            
- V_k(s) = max{over a}(sum(P(s’(s|a)(R(s,a,s’) + gV_{k-1}(s’))))
 - pi*_k(s) = argmax of same expression
 
 
 - For a finite state space and <1 discount factor, VI converges
        
- For infinite horizon, simply run VI until it converges, which gives you pi* - use that to determine optimal behavior (contractions)
 - V_H -> V as H->infinity
 
 - Note: optimal policy is stationary (optimal action at state s is always the same)
 - Q(s,a) - much like V but says from s, take action a, and from then on act optimally
        
- Q(s,a) = sum[over s’]P(S’|s,a)(R(s,a,s’) + gmax[over a’]Q(s’,a’))
 - Q_{k+1}(s,a) = sum(P(s’|s,a)(R(s,a,s’) + gmax[over a’]Q_k(s’,a’)))
 
 
 - Algorithm (value update, or bellman update/backup)
        
 - Policy Methods
    
- Policy Evaluation: how to calculate goodness of V_pi
        
- We can use value iteration formulation, with fixed policy
 
 - V_k = max{a}(sum{s’}(P(s’|s,a)(R(s,a,s’) + gV_{k-1}(s’)))
 - V^pi_k = sum{a}(P(s’|s,pi(s))(R(s,pi(s),s’) + g*V^pi_{k-1}(s’))
 - At convergence
        
- RHS and LHS policies are the same
 
 
 - Policy Evaluation: how to calculate goodness of V_pi
        
 - Policy Iteration (repeat the following 2 steps until pi_{k+1} = pi_k)
    
- 
        
- Policy evaluation for current policy pi_k (iterate until convergence)
            
- V^{pi_k}_{i+1}(s) <- …
 
 
 - Policy evaluation for current policy pi_k (iterate until convergence)
            
 - 
        
- Policy improvement
            
- pi_{k+1}(s) <- argmax[a] sum[s’] P(s’|s,a)[R(s,a,s’) + g*V^{pi_k}(s’)]
 
 
 - Policy improvement
            
 - Computation methods
        
- Modify Bellman updates
 - Because we dropped the max (over a) it’s a linear system of equations
            
- Variables: V^{pi}(s)
 - Constants: P, R
 
 
 
 - 
        
 
2 - Sampling-based Approximations and Fitted Learning
- Optimal control: Given MDP (S, A, P, R, g, H), find optimal policy pi* maximizing sum of expected rewards
    
- Exact methods (value and policy iteration) guaranteed to converge for finite state/action space and when we have perfect access to environment
 - Assumptions of exact methods -> mitigations:
        
- Update equations require access to dynamics model -> Sampling-based approximations
 - Iteration over / Storage for all states and actions -> Q/V function fitting
 
 
 - Q*(s,a) means at state s, taking action a, and from then on take optimal actions
 - Q-value iteration
    
- Q_{k+1}(s,a) <- sum{s’}(P(s’|s,a)(R(s,a,s’) + g*max{a’}(Q_k(s’,a’)))
 
 - Q-value iteration with sampling
    
- Q_{k+1}(s,a) <- E{s’~P(s’|s,a)}(R(s,a,s’) + g*max{a’}(Q_k(s’,a’)))
 - For (s,a)
        
- receive s’ ~ P(s’|s,a)
 - consider target(s’) = R(s,a,s’) + g*max[a’]Q_k(s’,a’)
 - Q_{k+1}(s,a) <- (1-alpha)Q_k(s,a) + alpha(target(s’))
 
 
 - Tabular Q-Learning
    
- start with Q_0(s,a) = 0 for all s,a
 - for k = 1… until convergence
        
- sample a (via some method), get s’ ~ P(s’|s,a)
 - if s’ is terminal
            
- target = R(s,a,s’)
 - sample new initial state s’
 
 - else
            
- target = R(s,a,s’) + g*max[a’]Q_k(s’,a’)
 
 - Q_{k+1}(s,a) <- (1-alpha)Q_k(s,a) + alpha*target
 - s <- s’
 
 
 - How to sample actions
    
- e-greedy (epsilon-greedy) - choose random action with probability e, otherwise choose action greedily
 
 - Q-learning (an example of off-policy learning) converges to optimal policy
    
- If you explore enough
        
- All states and actions are visited infinitely often
 
 - learning rate needs to decrease (but not too quickly)
        
- sum{t}(alpha_t(s,a) = infinity) & sum{t}(alpha^2_t(s,a) < infinity)
 
 
 - If you explore enough
        
 - Can sampling work for value iteration? No because of the max over a
 - Can Policy iteration work with sampling?
    
- Policy evaluation - yes - called Temporal Difference
        
- V^{pi_k}_i(s) <- E{s’~P(s’|s,pi_k(s))}(R(s,pi(s),s’) + gV^{pi_k}_i(s’))
 
 - Policy improvement
        
- Not easy - because of argmax over a
 
 
 - Policy evaluation - yes - called Temporal Difference
        
 - Can tabular methods scale - no, state space gets too big (10^10)
 - Problem
    
- We keep a table of all q-values
 - We don’t need to because we usually don’t visit the same state twice
 
 - Solution - generalize - learn about small number of training states, generalize that experience to new, similar situations
 - Approximate Q-learning
    
- Instead of table, parametrized Q function Q_pi(s,a)
 - Can be linear Q_theta(s,a) = theta_0f_0(s,a) + … + theta_nf_n(s,a)
 
 - Learning rule:
    
- target(s’) = R(s,a,s’) + g*max{a’}(Q_th_k(s’,a’))
 - th_{k+1} <- theta_k - alpha*grad_th((Q_th - target(s’))^2) | th=th_k
 
 - We can determine theta by hand engineering the features or DL
 
3 - Deep Q-Networks
Video Slides
- Q-Learning
    
- Agent collects experiences online
 - target(s’) = R(s,a,s’) + g*max{a’}(Q_th_k(s’a’))
 - update th_{k+1} <- th_k - alphagrad_thE_{s’~P(s’|s,a)}[(Q_th(s,a)-target(s’))^2]|th=th_k
 - For tabular function: Q_{k+1}(s,a) <- (1-a)Q_k(s,a) + alpha*target(s’)
 - Does it work with neural network Q functions? Yes but with some care
 
 - Problems
    
- (nonstationary target) NN generalize so when you do the update of Q(s,a) for one (s,a) the network will generalize that change and update the value of other (s,a)
 - Updates to target are correlated within a trajectory (non-stationary distribution for grad descent)
 
 - DQN
    
- Make Q-learning look like supervised learning
 - Experience replay - Take action, get reward, but instead of applying it right-away, commit it to an replay buffer - when it comes time to learn, sample a mini-batch
 
 - DQN Algorithm
    
- Init replay memory D
 - Init action-value function Q with random weights th
 - Init target action-value function Q-hat with weights th-hat = th
 - for episode = 1, M
        
- Init sequence s_1= {x_1} and preprocessed sequence phi_1=phi(s_1)
 - For t=1,T
            
- epsilon greedy pick of a_t from random and argmax[a]Q(phi(s_t),a;th)
 - Execute action a_t and observe reward r_t, image x_{t+1}
 - Set s_{t+1} = s_t,a_t,x_{t+1} and preprocess phi_{t+1} = phi{s_{t+1}}
 - Store transition (phi_t, a_t, r_t, phi_{t+1}) in D
 - Sample random minibatch of transitions (phi_j,a_j,r_j,phi_{j+1})
 - Set y_j
                
- r_j if episode terminates at step j+1
 - r_j + g*max[a’]Q-hat(phi_{j+1},a’;th-hat) otherwise
 
 - Perform grad descent on (y_j - Q(phi_j,a_j;th))^2 w.r.t th
 - Every C steps reset Q-hat = Q
 
 
 
 - DQN Details
    
- RMSProp instead of SGD (optimization RL really matters since it determines what data you see)
 - Anneal the exploration rate - epsilon starts at 1 -> to 0.1 -> 0.05
 - Value-based methods are more robust than policy gradient methods
 - Double DQN - use on-line for selcting best action but the target estimator for evaluating the best action (separate argmax vs max)
 - Prioritized experience replay - Replay transitions in proportion to absolute bellman error (visit and update the state/action pairs you got really wrong)
 - Dueling DQN
        
- Value-advantage decomposition of Q: Q^{pi}(s,a) = V^{pi}(s) + A^{pi}(s,a)
 - Dueling DQN: Q(s,a) = V(s) + A(s,a) - 1/A*sum{a=1->|A|}(A(s,a))
 
 - Noisy Nets - Beyond epsilon-greedy - Add noise to network parameters for better exploration
 
 
4a - Policy Gradients and Actor Critic
- Policy optimization: pi_th(u|s)
    
- So far, we’ve been trying to find V or Q and from that derive pi - now more direct
 - theta no longer parameterizes the Q-function but the NN for the policy
 - Stochastic policy class to smooth out the optimization problem
 - Is often simpler than V and Q since they don’t directly give you A
 - PO is more compatible with rich architectures than DP but can’t do off-policy learning (collect data with a different policy) and is less sample-efficient
 
 - Likelihood Ratio Policy Gradient
    
- trajectory tau: s_0,u_0,..s_H,u_H
 - Reward of trajectory: R(tau) = sum[t=0:H]R(st,ut)
 - Utility of Policy-Param: U(th) = E{sum[t=0:H}(R(st,ut);pi_th)]
        
- = Sum{tau}(P(tau;th)R(tau))
 
 - Goal is to find theta: Max[th]U(th) = max[th]sum{tau}(P(tau;th)*R(tau)
 
 - grad-U(th) ~ (1/m)sum{i=1:m}(grad_thlog(Pr_i;th)*R(tau_i))
    
- Valid even when R is discontinuous and or unknown
 - Increase prob of paths with positive R and vice versa
 - reduce noise: add baseline b: (R(tau_i) - b) - reduces variance but still unbiased
 
 - Likelihood Ratio and Temporal Structure
    
- (1/m)sum[i=1:m]sum{t=0:H-1}(grad_thlogpi_th(u_t^i|s_t^i))(sum{k=t:H-1}(R(s_k^i,u_k^i) - b(s_t^i)))
 - b(s_t) = E(r_t + r_{t+1}) = V^pi(s_t)
 
 - Algorithm
 - initialize policy parameter th, baseline b
 - for iteration=1,2..
    
- Collect a set of trajectories by executing the current policy
 - At each timestep in each trajectory compute
        
- the return R_t = sum{t’=t:T-1}(g^{t’-t}r_t’)
 - advantage estimate A_t = R_t - b(s_t)
 
 - Refit the baseline: minimize ||b(s_t)-R_t}}^2 summed over all trajectories and timestaps
 - Update the policy using a policy gradient estimate g-hat, which is a sum of terms grad_th*log(pi(a_t|s_t,th))A_t
 
 - Instead of using Reward we can use Q-value (to think about the reward at next steps as well)
 - Q^{pi,g}(s,u) = E[r_0 + g*r1 … | s_0=s, u_0=u]
    
- = E[r_0 + g*V^pi(s1) | s_0=s, u_0=u]
 - Async Advantage Actor Critic (AC3) uses k=5 step lookahead
 - Generalized Advantage Estimation (GAE) exponential weighted average for all k
 - Similar to TD lambda
 
 - Actor Critic: Policy Gradient + Generalized Advantage Function(estimate value function for baseline)
 
4b - Policy Gradients Revisited
- Supervised learning: maximize sum{i}(log(p(y_i|x_i))
 - RL:
    
- y_i ~ p(.|x_i)
 - max{i}(A_i*log(p(y_i|x_i)))
        
- A_i should be discounted (somehow - exponential may not be the best idea)
 
 
 - Code
 
5 - Natural Policy Gradients, TRPO, PPO
- Vanilla Policy Gradient MEthods
    
- Getting policy gradient estimate, plugging it into SGD
 - Lots of ways to get better advantage estimate
 - But once you get it, how do you update the policy?
 - Issues:
        
- Hard to choose stepsize
            
- input data in RL, observation and reward changes (non-stationary distribution)
 - Bad step is more damaging since it affects visitation distribution
 
 - Sample efficiency
            
- Only one gradient step per env sample
 
 
 - Hard to choose stepsize
            
 - Much of modern ML is reducing learning to numerical optimization problem
        
- Supervised learning: minimize training error
 
 - Q-learning can include all transitions seen so far, but optimizing the wrong objective (bellman error, as opposed to the performance of policy)
 - Policy gradients, yes optimize policy, but no optimization problem (not using old data)
 - Instead of just following the gradient per observation, let’s solve an optimization problem
 
 - What Loss to optimize?
    
- Policy gradient
        
- g-hat = E-hat_t[grad_th*log(pi_th(a_t|s_t)A-hat_t]
 
 - Can differentiate the following loss
        
- L^{PG}(th) = E-hat_t[logpi_th(a_t|s_t)A-hat_t]
 - but don’t want to optimize it too far (advantage is noisy)
 
 - Equivalently differentiate
        
- L^{IS}_th_old E-hat_t[pi_th(a_t|s_t)/pi_th_old(a_t | s_t) * A-hat_t]
 - IS = Importance sampling - What would the expectation be under distribution A but when I collected samples from distribution B
 - We want to choose a policy pi_th such that the expected advantage under that policy is high (but collecting samples from old policy)
            
- But this doesn’t solve the problem that steps need to be small and we don’t want to optimize the objective fully
 
 
 - Trust Region Policy Optimization
        
- I have some function I want to optimize, I have a local approximation to this function (but really inaccurate if you get away from trust region) - I want to maximize this objective (local aprox.) subject to constraint that we’re not moving too far away (for ex. euclidean distance between starting point and new point - or in our case, KL-divergence between old and new policy probability distributions - we don’t care about the parameter space, we care about the distributions that they define)
 - We could also do a penalized optimization (as opposed to constrained)
 
 
 - Policy gradient
        
 - TRPO algorithm
    
- for iteration=1,2,..
        
- run policy for T timestpes or N trajectories using current policy
 - Estimate advantage function for all timesteps for all trajectories
            
- Maximize surrogate loss subject to constraint
 - Maximize{th}(pi_th(a_n|s_n)/pi_th_old(a_n|s_n)*A-hat_n) subject to KL_pi_th_old(pi_th) < delta
 
 
 - We can solve this efficiently approximately using conjugate gradient
 
 - for iteration=1,2,..
        
 - Solving KL penalized problem (non-linear optimization)
    
- Goal: Maximize[th] L_pi_th_old(pi_th) - b*KL_pi_th_old(pi_th)
 - Repeatedly make an approximation to it and solve the subproblem
        
- Linear approx. to L and quadratic to KL
 
 - This is called the natural policy gradient
 - Cheap gradient principle - computing the gradient to a function is a small multiple of computing the function itself - but the Hessian (L) is expensive O(#params)
 - Use conjugate gradients to do Hessian free optimization
 
 - Summary
    
- Wanted to write down an optimization problem to get a more sample-efficient and robust policy update
        
- Suggested optimizing surrogate loss L^PG or L^IS - no good because doesn’t constrain the size of update
 - Use KL divergence to constrain size of update - but constrained optimization problem
 - Corresponds to natural gradient step F^-1*g under linear quadratic approximation
 - can solve for this step approximately using conjugate gradient method
 
 
 - Wanted to write down an optimization problem to get a more sample-efficient and robust policy update
        
 - Proximal Policy Optimization: KL Penalty Version
    
- Useful if don’t want to do conjugate gradient method
 - Use penalty instead of constraint
 - max[th]sum{n=1:N}(pi_th(a_n|s_n)/(pi_th_old(a_n|s_n)A-hat_n) - CKL_pi_th_old(pi_th)
 - Algo:
 - For iteration=1,2,…
        
- Run policy for T timesteps or N trajectories
 - Estimate advantage function at all timesteps
 - Do SGD on above objective for some number of epochs
 - If KL too high, increase b, if KL too low, decrease b
 
 - Choose KL-divergence as hyperparameter - choose the right b to make the update tight
 
 - Connection between trust region problem and other things
    
- Linear-quadratic approximation + penalty -> natural gradient
 - No constraint -> policy iteration
 - euclidean penalty instead of KL -> vanilla policy gradient
 
 - Limitation of TRPO
    
- Hard to use if model has different outputs and empirically performs poor on depp CNNs and RNNs
 - Conjugate gradients makes it more complicated
 
 - Alternate to TRPO
    
- KFAC - Kronecker-factored approximate Fisher
        
- Do blockwise approximation to fisher information matrix
 - Combined with A2C - gives very good results on atari
 
 - Clipping objective
        
- form a lower bound via clipped importance ratios
 
 
 - KFAC - Kronecker-factored approximate Fisher
        
 
6 - Nuts and bolts of Deep RL
- How to make the learning problem easier
    
- Use a smaller problem at first
 - Feature engineering
 - Shape the reward function
 
 - PMDP Design
    
- Visualize random policy - does it ever do the right thing?
 - Baseline
        
- Cross-entropy method
 - well-tuned policy gradient method
 - well-tuned Q-learning + SARSA method
 
 
 - Ongoing development and tuning
    
- If too sensitive to each hyperparameter, it’s not robust enough
 - health indicator
        
- value function fit quality
 - policy entropy
 - Update size in output space and parameter space
 - Standard diagnostics for deep networks
 
 - always use multiple random seeds
 
 - General tuning strategies for RL
    
- Whitening / standardizing data
        
- Compute running estimate of mean and stdev (over all data you’ve seen)
 - Rescale the rewards but don’t shift mean
 
 - Important parameters
        
- Discount
 - Action frequency
 
 - Entropy as Diagnostic
        
- Premature drop in policy entropy -> no learning
 - Alleviate by using entropy bonus or KL penalty
 - Compute entropy analytically for discrete action spaces
 - KL as diagnostic
            
- KL spike -> drastic loss of performance
 
 
 - explained variance = (1-Var(empirical return - predicted value))/Var(empirical return)
 
 - Whitening / standardizing data
        
 - Policy initialization
    
- Zero or tiny final layer, to maximize entropy
 
 - Q-learning
    
- Optimize memory usage to have a large replay buffer
 - Learning rate schedules
 - Exploration schedules
 - DQN converges slowly