Let us consider the simplified case of one step MDP to see the policy gradient theorem in use. Every episode is just a single step. The episode starts, takes an action and then ends with a reward for that step. We present a simplified version of the policy gradient theorem for this one step episode setting with simplified proofs. We then present a simplified example of rock, paper and scissors which only lasts a single episode. OneStepMDP_PGT

Policy Gradient Theorem

Definitions

Markov Decision Process (1)

A Markov decision process MM is a tuple where
M=(S,A,Ra)M = (S, A, R_a)
and

  • SS is the finite set of states
  • AA is the finite set of actions_ and AsA_s is the set of actions for state sSs \in S
  • Ra(s)R_a(s) is the reward from state ss taking action aa

Suppose that MM is a one-step MDP. The episode starts at state sSs \in S. The episode ends after one step and with reward rr.

Policy

Given a state ss, the policy π(s,a)[0,1]\pi(s,a) \in [0,1] gives us the probability that we will take action aa from state ss.

Let θRn\theta \in \mathbb{R}^n be some set of parameters that determine our policy. We denote such a policy by πθ\pi_\theta.

Theorem

Let rr be the random variable of the reward we get from taking an action. Let J(θ)J(\theta) be the expected reward from following policy πθ\pi_\theta with parameters θ\theta.
J(θ)=Eπθ[r]J(\theta) = \mathbb{E}_{\pi_\theta} [r]

Suppose we start at state ss. Then,
θJ(θ)=Eπθ[(θlogπθ(s,a))r]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]

Notation Clarifications

  • The notation Eπθ[r]\mathbb{E}_{\pi_\theta} [r] is different than the standard method of writing expectations since we have a subscript πθ\pi_\theta on the expectation. Usually, we have a random variable XX and then E[X]=i=1kxipi\mathbb{E}[X] = \sum_{i=1}^k x_i p_i. In our case, rr denotes the random variable of the reward for each action taken. We could have used rπθr_{\pi_\theta} as the random variable where the probability of each action taken is given by πθ\pi_\theta and the expectation would be written as E[rπθ]\mathbb{E}[r_{\pi_\theta}] for the reward for taking actions according to πθ\pi_\theta.
  • The notation Eπθ[(θlogπθ(s,a))r]\mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right] has inside it term (θlogπθ(s,a)\nabla_\theta \log \pi_\theta(s,a)) that are not random variables. Eπθ\mathbb{E}_{\pi_\theta} is the expectation over all actions that are possible to take. The θlogπθ(s,a)\nabla_\theta \log \pi_\theta(s,a) in this context should be interpreted as the random variable with given state ss and over all possible actions aa. We are slightly abusing the notation to use the same symbols with the function as the random variable.

Proof

By our definition of expectation of the reward, we have
J(θ)=Eπθ[r]=aAπθ(a,s)Ra(s)\begin{aligned} J(\theta) &= \mathbb{E}_{\pi_\theta} [r] \\ &= \sum_{a \in A} \pi_\theta(a,s) R_a(s) \end{aligned}

Taking the gradient, we have
θJ(θ)=θ[aAπθ(a,s)Ra(s)]=aAθπθ(a,s)  Ra(s)\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \left[ \sum_{a \in A} \pi_\theta(a,s) R_a(s) \right] \\ &= \sum_{a \in A} \nabla_\theta \pi_\theta(a,s) \; R_a(s) \end{aligned}
The reward function does not depend on the θ\theta parameter and thus we don't take its gradient. The reward we get is determined by the action which is determined by θ\theta but the reward function itself is independent of θ\theta.

The gradient of the policy can be written in terms of log using the identity ff=logf\frac{\nabla f}{f} = \nabla \log f as follows
θπθ(s,a)=πθ(s,a)θπθ(s,a)πθ(s,a)=πθ(s,a)θlogπθ(s,a)\begin{aligned} \nabla_\theta \pi_\theta(s, a) &= \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s, a)}{\pi_\theta(s,a)} \\ &= \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s, a) \end{aligned}

Using the log gradient substitution, we have
θJ(θ)=aAθπθ(s,a)  Ra(s)=aAπθ(s,a)θlogπθ(s,a)  Ra(s)=Eπθ[(θlogπθ(s,a))r]\begin{aligned} \nabla_\theta J(\theta) &= \sum_{a \in A} \nabla_\theta \pi_\theta(s,a) \; R_a(s) \\ &= \sum_{a \in A} \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s, a) \; R_a(s) \\ &= \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right] \end{aligned}

Rock Paper Scissors MDP

As an example of one step MDP, we will use rock, paper and scissors game. We play the game for one step and we either win, lose or draw.

For the first example of the game we will assume

  • we know what the opponent is going to play.
  • opponent will play rock (since we specify a starting state in our theorem - we will generalize it to any starting state below)

States

We have 3 states for what we know the opponent will play.

  • srs_{r}: opponent plays rock
  • sps_{p}: opponent plays paper
  • sss_{s}: opponent plays scissors

Actions

We either rock, paper or scissors.

  • ara_r : play rock
  • apa_p : play paper
  • asa_s : play scissors

Rewards

We get +1 points if we win the game, 0 if the we draw the game and -1 if we lose the game.

Policy

Let θR\theta \in \mathbb{R} be the parameter for our policy. We define πθ\pi_\theta as the following:
πθ(sr,ap)=σ(θ)πθ(sr,ar)=1σ(θ)2πθ(sr,as)=1σ(θ)2\begin{aligned} \pi_\theta(s_r, a_p) &= \sigma(\theta) \\ \pi_\theta(s_r, a_r) &= \frac{1 - \sigma(\theta)}{2} \\ \pi_\theta(s_r, a_s) &= \frac{1 - \sigma(\theta)}{2} \\ \end{aligned}
where σ(x)\sigma(x) is the sigmoid function given by
σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}

The sigmoid function maps (,)(-\infty, \infty) to (0,1)(0, 1). Then, the above policy is a valid probability as the sum over all actions sums up to 1.

When θ\theta \rightarrow -\infty, πθ(sr,ap)0\pi_\theta(s_r, a_p) \rightarrow 0, and θ\theta \rightarrow \infty, πθ(sr,ap)1\pi_\theta(s_r, a_p) \rightarrow 1. Thus, the smaller the value of θ\theta, the smaller the probability of us choose the action apa_p for paper.

Policy Gradient

Then, we have reward 1 for playing paper and -1 for playing scissors,
θJ(θ)=aAθπθ(s,a)  Ra(s)=1θπθ(sr,ap)+(1)θπθ(sr,as)=σ(θ)+σ(θ)2=32σ(θ)\begin{aligned} \nabla_\theta J(\theta) &= \sum_{a \in A} \nabla_\theta \pi_\theta(s,a) \; R_a(s) \\ &= 1 \cdot \nabla_\theta \pi_\theta(s_r, a_p) + (-1) \cdot \nabla_\theta \pi_\theta(s_r, a_s) \\ &= \sigma^\prime(\theta) + \frac{\sigma^\prime(\theta)}{2} \\ &= \frac{3}{2} \sigma^\prime(\theta) \end{aligned}

We do the parameter update using the rule
θ=θ+αθJ(θ)\theta = \theta + \alpha \nabla_\theta J(\theta)
where αR\alpha \in \mathbb{R} is the learning rate.

Since σ>0\sigma^\prime > 0, θ\theta is always increasing. Thus, πθ(sr,ap)\pi_\theta(s_r, a_p) goes towards 1 and the probability for other actions towards 0. Thus, the policy gradient will choose to play paper for the rock that the opponent plays.

Generalization (1) - With Transition Probability

We can generalize the MDP to have a transition probability.

In the first definition of MDP, if we are in state ss and we take action aa, we would transition to a state ss^\prime and obtain single reward for the action.

Here the generalization results is that when taking an action aa, it now gives us a probability of transitioning to each of the states in SS. Before an action from state ss would only take you another specific state ss^\prime. Now, there is a probability for transition to every state of SS. If the probability of transitioning to state ss^\prime is 1 and others 0, it is effectively the same as without the generalization.

For reward, if we are in state ss and we transition to state ss^\prime, the reward is given by Ra(s,s)R_a(s, s^\prime).

A Markov decision process (2) MM is a tuple where
M=(S,A,Pa,Ra)M = (S, A, P_a, R_a)
be

  • SS is the finite set of states
  • AA is the finite set of actions_ and AsA_s is the set of actions for state sSs \in S
  • Pa(s,s)P_a(s, s^\prime) is the transition probability from state ss to another state ss^\prime by taking action aa
  • Ra(s,s)R_a(s, s^\prime) is the reward from state ss to ss^\prime

Policy Gradient Theorem

Let MM me an MDP(2) as defined above. Let rr be the random variable of the reward we get from taking an action. Let J(θ)J(\theta) be the expected reward from following policy πθ\pi_\theta with parameters θ\theta.
J(θ)=Eπθ[r]J(\theta) = \mathbb{E}_{\pi_\theta} [r]

Suppose we start at state ss. Then,
θJ(θ)=Eπθ[(θlogπθ(s,a))r]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]

Proof

The definition of MDP has changed and thus, when we take an action aa, we have a transition probability of going from state ss to state ss^\prime with probability Pa(s,s)P_a(s, s^\prime).

Thus, we have to sum over all subsequent states that we might go to after an action is taken.
J(θ)=Eπθ[r]=aAsSπθ(s,a)Pa(s,s)Ra(s,s)\begin{aligned} J(\theta) &= \mathbb{E}_{\pi_\theta} [r] \\ &= \sum_{a \in A} \sum_{s^\prime \in S} \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \end{aligned}

Now, taking the gradient and noting that the transition probability and reward does not depend on θ\theta, and so we have,
θJ(θ)=θ(aAsSπθ(s,a)Pa(s,s)Ra(s,s))=aAsSθπθ(s,a)Pa(s,s)Ra(s,s)=aAsS[θlogπθ(s,a)]πθ(s,a)Pa(s,s)Ra(s,s)=Eπθ[(θlogπθ(s,a))r]\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \left( \sum_{a \in A} \sum_{s^\prime \in S} \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \right)\\ &= \sum_{a \in A} \sum_{s^\prime \in S} \nabla_\theta \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \\ &= \sum_{a \in A} \sum_{s^\prime \in S} \left[ \nabla_\theta \log \pi_\theta(s, a) \right] \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \\ &= \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right] \end{aligned}

Rock Paper Scissors

In the previous example of rock, paper, scissors, we knew what the opponent was going to play. With the new formulation of the MDP, we now divide the state into what we believe the opponent will play and what the opponent plays after we perform our action.

States

We have the 6 states, 3 for before we play and 3 after we play. The 3 before we play will stand for what we think the opponent will play. The 3 after we play will be what the opponent played.

  • sr1s_{r_1}: we think opponent plays rock
  • sp1s_{p_1}: we think opponent plays paper
  • ss1s_{s_1}: we think opponent plays scissors
  • sr2s_{r_2}: opponent played rock
  • sp2s_{p_2}: opponent played paper
  • ss2s_{s_2}: opponent played scissors

Actions

The actions are the same as before. We have ara_r, apa_p and asa_s for playing rock, paper and scissors respectively.

Transition Probabilities

The transition probability is non-zero from sr1,sp1,ss1s_{r_1}, s_{p_1}, s_{s_1} to sr2,sp2,ss2s_{r_2}, s_{p_2}, s_{s_2}. If our guess was random, then all the transition probabilities would be 13\frac{1}{3}.

For example, a random transition probability could be
P(sr1,sr2)=13P(sr1,sp2)=13P(sr1,ss2)=13\begin{aligned} P(s_{r_1}, s_{r_2}) &= \frac{1}{3} \\ P(s_{r_1}, s_{p_2}) &= \frac{1}{3} \\ P(s_{r_1}, s_{s_2}) &= \frac{1}{3} \\ \end{aligned}

Rewards

The reward is similar. We still get +1 for winning, 0 for drawing and -1 for losing the game. However, in this case, the reward depends on the starting state and the ending state and the action taken.

We can illustrate the rewards in term of a matrix. Suppose our action was ara_r. Then,
sr2sp2ss2sr1011sp1011ss1011\begin{matrix} & s_{r_2} & s_{p_2} & s_{s_2} \\ s_{r_1} & 0 & -1 & 1 \\ s_{p_1} & 0 & -1 & 1 \\ s_{s_1} & 0 & -1 & 1 \\ \end{matrix}
Note the ending state and the action only determines the reward. Whatever, we believe the opponent was going to play does not matter in the reward if the transition probability is equally random.

Policy

Let θR3\theta \in \mathbb{R}^3. We will denote θ=[θ1,θ2,θ2]\theta = [\theta_1, \theta_2, \theta_2].

We assume that we start at state sr1s_{r_1}. Then, the policy is given by
πθ(sr1,ar)=eθ1i=13eθiπθ(sr1,ap)=eθ2i=13eθiπθ(sr1,as)=eθ3i=13eθi\begin{aligned} \pi_\theta(s_{r_1}, a_r) &= \frac{e^{\theta_1}}{\sum_{i=1}^3 e^{\theta_i}} \\ \pi_\theta(s_{r_1}, a_p) &= \frac{e^{\theta_2}}{\sum_{i=1}^3 e^{\theta_i}} \\ \pi_\theta(s_{r_1}, a_s) &= \frac{e^{\theta_3}}{\sum_{i=1}^3 e^{\theta_i}} \\ \end{aligned}

Thus, the higher the probability of θi\theta_i, the more likely that we will take that action.

Policy Gradient

Since we have 3 parameters in θ\theta, the gradient is a 3×13 \times 1 matrix.

θπθ(sr1,ar)=[πθ(sr1,ar)dθ1πθ(sr1,ar)dθ2πθ(sr1,ar)dθ3]=[eθ1(eθ2+eθ3)(eθ1+eθ2+eθ3)2eθ1+θ2(eθ1+eθ2+eθ3)2eθ1+θ3(eθ1+eθ2+eθ3)2]\nabla_\theta \pi_\theta(s_{r_1}, a_r) = \begin{bmatrix} \frac{\partial \pi_\theta(s_{r_1}, a_r)}{d \theta_1} \\ \frac{\partial \pi_\theta(s_{r_1}, a_r)}{d \theta_2} \\ \frac{\partial \pi_\theta(s_{r_1}, a_r)}{d \theta_3} \\ \end{bmatrix} = \begin{bmatrix} \frac{e^{\theta_1} (e^{\theta_2} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ -\frac{e^{\theta_1+\theta_2}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ -\frac{e^{\theta_1+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ \end{bmatrix}

θπθ(sr1,ap)=[πθ(sr1,ap)dθ1πθ(sr1,ap)dθ2πθ(sr1,ap)dθ3]=[eθ1+θ2(eθ1+eθ2+eθ3)2eθ2(eθ1+eθ3)(eθ1+eθ2+eθ3)2eθ2+θ3(eθ1+eθ2+eθ3)2]\nabla_\theta \pi_\theta(s_{r_1}, a_p) = \begin{bmatrix} \frac{\partial \pi_\theta(s_{r_1}, a_p)}{d \theta_1} \\ \frac{\partial \pi_\theta(s_{r_1}, a_p)}{d \theta_2} \\ \frac{\partial \pi_\theta(s_{r_1}, a_p)}{d \theta_3} \\ \end{bmatrix} = \begin{bmatrix} -\frac{e^{\theta_1+\theta_2}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ \frac{e^{\theta_2} (e^{\theta_1} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ -\frac{e^{\theta_2+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ \end{bmatrix}

θπθ(sr1,as)=[πθ(sr1,as)dθ1πθ(sr1,as)dθ2πθ(sr1,as)dθ3]=[eθ1+θ3(eθ1+eθ2+eθ3)2eθ2+θ3(eθ2+eθ3+eθ3)2eθ3(eθ1+eθ2)(eθ1+eθ2+eθ3)2]\nabla_\theta \pi_\theta(s_{r_1}, a_s) = \begin{bmatrix} \frac{\partial \pi_\theta(s_{r_1}, a_s)}{d \theta_1} \\ \frac{\partial \pi_\theta(s_{r_1}, a_s)}{d \theta_2} \\ \frac{\partial \pi_\theta(s_{r_1}, a_s)}{d \theta_3} \\ \end{bmatrix} = \begin{bmatrix} -\frac{e^{\theta_1+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ -\frac{e^{\theta_2+\theta_3}}{(e^{\theta_2} + e^{\theta_3} + e^{\theta_3})^2} \\ \frac{e^{\theta_3} (e^{\theta_1} + e^{\theta_2})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ \end{bmatrix}

Optimality Under Random Choice

Suppose we do not know how the opponent plays and thus all transition probabilities are 13\frac{1}{3}.

Let us first look at the partial derivative of J(θ)J(\theta) with respect to θ1\theta_1 instead of the whole θJ(θ)\nabla_\theta J(\theta).

J(θ)θ1=aAsSπθ(s,a)θ1Pa(s,s)Ra(s,s)=13aAsSπθ(s,a)θ1Ra(s,s)=13aAπθ(s,a)θ1sSRa(s,s)=0\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_1} &= \sum_{a \in A} \sum_{s^\prime \in S} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} P_a(s, s^\prime) R_a(s, s^\prime) \\ &= \frac{1}{3} \sum_{a \in A} \sum_{s^\prime \in S} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} R_a(s, s^\prime) \\ &= \frac{1}{3} \sum_{a \in A} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} \sum_{s^\prime \in S} R_a(s, s^\prime) \\ &= 0 \end{aligned}

The policy only depends on the starting state and not the state we transition to. Here, the reward is dependent on the state we start with and the state we end up with. Since for each action, we lose one and win one, the total is 0. Thus, we can choose any value of θ\theta and we will have the same expected reward since the gradient is 0 everywhere.

Policy Improvement Under Scissors Only

Suppose that the opponent only plays scissors. Then, we should always play rock and thus, θ1\theta_1 should be as large as possible and θ2,θ3\theta_2, \theta_3 should be as small as possible.

J(θ)θ1=aAsSπθ(s,a)θ1Pa(s,s)Ra(s,s)=aAπθ(s,a)θ1Pa(s,ss2)Ra(s,ss2)=πθ(s,ar)θ1Rar(s,ss2)+πθ(s,ap)θ1Rap(s,ss2)=πθ(s,ar)θ1πθ(s,ap)θ1\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_1} &= \sum_{a \in A} \sum_{s^\prime \in S} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} P_a(s, s^\prime) R_a(s, s^\prime)\\ &= \sum_{a \in A} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} P_a(s, s_{s_2}) R_a(s, s_{s_2})\\ &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_1} R_{a_r}(s, s_{s_2}) + \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_1} R_{a_p}(s, s_{s_2})\\ &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_1} - \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_1}\\ \end{aligned}

The first step follows from definition. The second step follows that the opponent is only playing scissors and all other transitions that do not end in scissors is 0. The third step follows from the fact that action of rock or paper will result a positive or negative reward. If both the opponent and the agent draw scissors, it is a draw and thus has reward 0. The fourth step is just reward of 1 for winning and -1 for losing.

Thus, the derivative for θ1\theta_1
J(θ)θ1=πθ(s,ar)θ1πθ(s,ap)θ1=eθ1(eθ2+eθ3)(eθ1+eθ2+eθ3)2(eθ1+θ2(eθ1+eθ2+eθ3)2)=eθ1(2eθ2+eθ3)(eθ1+eθ2+eθ3)2>0\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_1} &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_1} - \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_1}\\ &= \frac{e^{\theta_1} (e^{\theta_2} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} - \left( -\frac{e^{\theta_1+\theta_2}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \right) \\ &= \frac{e^{\theta_1} (2e^{\theta_2} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ &> 0 \end{aligned}

Next, for θ2\theta_2, we have
J(θ)θ2=πθ(s,ar)θ2πθ(s,ap)θ2=eθ1+θ2(eθ1+eθ2+eθ3)2eθ2(eθ1+eθ3)(eθ1+eθ2+eθ3)2<0\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_2} &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_2} - \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_2}\\ &= -\frac{e^{\theta_1+\theta_2}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} - \frac{e^{\theta_2} (e^{\theta_1} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ &< 0 \end{aligned}

Finally, for θ3\theta_3, we have
J(θ)θ3=πθ(s,ar)θ3πθ(s,ap)θ3=eθ1+θ3(eθ1+eθ2+eθ3)2+eθ2+θ3(eθ1+eθ2+eθ3)2=eθ3(eθ2eθ1)(eθ1+eθ2+eθ3)2\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_3} &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_3} - \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_3}\\ &= -\frac{e^{\theta_1+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} + \frac{e^{\theta_2+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ &= \frac{e^{\theta_3}(e^{\theta_2}-e^{\theta_1})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \end{aligned}

This follows from θ1>θ2\theta_1 > \theta_2, and hence J(θ)θ3<0\frac{\partial J(\theta)}{\partial \theta_3} < 0.

Thus, θ1\theta_1 increases and θ2,θ3\theta_2, \theta_3 decreases when updating θ\theta along the policy gradient. Thus, the policy favors playing rock over the other two alternatives.

Generalization (2) - No specified start state

In the above theorem statements, we have required to specify a start state.

In order to not have to specify a start state, we have to define a distribution d(s)d(s) that represents the probability of starting at state ss.

Policy Gradient Theorem (2)

Let MM me an MDP(2). Let rr be the random variable of the reward we get from taking an action. Let J(θ)J(\theta) be the expected reward from following policy πθ\pi_\theta with parameters θ\theta.
J(θ)=Eπθ[r]J(\theta) = \mathbb{E}_{\pi_\theta} [r]

Then,
θJ(θ)=Eπθ[(θlogπθ(s,a))r]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]

Proof

The random variable rr covers the reward starting from any state sSs \in S and the initial starting state is not specified.

J(θ)=Eπθ[r]=sSd(s)aAsSπθ(s,a)Pa(s,s)Ra(s,s)\begin{aligned} J(\theta) &= \mathbb{E}_{\pi_\theta} [r] \\ &= \sum_{s \in S} d(s) \sum_{a \in A} \sum_{s^\prime \in S} \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \end{aligned}

Now, taking the gradient and noting that the initial distribution, transition probability and reward does not depend on θ\theta, we have
θJ(θ)=θ(sSd(s)aAsSπθ(s,a)Pa(s,s)Ra(s,s))=sSd(s)aAsSθπθ(s,a)Pa(s,s)Ra(s,s)=sSd(s)aAsS[θlogπθ(s,a)]πθ(s,a)Pa(s,s)Ra(s,s)=Eπθ[(θlogπθ(s,a))r]\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \left( \sum_{s \in S} d(s) \sum_{a \in A} \sum_{s^\prime \in S} \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \right)\\ &= \sum_{s \in S} d(s) \sum_{a \in A} \sum_{s^\prime \in S} \nabla_\theta \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \\ &= \sum_{s \in S} d(s) \sum_{a \in A} \sum_{s^\prime \in S} \left[ \nabla_\theta \log \pi_\theta(s, a) \right] \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \\ &= \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right] \end{aligned}

Python Code Example

https://github.com/mochan-b/policy-gradient-rock-paper-scissors

Using θR9\theta \in \mathbb{R}^9, we have probability for every action and initial state, the given example can start from any state (which only signifies the belief of what the opponent will play). It transitions to the state that the opponent's play after the action is taken. We show the improvement of the θ\theta parameters using policy gradients and policy improvements.

Conclusion

This is a very simple version of the policy gradient theorem where the episodes are only a single step.

The example of rock, paper and scissors illustrates how we can use the policy gradient to improve the policy. The python example also gives the full example and full policy improvement.

Policy gradient algorithms are best when the state are continuous. Here are our states are discrete and this could have been done with policy iteration and dynamic programming. But, we use policy gradients to illustrate the technique.