# Policy Gradient Theorem for One Step MDPs

**one step MDP**to see the policy gradient theorem in use. Every episode is just a single step. The episode starts, takes an action and then ends with a reward for that step. We present a simplified version of the policy gradient theorem for this one step episode setting with simplified proofs. We then present a simplified example of rock, paper and scissors which only lasts a single episode.

## Policy Gradient Theorem

### Definitions

#### Markov Decision Process (1)

A **Markov decision process** $M$ is a tuple where

$M = (S, A, R_a)$

and

- $S$ is the finite set of
**states** - $A$ is the finite set of
**actions**_ and $A_s$ is the set of actions for state $s \in S$ - $R_a(s)$ is the
**reward**from state $s$ taking action $a$

Suppose that $M$ is a **one-step MDP**. The episode starts at state $s \in S$. The episode ends after one step and with reward $r$.

#### Policy

Given a state $s$, the **policy** $\pi(s,a) \in [0,1]$ gives us the probability that we will take action $a$ from state $s$.

Let $\theta \in \mathbb{R}^n$ be some set of parameters that determine our **policy**. We denote such a policy by $\pi_\theta$.

### Theorem

Let $r$ be the random variable of the reward we get from taking an action. Let $J(\theta)$ be the expected reward from following policy $\pi_\theta$ with parameters $\theta$.

$J(\theta) = \mathbb{E}_{\pi_\theta} [r]$

Suppose we start at state $s$. Then,

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]$

#### Notation Clarifications

- The notation $\mathbb{E}_{\pi_\theta} [r]$ is different than the standard method of writing expectations since we have a subscript $\pi_\theta$ on the expectation. Usually, we have a random variable $X$ and then $\mathbb{E}[X] = \sum_{i=1}^k x_i p_i$. In our case, $r$ denotes the random variable of the reward for each action taken. We could have used $r_{\pi_\theta}$ as the random variable where the probability of each action taken is given by $\pi_\theta$ and the expectation would be written as $\mathbb{E}[r_{\pi_\theta}]$ for the reward for taking actions according to $\pi_\theta$.
- The notation $\mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]$ has inside it term ($\nabla_\theta \log \pi_\theta(s,a)$) that are not random variables. $\mathbb{E}_{\pi_\theta}$ is the expectation over all actions that are possible to take. The $\nabla_\theta \log \pi_\theta(s,a)$ in this context should be interpreted as the random variable with given state $s$ and over all possible actions $a$. We are slightly abusing the notation to use the same symbols with the function as the random variable.

## Proof

By our definition of expectation of the reward, we have

$\begin{aligned}
J(\theta) &= \mathbb{E}_{\pi_\theta} [r] \\
&= \sum_{a \in A} \pi_\theta(a,s) R_a(s)
\end{aligned}$

Taking the gradient, we have

$\begin{aligned}
\nabla_\theta J(\theta) &= \nabla_\theta \left[ \sum_{a \in A} \pi_\theta(a,s) R_a(s) \right] \\
&= \sum_{a \in A} \nabla_\theta \pi_\theta(a,s) \; R_a(s)
\end{aligned}$

The reward function does not depend on the $\theta$ parameter and thus we don't take its gradient. The reward we get is determined by the action which is determined by $\theta$ but the reward function itself is independent of $\theta$.

The gradient of the policy can be written in terms of log using the identity $\frac{\nabla f}{f} = \nabla \log f$ as follows

$\begin{aligned}
\nabla_\theta \pi_\theta(s, a) &= \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s, a)}{\pi_\theta(s,a)} \\
&= \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s, a)
\end{aligned}$

Using the log gradient substitution, we have

$\begin{aligned}
\nabla_\theta J(\theta) &= \sum_{a \in A} \nabla_\theta \pi_\theta(s,a) \; R_a(s) \\
&= \sum_{a \in A} \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s, a) \; R_a(s) \\
&= \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]
\end{aligned}$

## Rock Paper Scissors MDP

As an example of one step MDP, we will use rock, paper and scissors game. We play the game for one step and we either win, lose or draw.

For the first example of the game we will assume

- we know what the opponent is going to play.
- opponent will play rock (since we specify a starting state in our theorem - we will generalize it to any starting state below)

### States

We have 3 states for what we know the opponent will play.

- $s_{r}$: opponent plays rock
- $s_{p}$: opponent plays paper
- $s_{s}$: opponent plays scissors

### Actions

We either rock, paper or scissors.

- $a_r$ : play rock
- $a_p$ : play paper
- $a_s$ : play scissors

### Rewards

We get +1 points if we win the game, 0 if the we draw the game and -1 if we lose the game.

### Policy

Let $\theta \in \mathbb{R}$ be the parameter for our policy. We define $\pi_\theta$ as the following:

$\begin{aligned}
\pi_\theta(s_r, a_p) &= \sigma(\theta) \\
\pi_\theta(s_r, a_r) &= \frac{1 - \sigma(\theta)}{2} \\
\pi_\theta(s_r, a_s) &= \frac{1 - \sigma(\theta)}{2} \\
\end{aligned}$

where $\sigma(x)$ is the sigmoid function given by

$\sigma(x) = \frac{1}{1+e^{-x}}$

The sigmoid function maps $(-\infty, \infty)$ to $(0, 1)$. Then, the above policy is a valid probability as the sum over all actions sums up to 1.

When $\theta \rightarrow -\infty$, $\pi_\theta(s_r, a_p) \rightarrow 0$, and $\theta \rightarrow \infty$, $\pi_\theta(s_r, a_p) \rightarrow 1$. Thus, the smaller the value of $\theta$, the smaller the probability of us choose the action $a_p$ for paper.

### Policy Gradient

Then, we have reward 1 for playing paper and -1 for playing scissors,

$\begin{aligned}
\nabla_\theta J(\theta) &= \sum_{a \in A} \nabla_\theta \pi_\theta(s,a) \; R_a(s) \\
&= 1 \cdot \nabla_\theta \pi_\theta(s_r, a_p) + (-1) \cdot \nabla_\theta \pi_\theta(s_r, a_s) \\
&= \sigma^\prime(\theta) + \frac{\sigma^\prime(\theta)}{2} \\
&= \frac{3}{2} \sigma^\prime(\theta)
\end{aligned}$

We do the parameter update using the rule

$\theta = \theta + \alpha \nabla_\theta J(\theta)$

where $\alpha \in \mathbb{R}$ is the learning rate.

Since $\sigma^\prime > 0$, $\theta$ is always increasing. Thus, $\pi_\theta(s_r, a_p)$ goes towards 1 and the probability for other actions towards 0. Thus, the policy gradient will choose to play paper for the rock that the opponent plays.

## Generalization (1) - With Transition Probability

We can generalize the MDP to have a transition probability.

In the first definition of MDP, if we are in state $s$ and we take action $a$, we would transition to a state $s^\prime$ and obtain single reward for the action.

Here the generalization results is that when taking an action $a$, it now gives us a probability of transitioning to each of the states in $S$. Before an action from state $s$ would only take you another specific state $s^\prime$. Now, there is a probability for transition to every state of $S$. If the probability of transitioning to state $s^\prime$ is 1 and others 0, it is effectively the same as without the generalization.

For reward, if we are in state $s$ and we transition to state $s^\prime$, the reward is given by $R_a(s, s^\prime)$.

A **Markov decision process (2)** $M$ is a tuple where

$M = (S, A, P_a, R_a)$

be

- $S$ is the finite set of
**states** - $A$ is the finite set of
**actions**_ and $A_s$ is the set of actions for state $s \in S$ - $P_a(s, s^\prime)$ is the
**transition probability**from state $s$ to another state $s^\prime$ by taking action $a$ - $R_a(s, s^\prime)$ is the
**reward**from state $s$ to $s^\prime$

### Policy Gradient Theorem

Let $M$ me an MDP(2) as defined above. Let $r$ be the random variable of the reward we get from taking an action. Let $J(\theta)$ be the expected reward from following policy $\pi_\theta$ with parameters $\theta$.

$J(\theta) = \mathbb{E}_{\pi_\theta} [r]$

Suppose we start at state $s$. Then,

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]$

### Proof

The definition of MDP has changed and thus, when we take an action $a$, we have a transition probability of going from state $s$ to state $s^\prime$ with probability $P_a(s, s^\prime)$.

Thus, we have to sum over all subsequent states that we might go to after an action is taken.

$\begin{aligned}
J(\theta) &= \mathbb{E}_{\pi_\theta} [r] \\
&= \sum_{a \in A} \sum_{s^\prime \in S} \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime)
\end{aligned}$

Now, taking the gradient and noting that the transition probability and reward does not depend on $\theta$, and so we have,

$\begin{aligned}
\nabla_\theta J(\theta) &= \nabla_\theta \left( \sum_{a \in A} \sum_{s^\prime \in S} \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \right)\\
&= \sum_{a \in A} \sum_{s^\prime \in S} \nabla_\theta \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \\
&= \sum_{a \in A} \sum_{s^\prime \in S} \left[ \nabla_\theta \log \pi_\theta(s, a) \right] \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \\
&= \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]
\end{aligned}$

### Rock Paper Scissors

In the previous example of rock, paper, scissors, we knew what the opponent was going to play. With the new formulation of the MDP, we now divide the state into what we believe the opponent will play and what the opponent plays after we perform our action.

#### States

We have the 6 states, 3 for before we play and 3 after we play. The 3 before we play will stand for what we think the opponent will play. The 3 after we play will be what the opponent played.

- $s_{r_1}$: we think opponent plays rock
- $s_{p_1}$: we think opponent plays paper
- $s_{s_1}$: we think opponent plays scissors
- $s_{r_2}$: opponent played rock
- $s_{p_2}$: opponent played paper
- $s_{s_2}$: opponent played scissors

#### Actions

The actions are the same as before. We have $a_r$, $a_p$ and $a_s$ for playing rock, paper and scissors respectively.

#### Transition Probabilities

The transition probability is non-zero from $s_{r_1}, s_{p_1}, s_{s_1}$ to $s_{r_2}, s_{p_2}, s_{s_2}$. If our guess was random, then all the transition probabilities would be $\frac{1}{3}$.

For example, a random transition probability could be

$\begin{aligned}
P(s_{r_1}, s_{r_2}) &= \frac{1}{3} \\
P(s_{r_1}, s_{p_2}) &= \frac{1}{3} \\
P(s_{r_1}, s_{s_2}) &= \frac{1}{3} \\
\end{aligned}$

#### Rewards

The reward is similar. We still get +1 for winning, 0 for drawing and -1 for losing the game. However, in this case, the reward depends on the starting state and the ending state and the action taken.

We can illustrate the rewards in term of a matrix. Suppose our action was $a_r$. Then,

$\begin{matrix}
& s_{r_2} & s_{p_2} & s_{s_2} \\
s_{r_1} & 0 & -1 & 1 \\
s_{p_1} & 0 & -1 & 1 \\
s_{s_1} & 0 & -1 & 1 \\
\end{matrix}$

Note the ending state and the action only determines the reward. Whatever, we believe the opponent was going to play does not matter in the reward if the transition probability is equally random.

#### Policy

Let $\theta \in \mathbb{R}^3$. We will denote $\theta = [\theta_1, \theta_2, \theta_2]$.

We assume that we start at state $s_{r_1}$. Then, the policy is given by

$\begin{aligned}
\pi_\theta(s_{r_1}, a_r) &= \frac{e^{\theta_1}}{\sum_{i=1}^3 e^{\theta_i}} \\
\pi_\theta(s_{r_1}, a_p) &= \frac{e^{\theta_2}}{\sum_{i=1}^3 e^{\theta_i}} \\
\pi_\theta(s_{r_1}, a_s) &= \frac{e^{\theta_3}}{\sum_{i=1}^3 e^{\theta_i}} \\
\end{aligned}$

Thus, the higher the probability of $\theta_i$, the more likely that we will take that action.

#### Policy Gradient

Since we have 3 parameters in $\theta$, the gradient is a $3 \times 1$ matrix.

$\nabla_\theta \pi_\theta(s_{r_1}, a_r) = \begin{bmatrix} \frac{\partial \pi_\theta(s_{r_1}, a_r)}{d \theta_1} \\ \frac{\partial \pi_\theta(s_{r_1}, a_r)}{d \theta_2} \\ \frac{\partial \pi_\theta(s_{r_1}, a_r)}{d \theta_3} \\ \end{bmatrix} = \begin{bmatrix} \frac{e^{\theta_1} (e^{\theta_2} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ -\frac{e^{\theta_1+\theta_2}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ -\frac{e^{\theta_1+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ \end{bmatrix}$

$\nabla_\theta \pi_\theta(s_{r_1}, a_p) = \begin{bmatrix} \frac{\partial \pi_\theta(s_{r_1}, a_p)}{d \theta_1} \\ \frac{\partial \pi_\theta(s_{r_1}, a_p)}{d \theta_2} \\ \frac{\partial \pi_\theta(s_{r_1}, a_p)}{d \theta_3} \\ \end{bmatrix} = \begin{bmatrix} -\frac{e^{\theta_1+\theta_2}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ \frac{e^{\theta_2} (e^{\theta_1} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ -\frac{e^{\theta_2+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ \end{bmatrix}$

$\nabla_\theta \pi_\theta(s_{r_1}, a_s) = \begin{bmatrix} \frac{\partial \pi_\theta(s_{r_1}, a_s)}{d \theta_1} \\ \frac{\partial \pi_\theta(s_{r_1}, a_s)}{d \theta_2} \\ \frac{\partial \pi_\theta(s_{r_1}, a_s)}{d \theta_3} \\ \end{bmatrix} = \begin{bmatrix} -\frac{e^{\theta_1+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ -\frac{e^{\theta_2+\theta_3}}{(e^{\theta_2} + e^{\theta_3} + e^{\theta_3})^2} \\ \frac{e^{\theta_3} (e^{\theta_1} + e^{\theta_2})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ \end{bmatrix}$

### Optimality Under Random Choice

Suppose we do not know how the opponent plays and thus all transition probabilities are $\frac{1}{3}$.

Let us first look at the partial derivative of $J(\theta)$ with respect to $\theta_1$ instead of the whole $\nabla_\theta J(\theta)$.

$\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_1} &= \sum_{a \in A} \sum_{s^\prime \in S} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} P_a(s, s^\prime) R_a(s, s^\prime) \\ &= \frac{1}{3} \sum_{a \in A} \sum_{s^\prime \in S} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} R_a(s, s^\prime) \\ &= \frac{1}{3} \sum_{a \in A} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} \sum_{s^\prime \in S} R_a(s, s^\prime) \\ &= 0 \end{aligned}$

The policy only depends on the starting state and not the state we transition to. Here, the reward is dependent on the state we start with and the state we end up with. Since for each action, we lose one and win one, the total is 0. Thus, we can choose any value of $\theta$ and we will have the same expected reward since the gradient is 0 everywhere.

### Policy Improvement Under Scissors Only

Suppose that the opponent only plays scissors. Then, we should always play rock and thus, $\theta_1$ should be as large as possible and $\theta_2, \theta_3$ should be as small as possible.

$\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_1} &= \sum_{a \in A} \sum_{s^\prime \in S} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} P_a(s, s^\prime) R_a(s, s^\prime)\\ &= \sum_{a \in A} \frac{\partial \pi_\theta(s, a)}{\partial \theta_1} P_a(s, s_{s_2}) R_a(s, s_{s_2})\\ &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_1} R_{a_r}(s, s_{s_2}) + \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_1} R_{a_p}(s, s_{s_2})\\ &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_1} - \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_1}\\ \end{aligned}$

The first step follows from definition. The second step follows that the opponent is only playing scissors and all other transitions that do not end in scissors is 0. The third step follows from the fact that action of rock or paper will result a positive or negative reward. If both the opponent and the agent draw scissors, it is a draw and thus has reward 0. The fourth step is just reward of 1 for winning and -1 for losing.

Thus, the derivative for $\theta_1$

$\begin{aligned}
\frac{\partial J(\theta)}{\partial \theta_1} &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_1} - \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_1}\\
&= \frac{e^{\theta_1} (e^{\theta_2} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} - \left( -\frac{e^{\theta_1+\theta_2}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \right) \\
&= \frac{e^{\theta_1} (2e^{\theta_2} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\ &> 0
\end{aligned}$

Next, for $\theta_2$, we have

$\begin{aligned}
\frac{\partial J(\theta)}{\partial \theta_2} &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_2} - \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_2}\\
&= -\frac{e^{\theta_1+\theta_2}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} - \frac{e^{\theta_2} (e^{\theta_1} + e^{\theta_3})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\
&< 0
\end{aligned}$

Finally, for $\theta_3$, we have

$\begin{aligned}
\frac{\partial J(\theta)}{\partial \theta_3} &= \frac{\partial \pi_\theta(s, a_r)}{\partial \theta_3} - \frac{\partial \pi_\theta(s, a_p)}{\partial \theta_3}\\
&= -\frac{e^{\theta_1+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} + \frac{e^{\theta_2+\theta_3}}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2} \\
&= \frac{e^{\theta_3}(e^{\theta_2}-e^{\theta_1})}{(e^{\theta_1} + e^{\theta_2} + e^{\theta_3})^2}
\end{aligned}$

This follows from $\theta_1 > \theta_2$, and hence $\frac{\partial J(\theta)}{\partial \theta_3} < 0$.

Thus, $\theta_1$ increases and $\theta_2, \theta_3$ decreases when updating $\theta$ along the policy gradient. Thus, the policy favors playing rock over the other two alternatives.

## Generalization (2) - No specified start state

In the above theorem statements, we have required to specify a start state.

In order to not have to specify a start state, we have to define a distribution $d(s)$ that represents the probability of starting at state $s$.

### Policy Gradient Theorem (2)

Let $M$ me an MDP(2). Let $r$ be the random variable of the reward we get from taking an action. Let $J(\theta)$ be the expected reward from following policy $\pi_\theta$ with parameters $\theta$.

$J(\theta) = \mathbb{E}_{\pi_\theta} [r]$

Then,

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]$

### Proof

The random variable $r$ covers the reward starting from any state $s \in S$ and the initial starting state is not specified.

$\begin{aligned} J(\theta) &= \mathbb{E}_{\pi_\theta} [r] \\ &= \sum_{s \in S} d(s) \sum_{a \in A} \sum_{s^\prime \in S} \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \end{aligned}$

Now, taking the gradient and noting that the initial distribution, transition probability and reward does not depend on $\theta$, we have

$\begin{aligned}
\nabla_\theta J(\theta) &= \nabla_\theta \left( \sum_{s \in S} d(s) \sum_{a \in A} \sum_{s^\prime \in S} \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \right)\\
&= \sum_{s \in S} d(s) \sum_{a \in A} \sum_{s^\prime \in S} \nabla_\theta \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \\
&= \sum_{s \in S} d(s) \sum_{a \in A} \sum_{s^\prime \in S} \left[ \nabla_\theta \log \pi_\theta(s, a) \right] \pi_\theta(s, a) P_a(s, s^\prime) R_a(s, s^\prime) \\
&= \mathbb{E}_{\pi_\theta} \left[ \left( \nabla_\theta \log \pi_\theta(s,a) \right) r \right]
\end{aligned}$

## Python Code Example

https://github.com/mochan-b/policy-gradient-rock-paper-scissors

Using $\theta \in \mathbb{R}^9$, we have probability for every action and initial state, the given example can start from any state (which only signifies the belief of what the opponent will play). It transitions to the state that the opponent's play after the action is taken. We show the improvement of the $\theta$ parameters using policy gradients and policy improvements.

## Conclusion

This is a very simple version of the policy gradient theorem where the episodes are only a single step.

The example of rock, paper and scissors illustrates how we can use the policy gradient to improve the policy. The python example also gives the full example and full policy improvement.

Policy gradient algorithms are best when the state are continuous. Here are our states are discrete and this could have been done with policy iteration and dynamic programming. But, we use policy gradients to illustrate the technique.