to see the policy gradient theorem in use.
Every episode is just a single step. The episode starts, takes an action and then ends with a reward for that step.
We present a simplified version of the policy gradient theorem for this one step episode setting with simplified proofs.
We then present a simplified example of rock, paper and scissors which only lasts a single episode.
A Markov decision process M is a tuple where
M=(S,A,Ra)
and
- S is the finite set of states
- A is the finite set of actions_ and As is the set of actions for state s∈S
- Ra(s) is the reward from state s taking action a
Suppose that M is a one-step MDP. The episode starts at state s∈S. The episode ends after one step and with reward r.
Given a state s, the policy π(s,a)∈[0,1] gives us the probability that we will take action a from state s.
Let θ∈Rn be some set of parameters that determine our policy. We denote such a policy by πθ.
Let r be the random variable of the reward we get from taking an action. Let J(θ) be the expected reward from following policy πθ with parameters θ.
J(θ)=Eπθ[r]
Suppose we start at state s. Then,
∇θJ(θ)=Eπθ[(∇θlogπθ(s,a))r]
- The notation Eπθ[r] is different than the standard method of writing expectations since we have a subscript πθ on the expectation. Usually, we have a random variable X and then E[X]=∑i=1kxipi. In our case, r denotes the random variable of the reward for each action taken. We could have used rπθ as the random variable where the probability of each action taken is given by πθ and the expectation would be written as E[rπθ] for the reward for taking actions according to πθ.
- The notation Eπθ[(∇θlogπθ(s,a))r] has inside it term (∇θlogπθ(s,a)) that are not random variables. Eπθ is the expectation over all actions that are possible to take. The ∇θlogπθ(s,a) in this context should be interpreted as the random variable with given state s and over all possible actions a. We are slightly abusing the notation to use the same symbols with the function as the random variable.
By our definition of expectation of the reward, we have
J(θ)=Eπθ[r]=a∈A∑πθ(a,s)Ra(s)
Taking the gradient, we have
∇θJ(θ)=∇θ[a∈A∑πθ(a,s)Ra(s)]=a∈A∑∇θπθ(a,s)Ra(s)
The reward function does not depend on the θ parameter and thus we don't take its gradient. The reward we get is determined by the action which is determined by θ but the reward function itself is independent of θ.
The gradient of the policy can be written in terms of log using the identity f∇f=∇logf as follows
∇θπθ(s,a)=πθ(s,a)πθ(s,a)∇θπθ(s,a)=πθ(s,a)∇θlogπθ(s,a)
Using the log gradient substitution, we have
∇θJ(θ)=a∈A∑∇θπθ(s,a)Ra(s)=a∈A∑πθ(s,a)∇θlogπθ(s,a)Ra(s)=Eπθ[(∇θlogπθ(s,a))r]
As an example of one step MDP, we will use rock, paper and scissors game. We play the game for one step and we either win, lose or draw.
For the first example of the game we will assume
- we know what the opponent is going to play.
- opponent will play rock (since we specify a starting state in our theorem - we will generalize it to any starting state below)
We have 3 states for what we know the opponent will play.
- sr: opponent plays rock
- sp: opponent plays paper
- ss: opponent plays scissors
We either rock, paper or scissors.
- ar : play rock
- ap : play paper
- as : play scissors
We get +1 points if we win the game, 0 if the we draw the game and -1 if we lose the game.
Let θ∈R be the parameter for our policy. We define πθ as the following:
πθ(sr,ap)πθ(sr,ar)πθ(sr,as)=σ(θ)=21−σ(θ)=21−σ(θ)
where σ(x) is the sigmoid function given by
σ(x)=1+e−x1
The sigmoid function maps (−∞,∞) to (0,1). Then, the above policy is a valid probability as the sum over all actions sums up to 1.
When θ→−∞, πθ(sr,ap)→0, and θ→∞, πθ(sr,ap)→1. Thus, the smaller the value of θ, the smaller the probability of us choose the action ap for paper.
Then, we have reward 1 for playing paper and -1 for playing scissors,
∇θJ(θ)=a∈A∑∇θπθ(s,a)Ra(s)=1⋅∇θπθ(sr,ap)+(−1)⋅∇θπθ(sr,as)=σ′(θ)+2σ′(θ)=23σ′(θ)
We do the parameter update using the rule
θ=θ+α∇θJ(θ)
where α∈R is the learning rate.
Since σ′>0, θ is always increasing. Thus, πθ(sr,ap) goes towards 1 and the probability for other actions towards 0. Thus, the policy gradient will choose to play paper for the rock that the opponent plays.
We can generalize the MDP to have a transition probability.
In the first definition of MDP, if we are in state s and we take action a, we would transition to a state s′ and obtain single reward for the action.
Here the generalization results is that when taking an action a, it now gives us a probability of transitioning to each of the states in S. Before an action from state s would only take you another specific state s′. Now, there is a probability for transition to every state of S. If the probability of transitioning to state s′ is 1 and others 0, it is effectively the same as without the generalization.
For reward, if we are in state s and we transition to state s′, the reward is given by Ra(s,s′).
A Markov decision process (2) M is a tuple where
M=(S,A,Pa,Ra)
be
- S is the finite set of states
- A is the finite set of actions_ and As is the set of actions for state s∈S
- Pa(s,s′) is the transition probability from state s to another state s′ by taking action a
- Ra(s,s′) is the reward from state s to s′
Let M me an MDP(2) as defined above. Let r be the random variable of the reward we get from taking an action. Let J(θ) be the expected reward from following policy πθ with parameters θ.
J(θ)=Eπθ[r]
Suppose we start at state s. Then,
∇θJ(θ)=Eπθ[(∇θlogπθ(s,a))r]
The definition of MDP has changed and thus, when we take an action a, we have a transition probability of going from state s to state s′ with probability Pa(s,s′).
Thus, we have to sum over all subsequent states that we might go to after an action is taken.
J(θ)=Eπθ[r]=a∈A∑s′∈S∑πθ(s,a)Pa(s,s′)Ra(s,s′)
Now, taking the gradient and noting that the transition probability and reward does not depend on θ, and so we have,
∇θJ(θ)=∇θ(a∈A∑s′∈S∑πθ(s,a)Pa(s,s′)Ra(s,s′))=a∈A∑s′∈S∑∇θπθ(s,a)Pa(s,s′)Ra(s,s′)=a∈A∑s′∈S∑[∇θlogπθ(s,a)]πθ(s,a)Pa(s,s′)Ra(s,s′)=Eπθ[(∇θlogπθ(s,a))r]
In the previous example of rock, paper, scissors, we knew what the opponent was going to play. With the new formulation of the MDP, we now divide the state into what we believe the opponent will play and what the opponent plays after we perform our action.
We have the 6 states, 3 for before we play and 3 after we play. The 3 before we play will stand for what we think the opponent will play. The 3 after we play will be what the opponent played.
- sr1: we think opponent plays rock
- sp1: we think opponent plays paper
- ss1: we think opponent plays scissors
- sr2: opponent played rock
- sp2: opponent played paper
- ss2: opponent played scissors
The actions are the same as before. We have ar, ap and as for playing rock, paper and scissors respectively.
The transition probability is non-zero from sr1,sp1,ss1 to sr2,sp2,ss2. If our guess was random, then all the transition probabilities would be 31.
For example, a random transition probability could be
P(sr1,sr2)P(sr1,sp2)P(sr1,ss2)=31=31=31
The reward is similar. We still get +1 for winning, 0 for drawing and -1 for losing the game. However, in this case, the reward depends on the starting state and the ending state and the action taken.
We can illustrate the rewards in term of a matrix. Suppose our action was ar. Then,
sr1sp1ss1sr2000sp2−1−1−1ss2111
Note the ending state and the action only determines the reward. Whatever, we believe the opponent was going to play does not matter in the reward if the transition probability is equally random.
Let θ∈R3. We will denote θ=[θ1,θ2,θ2].
We assume that we start at state sr1. Then, the policy is given by
πθ(sr1,ar)πθ(sr1,ap)πθ(sr1,as)=∑i=13eθieθ1=∑i=13eθieθ2=∑i=13eθieθ3
Thus, the higher the probability of θi, the more likely that we will take that action.
Since we have 3 parameters in θ, the gradient is a 3×1 matrix.
∇θπθ(sr1,ar)=⎣⎢⎢⎡dθ1∂πθ(sr1,ar)dθ2∂πθ(sr1,ar)dθ3∂πθ(sr1,ar)⎦⎥⎥⎤=⎣⎢⎢⎢⎡(eθ1+eθ2+eθ3)2eθ1(eθ2+eθ3)−(eθ1+eθ2+eθ3)2eθ1+θ2−(eθ1+eθ2+eθ3)2eθ1+θ3⎦⎥⎥⎥⎤
∇θπθ(sr1,ap)=⎣⎢⎢⎡dθ1∂πθ(sr1,ap)dθ2∂πθ(sr1,ap)dθ3∂πθ(sr1,ap)⎦⎥⎥⎤=⎣⎢⎢⎢⎡−(eθ1+eθ2+eθ3)2eθ1+θ2(eθ1+eθ2+eθ3)2eθ2(eθ1+eθ3)−(eθ1+eθ2+eθ3)2eθ2+θ3⎦⎥⎥⎥⎤
∇θπθ(sr1,as)=⎣⎢⎢⎡dθ1∂πθ(sr1,as)dθ2∂πθ(sr1,as)dθ3∂πθ(sr1,as)⎦⎥⎥⎤=⎣⎢⎢⎢⎡−(eθ1+eθ2+eθ3)2eθ1+θ3−(eθ2+eθ3+eθ3)2eθ2+θ3(eθ1+eθ2+eθ3)2eθ3(eθ1+eθ2)⎦⎥⎥⎥⎤
Optimality Under Random Choice
Suppose we do not know how the opponent plays and thus all transition probabilities are 31.
Let us first look at the partial derivative of J(θ) with respect to θ1 instead of the whole ∇θJ(θ).
∂θ1∂J(θ)=a∈A∑s′∈S∑∂θ1∂πθ(s,a)Pa(s,s′)Ra(s,s′)=31a∈A∑s′∈S∑∂θ1∂πθ(s,a)Ra(s,s′)=31a∈A∑∂θ1∂πθ(s,a)s′∈S∑Ra(s,s′)=0
The policy only depends on the starting state and not the state we transition to. Here, the reward is dependent on the state we start with and the state we end up with. Since for each action, we lose one and win one, the total is 0. Thus, we can choose any value of θ and we will have the same expected reward since the gradient is 0 everywhere.
Suppose that the opponent only plays scissors. Then, we should always play rock and thus, θ1 should be as large as possible and θ2,θ3 should be as small as possible.
∂θ1∂J(θ)=a∈A∑s′∈S∑∂θ1∂πθ(s,a)Pa(s,s′)Ra(s,s′)=a∈A∑∂θ1∂πθ(s,a)Pa(s,ss2)Ra(s,ss2)=∂θ1∂πθ(s,ar)Rar(s,ss2)+∂θ1∂πθ(s,ap)Rap(s,ss2)=∂θ1∂πθ(s,ar)−∂θ1∂πθ(s,ap)
The first step follows from definition. The second step follows that the opponent is only playing scissors and all other transitions that do not end in scissors is 0. The third step follows from the fact that action of rock or paper will result a positive or negative reward. If both the opponent and the agent draw scissors, it is a draw and thus has reward 0. The fourth step is just reward of 1 for winning and -1 for losing.
Thus, the derivative for θ1
∂θ1∂J(θ)=∂θ1∂πθ(s,ar)−∂θ1∂πθ(s,ap)=(eθ1+eθ2+eθ3)2eθ1(eθ2+eθ3)−(−(eθ1+eθ2+eθ3)2eθ1+θ2)=(eθ1+eθ2+eθ3)2eθ1(2eθ2+eθ3)>0
Next, for θ2, we have
∂θ2∂J(θ)=∂θ2∂πθ(s,ar)−∂θ2∂πθ(s,ap)=−(eθ1+eθ2+eθ3)2eθ1+θ2−(eθ1+eθ2+eθ3)2eθ2(eθ1+eθ3)<0
Finally, for θ3, we have
∂θ3∂J(θ)=∂θ3∂πθ(s,ar)−∂θ3∂πθ(s,ap)=−(eθ1+eθ2+eθ3)2eθ1+θ3+(eθ1+eθ2+eθ3)2eθ2+θ3=(eθ1+eθ2+eθ3)2eθ3(eθ2−eθ1)
This follows from θ1>θ2, and hence ∂θ3∂J(θ)<0.
Thus, θ1 increases and θ2,θ3 decreases when updating θ along the policy gradient. Thus, the policy favors playing rock over the other two alternatives.
In the above theorem statements, we have required to specify a start state.
In order to not have to specify a start state, we have to define a distribution d(s) that represents the probability of starting at state s.
Let M me an MDP(2). Let r be the random variable of the reward we get from taking an action. Let J(θ) be the expected reward from following policy πθ with parameters θ.
J(θ)=Eπθ[r]
Then,
∇θJ(θ)=Eπθ[(∇θlogπθ(s,a))r]
The random variable r covers the reward starting from any state s∈S and the initial starting state is not specified.
J(θ)=Eπθ[r]=s∈S∑d(s)a∈A∑s′∈S∑πθ(s,a)Pa(s,s′)Ra(s,s′)
Now, taking the gradient and noting that the initial distribution, transition probability and reward does not depend on θ, we have
∇θJ(θ)=∇θ(s∈S∑d(s)a∈A∑s′∈S∑πθ(s,a)Pa(s,s′)Ra(s,s′))=s∈S∑d(s)a∈A∑s′∈S∑∇θπθ(s,a)Pa(s,s′)Ra(s,s′)=s∈S∑d(s)a∈A∑s′∈S∑[∇θlogπθ(s,a)]πθ(s,a)Pa(s,s′)Ra(s,s′)=Eπθ[(∇θlogπθ(s,a))r]
https://github.com/mochan-b/policy-gradient-rock-paper-scissors
Using θ∈R9, we have probability for every action and initial state, the given example can start from any state (which only signifies the belief of what the opponent will play). It transitions to the state that the opponent's play after the action is taken. We show the improvement of the θ parameters using policy gradients and policy improvements.
This is a very simple version of the policy gradient theorem where the episodes are only a single step.
The example of rock, paper and scissors illustrates how we can use the policy gradient to improve the policy. The python example also gives the full example and full policy improvement.
Policy gradient algorithms are best when the state are continuous. Here are our states are discrete and this could have been done with policy iteration and dynamic programming. But, we use policy gradients to illustrate the technique.