In this article we will explore the estimation methods of MAP and MLE. From the data we have, we would like to estimate a model that best explains the data. MAP and MLE describes techniques to find the parameters of a distribution given the data. We start with the probability definitions that build the foundations of the definitions of MLE and MAP, show the mathematical definition and then examples and code samples for them. In this article, we give a good foundation to the workings of MAP and MLE. MLE MAP

Introduction

Let us first recap a few definitions from probability that we will use to build up to the definitions of MLE/MAP. Since discrete and continuous definitions are different, we have to give slightly different definitions for each of the cases.

We will define probability space which leads to random variables and distribution. Condition probability and Bayes theorem gives us the ability to define the posterior distribution. The posterior distribution then gives us all the tools to explore MLE/MAP.

I have not made the effort to make the definitions mathematically super-accurate or follow a known set of definitions. These are just here to give a rough idea of what the terms mean and to get us to MLE/MAP that does not require a lot of hand-wavey explanations.

Probability

Probability in the common understanding takes a set of possible outcomes and gives a number between 0 and 1 of how likely it is to occur. A fair coin toss has probability of head as 12\frac{1}{2} and probability of tail as 12\frac{1}{2}. A fair dice has a probability of each of the faces coming up is 16\frac{1}{6} and the probability of being at least a 3 is 12\frac{1}{2}.

Probability Space

To mathematically represent this, we take the triplet of (Ω,F,P)(\Omega, \mathcal{F}, P) to define a probability space that gives probabilities to possible outcomes. The three items are defined as follows:

  • Ω\Omega: Sample space and is the set of all possible outcomes. For the fair coin, Ω={H,T}\Omega = \{H,T\} where HH represents the coin landing heads up and TT represents tails. For a fair dice, Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}. For the income of a person, Ω=$[$0,)\Omega = \$ \left[ \$ 0, \infty \right) and has an infinite number of elements.
  • F\mathcal{F}: Event space and is set of subset of Ω\Omega. If Ω\Omega is finite, F\mathcal{F} is the power-set of Ω\Omega or the set of all subsets of Ω\Omega. Since taking the power-set of all elements of an infinite set can leads to problems, we cannot define F\mathcal{F} as the superset of the infinite set Ω\Omega. However, we will just define it as the set of subsets of Ω\Omega where the following function PP works on each subset. For the fair coin, F={{},{H},{T},,{H,T}}\mathcal{F} = \{ \{\}, \{H\}, \{T\},, \{H, T\} \}
  • PP: Probability function that takes an event and assigns a real number that we call probability to it. Thus, P:FRP : \mathcal{F} \rightarrow \mathbb{R}. For the fair coin, P({})=0P(\{\}) = 0, P({H})=12P(\{H\}) = \frac{1}{2} and P({H,T})=1P(\{H,T\}) = 1.

Random Variable

Let XX be a random variable. Then, XX is defined as a function from a space of possibilities Ω\Omega to another space EE.

X:ΩEX : \Omega \rightarrow E

For example, the random variable of a fair coin toss has the space of possibilities as heads or tails {H,T}\{H,T\} as above. We could choose to represent tails as 0 and heads as 1. Then, XX would be the random variable of the number of heads during a single coin toss. Then, X(T)=0X(T) = 0 and X(H)=1X(H) = 1.

A common notation used is P(X=1)P(X=1). This is a short-hand of using XX with a probability space such that we find all elements in ωΩ\omega \in \Omega such that X(ω)=1X(\omega) = 1 and take the probability of that set.

P(X=1)=P({ωΩ and X(ω)=1})P(X = 1) = P(\{ \omega \in \Omega \text{ and } X(\omega) = 1 \})

Another equivalent notation commonly used is PX(x)P_X(x) and is equivalent.

PX(x)=P(X=x)=P({ωΩ and X(ω)=x})P_X(x) = P(X= x) = P(\{ \omega \in \Omega \text{ and } X(\omega) = x \})

The random variable of the number of heads from 3 coin tosses of a fair coin can be expressed as
X:Ω3RX : \Omega^3 \rightarrow \mathbb{R}
and as an example X(H×H×T)=2X(H \times H \times T) = 2

Probability Distribution Function

The probability distribution of a random variable is the function P:ERP^\prime : E \rightarrow \mathbb{R} where we are using EE as the output set of the random variable function and is naturally induced from XX. So, instead of taking values from F\mathcal{F} to a real number, we take it from the output set EE to a real number. For example, the distribution of the number of heads after is 3 coin tosses takes from {0,1,2,3}R\{ 0,1,2,3 \} \rightarrow \mathbb{R}.

For example, consider the probability density function of 2 heads and 1 tail of 3 coin tosses with P(H)=12P(H) = \frac{1}{2} and P(T)=12P(T) = \frac{1}{2}. The probability density is given by the binomial distribution
P(X=2)=(32)122121=3123=38P\left(X=2\right) = \binom{3}{2} \cdot \frac{1}{2}^2 \cdot \frac{1}{2}^1 = 3 \cdot \frac{1}{2}^3 = \frac{3}{8}

Here, we are dealing with data which are samples coming from an experiment. We can think of the probability distribution in different way. Given a sample, the probability density function gives the probability that the sample came from the distribution.

Inference

Estimation is the problem where we have data and need to find the distribution of the data. MAP/MLE are different methods for doing this. Estimation is possible after we have the likelihood but before that we review conditional probability and Bayes theorem.

Conditional Probability

The conditional probability P(AB)P(A|B) is conceptually the probability of AA given that BB has already occurred. It is defined as

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

To define the conditional probability for random variables XX and YY, we first define the joint probability distribution ff of XX and YY.

Before even that, we assume that the domain of both the random variables come from the same probability space Ω\Omega.

Then, the joint probability distribution is defined as
fX,Y(x,y)=f(X=x and Y=y)f_{X,Y} (x,y) = f(X = x \text{ and } Y = y)
where ff is the probability function from the common probability space PP. From our definition of the above notation we can write as

fX,Y(x,y)=P({wΩ and X(w)=x}{vΩ and Y(v)=y})f_{X,Y} (x,y) = P(\{ w \in \Omega \text{ and } X(w) = x\} \cap \{ v \in \Omega \text{ and } Y(v) = y \})

We also denote f(X=x)f(X=x) as values of xx over all of YY and is defined as

f(X=x)=YfX,Y(xy)g(y)dyf(X=x) = \int_Y f_{X,Y}(x|y) g(y) dy

and we define as the probability in YY only
g=P({vΩ and Y(v)=y})g = P(\{ v \in \Omega \text{ and } Y(v) = y \})

Then, the conditional probability for random variables XX and YY is defined as

f(Y=yX=x)=f(X=x and Y=y)f(X=x)=fX,Y(x,y)fX(X)f(Y=y | X = x) = \frac{f(X =x \text{ and } Y = y)}{f(X=x)} = \frac{f_{X,Y}(x,y)}{f_X(X)}

Another common notation for conditional probability is fYX(yx)f_{Y|X}(y|x) and thus, we can write

fYX(yx)=fX,Y(x,y)fX(x)f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}

If they are from different probability spaces Ω1\Omega_1 and Ω2\Omega_2. we can take Ω=Ω1×Ω2\Omega = \Omega_1 \times \Omega_2. Then, we can define the joint probability space between any two random variables.

Bayes Theorem

This leads to Bayes theorem,

P(BA)=P(AB)P(B)P(A)P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)}

By substituting the conditional probability, we can prove the above statement.

P(AB)P(A)=P(BA)=P(AB)P(B)P(A)=P(AB)P(B)P(B)P(A)=P(AB)P(A)\frac{P(A \cap B)}{P(A)} = P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)} = \frac{\frac{P(A \cap B)}{P(B)} \cdot P(B)}{P(A)} = \frac{P(A \cap B)}{P(A)}

Bayes theorem for random variables is similar and stated as,

fYX(yx)=fXY(xy)fY(y)fX(x)f_{Y|X}(y|x) = \frac{f_{X|Y}(x|y) f_Y(y)}{f_X(x)}

and can be proved using similar substitution as above,

fX,Y(x,y)fX(x)=fYX(yx)=fXY(xy)fY(y)fX(x)=fX,Y(x,y)fY(y)fY(y)fX(x)=fX,Y(x,y)fX(x)\frac{f_{X,Y}(x,y)}{f_X(x)} = f_{Y|X}(y|x) = \frac{f_{X|Y}(x|y) f_Y(y)}{f_X(x)} = \frac{\frac{f_{X,Y}(x,y)}{f_Y(y)} f_Y(y)}{f_X(x)} = \frac{f_{X,Y}(x,y)}{f_X(x)}

Likelihood

Given a set of observations, we want to estimate the distribution that best fits the data.

Instead of considering all possible distributions, we will consider a simpler problem of only looking at a set of a family of distributions. For example, for the length of a task, we can consider it to come from a normal distribution. For a neural network, we can fix the architecture and connections and only worry about the weights.

By considering a family of distributions, the distribution can be represented a random variable of the parameters of the distribution. For example, for a normal distribution, we consider the random variable of normal variable parameters of mean and standard deviation. Let Φ\Phi be the set of the family of distributions and our random variable is Θ:ΦR2\Theta : \Phi \rightarrow \mathbb{R}^2. For nueral networks, if there are dd weights in the neural network, Θ:ΦRd\Theta : \Phi \rightarrow \mathbb{R}^d.

Suppose we are using nn observations for inference. Let Ω\Omega be the set of observations and thus, our observations come from Ωn\Omega^n. Let XX be our random variable and x=(x1,x2,,xn)\mathbf{x} = (x_1, x_2, \ldots, x_n) be our set of observations. Using the notation of Θ\Theta as the random variable to describe the parameters for our distribution, we want to find the function that gives us the parameter of the distribution θ\theta from the data. The function would be something like LΘX(θx)\mathcal{L}_{\Theta|X}(\theta|x) which gives us the probability of distribution parameter from the data.

However, the function LΘX(θx)\mathcal{L}_{\Theta|X}(\theta|x) does not make sense as a conditional probability for random variables because we required the domain of Θ\Theta and XX to be the same in our definition. To fix this, we will consider the domain Ωn×Φ\Omega^n \times \Phi and our random variables will be X~:Ωn×ΦRn\tilde{X} : \Omega^n \times \Phi \rightarrow \mathbb{R}^n and Θ~:Ωn×ΦRd\tilde{\Theta} : \Omega^n \times \Phi \rightarrow \mathbb{R}^d. Then, given that our observation probability space is (Ωn,F,P)(\Omega^n, \mathcal{F}, P), we define L(X~=x×θ)=P(X=x)\mathcal{L}(\tilde{X} = \mathbf{x} \times \theta) = P(X =\mathbf{x}). If the probability space for the distributions is (Φ,G,Q)(\Phi, \mathcal{G}, Q) then, we define L(Θ~=x×θ)=Q(Θ=θ)\mathcal{L}(\tilde{\Theta} = \mathbf{x} \times \theta) = Q(\Theta =\theta). Thus, we can define the function as the likelihood function and,
LΘ~X~(θx)\mathcal{L}_{\tilde{\Theta}|\tilde{X}}(\theta|x)

The likelihood function gives the probability that our given set of statistical observations x\mathbf{x} came from a set of statistical parameters θ\theta.

By using Bayes theorem for random variables and using ff for the joint probability distribution and discarding the tilde in our random variable names,
LΘX(θx)=fXΘ(xθ)fΘ(θ)fX(x)\mathcal{L}_{\Theta|X} (\theta | x) = f_{X|\Theta}(\mathbf{x} | \theta) \frac{f_\Theta(\theta)}{f_X(\mathbf{x})}

fΘ(θ)f_\Theta(\theta) is called the prior distribution and is the distribution of our parameters. If we have no idea what it could be, this is just uniform distribution and a constant value. In cases where we have a prior distribution of Θ\Theta, it is not constant.

The term fX(x)=Φf(xv)g(v)dvf_X(\mathbf{x}) = \int_\Phi f(x|v) g(v) dv where gg is the distribution of Θ\Theta. By integrating on all terms in Θ\Theta, this term does not depend on θ\theta and since our observation data is fixed, we can think of it as being constant.

Note that fXθ(xθ)f_{X|\theta}(\mathbf{x}|\theta) is the probability of the random variable XX given the distribution has parameters θ\theta. The domain of XX is all nn of our observations. It is not one observation but all of them. In inference, we would use get a single observation xx and we would return P(xθ)P(x|\theta). Here ff is a different function that takes nn observations x\mathbf{x} and provides a probability for it.

When θ\theta is the probability of the coin toss landing head, then XX is the number of heads observed and ff is the Bernoulli distribution of xx heads out of nn tosses.

When θ\theta are the parameters of the normal distribution, ff is simply the product of nn normal distributions

f(xθ=μ,σ2)=i=1nf(xiμ,σ2)f(\mathbf{x} | \theta = \mu, \sigma^2) = \prod_{i=1}^n f(x_i | \mu, \sigma^2)

In deep learning, x\mathbf{x} is all our training samples. In this case, ff is the cross-entropy which we will discuss below. The basic idea is that the training samples produce a distribution whereas the output of the neural network using θ\theta as the parameters produces another distribution. Cross-entropy measures the difference between the two distributions.

Example

Let us go back to the coin toss where the probability of head P(H)P(H) is 0.5. Thus, the coin toss is a random variable and we have an observation of two heads and one tail. Then,
P(H2TpH=0.5)=(32)0.520.5=38P(H^2T|p_H = 0.5) = {3 \choose 2} \cdot 0.5^2 0.5 = \frac{3}{8}

Since the likelihood is the same as the density function times a constant,
L(θH2T)=c  P(H2TpH=θ)\mathcal{L}(\theta|H^2T) = c \; P(H^2T | p_H = \theta)
and thus, we have that
L(0.5H2T)=c  P(H2TpH=0.5)=0.375  c\mathcal{L}(0.5|H^2T) = c \; P(H^2T | p_H = 0.5) = 0.375 \; c

Also note that,
L(23H2T)=c  P(H2TpH=23)=3232313c=49c=0.44c\mathcal{L}\left(\frac{2}{3}|H^2T\right) = c \; P\left(H^2T | p_H = \frac{2}{3}\right) = 3 \cdot \frac{2}{3} \cdot \frac{2}{3} \cdot \frac{1}{3} c = \frac{4}{9}c = 0.44c

It seems that P(H)=23cP(H) = \frac{2}{3}c is more likely than P(H)=12cP(H) = \frac{1}{2}c from the data we have seen.

Actually we can graph the values of θ\theta against the likelihood. We see that it maximizes at 49\frac{4}{9}.

Estimation Using Likelihood

The distribution L(θx)\mathcal{L}(\theta|\mathbf{x}) gives us all the information to do inference. Note that we are omitting the random variable terms and the above term should be LΘ~X~(θx)\mathcal{L}_{\tilde{\Theta}|\tilde{X}}(\theta|x). Also note that we have already selected the family of distributions and θ\theta is the parameter that gives the exact distribution out of the family of distributions.

From Bayes theorem and our argument that fX(x)f_X(\mathbf{x}) is a constant and let c=1fX(x)c=\frac{1}{f_X(\mathbf{x})}, we have that
L(θx)=c  f(xθ)  f(θ)\mathcal{L} (\theta | x) = c \; f(\mathbf{x} | \theta) \; f(\theta)
Here again we are dropping all of the random variables for cleaner notation.

Since L(θx)\mathcal{L}(\theta|x) is a distribution, we want a single parameter θ\theta. We can take a statistical property like mean or mode of the distribution to select a value θ\theta^\star as our estimate.

  • Mode: arg maxθL(θx)\argmax_{\theta} \mathcal{L} (\theta|\mathbf{x}) - MLE or MAP
  • Mean : Eθ[L(θx)]\mathbb{E}_\theta \left[ \mathcal{L} (\theta|\mathbf{x}) \right]

If we have no knowledge of θ\theta, then f(θ)f(\theta) is a uniform distribution and is constant. In this case denoting c1=c  f(θ)c_1=c \; f(\theta),
L(θx)=c1  f(xθ)\mathcal{L} (\theta | x) = c_1 \; f(\mathbf{x} | \theta)

If f(θ)f(\theta) is not constant, we call it a prior distribution. This means that we have some guess or information about the distribution of θ\theta. The likelihood will be an improvement over the current prior given the dataset x\mathbf{x}.

Using the arg max without a prior (or uniform prior) is called MLE estimation. Using a prior distribution, we call it an MAP estimate.

MLE Estimation

Given the likelihood function L(Xθ)\mathcal{L}(X|\theta), then the MLE for θ\theta is given by
θMLE=arg maxθL(xθ)\theta_{MLE} = \argmax_\theta \mathcal{L}(\mathbf{x}|\theta)

For computational reasons, we can also define the function in log\log.
θMLE=arg maxθ[logL(Xθ)]\theta_{MLE} = \argmax_\theta \left[ \log \mathcal{L}(X|\theta) \right]
And, thus, we will be maximizing in log likelikhood.

MLE for the Example

In the above example, we were tossing a coin three times and observed that we had two heads and one tail. We want to find the P(H)P(H) the probability of landing a head based on the MLE principle.

As we calculated before,
L(θH2T)=(32)θ2(1θ)\mathcal{L}(\theta | H^2 T) = \binom{3}{2} \theta^2 (1-\theta)
and thus, using the MLE principle, we will have

θ=arg maxθ[(32)θ2(1θ)]=arg maxθ[θ2(1θ)]\theta^\prime = \argmax_\theta \left[ \binom{3}{2} \theta^2 (1-\theta) \right] = \argmax_\theta \left[ \theta^2 (1-\theta) \right]

We can calculate θMLE\theta_{MLE} by and find the maximum when it is 0.
Thus,
dL(θH2T)dθ=dθ2(1θ)dθ=2θ3θ2=0\frac{d \mathcal{L(\theta | H^2T)}}{d \theta} = \frac{d\theta^2 (1-\theta)}{d \theta} = 2\theta - 3 \theta^2 = 0
and solving for θ\theta, we have θ=0\theta = 0 or θ=23\theta = \frac{2}{3}.

Thus, with MLE, θMLE=23\theta_{MLE} = \frac{2}{3} which is simply the mean number of heads in the experiment.

MLE for normal distributions

Suppose that we measure the length of a task and record the values 1.3, 1.2 and 1.4.

To calculate the MLE estimate, we first specify that the length of a task is normally distribututed. We want to use MLE to find the mean μ\mu and the standard deviation σ\sigma.

The density of the normal distribution is given by
f(xμ,σ2)=12πσ2e(xμ)22σ2f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
and thus, the density of the measurements 1.3, 1.2 and 1.4 denoted as x1x_1, x2x_2 and x3x_3 is given by
f(x1,x2,x3μ,σ2)=i=1312πσ2e(xiμ)22σ2=(2πσ2)32i=13e(xiμ)22σ2=(2πσ2)32ei=13(xiμ)22σ2f(x_1, x_2, x_3 | \mu, \sigma^2) = \prod_{i=1}^3 \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \\ = {(2 \pi \sigma^2)}^{\frac{-3}{2}} \prod_{i=1}^3 e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \\ = {(2 \pi \sigma^2)}^{-\frac{3}{2}} e^{-\sum_{i=1}^3 \frac{(x_i-\mu)^2}{2\sigma^2}}

The likelihood function is the same and thus,
L(μ,σ2x1,x2,x3)=(2πσ2)32ei=13(xiμ)22σ2\mathcal{L}(\mu, \sigma^2 | x_1, x_2, x_3 ) = {(2 \pi \sigma^2)}^{-\frac{3}{2}} e^{-\sum_{i=1}^3 \frac{(x_i-\mu)^2}{2\sigma^2}}
And, the log likelihood is given as the following:
logL(μ,σ2x1,x2,x3)=32log(2πσ2)12σ2i=13(xiμ)2\log{\mathcal{L}(\mu, \sigma^2 | x_1, x_2, x_3 )} = -\frac{3}{2}\log{(2 \pi \sigma^2)} - \frac{1}{2\sigma^2} \sum_{i=1}^3 (x_i-\mu)^2

MLE for μ\mu

To get the MLE estimate for μ\mu, we take the derivative with respect to μ\mu first which gives us,
dlogL(μ,σ2x1,x2,x3)dμ=12σ2i=132(xiμ)=1σ2i=13(xiμ)=1σ2[i=13xi3μ]\frac{d\log{\mathcal{L}(\mu, \sigma^2 | x_1, x_2, x_3 )}}{d\mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^3 -2 (x_i - \mu) \\ = \frac{1}{\sigma^2} \sum_{i=1}^3 (x_i-\mu) \\ = \frac{1}{\sigma^2} \left[ \sum_{i=1}^3x_i - 3\mu \right]
Setting the derivative to 0 and solving for μ\mu, we have
μ=13i=13xi\mu = \frac{1}{3}\sum_{i=1}^3 x_i
And, thus μ\mu is just the average.

MLE for σ\sigma

To get the MLE for σ\sigma, we take the derivative of the likelihood with respect to σ\sigma and this gives us,
To get the MLE estimate for μ\mu, we take the derivative with respect to μ\mu first which gives us,
dlogL(μ,σ2x1,x2,x3)dσ=3212πσ22π(2σ)(2)2σ3i=13(xiμ)2=3σ+1σ3i=13(xiμ)2\frac{d\log{\mathcal{L}(\mu, \sigma^2 | x_1, x_2, x_3 )}}{d\sigma} = -\frac{3}{2} \frac{1}{2 \pi \sigma^2} 2\pi (2\sigma) - \frac{(-2)}{2 \sigma^3} \sum_{i=1}^3 (x_i-\mu)^2 \\ = -\frac{3}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^3 (x_i-\mu)^2
Setting the derivative to 0 and solving for σ\sigma we have,
σ2=13i=13(xiμ)2\sigma^2 =\frac{1}{3}\sum_{i=1}^3 (x_i-\mu)^2
which is just the sample variance.

Generalizing for any number of samples

In the general case with nn-samples x1,x2,,xnx_1, x_2, \ldots, x_n, we can calculate
μMLE=1ni=1nxi\mu_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i
and
σMLE2=1ni=1n(xiμ)2\sigma_{MLE}^2 =\frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2

Thus, the MLE estimate for the normal distribution is the sample mean and sample standard deviation.

MLE for Deep Learning

Cross-Entropy

Given two probability distributions pp and qq, the cross entropy of qq with respect to pp is given by

H(p,q)=xXp(x)logq(x)H(p,q) = -\sum_{x \in \mathcal{X}} p(x) \log q(x)

and for the continuous case,
H(p,q)=Xp(x)logq(x)dxH(p,q) = -\int_X p(x) \log q(x) dx

Cross Entropy for Multi Class Classification

Let us take the case of multi-class classification. Here, the network output is a classification label into one of CC classes.

In our case, we use the cross entropy H(p,q)H(p,q) where pp is the ground truth distribution (actual training data) and qq is the distribution output from the neural network.

Given one data point xx, suppose the ground truth label is cc. Let scs_c be the output of the neural network for the cc class. Since we have the ground truth that cc is the correct label, the probabilities of all other labels are 0. Then, the cross entropy only involves logsc-\log s_c. For the entire data set we have the cross-entropy as
f(xθ)=i=1nlog(sci)f(\mathbf{x}|\theta) = -\sum_{i=1}^n \log(s_{c_i})

MAP Estimation

Given that we have the prior distribution, the likelihood becomes
L(θx)=c  fXΘ(xθ)  fΘ(θ)\mathcal{L} (\theta | x) = c \; f_{X|\Theta}(\mathbf{x} | \theta) \; f_\Theta(\theta)
and taking the mode of the distribution, we have
θMAP=arg maxθfXΘ(xθ)fΘ(θ)\theta_{MAP} = \argmax_\theta f_{X|\Theta}(\mathbf{x}|\theta) f_\Theta(\theta)
and with the logarithm version, we have
θMAP=arg maxθ[logfXΘ(xθ)+logfΘ(θ)]\theta_{MAP} = \argmax_\theta \left[ \log f_{X|\Theta}(\mathbf{x}|\theta) + \log f_\Theta(\theta) \right]

Note that fΘ(θ)f_\Theta(\theta) is the probability of the parameter being θ\theta. This is another function that has be provided for the estimation computation.

Example

We will be adding to the example above for the normal distribution. Assume that we have a prior distribution of the mean as f(μ)=N(μ0,σ02)f(\mu) = \mathcal{N}(\mu_0, \sigma_0^2). Note that we only a prior for the mean and not the σ\sigma.

The mean μ\mu itself is another distribution with different parameters μ0,σ0\mu_0, \sigma_0 and not related to μ\mu and σ\sigma that we are trying to estimate.

From above,
logf(xμ,σ2)=n2log(2πσ2)12σ2i=1n(xiμ)2\log{f(\mathbf{x} | \mu, \sigma^2)} = -\frac{n}{2}\log{(2 \pi \sigma^2)} - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2
and that
log(f(μ))=log(2πσ02)212σ02(μμ0)2\log(f(\mu)) = -\frac{\log{(2 \pi \sigma_0^2)}}{2} - \frac{1}{2\sigma_0^2} (\mu-\mu_0)^2

Taking the derivative of both of the above terms, we have

dlogL(μ,σ2x)dμ=1σ2i=1n(xiμ)+μμ0σ02=1σ2i=1nxiμ0σ02+μ(nσ2+1σ02)=1σ2σ02(σ02i=1nxiμ0σ2+μ(nσ02+σ2))\frac{d\log{\mathcal{L}(\mu, \sigma^2 | \mathbf{x} )}}{d\mu} = - \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) + \frac{\mu - \mu_0}{\sigma_0^2} \\ = -\frac{1}{\sigma^2} \sum_{i=1}^n x_i - \frac{\mu_0}{\sigma_0^2} + \mu\left( \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} \right) \\ = \frac{1}{\sigma^2 \sigma_0^2} \left( -\sigma_0^2 \sum_{i=1}^n x_i - \mu_0 \sigma^2 + \mu \left( n\sigma_0^2 + \sigma^2 \right) \right)

Setting the derivative to 0 and solving for μ\mu we have
μMAP=σ02i=1nxi+μ0σ2nσ02+σ2\mu_{MAP} = \frac{\sigma_0^2 \sum_{i=1}^n x_i + \mu_0 \sigma^2}{n\sigma_0^2 + \sigma^2}

We see that the prior distribution has affected the estimated MAP mean.

Conclusion

This article gives a quick introduction to MLE and MAP and how they are derived from the Bayes theorem for random variables. These methods are general methods of estimation and using them still involves finding the right likelihood function for the problem that can give the best estimate of its parameters.

Slides for the presentation on this topic