MLE and MAP

In this article we will explore the estimation methods of MAP and MLE. From the data we have, we would like to estimate a model that best explains the data. MAP and MLE describes techniques to find the parameters of a distribution given the data. We start with the probability definitions that build the foundations of the definitions of MLE and MAP, show the mathematical definition and then examples and code samples for them. In this article, we give a good foundation to the workings of MAP and MLE. MLE MAP

Introduction

Let us first recap a few definitions from probability that we will use to build up to the definitions of MLE/MAP. Since discrete and continuous definitions are different, we have to give slightly different definitions for each of the cases.

We will define probability space which leads to random variables and distribution. Condition probability and Bayes theorem gives us the ability to define the posterior distribution. The posterior distribution then gives us all the tools to explore MLE/MAP.

I have not made the effort to make the definitions mathematically super-accurate or follow a known set of definitions. These are just here to give a rough idea of what the terms mean and to get us to MLE/MAP that does not require a lot of hand-wavey explanations.

Probability

Probability in the common understanding takes a set of possible outcomes and gives a number between 0 and 1 of how likely it is to occur. A fair coin toss has probability of head as $\frac{1}{2}$ and probability of tail as $\frac{1}{2}$ . A fair dice has a probability of each of the faces coming up is $\frac{1}{6}$ and the probability of being at least a 3 is $\frac{1}{2}$ .

Probability Space

To mathematically represent this, we take the triplet of $(\Omega, \mathcal{F}, P)$ to define a probability space that gives probabilities to possible outcomes. The three items are defined as follows:

$\Omega$ : Sample space and is the set of all possible outcomes. For the fair coin, $\Omega = \{H,T\}$ where $H$ represents the coin landing heads up and $T$ represents tails. For a fair dice, $\Omega = \{1, 2, 3, 4, 5, 6\}$ . For the income of a person, $\Omega = \$ \left[ \$ 0, \infty \right)$ and has an infinite number of elements.
$\mathcal{F}$ : Event space and is set of subset of $\Omega$ . If $\Omega$ is finite, $\mathcal{F}$ is the power-set of $\Omega$ or the set of all subsets of $\Omega$ . Since taking the power-set of all elements of an infinite set can leads to problems, we cannot define $\mathcal{F}$ as the superset of the infinite set $\Omega$ . However, we will just define it as the set of subsets of $\Omega$ where the following function $P$ works on each subset. For the fair coin, $\mathcal{F} = \{ \{\}, \{H\}, \{T\},, \{H, T\} \}$
$P$ : Probability function that takes an event and assigns a real number that we call probability to it. Thus, $P : \mathcal{F} \rightarrow \mathbb{R}$ . For the fair coin, $P(\{\}) = 0$ , $P(\{H\}) = \frac{1}{2}$ and $P(\{H,T\}) = 1$ .

Random Variable

Let $X$ be a random variable. Then, $X$ is defined as a function from a space of possibilities $\Omega$ to another space $E$ .

$X : \Omega \rightarrow E$

For example, the random variable of a fair coin toss has the space of possibilities as heads or tails $\{H,T\}$ as above. We could choose to represent tails as 0 and heads as 1. Then, $X$ would be the random variable of the number of heads during a single coin toss. Then, $X(T) = 0$ and $X(H) = 1$ .

A common notation used is $P(X=1)$ . This is a short-hand of using $X$ with a probability space such that we find all elements in $\omega \in \Omega$ such that $X(\omega) = 1$ and take the probability of that set.

$P(X = 1) = P(\{ \omega \in \Omega \text{ and } X(\omega) = 1 \})$

Another equivalent notation commonly used is $P_X(x)$ and is equivalent.

$P_X(x) = P(X= x) = P(\{ \omega \in \Omega \text{ and } X(\omega) = x \})$

The random variable of the number of heads from 3 coin tosses of a fair coin can be expressed as
$X : \Omega^3 \rightarrow \mathbb{R}$
and as an example $X(H \times H \times T) = 2$

Probability Distribution Function

The probability distribution of a random variable is the function $P^\prime : E \rightarrow \mathbb{R}$ where we are using $E$ as the output set of the random variable function and is naturally induced from $X$ . So, instead of taking values from $\mathcal{F}$ to a real number, we take it from the output set $E$ to a real number. For example, the distribution of the number of heads after is 3 coin tosses takes from $\{ 0,1,2,3 \} \rightarrow \mathbb{R}$ .

For example, consider the probability density function of 2 heads and 1 tail of 3 coin tosses with $P(H) = \frac{1}{2}$ and $P(T) = \frac{1}{2}$ . The probability density is given by the binomial distribution
$P\left(X=2\right) = \binom{3}{2} \cdot \frac{1}{2}^2 \cdot \frac{1}{2}^1 = 3 \cdot \frac{1}{2}^3 = \frac{3}{8}$

Here, we are dealing with data which are samples coming from an experiment. We can think of the probability distribution in different way. Given a sample, the probability density function gives the probability that the sample came from the distribution.

Inference

Estimation is the problem where we have data and need to find the distribution of the data. MAP/MLE are different methods for doing this. Estimation is possible after we have the likelihood but before that we review conditional probability and Bayes theorem.

Conditional Probability

The conditional probability $P(A|B)$ is conceptually the probability of $A$ given that $B$ has already occurred. It is defined as

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

To define the conditional probability for random variables $X$ and $Y$ , we first define the joint probability distribution $f$ of $X$ and $Y$ .

Before even that, we assume that the domain of both the random variables come from the same probability space $\Omega$ .

Then, the joint probability distribution is defined as
$f_{X,Y} (x,y) = f(X = x \text{ and } Y = y)$
where $f$ is the probability function from the common probability space $P$ . From our definition of the above notation we can write as

$f_{X,Y} (x,y) = P(\{ w \in \Omega \text{ and } X(w) = x\} \cap \{ v \in \Omega \text{ and } Y(v) = y \})$

We also denote $f(X=x)$ as values of $x$ over all of $Y$ and is defined as

$f(X=x) = \int_Y f_{X,Y}(x|y) g(y) dy$

and we define as the probability in $Y$ only
$g = P(\{ v \in \Omega \text{ and } Y(v) = y \})$

Then, the conditional probability for random variables $X$ and $Y$ is defined as

$f(Y=y | X = x) = \frac{f(X =x \text{ and } Y = y)}{f(X=x)} = \frac{f_{X,Y}(x,y)}{f_X(X)}$

Another common notation for conditional probability is $f_{Y|X}(y|x)$ and thus, we can write

$f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}$

If they are from different probability spaces $\Omega_1$ and $\Omega_2$ . we can take $\Omega = \Omega_1 \times \Omega_2$ . Then, we can define the joint probability space between any two random variables.

Bayes Theorem

This leads to Bayes theorem,

$P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)}$

By substituting the conditional probability, we can prove the above statement.

$\frac{P(A \cap B)}{P(A)} = P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)} = \frac{\frac{P(A \cap B)}{P(B)} \cdot P(B)}{P(A)} = \frac{P(A \cap B)}{P(A)}$

Bayes theorem for random variables is similar and stated as,

$f_{Y|X}(y|x) = \frac{f_{X|Y}(x|y) f_Y(y)}{f_X(x)}$

and can be proved using similar substitution as above,

$\frac{f_{X,Y}(x,y)}{f_X(x)} = f_{Y|X}(y|x) = \frac{f_{X|Y}(x|y) f_Y(y)}{f_X(x)} = \frac{\frac{f_{X,Y}(x,y)}{f_Y(y)} f_Y(y)}{f_X(x)} = \frac{f_{X,Y}(x,y)}{f_X(x)}$

Likelihood

Given a set of observations, we want to estimate the distribution that best fits the data.

Instead of considering all possible distributions, we will consider a simpler problem of only looking at a set of a family of distributions. For example, for the length of a task, we can consider it to come from a normal distribution. For a neural network, we can fix the architecture and connections and only worry about the weights.

By considering a family of distributions, the distribution can be represented a random variable of the parameters of the distribution. For example, for a normal distribution, we consider the random variable of normal variable parameters of mean and standard deviation. Let $\Phi$ be the set of the family of distributions and our random variable is $\Theta : \Phi \rightarrow \mathbb{R}^2$ . For nueral networks, if there are $d$ weights in the neural network, $\Theta : \Phi \rightarrow \mathbb{R}^d$ .

Suppose we are using $n$ observations for inference. Let $\Omega$ be the set of observations and thus, our observations come from $\Omega^n$ . Let $X$ be our random variable and $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ be our set of observations. Using the notation of $\Theta$ as the random variable to describe the parameters for our distribution, we want to find the function that gives us the parameter of the distribution $\theta$ from the data. The function would be something like $\mathcal{L}_{\Theta|X}(\theta|x)$ which gives us the probability of distribution parameter from the data.

However, the function $\mathcal{L}_{\Theta|X}(\theta|x)$ does not make sense as a conditional probability for random variables because we required the domain of $\Theta$ and $X$ to be the same in our definition. To fix this, we will consider the domain $\Omega^n \times \Phi$ and our random variables will be $\tilde{X} : \Omega^n \times \Phi \rightarrow \mathbb{R}^n$ and $\tilde{\Theta} : \Omega^n \times \Phi \rightarrow \mathbb{R}^d$ . Then, given that our observation probability space is $(\Omega^n, \mathcal{F}, P)$ , we define $\mathcal{L}(\tilde{X} = \mathbf{x} \times \theta) = P(X =\mathbf{x})$ . If the probability space for the distributions is $(\Phi, \mathcal{G}, Q)$ then, we define $\mathcal{L}(\tilde{\Theta} = \mathbf{x} \times \theta) = Q(\Theta =\theta)$ . Thus, we can define the function as the likelihood function and,
$\mathcal{L}_{\tilde{\Theta}|\tilde{X}}(\theta|x)$

The likelihood function gives the probability that our given set of statistical observations $\mathbf{x}$ came from a set of statistical parameters $\theta$ .

By using Bayes theorem for random variables and using $f$ for the joint probability distribution and discarding the tilde in our random variable names,
$\mathcal{L}_{\Theta|X} (\theta | x) = f_{X|\Theta}(\mathbf{x} | \theta) \frac{f_\Theta(\theta)}{f_X(\mathbf{x})}$

$f_\Theta(\theta)$ is called the prior distribution and is the distribution of our parameters. If we have no idea what it could be, this is just uniform distribution and a constant value. In cases where we have a prior distribution of $\Theta$ , it is not constant.

The term $f_X(\mathbf{x}) = \int_\Phi f(x|v) g(v) dv$ where $g$ is the distribution of $\Theta$ . By integrating on all terms in $\Theta$ , this term does not depend on $\theta$ and since our observation data is fixed, we can think of it as being constant.

Note that $f_{X|\theta}(\mathbf{x}|\theta)$ is the probability of the random variable $X$ given the distribution has parameters $\theta$ . The domain of $X$ is all $n$ of our observations. It is not one observation but all of them. In inference, we would use get a single observation $x$ and we would return $P(x|\theta)$ . Here $f$ is a different function that takes $n$ observations $\mathbf{x}$ and provides a probability for it.

When $\theta$ is the probability of the coin toss landing head, then $X$ is the number of heads observed and $f$ is the Bernoulli distribution of $x$ heads out of $n$ tosses.

When $\theta$ are the parameters of the normal distribution, $f$ is simply the product of $n$ normal distributions

$f(\mathbf{x} | \theta = \mu, \sigma^2) = \prod_{i=1}^n f(x_i | \mu, \sigma^2)$

In deep learning, $\mathbf{x}$ is all our training samples. In this case, $f$ is the cross-entropy which we will discuss below. The basic idea is that the training samples produce a distribution whereas the output of the neural network using $\theta$ as the parameters produces another distribution. Cross-entropy measures the difference between the two distributions.

Example

Let us go back to the coin toss where the probability of head $P(H)$ is 0.5. Thus, the coin toss is a random variable and we have an observation of two heads and one tail. Then,
$P(H^2T|p_H = 0.5) = {3 \choose 2} \cdot 0.5^2 0.5 = \frac{3}{8}$

Since the likelihood is the same as the density function times a constant,
$\mathcal{L}(\theta|H^2T) = c \; P(H^2T | p_H = \theta)$
and thus, we have that
$\mathcal{L}(0.5|H^2T) = c \; P(H^2T | p_H = 0.5) = 0.375 \; c$

Also note that,
$\mathcal{L}\left(\frac{2}{3}|H^2T\right) = c \; P\left(H^2T | p_H = \frac{2}{3}\right) = 3 \cdot \frac{2}{3} \cdot \frac{2}{3} \cdot \frac{1}{3} c = \frac{4}{9}c = 0.44c$

It seems that $P(H) = \frac{2}{3}c$ is more likely than $P(H) = \frac{1}{2}c$ from the data we have seen.

Actually we can graph the values of $\theta$ against the likelihood. We see that it maximizes at $\frac{4}{9}$ .

Estimation Using Likelihood

The distribution $\mathcal{L}(\theta|\mathbf{x})$ gives us all the information to do inference. Note that we are omitting the random variable terms and the above term should be $\mathcal{L}_{\tilde{\Theta}|\tilde{X}}(\theta|x)$ . Also note that we have already selected the family of distributions and $\theta$ is the parameter that gives the exact distribution out of the family of distributions.

From Bayes theorem and our argument that $f_X(\mathbf{x})$ is a constant and let $c=\frac{1}{f_X(\mathbf{x})}$ , we have that
$\mathcal{L} (\theta | x) = c \; f(\mathbf{x} | \theta) \; f(\theta)$
Here again we are dropping all of the random variables for cleaner notation.

Since $\mathcal{L}(\theta|x)$ is a distribution, we want a single parameter $\theta$ . We can take a statistical property like mean or mode of the distribution to select a value $\theta^\star$ as our estimate.

Mode: $\argmax_{\theta} \mathcal{L} (\theta|\mathbf{x})$ - MLE or MAP
Mean : $\mathbb{E}_\theta \left[ \mathcal{L} (\theta|\mathbf{x}) \right]$

If we have no knowledge of $\theta$ , then $f(\theta)$ is a uniform distribution and is constant. In this case denoting $c_1=c \; f(\theta)$ ,
$\mathcal{L} (\theta | x) = c_1 \; f(\mathbf{x} | \theta)$

If $f(\theta)$ is not constant, we call it a prior distribution. This means that we have some guess or information about the distribution of $\theta$ . The likelihood will be an improvement over the current prior given the dataset $\mathbf{x}$ .

Using the arg max without a prior (or uniform prior) is called MLE estimation. Using a prior distribution, we call it an MAP estimate.

MLE Estimation

Given the likelihood function $\mathcal{L}(X|\theta)$ , then the MLE for $\theta$ is given by
$\theta_{MLE} = \argmax_\theta \mathcal{L}(\mathbf{x}|\theta)$

For computational reasons, we can also define the function in $\log$ .
$\theta_{MLE} = \argmax_\theta \left[ \log \mathcal{L}(X|\theta) \right]$
And, thus, we will be maximizing in log likelikhood.

MLE for the Example

In the above example, we were tossing a coin three times and observed that we had two heads and one tail. We want to find the $P(H)$ the probability of landing a head based on the MLE principle.

As we calculated before,
$\mathcal{L}(\theta | H^2 T) = \binom{3}{2} \theta^2 (1-\theta)$
and thus, using the MLE principle, we will have

$\theta^\prime = \argmax_\theta \left[ \binom{3}{2} \theta^2 (1-\theta) \right] = \argmax_\theta \left[ \theta^2 (1-\theta) \right]$

We can calculate $\theta_{MLE}$ by and find the maximum when it is 0.
Thus,
$\frac{d \mathcal{L(\theta | H^2T)}}{d \theta} = \frac{d\theta^2 (1-\theta)}{d \theta} = 2\theta - 3 \theta^2 = 0$
and solving for $\theta$ , we have $\theta = 0$ or $\theta = \frac{2}{3}$ .

Thus, with MLE, $\theta_{MLE} = \frac{2}{3}$ which is simply the mean number of heads in the experiment.

MLE for normal distributions

Suppose that we measure the length of a task and record the values 1.3, 1.2 and 1.4.

To calculate the MLE estimate, we first specify that the length of a task is normally distribututed. We want to use MLE to find the mean $\mu$ and the standard deviation $\sigma$ .

The density of the normal distribution is given by
$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
and thus, the density of the measurements 1.3, 1.2 and 1.4 denoted as $x_1$ , $x_2$ and $x_3$ is given by
$f(x_1, x_2, x_3 | \mu, \sigma^2) = \prod_{i=1}^3 \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \\ = {(2 \pi \sigma^2)}^{\frac{-3}{2}} \prod_{i=1}^3 e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \\ = {(2 \pi \sigma^2)}^{-\frac{3}{2}} e^{-\sum_{i=1}^3 \frac{(x_i-\mu)^2}{2\sigma^2}}$

The likelihood function is the same and thus,
$\mathcal{L}(\mu, \sigma^2 | x_1, x_2, x_3 ) = {(2 \pi \sigma^2)}^{-\frac{3}{2}} e^{-\sum_{i=1}^3 \frac{(x_i-\mu)^2}{2\sigma^2}}$
And, the log likelihood is given as the following:
$\log{\mathcal{L}(\mu, \sigma^2 | x_1, x_2, x_3 )} = -\frac{3}{2}\log{(2 \pi \sigma^2)} - \frac{1}{2\sigma^2} \sum_{i=1}^3 (x_i-\mu)^2$

MLE for $\mu$

To get the MLE estimate for $\mu$ , we take the derivative with respect to $\mu$ first which gives us,
$\frac{d\log{\mathcal{L}(\mu, \sigma^2 | x_1, x_2, x_3 )}}{d\mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^3 -2 (x_i - \mu) \\ = \frac{1}{\sigma^2} \sum_{i=1}^3 (x_i-\mu) \\ = \frac{1}{\sigma^2} \left[ \sum_{i=1}^3x_i - 3\mu \right]$
Setting the derivative to 0 and solving for $\mu$ , we have
$\mu = \frac{1}{3}\sum_{i=1}^3 x_i$
And, thus $\mu$ is just the average.

MLE for $\sigma$

To get the MLE for $\sigma$ , we take the derivative of the likelihood with respect to $\sigma$ and this gives us,
To get the MLE estimate for $\mu$ , we take the derivative with respect to $\mu$ first which gives us,
$\frac{d\log{\mathcal{L}(\mu, \sigma^2 | x_1, x_2, x_3 )}}{d\sigma} = -\frac{3}{2} \frac{1}{2 \pi \sigma^2} 2\pi (2\sigma) - \frac{(-2)}{2 \sigma^3} \sum_{i=1}^3 (x_i-\mu)^2 \\ = -\frac{3}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^3 (x_i-\mu)^2$
Setting the derivative to 0 and solving for $\sigma$ we have,
$\sigma^2 =\frac{1}{3}\sum_{i=1}^3 (x_i-\mu)^2$
which is just the sample variance.

Generalizing for any number of samples

In the general case with $n$ -samples $x_1, x_2, \ldots, x_n$ , we can calculate
$\mu_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i$
and
$\sigma_{MLE}^2 =\frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2$

Thus, the MLE estimate for the normal distribution is the sample mean and sample standard deviation.

MLE for Deep Learning

Cross-Entropy

Given two probability distributions $p$ and $q$ , the cross entropy of $q$ with respect to $p$ is given by

$H(p,q) = -\sum_{x \in \mathcal{X}} p(x) \log q(x)$

and for the continuous case,
$H(p,q) = -\int_X p(x) \log q(x) dx$

Cross Entropy for Multi Class Classification

Let us take the case of multi-class classification. Here, the network output is a classification label into one of $C$ classes.

In our case, we use the cross entropy $H(p,q)$ where $p$ is the ground truth distribution (actual training data) and $q$ is the distribution output from the neural network.

Given one data point $x$ , suppose the ground truth label is $c$ . Let $s_c$ be the output of the neural network for the $c$ class. Since we have the ground truth that $c$ is the correct label, the probabilities of all other labels are 0. Then, the cross entropy only involves $-\log s_c$ . For the entire data set we have the cross-entropy as
$f(\mathbf{x}|\theta) = -\sum_{i=1}^n \log(s_{c_i})$

MAP Estimation

Given that we have the prior distribution, the likelihood becomes
$\mathcal{L} (\theta | x) = c \; f_{X|\Theta}(\mathbf{x} | \theta) \; f_\Theta(\theta)$
and taking the mode of the distribution, we have
$\theta_{MAP} = \argmax_\theta f_{X|\Theta}(\mathbf{x}|\theta) f_\Theta(\theta)$
and with the logarithm version, we have
$\theta_{MAP} = \argmax_\theta \left[ \log f_{X|\Theta}(\mathbf{x}|\theta) + \log f_\Theta(\theta) \right]$

Note that $f_\Theta(\theta)$ is the probability of the parameter being $\theta$ . This is another function that has be provided for the estimation computation.

Example

We will be adding to the example above for the normal distribution. Assume that we have a prior distribution of the mean as $f(\mu) = \mathcal{N}(\mu_0, \sigma_0^2)$ . Note that we only a prior for the mean and not the $\sigma$ .

The mean $\mu$ itself is another distribution with different parameters $\mu_0, \sigma_0$ and not related to $\mu$ and $\sigma$ that we are trying to estimate.

From above,
$\log{f(\mathbf{x} | \mu, \sigma^2)} = -\frac{n}{2}\log{(2 \pi \sigma^2)} - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2$
and that
$\log(f(\mu)) = -\frac{\log{(2 \pi \sigma_0^2)}}{2} - \frac{1}{2\sigma_0^2} (\mu-\mu_0)^2$

Taking the derivative of both of the above terms, we have

$\frac{d\log{\mathcal{L}(\mu, \sigma^2 | \mathbf{x} )}}{d\mu} = - \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) + \frac{\mu - \mu_0}{\sigma_0^2} \\ = -\frac{1}{\sigma^2} \sum_{i=1}^n x_i - \frac{\mu_0}{\sigma_0^2} + \mu\left( \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} \right) \\ = \frac{1}{\sigma^2 \sigma_0^2} \left( -\sigma_0^2 \sum_{i=1}^n x_i - \mu_0 \sigma^2 + \mu \left( n\sigma_0^2 + \sigma^2 \right) \right)$

Setting the derivative to 0 and solving for $\mu$ we have
$\mu_{MAP} = \frac{\sigma_0^2 \sum_{i=1}^n x_i + \mu_0 \sigma^2}{n\sigma_0^2 + \sigma^2}$

We see that the prior distribution has affected the estimated MAP mean.

Conclusion

This article gives a quick introduction to MLE and MAP and how they are derived from the Bayes theorem for random variables. These methods are general methods of estimation and using them still involves finding the right likelihood function for the problem that can give the best estimate of its parameters.

Slides for the presentation on this topic

Introduction

Probability

Probability Space

Random Variable

Probability Distribution Function

Inference

Conditional Probability

Bayes Theorem

Likelihood

Example

Estimation Using Likelihood

MLE Estimation

MLE for the Example

MLE for normal distributions

MLE for μ\muμ

MLE for σ\sigmaσ

Generalizing for any number of samples

MLE for Deep Learning

Cross-Entropy

Cross Entropy for Multi Class Classification

MAP Estimation

Example

Conclusion

MLE for $\mu$

MLE for $\sigma$