Let us first recap a few definitions from probability that we will use to build up to the definitions of MLE/MAP. Since discrete and continuous definitions are different, we have to give slightly different definitions for each of the cases.
We will define probability space which leads to random variables and distribution. Condition probability and Bayes theorem gives us the ability to define the posterior distribution. The posterior distribution then gives us all the tools to explore MLE/MAP.
I have not made the effort to make the definitions mathematically super-accurate or follow a known set of definitions. These are just here to give a rough idea of what the terms mean and to get us to MLE/MAP that does not require a lot of hand-wavey explanations.
Probability in the common understanding takes a set of possible outcomes and gives a number between 0 and 1 of how likely it is to occur. A fair coin toss has probability of head as and probability of tail as . A fair dice has a probability of each of the faces coming up is and the probability of being at least a 3 is .
To mathematically represent this, we take the triplet of to define a probability space that gives probabilities to possible outcomes. The three items are defined as follows:
- : Sample space and is the set of all possible outcomes. For the fair coin, where represents the coin landing heads up and represents tails. For a fair dice, . For the income of a person, and has an infinite number of elements.
- : Event space and is set of subset of . If is finite, is the power-set of or the set of all subsets of . Since taking the power-set of all elements of an infinite set can leads to problems, we cannot define as the superset of the infinite set . However, we will just define it as the set of subsets of where the following function works on each subset. For the fair coin,
- : Probability function that takes an event and assigns a real number that we call probability to it. Thus, . For the fair coin, , and .
Let be a random variable. Then, is defined as a function from a space of possibilities to another space .
For example, the random variable of a fair coin toss has the space of possibilities as heads or tails as above. We could choose to represent tails as 0 and heads as 1. Then, would be the random variable of the number of heads during a single coin toss. Then, and .
A common notation used is . This is a short-hand of using with a probability space such that we find all elements in such that and take the probability of that set.
Another equivalent notation commonly used is and is equivalent.
The random variable of the number of heads from 3 coin tosses of a fair coin can be expressed as
and as an example
Probability Distribution Function
The probability distribution of a random variable is the function where we are using as the output set of the random variable function and is naturally induced from . So, instead of taking values from to a real number, we take it from the output set to a real number. For example, the distribution of the number of heads after is 3 coin tosses takes from .
For example, consider the probability density function of 2 heads and 1 tail of 3 coin tosses with and . The probability density is given by the binomial distribution
Here, we are dealing with data which are samples coming from an experiment. We can think of the probability distribution in different way. Given a sample, the probability density function gives the probability that the sample came from the distribution.
Estimation is the problem where we have data and need to find the distribution of the data. MAP/MLE are different methods for doing this. Estimation is possible after we have the likelihood but before that we review conditional probability and Bayes theorem.
The conditional probability is conceptually the probability of given that has already occurred. It is defined as
To define the conditional probability for random variables and , we first define the joint probability distribution of and .
Before even that, we assume that the domain of both the random variables come from the same probability space .
Then, the joint probability distribution is defined as
where is the probability function from the common probability space . From our definition of the above notation we can write as
We also denote as values of over all of and is defined as
and we define as the probability in only
Then, the conditional probability for random variables and is defined as
Another common notation for conditional probability is and thus, we can write
If they are from different probability spaces and . we can take . Then, we can define the joint probability space between any two random variables.
This leads to Bayes theorem,
By substituting the conditional probability, we can prove the above statement.
Bayes theorem for random variables is similar and stated as,
and can be proved using similar substitution as above,
Given a set of observations, we want to estimate the distribution that best fits the data.
Instead of considering all possible distributions, we will consider a simpler problem of only looking at a set of a family of distributions. For example, for the length of a task, we can consider it to come from a normal distribution. For a neural network, we can fix the architecture and connections and only worry about the weights.
By considering a family of distributions, the distribution can be represented a random variable of the parameters of the distribution. For example, for a normal distribution, we consider the random variable of normal variable parameters of mean and standard deviation. Let be the set of the family of distributions and our random variable is . For nueral networks, if there are weights in the neural network, .
Suppose we are using observations for inference. Let be the set of observations and thus, our observations come from . Let be our random variable and be our set of observations. Using the notation of as the random variable to describe the parameters for our distribution, we want to find the function that gives us the parameter of the distribution from the data. The function would be something like which gives us the probability of distribution parameter from the data.
However, the function does not make sense as a conditional probability for random variables because we required the domain of and to be the same in our definition. To fix this, we will consider the domain and our random variables will be and . Then, given that our observation probability space is , we define . If the probability space for the distributions is then, we define . Thus, we can define the function as the likelihood function and,
The likelihood function gives the probability that our given set of statistical observations came from a set of statistical parameters .
By using Bayes theorem for random variables and using for the joint probability distribution and discarding the tilde in our random variable names,
is called the prior distribution and is the distribution of our parameters. If we have no idea what it could be, this is just uniform distribution and a constant value. In cases where we have a prior distribution of , it is not constant.
The term where is the distribution of . By integrating on all terms in , this term does not depend on and since our observation data is fixed, we can think of it as being constant.
Note that is the probability of the random variable given the distribution has parameters . The domain of is all of our observations. It is not one observation but all of them. In inference, we would use get a single observation and we would return . Here is a different function that takes observations and provides a probability for it.
When is the probability of the coin toss landing head, then is the number of heads observed and is the Bernoulli distribution of heads out of tosses.
When are the parameters of the normal distribution, is simply the product of normal distributions
In deep learning, is all our training samples. In this case, is the cross-entropy which we will discuss below. The basic idea is that the training samples produce a distribution whereas the output of the neural network using as the parameters produces another distribution. Cross-entropy measures the difference between the two distributions.
Let us go back to the coin toss where the probability of head is 0.5. Thus, the coin toss is a random variable and we have an observation of two heads and one tail. Then,
Since the likelihood is the same as the density function times a constant,
and thus, we have that
Also note that,
It seems that is more likely than from the data we have seen.
Actually we can graph the values of against the likelihood. We see that it maximizes at .
Estimation Using Likelihood
The distribution gives us all the information to do inference. Note that we are omitting the random variable terms and the above term should be . Also note that we have already selected the family of distributions and is the parameter that gives the exact distribution out of the family of distributions.
From Bayes theorem and our argument that is a constant and let , we have that
Here again we are dropping all of the random variables for cleaner notation.
Since is a distribution, we want a single parameter . We can take a statistical property like mean or mode of the distribution to select a value as our estimate.
- Mode: - MLE or MAP
- Mean :
If we have no knowledge of , then is a uniform distribution and is constant. In this case denoting ,
If is not constant, we call it a prior distribution. This means that we have some guess or information about the distribution of . The likelihood will be an improvement over the current prior given the dataset .
Using the arg max without a prior (or uniform prior) is called MLE estimation. Using a prior distribution, we call it an MAP estimate.
Given the likelihood function , then the MLE for is given by
For computational reasons, we can also define the function in .
And, thus, we will be maximizing in log likelikhood.
MLE for the Example
In the above example, we were tossing a coin three times and observed that we had two heads and one tail. We want to find the the probability of landing a head based on the MLE principle.
As we calculated before,
and thus, using the MLE principle, we will have
We can calculate by and find the maximum when it is 0.
and solving for , we have or .
Thus, with MLE, which is simply the mean number of heads in the experiment.
MLE for normal distributions
Suppose that we measure the length of a task and record the values 1.3, 1.2 and 1.4.
To calculate the MLE estimate, we first specify that the length of a task is normally distribututed. We want to use MLE to find the mean and the standard deviation .
The density of the normal distribution is given by
and thus, the density of the measurements 1.3, 1.2 and 1.4 denoted as , and is given by
The likelihood function is the same and thus,
And, the log likelihood is given as the following:
To get the MLE estimate for , we take the derivative with respect to first which gives us,
Setting the derivative to 0 and solving for , we have
And, thus is just the average.
To get the MLE for , we take the derivative of the likelihood with respect to and this gives us,
To get the MLE estimate for , we take the derivative with respect to first which gives us,
Setting the derivative to 0 and solving for we have,
which is just the sample variance.
Generalizing for any number of samples
In the general case with -samples , we can calculate
Thus, the MLE estimate for the normal distribution is the sample mean and sample standard deviation.
MLE for Deep Learning
Given two probability distributions and , the cross entropy of with respect to is given by
and for the continuous case,
Cross Entropy for Multi Class Classification
Let us take the case of multi-class classification. Here, the network output is a classification label into one of classes.
In our case, we use the cross entropy where is the ground truth distribution (actual training data) and is the distribution output from the neural network.
Given one data point , suppose the ground truth label is . Let be the output of the neural network for the class. Since we have the ground truth that is the correct label, the probabilities of all other labels are 0. Then, the cross entropy only involves . For the entire data set we have the cross-entropy as
Given that we have the prior distribution, the likelihood becomes
and taking the mode of the distribution, we have
and with the logarithm version, we have
Note that is the probability of the parameter being . This is another function that has be provided for the estimation computation.
We will be adding to the example above for the normal distribution. Assume that we have a prior distribution of the mean as . Note that we only a prior for the mean and not the .
The mean itself is another distribution with different parameters and not related to and that we are trying to estimate.
Taking the derivative of both of the above terms, we have
Setting the derivative to 0 and solving for we have
We see that the prior distribution has affected the estimated MAP mean.
This article gives a quick introduction to MLE and MAP and how they are derived from the Bayes theorem for random variables. These methods are general methods of estimation and using them still involves finding the right likelihood function for the problem that can give the best estimate of its parameters.