Likelihoods are:
the link between: Data -> Probability Models -> Inference
the most important tool, in practice, for fitting models … i.e. when we estimate parameters we like to Maximize the Likelihood,
Note: there are good theoretical reasons for this - MLE’s are proven to be CONSISTENT and have the LOWEST VARIANCE of any other estimator
provide a flexible / general framework for comparing / selecting models, via Likelihood Ratio Tests, Information Criteria (AIC, BIC, etc.)
And, very strangely:
They are almost NEVER TAUGHT in undergraduate statistics classes!
Though they are EASY TO UNDERSTAND.
We are basically peeking under the hood of a lot of statistical machinery.
Likelihoods turn Probabilities on their heads!
Let’s say the average height of a human male is \[X \sim {\cal N}(\mu_0 = 6, \sigma_0 = 0.5)\] This is a probability model, i.e. it tells us that: \[P(x < X < x+dx) = f(x|\mu, \sigma)dx\] where \(f(x|\mu, \sigma)\), is the density function of the normal distribution with \(\mu\) and \(\sigma\).
This probability statement means: If you take (for example) a single 7 foot tall person, you can say with certainty that:
\[f(7|\mu=6, \sigma=0.5) = 0.11 1/foot\]
dnorm(7,6,.5)
## [1] 0.1079819
But the per-foot “probability” that he’s around 7 feet tall is
\[f(7|\mu=6, \sigma=0.5) = 0.11 1/foot\]
dnorm(7,6,.5)
## [1] 0.1079819
Whats a “1/foot”? Convert it to “per inch”:
dnorm(7,6,.5)/12
## [1] 0.008998494
Just under 1% probability that a person is within one inch of 7 feet.
Say, you don’t know what those parameters are, but you are curious! So you collect one data point: a single man, and he is 7 feet tall.
\[ X_1 = \{7\}\]
To “flip this on its head”“, we ask a new question: What is the likelihood that the mean and standard deviation of human heights are 6 and 0.5, GIVEN that we observed a man who is \(X_1 = 7\) feet tall?
We write this as: \[ {\cal L}(\mu, \sigma | x).\]
It is “flipped on its head” because it as a function of the parameters given an observation, rather than as the probability of an observation given some parameters. But, by definition, the likelihood is numerically equal to the probability:
\[L_0 = {\cal L}(\mu_0, \sigma_0 | X_1) = f(X_1|\mu_0, \sigma_0) = 0.11\]
In colloquial language, a “likelihood” is sort of similar to a “probability” - i.e. the statement: “Event A is likely.” seems to have similar meaning to the statement: “Event A has high probability.”
In statistics, the “likelihood” and the “probability” are, in fact EQUAL - but there is an inversion of what is considered “known” and “unknown: - A probability tells you something about a random event given parameter values.
- The likelihood tells you something about parameter values given observations.
In practice - statistical inference is about having the data and guessing the parameters. Thus the concept of Likelihoods is extremely useful and natural.
In contrast to Probabilities, the actual raw value of the likelihood is almost NEVER of interest - We ONLY care about its value when compared with likelihoods of different models.
for example, we can compare the likelihood of the parameters \(\mu=6\) and \(\sigma=0.5\), GIVEN the observation \(X_1 = 7\), with an alternative probability model … say, \(\mu_1 = 7\) and \(\sigma_1 = 0.001\). That likelihood is given by:
\[ L_1 = {\cal L}(\mu_1, \sigma_1 | X_1 = 7) = f(7 | 7, .001)\]
(L1 <- dnorm(7,7,.001))
## [1] 398.9423
\(L_1\) is, clearly, much much greater than our original likelihood, i.e. this set of parameters is much likelier than the original set of parameters. Indeed, the ratio \(L_1 / L_0 = 3693\) ….
Let’s sample 5 individuals from a professional basketball arena, and record their heights: 6.6, 6.8, 7.0, 7.2, 7.4 feet.
At this point, I might have reasonable suspicion that our null-model might not be appropriate for this subset of humans.
Visualize these data-points against our null-model:
mu0 <- 6
sigma0 <- 0.5
X <- c(6.6,6.8,7.0,7.2,7.4)
curve(dnorm(x, mu0, sigma0), xlim=c(4,8))
points(X, dnorm(X,mu0,sigma0), pch=19, col=2)
points(X, dnorm(X,mu0,sigma0), type="h", col=2)
Now, we can compute the likelihood of the null parameters given all of these observations.The likelihood (a joint likelihood) is just the product of the likelihood for each of these points, because it is equal to the joint density distribution … this is because the Probability of \(n\) independent events is the product of the probabilities:
\[{\cal L}(\mu_0, \sigma_0 | {\bf X}) = \prod_{i=1}^n f(X_i|\mu_0, \sigma_0) \]
(L0 <- prod(dnorm(X, mu0, sigma0)))
## [1] 6.596596e-06
This is a very small number, but again - it is meaningless without having a comparison. Let’s compute the joint likelihood of our alternative model:
mu1 <- 7
sigma1 <- 0.001
(L1 <- prod(dnorm(X, mu1, sigma1)))
## [1] 0
To machine power, the second model is MUCH LESS likely than the first model! Illustrate this:
curve(dnorm(x, mu1, sigma1), xlim=c(6.2,7.8), n=1000)
points(X, dnorm(X,mu1,sigma1), pch=19, col=2)
points(X, dnorm(X,mu1,sigma1), type="h", col=2)
You see that the points away from 7 have extremely low probability, which brings the likelihood of this model way down.
So - how do we find the parameters that maximize the likelihood?
These parameters are called the Maximum Likelihood Estimators (MLE’s), and are the “best” parameters in that they are the most precise. Remember, every estimate is a random variable and therefore comes with some variance. It can be shown that MLE’s have the smallest of those possible variances:
\[ \{ \widehat\theta_\mathrm{mle}\} \subseteq \{ \underset{\theta\in\Theta}{\operatorname{arg\,max}}\ {\cal L}(\theta\,|\,X_1,\ldots,X_n) \}\]
Translating to English:
the MLE estimators of parameters \(\theta\) are those values of \(\theta\) for which the likelihood function is maximized.
Let’s look at a range of possible values for \(\widehat{\mu}\) and \(\widehat{\sigma}\). We can do this pretty efficiently in R:
First, pick some values to explore:
mus <- seq(6,8,.05)
sigmas <- seq(.1,1,.02)
Now, write a function that computes the likelihood (which, as we recall, is a function of \(\mu\) and \(\sigma\), and “assumes” \(\bf X\))
Likelihood <- function(mu, sigma)
prod(dnorm(X, mu, sigma))
And compute this likelihood for all the combinations of \(\mu\) and \(\sigma\) above, using the mighty outer()
function:
L.matrix <- outer(mus, sigmas, Vectorize(Likelihood))
Note (as a technical aside) that to get the Likelihood()
function to work within outer()
, it had to be “vectorized”. Happily, there is a function (Vectorize()
) that does just that.
Visualize our likelihood profile over those ranges of \(\mu\) and \(\sigma\)
Clearly, there is a sharp peak in the likelihood around \(\widehat{\mu} = 7\) and somewhere just under \(\widehat{\sigma} = 0.3\). From our limited sets of values, we can find the values at the maximum:
(max.indices <- which(L.matrix == max(L.matrix), arr.ind = TRUE))
## row col
## [1,] 21 10
(mu.hat <- mus[max.indices[1]])
## [1] 7
(sigma.hat <- sigmas[max.indices[2]])
## [1] 0.28
And the (usually irrelevant) value of the likelihood is
max(L.matrix)
## [1] 0.4580006
We don’t need to do this search ourselves - that’s why we invented electronic computing devices. The powerhouse function for optimizing functions in R is optim()
, but it takes some getting used to. The (minimal) syntax is:
optim(p, FUN, ...)
where:
p
is a required vector of your initial guess for the parameters.FUN(p, ...)
is a function that takes as its first argument a vector of parameters p
...
refers to other arguments passed to FUN
. Most importantly, this will be data!See how I make it work below:
optim(c(6,1), function(p) -Likelihood(p[1], p[2]))
## $par
## [1] 7.0000003 0.2828426
##
## $value
## [1] -0.4582359
##
## $counts
## function gradient
## 89 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
So - lots of output, but the key numbers we’re interested in are the top two, under $par
. These are: \(\widehat{\mu} = 7.000\) and \(\widehat{\sigma} = 0.2828\). Good! it is what we expected.
The $value
is very close to the (negative of the) maximum likelihood that we “hand-calculated”.
PlotLikelihood <- function(mu, sigma, X, ...)
{curve(dnorm(x, mu, sigma), ...)
points(X, dnorm(X,mu,sigma), pch=19, col=2)
points(X, dnorm(X,mu,sigma), type="h", col=2)}
par(mfrow=c(1,3), bty="l")
PlotLikelihood(mu = 6, sigma=0.5, X, xlim=c(4,8))
title("Null Model")
PlotLikelihood(mu = 7, sigma=0.001, X, xlim=c(6.2,7.8), n = 1000)
title("Model 1")
pars <- optim(c(6,1), function(p) -Likelihood(p[1], p[2]))$par
PlotLikelihood(mu = pars[1], sigma=pars[2], X, xlim=c(5,9))
title("MLE Model")
You can get a feel for the constraints on the MLE model: if \(\sigma\) were smaller, the contribution of the outlying points would become much smaller and bring the likelihood down; if \(\sigma\) were any larger, then the flattening would bring down too far the contribution of the central points.
Now … you might think that’s a whole lot of crazy to estimate a simple MEAN and VARIANCE.
You may have guessed that the best estimates are the sample mean:
\[\widehat{\mu} = \overline{X}\] and the sample standard deviation:
\[\widehat{\sigma^2} = s_x^2 = {1\over n-1}\sum (X_i - \overline{X})^2\]
The mean estimate matches, but the sample standard deviation doesn’t quite:
sd(X)
## [1] 0.3162278
This suggests that the MLE of the variance - for all of its great qualities - is at least somewhat biased (remember: \(E(s_x^2) = \sigma^2\)).
Can you recognize what analytical formula gives the MLE of \(\sigma\)?
Note: this is a special case where the MLE is actually computable by hand, but in general it is not - and best to obtain numerically, as the examples that follow show.
For a combination of practical and theoretical reasons, what we actually maximize is not the Likelihood but the log-Likelihood. The maximum will be in the same place for both functions, but the latter is MUCH easier to work with.
Likelihood: \[{\cal L}(\mu_0, \sigma_0 | {\bf X}) = \prod_{i=1}^n f(X_i|\mu_0, \sigma_0) \]
Log-likelihood: \[{\cal l}(\mu_0, \sigma_0 | {\bf X}) = \log({\cal L}) = \sum_{i=1}^n \log(f(X_i|\mu_0, \sigma_0)) \]
Sums are much easier to manipulate with algebra and handle computaitonally. Also, they help turn very very very small numbers and very very very large numbers to very very very ordinary numbers.
The number of particles in the universe is about 10^80 … its log is 184. The ratio of the mass of an electron to the size of the sun is 10^-60 … its log is -134.
L.matrix <- outer(mus, sigmas, Vectorize(Likelihood))
image(mus, sigmas, L.matrix, col=topo.colors(1000))
contour(mus, sigmas, L.matrix, add=TRUE, col="white")
ll.matrix <- outer(mus, sigmas, Vectorize(logLikelihood))
image(mus, sigmas, ll.matrix, col=topo.colors(1000))
contour(mus, sigmas, ll.matrix, add=TRUE, nlevels=100, col=rgb(0,0,0,.2))
The peak of the (log)-likelihood surface gives you point estimates of parameters.
Likelihood theory provides an additional enormously handy (asymptotically correct) result with respect to standard errors around the estimates. Specifically (in words): The variance around the point estimates is equal to the negative reciprocal of the second derivative of the log-likelihood at the maximum.
This is actually very intuitive! The sharper the peak, the MORE NEGATIVE the second derivative, the SMALLER the (POSITIVE) variance. The flatter the peak, the LESS NEGATIVE the second derivative, i.e. the LARGER the variance.
\[\Sigma(\theta) = {\cal I}(\theta)^{-1}\]
Often (e.g. in these examples) we have several parameters here, so the jargon is fancier, but the idea is the same:
the Hessian is an n-dimensional second derivative (a matrix)
the Fisher Information (\(\cal{I}\)) is the Hessian of the log likelihood.
the inverse is the n-dimensional equivalent of “reciprocal”.
\(\Sigma\) is the variance-covariance matrix of the parameter estimates.
Ask optim()
to compute the Hessian:
logLikelihood <- function(mu, sigma)
sum(log(dnorm(X, mu, sigma)))
(param.fit <- optim(c(1,1), function(p) -logLikelihood(p[1], p[2]), hessian=TRUE))
## $par
## [1] 6.9999642 0.2828735
##
## $value
## [1] 0.7803712
##
## $counts
## function gradient
## 97 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2]
## [1,] 62.48637509 0.01580303
## [2,] 0.01580303 124.94594137
The inverse of a matrix (in R) is solve(M)
:
(Sigma <- solve(param.fit$hessian))
## [,1] [,2]
## [1,] 1.600349e-02 -2.024104e-06
## [2,] -2.024104e-06 8.003462e-03
The square root of the diagonal gives standard errors
(se <- sqrt(diag(Sigma)))
## [1] 0.12650490 0.08946207
And the confidence intervals are just:
cbind(hat = param.fit$par, CI.low = param.fit$par - 1.96*se, CI.high = param.fit$par + 1.96*se)
## hat CI.low CI.high
## [1,] 6.9999642 6.7520146 7.2479138
## [2,] 0.2828735 0.1075279 0.4582192
Is the engine under a great many hoods of R functions.
John Nash: creator (and skeptic) of optim
see: http://www.ibm.com/developerworks/library/ba-optimR-john-nash/
Model 0 and Model 1 are NESTED
(i.e. Model 0 is a special case of Model 1) with \(k_0\) and \(k_1\) parameters.
Compute MLE’s: \(\widehat{\theta_0}\) and \(\widehat{\theta_1}\)
important: the data \(X\) must be identical!
Likelihood Ratio Test Statistic: \[\Lambda = -2 \log \left( \frac{L_0}{L_1} \right) = 2 (l_1 - l_0)\]
under Null hypothesis (i.e. Model 1) has distribution \(\Lambda \sim Chi^2_{d.f. = k_1 - k_0}\)
Competing models: \[M0: Y_i = \beta_0 + \epsilon_i\] \[M1: Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where \(\epsilon\) are iid Gaussian.
Defining (negative) log-likelihood functions
LL0 <- function(par = c(beta0, sigma), X, Y){
-sum(dnorm(Y, mean = par['beta0'], sd = par['sigma'], log=TRUE) )
}
LL1 <- function(par = c(beta0, beta1, sigma), X, Y){
Y.hat <- par['beta0'] + par['beta1']*X
-sum(dnorm(Y, mean = Y.hat, sd = par['sigma'], log=TRUE) )
}
Obtaining estimates
fit0 <- optim(c(beta0=0, sigma=1), LL0, X = X, Y = Y)
fit1 <- optim(c(beta0 = 0, beta1 = 0, sigma = 1), LL1, X = X, Y = Y)
Performing test:
LRT <- 2*(-fit1$value + fit0$value)
1-pchisq(LRT, 1)
## [1] 0.1149729
Compare Akaike Information Criterion (AIC):
\[ AIC = - 2 \log(\cal L) + 2 k\]
c(AIC0 = 2*(fit0$value + 2*2),
AIC1 = 2*(fit1$value + 2*3))
## AIC0 AIC1
## 60.47236 61.98786
Or … Bayesian Information Criterion (BIC) \[ BIC = - 2 \log(\cal L) + k \log(n) \]
c(BIC0 = 2*(fit0$value + 2 * log(length(X))),
BIC1 = 2*(fit1$value + 3 * log(length(X))))
## BIC0 BIC1
## 64.45529 67.96225
Which? Why? - a complicated debate, different underlying assumptions But generally - if you want to be more parsimonious (i.e. protect from overfitting) BIC is a better bet.
If you have data …
and you can write a probability model …
in terms of parameters …
no matter how strange or arbitrary seeming …
then there’s a good chance you can estimate those parameters …
and compare competing models …
and probably obtain confidence intervals.
==========================
Likelihood ratio test = pregnancy test
==========================
\(p\)-value of birth: 0.08
==========================
\(p\)-value of birth: 1e-11