Likelihood Theory And Review

Elie Gurarie
August 22, 2016







Why I'm Going Over This

Likelihoods are:

  • the link between: Data -> Probability Models -> Inference
  • the most important tool, in practice, for fitting models … i.e. when we estimate parameters we like to Maximize the Likelihood,
    • Note: there are good theoretical reasons for this - MLE's are proven to be CONSISTENT and have the LOWEST VARIANCE of any other estimator
  • provide a flexible / general framework for comparing / selecting models, via Likelihood Ratio Tests, Information Criteria (AIC, BIC, etc.)

And, very strangely:

  • They are NEVER TAUGHT in most undergraduate Statistics classes!
  • Though they are EASY TO UNDERSTAND.

We are basically peeking under the hood of a lot of statistical machinery.

Very briefly:

Likelihoods turn Probabilities on their heads!

Probability statements

Let's say the average height of a human male is \[ X \sim {\cal N}(\mu_0 = 6, \sigma_0 = 0.5) \] This is a probability model, i.e. it tells us that: \[ P(x < X < x+dx) = f(x|\mu, \sigma)dx \] where \( f(x|\mu, \sigma) \), is the density function of the normal distribution with \( \mu \) and \( \sigma \).

This probability statement means: If you take (for example) a single 7 foot tall person, you can say with certainty that:

  1. the probability that a person is exactly 7 feet tall is exactly 0
  2. but the per-foot “probability” that he's around 7 feet tall is

\[ f(7|\mu=6, \sigma=0.5) = 0.11 1/foot \].

dnorm(7,6,.5)
[1] 0.1079819

Probability statements

But the per-foot “probability” that he's around 7 feet tall is

\[ f(7|\mu=6, \sigma=0.5) = 0.11 1/foot \]

dnorm(7,6,.5)
[1] 0.1079819

Whats a “1/foot”? Convert it to “per inch”:

dnorm(7,6,.5)/12
[1] 0.008998494

Just under 1% probability that a person is within one inch of 7 feet.

Defining a Likelihood

Say, you don't know what those parameters are, but you are curious! So you collect one data point: a single man, and he is 7 feet tall.

\[ X_1 = \{7\} \]

So, to “flip this on its head”“, we ask a new question: What is the likelihood that the mean and standard deviation of human heights are 6 and 0.5, respectively, GIVEN that we observed a man who is \( X_1 = 7 \) feet tall?

We write this as: \[ {\cal L}(\mu, \sigma | x). \]

It is "flipped on its head” because it as a function of the parameters given an observation, rather than as the probability of an observation given some parameters. But, by definition, they are equal:

\[ L_0 = {\cal L}(\mu_0, \sigma_0 | X_1) = f(X_1|\mu_0, \sigma_0) = 0.11 \]

Likelihoods and Probabilities

In colloquial language, a “likelihood” is sort of similar to a “probability” - i.e. the statement: “Event A is likely.” seems to have similar meaning to the statement: “Even A has high probability.”

In statistics, the “likelihood” and the “probability” are, in fact EQUAL - but there is an inversion of what is considered “known” and “unknown:

  • A probability tells you something about a random event given parameter values.
  • The likelihood tells you something about parameter values given observations.

In practice - statistical inference is about having the data and guessing the parameters. Thus the concept of Likelihoods is extremely useful and natural.

Using the Likelihood concept

In contrast to Probabilities, the actual raw value of the likelihood is almost NEVER of interest - We ONLY care about its value when compared with likelihoods of different models.

For example, we can compare the likelihood of the parameters \( \mu=6 \) and \( \sigma=0.5 \), GIVEN the observation \( X_1 = 7 \), with an alternative probability model … say, \( \mu_1 = 7 \) and \( \sigma_1 = 0.001 \). That likelihood is given by:

\[ L_1 = {\cal L}(\mu_1, \sigma_1 | X_1 = 7) = f(7 | 7, .001) \]

(L1 <- dnorm(7,7,.001))
[1] 398.9423

\( L_1 \) is, clearly, much much greater than our original likelihood, i.e. this set of parameters is much likelier than the original set of parameters. Indeed, the ratio \( L_1 / L_0 = 3693 \) ….

  • so we can say that model 2 is more than 3000x more likely than Model 1!
  • (Of course, the likelihood ratio test for these two likelihoods has zero power, because we only have one data point. But we can still quantify their relative likelihoods.)

Joint Likelihood

Let's sample 5 individuals from a professional basketball arena, and record their heights: 6.6, 6.8, 7.0, 7.2, 7.4 feet.

At this point, I might have reasonable suspicion that our null-model might not be appropriate for this subset of humans.

Visualize these data-points against our null-model:

mu0 <- 6
sigma0 <- 0.5
X <- c(6.6,6.8,7.0,7.2,7.4)
curve(dnorm(x, mu0, sigma0), xlim=c(4,8))
points(X, dnorm(X,mu0,sigma0), pch=19, col=2)
points(X, dnorm(X,mu0,sigma0), type="h", col=2)

plot of chunk unnamed-chunk-7

Joint Likelihood II

Now, we compute the likelihood of the null parameters given ALL observations. The joint likelihood is just the product of each likelihood/probability because it is equal to the joint density distribution (which, for independent random variables, is the product of the densities):

\[ {\cal L}(\mu_0, \sigma_0 | {\bf X}) = \prod_{i=1}^n f(X_i|\mu_0, \sigma_0) \]

(L0 <- prod(dnorm(X, mu0, sigma0)))
[1] 6.596596e-06

This is a very small number, but again - meaningless without a comparison. The joint likelihood of the alternative model:

mu1 <- 7
sigma1 <- 0.001
(L1 <- prod(dnorm(X, mu1, sigma1)))
[1] 0

To machine power, the second model is much LESS likely than the first model! Illustrate this:

curve(dnorm(x, mu1, sigma1), xlim=c(6.2,7.8), n=1000)
points(X, dnorm(X,mu1,sigma1), pch=19, col=2)
points(X, dnorm(X,mu1,sigma1), type="h", col=2)

plot of chunk unnamed-chunk-10

The points away from 7 have extremely low probability, which brings the likelihood of this model way down.

Defining the Maximum Likelihood Estimator

So - how do we find the parameters that maximize the likelihood?

These parameters are called the Maximum Likelihood Estimators (MLE's), and are the “best” parameters in that they are the most precise. Remember, every estimate is a random variable and therefore comes with some variance. It can be shown that MLE's have the smallest of those possible variances:

\[ \{ \widehat\theta_\mathrm{mle}\} \subseteq \{ \underset{\theta\in\Theta}{\operatorname{arg\,max}}\ {\cal L}(\theta\,|\,X_1,\ldots,X_n) \} \]

Translating to English:

the MLE estimators of parameters \( \theta \) are those values of \( \theta \) for which the likelihood function is maximized.

The Likelihood profile

Let's look at a range of possible values for \( \widehat{\mu} \) and \( \widehat{\sigma} \). We can do this pretty efficiently in R:

First, pick some values to explore:

mus <- seq(6,8,.05)
sigmas <- seq(.1,1,.02)

Now, write a function that computes the likelihood (which, as we recall, is a function of \( \mu \) and \( \sigma \), and “assumes” \( \bf X \))

Likelihood <- function(mu, sigma)
  prod(dnorm(X, mu, sigma))

And compute this likelihood for all the combinations of \( \mu \) and \( \sigma \) above, using the mighty outer() function:

L.matrix <- outer(mus, sigmas, Vectorize(Likelihood))

Note (as a technical aside) that to get the Likelihood() function to work within outer(), it had to be “vectorized”. Happily, there is a function (Vectorize()) that does just that.

The Likelihood profile

Visualize our likelihood profile over those ranges of \( \mu \) and \( \sigma \)

image(mus, sigmas, L.matrix, col=topo.colors(100))
contour(mus, sigmas, L.matrix, add=TRUE)
persp(mus, sigmas, L.matrix, axes=TRUE, shade=TRUE, bor=NA)

plot of chunk unnamed-chunk-14

Visualizing the Likelihood profile

plot of chunk unnamed-chunk-15

Clearly, there is a sharp peak in the likelihood around \( \widehat{\mu} = 7 \) and somewhere just under \( \widehat{\sigma} = 0.3 \). From our limited sets of values, we can find the values at the maximum:

(max.indices <- which(L.matrix == max(L.matrix), arr.ind = TRUE))
     row col
[1,]  21  10
(mu.hat <- mus[max.indices[1]])
[1] 7
(sigma.hat <- sigmas[max.indices[2]])
[1] 0.28

And the (usually irrelevant) value of the likelihood

max(L.matrix) = 0.4580006

Numerically finding the MLE

We don't need to do this search ourselves - that's why we invented electronic computing devices. The powerhouse function for optimizing functions in R is optim(), but it takes some getting used to. The (minimal) syntax is:

optim(p, FUN, ...)

where:

  1. p is your initial guess for the parameters
  2. FUN(p, ...) is a function that takes as its first argument a vector of parameters p
  3. The ... refers to any other object that is passed to FUN. Most importantly, this will be data!

See how I make it work below:

optim(c(6,1), function(p) -Likelihood(p[1], p[2]))
$par
[1] 7.0000003 0.2828426

$value
[1] -0.4582359

$counts
function gradient 
      89       NA 

$convergence
[1] 0

$message
NULL

So - lots of output, but the key numbers we're interested in are the top two, under $par. These are: \( \widehat{\mu} = 7.000 \) and \( \widehat{\sigma} = 0.2828 \). This is good, it is what we basically expected. The $value is very close to the (negative of the) maximum likelihood that we “hand-calculated”.

Comparing models side by side

PlotLikelihood <- function(mu, sigma, X, ...)
{curve(dnorm(x, mu, sigma), ...)
  points(X, dnorm(X,mu,sigma), pch=19, col=2)
  points(X, dnorm(X,mu,sigma), type="h", col=2)}

par(mfrow=c(1,3), bty="l")

PlotLikelihood(mu = 6, sigma=0.5, X, xlim=c(4,8))
title("Null Model")

PlotLikelihood(mu = 7, sigma=0.001, X, xlim=c(6.2,7.8), n = 1000)
title("Model 1")

pars <- optim(c(6,1), function(p) -Likelihood(p[1], p[2]))$par
PlotLikelihood(mu = pars[1], sigma=pars[2], X, xlim=c(5,9))
title("MLE Model")

plot of chunk unnamed-chunk-20

You can get a feel for the constraints on the MLE model: if \( \sigma \) were smaller, the contribution of the outlying points would become much smaller and bring the likelihood down; if \( \sigma \) were any larger, then the flattening would bring down too far the contribution of the central points.

Comparing with Method of Moment Estimators (MME's)

Now … you might think that's a whole lot of crazy to estimate a simple MEAN and VARIANCE.

You may have guessed that the best estimates are the sample mean:

\[ \widehat{\mu} = \overline{X} \] and the sample standard deviation:

\[ \widehat{\sigma^2} = s_x^2 = {1\over n-1}\sum (X_i - \overline{X})^2 \]

The mean estimate matches, but the sample standard deviation doesn't quite: </small?

sd(X)
[1] 0.3162278

This suggests that the MLE of the variance - for all of its great qualities - is at least somewhat biased (remember: \( E(s_x^2) = \sigma^2 \)).

Can you recognize what analytical formula gives the MLE of \( \sigma \)?

Note: this is a special case where the MLE is actually computable by hand, but in general it is not - and best to obtain numerically, as the examples that follow show.

Log-likelihoods

For a combination of practical and theoretical reasons, what we actually maximize is not the Likelihood but the log-Likelihood. The maximum will be in the same place for both functions, but the latter is MUCH easier to work with.

Likelihood: \[ {\cal L}(\mu_0, \sigma_0 | {\bf X}) = \prod_{i=1}^n f(X_i|\mu_0, \sigma_0) \]

Log-likelihood: \[ {\cal l}(\mu_0, \sigma_0 | {\bf X}) = \log({\cal L}) = \sum_{i=1}^n \log(f(X_i|\mu_0, \sigma_0)) \]

Sums are much easier to manipulate with algebra and handle computaitonally. Also, they help turn very very very small numbers and very very very large numbers to very very very ordinary numbers.

The number of particles in the universe is about 1080 … its log is 184. The ratio of the mass of an electron to the size of the sun is 10-60 … its log is -134.

Log-likelihoods: illustrated

L.matrix <- outer(mus, sigmas, Vectorize(Likelihood))
image(mus, sigmas, L.matrix, col=topo.colors(1000))
contour(mus, sigmas, L.matrix, add=TRUE, col="white")

plot of chunk unnamed-chunk-22

ll.matrix <- outer(mus, sigmas, Vectorize(logLikelihood))
image(mus, sigmas, ll.matrix, col=topo.colors(1000))
contour(mus, sigmas, ll.matrix, add=TRUE, nlevels=100, col=rgb(0,0,0,.2))

plot of chunk unnamed-chunk-23

Ok ... back to Movement

We have seen that movement data often has:

  1. Skewed, Positive, step Length Distribution
  2. Wrapped turning angle distributions, clustered around 0o

This combination is known as the Correlated Random Walk (CRW) model, probably the most commonly used basic movement model in ecology.

Typical step-length model - the Weibull distribution: \[ f(x;\alpha, \beta) = \frac{\alpha}{\beta}\left(\frac{x}{\beta}\right)^{\alpha-1}e^{-(x/\beta)^{k}} \] \( \alpha \) and \( \beta \) are the shape and scale parameter, respectively.

curve(dweibull(x, 1, 2), xlim=c(0,6), ylim=c(0,1.2), ylab="density", xlab="", col=2)
curve(dweibull(x, 2, 2), add=TRUE, col=3)
curve(dweibull(x, 6, 2), add=TRUE, col=4)
legend("topright", col=2:4, lwd=2, legend=c(1,2,6), title="Shape parameter", bty="n", cex=1.5)

plot of chunk unnamed-chunk-24

Typical turning angle model - wrapped Cauchy distribution: \[ f(\theta;\mu,\kappa)=\frac{1}{2\pi}\,\,\frac{\sinh\kappa}{\cosh\kappa-\cos(\theta-\mu)} \] \( \mu \) is the mean angle (usually 0) and \( \kappa \) is clustering parameter- equal to \( E(cos(\theta)) \).

require(CircStats) # NOT in base!
curve(dwrpcauchy(x, 0, 0), xlim=c(-pi,pi), ylim=c(0,1.5), lwd=2, ylab="density", xlab="", col=2)
curve(dwrpcauchy(x, 0, 0.5),  add=TRUE, col=3)
curve(dwrpcauchy(x, 0, 0.8),add=TRUE, col=4, n=1001)
legend("topright", col=2:4, legend=c(0,0.5,0.8), title="Shape parameter", bty="n", cex=1.5, lwd=2)

plot of chunk unnamed-chunk-25

MLE of Weibull parameters I

Some sample data:

Y <- cumsum(arima.sim(n=100, model=list(ar=.7)))
Z <- X + 1i*Y
plot(Z, type="o", asp=1)

plot of chunk SRW

A function that returns the likelihood as a function of the parameters and data

Weibull.Like <- function(p, Z)
{ 
  S = Mod(diff(Z))
  -sum(dweibull(S, p[1], p[2], log=TRUE))
}

Note 1: we use log() because it is much easier to SUM LOGS than it is to MULTIPLY SMALL NUMBERS.


Note 2: we return the NEGATIVE of the likelihood because the optim() function likes to MINIMIZE rather than MAXIMIZE.

MLE of Weibull parameters II

Run the optimization:

(Weibull.fit <- optim(c(1,1), Weibull.Like, Z=Z))
$par
[1] 2.717573 2.164185

$value
[1] 114.6824

$counts
function gradient 
      85       NA 

$convergence
[1] 0

$message
NULL

Visually assess the fit:

hist(Mod(diff(Z)), freq=FALSE, col="grey", breaks=10)
curve(dweibull(x, Weibull.fit$par[1], Weibull.fit$par[2]), add=TRUE, col=2, lwd=2)

plot of chunk unnamed-chunk-28

Not Bad!

Exercise

Use optim to estimate the \( \kappa \) parameter in the Wrapped cauchy distribution Visualize the quality of the fits by comparing distribution curves to histograms.

Follow the template for the Weibull:

Weibull.Like <- function(p, Z)
{ 
  S = Mod(diff(Z))
  -sum(dweibull(S, p[1], p[2], log=TRUE))
}
(Weibull.fit <- optim(c(1,1), Weibull.Like, Z=Z))

Confidence Intervals: Theory

The peak of the likelihood surface gives you point estimates of the parameters.

Likelihood theory provides an additional enormously handy (asymptotically correct) result with respect to standard errors around the estimates. Specifically (in words): The variance around the point estimates is equal to the negative reciprocal of the second derivative of the log-likelihood at the maximum.

This is actually very intuitive! The sharper the peak, the MORE NEGATIVE the second derivative, the SMALLER the (POSITIVE) variance. The flatter the peak, the LESS NEGATIVE the second derivative, i.e. the LARGER the variance.

\[ \Sigma(\theta) = {\cal I}(\theta)^{-1} \]

Often (e.g. in these examples) we have several parameters here, so the jargon is fancier, but the idea is the same:

  • the Hessian is an n-dimensional second derivative (a matrix)
  • the Fisher Information (\( \cal{I} \)) is the Hessian of the log likelihood.
  • the inverse is the n-dimensional equivalent of “reciprocal”.
  • \( \Sigma \) is the variance-covariance matrix of the parameter estimates.

Confidence Intervals: Application

Ask optim() to compute the Hessian:

(Weibull.fit <- optim(c(1,1), Weibull.Like, Z=Z, hessian=TRUE))
$par
[1] 2.717573 2.164185

$value
[1] 114.6824

$counts
function gradient 
      85       NA 

$convergence
[1] 0

$message
NULL

$hessian
          [,1]      [,2]
[1,]  22.82690 -17.81262
[2,] -17.81262 156.12353

The inverse of a matrix (in R) is solve(M), and since Weibull.Like() actually returns the negative of the log-likelihood, that's all we need:

(Sigma <- solve(Weibull.fit$hessian))
           [,1]        [,2]
[1,] 0.04808939 0.005486670
[2,] 0.00548667 0.007031176

The square root of the diagonal gives standard errors

(se <- sqrt(diag(Sigma)))
[1] 0.21929294 0.08385211

And the confidence intervals are just:

cbind(hat = Weibull.fit$par, CI.low = Weibull.fit$par  - 1.96*se, CI.high = Weibull.fit$par  + 1.96*se)
          hat   CI.low  CI.high
[1,] 2.717573 2.287759 3.147387
[2,] 2.164185 1.999834 2.328535

EXERCISE


Following the template below:

  1. Estimate the Weibull step-length and wrapped Cauchy turning angle parameters for your movement data of choice.

  2. Visualize the quality of the fits by comparing distribution curves to histograms.

  3. Extra Credit: Compute standard errors around those estimates.

For discussion: What do YOU think about the CRW model?

Weibull.fit <- optim(c(1,1), Weibull.Like, Z=Z, hessian=TRUE)
Sigma <- solve(Weibull.fit$hessian)
se <- sqrt(diag(Sigma))
cbind(hat = Weibull.fit$par, CI.low = Weibull.fit$par  - 1.96*se, CI.high = Weibull.fit$par  + 1.96*se)
          hat   CI.low  CI.high
[1,] 2.717573 2.287759 3.147387
[2,] 2.164185 1.999834 2.328535

Optim





Is the engine under a great many hoods of R functions.

John Nash: creator (and skeptic) of optim

see: http://www.ibm.com/developerworks/library/ba-optimR-john-nash/

Comparing models with likelihoods

  1. Likelihood Ratio Test:
  • Model 0 and Model 1 are NESTED

  • (i.e. Model 0 is a special case of Model 1) with \( k_0 \) and \( k_1 \) parameters.

  • Compute MLE's: \( \widehat{\theta_0} \) and \( \widehat{\theta_1} \)

  • Compute likelihoods: \( {\cal L_0(\theta_0|X)} \) and \( {\cal L_1(\theta_1|X)} \)

  • important: the data \( X \) must be identical!

  • Likelihood Ratio Test Statistic: \[ \Lambda = -2 \log \left( \frac{L_0}{L_1} \right) = 2 (l_1 - l_0) \]

  • under Null hypothesis (i.e. Model 1) has distribution \( \Lambda \sim \chi^2_{d.f. = k_1 - k_0} \)

LRT Example

plot of chunk unnamed-chunk-35

Competing models: \[ M0: Y_i = \beta_0 + \epsilon_i \] \[ M1: Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] where \( \epsilon \) are iid Gaussian.

Defining (negative) log-likelihood functions

LL0 <- function(par = c(beta0, sigma), X, Y){
  -sum(dnorm(Y, mean = par['beta0'], sd = par['sigma'], log=TRUE) )
}
LL1 <- function(par = c(beta0, beta1, sigma), X, Y){
  Y.hat <- par['beta0'] + par['beta1']*X
  -sum(dnorm(Y, mean = Y.hat, sd = par['sigma'], log=TRUE) )
}

Obtaining estimates

fit0 <- optim(c(beta0=0, sigma=1), LL0, X = X, Y = Y)
fit1 <- optim(c(beta0 = 0, beta1 = 0, sigma = 1), LL1, X = X, Y = Y)

Performing test:

LRT <- 2*(-fit1$value + fit0$value)
1-pchisq(LRT, 1)
[1] 1.581083e-05

Non-nested models

Compare Akaike Information Criterion (AIC):
\[ AIC = - 2 \log(\cal L) + 2 k \]

c(AIC0  = 2*(fit0$value + 2*2),
  AIC1  = 2*(fit1$value + 2*3))
    AIC0     AIC1 
81.96812 67.33096 

Or … Bayesian Information Criterion (BIC) \[ BIC = - 2 \log(\cal L) + k \log(n) \]

c(BIC0  = 2*(fit0$value + 2 * log(length(X))),
  BIC1  = 2*(fit1$value + 3 * log(length(X))))
    BIC0     BIC1 
85.95105 73.30535 

Which? Why? - a complicated debate, different underlying assumptions But generally - if you want to be more parsimonious (i.e. protect from overfitting) BIC is a better bet.

Likelihood maximization: A summary

  • If you have data

  • and you can write a probability model

  • in terms of parameters

  • no matter how strange or arbitrary seeming …

  • then there's a good chance you can estimate those parameters

  • and compare competing models

  • and probably obtain confidence intervals.