Elie Gurarie
August 22, 2016
Likelihoods are:
And, very strangely:
We are basically peeking under the hood of a lot of statistical machinery.
Let's say the average height of a human male is \[ X \sim {\cal N}(\mu_0 = 6, \sigma_0 = 0.5) \] This is a probability model, i.e. it tells us that: \[ P(x < X < x+dx) = f(x|\mu, \sigma)dx \] where \( f(x|\mu, \sigma) \), is the density function of the normal distribution with \( \mu \) and \( \sigma \).
This probability statement means: If you take (for example) a single 7 foot tall person, you can say with certainty that:
\[ f(7|\mu=6, \sigma=0.5) = 0.11 1/foot \].
dnorm(7,6,.5)
[1] 0.1079819
But the per-foot “probability” that he's around 7 feet tall is
\[ f(7|\mu=6, \sigma=0.5) = 0.11 1/foot \]
dnorm(7,6,.5)
[1] 0.1079819
Whats a “1/foot”? Convert it to “per inch”:
dnorm(7,6,.5)/12
[1] 0.008998494
Just under 1% probability that a person is within one inch of 7 feet.
Say, you don't know what those parameters are, but you are curious! So you collect one data point: a single man, and he is 7 feet tall.
\[ X_1 = \{7\} \]
So, to “flip this on its head”“, we ask a new question: What is the likelihood that the mean and standard deviation of human heights are 6 and 0.5, respectively, GIVEN that we observed a man who is \( X_1 = 7 \) feet tall?
We write this as: \[ {\cal L}(\mu, \sigma | x). \]
It is "flipped on its head” because it as a function of the parameters given an observation, rather than as the probability of an observation given some parameters. But, by definition, they are equal:
\[ L_0 = {\cal L}(\mu_0, \sigma_0 | X_1) = f(X_1|\mu_0, \sigma_0) = 0.11 \]
In colloquial language, a “likelihood” is sort of similar to a “probability” - i.e. the statement: “Event A is likely.” seems to have similar meaning to the statement: “Even A has high probability.”
In statistics, the “likelihood” and the “probability” are, in fact EQUAL - but there is an inversion of what is considered “known” and “unknown:
In practice - statistical inference is about having the data and guessing the parameters. Thus the concept of Likelihoods is extremely useful and natural.
In contrast to Probabilities, the actual raw value of the likelihood is almost NEVER of interest - We ONLY care about its value when compared with likelihoods of different models.
For example, we can compare the likelihood of the parameters \( \mu=6 \) and \( \sigma=0.5 \), GIVEN the observation \( X_1 = 7 \), with an alternative probability model … say, \( \mu_1 = 7 \) and \( \sigma_1 = 0.001 \). That likelihood is given by:
\[ L_1 = {\cal L}(\mu_1, \sigma_1 | X_1 = 7) = f(7 | 7, .001) \]
(L1 <- dnorm(7,7,.001))
[1] 398.9423
\( L_1 \) is, clearly, much much greater than our original likelihood, i.e. this set of parameters is much likelier than the original set of parameters. Indeed, the ratio \( L_1 / L_0 = 3693 \) ….
Let's sample 5 individuals from a professional basketball arena, and record their heights: 6.6, 6.8, 7.0, 7.2, 7.4 feet.
At this point, I might have reasonable suspicion that our null-model might not be appropriate for this subset of humans.
Visualize these data-points against our null-model:
mu0 <- 6
sigma0 <- 0.5
X <- c(6.6,6.8,7.0,7.2,7.4)
curve(dnorm(x, mu0, sigma0), xlim=c(4,8))
points(X, dnorm(X,mu0,sigma0), pch=19, col=2)
points(X, dnorm(X,mu0,sigma0), type="h", col=2)
Now, we compute the likelihood of the null parameters given ALL observations. The joint likelihood is just the product of each likelihood/probability because it is equal to the joint density distribution (which, for independent random variables, is the product of the densities):
\[ {\cal L}(\mu_0, \sigma_0 | {\bf X}) = \prod_{i=1}^n f(X_i|\mu_0, \sigma_0) \]
(L0 <- prod(dnorm(X, mu0, sigma0)))
[1] 6.596596e-06
This is a very small number, but again - meaningless without a comparison. The joint likelihood of the alternative model:
mu1 <- 7
sigma1 <- 0.001
(L1 <- prod(dnorm(X, mu1, sigma1)))
[1] 0
To machine power, the second model is much LESS likely than the first model! Illustrate this:
curve(dnorm(x, mu1, sigma1), xlim=c(6.2,7.8), n=1000)
points(X, dnorm(X,mu1,sigma1), pch=19, col=2)
points(X, dnorm(X,mu1,sigma1), type="h", col=2)
The points away from 7 have extremely low probability, which brings the likelihood of this model way down.
So - how do we find the parameters that maximize the likelihood?
These parameters are called the Maximum Likelihood Estimators (MLE's), and are the “best” parameters in that they are the most precise. Remember, every estimate is a random variable and therefore comes with some variance. It can be shown that MLE's have the smallest of those possible variances:
\[ \{ \widehat\theta_\mathrm{mle}\} \subseteq \{ \underset{\theta\in\Theta}{\operatorname{arg\,max}}\ {\cal L}(\theta\,|\,X_1,\ldots,X_n) \} \]
Translating to English:
the MLE estimators of parameters \( \theta \) are those values of \( \theta \) for which the likelihood function is maximized.
Let's look at a range of possible values for \( \widehat{\mu} \) and \( \widehat{\sigma} \). We can do this pretty efficiently in R:
First, pick some values to explore:
mus <- seq(6,8,.05)
sigmas <- seq(.1,1,.02)
Now, write a function that computes the likelihood (which, as we recall, is a function of \( \mu \) and \( \sigma \), and “assumes” \( \bf X \))
Likelihood <- function(mu, sigma)
prod(dnorm(X, mu, sigma))
And compute this likelihood for all the combinations of \( \mu \) and \( \sigma \) above, using the mighty outer()
function:
L.matrix <- outer(mus, sigmas, Vectorize(Likelihood))
Note (as a technical aside) that to get the Likelihood()
function to work within outer()
, it had to be “vectorized”. Happily, there is a function (Vectorize()
) that does just that.
Visualize our likelihood profile over those ranges of \( \mu \) and \( \sigma \)
image(mus, sigmas, L.matrix, col=topo.colors(100))
contour(mus, sigmas, L.matrix, add=TRUE)
persp(mus, sigmas, L.matrix, axes=TRUE, shade=TRUE, bor=NA)
Clearly, there is a sharp peak in the likelihood around \( \widehat{\mu} = 7 \) and somewhere just under \( \widehat{\sigma} = 0.3 \). From our limited sets of values, we can find the values at the maximum:
(max.indices <- which(L.matrix == max(L.matrix), arr.ind = TRUE))
row col
[1,] 21 10
(mu.hat <- mus[max.indices[1]])
[1] 7
(sigma.hat <- sigmas[max.indices[2]])
[1] 0.28
And the (usually irrelevant) value of the likelihood
max(L.matrix)
= 0.4580006
We don't need to do this search ourselves - that's why we invented electronic computing devices. The powerhouse function for optimizing functions in R is optim()
, but it takes some getting used to. The (minimal) syntax is:
optim(p, FUN, ...)
where:
p
is your initial guess for the parametersFUN(p, ...)
is a function that takes as its first argument a vector of parameters p
...
refers to any other object that is passed to FUN
. Most importantly, this will be data!See how I make it work below:
optim(c(6,1), function(p) -Likelihood(p[1], p[2]))
$par
[1] 7.0000003 0.2828426
$value
[1] -0.4582359
$counts
function gradient
89 NA
$convergence
[1] 0
$message
NULL
So - lots of output, but the key numbers we're interested in are the top two, under $par
. These are: \( \widehat{\mu} = 7.000 \) and \( \widehat{\sigma} = 0.2828 \). This is good, it is what we basically expected. The $value
is very close to the (negative of the) maximum likelihood that we “hand-calculated”.
PlotLikelihood <- function(mu, sigma, X, ...)
{curve(dnorm(x, mu, sigma), ...)
points(X, dnorm(X,mu,sigma), pch=19, col=2)
points(X, dnorm(X,mu,sigma), type="h", col=2)}
par(mfrow=c(1,3), bty="l")
PlotLikelihood(mu = 6, sigma=0.5, X, xlim=c(4,8))
title("Null Model")
PlotLikelihood(mu = 7, sigma=0.001, X, xlim=c(6.2,7.8), n = 1000)
title("Model 1")
pars <- optim(c(6,1), function(p) -Likelihood(p[1], p[2]))$par
PlotLikelihood(mu = pars[1], sigma=pars[2], X, xlim=c(5,9))
title("MLE Model")
You can get a feel for the constraints on the MLE model: if \( \sigma \) were smaller, the contribution of the outlying points would become much smaller and bring the likelihood down; if \( \sigma \) were any larger, then the flattening would bring down too far the contribution of the central points.
Now … you might think that's a whole lot of crazy to estimate a simple MEAN and VARIANCE.
You may have guessed that the best estimates are the sample mean:
\[ \widehat{\mu} = \overline{X} \] and the sample standard deviation:
\[ \widehat{\sigma^2} = s_x^2 = {1\over n-1}\sum (X_i - \overline{X})^2 \]
The mean estimate matches, but the sample standard deviation doesn't quite: </small?
sd(X)
[1] 0.3162278
This suggests that the MLE of the variance - for all of its great qualities - is at least somewhat biased (remember: \( E(s_x^2) = \sigma^2 \)).
Can you recognize what analytical formula gives the MLE of \( \sigma \)?
Note: this is a special case where the MLE is actually computable by hand, but in general it is not - and best to obtain numerically, as the examples that follow show.
For a combination of practical and theoretical reasons, what we actually maximize is not the Likelihood but the log-Likelihood. The maximum will be in the same place for both functions, but the latter is MUCH easier to work with.
Likelihood: \[ {\cal L}(\mu_0, \sigma_0 | {\bf X}) = \prod_{i=1}^n f(X_i|\mu_0, \sigma_0) \]
Log-likelihood: \[ {\cal l}(\mu_0, \sigma_0 | {\bf X}) = \log({\cal L}) = \sum_{i=1}^n \log(f(X_i|\mu_0, \sigma_0)) \]
Sums are much easier to manipulate with algebra and handle computaitonally. Also, they help turn very very very small numbers and very very very large numbers to very very very ordinary numbers.
The number of particles in the universe is about 1080 … its log is 184. The ratio of the mass of an electron to the size of the sun is 10-60 … its log is -134.
L.matrix <- outer(mus, sigmas, Vectorize(Likelihood))
image(mus, sigmas, L.matrix, col=topo.colors(1000))
contour(mus, sigmas, L.matrix, add=TRUE, col="white")
ll.matrix <- outer(mus, sigmas, Vectorize(logLikelihood))
image(mus, sigmas, ll.matrix, col=topo.colors(1000))
contour(mus, sigmas, ll.matrix, add=TRUE, nlevels=100, col=rgb(0,0,0,.2))
We have seen that movement data often has:
Typical step-length model - the Weibull distribution: \[ f(x;\alpha, \beta) = \frac{\alpha}{\beta}\left(\frac{x}{\beta}\right)^{\alpha-1}e^{-(x/\beta)^{k}} \] \( \alpha \) and \( \beta \) are the shape and scale parameter, respectively.
curve(dweibull(x, 1, 2), xlim=c(0,6), ylim=c(0,1.2), ylab="density", xlab="", col=2)
curve(dweibull(x, 2, 2), add=TRUE, col=3)
curve(dweibull(x, 6, 2), add=TRUE, col=4)
legend("topright", col=2:4, lwd=2, legend=c(1,2,6), title="Shape parameter", bty="n", cex=1.5)
Typical turning angle model - wrapped Cauchy distribution: \[ f(\theta;\mu,\kappa)=\frac{1}{2\pi}\,\,\frac{\sinh\kappa}{\cosh\kappa-\cos(\theta-\mu)} \] \( \mu \) is the mean angle (usually 0) and \( \kappa \) is clustering parameter- equal to \( E(cos(\theta)) \).
require(CircStats) # NOT in base!
curve(dwrpcauchy(x, 0, 0), xlim=c(-pi,pi), ylim=c(0,1.5), lwd=2, ylab="density", xlab="", col=2)
curve(dwrpcauchy(x, 0, 0.5), add=TRUE, col=3)
curve(dwrpcauchy(x, 0, 0.8),add=TRUE, col=4, n=1001)
legend("topright", col=2:4, legend=c(0,0.5,0.8), title="Shape parameter", bty="n", cex=1.5, lwd=2)
Some sample data:
Y <- cumsum(arima.sim(n=100, model=list(ar=.7)))
Z <- X + 1i*Y
plot(Z, type="o", asp=1)
A function that returns the likelihood as a function of the parameters and data
Weibull.Like <- function(p, Z)
{
S = Mod(diff(Z))
-sum(dweibull(S, p[1], p[2], log=TRUE))
}
Note 1: we use
log()
because it is much easier to SUM LOGS than it is to MULTIPLY SMALL NUMBERS.
Note 2: we return the NEGATIVE of the likelihood because the
optim()
function likes to MINIMIZE rather than MAXIMIZE.
Run the optimization:
(Weibull.fit <- optim(c(1,1), Weibull.Like, Z=Z))
$par
[1] 2.717573 2.164185
$value
[1] 114.6824
$counts
function gradient
85 NA
$convergence
[1] 0
$message
NULL
Visually assess the fit:
hist(Mod(diff(Z)), freq=FALSE, col="grey", breaks=10)
curve(dweibull(x, Weibull.fit$par[1], Weibull.fit$par[2]), add=TRUE, col=2, lwd=2)
Not Bad!
optim
to estimate the \( \kappa \) parameter in the Wrapped cauchy distribution Visualize the quality of the fits by comparing distribution curves to histograms.Follow the template for the Weibull:
Weibull.Like <- function(p, Z)
{
S = Mod(diff(Z))
-sum(dweibull(S, p[1], p[2], log=TRUE))
}
(Weibull.fit <- optim(c(1,1), Weibull.Like, Z=Z))
The peak of the likelihood surface gives you point estimates of the parameters.
Likelihood theory provides an additional enormously handy (asymptotically correct) result with respect to standard errors around the estimates. Specifically (in words): The variance around the point estimates is equal to the negative reciprocal of the second derivative of the log-likelihood at the maximum.
This is actually very intuitive! The sharper the peak, the MORE NEGATIVE the second derivative, the SMALLER the (POSITIVE) variance. The flatter the peak, the LESS NEGATIVE the second derivative, i.e. the LARGER the variance.
\[ \Sigma(\theta) = {\cal I}(\theta)^{-1} \]
Often (e.g. in these examples) we have several parameters here, so the jargon is fancier, but the idea is the same:
Ask optim()
to compute the Hessian:
(Weibull.fit <- optim(c(1,1), Weibull.Like, Z=Z, hessian=TRUE))
$par
[1] 2.717573 2.164185
$value
[1] 114.6824
$counts
function gradient
85 NA
$convergence
[1] 0
$message
NULL
$hessian
[,1] [,2]
[1,] 22.82690 -17.81262
[2,] -17.81262 156.12353
The inverse of a matrix (in R) is solve(M)
, and since Weibull.Like()
actually returns the negative of the log-likelihood, that's all we need:
(Sigma <- solve(Weibull.fit$hessian))
[,1] [,2]
[1,] 0.04808939 0.005486670
[2,] 0.00548667 0.007031176
The square root of the diagonal gives standard errors
(se <- sqrt(diag(Sigma)))
[1] 0.21929294 0.08385211
And the confidence intervals are just:
cbind(hat = Weibull.fit$par, CI.low = Weibull.fit$par - 1.96*se, CI.high = Weibull.fit$par + 1.96*se)
hat CI.low CI.high
[1,] 2.717573 2.287759 3.147387
[2,] 2.164185 1.999834 2.328535
Following the template below:
Estimate the Weibull step-length and wrapped Cauchy turning angle parameters for your movement data of choice.
Visualize the quality of the fits by comparing distribution curves to histograms.
Extra Credit: Compute standard errors around those estimates.
For discussion: What do YOU think about the CRW model?
Weibull.fit <- optim(c(1,1), Weibull.Like, Z=Z, hessian=TRUE)
Sigma <- solve(Weibull.fit$hessian)
se <- sqrt(diag(Sigma))
cbind(hat = Weibull.fit$par, CI.low = Weibull.fit$par - 1.96*se, CI.high = Weibull.fit$par + 1.96*se)
hat CI.low CI.high
[1,] 2.717573 2.287759 3.147387
[2,] 2.164185 1.999834 2.328535
Is the engine under a great many hoods of R functions.
John Nash: creator (and skeptic) of optim
see: http://www.ibm.com/developerworks/library/ba-optimR-john-nash/
Model 0 and Model 1 are NESTED
(i.e. Model 0 is a special case of Model 1) with \( k_0 \) and \( k_1 \) parameters.
Compute MLE's: \( \widehat{\theta_0} \) and \( \widehat{\theta_1} \)
Compute likelihoods: \( {\cal L_0(\theta_0|X)} \) and \( {\cal L_1(\theta_1|X)} \)
important: the data \( X \) must be identical!
Likelihood Ratio Test Statistic: \[ \Lambda = -2 \log \left( \frac{L_0}{L_1} \right) = 2 (l_1 - l_0) \]
under Null hypothesis (i.e. Model 1) has distribution \( \Lambda \sim \chi^2_{d.f. = k_1 - k_0} \)
Competing models: \[ M0: Y_i = \beta_0 + \epsilon_i \] \[ M1: Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] where \( \epsilon \) are iid Gaussian.
Defining (negative) log-likelihood functions
LL0 <- function(par = c(beta0, sigma), X, Y){
-sum(dnorm(Y, mean = par['beta0'], sd = par['sigma'], log=TRUE) )
}
LL1 <- function(par = c(beta0, beta1, sigma), X, Y){
Y.hat <- par['beta0'] + par['beta1']*X
-sum(dnorm(Y, mean = Y.hat, sd = par['sigma'], log=TRUE) )
}
Obtaining estimates
fit0 <- optim(c(beta0=0, sigma=1), LL0, X = X, Y = Y)
fit1 <- optim(c(beta0 = 0, beta1 = 0, sigma = 1), LL1, X = X, Y = Y)
Performing test:
LRT <- 2*(-fit1$value + fit0$value)
1-pchisq(LRT, 1)
[1] 1.581083e-05
Compare Akaike Information Criterion (AIC):
\[ AIC = - 2 \log(\cal L) + 2 k \]
c(AIC0 = 2*(fit0$value + 2*2),
AIC1 = 2*(fit1$value + 2*3))
AIC0 AIC1
81.96812 67.33096
Or … Bayesian Information Criterion (BIC) \[ BIC = - 2 \log(\cal L) + k \log(n) \]
c(BIC0 = 2*(fit0$value + 2 * log(length(X))),
BIC1 = 2*(fit1$value + 3 * log(length(X))))
BIC0 BIC1
85.95105 73.30535
Which? Why? - a complicated debate, different underlying assumptions But generally - if you want to be more parsimonious (i.e. protect from overfitting) BIC is a better bet.
If you have data …
and you can write a probability model …
in terms of parameters …
no matter how strange or arbitrary seeming …
then there's a good chance you can estimate those parameters …
and compare competing models …
and probably obtain confidence intervals.