is according to some a radically different and modern approach to modeling data that:
according to others
… provides the best excuse in 150 years for statisticians/analysts to polarize into antagonistic camps (frequenists vs. Bayesianists) and engage in (often, in my opinion) pointless, pedantic, pseudo-philosophical arguments.
We have to go back to some basic definitions in PROBABILITY THEORY.
hair & eye color:
(HE <- HairEyeColor[,,2] + HairEyeColor[,,1])
## Eye
## Hair Brown Blue Hazel Green
## Black 68 20 15 5
## Brown 119 84 54 29
## Red 26 17 14 14
## Blond 7 94 10 16
(HE.p <- round(prop.table(HE),3))
## Eye
## Hair Brown Blue Hazel Green
## Black 0.115 0.034 0.025 0.008
## Brown 0.201 0.142 0.091 0.049
## Red 0.044 0.029 0.024 0.024
## Blond 0.012 0.159 0.017 0.027
What’s the probability of event A given event B.
\[\text{Pr}(A|B)\]
e.g.: \[Prob(Eye = blue | Hair = blond)\]
How do you compute this?
\(Pr(Eyes = blue \& Hair = blond) = 0.159\)
## Eye
## Hair Brown Blue Hazel Green
## Black 0.115 0.034 0.025 0.008
## Brown 0.201 0.142 0.091 0.049
## Red 0.044 0.029 0.024 0.024
## Blond 0.012 0.159 0.017 0.027
\(Pr(Hair = blond) = 0.215\)
rowSums(HE.p)
## Black Brown Red Blond
## 0.182 0.483 0.121 0.215
\[\text{Pr}(A|B) = {Pr(A, B) \over Pr(B)}\]
Probability of having blue eyes given that your hair is blond:
HE.p[4,2]/rowSums(HE.p)[4]
## Blond
## 0.7395349
Very High!
Compare to overall probability of blue eyes.
\[\text{Pr}(A|B) = \text{Pr}(B|A)?\]
Blond haired given you’re blue eyed:
HE.p[4,2]/colSums(HE.p)[2]
## Blue
## 0.4368132
Obviously not! But what is the relationship?
\[\text{Pr}(A|B) = {\text{Pr}(B|A) P(A) \over P(B)}\]
Straightforward, non-controvertial, fundamental law of probabilities:
“Bayes’ theorem is to the theory of probability what Pythagoras’s theorem is to geometry.” - Harold Jeffreys
Consider a disease (e.g. ebola) that exists with a prevalance of 0.0001 in the population: \[P(E) = 0.0001\]
There is a test, which has a 99% positive identification rate, i.e. if you DO have ebola, a positive result has probability 0.99: \[P(+|E) = 0.99\]
The test also has a 1% false positive rate, i.e. if you DON’T have ebola, the result will be negative. \[P(+|!E) = 0.01\]
YOU test positive! What’s the probability you have ebola?
Probabilities:
\[P(E) = 1e(-4)\] \[P(+|E) = 0.99\] \[P(+| !E) = 0.01\]
Probabilities:
\[P(E) = 1e(-4)\] \[P(+|E) = 0.99\] \[P(+| !E) = 0.05\]
\[P(E|+) = {P(+|E) P(E) \over P(+)}\]
\[P(E|+) = {P(+|E) P(E) \over (P(+|!E) P(!E) + P(+|E) P(E)}\]
\[P(E|+) = {0.99 \times 1e(-4) \over 0.99 \times 1e(-4) + 1e(-2) \times (1 - 1e(-4)}\] \[P(E|+) \approx {1e-4 \over 1e-4 + 1e-2}\] \[P(E|+) \approx 0.0099\]
Perhaps … lower than you expected?
\[\text{Pr}(A|B) = {\text{Pr}(B|A) P(A) \over P(B)}\]
where: \(A\) - is anything and \(B\) - is anything
\[\pi(\theta | {\bf X}) = {p({\bf X} | \theta) \pi (\theta) \over p({\bf X})}\]
where: \({\bf X}\) - is DATA and \({\theta}\) - is parameters
Shockingly (or intuitively?) PARAMETERS are random!?
The goal is to obtain a probability distribution of parameter values given data:
basing that inference by adjusting a “naive” guess:
by multiplying (updating):
What is this thing called?
Yes! the Likelihood!
\[\pi(\theta | {\bf X}) = {p({\bf X} | \theta) \pi (\theta) \over p({\bf X})}\]
In the ebola example, we want to infer the parameter \(p_{you\,have\,ebola}\).
The prior: \(\pi(\theta) = Pr(E) = p_{ebola\,in\,population}\).
The data: \({\bf X} = +\)
The likelihood: \(p({\bf X} | \theta) = P(+|E)\)
The updating of the prior: \(p({\bf X} | \theta) \pi(\theta) = P(+|E) P(E)\)
The normalizing of the prior: \[p(X) = \int_{\theta} p(X|\theta^*) \pi(\theta^*) d\theta^*\] \[= P(+|!E) P(!E) + P(+|E) P(E)\]
Note: this normalization is - here and in other cases - the biggest pain! But thankfully, also, not always necessary.
In words:
We took the very strong prior probability distribution \(p_{you\,have\,ebola} = 0.0001\) and “nudged” it with our new, slightly imperfect observation to \(p_{you\,have\,ebola} = 0.01\).
\[\pi(\theta | {\bf X}) \propto p({\bf X} | \theta) \pi (\theta)\]
Is (always) to compute the posterior distribution of the parameters: \(\pi(\theta | {\bf X})\)
What does that mean for a distribution? What are point estimates? How do we quantify uncertainty?
There are options!
A point estimate can be the mean of the posterior: \(\widehat{\theta} = E(\theta | {\bf X}) = \int \theta \pi(\theta|{\bf X})\, d\theta\),
or median or mode.
Uncertainty is given by a Credible Interval, i.e. \(a\) and \(b\) such that: \(\int_a^b \pi(\theta|{\bf X}) = 1 - \alpha\)
You are allowed to interpret this as the Probability of a parameter being within this interval! (to the great relief of statistics students who are understanably confused about the interpretation of Confidence Intervals).
Data: \[{\bf X} = \{H,H,H,H,H\}\]
Uninformative prior: \[\theta \sim \text{Unif}(0,1)\] \[\pi(\theta) = 1; 0 < \theta < 1\]
Likelihood: \[X \sim \text{Binomial}(n=5,\theta)\] \[f(X|\theta) = \theta^5\]
Posterior: \[\pi(\theta|X) \propto 1 \times \theta^5 = 6 \theta^5\]
fillcurve <- function(dist, min=-.2, col="grey", bor=1, lwd =2, max=1.2, n=1e5, ...){
x <- c(min, seq(min, max, length=1000), max)
y <- c(0, dist(seq(min, max, length=1000)), 0)
plot(x, y, type="n", ylab="",...)
polygon(x, y, col=col, bor=bor, lwd=lwd)
}
pi1 <- function(theta) ifelse(theta >= 0 & theta <= 1, 6*theta^5, 0)
fillcurve(dunif, xlab=expression(theta), main="Prior")
fillcurve(pi1, xlab=expression(theta), main="Posterior")
With this new posterior distribution we can report some results with respect to our parameter!
Mean: \[\widehat{\theta} = \int_0^1 \theta \times 6 \theta^5 d\theta = 6/7 = 0.857\]
Median: \[\tilde{\theta}: \int_0^{\tilde{\theta}} 6 \theta^5 d\theta = 1/2\] \[ \tilde{\theta}^6 = {1 \over 2}; \tilde{\theta} = {1 \over 2}^{1/6} = 0.89\]
Mode: \[\breve{\theta} = \text{argmax}_\theta (f(\theta)) = 1\]
\[\int_0^{\{a,b\}} \pi(\theta|X) d\theta = \{0.025, 0.975\}\]
Here: \(a^6 = 0.025;\,\, a = 0.54; b^6 = 0.975;\,\, b = 0.996\)
\(a\) and \(b\) such that \(\pi(a) = \pi(b)\) and \[\int_a^b \pi(\theta|X) d\theta = 0.95\]
Here: \[ 0.607 - 1.0 \]
MLE of \(\widehat{\theta}_{mle} = 1\), but C.I. = 0!?
Bayes allows you to quantify your belief in the parameter’s value where a frequentist approach is mostly useless.
This is because you ALREADY had a model. All the data did was “update” it!
This is often amazing to people who first encounter Bayesian modeling - it CAN estimate parameters with VERY HIGH parameter/data ratios.
Probably - if someone flips a coin you’ll assume that it’s a fair coin. But maybe you’ve heard of “magic” double-headed or double-tailed coins?
Data: \(\{H,H,H,H,H\}\)
Prior: \(\pi(\theta) = \{0, 1/2, 1\}\) with Prob: \(\{.005, .99, .005\}\)
Exercise 1: Compute the posterior.
Exercise 2: How many consecutive coin flips will it take for the subjective probability that it is a 2-headed coin to be creater than the subjective probability that it is a fair coin?
Question: What happens to the posterior if you observe a single tail - even after a long string of heads?
This illustrates that the stronger the initial belief, the harder it is to shift it! Bayesian statistics is therefore a nice model of an adaptive learning process.
Obviously a large part of Bayesian models has to do with picking priors.
The Uniform(0,1) prior for probabilities is, in fact, a special case of a Beta(a,b) distribution. Even more remarkably, the posterior distribution (for a Binomial process), is ALSO a Beta distribution.
\[ \pi(\theta) = \frac{1}{B(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1} \]
Prior: \(\alpha = \beta = 1\); Posterior: \(\alpha = 6\) , \(\beta = 1\) .
We say that the Beta is the conjugate prior for the p parameter in a Binomial distribution. Conjugate priors are VERY HANDY - both mathematically and computationally, and as you parameterize / try to fit Bayesian models, it is good to be aware of them.
A summary of conjugate priors here: http://www.johndcook.com/conjugate_prior_diagram.html
The coin problem is a rare analytically tractable one.
If you have a few more parameters, and maybe odd distributions, you might be able to numerically integrate.
But if you have a lot of parameters, this is a near impossble operation to perform!
Thus, though the theory dates to the 1600’s, it has been difficult to implement more broadly … until the developement of Markov Chain Monte Carlo techniques.