In frequentist inference, observations are random while parameters are fixed, unknown quantities.
Probabilities are interpreted as the expected frequency of many-many-many observations of a random process.
Has this ever troubled you? The idea that there is a “TRUE”, say, regression coefficient on the interaction between salinity and temperature in predicting the presence of a fish?
In the Bayesian approach probability are baldly interpreted as a subjective measure of belief.
Yikes! But also, to many, appealing.
EVERYTHING - the response variables and the parameters and hypotheses are all RANDOM VARIABLES. We use data to determine the certainty (Credibility) of an estimate.
That salinity X temperature interaction reflects not something absolute, but something (intrinsically sloppy that) we have learned about Solea solea.
Frequentist
Guarantee correct probabilities of error taking into account the sampling, the sample sizes, the models.
MINUS: Requires convoluted interpretation of confidence intervals, Type I and Type II errors
PLUS: More intrinsically ``objective’’ and logically consistent.
Bayesian
Analyze the extent to which more information improves our beliefs about a system.
MINUS: It’s all about beliefs! … with major consequences.
PLUS: More intuitive interpretation and implementation, e.g. this is the probability of this hypothesis, and here is the probability that this parameter is equal to this value. Probably closer to how humans naturally interpret the world.
\[\pi(\theta | {\bf X}) \propto p({\bf X} | \theta) \pi (\theta)\]
Some typical Bayes shorthand:
Note immediately:
is to compute the posterior distribution of the parameters: \(\pi(\theta | {\bf X})\)
A point estimate is the mean of the posterior: \(\widehat{\theta} = E(\theta | {\bf X}) = \int \theta \pi(\theta|{\bf X})\, d\theta\)
(or median or mode).
A Credible Interval is \(a\) and \(b\) such that: \(\int_a^b \pi(\theta|{\bf X}) = 1 - \alpha\)
You are allowed to interpret this as the Probability of a parameter being within this interval! (to the great relief of undergraduate stats students who - like any normal humans - always misinterpret Confidence Interval).
The idea of MCMC is to “sample” from parameter values \(\theta_i\) in such a way that the resulting distribution approximates the posterior distribution.
Recall that Markov Chain is a random process that depends only on its previous state, and that (if ergodic), leads to a stationary distributoin.
The “trick” is to find sampling rules (MCMC algorithms) that asymptotically approach the correct distribution.
There are several such (related) algorithms:
A large, evolving literature and body of work!
You’ve flipped 5 heads. Your initial “guess” for \(\theta\) is Uniform. How do these data shift your prior?
MCMC:
prior <- function(theta) dunif(theta)
likelihood <- function(theta)
ifelse(theta >= 0 & theta <=1, theta^5, 0)
theta <- 0.3
n <- 1000
p.old <- prior(theta)*likelihood(theta)
thetas <- theta
while(length(thetas) <= n){
theta.new <- theta + rnorm(1,0,0.05)
p.new <- prior(theta.new)*likelihood(theta.new)
if(p.new > p.old | runif(1) < p.new/p.old){
theta <- theta.new
thetas <- c(thetas,theta)
p.old <- p.new
}
}
Plot:
plot(thetas, type="o", pch=21, bg="darkgrey")
hist(thetas[-(1:100)], freq=FALSE, col="grey", bor="darkgrey")
curve(6*x^5, add=TRUE, col=2, lwd=3)
Generalize the MCMC from above to estimate the probability from an experiment with any number of trials and any number of successes.
acf()
as guideTo guarantee that you are converging to the correct distribution, you typically obtain mutiple chains (e.g. 4) starting from different locations.
The effective sample size (\(N_{eff}\)) is the total length of the raw chains, minus the trimmed bit, divided by the thinning.
The coda
R package helps analyze MCMC chains. An example is line
- a Bayesian fit of a linear regression (\(\alpha, \beta, \sigma\)).
require(lattice); require(coda); data(line); xyplot(line)
Clip the burn-in portion:
xyplot(line[[1]], start=10)
Check out the posterior distributions (also assessing convergence):
densityplot(line)
The correlation between the parameters, and the auto-correlations within the chains
levelplot(line[[2]])
acfplot(line)
summary(line)
##
## Iterations = 1:200
## Thinning interval = 1
## Number of chains = 2
## Sample size per chain = 200
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## alpha 2.9876 0.4984 0.02492 0.02375
## beta 0.7992 0.3367 0.01683 0.01597
## sigma 0.9681 0.7413 0.03707 0.05653
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## alpha 1.9650 2.7321 3.0188 3.2420 3.877
## beta 0.1431 0.6039 0.7963 0.9925 1.470
## sigma 0.4250 0.6158 0.7912 1.0752 2.560
MCMCpack
contains a nice set of models
Package: MCMCpack Version: 1.3-3 Date: 2013-5-1 Title: Markov chain Monte Carlo (MCMC) Package Author: Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park Maintainer: Jong Hee ParkDepends: R (>= 2.10.0), coda (>= 0.11-3), MASS, stats Description: This package contains functions to perform Bayesian inference using posterior simulation for a number of statistical models. Most simulation is done in compiled C++ written in the Scythe Statistical Library Version 1.0.3. All models return coda mcmc objects that can then be summarized using the coda package. MCMCpack also contains some useful utility functions, including some additional density functions and pseudo-random number generators for statistical distributions, a general purpose Metropolis sampling algorithm, and tools for visualization. License: GPL-3 SystemRequirements: gcc (>= 4.0) URL: http://mcmcpack.wustl.edu Packaged: 2013-05-01 00:37:02 UTC; parkjonghee NeedsCompilation: yes Repository: CRAN Date/Publication: 2013-05-01 07:02:08 Built: R 3.0.1; x86_64-w64-mingw32; 2013-06-22 15:56:51 UTC; windows Archs: i386, x64
Look at the list of models it can fit: ls(package:MCMCpack)
birthwt.mcmc <- MCMClogit(low~age+as.factor(race)+smoke, data=birthwt)
xyplot(birthwt.mcmc)
densplot(birthwt.mcmc)
summary(birthwt.mcmc)
##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## (Intercept) -0.98190 0.9109 0.009109 0.038909
## age -0.03812 0.0341 0.000341 0.001403
## as.factor(race)2 1.03851 0.5002 0.005002 0.020253
## as.factor(race)3 1.08242 0.4185 0.004185 0.017995
## smoke 1.12993 0.3922 0.003922 0.016334
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## (Intercept) -2.72363 -1.60072 -1.00723 -0.31925 0.87166
## age -0.11025 -0.06131 -0.03675 -0.01468 0.02623
## as.factor(race)2 0.06971 0.69756 1.03927 1.38325 2.01703
## as.factor(race)3 0.30514 0.79286 1.06132 1.36351 1.94954
## smoke 0.38584 0.85418 1.14129 1.39460 1.88826
***
summary(glm(low~age+as.factor(race)+smoke, data=birthwt, family="binomial"))
##
## Call:
## glm(formula = low ~ age + as.factor(race) + smoke, family = "binomial",
## data = birthwt)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4211 -0.9171 -0.5687 1.3687 2.0707
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.00755 0.86166 -1.169 0.24228
## age -0.03488 0.03340 -1.044 0.29634
## as.factor(race)2 1.01141 0.49342 2.050 0.04039 *
## as.factor(race)3 1.05673 0.40596 2.603 0.00924 **
## smoke 1.10055 0.37195 2.959 0.00309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 218.86 on 184 degrees of freedom
## AIC: 228.86
##
## Number of Fisher Scoring iterations: 4
For this application, there is no very good reason to use Bayesian modeling, UNLESS - you are a categorically / philosophically a “Bayesian”. - you have real prior information on regression coefficient (which is - basically - unheard of).
A major drawback - you can always be criticized for: - the priors - the finicky tuning parameters (thinning / convergence / burn-in)
But see some of the more complex models that MCMC can fit (e.g. hierarchical logit MCMChlogit
)
Metropolis-Hastings is great, simple, and general. BUT … sensitive to step size. AND … can be too slow, because it cab ends up rejecting a great many steps.
In Gibbs sampling, rather than accept/reject with appropriate probabilities, you march around a parameter space with the appropriate conditional probabilities: \[\theta_{1,i+1} \sim p(\theta_{1,i} |X, \theta_{2,i})\] and draw a step from THAT distribution.
Then you fix the new \(\theta_1\) and draw the next \(\theta_2\) from the distribution conditioned on \(\theta_2\): \[\theta_{2,i+1} \sim p(\theta_{2,i} | X, \theta_{1,i})\]
Much faster (and therefore more common) than Metropolis-Hastings. Note the far higher effective sample size!
BUGS (OpenBUGS, WinBUGS) is Bayesian inference Using Gibbs Sampling
JAGS is “just another Gibbs sampler”
Hamiltonian Monte Carlo (HMC) - is a Metropolis-Hastings that climbs gradients and is therefore faster and better with correlation between parameters.
No-U Turn Sampler (NUTS) - stops the MCMC when it is curling up on itself too much - which speeds things even more by not requiring a fixed length. This is what STAN uses (see http://arxiv.org/pdf/1111.4246v1.pdf).
You might want to create your own model to fit using Bayesian MCMC rather than rely on existing models. For this purpose, there are several tools to choose from.
BUGS / WinBUGS / OpenBUGS (Bayesian inference Using Gibbs Sampling) - granddaddy (since 1989) of Bayesian sampling tools. WinBUGS is proprietary. OpenBUGS is poorly supported. A lot of point and click with interfacing in R.
JAGS (Just Another Gibbs Sampler) accepts a model string written in an R-like syntax and that compiles and generate MCMC samples from this model using Gibbs sampling. Can be used within R with the rjags
package.
Stan (named for Stanislaw Ulam) a fairly new program similar to JAGS - somewhat faster, somewhat more robust, growing rapidly. Generates C++ code from pseudo-R/C syntax. **We’ll be using Stan in lab! So install it! http://mc-stan.org/rstan.html**
Laplace’s Demon Intriguing looking set of Bayesian tools all within R: http://www.bayesian-inference.com/software
For a discussion / comparison, check out: http://www.sumsar.net/blog/2013/06/three-ways-to-run-bayesian-models-in-r/
Stanislaw Ulam (1909-1984) holding a mechanical Monte Carlo simulator of neutrino diffusion.
To fit a model using STAN, the steps are:
STAN code is a sort of hybrid between R (e.g. with handy distribution functions) and C (i.e. you have to declare your variables). Each model definition comes with three blocks:
1. The data block:
data { intn; // vector[n] y; // Y Vector vector[n] x; // X Vector }
This specifies the raw data that you are going to enter. In this case, is is just Y
and X
, both of which are (numeric) vectors of length n
, which is an integer that cannot be less than 0.
parameters{ real beta0; // intercept real beta1; // slope realsigma; // residual variance }
These list the parameters you want to estimate: the intercept, slope and variance.
model{ // a useful intermediate variable: vector[n] yhat; // Priors - (self-explanatory) sigma ~ inv_gamma(0.001, 0.001); beta0 ~ normal(150,150); beta1 ~ normal(0, 10); // Likelihood for(i in 1:n){ yhat[i] <- beta0 + beta1 * (x[i] - mean(x));} y ~ normal(yhat, sigma); }
Note:
Refer frequently to the manual! https://github.com/stan-dev/stan/releases/download/v2.9.0/stan-reference-2.9.0.pdf
You save your model in a separate file (I called it lm.stan
… note: newer versions of Rstudio provides STAN syntax coloring!) And then compile the model with the stan_model()
command.
lm_stan <- stan_model(file = "lm.stan")
This command is taking the model you described and encoding and compiling a NUTS sampler in C++. That’s wonderful! Believe me it is a great, great pain (wihtout lots of experience) to write that C++ code yourself, and it is guaranteed to be way faster than the equivalent in R. Now you have your own Bayesian MCMC model in your workspace.
Note: this step can be quite slow!
The key function here is sampling()
. Note, also, that to give the data to your model, it has to be in the form of a list.
Simulating some data:
beta0 <- 4
beta1 <- -.5
sigma <- 2
X <- runif(100,0,20)
Y <- rnorm(100, beta0+beta1*X, sigma)
Data <- list(n = length(X), y=Y, x=X)
Perform the sampling!
lm_stan.fit <- sampling(lm_stan, Data)
lots of output here as it computes
print(lm_stan.fit, digits = 2)
## Inference for Stan model: lm.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
## beta0 4.08 0.01 0.41 3.28 3.81 4.09 4.35 4.92 1312
## beta1 -0.49 0.00 0.03 -0.56 -0.51 -0.49 -0.47 -0.42 1327
## sigma 1.90 0.00 0.14 1.64 1.80 1.89 1.99 2.20 1356
## lp__ -113.67 0.04 1.26 -116.94 -114.26 -113.34 -112.74 -112.23 956
## Rhat
## beta0 1
## beta1 1
## sigma 1
## lp__ 1
##
## Samples were drawn using NUTS(diag_e) at Thu Jun 04 14:31:27 2015.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
To apply the coda
family of diagnostic tools, you need to extract the chains from the STAN fitted object, and re-create it is as an mcmc.list
.
require(plyr)
lm.chains <- extract(lm_stan.fit, inc_warmup=FALSE, permuted=FALSE, pars = c("beta0", "beta1", "sigma"))
lm.mcmc <- mcmc.list(alply(lm.chains, 2, mcmc))
plot(lm.mcmc)
in lab!
[L]a théorie des probabilités n’est au fond, que le bon sens réduit au calcul … Par là, elle devient le supplément le plus heureux à l’ignorance et à la faiblesse de l’esprit humain… [I]l n’est point de science plus digne de nos méditations, et qu’il soit plus utile de faire entrer dans le système de l’instruction publique.
[T]he theory of probabilities is no more than common sense reduced to calculation… It is therefore the most happy supplement to the ignorance and weakness of the human spirit. There is no science more worthy of our meditations, or more useful to include in our system of public instruction.