What is Frequentist?

In frequentist inference, observations are random while parameters are fixed, unknown quantities.

Probabilities are interpreted as the expected frequency of many-many-many observations of a random process.

Has this ever troubled you? The idea that there is a “TRUE”, say, regression coefficient on the interaction between salinity and temperature in predicting the presence of a fish?

What is Bayesian?

In the Bayesian approach probability are baldly interpreted as a subjective measure of belief.

Yikes! But also, to many, appealing.

EVERYTHING - the response variables and the parameters and hypotheses are all RANDOM VARIABLES. We use data to determine the certainty (Credibility) of an estimate.

That salinity X temperature interaction reflects not something absolute, but something (intrinsically sloppy that) we have learned about Solea solea.

Philosophical Goals

Frequentist

Guarantee correct probabilities of error taking into account the sampling, the sample sizes, the models.

  • MINUS: Requires convoluted interpretation of confidence intervals, Type I and Type II errors

  • PLUS: More intrinsically ``objective’’ and logically consistent.


Bayesian

Analyze the extent to which more information improves our beliefs about a system.

  • MINUS: It’s all about beliefs! … with major consequences.

  • PLUS: More intuitive interpretation and implementation, e.g. this is the probability of this hypothesis, and here is the probability that this parameter is equal to this value. Probably closer to how humans naturally interpret the world.

Practically: Why Bayesian?

  • Complex models with limited data, e.g. Hierarchical models, where
\(Y \sim Dist(\theta)\); \(\theta \sim Dist(\gamma)\) AND \(\gamma \sim Dist (\nu)\)
  • Actual prior knowledge (surprisingly rare)

Bayes Rule (redacted):

\[\pi(\theta | {\bf X}) \propto p({\bf X} | \theta) \pi (\theta)\]

Some typical Bayes shorthand:

  • \(\pi()\) and \(p()\) represent probability functions (mass or density)
  • \(\pi(\theta)\) is the prior distribution of our “ideas” about \(\theta\)
  • \(\pi(\theta|{\bf X})\) is the posterior distribution of our “ideas” about \(\theta\), given the information provided by the data.
  • \(p(X | \theta)\) is the probability of the data given parameter values … i.e. the Likelihood Function

Note immediately:

  • This is not an equality - but no matter because it is a distribution that can be normalized. The missing demonimator is \(p({\bf X})\) (a very difficult thing to compute).
  • From a frequentist perspective, it is ABSURD to talk about the \(\pi(\theta)\) - since that is assumed to be fixed.
  • The biggest bugaboo of Bayes is in determining the prior distribution. What should the prior be? What influence does it have?

The Goal:

is to compute the posterior distribution of the parameters: \(\pi(\theta | {\bf X})\)

A point estimate is the mean of the posterior: \(\widehat{\theta} = E(\theta | {\bf X}) = \int \theta \pi(\theta|{\bf X})\, d\theta\)
(or median or mode).

A Credible Interval is \(a\) and \(b\) such that: \(\int_a^b \pi(\theta|{\bf X}) = 1 - \alpha\)

You are allowed to interpret this as the Probability of a parameter being within this interval! (to the great relief of undergraduate stats students who - like any normal humans - always misinterpret Confidence Interval).

Computation

Pierre-Simon Laplace (1749-1827) (see: Sharon Bertsch McGrayne: The Theory That Would Not Die)

  • Some problems are analytically tractable, e.g. the Binomal likelihood- Beta prior. (See Bayes Lecture Part I).
    • These cases are rare and rely on nice conjugate pairs.
  • If you have a few parameters, and odd distributions, you might be able to numerically multiply / integrate the prior and likelihood (aka grid approximation). See (Bayes Lab Part I).
    • But if you have a lot of parameters, this is a near impossible operation to perform!
  • Though the theory dates to the 1700’s, and even its interpretation for inference dates to the early 1800’s, it has been difficult to implement more broadly … until the development of Markov Chain Monte Carlo techniques.

MCMC

The idea of MCMC is to “sample” from parameter values \(\theta_i\) in such a way that the resulting distribution approximates the posterior distribution.

Recall that Markov Chain is a random process that depends only on its previous state, and that (if ergodic), leads to a stationary distributoin.

The “trick” is to find sampling rules (MCMC algorithms) that asymptotically approach the correct distribution.

There are several such (related) algorithms:

  • Metropolis-Hastings
  • Gibbs Sampling
  • No U-Turn Sampling (NUTS)
  • Reversible Jump

A large, evolving literature and body of work!

Metropolis-Hastings Algorithm

  1. Start somewhere: \(\theta_0\), and compute \(p_0 = \pi(\theta_0 | {\bf X})\)
  2. Make a jump to a new candidate location: \(\theta_1^* = \theta_0 + J_1\), where the Jumps are some random step (e.g. normal)
  3. Compute the posterior: \(p_1 = \pi(\theta_0 | {\bf X})\)
  4. IF \(p_1 > p_0\) - Accept \(\theta_1^*\)
  5. IF \(p_1 < p_0\) - Accept \(\theta_1^*\) with probability \(p_1 / p_0\)
  6. Go to step 2

Metropolis-Hastings: Coin Example

You’ve flipped 5 heads. Your initial “guess” for \(\theta\) is Uniform. How do these data shift your prior?

MCMC:

prior <- function(theta) dunif(theta)
likelihood <- function(theta) 
  ifelse(theta >= 0 & theta <=1, theta^5, 0)
theta <- 0.3
n <- 1000
p.old <- prior(theta)*likelihood(theta)
thetas <- theta
while(length(thetas) <= n){
  theta.new <- theta + rnorm(1,0,0.05)
  p.new <- prior(theta.new)*likelihood(theta.new)
  if(p.new > p.old | runif(1) < p.new/p.old){
    theta <- theta.new
    thetas <- c(thetas,theta)
    p.old <- p.new
  }
}

Plot:

plot(thetas, type="o", pch=21, bg="darkgrey")
hist(thetas[-(1:100)], freq=FALSE, col="grey", bor="darkgrey")
curve(6*x^5, add=TRUE, col=2, lwd=3)

Exercise:

Generalize the MCMC from above to estimate the probability from an experiment with any number of trials and any number of successes.

The Sampling Chain: Trimming, Thinning, Multiple Chains

  • That initial transition “towards” stationarity is called the Burn-In period, and must be trimmed
    • how? eyeballing!
  • The sampling procedure is (obviously) auto-correlated. Therefore, the final posterior distribution is typically thinned.
    • how? typically eyeballing, using acf() as guide
  • To guarantee that you are converging to the correct distribution, you typically obtain mutiple chains (e.g. 4) starting from different locations.

  • The effective sample size (\(N_{eff}\)) is the total length of the raw chains, minus the trimmed bit, divided by the thinning.

coda: MCMC diagnostics

The coda R package helps analyze MCMC chains. An example is line - a Bayesian fit of a linear regression (\(\alpha, \beta, \sigma\)).

require(lattice); require(coda); data(line); xyplot(line)


Clip the burn-in portion:

xyplot(line[[1]], start=10)

coda(): MCMC diagnostics

Check out the posterior distributions (also assessing convergence):

densityplot(line)


The correlation between the parameters, and the auto-correlations within the chains

levelplot(line[[2]])
acfplot(line)

coda(): Summary statistics

summary(line)
## 
## Iterations = 1:200
## Thinning interval = 1 
## Number of chains = 2 
## Sample size per chain = 200 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean     SD Naive SE Time-series SE
## alpha 2.9876 0.4984  0.02492        0.02375
## beta  0.7992 0.3367  0.01683        0.01597
## sigma 0.9681 0.7413  0.03707        0.05653
## 
## 2. Quantiles for each variable:
## 
##         2.5%    25%    50%    75% 97.5%
## alpha 1.9650 2.7321 3.0188 3.2420 3.877
## beta  0.1431 0.6039 0.7963 0.9925 1.470
## sigma 0.4250 0.6158 0.7912 1.0752 2.560

Tools for Running MCMC (within R)

MCMCpack contains a nice set of models

Package: MCMCpack
Version: 1.3-3
Date: 2013-5-1
Title: Markov chain Monte Carlo (MCMC) Package
Author: Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
Maintainer: Jong Hee Park 
Depends: R (>= 2.10.0), coda (>= 0.11-3), MASS, stats
Description: This package contains functions to perform Bayesian
        inference using posterior simulation for a number of
        statistical models. Most simulation is done in compiled C++
        written in the Scythe Statistical Library Version 1.0.3. All
        models return coda mcmc objects that can then be summarized
        using the coda package.  MCMCpack also contains some useful
        utility functions, including some additional density functions
        and pseudo-random number generators for statistical
        distributions, a general purpose Metropolis sampling algorithm,
        and tools for visualization.
License: GPL-3
SystemRequirements: gcc (>= 4.0)
URL: http://mcmcpack.wustl.edu
Packaged: 2013-05-01 00:37:02 UTC; parkjonghee
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-05-01 07:02:08
Built: R 3.0.1; x86_64-w64-mingw32; 2013-06-22 15:56:51 UTC; windows
Archs: i386, x64

Look at the list of models it can fit: ls(package:MCMCpack)

Logistic regression: Low Infant Birth Weight

birthwt.mcmc <- MCMClogit(low~age+as.factor(race)+smoke, data=birthwt)
xyplot(birthwt.mcmc)


densplot(birthwt.mcmc)

MCMC vs. GLM Logistic Regression

summary(birthwt.mcmc)
## 
## Iterations = 1001:11000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                      Mean     SD Naive SE Time-series SE
## (Intercept)      -0.98190 0.9109 0.009109       0.038909
## age              -0.03812 0.0341 0.000341       0.001403
## as.factor(race)2  1.03851 0.5002 0.005002       0.020253
## as.factor(race)3  1.08242 0.4185 0.004185       0.017995
## smoke             1.12993 0.3922 0.003922       0.016334
## 
## 2. Quantiles for each variable:
## 
##                      2.5%      25%      50%      75%   97.5%
## (Intercept)      -2.72363 -1.60072 -1.00723 -0.31925 0.87166
## age              -0.11025 -0.06131 -0.03675 -0.01468 0.02623
## as.factor(race)2  0.06971  0.69756  1.03927  1.38325 2.01703
## as.factor(race)3  0.30514  0.79286  1.06132  1.36351 1.94954
## smoke             0.38584  0.85418  1.14129  1.39460 1.88826

***

summary(glm(low~age+as.factor(race)+smoke, data=birthwt, family="binomial"))
## 
## Call:
## glm(formula = low ~ age + as.factor(race) + smoke, family = "binomial", 
##     data = birthwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4211  -0.9171  -0.5687   1.3687   2.0707  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -1.00755    0.86166  -1.169  0.24228   
## age              -0.03488    0.03340  -1.044  0.29634   
## as.factor(race)2  1.01141    0.49342   2.050  0.04039 * 
## as.factor(race)3  1.05673    0.40596   2.603  0.00924 **
## smoke             1.10055    0.37195   2.959  0.00309 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 218.86  on 184  degrees of freedom
## AIC: 228.86
## 
## Number of Fisher Scoring iterations: 4

MCMC vs. GLM Logistic Regression

For this application, there is no very good reason to use Bayesian modeling, UNLESS - you are a categorically / philosophically a “Bayesian”. - you have real prior information on regression coefficient (which is - basically - unheard of).

A major drawback - you can always be criticized for: - the priors - the finicky tuning parameters (thinning / convergence / burn-in)

But see some of the more complex models that MCMC can fit (e.g. hierarchical logit MCMChlogit)

Metropolis-Hastings

Metropolis-Hastings is great, simple, and general. BUT … sensitive to step size. AND … can be too slow, because it cab ends up rejecting a great many steps.

Gibbs sampling


In Gibbs sampling, rather than accept/reject with appropriate probabilities, you march around a parameter space with the appropriate conditional probabilities: \[\theta_{1,i+1} \sim p(\theta_{1,i} |X, \theta_{2,i})\] and draw a step from THAT distribution.

Then you fix the new \(\theta_1\) and draw the next \(\theta_2\) from the distribution conditioned on \(\theta_2\): \[\theta_{2,i+1} \sim p(\theta_{2,i} | X, \theta_{1,i})\]

Much faster (and therefore more common) than Metropolis-Hastings. Note the far higher effective sample size!

BUGS (OpenBUGS, WinBUGS) is Bayesian inference Using Gibbs Sampling

JAGS is “just another Gibbs sampler”

Other Samplers

Hamiltonian Monte Carlo (HMC) - is a Metropolis-Hastings that climbs gradients and is therefore faster and better with correlation between parameters.

No-U Turn Sampler (NUTS) - stops the MCMC when it is curling up on itself too much - which speeds things even more by not requiring a fixed length. This is what STAN uses (see http://arxiv.org/pdf/1111.4246v1.pdf).


(Hoffman and Gelman 2011)

Other Tools

You might want to create your own model to fit using Bayesian MCMC rather than rely on existing models. For this purpose, there are several tools to choose from.

  • BUGS / WinBUGS / OpenBUGS (Bayesian inference Using Gibbs Sampling) - granddaddy (since 1989) of Bayesian sampling tools. WinBUGS is proprietary. OpenBUGS is poorly supported. A lot of point and click with interfacing in R.

  • JAGS (Just Another Gibbs Sampler) accepts a model string written in an R-like syntax and that compiles and generate MCMC samples from this model using Gibbs sampling. Can be used within R with the rjags package.

  • Stan (named for Stanislaw Ulam) a fairly new program similar to JAGS - somewhat faster, somewhat more robust, growing rapidly. Generates C++ code from pseudo-R/C syntax. **We’ll be using Stan in lab! So install it! http://mc-stan.org/rstan.html**

  • Laplace’s Demon Intriguing looking set of Bayesian tools all within R: http://www.bayesian-inference.com/software

For a discussion / comparison, check out: http://www.sumsar.net/blog/2013/06/three-ways-to-run-bayesian-models-in-r/

STAN


Stanislaw Ulam (1909-1984) holding a mechanical Monte Carlo simulator of neutrino diffusion.


To fit a model using STAN, the steps are:

  1. generate a STAN syntax pseudo-code for the model (same in JAGS and BUGS)
  2. run an R command that compiles the model in C++
  3. use the generated function to fit your data

STAN example - Linear Regression

STAN code is a sort of hybrid between R (e.g. with handy distribution functions) and C (i.e. you have to declare your variables). Each model definition comes with three blocks:

1. The data block:

data {
  int n; //
  vector[n] y; // Y Vector
  vector[n] x; // X Vector
}

This specifies the raw data that you are going to enter. In this case, is is just Y and X, both of which are (numeric) vectors of length n, which is an integer that cannot be less than 0.

2. The parameter block
parameters{
  real beta0;  // intercept
  real beta1;  // slope
  real sigma; // residual variance
}

These list the parameters you want to estimate: the intercept, slope and variance.


3. The model block
model{
  // a useful intermediate variable:
    vector[n] yhat;   
  
  // Priors - (self-explanatory)
    sigma ~ inv_gamma(0.001, 0.001); 
    beta0 ~ normal(150,150);
    beta1 ~ normal(0, 10);
  
  // Likelihood
    for(i in 1:n){
      yhat[i] <- beta0 + beta1 * (x[i] - mean(x));}
    y ~ normal(yhat, sigma); 
}

Note:

  • you CAN vectorize, but looping is just as fast
  • lots of semicolons
  • many distributions (and functions like “mean”) are available

Refer frequently to the manual! https://github.com/stan-dev/stan/releases/download/v2.9.0/stan-reference-2.9.0.pdf

2. Compile the model in R

You save your model in a separate file (I called it lm.stan … note: newer versions of Rstudio provides STAN syntax coloring!) And then compile the model with the stan_model() command.

lm_stan <- stan_model(file = "lm.stan")

This command is taking the model you described and encoding and compiling a NUTS sampler in C++. That’s wonderful! Believe me it is a great, great pain (wihtout lots of experience) to write that C++ code yourself, and it is guaranteed to be way faster than the equivalent in R. Now you have your own Bayesian MCMC model in your workspace.

Note: this step can be quite slow!

3. Run the model in R

The key function here is sampling(). Note, also, that to give the data to your model, it has to be in the form of a list.

Simulating some data:

beta0 <- 4
beta1 <- -.5
sigma <- 2

X <- runif(100,0,20)
Y <- rnorm(100, beta0+beta1*X, sigma)
Data <- list(n = length(X), y=Y, x=X)

Perform the sampling!

lm_stan.fit <- sampling(lm_stan, Data)

lots of output here as it computes


print(lm_stan.fit, digits = 2)
## Inference for Stan model: lm.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## beta0    4.08    0.01 0.41    3.28    3.81    4.09    4.35    4.92  1312
## beta1   -0.49    0.00 0.03   -0.56   -0.51   -0.49   -0.47   -0.42  1327
## sigma    1.90    0.00 0.14    1.64    1.80    1.89    1.99    2.20  1356
## lp__  -113.67    0.04 1.26 -116.94 -114.26 -113.34 -112.74 -112.23   956
##       Rhat
## beta0    1
## beta1    1
## sigma    1
## lp__     1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jun 04 14:31:27 2015.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

MCMC diagnostics

To apply the coda family of diagnostic tools, you need to extract the chains from the STAN fitted object, and re-create it is as an mcmc.list.

require(plyr)
lm.chains <- extract(lm_stan.fit, inc_warmup=FALSE, permuted=FALSE, pars = c("beta0", "beta1", "sigma")) 
lm.mcmc <- mcmc.list(alply(lm.chains, 2, mcmc))

plot(lm.mcmc)

More interesting MCMC’s …

in lab!

  • change point - fitting the lilac data?
  • hierarchical (maybe example with stomach contents of fish?)

Pierre-Simon Laplace (1749-1827)

Essai Philosophique sur les Probabilités - 1814

[L]a théorie des probabilités n’est au fond, que le bon sens réduit au calcul … Par là, elle devient le supplément le plus heureux à l’ignorance et à la faiblesse de l’esprit humain… [I]l n’est point de science plus digne de nos méditations, et qu’il soit plus utile de faire entrer dans le système de l’instruction publique.

[T]he theory of probabilities is no more than common sense reduced to calculation… It is therefore the most happy supplement to the ignorance and weakness of the human spirit. There is no science more worthy of our meditations, or more useful to include in our system of public instruction.