Last lecture we talked about modeling correlations in the residuals of a linear model:
\[Y_i = \beta X_{ij} + \epsilon_{ij}\] where \[\epsilon \sim {\cal N}(0, \Sigma)\] and \(\Sigma\) is the variance-covariance matrix
Remember that for AR(1):
\[ \text{Cor}(\epsilon_t, \epsilon_{t-h}) = \phi^h \]
So the residual variance-covariance for an AR(1) is just: \[\Sigma = \sigma^2 \begin{bmatrix} 1 & \phi & \phi^2 & \phi^3 & ... & \phi^{n-1} \\ \phi & 1 & \phi & \phi^2 & ... & \phi^{n-2} \\ \phi^2 & \phi & 1 & \phi & ... & \phi^{n-3} \\ \phi^3 & \phi^2 & \phi & 1 & & \vdots \\ \vdots & \vdots & \vdots & & \ddots & \vdots\\ \phi^{n-1} & \phi^{n-2} & \phi^{n-3} & ... & ...& 1 \end{bmatrix}\]
\(\Sigma\) is estimated with two parameters, \(\phi\) and \(\sigma^2\)
The Gaussian (most commonly modeled) spatial correlation structure quantifies the correlation between (the residuals) of two points, assuming that it onnly depends on their distance:
\[\text{Cor}(\epsilon_{i,j}, \epsilon_{i,j}) = \exp \left(- \left({r_{i,j} \over \lambda}\right)^2\right) \] Where \(r_{ij} = ||{x_i,y_i}|| - ||{x_j,y_j}||\), i.e. the distance between the two points, and \(\lambda\) is a single parameter that quantifies the scale of spatial correlation. Note that:
The last fact is a useful way to interpret \(\lambda\) … as long as points are within distance \(\lambda\) of each other, they will be affecting each importantly.
As with serial autocorrelation (in time), the biggest risk of not accounting for spatial correlations is that you might detect spurious results, i.e. p-values will be artificially too low.
The gls
and nlme
functions in R can fit this (and a few other) spatial correlation structures in a linear modeling framework. As an example, here is one I found online which measures soil thickness as a function of quality at different locations
spdata <- read.csv("https://terpconnect.umd.edu/~egurarie/teaching/Biol709/data/soildata.csv")
## east north thick soil
## 1 70.66202 47.164981 0.70810547 4.512566
## 2 60.53693 77.351026 0.01785133 2.519441
## 3 87.18226 3.530827 1.11171322 8.133484
## 4 83.21457 21.109702 1.11715551 4.535713
## 5 75.42334 81.719902 0.36837944 5.274376
## 6 89.70406 22.805068 1.03328925 4.590200
A straightworward linear model suggests that quality (which is measured at the surface) is a significant predictor of thickness (which is harder to measure):
soil.lm <- lm(thick~soil, spdata)
summary(soil.lm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.47128554 0.06475729 7.277722 8.512716e-11
## soil 0.03572402 0.01385732 2.577989 1.142602e-02
The syntax to perform this fit is identical using gls
, which by default fits a simple linear model:
require(nlme)
soil.gls <- gls(thick~soil, spdata)
cbind(summary(soil.gls)$coef, anova(soil.gls))
## summary(soil.gls)$coef numDF F-value p-value
## (Intercept) 0.47128554 1 483.045263 1.170154e-39
## soil 0.03572402 1 6.646029 1.142602e-02
The coefficients (and p-values) are equal.
Is done as follows:
soil.gls <- gls(thick~soil, spdata,
correlation = corGaus(1,~east + north))
summary(soil.gls)
## Generalized least squares fit by REML
## Model: thick ~ soil
## Data: spdata
## AIC BIC logLik
## -324.0889 -313.7491 166.0445
##
## Correlation Structure: Gaussian spatial correlation
## Formula: ~east + north
## Parameter estimate(s):
## range
## 19.87503
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.6409182 0.07577223 8.458484 0.0000
## soil -0.0000164 0.00006041 -0.271171 0.7868
##
## Correlation:
## (Intr)
## soil -0.039
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.22586599 -0.79832354 -0.02467059 0.80239334 1.65427242
##
## Residual standard error: 0.2879282
## Degrees of freedom: 100 total; 98 residual
Note the syntax
corGauss(1, ~east+west)
Read about how this works in the ?corGauss
help file and see a list of other correlation structures in nlme
in the ?corClasses
help file
The lme
syntax is nearly identical. But: - because because it is a more general (and powerful) function, it is a bit more sluggish. - because it is designed for mixed effects models, it DEMANDS a “random effect”, which in this case is th column of dummy 1’s
require(plyr)
soil.lme <- lme(fixed = thick ~ soil, data = mutate(spdata, dummy = 1),
correlation = corGaus(1,~east + north),
random = ~ 1 | dummy, method = "ML")
summary(soil.lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: mutate(spdata, dummy = 1)
## AIC BIC logLik
## -343.0238 -329.998 176.5119
##
## Random effects:
## Formula: ~1 | dummy
## (Intercept) Residual
## StdDev: 9.492085e-06 0.2852258
##
## Correlation Structure: Gaussian spatial correlation
## Formula: ~east + north | dummy
## Parameter estimate(s):
## range
## 19.87889
## Fixed effects: thick ~ soil
## Value Std.Error DF t-value p-value
## (Intercept) 0.6409115 0.07583082 98 8.451861 0.0000
## soil -0.0000163 0.00006036 98 -0.270433 0.7874
## Correlation:
## (Intr)
## soil -0.039
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.2469318 -0.8058647 -0.0248815 0.8100185 1.6699683
##
## Number of Observations: 100
## Number of Groups: 1
But the conclusions are the same.
Load the Ozone data in the plyr
function. This is a 24x24x72 numeric array which contains monthly ozone averages on a coarse (24 x 24 grid) covering Central America, from Jan 1995 to Dec 2000. Below, for example, is a plot of ozone concentrations on August 1995:
require(plyr); require(maps); require(fields)
O3 <- plyr::ozone
lat <- as.numeric(row.names(O3[,,1]))
long <- as.numeric(colnames(O3[,,1]))
image.plot(long,lat,O3[,,8])
map("world", add=TRUE, lwd=2, fill=FALSE)
Note that these data can be easily “reshaped” to a more classic data frame using the melt
function in reshape2
:
require(reshape2)
O3.df <- melt(O3)
head(O3.df)
## lat long time value
## 1 -21.2 -113.8 1 260
## 2 -18.7 -113.8 1 258
## 3 -16.2 -113.8 1 258
## 4 -13.7 -113.8 1 254
## 5 -11.2 -113.8 1 252
## 6 -8.7 -113.8 1 252
To add the “correct” month and year to this data frame:
O3.df <- mutate(O3.df, year = 1995 + floor((time-1)/12), month = time-floor((time-1)/12))
An important additional tweak - we don’t actually want to do spatial analysis in latitude/longitude. We need to convert to a consistent unit (e.g. in km). This is a fairly large area, and the distortion of the earth might distort any projection. But a quick way to deal with this is as follows:
require(PBSmapping)
ll <- with(O3.df, data.frame(X = long, Y = lat))
attributes(ll)$projection <- "LL"
xy <- convUL(ll, km=TRUE)
O3.df <- data.frame(O3.df, xy)
For the analysis below, pick any month and year combination.
Map the ozone concentration for that month/year combination.
Perform a straightforward linear regression of ozone concentrations against East and North. Does there appear to be a East-West or North-South trend in either of those directions?
Estimate the spatial scale of autocorrelation in ozone concentrations.
Does taking that correlation into account change your conclusions about longitudinal or latitudinal trends in ozone concentration?
Bonus Problem: Write a function that performs this analysis for any year month combination. Perform it across all months. Are there any consistent annual patterns in (a) spatial scale of correlation and (b) longitudinal and latitudinal trends in ozone?