Processing data is the least loved and often most time consuming aspect of any data analysis. It is also unavoidable and extremely important. There are several tools - including some more recent R packages that greatly facilitate the processing of data.
Some goals of smart data processing include:
Several packages are particularly useful for data processing:
plyr
- manipulating data frames and lists – functions: mutate
; ddply
-ldply
-dlply
lubridate
- manipulating time objectsrgdal
- projecting coordinatesmaps
and mapdata
- quick and easy mapsmagrittr
- for pipingSneaky R trick: For loading a lot of packages - makes a list:
pcks <- list('plyr', 'lubridate', 'ggmap', 'maptools', 'rgdal', 'maps', 'mapdata')
and then “apply” the require
function to the list:
sapply(pcks, require, character = TRUE)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Probably the ideal format (in practice) for loading data in R is a comma-separated vector .csv
- these are lightweight (all text) files that can be opened in Excel - or, conversely, outputted from Excel - and the function that reads them in to R read.csv()
is generally unfussy (though in many countries, commas serve as decimal points and semi-colons are delimiters - which can cause great confusion).
In the real world, however, raw data doesn’t always conform with those ideals. We’re going to look at some tricks that will let us load, reformat, clean, and tidy data efficiently and consistently in R. We’ll use some boreal caribou and wolf data provided by Allicia Kelly as examples.
Here are the first few rows of one of the files:
kable(head(caribou.raw))
Animal.Id | Herd | Species | PTT | Gender | Region | Collar.Type | Latitude | Longitude | Location.Class | Location.Date | SerialDate |
---|---|---|---|---|---|---|---|---|---|---|---|
BWCA-WL-13-01 | Hay River Lowlands | Boreal Caribou | 123551 | Female | South Slave | Telonics - GPS/Argos | 60.13019 | -116.8509 | G | 2/26/2017 16:00 | 42790.67 |
BWCA-WL-13-01 | Hay River Lowlands | Boreal Caribou | 123551 | Female | South Slave | Telonics - GPS/Argos | 60.13285 | -116.8553 | G | 2/27/2017 8:00 | 42791.33 |
BWCA-WL-13-01 | Hay River Lowlands | Boreal Caribou | 123551 | Female | South Slave | Telonics - GPS/Argos | 60.13302 | -116.8535 | G | 2/27/2017 0:01 | 42791.00 |
BWCA-WL-13-01 | Hay River Lowlands | Boreal Caribou | 123551 | Female | South Slave | Telonics - GPS/Argos | 60.13388 | -116.8587 | G | 2/27/2017 16:00 | 42791.67 |
BWCA-WL-13-01 | Hay River Lowlands | Boreal Caribou | 123551 | Female | South Slave | Telonics - GPS/Argos | 60.14203 | -116.8717 | G | 2/28/2017 0:00 | 42792.00 |
BWCA-WL-13-01 | Hay River Lowlands | Boreal Caribou | 123551 | Female | South Slave | Telonics - GPS/Argos | 60.14324 | -116.8708 | G | 3/1/2017 8:00 | 42793.33 |
The read.csv
command will do a good, quick job of loading the data and retaining the columns:
caribou.raw <- read.csv("./_data/SouthSlaveBoreal.csv")
Note that I am reading the file from a “relative directory”. I.e., depending on where I have (or have set) my working directory, which you can check with getwd()
:
getwd()
## [1] "C:/Users/elie/Box Sync/teaching/NWT2018/materials"
and set with setwd()
(or do either by point-and-clicking in the Session window in Rstudio). In my working directory I have a folder called data
which contains the raw data file.
Okay, great, now that it’s loaded we can see what it’s made of. A few useful commands for working with a data frame: 1. The dimension (row and columns)
dim(caribou.raw)
## [1] 118324 12
names(caribou.raw)
## [1] "Animal.Id" "Herd" "Species" "PTT"
## [5] "Gender" "Region" "Collar.Type" "Latitude"
## [9] "Longitude" "Location.Class" "Location.Date" "SerialDate"
head(caribou.raw)
## Animal.Id Herd Species PTT Gender
## 1 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 2 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 3 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 4 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 5 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 6 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## Region Collar.Type Latitude Longitude Location.Class
## 1 South Slave Telonics - GPS/Argos 60.13019 -116.8509 G
## 2 South Slave Telonics - GPS/Argos 60.13285 -116.8553 G
## 3 South Slave Telonics - GPS/Argos 60.13302 -116.8535 G
## 4 South Slave Telonics - GPS/Argos 60.13388 -116.8587 G
## 5 South Slave Telonics - GPS/Argos 60.14203 -116.8717 G
## 6 South Slave Telonics - GPS/Argos 60.14324 -116.8708 G
## Location.Date SerialDate
## 1 2/26/2017 16:00 42790.67
## 2 2/27/2017 8:00 42791.33
## 3 2/27/2017 0:01 42791.00
## 4 2/27/2017 16:00 42791.67
## 5 2/28/2017 0:00 42792.00
## 6 3/1/2017 8:00 42793.33
View(caribou.raw)
will open a new window. Or you can just click on the object in the “Environment” panel in Rstudio 5. Finally - and most comprehensively - the str
command, which reveals the “structure” of the data frame:
str(caribou.raw)
## 'data.frame': 118324 obs. of 12 variables:
## $ Animal.Id : Factor w/ 107 levels "BWCA-WL-13-01",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Herd : Factor w/ 7 levels "Big Island","Buffalo Lake",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Species : Factor w/ 1 level "Boreal Caribou": 1 1 1 1 1 1 1 1 1 1 ...
## $ PTT : Factor w/ 100 levels "0-2365369","0-2367124",..: 20 20 20 20 20 20 20 20 20 20 ...
## $ Gender : Factor w/ 3 levels "","Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Region : Factor w/ 1 level "South Slave": 1 1 1 1 1 1 1 1 1 1 ...
## $ Collar.Type : Factor w/ 3 levels "Telonics - GPS/Argos",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Latitude : num 60.1 60.1 60.1 60.1 60.1 ...
## $ Longitude : num -117 -117 -117 -117 -117 ...
## $ Location.Class: Factor w/ 1 level "G": 1 1 1 1 1 1 1 1 1 1 ...
## $ Location.Date : Factor w/ 8299 levels "1/1/2016 0:00",..: 3656 3702 3681 3688 3711 4043 3737 4403 4027 4020 ...
## $ SerialDate : num 42791 42791 42791 42792 42792 ...
Note that there are factors are numeric. For example, Herd
is a factor, which contains a fixed number of levels:
levels(caribou.raw$Herd)
## [1] "Big Island" "Buffalo Lake" "Buffalo West"
## [4] "Hay River Lowlands" "Mackenzie" "Pine Point"
## [7] "West Buffalo"
These data are relatively clean. For example, the Latitudes and Longitudes are numeric (as opposed to, e.g., 60.1 N, 117 W
) - which is already something. But there are a few things that we need to do to clean things up.
Note that some of the columns in the original data frame had spaces in the names: “Animal Id”, “Collar Type”, “Location Date”. In R, spaces are not allowed in object names of column headings, so R automatically inserted a bunch of “.”’s. Some of these are unwieldly, so I’ll fix them with the rename
function:
caribou <- rename(caribou.raw, c(Animal.Id = "ID", Location.Date = "Time"))
You can count factors very conveniently with the table
command:
table(caribou$Herd)
##
## Big Island Buffalo Lake Buffalo West
## 22 8141 3148
## Hay River Lowlands Mackenzie Pine Point
## 51770 39246 15863
## West Buffalo
## 134
Note that there is a “West Buffalo” - with very little data - and a “Buffalo West” herd with lots of data. I suspect they are, in fact, the dame. One way to confirm, is to plot one on top of the other. Here’s how:
with(subset(caribou, Herd == "Buffalo West"), plot(Longitude, Latitude, col = "blue"))
with(subset(caribou, Herd == "West Buffalo"), points(Longitude, Latitude, col = "red"))
So … as a first step of data manipulation, let’s just assign all the “West Buffalo” to “Buffalo West”.
caribou$Herd[caribou$Herd == "West Buffalo"] <- "Buffalo West"
table(caribou$Herd)
##
## Big Island Buffalo Lake Buffalo West
## 22 8141 3282
## Hay River Lowlands Mackenzie Pine Point
## 51770 39246 15863
## West Buffalo
## 0
Note, that the factor data type remembers that there was once a “West Buffalo”. To force it to forget:
caribou$Herd <- droplevels(caribou$Herd)
Good - now we just have 6 herds.
table(caribou$Herd)
##
## Big Island Buffalo Lake Buffalo West
## 22 8141 3282
## Hay River Lowlands Mackenzie Pine Point
## 51770 39246 15863
One of the scourges of movement data is dates and times, which are complicated and poorly stacked for a whole bunch of reasons related to inconsistent formatting, days and weeks and months not dividing nicely into years, etc. Also, the formatting is all over the place … in sensible places its is YYYY-MM-DD, but in North America, it is often MM/DD/YYYY.
When you read data in using read.csv
, all strings that aren’t entirely numeric are automatically “factors”. A standard universal date format for data is called POSIX
. You used to have to muck around with the as.POSIX
commands in R which could be quite clunky, e.g.:
head(as.POSIXlt(caribou$Time, format = "%m/%d/%Y %H:%M", tz = "UTC"))
## [1] "2017-02-26 16:00:00 UTC" "2017-02-27 08:00:00 UTC"
## [3] "2017-02-27 00:01:00 UTC" "2017-02-27 16:00:00 UTC"
## [5] "2017-02-28 00:00:00 UTC" "2017-03-01 08:00:00 UTC"
The lubridate
package simplifies all of this with the very simple ymd_hm
command, which will find the year/month/day hour/minute with less fuss:
require(lubridate)
caribou$Time <- mdy_hm(caribou$Time)
We can confirm that the times are now actual times by, for example, looking at the range:
range(caribou$Time)
## [1] "2015-04-01 00:00:00 UTC" "2017-03-30 22:01:00 UTC"
So the data were collected between April 2015 and March 2017 (lots of data for 2 years!). One other nice feature of lubridate
is the ease of getting day of year, hour, month, etc. from data, e.g.:
hour(caribou$Time)[1:10]
## [1] 16 8 0 16 0 8 8 0 16 0
month(caribou$Time)[1:10]
## [1] 2 2 2 2 2 3 2 3 3 3
yday(caribou$Time)[1:10]
## [1] 57 58 58 58 59 60 59 61 60 60
You can also get time intervals with the very basic diff
(difference) function:
caribou$Time[1:10]
## [1] "2017-02-26 16:00:00 UTC" "2017-02-27 08:00:00 UTC"
## [3] "2017-02-27 00:01:00 UTC" "2017-02-27 16:00:00 UTC"
## [5] "2017-02-28 00:00:00 UTC" "2017-03-01 08:00:00 UTC"
## [7] "2017-02-28 08:00:00 UTC" "2017-03-02 00:00:00 UTC"
## [9] "2017-03-01 16:00:00 UTC" "2017-03-01 00:00:00 UTC"
diff(caribou$Time[1:10])
## Time differences in hours
## [1] 16.000000 -7.983333 15.983333 8.000000 32.000000 -24.000000
## [7] 40.000000 -8.000000 -16.000000
In the very first rows, there are some negative time jumps! This suggests that the data are perhaps not ordered how we would expect. Note also that (at least for the first few rows) the time intervals are pretty consistently multiples of 8.
That last observation suggests that things are not quite in the order we would expect. Let’s do a very basic plot of time for only the first individual:
with(subset(caribou, ID == ID[1]), plot(Time, type="l"))
Yikes! There’s crazy jumping around. We should sort these data, first for one caribou. The easiest way to do this is with the arrange
function in the plyr
package.
require(plyr)
c1 <- subset(caribou, ID == ID[1])
head(c1)
## ID Herd Species PTT Gender
## 1 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 2 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 3 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 4 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 5 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 6 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## Region Collar.Type Latitude Longitude Location.Class
## 1 South Slave Telonics - GPS/Argos 60.13019 -116.8509 G
## 2 South Slave Telonics - GPS/Argos 60.13285 -116.8553 G
## 3 South Slave Telonics - GPS/Argos 60.13302 -116.8535 G
## 4 South Slave Telonics - GPS/Argos 60.13388 -116.8587 G
## 5 South Slave Telonics - GPS/Argos 60.14203 -116.8717 G
## 6 South Slave Telonics - GPS/Argos 60.14324 -116.8708 G
## Time SerialDate
## 1 2017-02-26 16:00:00 42790.67
## 2 2017-02-27 08:00:00 42791.33
## 3 2017-02-27 00:01:00 42791.00
## 4 2017-02-27 16:00:00 42791.67
## 5 2017-02-28 00:00:00 42792.00
## 6 2017-03-01 08:00:00 42793.33
head(arrange(c1, Time))
## ID Herd Species PTT Gender
## 1 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 2 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 3 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 4 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 5 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 6 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## Region Collar.Type Latitude Longitude Location.Class
## 1 South Slave Telonics - GPS/Argos 60.88155 -116.8018 G
## 2 South Slave Telonics - GPS/Argos 60.88181 -116.8016 G
## 3 South Slave Telonics - GPS/Argos 60.88147 -116.8020 G
## 4 South Slave Telonics - GPS/Argos 60.88044 -116.8005 G
## 5 South Slave Telonics - GPS/Argos 60.88104 -116.7970 G
## 6 South Slave Telonics - GPS/Argos 60.88190 -116.8018 G
## Time SerialDate
## 1 2015-04-01 00:00:00 42093.00
## 2 2015-04-01 08:00:00 42093.33
## 3 2015-04-01 16:00:00 42093.67
## 4 2015-04-02 00:00:00 42094.00
## 5 2015-04-02 08:00:00 42094.33
## 6 2015-04-02 16:00:00 42094.67
This looks a lot better! PLot it to confirm:
c1 <- arrange(c1, Time)
plot(c1$Time)
We can get an idea of the time intervals between obvervations (we need to use the difftime
function to coerce the units to be hours:
table(round(difftime(c1$Time[-1], c1$Time[-nrow(c1)], unit = "hour")))
##
## 0 8 16 24 32 40 48 56 64 72 80 104 112 120 128
## 271 1570 7 16 13 5 12 25 9 4 2 1 2 2 1
## 144
## 1
Mainly - eight hour intervals, which is good. A few gaps, which is expected - the longest 144 hours (6 days). The more concerning thing in the data is that there are 0’s - which are most likely errors and should be removed. Let’s check a couple to make sure that the row are truly duplicated
zeroes <- which(diff(c1$Time) == 0)
c1[c(zeroes[1], zeroes[1]+0:2),]
## ID Herd Species PTT Gender
## 1032 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 1032.1 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 1033 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 1034 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## Region Collar.Type Latitude Longitude Location.Class
## 1032 South Slave Telonics - GPS/Argos 60.51420 -116.9698 G
## 1032.1 South Slave Telonics - GPS/Argos 60.51420 -116.9698 G
## 1033 South Slave Telonics - GPS/Argos 60.51506 -116.9646 G
## 1034 South Slave Telonics - GPS/Argos 60.51892 -116.9414 G
## Time SerialDate
## 1032 2016-07-03 42552
## 1032.1 2016-07-03 42552
## 1033 2016-07-03 42552
## 1034 2016-07-03 42552
The times are duplicated but the locations are not … suggesting that there is an error probably in the recording of the times. This is an annoying problem to deal with - which do we keep and which do we toss? The easiest thing to do is onnly keep the first observation and reject the other duplicates. This is very easy to do with the duplicated
function:
c1.unique <- c1[!duplicated(c1[,"Time"]),]
We should now have no replicates in the time:
range(diff(c1.unique$Time))
## Time differences in mins
## [1] 1 8640
No 0 minute intervals (though a few as small as 1 and 2 minutes! … this might cause problems later).
We’re going to have to run these fixes on each individual caribou, so let’s write a function that performs the steps. It will take a subset of the data, arrange by time, average the duplicates, and return the tidied dataframe:
With the above, we have our key processing steps for these particular caribou data. Assuming that future data will arrive in exactly the same format as the data here, we’d like a compact piece of replicable code that starting from the raw file and performs the following steps:
Here it is:
caribou.raw <- read.csv("./_data/SouthSlaveBoreal.csv")
caribou <- rename(caribou.raw, c(Animal.Id = "ID", Location.Date = "Time"))
caribou$Herd[caribou$Herd == "West Buffalo"] <- "Buffalo West"
caribou$Herd <- droplevels(caribou$Herd)
caribou$Time <- mdy_hm(caribou$Time)
caribou <- ddply(caribou, "ID", arrange, Time)
caribou <- ddply(caribou, "ID", subset, !duplicated(Time))
The ddply
command (from the plyr
package) applies a function (that takes a dataframe and returns a dataframe) systematically subsetting the large dataframe caribou
by ID
. Because of the number of individuals
Sneaky R trick: The magrittr
package allows you to “pipe” all (or alot) of these processing commands into a single line using the funky %>%
command:
require(magrittr)
caribou <- caribou.raw %>% rename(c(Animal.Id = "ID", Location.Date = "Time")) %>% mutate(Time = mdy_hm(Time)) %>% ddply("ID", subset, !duplicated(Time))
Pretty nifty! Definitely compact.
We’ll get into fancier visualizations later, but for data quality control it is important to quickly plot the data to make sure there are no egregious errors.
The lat-long range of the data:
(ylim <- range(caribou$Latitude))
## [1] 59.64522 62.54530
(xlim <- range(caribou$Longitude))
## [1] -121.6614 -113.6695
Here’s a super-quicky map of Canada and the range of the data:
par(mgp = c(1.5,.25,0), tck = 0.01, cex.axis= 0.8)
map("worldHires", xlim = c(-150,-80), ylim = c(45,75), col="grey")
map("worldHires", "Canada", fill = TRUE, col = "grey", bor =
"darkgrey", add = TRUE)
axis(1); axis(2)
rect(xlim[1], ylim[1], xlim[2], ylim[2], range(caribou$Longitude), col = alpha("red",.5), bor = "red")
And a figure with a bunch of the individual caribou locations, colorcoded (for now) by Herd.
Herd.names <- levels(caribou$Herd)
n.herds <- length(Herd.names)
# a nicer palette
palette(rainbow(n.herds))
with(caribou[seq(1, nrow(caribou),10),],
plot(Longitude, Latitude, col = alpha(as.integer(Herd), .2), pch = 19, cex = 0.3))
# add Great Slave Lake
map("worldHires", add = TRUE, col = "brown")
legend("bottomleft", legend= Herd.names, col= 1:n.herds, pch = 19, cex = 0.7)
The data look pretty clean!
It is nice to have Lat-Long data for seeing where your data are located in the world. But degrees latitude and longitude are not of consistent size, and movement statistics are best used in a consistent format, with “real” distances between points - whether in kilometers or meters.
When interfacing with spatial data - projections are particularly important, since GIS data will have a strict coordinate system and it is important that your data share that coordinate system.
A widely used type of projection for this purpose is “Universal transverse mercator” which will transform your data to meters north and east of one of 60 points at 6 degree intervals around the equator. One function that does this is convUL
in PBSmapping
. Its a bit awkward to use, as the data have to have columns names “X” and “Y” and have an “attribute” called “LL” so that it knows that its converting from a lat-long:
require(PBSmapping)
ll <- with(caribou, data.frame(X = Longitude, Y = Latitude))
attributes(ll)$projection <- "LL"
xy <- convUL(ll, km=TRUE)
Let’s plot this one - but only for one of the herds:
caribou <- cbind(caribou, xy)
palette(rainbow(10))
with(subset(caribou, Herd == "Buffalo West"),
plot(X,Y, asp=1, col = ID, pch = 19, cex= 0.5, main = "Buffalo West - UTM"))
Note that the function determined that the best zone for these data would be zone 11. That is an easy variable to change.
We can now say that the total extent of the caribou data (in km) is:
diff(range(xy$X))
## [1] 427.2115
diff(range(xy$Y))
## [1] 323.4098
So several hundred km north-south and east-west. We can add these columns to our original data frame and get some herd-specific summaries:
ddply(caribou, "Herd", summarize,
n.loc = length(X),
n.ind = length(unique(ID)),
time.range = diff(range(Time)),
x.range.km = diff(range(X)), y.range.km = diff(range(Y)))
## Herd n.loc n.ind time.range x.range.km y.range.km
## 1 Big Island 22 1 6.918056 days 3.192469 4.250903
## 2 Buffalo Lake 7063 4 729.666667 days 97.611953 123.638830
## 3 Buffalo West 3280 11 106.928472 days 48.274491 80.395766
## 4 Hay River Lowlands 46169 47 729.668056 days 341.930584 162.681226
## 5 Mackenzie 36367 30 729.917361 days 337.323586 169.819666
## 6 Pine Point 14022 14 729.667361 days 180.713336 125.181815
A more comprehensive projection standard is the “PROJ4” system, which GIS users should be aware of. In particular when you are working with spatial files from other sources (e.g. habitat layers, bathymetry models), you need to make sure that your coordinates match up! This is an excellent resource for exploring projections: https://epsg.io/. The GNWT apparently likes using the Lambert Conformal Conic projection (https://epsg.io/102002) which has a PROJ4 code that looks like this: +proj=lcc +lat_1=50 +lat_2=70 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
require(rgdal)
NWT.proj4 <- "+proj=lcc +lat_1=50 +lat_2=70 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"
XY <- project(cbind(caribou$Longitude, caribou$Latitude), NWT.proj4)
caribou$X <- XY[,1]
caribou$Y <- XY[,2]
with(subset(caribou, Herd == "Buffalo West"),
plot(X,Y, asp=1, col = ID, cex= 0.5, pch = 19,
main = "Buffalo West: Lambert Conic Canada"))
The distances between neighboring points in this projection should be similar to UTM, but note that the data have turned - so North is no longer up, and East is no longer to the right. This could be an issue if you’re interested in ranges in either of those directions.
I like shorter names for my individuals, so I often add a nickname
column, in this case a contraction of the herd name and a number. The code to do this is below (no need to worry too much about the details):
require(magrittr)
HerdNames <- levels(caribou$Herd)
HerdShort <- c("BI", "BL", "BW", "HR", "MK", "PP")
caribou$HerdShort <- HerdShort[match(caribou$Herd, HerdNames)]
caribou <- ddply(caribou, "Herd", mutate, nickname = paste(HerdShort, as.integer(droplevels(ID)), sep=".")) %>% mutate(nickname = factor(nickname))
Check out the complete list of ID’s, herds and nicknames (not outputted here):
unique(caribou[,c("ID","Herd","HerdShort","nickname")])
There are a few clearly wrong points in the Mackenzie data set. You can see them in the figures earlier. Or, here, note the three extreme X values:
plot(sort(caribou$Longitude)[1:20])
We’ll remove all (six) points that are west of 121 degrees W.
We’ll also get rid of the Big Island … there are very few of them and the total duration is under a week:
ddply(caribou, "Herd", summarize, duration = diff(range(Time)))
## Herd duration
## 1 Big Island 6.918056 days
## 2 Buffalo Lake 729.666667 days
## 3 Buffalo West 106.928472 days
## 4 Hay River Lowlands 729.668056 days
## 5 Mackenzie 729.917361 days
## 6 Pine Point 729.667361 days
So:
caribou <- subset(caribou, Longitude < -121 & Herd != "Big Island")
caribou$Herd <- droplevels(caribou$Herd)
Starting with the raw caribou, we want to do the complete processing. And - to make it even more convenient - make it a function that we can reuse later. We’ll make it a function of the raw data and the projection:
processCaribou <- function(caribou.raw, projection){
caribou <- rename(caribou.raw, c(Animal.Id = "ID", Location.Date = "Time"))
caribou$Time <- mdy_hm(caribou$Time)
# pool herds
caribou$Herd[caribou$Herd == "West Buffalo"] <- "Buffalo West"
caribou$Herd <- droplevels(caribou$Herd)
# remove duplicates
caribou <- ddply(caribou, "ID", subset, !duplicated(Time))
caribou <- ddply(caribou, "ID", arrange, Time)
XY <- project(cbind(caribou$Longitude, caribou$Latitude),
projection)
caribou$X <- XY[,1]
caribou$Y <- XY[,2]
# create nicknmames
HerdNames <- levels(caribou$Herd)
HerdShort <- c("BI", "BL", "BW", "HR", "MK", "PP")
caribou$HerdShort <- HerdShort[match(caribou$Herd, HerdNames)]
caribou <- ddply(caribou, "Herd", mutate, nickname = paste(HerdShort, as.integer(droplevels(ID)), sep=".")) %>% mutate(nickname = factor(nickname))
# get rid of outliers
caribou <- subset(caribou, Longitude > -121 & Herd != "Big Island")
caribou$Herd <- droplevels(caribou$Herd)
# this line attached the projection as an "attribute" of the data
attributes(caribou) <- c(attributes(caribou), projection = projection)
return(caribou)
}
Let’s test this, transforming to the UTM (Zone 13) using its proj4 code:
UTM11.proj <- "+proj=utm +zone=11 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
caribou <- processCaribou(caribou.raw, UTM11.proj)
with(caribou,
plot(X/1000, Y/1000, col = alpha(as.integer(Herd), .2),
pch = 19, cex = 0.3, asp = 1, main = "All caribou"))
# to show that the projection is "carried" with the data
mtext(side = 3, attr(caribou, "projection"))
legend("bottomleft", legend= levels(caribou$Herd),
col= 1:length(levels(caribou$Herd)), pch = 19)
The most efficient way to save the processed data - which is formatted correctly is to use the all powerful save command:
save(caribou, file = "./_data/caribou.rda")
Next time you fire up your R for analysis, you can forget everything we did up here and just load the file:
load(file = "./_data/caribou.rda")
A LOT of this stuff is really taken care of by storing data on Movebank (https://www.movebank.org). Then, with the move
package in R, loading the data would look something like this:
mylogin <- movebankLogin(username = "xxxx", password = "xxxx")
caribou <- getMovebankData(study = "ABoVE: NWT South Slave Boreal Woodland Caribou", login = mylogin)
The data would be loaded with projections, time stamps, and other ancillary data - nominally all controlled for quality. See this page for more details (and an example): https://terpconnect.umd.edu/~egurarie/teaching/MovementAtICCB2017/LoadingMovebankData.html