1 Preamble

Processing data is the least loved and often most time consuming aspect of any data analysis. It is also unavoidable and extremely important. There are several tools - including some more recent R packages that greatly facilitate the processing of data.

Some goals of smart data processing include:

compartamentalized - e.g. each step a function
interactive - leverage visualization and interactive tools
generalizable - to apply to multiple individuals
replicable - important: NEVER overwrite the raw data!
well-documented - so you don’t have to remember what you did
forgettable! - so once its done you don’t need to think about it any more

1.1 Packages and functions

Several packages are particularly useful for data processing:

plyr - manipulating data frames and lists – functions: mutate; ddply-ldply-dlply
lubridate - manipulating time objects
rgdal - projecting coordinates
maps and mapdata - quick and easy maps
magrittr - for piping

Sneaky R trick: For loading a lot of packages - makes a list:

pcks <- list('plyr', 'lubridate', 'ggmap', 'maptools', 'rgdal', 'maps', 'mapdata')

and then “apply” the require function to the list:

sapply(pcks, require, character = TRUE)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

2 Loading Data

Probably the ideal format (in practice) for loading data in R is a comma-separated vector .csv - these are lightweight (all text) files that can be opened in Excel - or, conversely, outputted from Excel - and the function that reads them in to R read.csv() is generally unfussy (though in many countries, commas serve as decimal points and semi-colons are delimiters - which can cause great confusion).

In the real world, however, raw data doesn’t always conform with those ideals. We’re going to look at some tricks that will let us load, reformat, clean, and tidy data efficiently and consistently in R. We’ll use some boreal caribou and wolf data provided by Allicia Kelly as examples.

Here are the first few rows of one of the files:

kable(head(caribou.raw))

Animal.Id	Herd	Species	PTT	Gender	Region	Collar.Type	Latitude	Longitude	Location.Class	Location.Date	SerialDate
BWCA-WL-13-01	Hay River Lowlands	Boreal Caribou	123551	Female	South Slave	Telonics - GPS/Argos	60.13019	-116.8509	G	2/26/2017 16:00	42790.67
BWCA-WL-13-01	Hay River Lowlands	Boreal Caribou	123551	Female	South Slave	Telonics - GPS/Argos	60.13285	-116.8553	G	2/27/2017 8:00	42791.33
BWCA-WL-13-01	Hay River Lowlands	Boreal Caribou	123551	Female	South Slave	Telonics - GPS/Argos	60.13302	-116.8535	G	2/27/2017 0:01	42791.00
BWCA-WL-13-01	Hay River Lowlands	Boreal Caribou	123551	Female	South Slave	Telonics - GPS/Argos	60.13388	-116.8587	G	2/27/2017 16:00	42791.67
BWCA-WL-13-01	Hay River Lowlands	Boreal Caribou	123551	Female	South Slave	Telonics - GPS/Argos	60.14203	-116.8717	G	2/28/2017 0:00	42792.00
BWCA-WL-13-01	Hay River Lowlands	Boreal Caribou	123551	Female	South Slave	Telonics - GPS/Argos	60.14324	-116.8708	G	3/1/2017 8:00	42793.33

The read.csv command will do a good, quick job of loading the data and retaining the columns:

caribou.raw <- read.csv("./_data/SouthSlaveBoreal.csv")

Note that I am reading the file from a “relative directory”. I.e., depending on where I have (or have set) my working directory, which you can check with getwd():

getwd()

## [1] "C:/Users/elie/Box Sync/teaching/NWT2018/materials"

and set with setwd() (or do either by point-and-clicking in the Session window in Rstudio). In my working directory I have a folder called data which contains the raw data file.

Okay, great, now that it’s loaded we can see what it’s made of. A few useful commands for working with a data frame: 1. The dimension (row and columns)

dim(caribou.raw)

## [1] 118324     12

The column names

names(caribou.raw)

##  [1] "Animal.Id"      "Herd"           "Species"        "PTT"           
##  [5] "Gender"         "Region"         "Collar.Type"    "Latitude"      
##  [9] "Longitude"      "Location.Class" "Location.Date"  "SerialDate"

View the top few rows

head(caribou.raw)

##       Animal.Id               Herd        Species    PTT Gender
## 1 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 2 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 3 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 4 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 5 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 6 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
##        Region          Collar.Type Latitude Longitude Location.Class
## 1 South Slave Telonics - GPS/Argos 60.13019 -116.8509              G
## 2 South Slave Telonics - GPS/Argos 60.13285 -116.8553              G
## 3 South Slave Telonics - GPS/Argos 60.13302 -116.8535              G
## 4 South Slave Telonics - GPS/Argos 60.13388 -116.8587              G
## 5 South Slave Telonics - GPS/Argos 60.14203 -116.8717              G
## 6 South Slave Telonics - GPS/Argos 60.14324 -116.8708              G
##     Location.Date SerialDate
## 1 2/26/2017 16:00   42790.67
## 2  2/27/2017 8:00   42791.33
## 3  2/27/2017 0:01   42791.00
## 4 2/27/2017 16:00   42791.67
## 5  2/28/2017 0:00   42792.00
## 6   3/1/2017 8:00   42793.33

View the whole data frame:

View(caribou.raw)

will open a new window. Or you can just click on the object in the “Environment” panel in Rstudio 5. Finally - and most comprehensively - the str command, which reveals the “structure” of the data frame:

str(caribou.raw)

## 'data.frame':    118324 obs. of  12 variables:
##  $ Animal.Id     : Factor w/ 107 levels "BWCA-WL-13-01",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Herd          : Factor w/ 7 levels "Big Island","Buffalo Lake",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Species       : Factor w/ 1 level "Boreal Caribou": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PTT           : Factor w/ 100 levels "0-2365369","0-2367124",..: 20 20 20 20 20 20 20 20 20 20 ...
##  $ Gender        : Factor w/ 3 levels "","Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Region        : Factor w/ 1 level "South Slave": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Collar.Type   : Factor w/ 3 levels "Telonics - GPS/Argos",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Latitude      : num  60.1 60.1 60.1 60.1 60.1 ...
##  $ Longitude     : num  -117 -117 -117 -117 -117 ...
##  $ Location.Class: Factor w/ 1 level "G": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Location.Date : Factor w/ 8299 levels "1/1/2016 0:00",..: 3656 3702 3681 3688 3711 4043 3737 4403 4027 4020 ...
##  $ SerialDate    : num  42791 42791 42791 42792 42792 ...

Note that there are factors are numeric. For example, Herd is a factor, which contains a fixed number of levels:

levels(caribou.raw$Herd)

## [1] "Big Island"         "Buffalo Lake"       "Buffalo West"      
## [4] "Hay River Lowlands" "Mackenzie"          "Pine Point"        
## [7] "West Buffalo"

3 Processing

These data are relatively clean. For example, the Latitudes and Longitudes are numeric (as opposed to, e.g., 60.1 N, 117 W) - which is already something. But there are a few things that we need to do to clean things up.

3.1 Renaming columns

Note that some of the columns in the original data frame had spaces in the names: “Animal Id”, “Collar Type”, “Location Date”. In R, spaces are not allowed in object names of column headings, so R automatically inserted a bunch of “.”’s. Some of these are unwieldly, so I’ll fix them with the rename function:

caribou <- rename(caribou.raw, c(Animal.Id = "ID", Location.Date = "Time"))

3.2 Pooling a herd

You can count factors very conveniently with the table command:

table(caribou$Herd)

## 
##         Big Island       Buffalo Lake       Buffalo West 
##                 22               8141               3148 
## Hay River Lowlands          Mackenzie         Pine Point 
##              51770              39246              15863 
##       West Buffalo 
##                134

Note that there is a “West Buffalo” - with very little data - and a “Buffalo West” herd with lots of data. I suspect they are, in fact, the dame. One way to confirm, is to plot one on top of the other. Here’s how:

with(subset(caribou, Herd == "Buffalo West"), plot(Longitude, Latitude, col = "blue"))
with(subset(caribou, Herd == "West Buffalo"), points(Longitude, Latitude, col = "red"))

So … as a first step of data manipulation, let’s just assign all the “West Buffalo” to “Buffalo West”.

caribou$Herd[caribou$Herd == "West Buffalo"] <- "Buffalo West"
table(caribou$Herd)

## 
##         Big Island       Buffalo Lake       Buffalo West 
##                 22               8141               3282 
## Hay River Lowlands          Mackenzie         Pine Point 
##              51770              39246              15863 
##       West Buffalo 
##                  0

Note, that the factor data type remembers that there was once a “West Buffalo”. To force it to forget:

caribou$Herd <- droplevels(caribou$Herd)

Good - now we just have 6 herds.

table(caribou$Herd)

## 
##         Big Island       Buffalo Lake       Buffalo West 
##                 22               8141               3282 
## Hay River Lowlands          Mackenzie         Pine Point 
##              51770              39246              15863

3.3 Dates and Times

One of the scourges of movement data is dates and times, which are complicated and poorly stacked for a whole bunch of reasons related to inconsistent formatting, days and weeks and months not dividing nicely into years, etc. Also, the formatting is all over the place … in sensible places its is YYYY-MM-DD, but in North America, it is often MM/DD/YYYY.

When you read data in using read.csv, all strings that aren’t entirely numeric are automatically “factors”. A standard universal date format for data is called POSIX. You used to have to muck around with the as.POSIX commands in R which could be quite clunky, e.g.:

head(as.POSIXlt(caribou$Time, format = "%m/%d/%Y %H:%M", tz = "UTC"))

## [1] "2017-02-26 16:00:00 UTC" "2017-02-27 08:00:00 UTC"
## [3] "2017-02-27 00:01:00 UTC" "2017-02-27 16:00:00 UTC"
## [5] "2017-02-28 00:00:00 UTC" "2017-03-01 08:00:00 UTC"

The lubridate package simplifies all of this with the very simple ymd_hm command, which will find the year/month/day hour/minute with less fuss:

require(lubridate)
caribou$Time <- mdy_hm(caribou$Time)

We can confirm that the times are now actual times by, for example, looking at the range:

range(caribou$Time)

## [1] "2015-04-01 00:00:00 UTC" "2017-03-30 22:01:00 UTC"

So the data were collected between April 2015 and March 2017 (lots of data for 2 years!). One other nice feature of lubridate is the ease of getting day of year, hour, month, etc. from data, e.g.:

hour(caribou$Time)[1:10]

##  [1] 16  8  0 16  0  8  8  0 16  0

month(caribou$Time)[1:10]

##  [1] 2 2 2 2 2 3 2 3 3 3

yday(caribou$Time)[1:10]

##  [1] 57 58 58 58 59 60 59 61 60 60

You can also get time intervals with the very basic diff (difference) function:

caribou$Time[1:10]

##  [1] "2017-02-26 16:00:00 UTC" "2017-02-27 08:00:00 UTC"
##  [3] "2017-02-27 00:01:00 UTC" "2017-02-27 16:00:00 UTC"
##  [5] "2017-02-28 00:00:00 UTC" "2017-03-01 08:00:00 UTC"
##  [7] "2017-02-28 08:00:00 UTC" "2017-03-02 00:00:00 UTC"
##  [9] "2017-03-01 16:00:00 UTC" "2017-03-01 00:00:00 UTC"

diff(caribou$Time[1:10])

## Time differences in hours
## [1]  16.000000  -7.983333  15.983333   8.000000  32.000000 -24.000000
## [7]  40.000000  -8.000000 -16.000000

In the very first rows, there are some negative time jumps! This suggests that the data are perhaps not ordered how we would expect. Note also that (at least for the first few rows) the time intervals are pretty consistently multiples of 8.

3.4 Reordering data

That last observation suggests that things are not quite in the order we would expect. Let’s do a very basic plot of time for only the first individual:

with(subset(caribou, ID == ID[1]), plot(Time, type="l"))

Yikes! There’s crazy jumping around. We should sort these data, first for one caribou. The easiest way to do this is with the arrange function in the plyr package.

require(plyr)
c1 <- subset(caribou, ID == ID[1])
head(c1)

##              ID               Herd        Species    PTT Gender
## 1 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 2 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 3 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 4 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 5 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 6 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
##        Region          Collar.Type Latitude Longitude Location.Class
## 1 South Slave Telonics - GPS/Argos 60.13019 -116.8509              G
## 2 South Slave Telonics - GPS/Argos 60.13285 -116.8553              G
## 3 South Slave Telonics - GPS/Argos 60.13302 -116.8535              G
## 4 South Slave Telonics - GPS/Argos 60.13388 -116.8587              G
## 5 South Slave Telonics - GPS/Argos 60.14203 -116.8717              G
## 6 South Slave Telonics - GPS/Argos 60.14324 -116.8708              G
##                  Time SerialDate
## 1 2017-02-26 16:00:00   42790.67
## 2 2017-02-27 08:00:00   42791.33
## 3 2017-02-27 00:01:00   42791.00
## 4 2017-02-27 16:00:00   42791.67
## 5 2017-02-28 00:00:00   42792.00
## 6 2017-03-01 08:00:00   42793.33

head(arrange(c1, Time))

##              ID               Herd        Species    PTT Gender
## 1 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 2 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 3 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 4 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 5 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 6 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
##        Region          Collar.Type Latitude Longitude Location.Class
## 1 South Slave Telonics - GPS/Argos 60.88155 -116.8018              G
## 2 South Slave Telonics - GPS/Argos 60.88181 -116.8016              G
## 3 South Slave Telonics - GPS/Argos 60.88147 -116.8020              G
## 4 South Slave Telonics - GPS/Argos 60.88044 -116.8005              G
## 5 South Slave Telonics - GPS/Argos 60.88104 -116.7970              G
## 6 South Slave Telonics - GPS/Argos 60.88190 -116.8018              G
##                  Time SerialDate
## 1 2015-04-01 00:00:00   42093.00
## 2 2015-04-01 08:00:00   42093.33
## 3 2015-04-01 16:00:00   42093.67
## 4 2015-04-02 00:00:00   42094.00
## 5 2015-04-02 08:00:00   42094.33
## 6 2015-04-02 16:00:00   42094.67

This looks a lot better! PLot it to confirm:

c1 <- arrange(c1, Time)
plot(c1$Time)

3.5 Removing duplicated times

We can get an idea of the time intervals between obvervations (we need to use the difftime function to coerce the units to be hours:

table(round(difftime(c1$Time[-1], c1$Time[-nrow(c1)], unit = "hour")))

## 
##    0    8   16   24   32   40   48   56   64   72   80  104  112  120  128 
##  271 1570    7   16   13    5   12   25    9    4    2    1    2    2    1 
##  144 
##    1

Mainly - eight hour intervals, which is good. A few gaps, which is expected - the longest 144 hours (6 days). The more concerning thing in the data is that there are 0’s - which are most likely errors and should be removed. Let’s check a couple to make sure that the row are truly duplicated

zeroes <- which(diff(c1$Time) == 0)
c1[c(zeroes[1], zeroes[1]+0:2),]

##                   ID               Herd        Species    PTT Gender
## 1032   BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 1032.1 BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 1033   BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
## 1034   BWCA-WL-13-01 Hay River Lowlands Boreal Caribou 123551 Female
##             Region          Collar.Type Latitude Longitude Location.Class
## 1032   South Slave Telonics - GPS/Argos 60.51420 -116.9698              G
## 1032.1 South Slave Telonics - GPS/Argos 60.51420 -116.9698              G
## 1033   South Slave Telonics - GPS/Argos 60.51506 -116.9646              G
## 1034   South Slave Telonics - GPS/Argos 60.51892 -116.9414              G
##              Time SerialDate
## 1032   2016-07-03      42552
## 1032.1 2016-07-03      42552
## 1033   2016-07-03      42552
## 1034   2016-07-03      42552

The times are duplicated but the locations are not … suggesting that there is an error probably in the recording of the times. This is an annoying problem to deal with - which do we keep and which do we toss? The easiest thing to do is onnly keep the first observation and reject the other duplicates. This is very easy to do with the duplicated function:

c1.unique <- c1[!duplicated(c1[,"Time"]),]

We should now have no replicates in the time:

range(diff(c1.unique$Time))

## Time differences in mins
## [1]    1 8640

No 0 minute intervals (though a few as small as 1 and 2 minutes! … this might cause problems later).

We’re going to have to run these fixes on each individual caribou, so let’s write a function that performs the steps. It will take a subset of the data, arrange by time, average the duplicates, and return the tidied dataframe:

3.6 Combining all changes

With the above, we have our key processing steps for these particular caribou data. Assuming that future data will arrive in exactly the same format as the data here, we’d like a compact piece of replicable code that starting from the raw file and performs the following steps:

Rename columns
Pool herds
Reformat times
Reorder (for each individual) by time
Remove (for each indivudual) away duplicate times

Here it is:

caribou.raw <- read.csv("./_data/SouthSlaveBoreal.csv")
caribou <- rename(caribou.raw, c(Animal.Id = "ID", Location.Date = "Time"))
caribou$Herd[caribou$Herd == "West Buffalo"] <- "Buffalo West"
caribou$Herd <- droplevels(caribou$Herd)
caribou$Time <- mdy_hm(caribou$Time)
caribou <- ddply(caribou, "ID", arrange, Time)
caribou <- ddply(caribou, "ID", subset, !duplicated(Time))

The ddply command (from the plyr package) applies a function (that takes a dataframe and returns a dataframe) systematically subsetting the large dataframe caribou by ID. Because of the number of individuals

Sneaky R trick: The magrittr package allows you to “pipe” all (or alot) of these processing commands into a single line using the funky %>% command:

require(magrittr)
caribou <- caribou.raw %>% rename(c(Animal.Id = "ID", Location.Date = "Time")) %>% mutate(Time = mdy_hm(Time)) %>% ddply("ID", subset, !duplicated(Time))

Pretty nifty! Definitely compact.

4 Maps and projections

4.1 Quick maps

We’ll get into fancier visualizations later, but for data quality control it is important to quickly plot the data to make sure there are no egregious errors.

The lat-long range of the data:

(ylim <- range(caribou$Latitude))

## [1] 59.64522 62.54530

(xlim <- range(caribou$Longitude))

## [1] -121.6614 -113.6695

Here’s a super-quicky map of Canada and the range of the data:

par(mgp = c(1.5,.25,0), tck = 0.01, cex.axis= 0.8)
map("worldHires", xlim = c(-150,-80), ylim = c(45,75), col="grey")
map("worldHires", "Canada",  fill = TRUE, col = "grey", bor = 
            "darkgrey", add = TRUE)
axis(1); axis(2)

rect(xlim[1], ylim[1], xlim[2], ylim[2], range(caribou$Longitude), col = alpha("red",.5), bor = "red")

And a figure with a bunch of the individual caribou locations, colorcoded (for now) by Herd.

Herd.names <- levels(caribou$Herd)
n.herds <- length(Herd.names)

# a nicer palette
palette(rainbow(n.herds))

with(caribou[seq(1, nrow(caribou),10),], 
         plot(Longitude, Latitude, col = alpha(as.integer(Herd), .2), pch = 19, cex = 0.3))

# add Great Slave Lake
map("worldHires", add = TRUE, col = "brown")
legend("bottomleft", legend= Herd.names, col= 1:n.herds, pch = 19, cex = 0.7)

The data look pretty clean!

4.2 Projections

It is nice to have Lat-Long data for seeing where your data are located in the world. But degrees latitude and longitude are not of consistent size, and movement statistics are best used in a consistent format, with “real” distances between points - whether in kilometers or meters.

When interfacing with spatial data - projections are particularly important, since GIS data will have a strict coordinate system and it is important that your data share that coordinate system.

4.2.1 …into UTM

A widely used type of projection for this purpose is “Universal transverse mercator” which will transform your data to meters north and east of one of 60 points at 6 degree intervals around the equator. One function that does this is convUL in PBSmapping. Its a bit awkward to use, as the data have to have columns names “X” and “Y” and have an “attribute” called “LL” so that it knows that its converting from a lat-long:

require(PBSmapping)
ll <- with(caribou, data.frame(X = Longitude, Y = Latitude))
attributes(ll)$projection <- "LL"
xy <- convUL(ll, km=TRUE)

Let’s plot this one - but only for one of the herds:

caribou <- cbind(caribou, xy)
palette(rainbow(10))
with(subset(caribou, Herd == "Buffalo West"),
         plot(X,Y, asp=1, col = ID, pch = 19, cex= 0.5, main = "Buffalo West - UTM"))

Note that the function determined that the best zone for these data would be zone 11. That is an easy variable to change.

We can now say that the total extent of the caribou data (in km) is:

diff(range(xy$X))

## [1] 427.2115

diff(range(xy$Y))

## [1] 323.4098

So several hundred km north-south and east-west. We can add these columns to our original data frame and get some herd-specific summaries:

ddply(caribou, "Herd", summarize, 
            n.loc = length(X), 
            n.ind = length(unique(ID)),
            time.range = diff(range(Time)),
            x.range.km = diff(range(X)), y.range.km = diff(range(Y)))

##                 Herd n.loc n.ind      time.range x.range.km y.range.km
## 1         Big Island    22     1   6.918056 days   3.192469   4.250903
## 2       Buffalo Lake  7063     4 729.666667 days  97.611953 123.638830
## 3       Buffalo West  3280    11 106.928472 days  48.274491  80.395766
## 4 Hay River Lowlands 46169    47 729.668056 days 341.930584 162.681226
## 5          Mackenzie 36367    30 729.917361 days 337.323586 169.819666
## 6         Pine Point 14022    14 729.667361 days 180.713336 125.181815

4.2.2 … proj4

A more comprehensive projection standard is the “PROJ4” system, which GIS users should be aware of. In particular when you are working with spatial files from other sources (e.g. habitat layers, bathymetry models), you need to make sure that your coordinates match up! This is an excellent resource for exploring projections: https://epsg.io/. The GNWT apparently likes using the Lambert Conformal Conic projection (https://epsg.io/102002) which has a PROJ4 code that looks like this: +proj=lcc +lat_1=50 +lat_2=70 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs

require(rgdal)

NWT.proj4 <- "+proj=lcc +lat_1=50 +lat_2=70 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"

XY <- project(cbind(caribou$Longitude, caribou$Latitude), NWT.proj4)
caribou$X <- XY[,1]
caribou$Y <- XY[,2]

with(subset(caribou, Herd == "Buffalo West"),
         plot(X,Y, asp=1, col = ID, cex= 0.5, pch = 19, 
                 main = "Buffalo West: Lambert Conic Canada"))

The distances between neighboring points in this projection should be similar to UTM, but note that the data have turned - so North is no longer up, and East is no longer to the right. This could be an issue if you’re interested in ranges in either of those directions.

4.3 A minor aside on nicknames

I like shorter names for my individuals, so I often add a nickname column, in this case a contraction of the herd name and a number. The code to do this is below (no need to worry too much about the details):

require(magrittr)
HerdNames <- levels(caribou$Herd)
HerdShort <-  c("BI", "BL", "BW", "HR", "MK", "PP")
caribou$HerdShort <- HerdShort[match(caribou$Herd, HerdNames)]
caribou <- ddply(caribou, "Herd", mutate, nickname = paste(HerdShort, as.integer(droplevels(ID)), sep=".")) %>% mutate(nickname = factor(nickname))

Check out the complete list of ID’s, herds and nicknames (not outputted here):

unique(caribou[,c("ID","Herd","HerdShort","nickname")])

4.4 Dropping some outliers / uninteresting data

There are a few clearly wrong points in the Mackenzie data set. You can see them in the figures earlier. Or, here, note the three extreme X values:

plot(sort(caribou$Longitude)[1:20])

We’ll remove all (six) points that are west of 121 degrees W.

We’ll also get rid of the Big Island … there are very few of them and the total duration is under a week:

ddply(caribou, "Herd", summarize, duration = diff(range(Time)))

##                 Herd        duration
## 1         Big Island   6.918056 days
## 2       Buffalo Lake 729.666667 days
## 3       Buffalo West 106.928472 days
## 4 Hay River Lowlands 729.668056 days
## 5          Mackenzie 729.917361 days
## 6         Pine Point 729.667361 days

So:

caribou <- subset(caribou, Longitude < -121 & Herd != "Big Island") 
caribou$Herd <- droplevels(caribou$Herd)

4.5 Final processing function

Starting with the raw caribou, we want to do the complete processing. And - to make it even more convenient - make it a function that we can reuse later. We’ll make it a function of the raw data and the projection:

processCaribou <- function(caribou.raw, projection){
    caribou <- rename(caribou.raw, c(Animal.Id = "ID", Location.Date = "Time"))
    caribou$Time <- mdy_hm(caribou$Time)
    
    # pool herds
    caribou$Herd[caribou$Herd == "West Buffalo"] <- "Buffalo West"
    caribou$Herd <- droplevels(caribou$Herd)
    
    # remove duplicates
    caribou <- ddply(caribou, "ID", subset, !duplicated(Time))
    caribou <- ddply(caribou, "ID", arrange, Time)
    XY <- project(cbind(caribou$Longitude, caribou$Latitude),
                                projection)
    caribou$X <- XY[,1]
    caribou$Y <- XY[,2]
    
    # create nicknmames
    HerdNames <- levels(caribou$Herd)
    HerdShort <-  c("BI", "BL", "BW", "HR", "MK", "PP")
    caribou$HerdShort <- HerdShort[match(caribou$Herd, HerdNames)]
    caribou <- ddply(caribou, "Herd", mutate, nickname = paste(HerdShort, as.integer(droplevels(ID)), sep=".")) %>% mutate(nickname = factor(nickname))

    # get rid of outliers
    caribou <- subset(caribou, Longitude > -121 & Herd != "Big Island") 
    caribou$Herd <- droplevels(caribou$Herd)

    
    # this line attached the projection as an "attribute" of the data
    attributes(caribou) <- c(attributes(caribou), projection = projection)
    return(caribou)
}

Let’s test this, transforming to the UTM (Zone 13) using its proj4 code:

UTM11.proj <- "+proj=utm +zone=11 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
caribou <- processCaribou(caribou.raw, UTM11.proj)
with(caribou, 
         plot(X/1000, Y/1000, col = alpha(as.integer(Herd), .2), 
                 pch = 19, cex = 0.3, asp = 1, main = "All caribou"))
# to show that the projection is "carried" with the data
mtext(side = 3, attr(caribou, "projection"))
legend("bottomleft", legend= levels(caribou$Herd), 
             col= 1:length(levels(caribou$Herd)), pch = 19)

5 Saving Data

The most efficient way to save the processed data - which is formatted correctly is to use the all powerful save command:

save(caribou, file = "./_data/caribou.rda")

Next time you fire up your R for analysis, you can forget everything we did up here and just load the file:

load(file = "./_data/caribou.rda")

6 Important footnote

A LOT of this stuff is really taken care of by storing data on Movebank (https://www.movebank.org). Then, with the move package in R, loading the data would look something like this:

mylogin <- movebankLogin(username = "xxxx", password = "xxxx")
caribou <- getMovebankData(study = "ABoVE: NWT South Slave Boreal Woodland Caribou", login = mylogin)

The data would be loaded with projections, time stamps, and other ancillary data - nominally all controlled for quality. See this page for more details (and an example): https://terpconnect.umd.edu/~egurarie/teaching/MovementAtICCB2017/LoadingMovebankData.html

Lab 1: Loading and Processing Movement Data

Elie Gurarie