Processing data is the least loved and often most time consuming aspect of any data analysis. It is also unavoidable and extremely important. There are several tools - including some more recent R packages that greatly facilitate the processing of data.
Some goals of smart data processing include:
Several packages are particularly useful for data processing:
plyr
- manipulating data frames and lists – functions: mutate
; ddply
-ldply
-dlply
lubridate
- manipulating time objectsrgdal
- projecting coordinatesmaps
and mapdata
- quick and easy mapsmagrittr
- for pipingSneaky R trick: For loading a lot of packages - makes a list:
pcks <- list('plyr', 'lubridate', 'ggmap', 'maptools', 'rgdal', 'maps', 'mapdata')
and then “apply” the require
function to the list:
sapply(pcks, require, character = TRUE)
## [1] TRUE TRUE TRUE TRUE TRUE
Probably the ideal format (in practice) for loading data in R is a comma-separated vector .csv
- these are lightweight (all text) files that can be opened in Excel - or, conversely, outputted from Excel - and the function that reads them in to R read.csv()
is generally unfussy (though in many countries, commas serve as decimal points and semi-colons are delimiters - which can cause great confusion).
In the real world, however, raw data doesn’t always conform with those ideals. We’re going to look at some tricks that will let us load, reformat, clean, and tidy data efficiently and consistently in R, using some short-toed eagles - a European/African species of lizard and snake eating eagle - as an example. Note: The data are courtesy of Dr. Ugo Mellone in Spain and for internal use only!
Here are the first few rows of one of the files:
Date Time Latitude(N) Longitude(E) Speed Course Altitude(m)
2010-07-19 15:00 40.35500 18.17217 0 272 103
2010-07-19 16:00 40.35467 18.17233 0 329 95
2010-07-19 17:00 40.35483 18.17233 0 246 90
2010-08-09 07:00 0.00000 0.00000 0 124 27
2010-08-09 09:00 0.00000 0.00000 0 92 27
2010-08-09 11:00 0.00000 0.00000 0 96 27
2010-08-14 13:00 40.56167 16.18200 0 126 591
These data are tab-separated, but the more general read.table
function handles that with the \t
separator (remember to include the header=TRUE
).
eagle <- read.table("./data/eagle.txt", sep="\t", header=TRUE)
head(eagle)
## Date Time Latitude.N. Longitude.E. Speed Course Altitude.m.
## 1 2010-07-19 15:00 40.35500 18.17217 0 272 103
## 2 2010-07-19 16:00 40.35467 18.17233 0 329 95
## 3 2010-07-19 17:00 40.35483 18.17233 0 246 90
## 4 2010-08-09 07:00 0.00000 0.00000 0 124 27
## 5 2010-08-09 09:00 0.00000 0.00000 0 92 27
## 6 2010-08-09 11:00 0.00000 0.00000 0 96 27
The str
(“structure”) command is often the most useful for immediately getting a sense of the data:
str(eagle)
## 'data.frame': 881 obs. of 7 variables:
## $ Date : Factor w/ 115 levels "2010-07-19","2010-08-09",..: 1 1 1 2 2 2 3 3 3 4 ...
## $ Time : Factor w/ 24 levels "00:00","01:00",..: 16 17 18 8 10 12 14 16 18 14 ...
## $ Latitude.N. : num 40.4 40.4 40.4 0 0 ...
## $ Longitude.E.: num 18.2 18.2 18.2 0 0 ...
## $ Speed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Course : int 272 329 246 124 92 96 126 104 96 290 ...
## $ Altitude.m. : int 103 95 90 27 27 27 591 27 27 647 ...
These data are - actually - relatively clean. For example, the Latitudes and Longitudes are numeric - which is already something. But there is almost NO dataset that doesn’t need some processing eventually.
The key location columns are those that contain latitude and longitude, which are currently named Latitude.N., Longitude.E.. The parentheses in the original column headers (which are invalid for column names for R objects) have been replaced with “.” In any case, we don’t really need the “.N.” and “.E.” Let’s use the rename
function in the plyr
package to fix that:
require(plyr)
eagle2 <- rename(eagle, c("Latitude.N." = "Latitude",
"Longitude.E." = "Longitude",
"Altitude.m." = "Altitude"))
Note - it’s bad form to be renaming objects the same name as before, which is why I called this eagle2
. It’s also not great form to have a whole bunch of eagle
1-2-3’s, because that’s confusing. We’ll see later how to combine all the data processing into a single command.
Let’s just do the most straightforward plot of the Long-Lat:
plot(eagle2$Longitude, eagle2$Latitude, type="o")
R-minder: you can also do the same plot by either calling plot
within the “environment” of the data frame:
with(eagle2, plot(Longitude, Latitude, type="o"))
or by passing the two columns within the call:
plot(eagle2[,c("Longitude","Latitude")], type="o")
The former, in particular, is handy as you can have a whole sequence of commands inside the with
command and only actually refer to the eagle once.
There’s a nice, long migration here, but clearly also some screwy things! It looks like some of the errors are related to zeros … lets count those:
with(eagle2, table(Longitude==0, Latitude==0))
##
## FALSE TRUE
## FALSE 755 0
## TRUE 0 126
Yikes! 126 of them. At least there are (apparently) no “true” 0’s. Eliminating these points, the plot looks like this:
with(subset(eagle2, Longitude != 0), plot(Longitude, Latitude, type="o"))
We have a choice now in our clean-up: Do we completely remove the rows where Lat-Long = 0 or do we leave them in there as NA’s? As a general principle, it is better not to remove data, especially because there are other columns in there that contain potentially important information (e.g. Altitude).
eagle3 <- eagle2
eagle3$Longitude[eagle3$Longitude == 0] <- NA
eagle3$Latitude[eagle3$Latitude == 0] <- NA
Sneaky R tricks: Replacing values within a vector within a data frame can often be tedious in R. We can combine the above into one command using named references:
eagle3[eagle2$Longitude == 0, c("Longitude","Latitude")] <- NA
There is also a replace
command (in base R), that operates on vectors. For example:
l <- replace(eagle2$Longitude, eagle2$Longitude == 0, "Hello!")
head(l)
## [1] "18.17217" "18.17233" "18.17233" "Hello!" "Hello!" "Hello!"
This command can be leveraged nicely with the mutate
function (in plyr
) which is super useful for editing columns in a data frame:
e <- mutate(eagle2, Longitude = replace(Longitude, Longitude == 0, "or me"),
Latitude = replace(Latitude, Latitude == 0, "not me"))
head(e)
## Date Time Latitude Longitude Speed Course Altitude
## 1 2010-07-19 15:00 40.355 18.17217 0 272 103
## 2 2010-07-19 16:00 40.35467 18.17233 0 329 95
## 3 2010-07-19 17:00 40.35483 18.17233 0 246 90
## 4 2010-08-09 07:00 not me or me 0 124 27
## 5 2010-08-09 09:00 not me or me 0 92 27
## 6 2010-08-09 11:00 not me or me 0 96 27
Or - you can make a custom function that does exactly what you want:
fixme <- function(x, v1 = 0, v2 = NA){ x[x==v1] <- v2; x}
e <- mutate(eagle2, Longitude = fixme(Longitude),
Latitude = fixme(Latitude))
head(e)
## Date Time Latitude Longitude Speed Course Altitude
## 1 2010-07-19 15:00 40.35500 18.17217 0 272 103
## 2 2010-07-19 16:00 40.35467 18.17233 0 329 95
## 3 2010-07-19 17:00 40.35483 18.17233 0 246 90
## 4 2010-08-09 07:00 NA NA 0 124 27
## 5 2010-08-09 09:00 NA NA 0 92 27
## 6 2010-08-09 11:00 NA NA 0 96 27
This last approach is most useful if you have some complicated fixits that you’re after. Just remember - you still always have to assign the output of mutate
.
But we still have at least one truly obvious and inexplicable outlier. How do we find it? We could identify it directly just by looking at the coordinates:
which(eagle3$Longitude < -60)
## [1] 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865
## [18] 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881
Yikes! Looks like a bunch of them, and all near the end. Let’s replace those as well and plot the final result:
eagle4 <- mutate(eagle3,
Longitude = replace(Longitude, Longitude < -60, NA),
Latitude = replace(Latitude, is.na(Longitude), NA))
with(subset(eagle4, !is.na(Longitude)), plot(Longitude, Latitude, type="o"))
Ok, looks pretty clean! There’s still one funny looking outlier at the very bottom. But is it is really a mistake? How would you find out?
Now - shockingly - we’ll violate two of our new rules. First - we’ll throw out all the rows with NA’s - to simplify our lives down the road. Second - we’ll rename the new clean file eagle
again.
eagle <- subset(eagle4, !is.na(Longitude))
One of the scourges of movement data is dates and times, which are complicated and poorly stacked for a whole bunch of reasons related to inconsistent formatting, days and weeks and months not dividing nicely into years, etc.
When you read data in using read.table
, all character strings are automatically “factors”. A standard universal date format for data is called POSIX
. You used to have to muck around with the as.POSIX
commands in R which could be quite clunky, e.g.:
head(as.POSIXlt(paste(eagle$Date, eagle$Time), format = "%Y-%m-%d %H:%M", tz = "UTC"))
## [1] "2010-07-19 15:00:00 UTC" "2010-07-19 16:00:00 UTC"
## [3] "2010-07-19 17:00:00 UTC" "2010-08-14 13:00:00 UTC"
## [5] "2010-08-20 13:00:00 UTC" "2010-08-20 15:00:00 UTC"
The lubridate
package can generate these objects much more quickly, without specifying the formatting. The functions ymd
and hm
below “intelligently” decifer the year-month-day and hour-minute formatting of the data.
require(lubridate)
head(ymd(eagle$Date) + hm(eagle$Time))
## [1] "2010-07-19 15:00:00 UTC" "2010-07-19 16:00:00 UTC"
## [3] "2010-07-19 17:00:00 UTC" "2010-08-14 13:00:00 UTC"
## [5] "2010-08-20 13:00:00 UTC" "2010-08-20 15:00:00 UTC"
Slick, right! In our data, the format was at least the (very sensible) YYYY-MM-DD. In the US very often dates are written down as MM/DD/YYYY (and, infuriatingly, always saved that way by Excel). For lubridate
… no problem:
dates <- c("5/7/1976", "10-24:1979", "1.1.2000")
mdy(dates)
## [1] "1976-05-07 UTC" "1979-10-24 UTC" "2000-01-01 UTC"
Again, lets mutate
the data, adding a true “Time” column:
eagle <- mutate(eagle, Date = ymd(Date), Time = hm(Time), DateTime = Date+Time)
head(eagle)
## Date Time Latitude Longitude Speed Course Altitude
## 1 2010-07-19 15H 0M 0S 40.35500 18.17217 0 272 103
## 2 2010-07-19 16H 0M 0S 40.35467 18.17233 0 329 95
## 3 2010-07-19 17H 0M 0S 40.35483 18.17233 0 246 90
## 7 2010-08-14 13H 0M 0S 40.56167 16.18200 0 126 591
## 10 2010-08-20 13H 0M 0S 40.56117 16.17767 0 290 647
## 11 2010-08-20 15H 0M 0S 40.56117 16.17767 0 349 662
## DateTime
## 1 2010-07-19 15:00:00
## 2 2010-07-19 16:00:00
## 3 2010-07-19 17:00:00
## 7 2010-08-14 13:00:00
## 10 2010-08-20 13:00:00
## 11 2010-08-20 15:00:00
One other nice feature of lubridate
is the ease of getting days of year, hour or days, etc. from data, e.g.:
hour(eagle$DateTime)[1:10]
## [1] 15 16 17 13 13 15 17 19 21 5
month(eagle$DateTime)[1:10]
## [1] 7 7 7 8 8 8 8 8 8 8
yday(eagle$DateTime)[1:10]
## [1] 200 200 200 226 232 232 232 234 234 235
It is often very useful to simply plot the Time:
plot(eagle$DateTime, type="o", pch=19, col=rgb(0,0,0,.3), cex=0.5)
You can see when the gaps occurred, and when the GPS collar had the biggest problems, how the data acquisition was distributed.
We’ll get into fancier visualizations later, but for a long ranging migrant like this one, it might be nice just to see where this bird is going on a globe. This can be done in two lines using the maps
and mapdata
packages:
require(mapdata) # also loads maps
map("world", xlim = c(-20,30), ylim = c(5,55), fill=TRUE, col="grey", bor=NA)
with(eagle, lines(Longitude, Latitude, pch=19, type="o", col=rgb(0,0,1,.3), cex=0.5))
box()
An impressive migration from Italy across southern France and Spain to the Sahel.
It is nice to have Lat-Long data for seeing where your data are located in the world. But degrees latitude and longitude are not of consistent size, and movement statistics are best used in a consistent format, with “real” distances between points - whether in kilometers or meters.
One widely used type of projection for this purpose is “Universal transverse mercator” which will transform your data to meters north and east of one of 60 points at 6 degree intervals around the equator. One function that does this is convUL
in PBSmapping
. Its a bit awkward to use, as the data have to have columns names “X” and “Y” and have an “attribute” called “” “LL” so that it knows that its converting from a lat-long:
require(PBSmapping)
ll <- with(eagle, data.frame(X = Longitude, Y = Latitude))
attributes(ll)$projection <- "LL"
xy <- convUL(ll, km=TRUE)
plot(xy, asp=1, type="o")
Note that the function determined that the best zone for these data would be zone 31. That is an easy variable to change. Note that we can now say that the total extent of the eagles migration was:
diff(range(xy$X))
## [1] 2053.71
diff(range(xy$Y))
## [1] 3741.004
So over 3700 km north-south, and 2000 km east-west.
A more comprehensive projection standard is the “PROJ4” system, which GIS users should be aware of. In particular when you are working with spatial files from other sources (e.g. habitat layers, bathymetry models), you need to make sure that your coordinates match up! From this repository: https://epsg.io/32629 we can identify a good projection as +proj=utm +zone=31 +datum=WGS84 +units=m +no_defs
require(rgdal)
XY <- project(as.matrix(eagle[,c("Longitude", "Latitude")]),"+proj=utm +zone=31 +datum=WGS84 +units=m +no_defs")
For an eagle moving on such a large scale, the differences will be very minor. How would you compare those differences?
Usually, there is more than one individual that data have been collected on, and all of those preprocessing steps need to be applied to all of them. To avoid too much copying / pasting and consequent error introduction, it is best to write a single function that performs all these steps at once.
Note, in the function below, that all of the mutate commands can be combined in one command:
getUTM <- function(lon, lat){
xy <- project(cbind(lon, lat), "+proj=utm +zone=31 +datum=WGS84 +units=m +no_defs")
colnames(xy) <- c("X", "Y")
return(xy/1000)}
cleanEagleData <- function(e){
e <- rename(e, c("Latitude.N." = "Latitude",
"Longitude.E." = "Longitude",
"Altitude.m." = "Altitude"))
e <- mutate(e,
Longitude = replace(Longitude, Longitude == 0 | Longitude < -60, NA),
Latitude = replace(Latitude, is.na(Longitude), NA),
Date = ymd(Date), Time = hour(hm(Time)), DateTime = Date+Time)
e <- data.frame(e, getUTM(e$Longitude, e$Latitude))
return(e)
}
e <- cleanEagleData(read.table("./data/eagle.txt", sep="\t", header=TRUE))
Super sneaky R trick: There’s a funky technique called “piping” in the package magrittr
which allows you do many steps as a single step. Check out examples and documentation, but in the meantime, compare the code below with the code above:
require(magrittr)
cleanEagleData <- function(e){
mutate(e, Date = ymd(Date), Time = hour(hm(Time)), DateTime = Date + Time) %>%
rename(c("Latitude.N."="Latitude", "Longitude.E."="Longitude", "Altitude.m." = "Altitude")) %>%
mutate(Longitude = replace(Longitude, Longitude == 0 | Longitude < -60, NA),
Latitude = replace(Latitude, is.na(Longitude), NA)) %>%
data.frame(getUTM(e$Longitude, e$Latitude))
}
There are several eagle datasets collected into the ./data/eagles/
folder. They have awkward names and it would be nice to automate loading all of those data.
You can access file lists from within R:
list.files("./data/eagles")
## [1] "44787g.txt" "80422g.txt" "80423g.txt" "80424g.txt"
If we save this list of files, we can automate the entire procedure. There are a few options
directory <- "./data/eagles/"
files <- list.files(directory)
Eagles <- list()
for(i in 1:length(files)){
myfile <- files[i]
mydata <- data.frame(ID = myfile,
read.table(paste0(directory, myfile), sep="\t", header=TRUE))
Eagles[[myfile]] <- cleanEagleData(mydata)
}
Note how we were able to add items to the list named by the file name:
names(Eagles)
## [1] "44787g.txt" "80422g.txt" "80423g.txt" "80424g.txt"
Are you annoyed by the ‘.txt’’s? Good. Because so am I:
names(Eagles) <- substr(files, 1, regexpr("[.]", files)-1)
names(Eagles)
## [1] "44787g" "80422g" "80423g" "80424g"
Eagles <- lapply(Eagles, function(e){
e$ID <- substr(e$ID, 1, regexpr("[.]", files)-1)
return(e)})
Notice, that this loop placed all of the eagles into a single object - a list. This can be handy - later - for applying functions that summarize all of the data. Again, the plyr
package comes in handy Its ldply
function takes a list (the l
), applies functions to each element of the list and returns a data frame (the d
). We’ll be doing lots of this later, but here’s a quick example:
ldply(Eagles, summarize, n.loc = length(X),
X.range = diff(range(X))/1000,
Y.range = diff(range(Y))/1000,
T.start = min(Date),
T.end = max(Date))
## .id n.loc X.range Y.range T.start T.end
## 1 44787g 978 1343.4551 4258.763 2009-07-08 2009-11-25
## 2 80422g 1915 684.6241 4295.544 2008-07-04 2009-02-21
## 3 80423g 8181 1329.0142 4374.840 2008-07-11 2012-08-06
## 4 80424g 5840 2019.1740 4270.519 2009-07-08 2011-09-01
Incidentally, if you want to just collapse the whole list into a data frame:
Eagles.df <- ldply(Eagles)
str(Eagles.df)
## 'data.frame': 16914 obs. of 11 variables:
## $ .id : chr "44787g" "44787g" "44787g" "44787g" ...
## $ Date : POSIXct, format: "2009-07-08" "2009-07-08" ...
## $ Time :Class 'Period' atomic [1:16914] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..- attr(*, "year")= num [1:978] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..- attr(*, "month")= num [1:978] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..- attr(*, "day")= num [1:978] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..- attr(*, "hour")= num [1:978] 8 9 10 11 12 13 14 16 6 8 ...
## .. ..- attr(*, "minute")= num [1:978] 0 0 0 0 0 0 0 0 0 0 ...
## $ Latitude : num NA 38.4 38.4 38.4 38.4 ...
## $ Longitude: num NA -1.01 -1.01 -1.01 -1.01 ...
## $ Speed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Course : int 164 122 200 192 73 190 322 329 68 31 ...
## $ Altitude : int 15 698 700 0 700 699 700 700 706 701 ...
## $ DateTime : POSIXct, format: "2009-07-08 08:00:00" "2009-07-08 09:00:00" ...
## $ X : num 166021 149261 149246 149261 149246 ...
## $ Y : num 0 4253427 4253428 4253427 4253428 ...
Note that .id
becomes the default name for the eagle ID’s.
The most efficient way to save the processed data - which is formatted correctly, and is a list which does not lend itself automatically to being a table (though see above), is to use the all powerful save command:
save(Eagles, file = "Eagles.rda")
Next time you fire up your R for analysis, you can forget everything we did up here and just load the file:
load(file = "Eagles.rda")
So useful! Grab that data like that eagle grabs snakes.