Dates and Times in R Without Losing Your Sanity
Working with dates and times in R can be frustrating! This isn't R's fault — dates and times are naturally complicated.
One must consider time zones, leap years, leap seconds, Daylight Savings, hundreds of potential date and time formats, and other quirky complexities.
The goal of this article is to give you the tools and knowledge to deal with dates and times in R so you can avoid common mistakes, saving your hair and extending your lifespan by a month or two.
(Note: this article was originally posted on my blog.)
Dates
The easiest way to make a date in R is to use as.Date(x)
, where x
is some date in yyyy-mm-dd format. For example,
> mydate <- as.Date("2018-01-01")
> mydate
[1] "2018-01-01"
> class(mydate)
[1] "Date"
Similarly, we can make a vector of dates by passing a vector of characters to as.Date()
.
> mydates <- as.Date(c("2018-01-01", "2018-05-31"))
> mydates
[1] "2018-01-01" "2018-05-31"
Perhaps the hardest, most common issue that arises when dealing with dates is that they often come in weird or cryptic formats. In these cases, you'll need to tell as.Date()
the format of the raw date values so that R can convert them into the canonical yyyy-mm-dd format. For example:
# year month day without spaces
> ugh <- c("20180205", "20170303")
> as.Date(ugh, format = "%Y%m%d")
[1] "2018-02-05" "2017-03-03"
# month day year with slashes
> really <- c("2/5/18", "3/3/18")
> as.Date(really, format = "%m/%d/%y")
[1] "2018-02-05" "2018-03-03"
# day month year with underscores
> whywouldsomeonedothis <- c("5_2_18", "3_3_18")
> as.Date(whywouldsomeonedothis, format = "%d_%m_%y")
[1] "2018-02-05" "2018-03-03"
# month day year with full month names
> annoying <- c("February 5, 2018", "March 3, 2018")
> as.Date(annoying, format = "%B %d, %Y")
[1] "2018-02-05" "2018-03-03"
You might be wondering how I figured out those format strings. The key here is to read R's documentation for strptime()
(see ?strptime
). In it, you'll find all of the special conversion specifications that tell R what to look for.
Here's a snippet from the docs:
Let's pause and consider what a date in R actually is. Consider:
> as.numeric(as.Date("2018-01-01"))
[1] 17532
That's right — you can cast a date to numeric. That's because internally, R stores a date as "number of days since January 1st, 1970."
Now that we know how to make dates in R, let's learn how to do stuff with them.
Add or subtract days from a date
> mydate <- as.Date("2018-01-01")
> mydate + 1
[1] "2018-01-02"
> mydate - 7
[1] "2017-12-25"
> mydate + c(0,1,2,3,4,5,6)
[1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05" "2018-01-06" "2018-01-07"
Get the time between two dates
> d1 <- as.Date("2017-01-01")
> d2 <- as.Date("2018-01-01")
> d2 - d1
Time difference of 365 days
> difftime(d2, d1, units = "day")
Time difference of 365 days
> difftime(d2, d1, units = "hour")
Time difference of 8760 hours
Determine the weekday of a date
> mydate <- as.Date("2018-01-01")
> weekdays(mydate)
[1] "Monday"
Change the format of a date
> mydate <- Sys.Date() # Get today's date
> myprettydate <- format(mydate, format = "%B %d, %Y")
> mydate
[1] "2018-04-27"
> myprettydate
[1] "April 27, 2018"
Note, however, that myprettydate
is type character, not Date.
> class(mydate)
[1] "Date"
> class(myprettydate)
[1] "character"
lubridate
The lubridate package in R lets us do a lot of other cool things with dates. Let's give it a go.
# install.packages(lubridate)
> library(lubridate)
Extract the day of month and month of year
> day(mydate)
[1] 27
> month(mydate)
[1] 4
Adding and Subtracting Months
Conceptually, this is a tricky topic. What does it mean to "add one month" to a date? For example, I think we'd all agree that one month after 2018-01-01 is 2018-02-01. But what's one month after 2018-01-31?
Base R provides a very handy seq.Date()
method that can be used to create a sequence of dates that differ by days, months, years, etc.
For example, if we create a sequence of dates that differ by one month, starting from 2018-01-31, we get:
> seq.Date(as.Date("2018-01-31"), length.out = 3, by = "month")
[1] "2018-01-31" "2018-03-03" "2018-03-31"
In this case, R is literally adding 1 to the month component of each date so that the resulting (attempted) dates are 2018-01-31, 2018-02-31, 2018-03-31, and 2018-04-31.
Obviously, dates like 2018-02-31 and 2018-04-31 don't exist, so R counts the difference between the day component and the last valid date in each month, and then adds that amount to the last valid month to get a valid date.
For example, 2018-02-31 is (in a sense) three days after 2018-02-28, so R resolves the date to 2018-03-03.
Alternatively, lubridate provides functionality for adding months to a date that behaves slightly differently.
> as.Date("2018-01-31") %m+% months(c(0,1,2))
[1] "2018-01-31" "2018-02-28" "2018-03-31"
In this case, 2018-02-31 resolves to 2018-02-28 so that the month component of the result is truly one month ahead of 2018-01-31.
Datetimes
Now that we've discussed dates, let's move on to the hard stuff — datetimes. Let's start by making a datetime from scratch.
> mydatetime <- as.POSIXct("2018-01-01 05:30:00")
> mydatetime
[1] "2018-01-01 05:30:00 CST"
Notice that R prints mydatetime
with the timezone CST. That's because my operating system was set up with the America/Chicago timezone (I'm based in New Orleans) and R is assuming it's the timezone I want.
Use Sys.timezone()
to see what timezone your OS is using. If we want to make a datetime with a specific timezone, we can use the tz
argument of as.POSIXct()
. For example:
> dtTokyo <- as.POSIXct("2018-01-01 5:00:00", tz = "Asia/Tokyo")
> dtCalifornia <- as.POSIXct("2018-01-01 03:30:00", tz = "America/Los_Angeles")
> dtSydney <- as.POSIXct("2018-01-01 22:30:00", tz = "Australia/Sydney")
For a list of timezones, you can plug into the tz
argument, see this Wikipedia article.
Note that dtCalifornia
and dtSydney
are actually "the same," meaning, if you were in LA at 3:30 on 2018-01-01 and you called your friend in Sydney, your friend's clock would show 22:30 (10:30 PM).
> dtCalifornia == dtSydney
[1] TRUE
In R, a vector of datetimes must all have the same timezone attribute. So, if you want to store the above datetimes together and retain the timezone information, the best approach is to store them in a dataframe with a DateTime column and a Timezone column, like so:
> df <- data.frame(
+ DateTime = c(dtTokyo, dtCalifornia, dtSydney),
+ Timezone = c(attr(dtTokyo, "tzone"), attr(dtCalifornia, "tzone"), attr(dtSydney, "tzone")),
+ stringsAsFactors = F
+ )
> df
DateTime Timezone
1 2017-12-31 14:00:00 Asia/Tokyo
2 2018-01-01 05:30:00 America/Los_Angeles
3 2018-01-01 05:30:00 Australia/Sydney
Notice that R actually converts the DateTime column to "America/Chicago" time (my local timezone). While this might be nice for me, my colleagues in California and New York might not appreciate it.
Instead, a better approach is to store the datetimes in Univeral Coordinated Time (UTC). UTC is like a reference point for which all other timezone representations are based upon.
> df$DateTime
[1] "2017-12-31 14:00:00 CST" "2018-01-01 05:30:00 CST" "2018-01-01 05:30:00 CST"
> attr(df$DateTime, "tzone") <- "UTC"
> df$DateTime
[1] "2017-12-31 20:00:00 UTC" "2018-01-01 11:30:00 UTC" "2018-01-01 11:30:00 UTC"
The important takeaway here is that we are NOT altering the datetime values. We are simply altering the way those datetimes are displayed. With that said, the timezone information is still important.
For example, suppose we want to add a column to our dataframe called Date indicating the date of each instance.
> df$Date <- as.Date(df$DateTime)
> df
DateTime Timezone Date
1 2017-12-31 20:00:00 Asia/Tokyo 2017-12-31
2 2018-01-01 11:30:00 America/Los_Angeles 2018-01-01
3 2018-01-01 11:30:00 Australia/Sydney 2018-01-01
The above is probably NOT what we want. While those dates are correct in reference to UTC time, the first row for Tokyo shows Date 2017-12-31.
However, in Tokyo, that date would actually be recognized as 2018-01-01. Fortunately as.Date()
has a tz
parameter where we can specify a timezone, although it's not vectorized so we have to apply it row-by-row.
> dats <- mapply(FUN = function(datetime, timezone) as.Date(datetime, tz = timezone), df$DateTime, df$Timezone)
> dats <- as.Date(dats, origin = "1970-01-01")
> df$Date <- dats
> df
DateTime Timezone Date
1 2017-12-31 20:00:00 Asia/Tokyo 2018-01-01
2 2018-01-01 11:30:00 America/Los_Angeles 2018-01-01
3 2018-01-01 11:30:00 Australia/Sydney 2018-01-01
# dats <- as.Date(dats, origin = "1970-01-01") is necessary here because mapply() resolves the dates to their numeric values.
In practice, I would typically do this with the data.table package, which is much faster for large datasets.
> library(data.table)
> dt <- as.data.table(df)
> dt[, Date := as.Date(DateTime, tz = Timezone[1L]), by = Timezone]
> dt
DateTime Timezone Date
1: 2017-12-31 20:00:00 Asia/Tokyo 2018-01-01
2: 2018-01-01 11:30:00 America/Los_Angeles 2018-01-01
3: 2018-01-01 11:30:00 Australia/Sydney 2018-01-01
Daylight savings can be a pain when working with datetimes. Consider this weird behavior: in case you weren't aware, the beginning of Daylight Savings in US Central skips 2AM.
> seq.POSIXt(from = as.POSIXct("2018-03-11 01:00:00", tz = "America/Chicago"), length.out = 6, by = "15 mins")
[1] "2018-03-11 01:00:00 CST" "2018-03-11 01:15:00 CST" "2018-03-11 01:30:00 CST" "2018-03-11 01:45:00 CST"
[5] "2018-03-11 03:00:00 CDT" "2018-03-11 03:15:00 CDT"
Similarly, the end of Daylight Savings repeats 1AM.
> seq.POSIXt(from = as.POSIXct("2018-11-04 01:00:00", tz = "America/Chicago"), length.out = 6, by = "15 mins")
[1] "2018-11-04 01:00:00 CDT" "2018-11-04 01:15:00 CDT" "2018-11-04 01:30:00 CDT" "2018-11-04 01:45:00 CDT"
[5] "2018-11-04 01:00:00 CST" "2018-11-04 01:15:00 CST"
With these things in mind, take note of how R handles the following:
> as.POSIXct("2018-03-11 02:15:00", tz = "America/Chicago") # Turns into 12:00AM
[1] "2018-03-11 CST"
> as.POSIXct("2018-11-04 01:15:00", tz = "America/Chicago")
[1] "2018-11-04 01:15:00 CDT"
> as.POSIXct("2018-11-04 01:15:00", tz = "America/Chicago") + 60*60 # Add 1 hour (60*60 seconds)
[1] "2018-11-04 01:15:00 CST"
To make things more confusing, not all regions observe Daylight Savings. Some regions do a half-hour rollback and others do a 45-minute rollback. Due to the whacky and inconsistent behavior of datetimes, it's often easiest to work with UTC if possible (UTC does not observe daylight savings).
Lastly, let's take a look at some common and useful tasks involving datetimes.
?strptime
)
Formatting (see > mydatetime <- Sys.time()
> format(mydatetime, "%Y-%m-%d, %I:%M %p", usetz = T)
[1] "2018-04-27, 12:42 PM CDT"
> format(mydatetime, "%I:%M %p", usetz = T)
[1] "12:42 PM CDT"
> format(mydatetime, "%I:%M:%S %p", usetz = T)
[1] "12:42:04 PM CDT"
Adding time
> mydatetime <- as.POSIXct("2018-11-04 00:00:00", tz = "America/Chicago")
> mydatetime + 60 # add 60 seconds
[1] "2018-11-04 00:01:00 CDT"
> mydatetime + 24*60*60 # add 24 hours (NOTE this is the end of Daylight Savings)
[1] "2018-11-04 23:00:00 CST"
> mydatetime + lubridate::days(1) # add 1 day
[1] "2018-11-05 CST"
Note that leap seconds exist, but R's POSIXct class ignores them.
Create a sequence of datetimes
> mydatetime <- as.POSIXct("2018-01-01 00:00:00", tz = "UTC")
> seq.POSIXt(from = mydatetime, length.out = 3, by = "5 mins")
[1] "2018-01-01 00:00:00 UTC" "2018-01-01 00:05:00 UTC" "2018-01-01 00:10:00 UTC"
Rounding datetimes
> library(lubridate)
> mydatetime <- as.POSIXct("2018-01-01 12:35:00", tz = "UTC")
> round_date(mydatetime, unit = "hour") # nearest hour
[1] "2018-01-01 13:00:00 UTC"
> floor_date(mydatetime, unit = "hour") # nearest hour not ahead
[1] "2018-01-01 12:00:00 UTC"
> ceiling_date(mydatetime, unit = "hour") # nearest hour not behind
[1] "2018-01-01 13:00:00 UTC"
In closing, R has the tools you need to do just about anything involving dates and times. However, it's your responsibility to think about things like leap years, Daylight Savings, timezones etc. in order to avoid common traps.
Use UTC and the lubridate package often.
Cheers!