## Thursday, February 12, 2015

### Parsing Months in R

As part of a recent analytics project, I needed to convert strings containing (English) names of months to the corresponding cardinal values (1 for January, ..., 12 for December). The strings came from a CSV file, and were translated by R to a factor when the file was read. The factor had more than 12 levels: to the literal-minded (which includes R), "August" and "August " (the latter with a trailing space) are different months.

So I wanted a solution that was moderately robust with respect to extra spaces, capitalization, and abbreviation. A Google search turned up several solutions involving string manipulation, none of which entirely appealed to me. So I rolled my own, which I'm posting here. As usual, the code is licensed under a Creative Commons license (see the right-hand margin for details).

A few notes about the code:
• I used the lubridate package to provide a function (month()) for extracting the month index from a date object. I know that some people dislike loading packages they don't absolutely need (memory consumption, name space clashes, ...). I find the lubridate::month() function pleasantly robust, but if you want to avoid loading lubridate, I suggest you try one of the other methods posted on the Web.
• My code loads the magrittr package so that I can "pipeline" commands. If you load a package (such as dplyr) that in turn loads magrittr, you're covered. If you prefer the pipeR package, a minimal amount of tweaking should produce a version that works with pipeR. If you just want to avoid loading anything, the same logic will work; you just need to change the piping into nested function calls.
• I make no claim that this is the most efficient, most robust or most elegant solution. It just seems to work for me.
The code includes a small example of its use.

#
#
library(lubridate)
library(magrittr)
#
# Function monthIndex converts English-language string
# representations of a month name to the equivalent
# cardinal value (1 for January, ..., 12 for December).
#
# Argument:
#   x  a character vector, or object that can be
#      coerced to a character vector
#
# Value:
#   a numeric vector of the same length as x,
#   containing the ordinals of the months named
#   in x (NA if the entry in x cannot be deciphered)
monthIndex <-
function(x) {
x                        %>%
# strip any periods
gsub("\\.", "", .)     %>%
# turn it into a full date string
paste0(" 1, 2001")     %>%
# turn the full string into a date
as.Date("%t%B %d, %Y") %>%
# extract the month as an integer
month
}
#
# Unit test.
#
x <- c("Sep", "May", " July ", "huh?",
"august", "dec ", "Oct. ")
monthIndex(x) # 9 5 7 NA 8 12 10
Created by Pretty R at inside-R.org