Thursday, February 12, 2015

Parsing Months in R

As part of a recent analytics project, I needed to convert strings containing (English) names of months to the corresponding cardinal values (1 for January, ..., 12 for December). The strings came from a CSV file, and were translated by R to a factor when the file was read. The factor had more than 12 levels: to the literal-minded (which includes R), "August" and "August " (the latter with a trailing space) are different months.

So I wanted a solution that was moderately robust with respect to extra spaces, capitalization, and abbreviation. A Google search turned up several solutions involving string manipulation, none of which entirely appealed to me. So I rolled my own, which I'm posting here. As usual, the code is licensed under a Creative Commons license (see the right-hand margin for details).

A few notes about the code:
  • I used the lubridate package to provide a function (month()) for extracting the month index from a date object. I know that some people dislike loading packages they don't absolutely need (memory consumption, name space clashes, ...). I find the lubridate::month() function pleasantly robust, but if you want to avoid loading lubridate, I suggest you try one of the other methods posted on the Web.
  • My code loads the magrittr package so that I can "pipeline" commands. If you load a package (such as dplyr) that in turn loads magrittr, you're covered. If you prefer the pipeR package, a minimal amount of tweaking should produce a version that works with pipeR. If you just want to avoid loading anything, the same logic will work; you just need to change the piping into nested function calls.
  • I make no claim that this is the most efficient, most robust or most elegant solution. It just seems to work for me.
The code includes a small example of its use.

#
# Load libraries.
#
library(lubridate)
library(magrittr)
#
# Function monthIndex converts English-language string
# representations of a month name to the equivalent
# cardinal value (1 for January, ..., 12 for December).
#
# Argument:
#   x  a character vector, or object that can be
#      coerced to a character vector
#
# Value:
#   a numeric vector of the same length as x,
#   containing the ordinals of the months named
#   in x (NA if the entry in x cannot be deciphered)
monthIndex <- 
  function(x) {
    x                        %>%
      # strip any periods
      gsub("\\.", "", .)     %>%
      # turn it into a full date string
      paste0(" 1, 2001")     %>%
      # turn the full string into a date
      as.Date("%t%B %d, %Y") %>%
      # extract the month as an integer
      month
  }
#
# Unit test.
#
x <- c("Sep", "May", " July ", "huh?",
       "august", "dec ", "Oct. ")
monthIndex(x) # 9 5 7 NA 8 12 10
Created by Pretty R at inside-R.org

No comments:

Post a Comment

If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on OR-Exchange.