Wednesday, June 26, 2024

Locating Date Columns in R

I've been working on a script that pulls data from an online spreadsheet (made with Google Sheets) shared by a committee. (Yes, I know the definition of "camel": a horse designed by a committee. The situation is what it is.) Once inhaled, the data resides in a data frame (actually, a tibble, but why quibble). At a certain point the script needs to compute for each row the maximum entry from a collection of date columns, skipping missing values.

Assuming I have the names of the date columns in a variable date_fields and the data in a data frame named df, the computation itself is simple. I use the dplyr library, so the following line of code

df |> rowwise() |> mutate(Latest = max(c_across(all_of(date_fields)), na.rm = TRUE))

produces a copy of the data frame with an extra column "Latest" containing the most recent date from any of the date fields.

That, as it turns out, was the easy part. The hard part was populating the variable date_fields. Ordinarily I would just define it at the front of the scripts, plugging in either the names or the indices of the date columns. The problem is that the spreadsheet "evolves" as committee members other than myself make changes. If they add or delete a column, the indices of the date columns will change, breaking code based on indices. If they rename one of the date columns, that will break code based on a static vector of names. So I decided to scan the spreadsheet after loading it to find the date fields.

It turned out to be harder than it looked. After tripping over various problems, I searched online and found someone who asked a similar question. The best answer pointed to a function in an R library named dataPreparation. I did not want to install another library just to use one function one time, so I futzed around a bit more and came up with the following function, which takes a data frame as input and returns a list of the names of columns that are dates (meaning that if you run the str() command on the data frame, they will be listed as dates). It requires the lubridate library, which I find myself commonly using. There may be more elegant ways to get the job done, but it works.

library(lubridate)
# INPUT: a tibble
# OUTPUT: a vector containing the names of the columns that contain dates
dateColumns <- function(x) {
  # Get a vector of logical values (true if column is a date) with names.
  temp <- sapply(names(x), function(y) pull(x, y) |> is.Date())
  # Return the column names for which is.Date is true.
  which(temp) |> names()
}

No comments:

Post a Comment

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.