Thursday, June 27, 2024

A Sorting Adventure in R

As part of an R application on which I have been working, I need to grab some data from an online spreadsheet keyed by a column of labels, add a column with new values, and print out the labels and new values in a two column array so that it matches the spreadsheet (meaning the order is the same as in the spreadsheet). The construction process involves breaking the original data into chunks, filling in the new values in each chunk separately, and gluing the chunks back together. The spreadsheet is sorted by the label column, so to match the original spreadsheet (allowing a user to view them side by side and see the same labels in the same order), I just needed to sort the output data frame by the label column ... or so I thought.

Since I was using the dplyr library already, and dplyr provides an arrange command to sort rows based on one or more columns, I started out with the following simple code (where df is the two column data frame I created):

df |> dplyr::arrange(Label) |> print()

Unfortunately, the result was not sorted in the same order as the spreadsheet. Some of the labels began with numbers (1000) and in one case a number containing a colon (10:20), and arrange listed them in the opposite order from that of the spreadsheet. Some of the name had funky capitalization (say, "BaNaNa"), and arrange treated capital letters as preceding lower case letters. Interestingly, the base R sort function sorted the labels in the same order that the spreadsheet used. More interestingly, the following hack suggested in a response to a question I posted online also matched the spreadsheet:

df |> dplyr::arrange(-dplyr::desc(Label)) |> print()

The desc function tells arrange to use descending order (the default being ascending order). So -desc tells arrange not to use descending order, meaning use ascending order, which is where we started ... and yet it somehow fixes the ordering problem.

The culprit turns out to be the default locale setting for arrange, which is "C". I'm guessing that means the C programming language. I filed a bug report in the dplyr repository on GitHub and was quickly told that the behavior was correct for locale "C" and that the solution to my problem was to feed arrange the optional argument .locale = "en". That did in fact fix things. The code now produces the expected sort order. Meanwhile, my bug report led to a new one about the difference in sort orders between arrange and desc. Depending on how that report is resolved, the -desc trick may stop working in the future.

Wednesday, June 26, 2024

Locating Date Columns in R

I've been working on a script that pulls data from an online spreadsheet (made with Google Sheets) shared by a committee. (Yes, I know the definition of "camel": a horse designed by a committee. The situation is what it is.) Once inhaled, the data resides in a data frame (actually, a tibble, but why quibble). At a certain point the script needs to compute for each row the maximum entry from a collection of date columns, skipping missing values.

Assuming I have the names of the date columns in a variable date_fields and the data in a data frame named df, the computation itself is simple. I use the dplyr library, so the following line of code

df |> rowwise() |> mutate(Latest = max(c_across(all_of(date_fields)), na.rm = TRUE))

produces a copy of the data frame with an extra column "Latest" containing the most recent date from any of the date fields.

That, as it turns out, was the easy part. The hard part was populating the variable date_fields. Ordinarily I would just define it at the front of the scripts, plugging in either the names or the indices of the date columns. The problem is that the spreadsheet "evolves" as committee members other than myself make changes. If they add or delete a column, the indices of the date columns will change, breaking code based on indices. If they rename one of the date columns, that will break code based on a static vector of names. So I decided to scan the spreadsheet after loading it to find the date fields.

It turned out to be harder than it looked. After tripping over various problems, I searched online and found someone who asked a similar question. The best answer pointed to a function in an R library named dataPreparation. I did not want to install another library just to use one function one time, so I futzed around a bit more and came up with the following function, which takes a data frame as input and returns a list of the names of columns that are dates (meaning that if you run the str() command on the data frame, they will be listed as dates). It requires the lubridate library, which I find myself commonly using. There may be more elegant ways to get the job done, but it works.

library(lubridate)
# INPUT: a tibble
# OUTPUT: a vector containing the names of the columns that contain dates
dateColumns <- function(x) {
  # Get a vector of logical values (true if column is a date) with names.
  temp <- sapply(names(x), function(y) pull(x, y) |> is.Date())
  # Return the column names for which is.Date is true.
  which(temp) |> names()
}

Tuesday, April 30, 2024

From IP to CP

Someone asked on OR Stack Exchange how to convert an integer programming model into a constraint programming model. I think you can reasonably say that it involves a "paradigm shift", for a couple of reasons.

The first paradigm shift has to do with how you frame the problem, mainly in terms of the decision variables. Math programmers are trained to turn discrete decisions with a logical flavor into binary variables. Discrete quantities, such as how many bins of a certain type to use or how many workers to assign to a task, are expressed as general integer variables, but most other things end up turning into a slew of binary variables. The problem being solved in the ORSE question illustrates this nicely.

The problem is as follows. You have $N$ participants in a tournament involving some kind of racing. Importantly, $N$ is guaranteed to be an even number. There is one track with two lanes, and races are spread over $N-1$ days. Every participant races head to head with every other participant exactly once, and nobody races twice in the same day. For whatever reason, the left lane is preferable to the right lane, and so there is a "fairness" constraint that nobody is assigned the left lane on more than $M$ consecutive days. For some reason, the author also imposed a second fairness constraint that nobody be assigned to the right lane on more than $M$ consecutive days. Dimensions for the author's problem were $N=20$ and $M=2.$

The model has to assign participant pairs (races) to days and also make lane assignments. To decide against whom I must race on a given day, someone building an IP model will use binary variables to select my opponent. Similarly, they will use binary variables to select my lane assignment each day. So the author of the question had in his IP model a variable array opp[Competitors][Competitors][Tracks][Days] taking value 1 "if competitor 'c1' races with 'c2' on track 't' on day 'd'".

CP models are more flexible in their use of variables, and in particular general integer variables. So to decide my opponent on a given day, I can just an integer variable array indexed by day where the value is the index number of my opponent on the given day. Similarly, I could (and would) use an integer variable indexed by day to indicate my lane assignment that day, although in this case that variable does turn out to be binary, since there are only two lanes.

The second paradigm shift has to do with constraints, and it ties to what solver you are using. IP models have a very limited constraint "vocabulary". They all understand linear equalities and inequalities, and some understand some combination of SOS1, SOS2, second order cone and implication constraints. That's pretty much it. CP solvers have a richer "vocabulary" of constraints, but with the caveat that not many of those constraints are universal. I would wager that every CP solver has the "all different" constraint, and they must have the usual arithmetic comparisons ($=,\neq,\lt,\le,\gt,\ge$). Beyond that, it pays to check in advance.

I wrote a CP model (in Java) using IBM's CP Optimizer (CPO) to solve the scheduling problem. Details of the model can be sussed out from the Java code, but I will mention a few pertinent details here.

  • I did use an integer variable array to determine, for each combination of participant and day, the participant's opponent that day, as well as an integer array giving the lane assignment (0 or 1) for each combination of participant and day.
  • To make sure that, on any day, the opponent of X's opponent is X I used CPO's  inverse constraint. The constraint inverse(f, g) says that f[g[x]] = x and g[f[x]] = x for any x in the domain of the inner function.
  • To ensure that nobody raced the same opponent twice, I used allDiff, which is CPO's version of the all different constraint.
  • We have to do something to force opponents in a race to be in different lanes. Let $x_{i,d}$ and $y_{i,d}$ denote respectively the opponent and lane assignment for participant $i$ on day $d.$ In mathematical terms, the constraint we want is $y_{x_{i,d},d} \neq y_{i,d}.$ Indexing a variable with another variable is impossible in an IP model. In CPO, I used the element constraint to do just that.

I added an objective function, namely to minimize the difference between the most and fewest times any participant gets assigned the preferred left lane. I also added one constraint to mitigate symmetry. Since any solution remains a solution (with the same objective value) under any permutation of the participant indices, I froze the first day's schedule as $1\  v.\  N$, $2\  v.\  N-1$, $3\  v.\  N-2$ etc.

On my decent but not screamingly fast PC, CPO found a feasible solution almost instantly and a solution with objective value 1 in under a second. In that solution, every participant gets the left lane either nine or ten times out of the 19 racing days. It's not hard to prove that 1 is the optimal value (you cannot have everybody get exactly the same number of left lane assignments), but don't tell CPO that -- it was still chugging along trying when it hit my five minute time limit.

My Java code is available from my repository under a Creative Commons 4.0 open source license.

Sunday, April 21, 2024

Where Quadratic, Positive Definite and Binary Meet

A comment by Rob Pratt (of SAS) on OR Stack Exchange pointed out two things that are glaringly obvious in hindsight but that somehow I keep forgetting. Both pertain to an expression of the form $x'Qx + c'x,$ either in an objective function or in a second order cone constraint, where $x$ is a vector of variables and $Q$ and $c$ are parameters.

The first observation does not depend on the nature of the $x$ variables. We can without loss of generality assume that $Q$ is symmetric. If it is not, replace $Q$ with the symmetric matrix $\hat{Q} = \frac{1}{2}\left(Q + Q'\right),$ which is symmetric. A wee bit of algebra should convince you that $x'\hat{Q}x = x'Qx.$

The second observation is specific to the case where the $x$ variables are binary (which was the case in the ORSE question which drew the comment from Rob). When minimizing an objective function of the form $x'Qx + c'x$ or when using it in a second order cone constraint of the form $x'Qx + c'x \le 0,$ you want the $Q$ matrix to be positive definite. When $x$ is binary, this can be imposed easily.

Suppose that $x$ is binary and $Q$ is symmetric but not positive definite. The following argument uses the euclidean 2-norm. Let $$\Lambda = \max_{\parallel y \parallel = 1} -y'Qy,$$ so that $y'Qy \ge -\Lambda$ for any unit vector $y.$ Under the assumption that $Q$ is not positive definite, $\Lambda \ge 0.$ Choose some $\lambda > \Lambda$ and set $\hat{Q} = Q + \lambda I,$ where $I$ is the identity matrix of appropriate dimension. For any nonzero vector $y,$

$$ \begin{align*} y'\hat{Q}y & =y'Qy+\lambda y'Iy\\ & =\parallel y\parallel^{2}\left(\frac{y'}{\parallel y\parallel}Q\frac{y}{\parallel y\parallel}+\lambda\right)\\ & \ge\parallel y\parallel^{2}\left(-\Lambda+\lambda\right)\\ & >0. \end{align*} $$

So $\hat{Q}$ is positive definite. Of course, $x'\hat{Q}x \neq x'Qx,$ but this is where the assumption that $x$ is binary sneaks in. For $x_i$ binary we have $x_i^2 = x_i.$ So

$$ \begin{align*} x'\hat{Q}x & =x'Qx+\lambda x'Ix\\ & =x'Qx+\lambda\sum_{i}x_{i}^{2}\\ & =x'Qx+\lambda e'x \end{align*} $$

where $e=(1,\dots,1).$ That means the original expression $x'Qx + c'x$ is equal to $x'\hat{Q}x+(c-\lambda e)'x,$ giving us an equivalent expression with a positive definite quadratic term.

Thursday, April 11, 2024

Finding Duplicated Records in R

Someone asked a question about finding which records (rows) in their data frame are duplicated by other records. If you just want to know which records are duplicates, base R has a duplicated() function that will do just that. It occurred to me, though, that the questioner might have wanted to know not just which records were duplicates but also which records were the corresponding "originals". Here's a bit of R code that creates a small data frame with duplicated rows and then identifies original/duplicate pairs by row number.


library(dplyr)

# Create source data.
df <- data.frame(a = c(3, 1, 1, 2, 3, 1, 3), b = c("c", "a", "a", "b", "c", "a", "c"))

# Find the indices of duplicated rows.
dup <- df |> duplicated() |> which()

# Split the source data into two data frames.
df1 <- df[-dup, ]  # originals (rows 1, 2 and 4)
df2 <- df[dup, ]   # duplicates (rows 3, 5, 6 and 7)

# The row names are the row indices in the original data frame df. Assign them to columns.
df1$Original <- row.names(df1)
df2$Duplicate <- row.names(df2)

# Perform an inner join to find the original/duplicate pairings. The "NULL" value for "by"
# (which is actually the default and can be omitted) means rows of df1 and df2 are paired
# based on identical values in all columns they have in common (i.e., all the original
# columns of df).
inner_join(df1, df2, by = NULL) |> select(Original, Duplicate)

# Result:
#   Original Duplicate
# 1        1         5
# 2        1         7
# 3        2         3
# 4        2         6

The key here is that the inner_join function pairs rows from each data frame (originals and duplicates) based on matching values in the "by" columns. The default value of "by" (NULL) tells it to match by all the columns the two data frames have in common -- which in the is case is all the columns in the source data frame. The resulting data frame will have the columns from the source data frame (here "a" and "b") plus the columns unique to each data frame ("Original" and "Duplicate"). We use the select() command to drop the source columns and just keep the indices of the original and duplicate rows.

Monday, April 8, 2024

File Access in RStudio

I've been spending a fair bit of time in RStudio Desktop recently, much of it related to my work with INFORMS Pro Bono Analytics. I really like RStudio as a development environment for R code, including Shiny apps. It does, however, come with the occasional quirk. One of those has to do with how RStudio accesses the file system.

I tripped over this a couple of times recently when I wanted to open an R file that I had dropped in the /tmp directory on my Linux Mint system. The Files tab in RStudio appeared to be limited to the directory tree under my home directory. There was no way to browse to system directories like /tmp. Similarly, there is a way to set the default working directory (Tools > Global Options... > General > Basic > R Sessions). RStudio does not let you type in a directory name (perhaps a defense against typos?), and the Browse... button will not leave your home tree.

Initially I decided this was not important enough to worry about, but then I saw a post on the Posit Community forum by someone who was stuck trying to work from home due to a related issue. So I did a little experimentation and found a workaround, at least for the first problem (accessing files in places like /tmp). If I run setwd("/tmp") in the Console tab (which sets the working directory for the current R session), then click the More menu in the Files tab and select Go To Working Directory, the Files tab now browses /tmp, and I can navigate up to the system root directory and then down to anywhere within reason.

Changing the default starting directory is not something I actually care to do, but I'll document it here in case a reader might wish to do so. You can go to the IDE configuration directory (~/.config/rstudio on Linux and OS X, %appdata%\RStudio on Windows), open the rstudio-prefs.json file in a text editor, and change the value of the "initial_working_directory" entry to whatever starting directory you want. Save it, (re)start RStudio Desktop, and hopefully you begin in the right place.


Friday, February 9, 2024

Another R Quirk

For the most part I like programming in R, but it is considerably quirkier than any other language I have used. I'm pretty sure that is what led to the development of what is known now as the "Tidyverse". The Tidyverse in turn introduces other quirks, as I've pointed out in a previous post.

One of the quirks in base R caused me a fair amount of grief recently. The context was an interactive program (written in Shiny, although that is beside the point here). At one point in the program the user would be staring at a table (the display of a data frame) and would select rows and columns for further analysis. The program would reduce the data frame to those rows and columns, and pass the reduced data frame to functions that would do things to it.

The program worked well until I innocently selected a bunch of rows and one column for analysis. That crashed the program with a rather cryptic (to me) error message saying that some function I was unaware of was not designed to work with a vector.

I eventually tracked down the line where the code died. The function I was unaware of apparently was baked into a library function I was using. As for the vector part, that was the result of what I would characterize as a "quirk" (though perhaps "booby trap" might be more accurate). I'll demonstrate using the mtcars data frame that automatically loads with R.

Consider the following code chunk.

rows <- 1:3
cols <- c("mpg", "cyl")
temp <- mtcars[rows, cols]
str(temp)

This extracts a subset of three rows and two columns from mtcars and presents it as a data frame.

'data.frame':    3 obs. of  2 variables:
 $ mpg: num  21 21 22.8
 $ cyl: num  6 6 4

So far, so good. Now suppose we choose only one column and rerun the code.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, cols]
str(temp)

Here is the result.

num [1:3] 21 21 22.8

Our data frame just became a vector. That was what caused the crash in my program.

Since I was using the dplyr library elsewhere, there was an easy fix once I knew what the culprit was.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, ] |> select(all_of(cols))
str(temp)

The result, as expected, is a data frame.

 'data.frame':    3 obs. of  1 variable:
 $ mpg: num  21 21 22.8

There will be situations where you grab one column of a data frame and want it to be a vector, and situations (such as mine) where you want it to be a data frame, so the designers of the language have to choose which route to go. I just wish they had opted to retain structure (in this case data frame) until explicitly dropped, rather than drop it without warning.