Sunday, April 21, 2024

Where Quadratic, Positive Definite and Binary Meet

A comment by Rob Pratt (of SAS) on OR Stack Exchange pointed out two things that are glaringly obvious in hindsight but that somehow I keep forgetting. Both pertain to an expression of the form $x'Qx + c'x,$ either in an objective function or in a second order cone constraint, where $x$ is a vector of variables and $Q$ and $c$ are parameters.

The first observation does not depend on the nature of the $x$ variables. We can without loss of generality assume that $Q$ is symmetric. If it is not, replace $Q$ with the symmetric matrix $\hat{Q} = \frac{1}{2}\left(Q + Q'\right),$ which is symmetric. A wee bit of algebra should convince you that $x'\hat{Q}x = x'Qx.$

The second observation is specific to the case where the $x$ variables are binary (which was the case in the ORSE question which drew the comment from Rob). When minimizing an objective function of the form $x'Qx + c'x$ or when using it in a second order cone constraint of the form $x'Qx + c'x \le 0,$ you want the $Q$ matrix to be positive definite. When $x$ is binary, this can be imposed easily.

Suppose that $x$ is binary and $Q$ is symmetric but not positive definite. The following argument uses the euclidean 2-norm. Let $$\Lambda = \max_{\parallel y \parallel = 1} -y'Qy,$$ so that $y'Qy \ge -\Lambda$ for any unit vector $y.$ Under the assumption that $Q$ is not positive definite, $\Lambda \ge 0.$ Choose some $\lambda > \Lambda$ and set $\hat{Q} = Q + \lambda I,$ where $I$ is the identity matrix of appropriate dimension. For any nonzero vector $y,$

$$ \begin{align*} y'\hat{Q}y & =y'Qy+\lambda y'Iy\\ & =\parallel y\parallel^{2}\left(\frac{y'}{\parallel y\parallel}Q\frac{y}{\parallel y\parallel}+\lambda\right)\\ & \ge\parallel y\parallel^{2}\left(-\Lambda+\lambda\right)\\ & >0. \end{align*} $$

So $\hat{Q}$ is positive definite. Of course, $x'\hat{Q}x \neq x'Qx,$ but this is where the assumption that $x$ is binary sneaks in. For $x_i$ binary we have $x_i^2 = x_i.$ So

$$ \begin{align*} x'\hat{Q}x & =x'Qx+\lambda x'Ix\\ & =x'Qx+\lambda\sum_{i}x_{i}^{2}\\ & =x'Qx+\lambda e'x \end{align*} $$

where $e=(1,\dots,1).$ That means the original expression $x'Qx + c'x$ is equal to $x'\hat{Q}x+(c-\lambda e)'x,$ giving us an equivalent expression with a positive definite quadratic term.

Thursday, April 11, 2024

Finding Duplicated Records in R

Someone asked a question about finding which records (rows) in their data frame are duplicated by other records. If you just want to know which records are duplicates, base R has a duplicated() function that will do just that. It occurred to me, though, that the questioner might have wanted to know not just which records were duplicates but also which records were the corresponding "originals". Here's a bit of R code that creates a small data frame with duplicated rows and then identifies original/duplicate pairs by row number.


library(dplyr)

# Create source data.
df <- data.frame(a = c(3, 1, 1, 2, 3, 1, 3), b = c("c", "a", "a", "b", "c", "a", "c"))

# Find the indices of duplicated rows.
dup <- df |> duplicated() |> which()

# Split the source data into two data frames.
df1 <- df[-dup, ]  # originals (rows 1, 2 and 4)
df2 <- df[dup, ]   # duplicates (rows 3, 5, 6 and 7)

# The row names are the row indices in the original data frame df. Assign them to columns.
df1$Original <- row.names(df1)
df2$Duplicate <- row.names(df2)

# Perform an inner join to find the original/duplicate pairings. The "NULL" value for "by"
# (which is actually the default and can be omitted) means rows of df1 and df2 are paired
# based on identical values in all columns they have in common (i.e., all the original
# columns of df).
inner_join(df1, df2, by = NULL) |> select(Original, Duplicate)

# Result:
#   Original Duplicate
# 1        1         5
# 2        1         7
# 3        2         3
# 4        2         6

The key here is that the inner_join function pairs rows from each data frame (originals and duplicates) based on matching values in the "by" columns. The default value of "by" (NULL) tells it to match by all the columns the two data frames have in common -- which in the is case is all the columns in the source data frame. The resulting data frame will have the columns from the source data frame (here "a" and "b") plus the columns unique to each data frame ("Original" and "Duplicate"). We use the select() command to drop the source columns and just keep the indices of the original and duplicate rows.

Monday, April 8, 2024

File Access in RStudio

I've been spending a fair bit of time in RStudio Desktop recently, much of it related to my work with INFORMS Pro Bono Analytics. I really like RStudio as a development environment for R code, including Shiny apps. It does, however, come with the occasional quirk. One of those has to do with how RStudio accesses the file system.

I tripped over this a couple of times recently when I wanted to open an R file that I had dropped in the /tmp directory on my Linux Mint system. The Files tab in RStudio appeared to be limited to the directory tree under my home directory. There was no way to browse to system directories like /tmp. Similarly, there is a way to set the default working directory (Tools > Global Options... > General > Basic > R Sessions). RStudio does not let you type in a directory name (perhaps a defense against typos?), and the Browse... button will not leave your home tree.

Initially I decided this was not important enough to worry about, but then I saw a post on the Posit Community forum by someone who was stuck trying to work from home due to a related issue. So I did a little experimentation and found a workaround, at least for the first problem (accessing files in places like /tmp). If I run setwd("/tmp") in the Console tab (which sets the working directory for the current R session), then click the More menu in the Files tab and select Go To Working Directory, the Files tab now browses /tmp, and I can navigate up to the system root directory and then down to anywhere within reason.

Changing the default starting directory is not something I actually care to do, but I'll document it here in case a reader might wish to do so. You can go to the IDE configuration directory (~/.config/rstudio on Linux and OS X, %appdata%\RStudio on Windows), open the rstudio-prefs.json file in a text editor, and change the value of the "initial_working_directory" entry to whatever starting directory you want. Save it, (re)start RStudio Desktop, and hopefully you begin in the right place.


Friday, February 9, 2024

Another R Quirk

For the most part I like programming in R, but it is considerably quirkier than any other language I have used. I'm pretty sure that is what led to the development of what is known now as the "Tidyverse". The Tidyverse in turn introduces other quirks, as I've pointed out in a previous post.

One of the quirks in base R caused me a fair amount of grief recently. The context was an interactive program (written in Shiny, although that is beside the point here). At one point in the program the user would be staring at a table (the display of a data frame) and would select rows and columns for further analysis. The program would reduce the data frame to those rows and columns, and pass the reduced data frame to functions that would do things to it.

The program worked well until I innocently selected a bunch of rows and one column for analysis. That crashed the program with a rather cryptic (to me) error message saying that some function I was unaware of was not designed to work with a vector.

I eventually tracked down the line where the code died. The function I was unaware of apparently was baked into a library function I was using. As for the vector part, that was the result of what I would characterize as a "quirk" (though perhaps "booby trap" might be more accurate). I'll demonstrate using the mtcars data frame that automatically loads with R.

Consider the following code chunk.

rows <- 1:3
cols <- c("mpg", "cyl")
temp <- mtcars[rows, cols]
str(temp)

This extracts a subset of three rows and two columns from mtcars and presents it as a data frame.

'data.frame':    3 obs. of  2 variables:
 $ mpg: num  21 21 22.8
 $ cyl: num  6 6 4

So far, so good. Now suppose we choose only one column and rerun the code.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, cols]
str(temp)

Here is the result.

num [1:3] 21 21 22.8

Our data frame just became a vector. That was what caused the crash in my program.

Since I was using the dplyr library elsewhere, there was an easy fix once I knew what the culprit was.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, ] |> select(all_of(cols))
str(temp)

The result, as expected, is a data frame.

 'data.frame':    3 obs. of  1 variable:
 $ mpg: num  21 21 22.8

There will be situations where you grab one column of a data frame and want it to be a vector, and situations (such as mine) where you want it to be a data frame, so the designers of the language have to choose which route to go. I just wish they had opted to retain structure (in this case data frame) until explicitly dropped, rather than drop it without warning.