Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Monday, July 14, 2025

A Shinyvalidate Hack

Lately I've been coding a Shiny web application (in R) in which users are confronted by several forms. Since their inputs go into a database, some amount of input validation seems prudent (to put it mildly). I came across the shinyvalidate package, which is proving very useful. You can specify a variety of rules for input fields (starting with whether they are required and moving on to various filters on what is allowed). If the input fails to meet the validation rules, messages are printed in red under the input controls (something you've probably seen before if you ever tried to submit a web form with something missing or glaringly incorrect).

The remainder of this post probably won't make much sense unless you have some experience with shinyvalidate. (I'm not promising it will make sense if you do.) I ran into a bit of a snag with a few input fields where the validation rules involved a disjunction. In one case the fields was the URL of an organization's web site, and in the other it was their contact email address. Both fields are "required" in my application, with required in quotes because it is possible for an organization not to have a web site and/or not to have a contact email address. (For instance, some organizations have web sites but no email address, expecting people to use a contact form on their home page.) The approach I'm taking is to allow "none" as a valid response for both web and email addresses if there is none. So I want to enforce the rule that the input value is either "none" or a valid address.

The shinyvalidate package provides mechanisms for enforcing conjunctions of rules (satisfy all of these requirements) but apparently does not have an explicit mechanism for enforcing disjunctions (either this or that). It does let you create custom functions to validate inputs, but I wanted to use their built-in web (sv_url()) and email (sv_email()) functions so that I would not have to go down the regular expression rabbit hole. That meant writing a function that combined a simple test (x != "none", where x is the value submitted in a text field of the form) with their functions.

It took me a while to figure out that while sv_url and sv_email take a few arguments (including an optional customized error message), they do not actually take the form input (what I'm calling x) as an argument. Instead,  the output value of sv_url() or sv_email() is itself a function with the form input as its sole argument. Once that sank in, making the custom validators was trivial. Here is my URL validator.

urlOrNone <- function(x) {
  if (x != "none") sv_url("Please enter a valid URL, including the http/https prefix.")(x)

If x either equals "none" or satisfies the sv_url validator, the return value is NULL, which allows the input to pass validation. Otherwise, the user gets the custom error message and the input is disallowed. 

Monday, April 14, 2025

Retaining Libraries During R Upgrades

Today I was able to upgrade R to version 4.5, and was reminded in the process of a tedious "feature" of the upgrade.

R libraries are organized into two distinct groups. If you use RStudio, look at the "Packages" tab. You will see the heading "User Library" followed by whatever packages you have installed manually. Scroll down and you will get to another heading, "System Library", followed by another group of packages. These are the packages that were automatically installed when you installed R itself.

The upgrade from R 4.4 to R 4.5 was very easy, since it comes as a system package, at least on Ubuntu and Linux Mint (and presumably other Linux distributions). I'm not sure about Windows, macOS etc. The Mint update manager offered me updates to several system packages (r-base, r-base-core, r-base-html and r-recommended) among the morning's gaggle of updates. I just installed those and R 4.4.3 was replaced by R 4.5.0. That part could not be easier.

After the updates were done, I opened RStudio and looked to see if any packages there needed updates. The "System Library" group was there, and none of them needed updates (no shock since they had just been installed during the upgrade). The "User Library" did not exist. I should have known this was coming based on previous R upgrades, but I forgot.
You can of course reinstall all your previously installed libraries manually (if you can remember which ones they were), or you can just wait until something doesn't work due to a missing library and install it then. I prefer to reinstall them all at once, and I most definitely do not have the list memorized. The fix is easy if you know how to do it (and remember that you have to do it). 

The first step is to open a file manager and navigate to the system directory where your libraries for the previous R version are stored. They will still be there. If you do not know where they are hiding, you can run the command .libPaths() and get a list of the directories in which R will look for libraries. One of them will contain the R version number. (It is consistently the first entry in the list when I do this, but I do not know if that will always be true.) In my case, the entry is "/home/paul/R/x86_64-pc-linux-gnu-library/4.5", which means I want to open "/home/paul/R/x86_64-pc-linux-gnu-library" in the file manager. There I find two directories, one for the previous version ( "/home/paul/R/x86_64-pc-linux-gnu-library/4.4") with lots of subdirectories and one for the new version ( "/home/paul/R/x86_64-pc-linux-gnu-library/4.5") that is empty. All it takes is copying or moving the contents of the older directory to the newer one. Once you have confirmed that the new R version can see the libraries (for instance, by observing that the "User Library" section has returned to the "Packages" tab in RStudio), you can delete the folder for the older version.

With that done, you will want to check for updates to the "User Library" packages. Several that I had installed needed updates today after moving to R 4.5. Updating them is done in the usual way.

I wonder if either R or RStudio has a "inherit libraries from previous version" function stashed away that would automate this? If so, I haven't found it.

Thursday, June 27, 2024

A Sorting Adventure in R

As part of an R application on which I have been working, I need to grab some data from an online spreadsheet keyed by a column of labels, add a column with new values, and print out the labels and new values in a two column array so that it matches the spreadsheet (meaning the order is the same as in the spreadsheet). The construction process involves breaking the original data into chunks, filling in the new values in each chunk separately, and gluing the chunks back together. The spreadsheet is sorted by the label column, so to match the original spreadsheet (allowing a user to view them side by side and see the same labels in the same order), I just needed to sort the output data frame by the label column ... or so I thought.

Since I was using the dplyr library already, and dplyr provides an arrange command to sort rows based on one or more columns, I started out with the following simple code (where df is the two column data frame I created):

df |> dplyr::arrange(Label) |> print()

Unfortunately, the result was not sorted in the same order as the spreadsheet. Some of the labels began with numbers (1000) and in one case a number containing a colon (10:20), and arrange listed them in the opposite order from that of the spreadsheet. Some of the name had funky capitalization (say, "BaNaNa"), and arrange treated capital letters as preceding lower case letters. Interestingly, the base R sort function sorted the labels in the same order that the spreadsheet used. More interestingly, the following hack suggested in a response to a question I posted online also matched the spreadsheet:

df |> dplyr::arrange(-dplyr::desc(Label)) |> print()

The desc function tells arrange to use descending order (the default being ascending order). So -desc tells arrange not to use descending order, meaning use ascending order, which is where we started ... and yet it somehow fixes the ordering problem.

The culprit turns out to be the default locale setting for arrange, which is "C". I'm guessing that means the C programming language. I filed a bug report in the dplyr repository on GitHub and was quickly told that the behavior was correct for locale "C" and that the solution to my problem was to feed arrange the optional argument .locale = "en". That did in fact fix things. The code now produces the expected sort order. Meanwhile, my bug report led to a new one about the difference in sort orders between arrange and desc. Depending on how that report is resolved, the -desc trick may stop working in the future.

Wednesday, June 26, 2024

Locating Date Columns in R

I've been working on a script that pulls data from an online spreadsheet (made with Google Sheets) shared by a committee. (Yes, I know the definition of "camel": a horse designed by a committee. The situation is what it is.) Once inhaled, the data resides in a data frame (actually, a tibble, but why quibble). At a certain point the script needs to compute for each row the maximum entry from a collection of date columns, skipping missing values.

Assuming I have the names of the date columns in a variable date_fields and the data in a data frame named df, the computation itself is simple. I use the dplyr library, so the following line of code

df |> rowwise() |> mutate(Latest = max(c_across(all_of(date_fields)), na.rm = TRUE))

produces a copy of the data frame with an extra column "Latest" containing the most recent date from any of the date fields.

That, as it turns out, was the easy part. The hard part was populating the variable date_fields. Ordinarily I would just define it at the front of the scripts, plugging in either the names or the indices of the date columns. The problem is that the spreadsheet "evolves" as committee members other than myself make changes. If they add or delete a column, the indices of the date columns will change, breaking code based on indices. If they rename one of the date columns, that will break code based on a static vector of names. So I decided to scan the spreadsheet after loading it to find the date fields.

It turned out to be harder than it looked. After tripping over various problems, I searched online and found someone who asked a similar question. The best answer pointed to a function in an R library named dataPreparation. I did not want to install another library just to use one function one time, so I futzed around a bit more and came up with the following function, which takes a data frame as input and returns a list of the names of columns that are dates (meaning that if you run the str() command on the data frame, they will be listed as dates). It requires the lubridate library, which I find myself commonly using. There may be more elegant ways to get the job done, but it works.

library(lubridate)
# INPUT: a tibble
# OUTPUT: a vector containing the names of the columns that contain dates
dateColumns <- function(x) {
  # Get a vector of logical values (true if column is a date) with names.
  temp <- sapply(names(x), function(y) pull(x, y) |> is.Date())
  # Return the column names for which is.Date is true.
  which(temp) |> names()
}

Thursday, April 11, 2024

Finding Duplicated Records in R

Someone asked a question about finding which records (rows) in their data frame are duplicated by other records. If you just want to know which records are duplicates, base R has a duplicated() function that will do just that. It occurred to me, though, that the questioner might have wanted to know not just which records were duplicates but also which records were the corresponding "originals". Here's a bit of R code that creates a small data frame with duplicated rows and then identifies original/duplicate pairs by row number.


library(dplyr)

# Create source data.
df <- data.frame(a = c(3, 1, 1, 2, 3, 1, 3), b = c("c", "a", "a", "b", "c", "a", "c"))

# Find the indices of duplicated rows.
dup <- df |> duplicated() |> which()

# Split the source data into two data frames.
df1 <- df[-dup, ]  # originals (rows 1, 2 and 4)
df2 <- df[dup, ]   # duplicates (rows 3, 5, 6 and 7)

# The row names are the row indices in the original data frame df. Assign them to columns.
df1$Original <- row.names(df1)
df2$Duplicate <- row.names(df2)

# Perform an inner join to find the original/duplicate pairings. The "NULL" value for "by"
# (which is actually the default and can be omitted) means rows of df1 and df2 are paired
# based on identical values in all columns they have in common (i.e., all the original
# columns of df).
inner_join(df1, df2, by = NULL) |> select(Original, Duplicate)

# Result:
#   Original Duplicate
# 1        1         5
# 2        1         7
# 3        2         3
# 4        2         6

The key here is that the inner_join function pairs rows from each data frame (originals and duplicates) based on matching values in the "by" columns. The default value of "by" (NULL) tells it to match by all the columns the two data frames have in common -- which in the is case is all the columns in the source data frame. The resulting data frame will have the columns from the source data frame (here "a" and "b") plus the columns unique to each data frame ("Original" and "Duplicate"). We use the select() command to drop the source columns and just keep the indices of the original and duplicate rows.

Friday, February 9, 2024

Another R Quirk

For the most part I like programming in R, but it is considerably quirkier than any other language I have used. I'm pretty sure that is what led to the development of what is known now as the "Tidyverse". The Tidyverse in turn introduces other quirks, as I've pointed out in a previous post.

One of the quirks in base R caused me a fair amount of grief recently. The context was an interactive program (written in Shiny, although that is beside the point here). At one point in the program the user would be staring at a table (the display of a data frame) and would select rows and columns for further analysis. The program would reduce the data frame to those rows and columns, and pass the reduced data frame to functions that would do things to it.

The program worked well until I innocently selected a bunch of rows and one column for analysis. That crashed the program with a rather cryptic (to me) error message saying that some function I was unaware of was not designed to work with a vector.

I eventually tracked down the line where the code died. The function I was unaware of apparently was baked into a library function I was using. As for the vector part, that was the result of what I would characterize as a "quirk" (though perhaps "booby trap" might be more accurate). I'll demonstrate using the mtcars data frame that automatically loads with R.

Consider the following code chunk.

rows <- 1:3
cols <- c("mpg", "cyl")
temp <- mtcars[rows, cols]
str(temp)

This extracts a subset of three rows and two columns from mtcars and presents it as a data frame.

'data.frame':    3 obs. of  2 variables:
 $ mpg: num  21 21 22.8
 $ cyl: num  6 6 4

So far, so good. Now suppose we choose only one column and rerun the code.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, cols]
str(temp)

Here is the result.

num [1:3] 21 21 22.8

Our data frame just became a vector. That was what caused the crash in my program.

Since I was using the dplyr library elsewhere, there was an easy fix once I knew what the culprit was.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, ] |> select(all_of(cols))
str(temp)

The result, as expected, is a data frame.

 'data.frame':    3 obs. of  1 variable:
 $ mpg: num  21 21 22.8

There will be situations where you grab one column of a data frame and want it to be a vector, and situations (such as mine) where you want it to be a data frame, so the designers of the language have to choose which route to go. I just wish they had opted to retain structure (in this case data frame) until explicitly dropped, rather than drop it without warning.


Monday, November 6, 2023

Displaying a Nested Data Frame

Almost anything (within reason) you want to do with data can be done in an R application ... if you can just find the correct libraries and functions.

A program I'm writing in R, using the Shiny web framework, needs to both display and export a data frame. Ordinarily, this would be straightforward: I would use datatable()renderDT() and dataTableOutput() from the DT library for display, and something like write.csv from the utils library to export the data frame in a spreadsheet format. Unfortunately, none of that works this time. The problem is that the data frame comes from an HTTP POST request, with the data arriving as a JSON object. The JSON object contains nested structures, which results in a data frame where some cells contain lists or data frames. Even after applying the flatten() function from the jsonlite library, the nesting persisted.

The data frame would render in Shiny as a table, but in cells where nested structures hid the rendered table would just say something like "list()" or maybe "1 entry". The write.csv function refused to write the table to a file. I noticed, though, that if I viewed the data frame in RStudio (using the View() command), the nested structures would still say something like "list()" but clicking them would open them in a new window and hovering over them with the mouse would give an indication of the content. A quick check of the help entry for View() showed that it uses the format.data.frame() function from the base library. So I tried applying that function to my data frame before displaying or exporting it, and both problems were fixed. The displayed table showed the contents of each cell, and the flattened table could be exported to a CSV file. The structure of the nested data frames and lists was not preserved, but the user could at least see what was in those cells, which was good enough for my application.

Monday, October 30, 2023

Publishing a Shiny Application

I've twice encountered problems while trying to deploy Shiny applications on the shinyapps.io hosting platform, and since the fix was the same in both cases, I think a pattern is emerging. In both cases, the issue has to do with use of the includeMarkdown function to display help text written in Markdown and stored in a file within the application. The includeMarkdown function is provided by the htmltools library, which is apparently loaded automatically by the shiny library. So my code explicitly specifies library(shiny) but does not load htmltools.

In both problem cases, the code ran fine on my PC but not on the server. Why? A little "fine print" in the documentation of the includeMarkdown function mentions that it requires the markdown package. I have that installed on my PC, and apparently it gets loaded automatically. The server deployment system, though, apparently does not realize the need for it. So on the server my code loads the shiny library explicitly, and that causes the server to include the htmltools library but not the markdown library.

The solution is trivial: just add include(markdown) in the source code. The hard part is remembering to do it, given that it is unnecessary on my PC.

Saturday, October 28, 2023

The Trouble with Tibbles

Apologies to Star Trek fans for the title pun, but I couldn't resist. (If you're not a Trekkie, see here for clarification.)

I've been working on a interactive R application (using Shiny, although that's probably irrelevant here). There are places where the code needs to loop through a data frame, looking for consecutive rows where either three text fields match or two match and one does not. Matching rows are copied into new data frames. The data I'm testing the code on has a bit over 9,000 rows, and the time spent on this process can take upwards of nine seconds -- not an eternity, but a bit annoying when you are sitting there waiting for it to hatch.

I decided to use the profiler in RStudio to see where time was being eaten up. Almost all the nine seconds was blamed on two steps. The biggest time suck was doing case-insensitive string comparisons on the three fields, which did not come as a big surprise. I went into the profiling process thinking the other big time suck would be adding a data frame to a growing list of data frames, but that was actually quite fast. To my surprise, the number two consumer of time was "df <- temp[n, ]", which grabs row n from data frame "temp" and turns it into a temporary data frame named "df". How could such a simple operation take so long?

I had a hunch that turned out to be correct. Somewhere earlier in the code, my main data frame (from which "temp" was extracted) became a tibble. Tibbles are modernized, souped-up versions of data frames, with extra features but also occasional extra overhead/annoyances. One might call them the backbone of the Tidyverse. The Tidyverse has its adherents and its detractors, and I don't want to get into the middle of that. I'm mostly happy to work with the Tidyverse, but in this case using tibbles became a bit of a problem.

So I tweaked the line of code that creates "temp" to "temp <- ... %>% as.data.frame()", where ... was the existing code. Lo and behold, the time spend on "df <- temp[n, ]" dropped to a little over half a second. Somewhat surprisingly, the time spent on the string comparisons dropped even more. So the overall processing time fell from around 9 seconds to under 1.3 seconds, speeding up the code by a factor of almost 7.

I'll need to keep this in mind if I bump into any other places where my code seems unusually slow.


Tuesday, April 18, 2023

Node Coloring

Someone posted a question on OR Stack Exchange about coloring nodes in a sensor network. The underlying problem has something to do with placing sensors, but in mathematical terms the poster wanted to color the nodes of a weighted undirected graph with a fixed number of colors, while minimizing the sum of the edge weights of edges connecting nodes of the same color.

There is an obvious MIP model for this problem, which was posted as an answer to an earlier version of the question. If $N$ and $C$ are the index sets for nodes and colors respectively, $E\subseteq N\times N$ is the set of edges, $w_{ij}$ is the objective weight of edge $(i,j)\in E$ and $x_{n,c}$ is a binary variable indicating whether node $n$ is assigned color $c$, the problem can be written as a binary quadratic program:

$$\min\ \sum_{(n,m)\in E}w_{nm}\sum_{c\in C}x_{nc}x_{mc}\\
\textrm{s.t. }\sum_{c\in C}x_{nc}  =1\quad\forall n\in N\\
\phantom{\textrm{s.t. }}x_{nc} \in\left\{ 0,1\right\} \quad\forall n\in N,c\in C.$$

 

The model can be linearized by introducing continuous variables $y_{nm}\ge 0$ to represent the objective contribution of each edge $(n,m)\in E$:

 

$$\min\ \sum_{(n,m)\in E}w_{nm} y_{nm}\\
\textrm{s.t. }\sum_{c\in C}x_{nc}=1 \quad \forall n\in N\\
\phantom{\textrm{s.t. }}y_{nm}\ge x_{nc} + x_{mc} - 1 \quad \forall (n,m)\in E, \forall c\in C\\
\phantom{\textrm{s.t. }}x_{nc} \in\left\{ 0,1\right\} \quad\forall n\in N,c\in C\\
\phantom{\textrm{s.t. }}y_{nm}\ge 0 \quad \forall (n,m)\in E.$$


There is some symmetry in the model: given any feasible solution, another solution of identical cost is obtained by permuting the colors. It is possible to add constraints to remove that symmetry.


The person posting the question was looking for alternatives to a MIP model. An easy choice for me would be a genetic algorithm. I wanted to test this in R, using the GA library for the genetic algorithm. The GA library allows three types of chromosomes: binary or real vectors, or permutations of an index vector. If I were coding my own GA, the chromosome would be a vector in $\lbrace 1,\dots,C\rbrace^N.$ Within the available chromosome types in the library, the obvious choice was to use a vector $x\in [0,C]^N$ and then set the color of node $i$ to $\lceil x_i \rceil\in \lbrace 1,\dots,C\rbrace.$ I suspect, though, that the symmetry issue described above might have negative repercussions for a GA. If two "good" solutions were effectively the same (other than a permutation of the color indices), their offspring (which would be a mix of the two color numbering methods) might be a mess.


A couple of responders to the OR SE question suggested heuristics, as did I. My choice was to start with a random generation heuristic (assign colors randomly), apply an improvement heuristic (go through the nodes in random order and, for each node, reassign it to the cheapest color given the rest of the solution), and repeat (changing the order in which nodes are examined on each pass) until one pass produced no improvement. Since the heuristic is fairly fast, it can be done multiple times (new random restarts) with the best solution retained. Heuristics like this are very easy to code.


I coded all three approaches in R, using CPLEX (with the ompr and ROI libraries) for the MIP model, the GA library for the genetic algorithm, and nothing special for the improvement heuristic. I also threw in the tictoc library to do some timings. The test problem was a randomly constructed $10\times 20$ complete grid graph (200 nodes; 19,900 edges), to be colored using five colors.


To break symmetry in the MIP model, I forced the colors to be ordered so that the lowest node index receiving any color was smaller than the lowest node index receiving the next color. I let CPLEX run for five minutes using default settings. It made what I would call reasonable progress on the incumbent objective value, but the formulation is apparently very weak (at least on the test example). The lower bound barely moved, and the gap was 99.97% after five minutes.


I mostly used default parameters for the GA, although I let it run for 2,000 generations with a population of size 100. It found a much better solution (in 38 seconds) than CPLEX found in five minutes. To be fair, though, CPLEX was balancing improving the incumbent with working on the bound. Had I changed the CPLEX parameters to emphasize improving the incumbent, it likely would have done better (though likely not as well as the GA).


The improvement heuristic surprised me a bit. 600 random starts needed only 37 seconds (and I'm not the poster boy for optimized R code), and the best solution it found had an objective value about 5% better than what the GA produced. I did a small number of runs with different random seeds (but the same problem size), and the improvement heuristic consistently edged out the GA. I suppose the moral here is to try simple heuristics before going "high tech".


You are welcome to play with my R notebook, which contains both code and results.

Thursday, April 13, 2023

The rJava Curse Strikes Again

Apparently I have not needed the rJava R package in a while, because when I wanted to install an R package today that has rJava as a dependency, it was not there. So I tried to install it (this is on Linux Mint), and of course it failed to install. I have a long history of installation battles with rJava (see here, here and here in chronological order ... or better still don't traumatize yourself by reading them). Why should this time be different?

All my previous battles involved older versions of Java with apparently different directory locations or structures, and none of the previous fixes worked. After considerable aggravation, I found a very helpful post by "datawookie" that nearly got the job done. I did in fact get an error message about "jni" and used the trick in datawookie's post of setting JAVA_HOME to the correct path (in my case to Open JDK 17) as an argument to javareconf. When I then attempted to install rJava, I got a different error ("could not find -lbz2"), which prompted me to install libbz2.dev ("sudo apt-get install libbz2.dev"), after which R was finally able to install rJava (woo-hoo!).

I'd say this is getting ridiculous, but we passed that milestone years ago.

Monday, March 20, 2023

Annoying Message with R GA Package

This is a small technical note (mainly a reminder to myself) about using the GA package for genetic algorithms in R, specifically within an R Markdown document (such as an R notebook). The GA::ga() function includes an optional argument to turn on parallel evaluation of the fitness function. The argument value can be true or false (default), which are self-explanatory, or a number of cores, or a character string ("snow" or "multicore") for how parallelization is to be done. The default is "multicore" except on Windows, where "snow" is apparently the only option. When I choose to turn on parallel objective evaluation, I take the easy route and just set it to TRUE.

When run in a terminal, I just get a sequence of log entries, one per generation, and then it terminates. That's also what I see when the command is run in an R Markdown document ... until the document is rendered to HTML. In the HTML output, interspersed with each line of log output is a block of four identical copies of the following.

loaded GA and set parent environment

The fact that there are four copies is presumably because my PC has four cores. Turning off the monitor option gets rid of the progress printouts but does not eliminate the repetitive messages. Switching the parallel mode to "snow" does get rid of them, but I'm not sure what the broader implications of that are.

After significant spelunking, I traced the messages to the doParallel R package, which is apparently being used to implement parallel processing under the "multicore" option. In any event, the way to get rid of them is to add the option "message = FALSE" to the R chunk in which ga() is being executed. Barring other chunk options, this would change {r} to {r, message = FALSE} for that one chunk. (You can also set it as a global option in the document.) Happily, the monitor output (stats for each generation) still prints out. It just gets rid of those <expletive deleted> parallel processing messages.

Tuesday, March 7, 2023

Another GA for Clustering

Someone asked on OR Stack Exchange about clustering nodes, a subject that comes up from time to time. The premise is that you have $N$ points (nodes), and for each pair $i \neq j$ there is (or may be) an edge with weight $w_{i,j} \in [0, 1]$ (where $w_{i,j} = 0$ if there is no edge between $i$ and $j$). The objective is to create exactly $G$ groups (clusters), with no group having cardinality greater than $M,$ such that the sum of within-group edge weights is maximized. It is simple to write an integer programming model for the problem, and the question includes one, but the author was looking for something that did not require an IP solver. The question specifies, as an example, dimensions of $N=500$ points, $G=8$ groups, and a maximum group size of $M=70.$

Since I have suggest genetic algorithms for previous clustering problems (see here and here), it probably will not shock you that I once again looked for a GA approach, using a "random key" algorithm (where the chromosome gets decoded into a feasible solution. In this case, the chromosome consists of $N+G$ values between 0 and 1. The first $G$ values are used to select the sizes $n_1,\dots, n_G$ of the groups. The last $N$ values are converted into a permutation of the point indices $1,\dots, N.$ The first $n_1$ entries in the permutation are the indices of the points in group 1, the next $n_2$ entries give the points in group 2, and so on.

The tricky part here is converting the first portion of the chromosome into the group sizes. We know that the maximum group size is $M,$ and it is easy to deduce that the minimum group size is $m = N - (G-1)M.$ So the minimum and maximum fractions of the population to assign to any group are $a=m/N$ and $b=M/N,$ where $$m = N - (G-1)M \implies a = 1 - (G-1)b.$$ Now suppose that I have values $\pi_1, \dots,\pi_G \in (a,b)$ and assign group sizes $n_i = \pi_i \cdot N,$ which clearly meet the minimum and maximum size limits. (I'm cheating here, but I'll fix it in a minute.) We want $\sum_{i=1}^G n_i = N,$ which means we need $\sum_i \pi_i = 1.$

To get the $\pi_i,$ we take a rather roundabout route. Assume that the first $G$ entries in the chromosome are $x_1,\dots,x_G \in (0,1).$ Set $$y_i = 1 - \frac{x_i}{\sum_{j=1}^G x_j} \in (0,1)$$ and observe that $\sum_{i=1}^G y_i = G-1.$ Finally, set $\pi_i = a + (b-a) y_i \in (a, b).$ We have $$\sum_i \pi_i = G\cdot a + (b-a) \sum_i y_i\\ = G\cdot a + (b-a)(G-1) = (G-1)b + a = 1.$$

Now to confess to my cheating. I said the $i$-th group size would be $n_i=\pi_i \cdot N,$ ignoring the minor problem that the left side is an integer and the right side almost surely is not. So of course we will round $\pi_i \cdot N$ to get $n_i.$ After rounding, though, we can no longer be sure that $\sum_i n_i = N.$ So we iteratively fix it. If $\sum_i n_i < N,$ we add 1 to the smallest $n_i$ and recheck. if $\sum_i n_i > N,$ we subtract 1 from the largest $n_i$ and recheck. Eventually, we end up with a feasible set of group sizes.

All this is encoded in an R notebook I wrote, including a sample problem. Whether the GA gets a "good" (near optimal) partition or not I cannot say, since I did not write the equivalent MIP model. I can say, though, that it gets a sequence of progressively improving partitions.

Monday, January 30, 2023

A Random Tree Generator

For a problem I was noodling with in R, I needed to generate random trees (meaning connected, acyclic, layered, undirected graphs, not classification trees or anything statistical). There are a bunch of libraries for R that have various capabilities for constructing and or massaging graphs and networks (see the CRAN Task View for graphical models). After spending some time poking through the docs for some of the libraries listed in the task view, and unsuccessfully trying to install one, I decided it would be faster just to roll my own code.

Along the way I succumbed to the software developer's maxim: if it works, it needs more features. Still, I wound up with a function that is not horribly complicated and seems to work. Given how many nodes and how many layers you want in the tree, it outputs a matrix of edges forming the desired tree. There are optional arguments for how you want the nodes labeled (the default is 1, 2, ... but you can supply a vector of labels) and in what order the labels should be assigned (top to bottom/left to right raster scan of the graph or randomly), as well as how you want the output matrix organized (randomly, by label order, or by layer order).

The code is in an R notebook that demonstrates the use of the function. It imports the DiagrammeR library in order to plot the resulting trees, but the function does not require DiagrammeR and in fact has no dependencies. You can view the notebook here, and if you want you can download the code from it (see the select box in the upper right of the notebook). The code is licensed under the Creative Commons Attribution 3.0 Unported license.

Tuesday, November 1, 2022

Holt Double Exponential Smoothing

Given a time series $x_t,\, t=1,2,\dots,$ presumed to contain linear trend, Holt's double exponential smoothing method computes estimates $s_t$ of the level (mean) at time $t$ and $b_t$ of the (presumed constant) slope using the following formulas: $$s_t = \alpha x_t + (1-\alpha)(s_{t-1} + b_{t-1})$$ and $$b_t = \beta(s_t - s_{t-1}) + (1-\beta)b_{t-1}$$ where $\alpha,\beta \in (0,1)$ are smoothing weights. A recent question on OR Stack Exchange asked why the second formula is based on the level estimate and not the observed value. In other words, the proposed alternative to the trend update was $$b_t = \beta(x_t - x_{t-1}) + (1-\beta)b_{t-1}.$$

The intuition for doing it Holt's way is fairly simple. If exponential smoothing is working as intended (meaning smoothing things), then the difference in level estimates $s_t - s_{t-1}$ should be less variable than the difference in observed values $x_t - x_{t-1}.$ A formal proof probably involves induction arguments, requiring more functioning brain cells than I had available at the time, so I was a bit loose mathematically in my answer on OR SE. Just to confirm the intuition, I did some Monte Carlo simulations in R. The notebook containing the experimental setup, including code, is available here. It requires the dplyr and ggplot2 library packages.

The following plots show confidence intervals over time for the errors in the level and trend estimates using both Holt's formula and what I called the "variant" method. They are from a single experiment (100 independent time series with identical slope and intercept, smoothed both ways), but other trials with different random number seeds and changes to the variability of the noise produced similar results.

plot of confidence intervals for error in level estimates

plot of confidence intervals for error in trend estimates

In both cases, the estimates start out a bit wobbly (and the Holt estimates may actually be a bit noisier), but over time both stabilize. There does not seem to be much difference between the two approaches in how noisy the level estimates are, at least in this run. The Holt estimates may have slightly narrower confidence intervals, but that is not clear, and the difference if any seems pretty small. The Holt trend estimates, however, are considerably less noisy than those of the variant method, supporting the intuitive argument.

Friday, July 15, 2022

Left Matrix Inverses in R

The following question popped up on OR Stack Exchange: given an $m\times n$ matrix $A$ with $m > n,$ how can one find all left inverses of $A$ in R? The author mentions that dimensions in their case would be around 200x10.

A left inverse is a matrix $B\in \mathbb{R}^{n\times m}$ such that $B A = I.$ In order for a left inverse to exist, $A$ must have full column rank. If it does not, $Ax = 0$ for some nonzero $x\in \mathbb{R}^n,$ in which case $$x = Ix = (BA)x = B(Ax) = B\cdot 0 = 0,$$ a contradiction.

Henceforth we assume that $A$ has full column rank, in which case the left null space $\mathcal{N} \subseteq \mathbb{R}^m$ will have dimension $m-n.$ Now suppose that $C\in \mathbb{R}^{n\times m}$ has the property that every row of $C$ belongs to $\mathcal{N}.$ Then $CA = 0$ and so $(B+C)A = BA+0 = I,$ making $B+C$ another left inverse of $A.$ That means $A$ has an uncountably infinite number of left inverses. Conversely, if $DA=I$ then the rows of $D-B$ belong to $\mathcal{N}.$ So the set of left inverses of $A$ can be fully characterized by any individual left inverse and a basis for the left null space of $A.$

Getting this information in R is remarkably easy. There are multiple ways to compute a left inverse, but if you have the pracma library loaded, then the function pracma::pinv() can be used to compute the Moore-Penrose pseudoinverse, which is a left inverse of $A.$ To get a basis for the left null space of $A,$ we apply the function pracma::nullspace() to $A^T,$ which computes a basis for the right null space of the transpose of $A,$ and then transpose the results.

I have a small R notebook that demonstrates this on a random 200x10 matrix.

Sunday, February 6, 2022

Taxi Dispatch

A question on OR Stack Exchange pertains to a taxi stand. The assumptions are as follows.

  • There are $t$ taxis operating between points A and B, with all passenger traffic going from A to B. For concreteness, I'll refer to a "taxi stand" with a "line" of taxis, similar to what you might see at an airport or popular hotel.
  • The round-trip travel time for a taxi (taking passengers from A to B and deadheading back from B to A) is deterministic with value $T$. (Note: Deterministic travel time is the author's assumption, not mine.)
  • Each taxi has a capacity for $p$ passengers and is dispatched only when full.
  • Passengers arrive according to a Poisson process with parameter (rate) $\lambda$.

The author was looking for the mean time between dispatches.

It has been a very long time since I taught queuing theory, and I have forgotten most of what I knew. My best guess for the mean time between dispatches, under the assumption that the system reached steady state, was $p/\lambda$, which would be the long-run average time required for $p$ passengers to arrive. To achieve steady-state, you need the average arrival rate ($\lambda$) to be less than the maximum full-blast service rate ($t p/T$). If arrivals occur faster than that, in the long term you will dispatching a taxi every $1/T$ time units while either queue explodes or there is significant balking/reneging.

To test whether my answer was plausible (under the assumption of a steady state), I decided to write a little discrete event simulation (DES). There are special purpose DES programs, but it's been way to long since I used one (or had a license), so I decided to try writing one in R. A little googling turned up at least a couple of DES packages for R, but I did not want to commit much time to learning their syntax for a one-off experiment, plus I was not sure if any of them handled batch processing. So I wrote my own from scratch.

My code is not exactly a paragon of efficiency. Given the simplicity of the problem and the fact that I would only be running it once or twice, I went for minimal programmer time as opposed to minimal run time. That said, it can run 10,000 simulated passenger arrivals in a few seconds. The program uses one data frame as what is known in DES (or at least was back in my day) an "event calendar", holding in chronological order three types of events: passengers joining the queue; taxis being dispatched; and taxis returning.

The simulation starts with all taxis in line and no passengers present. I did not bother with a separate "warm-up" period to achieve steady-state, which would be required if I were striving for more accuracy. For a typical trial run ($t=5$, $p=4$, $T=20$, $\lambda=0.7$), my guess at the mean time between dispatches worked out to 5.714 and the mean time computed from the run was 5.702. So I'm inclined to trust my answer (and to quit while I'm ahead).

If anyone is curious (and will forgive the lack of efficiency), you can take a look at my R notebook (which includes the code).

Tuesday, December 21, 2021

Installing Rcplex in RStudio

This is an update to a previous post ("Installing Rcplex and cplexAPI"). Rcplex is a interface from R to the CPLEX optimization solver. A recent release (it's now at version 0.3-5) has made installation and updating much easier (no hacking of installer code required), provided you configure your environment correctly.

I use the RStudio IDE for almost everything related to R, including updating packages. A recent list of packages waiting to be updated included Rcplex 0.3-4, but when I told RStudio to update them all it balked at Rcplex. The workaround is actually quite simple. What follows is how I got to work in Linux Mint, but the same steps should work with any Linux distribution and I assume can be modified to work on Windows or Mac OS.

  1. Find where the CPLEX binary lives. In my case, the path was "/home/paul/<stuff>/CPLEX_Studio201/cplex/bin/x86-64_linux/cplex".
  2. In a terminal, create an environment variable CPLEX_BIN containing that path (unless you already have one). This can be done using the export command, as in "export CPLEX_BIN=/home/paul/<stuff>/CPLEX_Studio201/cplex/bin/x86-64_linux/cplex". Note that what I am exporting includes the name of the executable (that last "cplex"), not just the path to it. (I goofed on my first try by just specifying the path.)
  3. Now invoke RStudio from the terminal and do the package update (or installation if you are getting it for the first time) as normal.

If you are not an RStudio user, you can install or update the package manually from a terminal by issuing the same export command followed by "R CMD INSTALL <path to download>/Rcplex_0.3-5.tar.gz" (updating the version number as necessary).

This was much less painful than previous installations. Thanks to the developers for maintaining the package and updating the installation scripts!

Update (04/23/2023): A recent update to the base R package (to 4.3.0) forced me to reinstall Rcplex, and of course I ran into problems. The current version of Rcplex in CRAN is 0.3-6. I followed the steps described above, and the C compiler barfed up some messages about multiple definitions of things and exited with an error code. Luckily, in the course of searching for explanations of the errors and fixes I tripped over the development page for Rcplex, where I found version 0.3-7, which had been released just a few days ago. That version installed correctly. Big thanks to the developers for continuing to maintain it.


Monday, April 12, 2021

A Math Puzzle as a Network

There is a standard type of math puzzle that has been around at least since I was a child. The details vary, but the concept is consistent. You are typically given a few initially empty containers of various (integer) capacities, an essentially infinite reservoir of something that goes in the containers, and a goal (integer) for how much of that something you want to end up with. You have to figure out how to reach the goal without having any measuring instruments, meaning that your operations are limited to emptying a container into the reservoir, filling a container from the reservoir, or moving content from one container to another until you empty the source or fill the destination, whichever happens first. (All this is done under the assumption of no spillage, meaning the originator of the puzzle did not have me in mind.) I think I've seen a variant that involves cutting things, where your ability to measure where to cut is limited to stacking pieces you already have as a guide to the piece you want to cut.

A question popped up on Mathematics Stack Exchange about how to solve one of these puzzles using dynamic programming (DP) with backward recursion. The problem at hand involves two jugs, of capacities seven and three liters respectively, and a lake, with the desired end state being possession of exactly five liters of water. The obvious (at least to me) state space for DP would be the volume of water in each jug, resulting in 32 possible states ($\lbrace 0,\dots,7\rbrace \times \lbrace 0,\dots,3 \rbrace$). Assuming the objective function is to reach the state $(5,0)$ with a minimal number of operations, the problem can be cast just as easily as a shortest path problem on a digraph, in which each node is a possible state of the system, each arc has weight 1, and arcs fall into one of the categories mentioned in the previous paragraph.

I was looking for an excuse to try out the igraph package for R, and this was it. In my R notebook, a node label "5|2" would indicate the state where the larger jug contains five liters and the smaller jug contains two. Arcs are labeled with one of the following: "EL" (empty the larger jug); "FL" (fill the larger jug); "ES" (empty the smaller jug); "FS" (fill the smaller jug); "PLS" (pour the larger jug into the smaller jug); or "PSL" (pour the smaller jug into the larger jug).

Assuming I did not screw up the digraph setup, a total of nine operations are required to get the job done. If you are interested, you can see my code (and the solution) in this R notebook.

Friday, October 16, 2020

Multilogit Fit via LP

 A recent question on OR Stack Exchange has to do with getting an $L_1$ regression fit to some data. (I'm changing notation from the original post very slightly to avoid mixing sub- and super-scripts.) The author starts with $K$ observations $y_1, \dots, y_K$ of the dependent variable and seeks to find $x_{i,k} \ge 0$ ($i=1,\dots,N$, $k=1,\dots,K$) so as to minimize the $L_1$ error $$\sum_{k=1}^K \left|y_k - \sum_{i=1}^N \frac{e^{x_{i,k}}}{\sum_{j=1}^K e^{x_{i,j}}}\right|.$$ The author was looking for a way to linearize the objective function.

The solution I proposed there begins with a change of variables: $$z_{i,k}=\frac{e^{x_{i,k}}}{\sum_{j=1}^K e^{x_{i,j}}}.$$ The $z$ variables are nonnegative and must obey the constraint $$\sum_{k=1}^{K}z_{i, k}=1\quad\forall i=1,\dots,N.$$ With this change of variables, the objective becomes $$\sum_{k=1}^K \left|y_k - \sum_{i=1}^N z_{i,k} \right|.$$ Add nonnegative variables $w_k$ ($k=1,\dots, K$) and the constraints $$-w_k \le y_k - \sum_{i=1}^N z_{i,k} \le w_k \quad \forall k=1,\dots,K,$$ and the objective simplifies to minimizing $\sum_{k=1}^K w_k$, leaving us with an easy linear program to solve.

That leaves us with the problem of getting from the LP solution $z$ back to the original variables $x$. It turns out the transformation from $x$ to $z$ is invariant with respect to the addition of constant offsets. More precisely, for any constants $\lambda_i$ ($i=1,\dots,N$), if we set $$\hat{x}_{i,k}=x_{i,k} + \lambda_i \quad \forall i,k$$ and perform the $x\rightarrow z$ transformation on $\hat{x}$, we get $$\hat{z}_{i,k}=\frac{e^{\lambda_{i}}e^{x_{i,k}}}{\sum_{j=1}^{K}e^{\lambda_{i}}e^{x_{i,j}}}=z_{i,k}\quad\forall i,k.$$ This allows us to convert from $z$ back to $x$ as follows. For each $i$, set $j_0=\textrm{argmin}_j z_{i,j}$ and note that $$\log\left(\frac{z_{i,k}}{z_{i,j_0}}\right) = x_{i,k} - x_{i, j_0}.$$ Given the invariance to constant offsets, we can set $x_{i, j_0} = 0$ and use the log equation to find $x_{i,k}$ for $k \neq j_0$.

Well, almost. I dealt one card off the bottom of the deck. There is nothing stopping the LP solution $z$ from containing zeros, which will automatically be the smallest elements since $z \ge 0$. That means the log equation involves dividing by zero, which has been known to cause black holes to erupt in awkward places. We can fix that with a slight fudge: in the LP model, change $z \ge 0$ to $z \ge \epsilon$ for some small positive $\epsilon$ and hope that the result is not far from optimal.

I tested this with an R notebook. In it, I generated values for $y$ uniformly over $[0, 1]$, fit $x$ using the approach described above, and also fit it using a genetic algorithm for comparison purposes. In my experiment (with dimensions $K=100$, $N=10$), the GA was able to match the LP solution if I gave it enough time. Interestingly, the GA solution was dense (all $x_{i,j} > 0$) while the LP solution was quite sparse (34 of 1,000 values of $x_{i,j}$ were nonzero). As shown in the notebook (which you can download here), the LP solution could be made dense by adding positive amounts $\lambda_i$ as described above, while maintaining the same objective value. I tried to make the GA solution sparse by subtracting $\lambda_i = \min_k x_{i,k}$ from the $i$-th row of $x$. It preserved nonnegativity of $x$ and maintained the same objective value, but reduce density only from 1 to 0.99.