OR in an OB World

RStudio Republish Menu

2025-06-11T14:56:00.002-04:00

I'm a big fan of the RStudio IDE for coding in R, but like all software it has one or two quirks. Someone recently asked about one on the Posit Community forum. It's one that has annoyed me a bit.

When editing certain kinds of files (Shiny applications, RMarkdown or Quarto documents, ...), the IDE gives you a drop-down menu to publish (or republish) the document. The menu lets you publish to an existing account or to define a new account and publish there. It lists accounts to which you previously published this app or document. There is also an option to clear the list of previous locations. Here comes the gripe: there is no option to delete selectively one or more previous locations from the list. All you can do is clear the entire list.

It turns out that you can partially but not completely winnow the list, but it takes a file browser or terminal/command prompt and a bit of exertion. Since I've only published Shiny apps, I'll document the steps for that. The process for a Quarto or RMarkdown document is presumably similar but perhaps not identical.

The first step is to navigate to the folder containing the document or application. It should contain a folder named "rsconnect", which is where you want to go next. In there, you should find a folder for each site (server) on which the app has been published. In each of those folders you should find a folder for each name/account under which the app was published on that server. Delete the folders corresponding to the locations you want to delete from the "republish" menu. If you are deleting locations on more than one server, repeat for each pertinent server folder.

For example, I wrote a Shiny application for a colleague to use in a course. The app is installed on shinyapps.io in two places, a paid account used by my colleague and my free developer account. The app files (ui.R and server.R -- it predates the option to use a single consolidated app.R file) live in file X on my PC. So I go there and drill down to X/rsconnect. Since both installations are on the same service, there is only one file there, X/rsconnect/shinyapps.io. Inside that folder are two folders, one bearing the name of the course account (call that A) and the other bearing the name of my developer account (call that B). If I want to remove just my developer account from the republish list, I delete folder X/rsconnect/shinyapps.io/B but leave A in place.

Note that this does not remove the app from any server on which it is current published. For that you need to log into the server and do something. On shinyapps.io, you find the app in your administrative panel, sleep it, archive it and then delete it.

Routing With Sequencing

2025-04-28T15:13:00.050-04:00

The motivation for this post comes from a sequence of questions posted on OR Stack Exchange (including this one), having to do with a mixed integer programming model for routing an electronic vehicle (EV) serving various customers. One difference from the basic single vehicle routing models with which I'm familiar is that the EV has to visit a charging station periodically during the route. That is easy to accommodate. Where it gets funky is that the modeler needs to know within the model which customer was last on the route, because the EV is required to go to the nearest charging station after its last stop. I'll take that a step further and require the model to provide (via variables) the position of each node (first, second, third) in the route sequence. This might be useful if, for example, the model had to enforce a rule that customer X must be among the first three customers served. Identifying just the last customer node is easier, as I'll describe at the end.

Attempts by the author of the original question followed the usual pattern for vehicle routing. Assume that there is a single vehicle, each customer must be visited once, and there are no time windows complicating things. You have a digraph containing nodes for each customer and each charging station. You typically start by assigning a binary variable $x_{ij}$ to each arc $(i, j),$ taking value 1 if and only if the vehicle crosses that arc, and proceed from there.

To collect sequencing information, I would normally employ the Miller-Tucker-Zemlin formulation of subtour elimination constraints. The MTZ approach adds a nonnegative auxiliary variable $u_i$ for each node $i$ together with the constraints $$u_j \ge u_i + x_{ij} - M(1 - x_{ij})$$ for each pair of distinct nodes $i\neq j.$ This says that if we cross arc $(i, j),$ the value of $u_j$ must be at least one higher than the value of $u_i,$ preventing any loops. If $n$ is the number of nodes and we are willing to start numbering with $u_s=0$ ($s$ being the starting node for the tour), we can choose $M=n-1.$ The MTZ constraints are intended to prevent subtours, but as a side effect the $u$ variables number the stops in the order they occur.

This would work for the EV problem if there were a rule that the EV cannot use the same charging station twice during a tour. If the vehicle can stop more than once at the same charging station (which I assume would normally be the case), we cannot use the MTZ constraints because a repeat visit to a charging station would create a subtour. This also complicates (I think) the use of subtour elimination constraints to prevent disjoint subtours. Fortunately, there are at least two "reasonable" (in my opinion) workarounds, both using the MTZ constraints. Unfortunately both are clunky.

The first workaround is to create multiple clones of each charging station node. So if node $s$ represents a charging station, we introduce additional nodes $s', s'', s''' \dots$ that are all charging stations, all in the same location (meaning time / distance / charge consumption between node $i$ and any of the clones is the same). For any arcs $(i, s)$ and $(s, j)$ we add arcs $(i, s'), (s', j), (i, s''), (s'', j)$ etc. We do not require that every charging node be entered (unlike customer nodes, which must all be visited), but we do limit each charging node to at most one entry. That removes the threat of loops and let us use the MTZ approach. Besides making the digraph larger, this forces the modeler to guess how many clones of each charging node will be needed.

The other approach is change to a multigraph, with fewer nodes but more arcs. We include only customer nodes, plus dummy start and end nodes. Arcs from the start node to each customer (within EV range) are the same as before. The end node is a stand-in for the closest charging station to the last customer visited. The arc from any customer node to the end node has the time / distance / charge consumption required to reach the closest charging station. (The closest charging station to each customer node is computed before building the model.)

For each pair of customer nodes $i \neq j,$ the arc $(i, j)$ (if it exists) represents moving directly from $i$ to $j.$ For select charging nodes $s$ we add another arc from $i$ to $j$, which I will denote $<i, s, j>,$ that represents going from $i$ to $s,$ recharging, and then proceeding from $s$ to $j.$ Each of those arcs produces another MTZ constraint. As usual, every customer node should be entered/exited exactly once.

If the number of charging stations is small, we can create extra arcs $<i, s, j>$ for every combination of two customers and a charging station, weeding out those that are infeasible (meaning an EV with a full charge could not get from $i$ to $s$ or from $s$ to $j$). To get a smaller model, we can throw out arcs that are dominated by other arcs. For $<i, s, j>$ to dominate $<i, s', j>,$ you would need power consumption from $i$ to $s$ to be no greater than power consumption from $i$ to $s'$ and power consumption from $s$ to $j$ to be no greater than power consumption from $s'$ to $j.$ (If other criteria, such as mileage or transit time, appear in the objective function then they would also factor into the determination of dominance.)

One advantage of this approach is that it would let us dispense with the $u$ variables and the MTZ constraints if the only reason for them was to enable constraints forcing the EV to end at the charging station closest to the last customer, since that is baked into the arcs leading to the dummy end node. If we want to use the cloned charging station approach, we can also use a dummy end node linked to each customer by an arc representing the link to that customer's nearest charging station to enforce the desired ending rule for the tour.

Retaining Libraries During R Upgrades

2025-04-14T18:01:00.001-04:00

Today I was able to upgrade R to version 4.5, and was reminded in the process of a tedious "feature" of the upgrade.

R libraries are organized into two distinct groups. If you use RStudio, look at the "Packages" tab. You will see the heading "User Library" followed by whatever packages you have installed manually. Scroll down and you will get to another heading, "System Library", followed by another group of packages. These are the packages that were automatically installed when you installed R itself.

The upgrade from R 4.4 to R 4.5 was very easy, since it comes as a system package, at least on Ubuntu and Linux Mint (and presumably other Linux distributions). I'm not sure about Windows, macOS etc. The Mint update manager offered me updates to several system packages (r-base, r-base-core, r-base-html and r-recommended) among the morning's gaggle of updates. I just installed those and R 4.4.3 was replaced by R 4.5.0. That part could not be easier.

After the updates were done, I opened RStudio and looked to see if any packages there needed updates. The "System Library" group was there, and none of them needed updates (no shock since they had just been installed during the upgrade). The "User Library" did not exist. I should have known this was coming based on previous R upgrades, but I forgot.
You can of course reinstall all your previously installed libraries manually (if you can remember which ones they were), or you can just wait until something doesn't work due to a missing library and install it then. I prefer to reinstall them all at once, and I most definitely do not have the list memorized. The fix is easy if you know how to do it (and remember that you have to do it).

The first step is to open a file manager and navigate to the system directory where your libraries for the previous R version are stored. They will still be there. If you do not know where they are hiding, you can run the command .libPaths() and get a list of the directories in which R will look for libraries. One of them will contain the R version number. (It is consistently the first entry in the list when I do this, but I do not know if that will always be true.) In my case, the entry is "/home/paul/R/x86_64-pc-linux-gnu-library/4.5", which means I want to open "/home/paul/R/x86_64-pc-linux-gnu-library" in the file manager. There I find two directories, one for the previous version ( "/home/paul/R/x86_64-pc-linux-gnu-library/4.4") with lots of subdirectories and one for the new version ( "/home/paul/R/x86_64-pc-linux-gnu-library/4.5") that is empty. All it takes is copying or moving the contents of the older directory to the newer one. Once you have confirmed that the new R version can see the libraries (for instance, by observing that the "User Library" section has returned to the "Packages" tab in RStudio), you can delete the folder for the older version.

With that done, you will want to check for updates to the "User Library" packages. Several that I had installed needed updates today after moving to R 4.5. Updating them is done in the usual way.

I wonder if either R or RStudio has a "inherit libraries from previous version" function stashed away that would automate this? If so, I haven't found it.

Boolean Grid II

2025-03-03T18:16:00.001-05:00

As the title implies, this is a sequel to my previous post, to deal with a variety of odds and ends.

There's a saying that you cannot teach an old dog new tricks. That's untrue; it's just that dogs as old as I am are slow learners. When I first started tangling with integer programs, there was a rule that you never made the model dimensions (number of rows or columns) any larger than you absolutely had to, partly to conserve memory and partly so as not to slow down pivoting. Once solvers switched from Gaussian pivoting (the way I learned to do it by hand) to matrix factoring, keeping dimensions down took a back seat to reducing matrix density.

Similarly, once upon a time I learned (the hard way) that symmetry in my model would slow down pruning of the search tree. In the previous post, I alluded to research on exploiting symmetry and said something about solvers having symmetry detection. Imre Polik mentioned in a comment that Xpress already detects the symmetry by default. CPLEX might need some encouragement to do so. Near the start of the Xpress output, it describes the model as symmetric and lists statistics on "orbits" (groups of variables whose values can be permuted). It is unclear to me whether Xpress exploits that information by using "orbital branching" (a relatively recent development) or in some other way. Early in the CPLEX output I see a message that it is "detecting symmetries", but the model dimensions do not change and there are no further mentions of symmetry.

Moving on, Imre suggested in his comment to the previous post that another possible antisymmetry constraint is to assert a dominance inequality between the number of true values in the top row and the number in the left column. This is compatible with my original two constraints, and so I added a version of it (combining Imre's constraint with mine) to my code. Rob Pratt suggested yet another possibility, an inequality between the subdiagonal and superdiagonal. I'm not convinced that one plays nice with my original constraints, meaning that if you added Rob's constraint to my original two you might legislate the optimal solution out of existence, so I added it to my code by itself. Also, it only works when the grid is square. Finally, since both solvers have parameters to control how hard they work to detect symmetry, I added an option to my code to skip adding any constraints and just crank up the solver's response.

The results are summarized in the following graph, which shows the optimality gap (best bound at the left end, best incumbent at the right end) for each solver and modeling option. The dashed vertical line is the optimal solution (from Erwin's post).

The "Bilateral" and "Trilateral" models are my original two constraints and my two constraints plus Imre's, respectively. The "Diagonal" model uses Rob's constraint. "None" is just the model by itself and "Solver" is the unmodified model plus a solver parameter setting to get it to work harder detecting symmetries. Note that the results for Xpress with "None" and Xpress with "Solver" are pretty much identical, confirming Imre's assertion that Xpress would detect the symmetry on its own. CPLEX saw a bit of improvement in both incumbent and bound going from "None" to "Solver", so apparently the nudge helped there. Only two runs found the optimal solution, both times CPLEX with some help from antisymmetry constraints. None of the runs got the best bound anywhere near tight.

Returning to my "old dog, new tricks" theme, the takeaway for me is that before I go nuts try to constrain away symmetry in a model, I need to investigate whether the solver can recognize it and, if so, whether it can eliminate or even exploit the symmetry.

Lastly, I belatedly realized that Erwin got his proven optimum quickly because the modified the model to use equality constraints in the interior of the grid and inequalities only in the two outermost rows/columns on each edge. I added that to my code as well, and yes, it gets a proven optimum incredibly fast. I was a bit leery about assuming that redundant coverage would only be required near the boundary, but per some comments by Rob on Erwin's post, 227 is indeed the (known) optimal value for a 32x32 grid.

A Boolean Grid

2025-02-28T18:21:00.001-05:00

A recent blog post by Erwin Kalvelagen discusses a very straightforward integer programming problem. You have a rectangular grid of boolean variables, where a variables neighbors are the variable immediately above, below, to the left or to the right of it. The sole constraint is that, for any cell, at least one of that cell's variable or its neighbor variables must be true. The objective is to minimize the number of true cells in the grid. Erwin coded the model in GAMS and ran it for a 32x32 grid. He reported that he got an incumbent value of 227 in about 65 seconds but had trouble getting to optimality. (This might be a good time to point out that Erwin's computer is probably better than mine, since he is a consultant.)

I was curious whether a couple of redundant constraints would help. The problem suffers (if that is the correct term) from symmetry. Draw a grid and color in the cells of an optimal solution. Now flip the grid, switching either top with bottom, left with right, or both. The colored cells still form an optimal solution. What is the harm of symmetry? Think about the branch-and-bound (or, if you prefer, branch-and-cut) algorithm and specifically how it prunes nodes based on bound. When you find a new incumbent solution, you prune any node whose bound is no better than that solution. Typically the node bound will be at least slightly loose (meaning, in a minimization problem, that the bound will be strictly less than the objective value of the best feasible solution lurking in that node). In the context of the current problem, when a feasible solution is found, there will be at least three other feasible solutions with the same objective value, obtained by reversing the indexing of rows, columns or both. Each of them will likely be in a node of the search tree with an objective value at least slightly better than their true value, meaning that none of those nodes can be pruned right away, even if they do not contain an even better solution.

So symmetry can slow pruning and also slow improvement of the best bound. There has been research on how to exploit symmetry in IP models, but as far as I know that work has to be baked into a solver to be used. At least some solvers have some built-in capability to recognize and deal with symmetry, but I'm not sure how well that works. I usually keep an eye out for symmetry and, if I think it might be slowing improvement of the bound, see if I can constrain away some of it.

In this case, the symmetry I identified can be removed by adding just two constraints. One is that the number of true cells in the top row should not exceed the number in the bottom row (so that the vertical flip is ruled out unless the top and bottom rows are tied). Similarly, the other is that the number of true cells in the left column should not exceed the number in the right column (ruling out the horizontal flip). Since these constraints shrink the feasible region, it would not be surprising if they slow down identification of improved incumbents. By enlarging the model, they also slow down (very slightly?) the rate at which nodes are solved. The hope is that faster bound improvements compensate for that.

I ran the 32x32 case (coded in Java) using two solvers, FICO Xpress MP (version 44.01.01) and IBM CPLEX (version 22.1.2), both with and without the antisymmetry constraints. Each run was limited to 15 minutes (wall clock time) and used default settings for all parameters. Here are the results, in the format "best solution/best bound".

	Xpress MP	CPLEX
With antisymmetry	231 / 215.73	227 / 214.40
Without antisymmetry	228 / 215.95	229 / 214.38

There is enough randomness in IP solvers that I would not read much into a single iteration of each model. CPLEX actually did a bit better on the incumbent with the antisymmetry constraints included, which surprises me. Xpress had a very slightly worse bound with them included, which also surprises me. The bottom line seems to be that the antisymmetry constraints do not help much (which I find disappointing) and that, as Erwin noted, the problem is a bit stubborn.

As always, my code is available for download.

CPLEX Drops Documentation

2025-02-10T14:42:00.000-05:00

CPLEX Studio 22.1.2 is now available for download, at least on most (possibly not all) supported platforms. The previous version was 22.1.1, and judging from the numbering I assume this is a fairly minor update. That's the good news. The bad news is that the new version no longer installs documentation.

Previous versions created a folder named "doc" in the main folder (parallel to "concert", "cplex", "cpoptimizer" etc.). Under the "doc" folder, if you drilled down a couple of levels, were all sorts of manuals. Particularly important to me were the "refjavacplex" and "refjavacpoptimizer" folders, which contained Javadoc documentation that could be integrated into an IDE (NetBeans in my case) for Java programming. It was also convenient to have the reference manuals installed locally, so that I could bookmark them in my browser and access them even if I was working offline.

Version 22.1.2 does not install the "doc" folder. The reference manuals are still available online if you know where to look (https://www.ibm.com/docs/en/icos/22.1.2), but I do not see a way to make that work as Javadoc in an IDE (and, of course, it is only available when you are online, unless you want to web-scrape it to get a local copy).

I've submitted an "idea" to restore the documentation to the download. If you too want it back, please vote for the suggestion. I would have no objection to it being a separate download, but taking away the Javadoc really seems a bit unfriendly to me.

Meanwhile, I am linking my IDE to the Javadoc for version 22.1.1 and hoping that nothing much has changed.

Minimizing Flow Support (II)

2025-01-29T15:02:00.000-05:00

In yesterday's post I described a graph optimization problem (from a post on OR Stack Exchange) and how to generate random test instances. You are given a digraph with a single commodity flowing through it. There may be multiple supply and demand nodes (with supply and demand in balance), and arcs have neither costs nor capacity limits. The problem is to find a subgraph with all the original nodes and a minimal number of arcs (the "support" of the flow) such that there is a feasible flow pattern to satisfy all demands. In this post, I will discuss a couple of mixed integer programming (MIP) models for the problem.

I will use the following notation. The nodes are $n_1, \dots, n_N$, and the supply or demand at node $n_i$ is given by $s_i,$ where $s_i > 0$ at supply nodes and $s_i < 0$ at demand nodes. The total supply is given by $F.$ The set of arcs is $A,$ and at each node $n_i$ we denote by $\delta^+(n_i)$ and $\delta^-(n_i)$ respectively the set of arcs flowing into and out of $n_i.$ The common elements of my two MIP models are the following.

Variable $x_a \in \lbrace 0,1\rbrace$ is 1 if and only if arc $a\in A$ is selected for the subgraph.
Variable $y_a \in [0, F]$ is the flow volume over arc $a \in A.$
The objective is to minimize the total number of arcs selected:

$$\textrm{minimize }\sum_{a\in A} x_a.$$
For each node $n_i$ we require that flows in and out combined with any supply or demand balance:

$$\sum_{a \in \delta^-(n_i)} y_a - \sum_{a\in \delta^+(a)} y_a = s_i.$$

The models differ in the remaining requirement, that there be no flow on any arcs that were not selected (i.e., $y_a = 0$ if $x_a =0$). Those constraints are added as follows.

In the "big M" model, for each $a\in A$ we add the constraint $y_a \le Fx_a.$
In the "indicators" model, for each $a\in A$ we add an if-then constraint $x_a = 0 \implies y_a = 0.$

I tested both models using two different solvers, IBM CPLEX 22.1.1 and FICO Xpress 9.5 Optimizer (version 44.01.01). Before running a "production" problem I ran all four combinations of model and solver on a small instance using the solvers' respective tuning routines. In three cases, the default solver settings seemed best. In one case (CPLEX on the big M model), the tuner suggested a couple of nondefault parameter settings, but they fared poorly on the production problem. So I used default parameter settings on all the production runs.

The production problem had 25 supply nodes, 34 demand nodes and 41 transit nodes (nodes where supply was 0) and 526 arcs. Total supply was $F=1000.$ I gave each combination of model and solver a one hour time limit (on my slightly vintage PC). I was mainly curious about how the two models would compare, and secondarily on how the two solvers would compare. Of course, running one test instance for one hour per combination is far from probative, but my curiosity has its bounds. Here is what I found.

Solver	Model	Incumbent	Lower bound	Gap (%)
CPLEX	big M	55	49.5	10
CPLEX	indicators	54	48	11
Xpress	big M	57	49	16
Xpress	indicators	58	49	18

There are some differences among combinations in the results (which, again, might not bear up under multiple tests), but what I found a bit interesting was that the gap never made it below 10% in any combination, even though I consider the test problem to be not particularly large. (Also slightly interesting was that only roughly 10% of the arcs in the graph were needed.)

I will note one difference in the solvers. I'm not sure if it is a function of different default parameter settings or different memory management. Within the one hour run time limit, neither attempt with Xpress ran into memory issues. In contrast, one of the CPLEX runs exhausted system memory (and hung the system) before the hour was up. So I did the other CPLEX run with a limit of 9500 MB on the tree size (set via the parameter CPXPARAM_MIP_Limits_TreeMemory), and that run ended due to memory exhaustion before the hour was up. (Both CPLEX runs lasted only about 53 minutes.)

One of the main reasons I ran the tests was to see whether the model with indicators would give be tighter than the big M model. A bit of wisdom I received a long time ago was that big M was better than indicators if you have insights into the model that allow you to use a not terribly large value of $M.$ Here, the worst case would be if the entire flow volume $F$ passed through a single arc, which lets me use $F$ (1000 in the test problem) as my value of $M.$ The big M runs did produce smaller gaps than their indicator counterparts, but not by much, and possibly not by a "statistically significant" amount (if you can even mention "statistical significance" while working with a sample size of 1 🥲).

Still, for now I will stick to preferring big M constraints with not-so-big values of $M$ over indicator constraints.

As mentioned in the previous post, you can find my code here, including a README.md file that explains the code structure.

Minimizing Flow Support (I)

2025-01-28T15:17:00.000-05:00

This is the first of hopefully two posts related to a question posted on Operations Research Stack Exchange. You are given a digraph through which a single commodity flows. Arcs have neither costs nor capacity limits. Each node $n_i$ is either a supply node (with supply $s_i>0$), a demand node (with demand $s_i<0$ treated as a "negative supply” in the optimization models), or what I will call a “transit node” ($s_i =0$) with neither supply nor demand. Key assumptions are that total supply equals total demand and that it is possible to find paths through the digraph satisfying all demands (and thus consuming all supplies).

The problem is to select a minimal number of arcs such that the reduced digraph (using all the original nodes but just the selected arcs) contains routes fulfilling all demands. In this post, I'll describe one way to generate random test instances of the problem. The following post will discuss modeling and solving the problem. I have Java code demonstrating both parts in my university GitLab repository.

My approach to generating a test instance starts with specification of the number of nodes in the graph and a lower bound for the number of arcs. (The lower bound might need to be exceeded in order to ensure that a feasible flow satisfying all demands exists.) My code also asks the user to specify an integer value $F$ for total supply (and total demand). I used integer flows just to make printing the supply/demand at each node and flow on each arc neat. You might prefer to use real valued flows and (since the problem is invariant with respect to the flow volume) just set the total supply/demand at 1.

In what follows, I will use the terms "upstream" and "downstream" to refer to nodes from which there are directed paths to a given node (upstream) or to which there are directed paths from a given node (downstream). To simplify explanation, I will treat demands as positive values. The construction process starts by creating the desired number of nodes and partitioning them into supply, demand and transit nodes. Since you need at least one supply node and at least one demand node, the first node is assigned as a supply node and the second as a demand node. (My code graph also assigns one node as a transit node, but if you do not care whether there are any transit nodes you can skip that.) The remaining nodes are randomly classified as supply, demand or transit. Since my code uses integer flow values, each supply (demand) node needs a supply (demand) of at least 1. If the partitioning process creates more than $F$ supply or demand nodes, the excess nodes are reclassified as transit nodes. If you are using real flow values, this can be skipped.

The next step is to allocate total supply (total demand) randomly across the supply (demand) nodes. Again, since I am using integer flows, my code first allocates one unit of supply or demand to each non-transit node, then allocates the remaining supply (demand) one unit at a time, randomly choosing with replacement a supply (demand) node to receive the next unit.

With the nodes created, it is time to move on to creating arcs. My code allocates to each node of any type two initially empty sets of nodes, those upstream and downstream, and two initially empty sets of arcs, those entering and those leaving the node. It also assigns a temporary variable containing the excess supply if the node is a supply node or the unmet demand if the node is a demand node. Two sets of nodes are created, one containing supply nodes with unused supply (initially, all the supply nodes) and the other containing demand nodes with unmet demands (initially, all the demand nodes).

A list of all possible arcs (excluding arcs from a node to itself) is created and randomly shuffled. Arcs are now added from that list until enough directed paths exist to ensure that all demands can be met. As each arc $(a, b)$ is added to the digraph, it is added to the set of arcs exiting the tail node $a$ and to the set of arcs entering the head node $b.$ The set of nodes upstream of $b$ is updated to include $a$ and all nodes upstream of $a,$ and the set of nodes downstream of $a$ is updated to include $b$ and all nodes downstream of $b.$ Finally, nodes upstream of $b$ with unused supply and nodes downstream of $a$ with unmet demand are paired up. The lesser of the unused supply and unmet demand is subtracted from both the supply of the upstream node and the demand of the downstream node, and whichever has zero supply/demand left is removed from the set of nodes with unused supply/unmet demand. It is possible (and certain when the very last drop of supply/demand is accounted for) that both nodes will be removed from their respective sets.

Once there are no nodes left with excess supply/demand, we can be certain the digraph contains at least one feasible solution. All that remains is to randomly add arcs until the user's specified minimum number of arcs is met.

Mint Madness

2025-01-17T18:02:00.001-05:00

I'm a long time, and generally quite content, user of the Linux Mint operating system. For quite a while now, it has had a very nice Software Manager program that lets you install or uninstall programs, as well as an older program named Synaptic that provides more fine-grained capabilities. Synaptic has not been updated in almost two years. To a lot of computer nerds, that means it needs to be replaced. To me, it's a reminder of the first sentence of the "Red-neck Repair Manual": "If it ain't broke, don't fix it."

Today I upgraded Mint from version 22 (Wilma) to version 22.1 (Xia). To my surprise, the upgrade uninstalled Synaptic and did not reinstall it. There was no warning in the Mint 22.1 release notes that this was going to happen (?!). Fortunately, it remains available in a repository and I was able to reinstall it (somewhat ironically, via Software Manager, its ostensible replacement).

I looked at the Mint message boards, and there was a fair bit of confusion and concern about this, along with some relief when users found out they could reinstall it. Looking for a rationale for the removal, all I could find were some vague assertions about there being better alternatives (Software Manager?) than the aged Synaptic, and speculation that the Mint developers were perhaps trying to push users to Software Manager or something new.

Here's the problem with this: Software Manager is not a replacement for Synaptic. It's fine for installing programs, but as best I can tell it is useless for installing libraries. For example, suppose that I want to install a program or library that will mess around with PDFs, and the installer balks because it cannot find the libpoppler library (or cannot find the correct version of it). With Synaptic, I can search "libpoppler" and see which versions I have installed and which are out there but not installed. Odds are a third-party program or library looking for it wants libpoppler-dev, and if I don't already have it, it's a couple of clicks to install it with Synaptic. If I search Software Manager, it won't find what I need. (As of today, at least, it just suggests Ruby-poppler, which it says contains Ruby bindings for libpoppler, as opposed to libpoppler itself.) There's a switch in Software Manager labeled "Search in package descriptions (even slower search)". I tried that and I can confirm that "even slower" was a massive understatement. "Glacial" might be more accurate. The extra time was for naught; it found an R library that uses libpoppler and one other similarly irrelevant hit, but nothing related to installing libpoppler.

I am at a total loss as to why the Mint folks would take away a convenient way to install libraries (not included in programs). There are other ways to install libraries (including directly from the command line), but I am not aware of any as easy to use as Synaptic ... which I fortunately still have.

MATE Misadventure

2024-12-07T15:03:00.000-05:00

My desktop computer runs Linux Mint with the MATE desktop environment. I know Cinnamon is fancier / trendier, but I'm used to MATE, plus I run it on my "vintage" laptop (which in the past struggled to run Cinnamon) and it's nice to have the same environment both places.

Until recently, the only issue I had with MATE was an occasional glitch in the panel. I "solved" that by creating a script to reset the panel, which I invoke manually as needed. Of late, though, I have run into a new problem. It may or may not be related to a recent operating system upgrade, which forced my rather long in the tooth PC from the 4.x kernel series to the 6.x kernel series.

Normally, when I power up my PC it opens a few applications (including Thunderbird and Firefox) in the foreground as well as the usual cornucopia of things in the background. That still happens most days, but every so often something goes splat. The desktop appears, complete with icons, but the MATE panel does not. More importantly, none of the application windows are visible (although, somewhat weirdly, notifications of incoming messages from Thunderbird will pop up, even though Thunderbird itself is invisible). So everything appears to be running, but nothing beyond the desktop (and those popups) is displaying.

I still have no explanation of why this happens, but I now have a workaround. It turns out that marco is the window manager in MATE. The first step (on a day when things were working) was to open a terminal, run pgrep marco followed by ps -l XXXX (where XXXX was the PID of the marco process), and take note of the command line options (in my case, "--composite --replace"). Next, I created a bash script containing the following:

pkill marco
marco --composite --replace &
disown

This kills the marco process that is acting up and starts a new one. As I discovered today, without the first line the script still works but leaves the original marco process in memory as an orphan. I bound the script to an unused key combination (using the "Keyboard Shortcuts" application in the Mint menu). So now if MATE screws up after booting, I just type the key combination and things are instantly fixed.

Android Silliness

2024-11-18T14:25:00.000-05:00

Once upon a time I bought an Insignia smart speaker (with Google Assistant baked in) for my bedroom. I could set an alarm that would stream my choice of radio station using verbal commands ("Hey Google, set a radio alarm for ..."). It worked so well that Google decided to fix it.

At some point I saw a brief item in a tech feed about Google eliminating radio alarms in favor of some sort of automation thing, but my speaker kept on working and I ignored the story ... to my own peril. A few weeks ago, I canceled the existing alarm. When I went to set a new alarm, the speaker's response was "I'm sorry, I don't know how to do that."

After a bit of research, I discovered that the Google Home app now has something called "automations". It turned out a bunch of predefined automations were set on my phone, none of which ever did anything (because I never uttered the necessary incantation?). None were anything I wanted, so I turned them all off and created a new one. Automations involve one or more "starters" (in my case, every day at 6:57 am), one or more "actions" (in my case, play the radio station I want) and a "configuration" (in my case, play it on the bedroom speaker). You can test the automation by tapping a button (and I did). It worked when triggered manually.

So began the adventure. On day 1, the new automation failed to do anything. I woke up on my own (belatedly), changed the time to five minutes or so after I awoke (call it 7:20 am), and it worked. So I reset it to 6:57 am, and the next day it again failed. Eventually I found a notification hidden away somewhere that the automation had failed due to scheduled down time being set. I went to the settings menu in the Home app, found "Digital wellbeing", and sure enough there was a downtime setting. It was set to expire at 8:00 am, leaving me confused as to why the 7:20 am test had worked, but whatever. I deleted it.

Next morning, no luck. I went to the settings for the speaker and found it had its own digital wellbeing setting, which I also deleted. Still no luck, and no notifications I could find as to why it did not work.

I'll skip over the details of a very unsatisfying online chat with a level 1 Google support person (who did not seem to grasp that the 7:20 test working implied that the speaker really was connected to the home WiFi) and an email exchange with a level 2 support person who wanted the speaker's serial number among other things. (That was two weeks ago. Nothing back so far.) With some experimentation, I discovered the following. If I set the alarm time prior to 7:00 am, it did not work at all. If I set it to exactly 7:00 am, it triggered at 7:00 am the first day and 7:02 am every day thereafter. If I set it to 7:01 am, it triggered at 7:01 am every day. (Note that the 7:01 setting triggered a minute before the 7:00 am setting did, which offended me as a mathematician due to the lack of monotonicity.)

Ultimately I got another notification about something to do with downtime, which made no sense to me since I had deleted the downtime settings for both phone and speaker. So I went into the phone settings (not the Home app, but the phone itself). After considerable vertical scrolling, I found "Digital Wellbeing & parental controls". Wallowing around in that, I found a "Bedtime mode" (which was turned off) and a "Do Not Disturb" menu. "Do Not Disturb" was turned off on the phone, but fortunately I got curious and burrowed into the menu. In the "General" section of that menu was a heading named "Schedules" saying I had three schedules set. Two of those were "Gaming" and "Game Dashboard" (no idea what they do or why they were turned on). I forget the title of the third one (I have since renamed it "sleeping"), but that turned out to be what was blocking the alarm. I left the start time at 11:00 pm and changed the end time to "6:45 am next day". It had been set to 7:00 am, which apparently was what caused issues. You can customize what things get blocked (via "Do Not Disturb behavior"), but I didn't bother. Curiously, there is a switch labeled "Alarm can override end time", which was (and remains) turned on. Since that did not allow the automation to trigger before 7:00 am, I assume it only applies to alarms and not automations.

With that tweak, the radio alarm via the speaker started working at 6:57 am, and so I am tentatively going to declare victory. Why we need digital wellbeing setting scattered all over the place under various names is a mystery to me, one of many.

Solver Parameters Matter

2024-11-10T16:37:00.002-05:00

Modern integer programming solvers come with a gaggle of parameters the user can adjust. There are so many possible parameter combinations that vendors are taking a variety of approaches to taming the beast. The first, of course, is for vendors to set default values that work pretty well most of the time. This is particularly important since many users probably stick to default settings unless they absolutely have to start messing with the parameters. (By "many" I mean "myself and likely others".) The second is to provide a "tuner" that the user can invoke. The tuner experiments with a subset of possible parameter settings to try to find a good combination. Third, I've seen some discussion and I think some papers or conference presentations on using machine learning to predict useful settings based on characteristics of the problem. I am not sure how far along that research is and whether vendors are yet implementing any of it.

In the past few days I got a rather vivid reminder of how much a single parameter tweak can affect things. Erwin Kalvelagen did a blog post on sorting a numeric vector using a MIP model (with an up front disclaimer that it "is not, per se, very useful"). He test a couple of variants of a model on vectors of dimension 50, 100 and 200. I implemented the version of his model with the redundant constraint (which he found to speed things up) in Java using the current versions of both CPLEX and Xpress as solvers. The vector to sort was generated randomly with components distributed uniformly over the interval (0, 1). I tried a few random number seeds, and while times varied a bit, the results were quite consistent. Not wanting to devote too much time to this, I set a time limit of two minutes per solver run.

Using default parameters, both solvers handled the dimension 50 case, but CPLEX was about eight times faster. For dimension 100, CPLEX pulled off the sort in two seconds but Xpress still did not have a solution at the two minute mark. For dimension 200, CPLEX needed around 80 seconds and Xpress, unsurprisingly, struck out.

So CPLEX is faster than Xpress, right? Well, hang on a bit. On the advice of FICO's Daniel Junglas, I adjusted one of their parameters ("PreProbing") to a non-default value. This is one of a number of parameters that will cause the solver to spend more time heuristically hunting for a feasible or improved solution. Using my grandmother's adage "what's sauce for the goose is sauce for the gander," I tweaked an analogous parameter in CPLEX ("MIP.Strategy.Probe"). Sure enough, both solvers got faster on the problem (and Xpress was able to solve all three sizes), but the changes were more profound than that. On dimension 50, Xpress was between three and four times faster than CPLEX. On dimension 100, Xpress was again around four times faster. On dimension 200, Xpress won by a factor of slightly less than three.

So is Xpress actually faster than CPLEX on this problem? Maybe, maybe not. I only tweaked one parameter among several that could be pertinent. To me, at least, there is nothing about the problem that screams "you need more of this" or "you need less of that", other than the fact that the objective function is a constant (we are just looking for a feasible solution), which suggests that any changes designed to tighten the bound faster are likely to be unhelpful. I confess that I also lack a deep understanding of what most parameters do internally, although I have a pretty good grasp on the role of the time limit parameter.

So the reminder for me is that before concluding that one solver is better than another on a problem, or that a problem is too difficult for a particular solver, I need to put a bit of effort into investigating whether any parameter tweaks have substantial impact on performance.

Update: A post on the Solver Max blog answers a question I had (but did not investigate). With feasibility problems, any feasible solution is "optimal", so it is common to leave the objective as optimizing a constant (usually zero). Erwin and I both did that. The question that occurred to me (fleetingly) was whether an objective function could be crafted that would help the solver find a feasible solution faster. In this case, per the Solver Max post, the answer appears to be "yes".

Xpress and RStudio

2024-11-03T11:55:00.000-05:00

The following is probably specific to Linux systems. I recently installed FICO Xpress optimizer, which comes with an R library to provide an API for R code. FICO requires a license file (or a license server -- I went with a static file since I'm a single user) and adds an assortment of environment variable to the bash shell, including one pointing to the license file. So far, so good.

Xpress comes with example files, including example R scripts. So I cranked up RStudio, opened the simplest example ("first_lp_problem.R", which is just what it sounds like) and executed it line by line. The problem setup lines worked fine, but the first Xpress API call died with an error message saying it couldn't find the license file in directory "." (i.e., the current working directory). The same thing happened when I tried to source the file in the RStudio console.

To make a long story somewhat shorter, after assorted failed attempts to sort things out it occurred to me to run R in a terminal and source the example file there. That ran smoothly. So the problem was with RStudio, not with R. Specifically, it turns out that RStudio runs without loading any bash environment variables.

After assorted failed attempts at a fix (and pretty much wearing out Google), I found the following solution. In my home directory ("/home/paul", a.k.a. "~") I created a text file named ".Renviron". In it, I put the line "XPAUTH_PATH=/home/paul/.../xpauth.xpr", where "..." is a bunch of path info you don't need to know and "xpauth.xpr" is the name of the license file. If you already have a ".Renviron" file, you can just add this line to it. The example script now runs fine in RStudio. Note that there are a gaggle of other bash environment variables created by Xpress, none of which presumably are known to RStudio, but apparently the license file path is the only needed by the API (at least so far). If I trip over any other omissions later on, presumably I can add them to ".Renviron".

Bounds and Reduced Costs

2024-09-16T15:20:00.001-04:00

I recently read a question from someone who had solved a linear program (minimization) using a reliable solver, extracted the primal and dual solutions, recomputed reduced costs manually and encountered a negative reduced cost in a supposed optimal solution. That appeared to be a bit paradoxical. I don't have enough information to be sure, but I suspect variable bounds are involved.

Fourteen years ago (!) I wrote a post about how to find the dual value of a bound on a variable when the bound is entered as a bound rather than a functional constraint. It belatedly occurred to me that an example might be helpful (and might shed some light on the paradoxical reduced cost), so here goes. Let's start with a simple LP.\begin{alignat*}{1} \max\ 5x+y\\ \textrm{s.t. }x+2y & \le 8\\ x & \le 2\\ y & \le 10 \end{alignat*} where $x\ge 0$ and $y\ge 0.$ The problem is easy to solve graphically, as shown below.

The optimal solution is $(x,y) = (2,3)$ with objective value 13. Let's assume we feed it to an LP solver as written (three constraints). The dual values of the constraints are $(0.5, 4.5, 0).$ In particular, note the dual value of 4.5 for the upper bound on $x$ (which is binding). If we increase the bound from $2$ to $2 + \epsilon,$ the optimal corner shifts from $(2,3)$ to $(2 + \epsilon, 3- \frac{1}{2}\epsilon),$ with a resulting change in objective value from 13 to $13+4.5\epsilon.$ Also note that the reduced cost of $x$ is $5 - 1 \times 0.5 - 1 \times 4.5 - 0 \times 0 = 0$ and the reduced cost of $y$ is $1 - 2\times 0.5 - 0 \times 4.5 -1 \times 0 = 0,$ which conforms to our expectation that basic variables have zero reduced costs.

Now suppose that I enter the variable bounds directly as bounds, rather than as constraints, so that the LP has a single inequality constraint. Most (all?) contemporary solvers allow this. The optimal solution is unchanged, as is the dual value (0.5) for the first (and now sole) constraint. What follows was confirmed using CPLEX, but I suspect the experience with other solvers will be similar. The only dual value available is the dual for the first constraint. CPLEX, as best I can tell, does not allow you to ask for a dual value associated with a variable bound. So does that mean that the reduced cost of $x$ is just $5 - 1 \times 0.5 = 4.5?$ Yes, that is what CPLEX reports as the reduced cost of $x,$ and correctly so.

The fact that at optimum a basic variable has zero reduced cost applies to the original simplex method. There is a variant of the simplex method for bounded variables, and in that version a variable can be nonbasic at its upper bound, with a nonzero (possibly favorable) reduced cost. Moreover, as I mentioned in that earlier post, in this approach the reduced cost of a variable at its bound is the dual value of the bound constraint. So it is not coincidence that the reduced cost of $x$ when entering its bound as a bound (4.5) matches the dual value of the bound when entering it as a constraint.

A Bicriterion Movement Model

2024-09-08T16:31:00.001-04:00

A question on Operations Research Stack Exchange asks about a bicriterion routing problem. The underlying scenario is as follows. A service area is partitioned into a rectangular grid, with a robot beginning (and eventually ending) in a specified cell. Each move of the robot is to an adjacent cell (up, down, left or right but not diagonal). The robot must eventually visit each cell at least once and return whence it came. One criterion for the solution is the number of movements (equivalently, the amount of time, since we assume constant speed) required for the robot to make its rounds. In addition, each cell is assigned a nonnegative priority (weight), and the second criterion is the sum over all cells of the time of the first visit weighted by the the priority of the cell. In other words, higher priority cells should be visited earlier. Both criteria are to be minimized.

The problem can be modeled as either an integer program or a constraint program. The movement portion is quite straightforward to model. Balancing the two objectives is where things get interesting. One can optimize either criterion after setting a bound on how bad the other can be, or one can use lexicographic ordering of the criteria (optimize the primary objective while finding the best possible value of the secondary objective given that the primary must remain optimal), or one can optimize a weighted combination of the two objectives (and then play with the weights to explore the Pareto frontier). Weighted combinations are a somewhat tricky business when the objective functions being merged are not directly comparable. For instance, merging two cost functions is pretty straightforward (a dollar of cost is a dollar of cost, at least until the accountants get involved). Merging distance traveled and "priority" (or "urgency", or "weighted delay") is much less straightforward. In real life (as opposed to answering questions on ORSE), I would want to sit with the problem owner and explore acceptable tradeoffs. How much longer could a "good" route be if it reduced weighted delays by 1 unit?

I chose to use an integer program (in Java, with CPLEX as the optimization engine), since CPLEX directly supports lexicographic combinations of objectives. You can find the source code in my GitLab repository. A write-up of the model is in a PDF file here, and output for the test case in the ORSE question is in a text file here. The output includes one run where I minimized delay while limiting the distance to a middle value between the minimum possible distance and the distance obtain from lexicographic ordering with delay having the higher priority. It turned out that compromising a little on distance did not help the delay value.

LyX Upgrade Hiccup

2024-08-20T15:06:00.001-04:00

In recent months, I've upgraded LyX from version 2.3.6 to 2.4.0 RC3 (from the Canonical repos) to 2.4.1 (by compiling from source, as documented here). Along the way I discovered a glitch that apparently occurred during the upgrade from 2.3.6 to 2.4.0.

When you create a new document in LyX, it is initially created using a default template that you can customize. My default template, which I created in a previous millennium, starts all paragraphs flush left and uses what LaTeX calls a "small skip" to provide vertical separation between paragraphs. Yesterday, when I created a new document, I noticed that the first paragraph after a section heading would begin flush left but subsequent paragraphs were indented ... in the LyX GUI. When I compiled to a PDF document, however, every paragraph was flush left. The disconnect between what the GUI showed and what the compiled document contained was, shall we say, a trifle confusing.

Looking at the LaTeX output code (which you can do by selecting View > Code Preview Pane to open a preview of the code to be compiled), I noticed the following LaTeX commands being added to the document.

\setlength{\parskip}{\smallskipamount}

\setlength{\parindent}{0pt}

That reminded me to look at the template's document preamble (Document > Settings... > LaTeX Preamble), and sure enough those commands appeared there. Apparently I added them to the preamble of the default template back when dinosaurs still roamed the earth, and they've been there ever since. In resolving bug #4796, the developers changed how vertical spacing between paragraphs was handled. They now use the parskip LaTeX package. Since, for whatever reason, the template had Document > Settings... > Text Layout > Paragraph Separation set to Indentation: Default, the GUI in the current version indented paragraphs other than those immediately after section breaks. Meanwhile, those two lines in the preamble overrode that when creating the PDF output.

The fix was to create a new default template, which is extremely easy in LyX. You create a new document, customize document settings to your liking, go to Document > Settings... and click the Save as Document Defaults button. In my case, this meant changing Document > Settings... > Text Layout > Paragraph Separation to Vertical Space: Small Skip and deleting the aforementioned lines from the preamble. Now I just need to remember to check and, if necessary, replace the default template after future upgrades.

Installing LyX from Source

2024-08-20T14:39:00.003-04:00

I've been using LyX as my go-to document editor since its early days, and I really love it. Until recently, installation and upgrades (on Linux Mint) were a no-brainer. I could either install/update the package from the Canonical repositories (which tend to lag behind the current release) or get the latest version from a PPA (which, sadly, is no longer maintained). Yesterday I was running LyX 2.4.0 RC3, the version in the Canonical repo for Ubuntu 22.04 (the basis for the current version, 22, of Linux Mint), and ran into an issue. To see if it had already been resolved, I decided to upgrade to the current LyX version, 2.4.1. That entailed compiling it from source for the first time ever.

The process, while slow, was almost smooth. I did run into one speed bump, an error message (from the APT package manager, not from LyX) that was a trifle cryptic. Some serious googling turned up the solution. I thought I'd document the compilation process below in case other Mint (or Ubuntu) users want to try installing from source for the first time.

Open the Software Sources system app and make sure that Official Repositories > Optional Sources > Source code repositories is turned on. (On my first go-around it was turned off, which produced the cryptic error message I mentioned.)
Download the source tarball and signature file from the LyX download site. Verify the tarball as described here.
Unzip the tarball into a directory located wherever you want to keep the source code (if in fact you choose to keep it). I'm pretty sure you can delete the source folder after compiling and installing, but I'm hanging onto it for now.
In a terminal in that directory, run sudo apt-get build-dep lyx to install all necessary prerequisites. I found this useful tip in the LyX wiki. This was the step that generated the error message on the first attempt, due to my having skipped step 1 above.
Open the INSTALL.autoconf text file in the source folder and follow the directions (summarized here).

Run configure (which takes a while).
Run make (which takes an eternity).
Optional: run make check and verify all tests were passed.
Run make install to install the program.

Regarding the last step, I normally install LyX for all users on the system (which is just me, since this is my home PC). So I ran make install as root. If you just want it for the account under which you are logged in, you can run it without escalating privileges.

That's all there is to it.

Pipewire Sound Server

2024-08-04T12:40:00.000-04:00

I recently upgraded my PC and laptop to Linux Mint 22 (Wilma). It was somewhat harrowing experience -- the laptop (which is my canary in coal mine) updated fine, but the PC was bricked due to an issue with BIOS secure boot stuff. At power up there was a message involving some cryptic acronym I no longer remember. It had something to do with secure boot keys, and it prevented the PC from booting (even from a bootable USB drive). I spent an hour or two futzing with BIOS settings but eventually got secure boot and TPM turned off, which let the PC boot. I haven't had any problems since then.

One of the changes in Mint 22 is that the developers moved from ALSA to a newer sound (and video?) server called PipeWire. The changeover was initially invisible to me -- sound and video so far have worked flawlessly -- but necessitated changes to a couple of shell scripts I use. One of them is a convenience script that resets the speaker volume to my preferred level. It's handy when I have to crank volume up for some reason. The other is a script that launches Zoom. It turns up the volume before Zoom starts and then resets the volume when I exit from Zoom.

Fortunately, it wasn't hard to find the new commands (thank you Uncle Google). There's probably more than one way to control volume from the command line or a script, but I ended up using the WirePlumber library. I can't recall if it was installed automatically during the upgrade or if I had to add it. The key command is wpctl, which somewhat curiously does not seem to have a man page. Fortunately, wpctl --help will get you the information you need. My old scripts used

pacmd set-sink-volume 0 27500
pacmd set-sink-volume 1 27500
sudo alsactl store

to reset the speaker volume. (I had to try both sink 0 and sink 1 because, for some reason, the speakers would sometimes be assigned to 0 and sometimes to 1 during boot.) With PipeWire I use

wpctl set-volume @DEFAULT_AUDIO_SINK@ 0.42

to do the same thing.

There are a few improvements with the new approach. I apparently do not have to play "guess the sink" anymore. I do not need to escalate privileges (sudo) just to change the volume. Also, the volume setting is easier to interpret (0.42 is 42% of maximum -- I'm not sure how I settled on 27500 in the old approach, but it is not obvious to me that it equates to 42%).

A Sorting Adventure in R

2024-06-27T12:00:00.001-04:00

As part of an R application on which I have been working, I need to grab some data from an online spreadsheet keyed by a column of labels, add a column with new values, and print out the labels and new values in a two column array so that it matches the spreadsheet (meaning the order is the same as in the spreadsheet). The construction process involves breaking the original data into chunks, filling in the new values in each chunk separately, and gluing the chunks back together. The spreadsheet is sorted by the label column, so to match the original spreadsheet (allowing a user to view them side by side and see the same labels in the same order), I just needed to sort the output data frame by the label column ... or so I thought.

Since I was using the dplyr library already, and dplyr provides an arrange command to sort rows based on one or more columns, I started out with the following simple code (where df is the two column data frame I created):

df |> dplyr::arrange(Label) |> print()

Unfortunately, the result was not sorted in the same order as the spreadsheet. Some of the labels began with numbers (1000) and in one case a number containing a colon (10:20), and arrange listed them in the opposite order from that of the spreadsheet. Some of the name had funky capitalization (say, "BaNaNa"), and arrange treated capital letters as preceding lower case letters. Interestingly, the base R sort function sorted the labels in the same order that the spreadsheet used. More interestingly, the following hack suggested in a response to a question I posted online also matched the spreadsheet:

df |> dplyr::arrange(-dplyr::desc(Label)) |> print()

The desc function tells arrange to use descending order (the default being ascending order). So -desc tells arrange not to use descending order, meaning use ascending order, which is where we started ... and yet it somehow fixes the ordering problem.

The culprit turns out to be the default locale setting for arrange, which is "C". I'm guessing that means the C programming language. I filed a bug report in the dplyr repository on GitHub and was quickly told that the behavior was correct for locale "C" and that the solution to my problem was to feed arrange the optional argument .locale = "en". That did in fact fix things. The code now produces the expected sort order. Meanwhile, my bug report led to a new one about the difference in sort orders between arrange and desc. Depending on how that report is resolved, the -desc trick may stop working in the future.

Locating Date Columns in R

2024-06-26T18:44:00.002-04:00

I've been working on a script that pulls data from an online spreadsheet (made with Google Sheets) shared by a committee. (Yes, I know the definition of "camel": a horse designed by a committee. The situation is what it is.) Once inhaled, the data resides in a data frame (actually, a tibble, but why quibble). At a certain point the script needs to compute for each row the maximum entry from a collection of date columns, skipping missing values.

Assuming I have the names of the date columns in a variable date_fields and the data in a data frame named df, the computation itself is simple. I use the dplyr library, so the following line of code

df |> rowwise() |> mutate(Latest = max(c_across(all_of(date_fields)), na.rm = TRUE))

produces a copy of the data frame with an extra column "Latest" containing the most recent date from any of the date fields.

That, as it turns out, was the easy part. The hard part was populating the variable date_fields. Ordinarily I would just define it at the front of the scripts, plugging in either the names or the indices of the date columns. The problem is that the spreadsheet "evolves" as committee members other than myself make changes. If they add or delete a column, the indices of the date columns will change, breaking code based on indices. If they rename one of the date columns, that will break code based on a static vector of names. So I decided to scan the spreadsheet after loading it to find the date fields.

It turned out to be harder than it looked. After tripping over various problems, I searched online and found someone who asked a similar question. The best answer pointed to a function in an R library named dataPreparation. I did not want to install another library just to use one function one time, so I futzed around a bit more and came up with the following function, which takes a data frame as input and returns a list of the names of columns that are dates (meaning that if you run the str() command on the data frame, they will be listed as dates). It requires the lubridate library, which I find myself commonly using. There may be more elegant ways to get the job done, but it works.

library(lubridate)
# INPUT: a tibble
# OUTPUT: a vector containing the names of the columns that contain dates
dateColumns <- function(x) {
  # Get a vector of logical values (true if column is a date) with names.
  temp <- sapply(names(x), function(y) pull(x, y) |> is.Date())
  # Return the column names for which is.Date is true.
  which(temp) |> names()
}

From IP to CP

2024-04-30T17:04:00.002-04:00

Someone asked on OR Stack Exchange how to convert an integer programming model into a constraint programming model. I think you can reasonably say that it involves a "paradigm shift", for a couple of reasons.

The first paradigm shift has to do with how you frame the problem, mainly in terms of the decision variables. Math programmers are trained to turn discrete decisions with a logical flavor into binary variables. Discrete quantities, such as how many bins of a certain type to use or how many workers to assign to a task, are expressed as general integer variables, but most other things end up turning into a slew of binary variables. The problem being solved in the ORSE question illustrates this nicely.

The problem is as follows. You have $N$ participants in a tournament involving some kind of racing. Importantly, $N$ is guaranteed to be an even number. There is one track with two lanes, and races are spread over $N-1$ days. Every participant races head to head with every other participant exactly once, and nobody races twice in the same day. For whatever reason, the left lane is preferable to the right lane, and so there is a "fairness" constraint that nobody is assigned the left lane on more than $M$ consecutive days. For some reason, the author also imposed a second fairness constraint that nobody be assigned to the right lane on more than $M$ consecutive days. Dimensions for the author's problem were $N=20$ and $M=2.$

The model has to assign participant pairs (races) to days and also make lane assignments. To decide against whom I must race on a given day, someone building an IP model will use binary variables to select my opponent. Similarly, they will use binary variables to select my lane assignment each day. So the author of the question had in his IP model a variable array opp[Competitors][Competitors][Tracks][Days] taking value 1 "if competitor 'c1' races with 'c2' on track 't' on day 'd'".

CP models are more flexible in their use of variables, and in particular general integer variables. So to decide my opponent on a given day, I can just an integer variable array indexed by day where the value is the index number of my opponent on the given day. Similarly, I could (and would) use an integer variable indexed by day to indicate my lane assignment that day, although in this case that variable does turn out to be binary, since there are only two lanes.

The second paradigm shift has to do with constraints, and it ties to what solver you are using. IP models have a very limited constraint "vocabulary". They all understand linear equalities and inequalities, and some understand some combination of SOS1, SOS2, second order cone and implication constraints. That's pretty much it. CP solvers have a richer "vocabulary" of constraints, but with the caveat that not many of those constraints are universal. I would wager that every CP solver has the "all different" constraint, and they must have the usual arithmetic comparisons ($=,\neq,\lt,\le,\gt,\ge$). Beyond that, it pays to check in advance.

I wrote a CP model (in Java) using IBM's CP Optimizer (CPO) to solve the scheduling problem. Details of the model can be sussed out from the Java code, but I will mention a few pertinent details here.

I did use an integer variable array to determine, for each combination of participant and day, the participant's opponent that day, as well as an integer array giving the lane assignment (0 or 1) for each combination of participant and day.
To make sure that, on any day, the opponent of X's opponent is X I used CPO's inverse constraint. The constraint inverse(f, g) says that f[g[x]] = x and g[f[x]] = x for any x in the domain of the inner function.
To ensure that nobody raced the same opponent twice, I used allDiff, which is CPO's version of the all different constraint.
We have to do something to force opponents in a race to be in different lanes. Let $x_{i,d}$ and $y_{i,d}$ denote respectively the opponent and lane assignment for participant $i$ on day $d.$ In mathematical terms, the constraint we want is $y_{x_{i,d},d} \neq y_{i,d}.$ Indexing a variable with another variable is impossible in an IP model. In CPO, I used the element constraint to do just that.

I added an objective function, namely to minimize the difference between the most and fewest times any participant gets assigned the preferred left lane. I also added one constraint to mitigate symmetry. Since any solution remains a solution (with the same objective value) under any permutation of the participant indices, I froze the first day's schedule as $1\ v.\ N$, $2\ v.\ N-1$, $3\ v.\ N-2$ etc.

On my decent but not screamingly fast PC, CPO found a feasible solution almost instantly and a solution with objective value 1 in under a second. In that solution, every participant gets the left lane either nine or ten times out of the 19 racing days. It's not hard to prove that 1 is the optimal value (you cannot have everybody get exactly the same number of left lane assignments), but don't tell CPO that -- it was still chugging along trying when it hit my five minute time limit.

My Java code is available from my repository under a Creative Commons 4.0 open source license.

Where Quadratic, Positive Definite and Binary Meet

2024-04-21T15:02:00.000-04:00

A comment by Rob Pratt (of SAS) on OR Stack Exchange pointed out two things that are glaringly obvious in hindsight but that somehow I keep forgetting. Both pertain to an expression of the form $x'Qx + c'x,$ either in an objective function or in a second order cone constraint, where $x$ is a vector of variables and $Q$ and $c$ are parameters.

The first observation does not depend on the nature of the $x$ variables. We can without loss of generality assume that $Q$ is symmetric. If it is not, replace $Q$ with the symmetric matrix $\hat{Q} = \frac{1}{2}\left(Q + Q'\right),$ which is symmetric. A wee bit of algebra should convince you that $x'\hat{Q}x = x'Qx.$

The second observation is specific to the case where the $x$ variables are binary (which was the case in the ORSE question which drew the comment from Rob). When minimizing an objective function of the form $x'Qx + c'x$ or when using it in a second order cone constraint of the form $x'Qx + c'x \le 0,$ you want the $Q$ matrix to be positive definite. When $x$ is binary, this can be imposed easily.

Suppose that $x$ is binary and $Q$ is symmetric but not positive definite. The following argument uses the euclidean 2-norm. Let $$\Lambda = \max_{\parallel y \parallel = 1} -y'Qy,$$ so that $y'Qy \ge -\Lambda$ for any unit vector $y.$ Under the assumption that $Q$ is not positive definite, $\Lambda \ge 0.$ Choose some $\lambda > \Lambda$ and set $\hat{Q} = Q + \lambda I,$ where $I$ is the identity matrix of appropriate dimension. For any nonzero vector $y,$

$$ \begin{align*} y'\hat{Q}y & =y'Qy+\lambda y'Iy\\ & =\parallel y\parallel^{2}\left(\frac{y'}{\parallel y\parallel}Q\frac{y}{\parallel y\parallel}+\lambda\right)\\ & \ge\parallel y\parallel^{2}\left(-\Lambda+\lambda\right)\\ & >0. \end{align*} $$

So $\hat{Q}$ is positive definite. Of course, $x'\hat{Q}x \neq x'Qx,$ but this is where the assumption that $x$ is binary sneaks in. For $x_i$ binary we have $x_i^2 = x_i.$ So

$$ \begin{align*} x'\hat{Q}x & =x'Qx+\lambda x'Ix\\ & =x'Qx+\lambda\sum_{i}x_{i}^{2}\\ & =x'Qx+\lambda e'x \end{align*} $$

where $e=(1,\dots,1).$ That means the original expression $x'Qx + c'x$ is equal to $x'\hat{Q}x+(c-\lambda e)'x,$ giving us an equivalent expression with a positive definite quadratic term.

Finding Duplicated Records in R

2024-04-11T16:38:00.003-04:00

Someone asked a question about finding which records (rows) in their data frame are duplicated by other records. If you just want to know which records are duplicates, base R has a duplicated() function that will do just that. It occurred to me, though, that the questioner might have wanted to know not just which records were duplicates but also which records were the corresponding "originals". Here's a bit of R code that creates a small data frame with duplicated rows and then identifies original/duplicate pairs by row number.

library(dplyr)

# Create source data.
df <- data.frame(a = c(3, 1, 1, 2, 3, 1, 3), b = c("c", "a", "a", "b", "c", "a", "c"))

# Find the indices of duplicated rows.
dup <- df |> duplicated() |> which()

# Split the source data into two data frames.
df1 <- df[-dup, ] # originals (rows 1, 2 and 4)
df2 <- df[dup, ]   # duplicates (rows 3, 5, 6 and 7)

# The row names are the row indices in the original data frame df. Assign them to columns.
df1$Original <- row.names(df1)
df2$Duplicate <- row.names(df2)

# Perform an inner join to find the original/duplicate pairings. The "NULL" value for "by"
# (which is actually the default and can be omitted) means rows of df1 and df2 are paired
# based on identical values in all columns they have in common (i.e., all the original
# columns of df).
inner_join(df1, df2, by = NULL) |> select(Original, Duplicate)

# Result:
#   Original Duplicate
# 1        1         5
# 2        1         7
# 3        2         3
# 4        2         6

The key here is that the inner_join function pairs rows from each data frame (originals and duplicates) based on matching values in the "by" columns. The default value of "by" (NULL) tells it to match by all the columns the two data frames have in common -- which in the is case is all the columns in the source data frame. The resulting data frame will have the columns from the source data frame (here "a" and "b") plus the columns unique to each data frame ("Original" and "Duplicate"). We use the select() command to drop the source columns and just keep the indices of the original and duplicate rows.

File Access in RStudio

2024-04-08T14:53:00.000-04:00

I've been spending a fair bit of time in RStudio Desktop recently, much of it related to my work with INFORMS Pro Bono Analytics. I really like RStudio as a development environment for R code, including Shiny apps. It does, however, come with the occasional quirk. One of those has to do with how RStudio accesses the file system.

I tripped over this a couple of times recently when I wanted to open an R file that I had dropped in the /tmp directory on my Linux Mint system. The Files tab in RStudio appeared to be limited to the directory tree under my home directory. There was no way to browse to system directories like /tmp. Similarly, there is a way to set the default working directory (Tools > Global Options... > General > Basic > R Sessions). RStudio does not let you type in a directory name (perhaps a defense against typos?), and the Browse... button will not leave your home tree.

Initially I decided this was not important enough to worry about, but then I saw a post on the Posit Community forum by someone who was stuck trying to work from home due to a related issue. So I did a little experimentation and found a workaround, at least for the first problem (accessing files in places like /tmp). If I run setwd("/tmp") in the Console tab (which sets the working directory for the current R session), then click the More menu in the Files tab and select Go To Working Directory, the Files tab now browses /tmp, and I can navigate up to the system root directory and then down to anywhere within reason.

Changing the default starting directory is not something I actually care to do, but I'll document it here in case a reader might wish to do so. You can go to the IDE configuration directory (~/.config/rstudio on Linux and OS X, %appdata%\RStudio on Windows), open the rstudio-prefs.json file in a text editor, and change the value of the "initial_working_directory" entry to whatever starting directory you want. Save it, (re)start RStudio Desktop, and hopefully you begin in the right place.

Another R Quirk

2024-02-09T14:12:00.000-05:00

For the most part I like programming in R, but it is considerably quirkier than any other language I have used. I'm pretty sure that is what led to the development of what is known now as the "Tidyverse". The Tidyverse in turn introduces other quirks, as I've pointed out in a previous post.

One of the quirks in base R caused me a fair amount of grief recently. The context was an interactive program (written in Shiny, although that is beside the point here). At one point in the program the user would be staring at a table (the display of a data frame) and would select rows and columns for further analysis. The program would reduce the data frame to those rows and columns, and pass the reduced data frame to functions that would do things to it.

The program worked well until I innocently selected a bunch of rows and one column for analysis. That crashed the program with a rather cryptic (to me) error message saying that some function I was unaware of was not designed to work with a vector.

I eventually tracked down the line where the code died. The function I was unaware of apparently was baked into a library function I was using. As for the vector part, that was the result of what I would characterize as a "quirk" (though perhaps "booby trap" might be more accurate). I'll demonstrate using the mtcars data frame that automatically loads with R.

Consider the following code chunk.

rows <- 1:3
cols <- c("mpg", "cyl")
temp <- mtcars[rows, cols]
str(temp)

This extracts a subset of three rows and two columns from mtcars and presents it as a data frame.

'data.frame': 3 obs. of 2 variables:
$ mpg: num 21 21 22.8
$ cyl: num 6 6 4

So far, so good. Now suppose we choose only one column and rerun the code.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, cols]
str(temp)

Here is the result.

num [1:3] 21 21 22.8

Our data frame just became a vector. That was what caused the crash in my program.

Since I was using the dplyr library elsewhere, there was an easy fix once I knew what the culprit was.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, ] |> select(all_of(cols))
str(temp)

The result, as expected, is a data frame.

'data.frame': 3 obs. of 1 variable:
$ mpg: num 21 21 22.8

There will be situations where you grab one column of a data frame and want it to be a vector, and situations (such as mine) where you want it to be a data frame, so the designers of the language have to choose which route to go. I just wish they had opted to retain structure (in this case data frame) until explicitly dropped, rather than drop it without warning.