Sunday, November 10, 2024

Solver Parameters Matter

Modern integer programming solvers come with a gaggle of parameters the user can adjust. There are so many possible parameter combinations that vendors are taking a variety of approaches to taming the beast. The first, of course, is for vendors to set default values that work pretty well most of the time. This is particularly important since many users probably stick to default settings unless they absolutely have to start messing with the parameters. (By "many" I mean "myself and likely others".) The second is to provide a "tuner" that the user can invoke. The tuner experiments with a subset of possible parameter settings to try to find a good combination. Third, I've seen some discussion and I think some papers or conference presentations on using machine learning to predict useful settings based on characteristics of the problem. I am not sure how far along that research is and whether vendors are yet implementing any of it.

In the past few days I got a rather vivid reminder of how much a single parameter tweak can affect things. Erwin Kalvelagen did a blog post on sorting a numeric vector using a MIP model (with an up front disclaimer that it "is not, per se, very useful"). He test a couple of variants of a model on vectors of dimension 50, 100 and 200. I implemented the version of his model with the redundant constraint (which he found to speed things up) in Java using the current versions of both CPLEX and Xpress as solvers. The vector to sort was generated randomly with components distributed uniformly over the interval (0, 1). I tried a few random number seeds, and while times varied a bit, the results were quite consistent. Not wanting to devote too much time to this, I set a time limit of two minutes per solver run.

Using default parameters, both solvers handled the dimension 50 case, but CPLEX was about eight times faster. For dimension 100, CPLEX pulled off the sort in two seconds but Xpress still did not have a solution at the two minute mark. For dimension 200, CPLEX needed around 80 seconds and Xpress, unsurprisingly, struck out.

So CPLEX is faster than Xpress, right? Well, hang on a bit. On the advice of FICO's Daniel Junglas, I adjusted one of their parameters ("PreProbing") to a non-default value. This is one of a number of parameters that will cause the solver to spend more time heuristically hunting for a feasible or improved solution. Using my grandmother's adage "what's sauce for the goose is sauce for the gander," I tweaked an analogous parameter in CPLEX ("MIP.Strategy.Probe"). Sure enough, both solvers got faster on the problem (and Xpress was able to solve all three sizes), but the changes were more profound than that. On dimension 50, Xpress was between three and four times faster than CPLEX. On dimension 100, Xpress was again around four times faster. On dimension 200, Xpress won by a factor of slightly less than three.

So is Xpress actually faster than CPLEX on this problem? Maybe, maybe not. I only tweaked one parameter among several that could be pertinent. To me, at least, there is nothing about the problem that screams "you need more of this" or "you need less of that", other than the fact that the objective function is a constant (we are just looking for a feasible solution), which suggests that any changes designed to tighten the bound faster are likely to be unhelpful. I confess that I also lack a deep understanding of what most parameters do internally, although I have a pretty good grasp on the role of the time limit parameter.

So the reminder for me is that before concluding that one solver is better than another on a problem, or that a problem is too difficult for a particular solver, I need to put a bit of effort into investigating whether any parameter tweaks have substantial impact on performance.

Update: A post on the Solver Max blog answers a question I had (but did not investigate). With feasibility problems, any feasible solution is "optimal", so it is common to leave the objective as optimizing a constant (usually zero). Erwin and I both did that. The question that occurred to me (fleetingly) was whether an objective function could be crafted that would help the solver find a feasible solution faster. In this case, per the Solver Max post, the answer appears to be "yes".

Sunday, November 3, 2024

Xpress and RStudio

The following is probably specific to Linux systems. I recently installed FICO Xpress optimizer, which comes with an R library to provide an API for R code. FICO requires a license file (or a license server -- I went with a static file since I'm a single user) and adds an assortment of environment variable to the bash shell, including one pointing to the license file. So far, so good.

Xpress comes with example files, including example R scripts. So I cranked up RStudio, opened the simplest example ("first_lp_problem.R", which is just what it sounds like) and executed it line by line. The problem setup lines worked fine, but the first Xpress API call died with an error message saying it couldn't find the license file in directory "." (i.e., the current working directory). The same thing happened when I tried to source the file in the RStudio console.

To make a long story somewhat shorter, after assorted failed attempts to sort things out it occurred to me to run R in a terminal and source the example file there. That ran smoothly. So the problem was with RStudio, not with R. Specifically, it turns out that RStudio runs without loading any bash environment variables.

After assorted failed attempts at a fix (and pretty much wearing out Google), I found the following solution. In my home directory ("/home/paul", a.k.a. "~") I created a text file named ".Renviron". In it, I put the line "XPAUTH_PATH=/home/paul/.../xpauth.xpr", where "..." is a bunch of path info you don't need to know and "xpauth.xpr" is the name of the license file. If you already have a ".Renviron" file, you can just add this line to it. The example script now runs fine in RStudio. Note that there are a gaggle of other bash environment variables created by Xpress, none of which presumably are known to RStudio, but apparently the license file path is the only needed by the API (at least so far). If I trip over any other omissions later on, presumably I can add them to ".Renviron".


Monday, September 16, 2024

Bounds and Reduced Costs

I recently read a question from someone who had solved a linear program (minimization) using a reliable solver, extracted the primal and dual solutions, recomputed reduced costs manually and encountered a negative reduced cost in a supposed optimal solution. That appeared to be a bit paradoxical. I don't have enough information to be sure, but I suspect variable bounds are involved.

Fourteen years ago (!) I wrote a post about how to find the dual value of a bound on a variable when the bound is entered as a bound rather than a functional constraint. It belatedly occurred to me that an example might be helpful (and might shed some light on the paradoxical reduced cost), so here goes. Let's start with a simple LP.\begin{alignat*}{1} \max\ 5x+y\\ \textrm{s.t. }x+2y & \le 8\\ x & \le 2\\ y & \le 10 \end{alignat*} where $x\ge 0$ and $y\ge 0.$ The problem is easy to solve graphically, as shown below.

 

 

The optimal solution is $(x,y) = (2,3)$ with objective value 13. Let's assume we feed it to an LP solver as written (three constraints). The dual values of the constraints are $(0.5, 4.5, 0).$ In particular, note the dual value of 4.5 for the upper bound on $x$ (which is binding). If we increase the bound from $2$ to $2 + \epsilon,$ the optimal corner shifts from $(2,3)$ to $(2 + \epsilon, 3- \frac{1}{2}\epsilon),$ with a resulting change in objective value from 13 to $13+4.5\epsilon.$ Also note that the reduced cost of $x$ is $5 - 1 \times 0.5 - 1 \times 4.5 - 0 \times 0 = 0$ and the reduced cost of $y$ is $1 - 2\times 0.5 - 0 \times 4.5 -1 \times 0 = 0,$ which conforms to our expectation that basic variables have zero reduced costs.

Now suppose that I enter the variable bounds directly as bounds, rather than as constraints, so that the LP has a single inequality constraint. Most (all?) contemporary solvers allow this. The optimal solution is unchanged, as is the dual value (0.5) for the first (and now sole) constraint. What follows was confirmed using CPLEX, but I suspect the experience with other solvers will be similar. The only dual value available is the dual for the first constraint. CPLEX, as best I can tell, does not allow you to ask for a dual value associated with a variable bound. So does that mean that the reduced cost of $x$ is just $5 - 1 \times 0.5 = 4.5?$ Yes, that is what CPLEX reports as the reduced cost of $x,$ and correctly so. 

The fact that at optimum a basic variable has zero reduced cost applies to the original simplex method. There is a variant of the simplex method for bounded variables, and in that version a variable can be nonbasic at its upper bound, with a nonzero (possibly favorable) reduced cost. Moreover, as I mentioned in that earlier post, in this approach the reduced cost of a variable at its bound is the dual value of the bound constraint. So it is not coincidence that the reduced cost of $x$ when entering its bound as a bound (4.5) matches the dual value of the bound when entering it as a constraint.


Sunday, September 8, 2024

A Bicriterion Movement Model

A question on Operations Research Stack Exchange asks about a bicriterion routing problem. The underlying scenario is as follows. A service area is partitioned into a rectangular grid, with a robot beginning (and eventually ending) in a specified cell. Each move of the robot is to an adjacent cell (up, down, left or right but not diagonal). The robot must eventually visit each cell at least once and return whence it came. One criterion for the solution is the number of movements (equivalently, the amount of time, since we assume constant speed) required for the robot to make its rounds. In addition, each cell is assigned a nonnegative priority (weight), and the second criterion is the sum over all cells of the time of the first visit weighted by the the priority of the cell. In other words, higher priority cells should be visited earlier. Both criteria are to be minimized.

The problem can be modeled as either an integer program or a constraint program. The movement portion is quite straightforward to model. Balancing the two objectives is where things get interesting. One can optimize either criterion after setting a bound on how bad the other can be, or one can use lexicographic ordering of the criteria (optimize the primary objective while finding the best possible value of the secondary objective given that the primary must remain optimal), or one can optimize a weighted combination of the two objectives (and then play with the weights to explore the Pareto frontier). Weighted combinations are a somewhat tricky business when the objective functions being merged are not directly comparable. For instance, merging two cost functions is pretty straightforward (a dollar of cost is a dollar of cost, at least until the accountants get involved). Merging distance traveled and "priority" (or "urgency", or "weighted delay") is much less straightforward. In real life (as opposed to answering questions on ORSE), I would want to sit with the problem owner and explore acceptable tradeoffs. How much longer could a "good" route be if it reduced weighted delays by 1 unit?

I chose to use an integer program (in Java, with CPLEX as the optimization engine), since CPLEX directly supports lexicographic combinations of objectives. You can find the source code in my GitLab repository. A write-up of the model is in a PDF file here, and output for the test case in the ORSE question is in a text file here. The output includes one run where I minimized delay while limiting the distance to a middle value between the minimum possible distance and the distance obtain from lexicographic ordering with delay having the higher priority. It turned out that compromising a little on distance did not help the delay value.

Tuesday, August 20, 2024

LyX Upgrade Hiccup

In recent months, I've upgraded LyX from version 2.3.6 to 2.4.0 RC3 (from the Canonical repos) to 2.4.1 (by compiling from source, as documented here). Along the way I discovered a glitch that apparently occurred during the upgrade from 2.3.6 to 2.4.0.

When you create a new document in LyX, it is initially created using a default template that you can customize. My default template, which I created in a previous millennium, starts all paragraphs flush left and uses what LaTeX calls a "small skip" to provide vertical separation between paragraphs. Yesterday, when I created a new document, I noticed that the first paragraph after a section heading would begin flush left but subsequent paragraphs were indented ... in the LyX GUI. When I compiled to a PDF document, however, every paragraph was flush left. The disconnect between what the GUI showed and what the compiled document contained was, shall we say, a trifle confusing.

Looking at the LaTeX output code (which you can do by selecting View > Code Preview Pane to open a preview of the code to be compiled), I noticed the following LaTeX commands being added to the document.

\setlength{\parskip}{\smallskipamount}

\setlength{\parindent}{0pt}

 

That reminded me to look at the template's document preamble (Document > Settings... > LaTeX Preamble), and sure enough those commands appeared there. Apparently I added them to the preamble of the default template back when dinosaurs still roamed the earth, and they've been there ever since. In resolving bug #4796, the developers changed how vertical spacing between paragraphs was handled. They now use the parskip LaTeX package. Since, for whatever reason, the template had Document > Settings... > Text Layout > Paragraph Separation set to Indentation: Default, the GUI in the current version indented paragraphs other than those immediately after section breaks. Meanwhile, those two lines in the preamble overrode that when creating the PDF output.

 

The fix was to create a new default template, which is extremely easy in LyX. You create a new document, customize document settings to your liking, go to Document > Settings... and click the Save as Document Defaults button. In my case, this meant changing Document > Settings... > Text Layout > Paragraph Separation to Vertical Space: Small Skip and deleting the aforementioned lines from the preamble. Now I just need to remember to check and, if necessary, replace the default template after future upgrades.

Installing LyX from Source

I've been using LyX as my go-to document editor since its early days, and I really love it. Until recently, installation and upgrades (on Linux Mint) were a no-brainer. I could either install/update the package from the Canonical repositories (which tend to lag behind the current release) or get the latest version from a PPA (which, sadly, is no longer maintained). Yesterday I was running LyX 2.4.0 RC3, the version in the Canonical repo for Ubuntu 22.04 (the basis for the current version, 22, of Linux Mint), and ran into an issue. To see if it had already been resolved, I decided to upgrade to the current LyX version, 2.4.1. That entailed compiling it from source for the first time ever.

The process, while slow, was almost smooth. I did run into one speed bump, an error message (from the APT package manager, not from LyX) that was a trifle cryptic. Some serious googling turned up the solution. I thought I'd document the compilation process below in case other Mint (or Ubuntu) users want to try installing from source for the first time.

  1. Open the Software Sources system app and make sure that Official Repositories > Optional Sources > Source code repositories is turned on. (On my first go-around it was turned off, which produced the cryptic error message I mentioned.)
  2. Download the source tarball and signature file from the LyX download site. Verify the tarball as described here.
  3. Unzip the tarball into a directory located wherever you want to keep the source code (if in fact you choose to keep it). I'm pretty sure you can delete the source folder after compiling and installing, but I'm hanging onto it for now.
  4. In a terminal in that directory, run sudo apt-get build-dep lyx to install all necessary prerequisites. I found this useful tip in the LyX wiki. This was the step that generated the error message on the first attempt, due to my having skipped step 1 above.
  5. Open the INSTALL.autoconf text file in the source folder and follow the directions (summarized here).
    1. Run configure (which takes a while).
    2. Run make (which takes an eternity).
    3. Optional: run make check and verify all tests were passed.
    4. Run make install to install the program.

Regarding the last step, I normally install LyX for all users on the system (which is just me, since this is my home PC). So I ran make install as root. If you just want it for the account under which you are logged in, you can run it without escalating privileges.

That's all there is to it.

Sunday, August 4, 2024

Pipewire Sound Server

I recently upgraded my PC and laptop to Linux Mint 22 (Wilma). It was somewhat harrowing experience -- the laptop (which is my canary in coal mine) updated fine, but the PC was bricked due to an issue with BIOS secure boot stuff. At power up there was a message involving some cryptic acronym I no longer remember. It had something to do with secure boot keys, and it prevented the PC from booting (even from a bootable USB drive). I spent an hour or two futzing with BIOS settings but eventually got secure boot and TPM turned off, which let the PC boot. I haven't had any problems since then.

One of the changes in Mint 22 is that the developers moved from ALSA to a newer sound (and video?) server called PipeWire. The changeover was initially invisible to me -- sound and video so far have worked flawlessly -- but necessitated changes to a couple of shell scripts I use. One of them is a convenience script that resets the speaker volume to my preferred level. It's handy when I have to crank volume up for some reason. The other is a script that launches Zoom. It turns up the volume before Zoom starts and then resets the volume when I exit from Zoom.

Fortunately, it wasn't hard to find the new commands (thank you Uncle Google). There's probably more than one way to control volume from the command line or a script, but I ended up using the WirePlumber library. I can't recall if it was installed automatically during the upgrade or if I had to add it. The key command is wpctl, which somewhat curiously does not seem to have a man page. Fortunately, wpctl --help will get you the information you need. My old scripts used 

pacmd set-sink-volume 0 27500
pacmd set-sink-volume 1 27500
sudo alsactl store

to reset the speaker volume. (I had to try both sink 0 and sink 1 because, for some reason, the speakers would sometimes be assigned to 0 and sometimes to 1 during boot.) With PipeWire I use

wpctl set-volume @DEFAULT_AUDIO_SINK@ 0.42

to do the same thing. 

There are a few improvements with the new approach. I apparently do not have to play "guess the sink" anymore. I do not need to escalate privileges (sudo) just to change the volume. Also, the volume setting is easier to interpret (0.42 is 42% of maximum -- I'm not sure how I settled on 27500 in the old approach, but it is not obvious to me that it equates to 42%).


Thursday, June 27, 2024

A Sorting Adventure in R

As part of an R application on which I have been working, I need to grab some data from an online spreadsheet keyed by a column of labels, add a column with new values, and print out the labels and new values in a two column array so that it matches the spreadsheet (meaning the order is the same as in the spreadsheet). The construction process involves breaking the original data into chunks, filling in the new values in each chunk separately, and gluing the chunks back together. The spreadsheet is sorted by the label column, so to match the original spreadsheet (allowing a user to view them side by side and see the same labels in the same order), I just needed to sort the output data frame by the label column ... or so I thought.

Since I was using the dplyr library already, and dplyr provides an arrange command to sort rows based on one or more columns, I started out with the following simple code (where df is the two column data frame I created):

df |> dplyr::arrange(Label) |> print()

Unfortunately, the result was not sorted in the same order as the spreadsheet. Some of the labels began with numbers (1000) and in one case a number containing a colon (10:20), and arrange listed them in the opposite order from that of the spreadsheet. Some of the name had funky capitalization (say, "BaNaNa"), and arrange treated capital letters as preceding lower case letters. Interestingly, the base R sort function sorted the labels in the same order that the spreadsheet used. More interestingly, the following hack suggested in a response to a question I posted online also matched the spreadsheet:

df |> dplyr::arrange(-dplyr::desc(Label)) |> print()

The desc function tells arrange to use descending order (the default being ascending order). So -desc tells arrange not to use descending order, meaning use ascending order, which is where we started ... and yet it somehow fixes the ordering problem.

The culprit turns out to be the default locale setting for arrange, which is "C". I'm guessing that means the C programming language. I filed a bug report in the dplyr repository on GitHub and was quickly told that the behavior was correct for locale "C" and that the solution to my problem was to feed arrange the optional argument .locale = "en". That did in fact fix things. The code now produces the expected sort order. Meanwhile, my bug report led to a new one about the difference in sort orders between arrange and desc. Depending on how that report is resolved, the -desc trick may stop working in the future.

Wednesday, June 26, 2024

Locating Date Columns in R

I've been working on a script that pulls data from an online spreadsheet (made with Google Sheets) shared by a committee. (Yes, I know the definition of "camel": a horse designed by a committee. The situation is what it is.) Once inhaled, the data resides in a data frame (actually, a tibble, but why quibble). At a certain point the script needs to compute for each row the maximum entry from a collection of date columns, skipping missing values.

Assuming I have the names of the date columns in a variable date_fields and the data in a data frame named df, the computation itself is simple. I use the dplyr library, so the following line of code

df |> rowwise() |> mutate(Latest = max(c_across(all_of(date_fields)), na.rm = TRUE))

produces a copy of the data frame with an extra column "Latest" containing the most recent date from any of the date fields.

That, as it turns out, was the easy part. The hard part was populating the variable date_fields. Ordinarily I would just define it at the front of the scripts, plugging in either the names or the indices of the date columns. The problem is that the spreadsheet "evolves" as committee members other than myself make changes. If they add or delete a column, the indices of the date columns will change, breaking code based on indices. If they rename one of the date columns, that will break code based on a static vector of names. So I decided to scan the spreadsheet after loading it to find the date fields.

It turned out to be harder than it looked. After tripping over various problems, I searched online and found someone who asked a similar question. The best answer pointed to a function in an R library named dataPreparation. I did not want to install another library just to use one function one time, so I futzed around a bit more and came up with the following function, which takes a data frame as input and returns a list of the names of columns that are dates (meaning that if you run the str() command on the data frame, they will be listed as dates). It requires the lubridate library, which I find myself commonly using. There may be more elegant ways to get the job done, but it works.

library(lubridate)
# INPUT: a tibble
# OUTPUT: a vector containing the names of the columns that contain dates
dateColumns <- function(x) {
  # Get a vector of logical values (true if column is a date) with names.
  temp <- sapply(names(x), function(y) pull(x, y) |> is.Date())
  # Return the column names for which is.Date is true.
  which(temp) |> names()
}

Tuesday, April 30, 2024

From IP to CP

Someone asked on OR Stack Exchange how to convert an integer programming model into a constraint programming model. I think you can reasonably say that it involves a "paradigm shift", for a couple of reasons.

The first paradigm shift has to do with how you frame the problem, mainly in terms of the decision variables. Math programmers are trained to turn discrete decisions with a logical flavor into binary variables. Discrete quantities, such as how many bins of a certain type to use or how many workers to assign to a task, are expressed as general integer variables, but most other things end up turning into a slew of binary variables. The problem being solved in the ORSE question illustrates this nicely.

The problem is as follows. You have $N$ participants in a tournament involving some kind of racing. Importantly, $N$ is guaranteed to be an even number. There is one track with two lanes, and races are spread over $N-1$ days. Every participant races head to head with every other participant exactly once, and nobody races twice in the same day. For whatever reason, the left lane is preferable to the right lane, and so there is a "fairness" constraint that nobody is assigned the left lane on more than $M$ consecutive days. For some reason, the author also imposed a second fairness constraint that nobody be assigned to the right lane on more than $M$ consecutive days. Dimensions for the author's problem were $N=20$ and $M=2.$

The model has to assign participant pairs (races) to days and also make lane assignments. To decide against whom I must race on a given day, someone building an IP model will use binary variables to select my opponent. Similarly, they will use binary variables to select my lane assignment each day. So the author of the question had in his IP model a variable array opp[Competitors][Competitors][Tracks][Days] taking value 1 "if competitor 'c1' races with 'c2' on track 't' on day 'd'".

CP models are more flexible in their use of variables, and in particular general integer variables. So to decide my opponent on a given day, I can just an integer variable array indexed by day where the value is the index number of my opponent on the given day. Similarly, I could (and would) use an integer variable indexed by day to indicate my lane assignment that day, although in this case that variable does turn out to be binary, since there are only two lanes.

The second paradigm shift has to do with constraints, and it ties to what solver you are using. IP models have a very limited constraint "vocabulary". They all understand linear equalities and inequalities, and some understand some combination of SOS1, SOS2, second order cone and implication constraints. That's pretty much it. CP solvers have a richer "vocabulary" of constraints, but with the caveat that not many of those constraints are universal. I would wager that every CP solver has the "all different" constraint, and they must have the usual arithmetic comparisons ($=,\neq,\lt,\le,\gt,\ge$). Beyond that, it pays to check in advance.

I wrote a CP model (in Java) using IBM's CP Optimizer (CPO) to solve the scheduling problem. Details of the model can be sussed out from the Java code, but I will mention a few pertinent details here.

  • I did use an integer variable array to determine, for each combination of participant and day, the participant's opponent that day, as well as an integer array giving the lane assignment (0 or 1) for each combination of participant and day.
  • To make sure that, on any day, the opponent of X's opponent is X I used CPO's  inverse constraint. The constraint inverse(f, g) says that f[g[x]] = x and g[f[x]] = x for any x in the domain of the inner function.
  • To ensure that nobody raced the same opponent twice, I used allDiff, which is CPO's version of the all different constraint.
  • We have to do something to force opponents in a race to be in different lanes. Let $x_{i,d}$ and $y_{i,d}$ denote respectively the opponent and lane assignment for participant $i$ on day $d.$ In mathematical terms, the constraint we want is $y_{x_{i,d},d} \neq y_{i,d}.$ Indexing a variable with another variable is impossible in an IP model. In CPO, I used the element constraint to do just that.

I added an objective function, namely to minimize the difference between the most and fewest times any participant gets assigned the preferred left lane. I also added one constraint to mitigate symmetry. Since any solution remains a solution (with the same objective value) under any permutation of the participant indices, I froze the first day's schedule as $1\  v.\  N$, $2\  v.\  N-1$, $3\  v.\  N-2$ etc.

On my decent but not screamingly fast PC, CPO found a feasible solution almost instantly and a solution with objective value 1 in under a second. In that solution, every participant gets the left lane either nine or ten times out of the 19 racing days. It's not hard to prove that 1 is the optimal value (you cannot have everybody get exactly the same number of left lane assignments), but don't tell CPO that -- it was still chugging along trying when it hit my five minute time limit.

My Java code is available from my repository under a Creative Commons 4.0 open source license.

Sunday, April 21, 2024

Where Quadratic, Positive Definite and Binary Meet

A comment by Rob Pratt (of SAS) on OR Stack Exchange pointed out two things that are glaringly obvious in hindsight but that somehow I keep forgetting. Both pertain to an expression of the form $x'Qx + c'x,$ either in an objective function or in a second order cone constraint, where $x$ is a vector of variables and $Q$ and $c$ are parameters.

The first observation does not depend on the nature of the $x$ variables. We can without loss of generality assume that $Q$ is symmetric. If it is not, replace $Q$ with the symmetric matrix $\hat{Q} = \frac{1}{2}\left(Q + Q'\right),$ which is symmetric. A wee bit of algebra should convince you that $x'\hat{Q}x = x'Qx.$

The second observation is specific to the case where the $x$ variables are binary (which was the case in the ORSE question which drew the comment from Rob). When minimizing an objective function of the form $x'Qx + c'x$ or when using it in a second order cone constraint of the form $x'Qx + c'x \le 0,$ you want the $Q$ matrix to be positive definite. When $x$ is binary, this can be imposed easily.

Suppose that $x$ is binary and $Q$ is symmetric but not positive definite. The following argument uses the euclidean 2-norm. Let $$\Lambda = \max_{\parallel y \parallel = 1} -y'Qy,$$ so that $y'Qy \ge -\Lambda$ for any unit vector $y.$ Under the assumption that $Q$ is not positive definite, $\Lambda \ge 0.$ Choose some $\lambda > \Lambda$ and set $\hat{Q} = Q + \lambda I,$ where $I$ is the identity matrix of appropriate dimension. For any nonzero vector $y,$

$$ \begin{align*} y'\hat{Q}y & =y'Qy+\lambda y'Iy\\ & =\parallel y\parallel^{2}\left(\frac{y'}{\parallel y\parallel}Q\frac{y}{\parallel y\parallel}+\lambda\right)\\ & \ge\parallel y\parallel^{2}\left(-\Lambda+\lambda\right)\\ & >0. \end{align*} $$

So $\hat{Q}$ is positive definite. Of course, $x'\hat{Q}x \neq x'Qx,$ but this is where the assumption that $x$ is binary sneaks in. For $x_i$ binary we have $x_i^2 = x_i.$ So

$$ \begin{align*} x'\hat{Q}x & =x'Qx+\lambda x'Ix\\ & =x'Qx+\lambda\sum_{i}x_{i}^{2}\\ & =x'Qx+\lambda e'x \end{align*} $$

where $e=(1,\dots,1).$ That means the original expression $x'Qx + c'x$ is equal to $x'\hat{Q}x+(c-\lambda e)'x,$ giving us an equivalent expression with a positive definite quadratic term.

Thursday, April 11, 2024

Finding Duplicated Records in R

Someone asked a question about finding which records (rows) in their data frame are duplicated by other records. If you just want to know which records are duplicates, base R has a duplicated() function that will do just that. It occurred to me, though, that the questioner might have wanted to know not just which records were duplicates but also which records were the corresponding "originals". Here's a bit of R code that creates a small data frame with duplicated rows and then identifies original/duplicate pairs by row number.


library(dplyr)

# Create source data.
df <- data.frame(a = c(3, 1, 1, 2, 3, 1, 3), b = c("c", "a", "a", "b", "c", "a", "c"))

# Find the indices of duplicated rows.
dup <- df |> duplicated() |> which()

# Split the source data into two data frames.
df1 <- df[-dup, ]  # originals (rows 1, 2 and 4)
df2 <- df[dup, ]   # duplicates (rows 3, 5, 6 and 7)

# The row names are the row indices in the original data frame df. Assign them to columns.
df1$Original <- row.names(df1)
df2$Duplicate <- row.names(df2)

# Perform an inner join to find the original/duplicate pairings. The "NULL" value for "by"
# (which is actually the default and can be omitted) means rows of df1 and df2 are paired
# based on identical values in all columns they have in common (i.e., all the original
# columns of df).
inner_join(df1, df2, by = NULL) |> select(Original, Duplicate)

# Result:
#   Original Duplicate
# 1        1         5
# 2        1         7
# 3        2         3
# 4        2         6

The key here is that the inner_join function pairs rows from each data frame (originals and duplicates) based on matching values in the "by" columns. The default value of "by" (NULL) tells it to match by all the columns the two data frames have in common -- which in the is case is all the columns in the source data frame. The resulting data frame will have the columns from the source data frame (here "a" and "b") plus the columns unique to each data frame ("Original" and "Duplicate"). We use the select() command to drop the source columns and just keep the indices of the original and duplicate rows.

Monday, April 8, 2024

File Access in RStudio

I've been spending a fair bit of time in RStudio Desktop recently, much of it related to my work with INFORMS Pro Bono Analytics. I really like RStudio as a development environment for R code, including Shiny apps. It does, however, come with the occasional quirk. One of those has to do with how RStudio accesses the file system.

I tripped over this a couple of times recently when I wanted to open an R file that I had dropped in the /tmp directory on my Linux Mint system. The Files tab in RStudio appeared to be limited to the directory tree under my home directory. There was no way to browse to system directories like /tmp. Similarly, there is a way to set the default working directory (Tools > Global Options... > General > Basic > R Sessions). RStudio does not let you type in a directory name (perhaps a defense against typos?), and the Browse... button will not leave your home tree.

Initially I decided this was not important enough to worry about, but then I saw a post on the Posit Community forum by someone who was stuck trying to work from home due to a related issue. So I did a little experimentation and found a workaround, at least for the first problem (accessing files in places like /tmp). If I run setwd("/tmp") in the Console tab (which sets the working directory for the current R session), then click the More menu in the Files tab and select Go To Working Directory, the Files tab now browses /tmp, and I can navigate up to the system root directory and then down to anywhere within reason.

Changing the default starting directory is not something I actually care to do, but I'll document it here in case a reader might wish to do so. You can go to the IDE configuration directory (~/.config/rstudio on Linux and OS X, %appdata%\RStudio on Windows), open the rstudio-prefs.json file in a text editor, and change the value of the "initial_working_directory" entry to whatever starting directory you want. Save it, (re)start RStudio Desktop, and hopefully you begin in the right place.


Friday, February 9, 2024

Another R Quirk

For the most part I like programming in R, but it is considerably quirkier than any other language I have used. I'm pretty sure that is what led to the development of what is known now as the "Tidyverse". The Tidyverse in turn introduces other quirks, as I've pointed out in a previous post.

One of the quirks in base R caused me a fair amount of grief recently. The context was an interactive program (written in Shiny, although that is beside the point here). At one point in the program the user would be staring at a table (the display of a data frame) and would select rows and columns for further analysis. The program would reduce the data frame to those rows and columns, and pass the reduced data frame to functions that would do things to it.

The program worked well until I innocently selected a bunch of rows and one column for analysis. That crashed the program with a rather cryptic (to me) error message saying that some function I was unaware of was not designed to work with a vector.

I eventually tracked down the line where the code died. The function I was unaware of apparently was baked into a library function I was using. As for the vector part, that was the result of what I would characterize as a "quirk" (though perhaps "booby trap" might be more accurate). I'll demonstrate using the mtcars data frame that automatically loads with R.

Consider the following code chunk.

rows <- 1:3
cols <- c("mpg", "cyl")
temp <- mtcars[rows, cols]
str(temp)

This extracts a subset of three rows and two columns from mtcars and presents it as a data frame.

'data.frame':    3 obs. of  2 variables:
 $ mpg: num  21 21 22.8
 $ cyl: num  6 6 4

So far, so good. Now suppose we choose only one column and rerun the code.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, cols]
str(temp)

Here is the result.

num [1:3] 21 21 22.8

Our data frame just became a vector. That was what caused the crash in my program.

Since I was using the dplyr library elsewhere, there was an easy fix once I knew what the culprit was.

rows <- 1:3
cols <- c("mpg")
temp <- mtcars[rows, ] |> select(all_of(cols))
str(temp)

The result, as expected, is a data frame.

 'data.frame':    3 obs. of  1 variable:
 $ mpg: num  21 21 22.8

There will be situations where you grab one column of a data frame and want it to be a vector, and situations (such as mine) where you want it to be a data frame, so the designers of the language have to choose which route to go. I just wish they had opted to retain structure (in this case data frame) until explicitly dropped, rather than drop it without warning.