## Saturday, December 12, 2020

### Incorrect Program Listings in MythTV

I've been using MythTV as a personal video recorder (PVR) since 2012, and for the most part I have been quite satisfied ... but there have definitely been adventures. You can type Mythbuntu or MythTV in the blog search box to see a litany of posts about things that did not work as expected (or at all) (or threatened to implode the universe) and fixes I found. Today's post is a two part adventure regarding program listings.

I have an account with Schedules Direct, which lets me download XML files containing program listings for my cable provider. I access it from two different machines. On my main PC, I have FreeGuide, an open-source TV guide program, installed. Once a week I download the latest listings and use FreeGuide to decide what I want to record during the coming week. On the PC that acts as my PVR, I have MythTV set up to download the same files from Schedules Direct when I tell it to refill the program database, which I again do once a week when programming that week's recordings.

The first part of today's adventure has been going on for a long time. When programming the week's recording schedule, I would occasionally run into discrepancies between the two machines. That is, in certain time slots FreeGuide would report one program and the listings on the PVR would report a different program, even though both were working from identical downloaded data files. When this happened, the listings in FreeGuide were invariably correct and the listings on the PVR incorrect. The obvious workaround was to use the FreeGuide listings, but that meant that when I wanted to record a program that FreeGuide said was there and the PVR said was not, I had to set up a manual recording rule, which is doable but somewhat inconvenient.

I confirmed this by setting up a two line script on the PVR to let me download schedule data manually and overwrite the old data. The script is:

#!/bin/sh
mythfilldatabase --dd-grab-all

Note the option -dd-grab-all, which signals that all 14 days are to be downloaded and added to the database, updating any existing data. Running this from a terminal eliminated the inconsistencies between machines.

This brings me to the second part of today's adventure. I normally update the listings on the PVR machine by choosing the menu option to grab EPG (electronic program guide) data from the MythWelcome user interface. That was set up, back when I first installed MythTV, to run mythfilldatabase without any optional settings. I wanted to update that setting to add the -dd-grab-all option. The problem was, I could not find where to make the change. I did some googling (of course), and every post I found led to the same solution posted in the MythTV wiki: run mythtv-setup; go to the "General" section, and within that to the "Program Schedule Downloading Options" section; then use the second of the six settings there ("Guide Data Program") to set up the program or script to download the guide data. That sounds simple enough, but when I run mythtv-setup and go to that section only the first entry (the toggle for automatically updating listings) is present. The other five are nowhere to be found. I'm pretty sure they were there when I first installed MythTV, but they do not show up when I run the setup program on a machine that has already been configured. Possibly I need to run setup as a different user (the MythTV account "owner"?).

Anyway, I found a simple solution. The PVR machine runs MythWeb, a web interface to MythTV. I use MythWeb to program recordings from my main PC. It also has the ability to access settings (by clicking the button whose icon shows a key and a wrench). In the settings editor, I picked the button labeled "MythTV" and did some serious scrolling. Fortunately the settings are in alphabetic order. The one labeled "MythFillDatabasePath" has the path to the mythfilldatabase program. I added the --dd-grab-all option there, clicked the "Save" button at the bottom of the page, and that (hopefully) fixes the problem. Time will tell.

## Thursday, December 3, 2020

### New CPLEX MIP Emphasis

Brace yourself (or flee now) -- this is a rather long post.

## Introduction

IBM has announced the next version of CPLEX Studio (20.1), with planned availability around December 11, 2020. As to why the version number is jumping from 12.10 to 20.1, I have no idea ... but this is 2020, and I have no explanation for pretty much everything that has happened this year.

Among the changes in version 20.1, they have added a new value to the MIP emphasis parameter. Prior to 20.1, there were five possible values for the emphasis parameter:

• 0 = Balance optimality and feasibility(default)
• 1 = Emphasize feasibility
• 2 = Emphasize proven optimality
• 3 = Emphasize improving the best bound
• 4 = Emphasize finding "hidden" feasible solutions.

They have added the following new value:

• 5 = Emphasize heuristics (what Xavier Nodet calls "heuristics on steroids").

The motivation for this is fairly clear: commercial (i.e., paying) customers with difficulty MIP models are frequently less concerned about provable optimality than with getting the best solution they can find within some limited amount of run time. Exactly how the new setting differs from setting 4 (and, for that matter, how setting 4 differs from setting 1) is unclear to me, which is OK. (I'm worried that if I really understood all the nuances, my brain would explode.)

I've been part of the beta test program for 20.1, and I've tried the new setting on a few MIP models. Going in, I expected it to slow down throughput (the number of nodes digested per minute), since running lots of heuristics means spending more time at a typical node. The question is whether the extra time per node pays for itself by reducing sufficiently the number of nodes required to find a solution of a specified quality.

My first attempt was on a difficult problem that arose in some recently published research, and on that problem the setting was definitely not helpful. In fairness, though, there may be a good reason for that. The solution approach involves a variant of Benders decomposition, so the extra time spent on heuristics will frequently produce a "good" solution only to see it shot down by the subproblem (producing a feasibility cut violated by the solution). So the remainder of my tests were on MIP models that did not involve decomposition.

## Test case 1: Partition

The first test case is a MIP model that glues together sets to minimize the range of set sizes in a partition of a master set. It was originally posted here in August, 2020. The test problem is actually quite easy to solve, with an optimal value of 1 (meaning the cardinalities of the sets formed differ by at most 1).

I ran the problem with a 90 second time limit (irrelevant in most cases), using each of the emphasis settings 0, 1, 4 and 5. The following plot (a log-log plot to enhance readability) shows the progress under each setting.

MIPEmphasis 1 ("Feasibility") makes the earliest progress but does not reach the optimal value of 1 within 90 seconds. (At that point, the incumbent value is 5.) Although just shy of one second some of the other levels do a little better than default, overall the default setting reaches the optimal solution fastest and the new setting is worse than the "Hidden Feasibility" setting. We can check the time at which each run (other than with emphasis 1) finds the optimal solution to confirm this.

MIPEmphasis
Time
Default5.50
Hidden Feasibility21.19
Heuristics40.85

## Test case 2: Typewriter

The second test case is a MIP model for laying out the keyboard of a hypothetical 19th century typewriter. The problem was featured in a series of posts, and the model used here appeared in the last of those posts. As I noted in that post, I was unable to find a provably optimal solution in large part due to a slow moving best bound, so for this demonstration I set a 60 second run limit. The problem seeks to minimize a distance measure. Once again, I'll use a log-log plot to show progress.

All the emphasis settings produce a rapid reduction in the objective function early on. After about a second or so, emphasis 1 (feasibility) seems to do a bit better than the others. Settings 4 and 5 seem to lag a bit. Looking at the final objective values (at the 60 second cutoff), however, it seems that setting 4 (hidden feasibility) did best, and setting 5 (heuristics) slightly outperformed the other settings.

MIPEmphasis
Best
Default5650882
Feasibilty5660625
Heuristics5640159
Hidden Feasibility5517363

We can also look at node throughput. As a general rule, we would expect that increased use of heuristics would slow down node throughput. One possible exception would be settings that encouraged "diving" (local depth-first search), which might speed up processing of nodes during a dive.

The "heuristics" and "hidden feasibility" settings do in fact process fewer nodes in 60 seconds than does the default setting. The "feasibility" setting process about twice as many nodes as does the default setting, which may mean it does a fair bit of diving.

## Test case 3: Group Selection

The last example is a group selection problem from yet another earlier post. I tested two slightly different MIP models with five minute time limits. The first variant uses continuous variables for some inherently boolean quantities, while the second variant makes those variables explicitly integer-valued. The second variant seems to be a bit harder to solve, even though they are mathematically equivalent.

The problem is a maximization problem, and none of the runs come remotely near proof of optimality. As noted in the earlier post, nonlinear approaches yielded an objective value of 889.3463, which is apparently optimal.

Looking at progress in the incumbent value, we see that all methods make substantial progress at the root node but shortly after the root node appear to bog down. In the first model, there is not much difference among the emphasis settings.

In the second model, the feasibility setting is a bit faster than the other to reach its maximum, and the heuristics setting is slower.

In both cases, though, the new "heuristics" setting produces the best objective value after 300 seconds.

MIPEmphasis
Best
Default885.7781
Feasibilty885.7781
Heuristics889.3451
Hidden Feasibility889.3130

MIPEmphasis
Best
Default884.6917
Feasibilty884.6917
Heuristics889.3392
Hidden Feasibility889.3130

As for node throughput, the next two plots show that node throughput is clearly greater in the first variant (where inherently boolean variables are treated a continuous with domain [0, 1]), and the "feasibility" setting is again fastest in both variants, while the new "heuristics" setting is slowest.

## Conclusion

Testing on a small set of examples does not tell us much. On the group selection models, where progress is hard to come by after a short time, the new setting produced the best results, but was not much better than the old "hidden feasibility" setting. The same was true on the typewriter problem. So I am still waiting to encounter a problem instance where the new setting will provide a substantial improvement.

## Friday, October 16, 2020

### Multilogit Fit via LP

A recent question on OR Stack Exchange has to do with getting an $L_1$ regression fit to some data. (I'm changing notation from the original post very slightly to avoid mixing sub- and super-scripts.) The author starts with $K$ observations $y_1, \dots, y_K$ of the dependent variable and seeks to find $x_{i,k} \ge 0$ ($i=1,\dots,N$, $k=1,\dots,K$) so as to minimize the $L_1$ error $$\sum_{k=1}^K \left|y_k - \sum_{i=1}^N \frac{e^{x_{i,k}}}{\sum_{j=1}^K e^{x_{i,j}}}\right|.$$ The author was looking for a way to linearize the objective function.

The solution I proposed there begins with a change of variables: $$z_{i,k}=\frac{e^{x_{i,k}}}{\sum_{j=1}^K e^{x_{i,j}}}.$$ The $z$ variables are nonnegative and must obey the constraint $$\sum_{k=1}^{K}z_{i, k}=1\quad\forall i=1,\dots,N.$$ With this change of variables, the objective becomes $$\sum_{k=1}^K \left|y_k - \sum_{i=1}^N z_{i,k} \right|.$$ Add nonnegative variables $w_k$ ($k=1,\dots, K$) and the constraints $$-w_k \le y_k - \sum_{i=1}^N z_{i,k} \le w_k \quad \forall k=1,\dots,K,$$ and the objective simplifies to minimizing $\sum_{k=1}^K w_k$, leaving us with an easy linear program to solve.

That leaves us with the problem of getting from the LP solution $z$ back to the original variables $x$. It turns out the transformation from $x$ to $z$ is invariant with respect to the addition of constant offsets. More precisely, for any constants $\lambda_i$ ($i=1,\dots,N$), if we set $$\hat{x}_{i,k}=x_{i,k} + \lambda_i \quad \forall i,k$$ and perform the $x\rightarrow z$ transformation on $\hat{x}$, we get $$\hat{z}_{i,k}=\frac{e^{\lambda_{i}}e^{x_{i,k}}}{\sum_{j=1}^{K}e^{\lambda_{i}}e^{x_{i,j}}}=z_{i,k}\quad\forall i,k.$$ This allows us to convert from $z$ back to $x$ as follows. For each $i$, set $j_0=\textrm{argmin}_j z_{i,j}$ and note that $$\log\left(\frac{z_{i,k}}{z_{i,j_0}}\right) = x_{i,k} - x_{i, j_0}.$$ Given the invariance to constant offsets, we can set $x_{i, j_0} = 0$ and use the log equation to find $x_{i,k}$ for $k \neq j_0$.

Well, almost. I dealt one card off the bottom of the deck. There is nothing stopping the LP solution $z$ from containing zeros, which will automatically be the smallest elements since $z \ge 0$. That means the log equation involves dividing by zero, which has been known to cause black holes to erupt in awkward places. We can fix that with a slight fudge: in the LP model, change $z \ge 0$ to $z \ge \epsilon$ for some small positive $\epsilon$ and hope that the result is not far from optimal.

I tested this with an R notebook. In it, I generated values for $y$ uniformly over $[0, 1]$, fit $x$ using the approach described above, and also fit it using a genetic algorithm for comparison purposes. In my experiment (with dimensions $K=100$, $N=10$), the GA was able to match the LP solution if I gave it enough time. Interestingly, the GA solution was dense (all $x_{i,j} > 0$) while the LP solution was quite sparse (34 of 1,000 values of $x_{i,j}$ were nonzero). As shown in the notebook (which you can download here), the LP solution could be made dense by adding positive amounts $\lambda_i$ as described above, while maintaining the same objective value. I tried to make the GA solution sparse by subtracting $\lambda_i = \min_k x_{i,k}$ from the $i$-th row of $x$. It preserved nonnegativity of $x$ and maintained the same objective value, but reduce density only from 1 to 0.99.

## Wednesday, September 30, 2020

### A Greedy Heuristic Wins

A problem posted on OR Stack Exchange starts as follows: "I need to find two distinct values to allocate, and how to allocate them in a network of stores." There are $n$ stores (where, according to the poster, $n$ can be close to 1,000). The two values (let's call them $x_1$ and $x_2$) must be integer, with $x_1 \in \lbrace 1, \dots, k_1 \rbrace$ and $x_2 \in \lbrace k_1, \dots, k_2 \rbrace$ for given parameters $k_1 < k_2$. Additionally, there is an additional set of parameters $s_{i3}$ and a balance constraint saying $$0.95 g(k_1 e) \le g(x_1, x_2) \le 1.05 g(k_1 e)$$ where $$g(y) = \sum_{i=1} \frac{s_{i3}}{y_i}$$ for any allocation $y$ and $e = (1,\dots, 1).$

The cost function (to be minimized) has the form $$f(x_1, x_2) = a\sum_{i=1}^n \left[ s_{i1}\cdot \left( \frac{s_{i2}}{y_i} \right)^b \right]$$with $a$, $s_{i1}$, $s_{i2}$ and $b$ all parameters and $y_i \in \lbrace x_1, x_2 \rbrace$ is the allocation to store $i$. There are two things to note about $f$. First, the leading coefficient $a (> 0)$ can be ignored when looking for an optimum. Second, given choices $x_1$ and $x_2>x_1$, the cheaper choice at all stores will be $x_1$ if $b < 0$ and $x_2$ if $b > 0$.

It's possible that a nonlinear solver might handle this, but I jumped straight to metaheuristics and, in particular, my go-to choice among metaheuristics -- a genetic algorithm. Originally, genetic algorithms were intended for unconstrained problems, and were tricky to use with constrained problems. (You could bake a penalty for constraint violations into the fitness function, or just reject offspring that violated any constraints, but neither of those approaches was entirely satisfactory.) Then came a breakthrough, the random key genetic algorithm [1]. A random key GA uses a numeric vector $v$ (perhaps integer, perhaps byte, perhaps double precision) as the "chromosome". The user is required to supply a function that translates any such chromosome into a feasible solution to the original problem.

I did some experiments in R, using the GA package to implement a random key genetic algorithm. The package requires all "genes" (think "variables") to be the same type, so I used a double-precision vector of dimension $n_2$ for chromosomes. The last two genes have domains $(1, k_1 + 1)$ and $(k_1, k_2 + 1)$; the rest have domain $(0, 1)$. Decoding a chromosome $v$ proceeds as follows. First, $x_1 = \left\lfloor v_{n+1}\right\rfloor$ and $x_2 = \left\lfloor v_{n+2}\right\rfloor$, where $\left\lfloor z \right\rfloor$ denotes the "floor" (greatest lower integer) of $z$. The remaining values $v_1, \dots, v_{n}$ are sorted into ascending order, and their sort order is applied to the stores. So, for instance, if $v_7$ is the smallest of those genes and $v_{36}$ is the largest, then store $7$ will be first in the sorted list of stores and store $36$ will be last. (The significance of this sorting will come out in a second.)

Armed with this, my decoder initially assigns every store the cheaper choice between $x_1$ and $x_2$ and computes the value of $g()$. If $g()$ does not fall within the given limits, the decoder runs through the stores in their sorted order, switching the allocation to the more expensive choice and updating $g()$, until $g()$ meets the balance constraint. As soon as it does, we have the decoded solution. This cheats a little on the supposed guarantee of feasibility in a decoded solution, since there is a (small?) (nearly zero?) chance that the decoding process will fail with $g()$ jumping from below the lower bound to above the upper bound (or vice versa) after some swap. If it does, my code discards the solution. This did not seem to happen in my testing.

The GA seemed to work okay, but it occurred to me that I might be over-engineering the solution a bit. (This would not be the first time I did that.) So I also tried a simple greedy heuristic. Since $k_1$ and $k_2$ seem likely to be relatively small in the original poster's problem (whereas $n$ is not), my greedy heuristic loops through all valid combinations of $x_1$ and $x_2$. For each combination, it sets $v_1$ equal to the cheaper choice and $v_2$ equal to the more expensive choice, assigns the cheaper quantity $v_1$ to every store and computes $g()$. It also computes, for each store, the ratio $\frac{|\frac{s_{i3}}{v_{2}}-\frac{s_{i3}}{v_{1}}|}{s_{i1}\left(\left(\frac{s_{i2}}{v_{2}}\right)^{b}-\left(\frac{s_{i1}}{v_{1}}\right)^{b}\right)}$in which the numerator is the absolute change in balance at store $i$ when switching from the cheaper allocation $v_1$ to the more expensive allocation $v_2$, and the denominator is the corresponding change in cost. The heuristic uses these ratios to select stores in descending "bang for the buck" order, switching each store to the more expensive allocation until the balance constraint is met.

Both the GA decoder and the greedy heuristic share the approach of initially allocating every store the cheaper choice and then switching stores to the more expensive choice until balance is attained. My R notebook generates a random problem instance with $n=1,000$ and then solves it twice, first with the GA and then with the greedy heuristic. The greedy heuristic stops when all combinations of $x_1$ and $x_2$ have been tried. Stopping criteria for the GA are more arbitrary. I limited it to at most 1,000 generations (with a population of size 100) or 20 consecutive generations with no improvement, whichever came first.

The results on a typical instance were as follows. The GA ran for 49 seconds and got a solution with cost 1065.945. The greedy heuristic needed only 0.176 seconds to get a solution with cost 1051.735. This pattern (greedy heuristic getting a better solution in much less time) repeated across a range of random number seeds and input parameters, including switching between positive and negative values of $b$.

If you are interested, you can browse my R notebook (which includes both code and results).

[1] Bean, J. C. (1994). Genetic Algorithms and Random Keys for Sequencing and Optimization. ORSA Journal on Computing, 6, 154-160.

## Thursday, September 3, 2020

### Installing Rcplex and cplexAPI

I've previously mentioned solving MIP models in R, using CPLEX. In one post [1], I used the OMPR package, which provides a domain specific language for model construction. OMPR uses the ROI package, and in particular the ROI.plugin.cplex package, to communicate with CPLEX. That, in turn, uses the Rcplex package. In another post [2], I used Rcplex directly. Meanwhile, there is still another package, cplexAPI, that provides a low-level API to CPLEX.

Both Rcplex and cplexAPI will install against CPLEX Studio 12.8 and earlier, but neither one installs with CPLEX Studio 12.9 or 12.10. Fortunately, IBM's Daniel Junglas was able to hack solutions for both of them. I'll spell out the steps I used to get Rcplex working with CPLEX 12.10. You can find the solutions for both in the responses to this question on the IBM Decision Optimization community site. Version information for what follows is: Linux Mint 19.3; CPLEX Studio 12.10; R 3.6.3; and Rcplex 0.3-3. Hopefully essentially the same hack works with Windows.

1. Download Rcplex_0.3-3.tar.gz, put it someplace harmless (the Downloads folder in my case, but /tmp would be fine) and expand it, producing a folder named Rcplex.
2. Go to the Rcplex folder and open the 'configure' file in a text editor (one you would use for plain text files).
3. Line 1548 should read as follows:
CPLEX_LIBS="-L${CPLEXLIBDIR} ${AWK} 'BEGIN {FS = " = "} /^CLNFLAGS/ {print $2}'${CPLEX_MAKEFILE}"
.
Replace it with