Monday, November 6, 2023

Displaying a Nested Data Frame

Almost anything (within reason) you want to do with data can be done in an R application ... if you can just find the correct libraries and functions.

A program I'm writing in R, using the Shiny web framework, needs to both display and export a data frame. Ordinarily, this would be straightforward: I would use datatable()renderDT() and dataTableOutput() from the DT library for display, and something like write.csv from the utils library to export the data frame in a spreadsheet format. Unfortunately, none of that works this time. The problem is that the data frame comes from an HTTP POST request, with the data arriving as a JSON object. The JSON object contains nested structures, which results in a data frame where some cells contain lists or data frames. Even after applying the flatten() function from the jsonlite library, the nesting persisted.

The data frame would render in Shiny as a table, but in cells where nested structures hid the rendered table would just say something like "list()" or maybe "1 entry". The write.csv function refused to write the table to a file. I noticed, though, that if I viewed the data frame in RStudio (using the View() command), the nested structures would still say something like "list()" but clicking them would open them in a new window and hovering over them with the mouse would give an indication of the content. A quick check of the help entry for View() showed that it uses the format.data.frame() function from the base library. So I tried applying that function to my data frame before displaying or exporting it, and both problems were fixed. The displayed table showed the contents of each cell, and the flattened table could be exported to a CSV file. The structure of the nested data frames and lists was not preserved, but the user could at least see what was in those cells, which was good enough for my application.

Monday, October 30, 2023

Publishing a Shiny Application

I've twice encountered problems while trying to deploy Shiny applications on the shinyapps.io hosting platform, and since the fix was the same in both cases, I think a pattern is emerging. In both cases, the issue has to do with use of the includeMarkdown function to display help text written in Markdown and stored in a file within the application. The includeMarkdown function is provided by the htmltools library, which is apparently loaded automatically by the shiny library. So my code explicitly specifies library(shiny) but does not load htmltools.

In both problem cases, the code ran fine on my PC but not on the server. Why? A little "fine print" in the documentation of the includeMarkdown function mentions that it requires the markdown package. I have that installed on my PC, and apparently it gets loaded automatically. The server deployment system, though, apparently does not realize the need for it. So on the server my code loads the shiny library explicitly, and that causes the server to include the htmltools library but not the markdown library.

The solution is trivial: just add include(markdown) in the source code. The hard part is remembering to do it, given that it is unnecessary on my PC.

Saturday, October 28, 2023

The Trouble with Tibbles

Apologies to Star Trek fans for the title pun, but I couldn't resist. (If you're not a Trekkie, see here for clarification.)

I've been working on a interactive R application (using Shiny, although that's probably irrelevant here). There are places where the code needs to loop through a data frame, looking for consecutive rows where either three text fields match or two match and one does not. Matching rows are copied into new data frames. The data I'm testing the code on has a bit over 9,000 rows, and the time spent on this process can take upwards of nine seconds -- not an eternity, but a bit annoying when you are sitting there waiting for it to hatch.

I decided to use the profiler in RStudio to see where time was being eaten up. Almost all the nine seconds was blamed on two steps. The biggest time suck was doing case-insensitive string comparisons on the three fields, which did not come as a big surprise. I went into the profiling process thinking the other big time suck would be adding a data frame to a growing list of data frames, but that was actually quite fast. To my surprise, the number two consumer of time was "df <- temp[n, ]", which grabs row n from data frame "temp" and turns it into a temporary data frame named "df". How could such a simple operation take so long?

I had a hunch that turned out to be correct. Somewhere earlier in the code, my main data frame (from which "temp" was extracted) became a tibble. Tibbles are modernized, souped-up versions of data frames, with extra features but also occasional extra overhead/annoyances. One might call them the backbone of the Tidyverse. The Tidyverse has its adherents and its detractors, and I don't want to get into the middle of that. I'm mostly happy to work with the Tidyverse, but in this case using tibbles became a bit of a problem.

So I tweaked the line of code that creates "temp" to "temp <- ... %>% as.data.frame()", where ... was the existing code. Lo and behold, the time spend on "df <- temp[n, ]" dropped to a little over half a second. Somewhat surprisingly, the time spent on the string comparisons dropped even more. So the overall processing time fell from around 9 seconds to under 1.3 seconds, speeding up the code by a factor of almost 7.

I'll need to keep this in mind if I bump into any other places where my code seems unusually slow.


Saturday, September 16, 2023

Unnecessary LyX Backup Files

If you edit a file from an older version of LyX and then save it, LyX will create an extra backup file with a lengthy name containing something about "version". As with other LyX backups, the file extension is ".lyx~". I believe this is a defense against the possibility that changes to the file format might muck up the file, and I have absolutely no problem with it. Let's call this the version backup.

That said, for a long time now it has been the case that every time I created a new LyX file, saved it, then edited it and saved the edit, I would get both the standard backup file and (after the first edit only) the version backup. Subsequent saves would only produce the usual backup. This was a minor annoyance for me, since I was perpetually tracking down and deleting the version backups to reduce drive clutter (and cut down the time it took to backup my hard drive, which I do regularly).

Turns out there was a simple explanation (and fix) for this, which I found in a thread on a support forum. Like many users, at some point I created a new file, customized a few things, and then in Document > Settings..., clicked "Save as Document Defaults" to make those settings the defaults for all new documents. Doing so creates a file named "defaults.lyx" in the templates directory (which, on my Linux Mint system, is ~/.lyx/templates). That file used the file version in effect when I set it as defaults, meaning every new file I've created since then has started with that version and then been upgraded to whatever the current version is ... leading to all those version backups.

The fix was simple. I just created a new document and set it as the default template, which updated the template to the current LyX format (544 as of this writing). Now I just have to remember to repeat this whenever a new LyX version comes out with a new file format. (At my age, "I just have to remember" should not be taken lightly.)


Tuesday, September 12, 2023

Randomly Generating a Connected Graph

Someone posted a question on Operations Research Stack Exchange about generating random test instances of a connected graph. (They specified a planar graph, but I am going to ignore that restriction, which I suspect was unintended and which I am pretty sure would make the problem much harder.) The premise is fairly straightforward. You start with a fixed set of nodes, partitioned into three subsets (supply, transshipment and demand nodes). The goal is to select a random set of edges of fixed cardinality such that (a) the resulting graph is connected and (b) no edge connects two supply nodes or a supply node and a demand node. Setting the number of edges too low will obviously make the problem infeasible. The author specified that the graph could be either directed or undirected. My solutions are coded for undirected graphs but can be adapted to directed graphs.

I proposed two solutions, both of which I coded and tested in Java. One uses a mixed integer programming (MIP) model, which my code solves using CPLEX. Let $E$ be the set of all valid edges, $N$ the number of nodes and $K$ the desired number of edges. The MIP model uses binary variables $x_{i,j}$ for all $(i,j)\in E.$ The requirement to include $K$ edges is handled by the constraint $$\sum_{(i,j)\in E} x_{i,j} = K$$ and randomness is achieved by minimizing $$\sum_{(i,j)\in E} c_{i,j} x_{i,j}$$ where the cost coefficients $c_{i,j}$ are generated randomly. 

That leaves the matter of forcing the graph to be connected. To do that, we introduce forward and backward flow variables $f_{i,j} \ge 0$ and $r_{i,j} \ge 0$ for each edge $(i,j),$ where $f_{i,j}$ is interpreted as flow from $i$ to $j$ and $r_{i,j}$ as flow from $j$ to $i.$ The concept is to single out one node (call it $s$) as a source for $N-1$ units of flow of some mystery commodity and assign every other node a demand of one unit of the commodity. For $i\neq s,$ the flow conservation constraint is $$\sum_{(j,i)\in E} (f_{j,i} - r_{j,i}) + \sum_{(i,j)\in E} (r_{i,j} - f_{i,j}) = 1,$$ which says that flow in minus flow out equals 1. To ensure that flow occurs only on selected edges, we add the constraints $$f_{i,j} \le (N-1) x_{i,j}$$ and $$r_{i,j} \le (N-1) x_{i,j}$$ for all $(i,j)\in E.$

The edge removal heuristic is a bit simpler. We start by selecting all eligible edges and shuffling the list randomly. While the number of edges in the graph exceeds $K,$ we pop the next edge from the list, remove it, and test whether the graph remains connected. If yes, the edge is permanently gone. If no, we replace the edge in the graph (but not on the list) and move to the next edge in the list. To confirm that removing edge $(i,j)$ leaves a connected graph, we start from node $i$ and find all nodes that can be reached from $i$ in one step (i.e., remain adjacent). Then we find all nodes adjacent to those nodes, and continue until either we encounter node $j$ (in which case the graph is still connected) or run out of nodes to test (in which case the graph is now disconnected).

The edge removal heuristic is simpler to explain, does not require a MIP solver and likely is faster. There are two potential disadvantages compared to the MIP approach. One is that the MIP is done after solving once, whereas there is a possibility that the edge removal heuristic fails to produce a connected graph due to an "unlucky" choice of which edges to remove, requiring one or more restarts with different random number seeds. The other is that if $K$ is set too low (making a connected graph impossible), the MIP model will detect that the problem is infeasible. With the edge removal heuristic, you would not be able to distinguish infeasibility from bad luck (although if the heuristic failed multiple times with different random seeds, infeasibility would start to look likely). In very very limited testing of my code, the edge removal heuristic was definitely faster than the MIP and achieved a valid result in the first pass each time.

Java code for both methods is in my code repository.

Saturday, August 12, 2023

The "Set" Card Game

Someone asked a question (actually a pair of questions) about the "Set" card game on OR Stack Exchange. This was the first I had heard of the game. For purposes of this post, the game play is irrelevant. The key is that it uses a deck of 81 cards, each of which has four "features" with three possible values for each feature. Every combination of feature values appears on exactly one card. Three cards form a "set" if, for each of the four features, either all three cards have the same value or all three cards have different values. As noted in the Wikipedia page, there are 1,080 possible "sets". (I'm putting "set" in quotes to avoid confusion with, you know, sets.) The OR SE question asked if an integer program (IP) could be used to generate all 1,080 sets.

The answer is yes, with the caveat that while you can do it, an IP model is not the most efficient way to do it. I won't repeat my IP model here, since (a) it's a bit of a space hog and (b) you can read it in my answer on ORSE. The model is pretty small (97 binary variables) and finds one "set" quickly. There are two ways to get it to find all possible "sets". Both involve adding a "no good" constraint for each solution found, i.e., a constraint that eliminates that solution but no others. The more tedious approach solves the progressively growing model 1,081 times (the last time producing a declaration of infeasibility, meaning all "sets" have been found). The less tedious approach requires a solver that supports a callback function letting you reject candidate solutions by adding a constraint cutting them off (the same "no good" constraint used in the first approach).

Since there is no objective and the constraints are mostly logical in nature, constraint programming (CP) strikes me as a more intuitive approach than IP. The specifics of a CP model will depend on what types of constraints your CP solver supports, but I think the model I tried is likely to work with a lot of solvers. Since I coded the models in Java, I'll use 0-based indexing here. My variables (all integer-valued) are: $x_s\in \lbrace 0,\dots, 80\rbrace,$ the index of the card in slot $s\in \lbrace 0,1,2\rbrace$ of the "set"; and $y_{f,s}\in \lbrace 0,1,2\rbrace,$ the value of feature $f$ in the card in slot $s.$ The constraints are as follows:

  • allDifferent($x$) (no two slots can contain the same card);
  • $x_0 \lt x_1 \lt x_2$ (to avoid counting the same "set" more than once, we require that the cards be listed in ascending index order);
  • $y_{f,s} = a_{x_s,f},$ where $a_{c,f}$  is the value of feature $f$ in card $c$ (the value of a feature in a slot comes from the card in the slot); and
  • allDifferent$(\lbrace y_{f,0}, y_{f,1}, y_{f,2}\rbrace) \vee (y_{f,0}=y_{f,1} \wedge y_{f,1}=y_{f,2})\,\forall f$ (for every feature, either all three slots have different values or all three slots have the same value).

This exploits two constraints that CP solvers typically have but IP solvers do not. One is the "all different" constraint, which forces a set of variables to take values that do not repeat. The other is the ability to use a decision variable as an index to either a constant vector or another variable vector. That comes into play in the third constraint, where the variable $x_s$ (the index of the card in slot $s$) is used as a subscript of $a$ to get the feature value of that card.

Even CP is probably overkill for this problem, though. We can compute the 1,080 possible solutions by brute force. I used a recursive function for this.

Here is how each method stacks up in terms of computation time. As I said, I coded it in Java, using CPLEX 22.1.1 for the IP models and CP Optimizer 22.1.1 for the CP model.

  • brute force took between 0.1 and 0.2 seconds;
  • the CP model also took between 0.1 and 0.55 seconds;
  • solving the IP model once using a callback took between 1.4 and 3 seconds; and
  • solving the IP model 1,081 times (without a callback) took around 56 seconds, give or take.

Solution times for all methods are a bit random (they change when I rerun the program), so treat them as approximate. In the interest of full disclosure, I throttled CPLEX to a single thread for the callback approach. Using multiple threads would have required both synchronizing the code (a pain) and also checking for repeated solutions (since hypothetically two parallel threads could independently locate the same "set"). The moral, to the extent there is one, is that sometimes brute force beats more sophisticated approaches.

You can find my Java code here if interested.

Sunday, July 30, 2023

Visiting All Nodes

A question on Operations Research Stack Exchange asks about finding a shortest route through a connected graph that visits every node. This is distinct from the well-known traveling salesman problem (TSP), in that the TSP requires each node to be visited exactly once whereas the OR SE question explicitly allows nodes to be visited more than once.

The question reminded me of a conversation I had years ago with a doctoral student in our logistics program. He was working on routing deliveries (via truck) from a depot to various customers, and was trying to formulate it as a TSP using a graph based on a road map (buildings or intersections as nodes, roads as edges). I did my best at the time to convince him that the TSP was unnecessarily restrictive and might even make the problem infeasible. The second example in the OR SE question is a nice example of how that could happen. Generally speaking, in an actual routing problem there is nothing stopping you from passing by/through a previously visited location if that's the best route. There are a few exceptions to this, some in military applications (you were laying mines and/or blowing up bridges as you passed them) and some in normal life (you were speeding and might not want to return some place where a traffic cop would remember you), but for the most part I believe it holds.

My advice to the doctoral student was similar to the answer Rob Pratt posted on OR SE. Rethink the graph, keeping just the depot and customers as nodes and letting the edges in the new graph represent the shortest routes between pairs of nodes. So edge $(i,j)$ and edge $(k,\ell)$ connect different pairs of nodes but might represent physical routes that cross (you pass through the same intersection on both, albeit possibly in different directions) or overlap (both take US 127 north from mile marker 102 to mile marker 137). Building the edges in the new graph involves solving a shortest path problem between each pair of nodes (using the original graph). You can apply Dijkstra's algorithm to each pair of nodes, but I'm pretty sure applying the Floyd-Warshall algorithm would be faster. Pro tip: if you use Floyd-Warshall, don't bother to extract the route for each pair of nodes. Just get the distances, and keep the F-W solution handy. Once you have solved a TSP on the modified graph and know which edges will be used, you can extract the physical routes corresponding to just the used edges from the F-W solution.

I posted an explicit MIP model (which I will not repeat here, since it's a space hog) as an answer to the OR SE question. I suspect that Rob's approach (TSP on a modified graph) might be faster, particularly because it allows you to use a dedicated TSP solver. Although TSPs have a reputation for being difficult (being the poster child for NP-hard problems), bespoke solvers like Concorde can chew through rather large TSPs pretty quickly. Other than the upfront computational cost of finding the shortest path between every pair of nodes, the one concern I might have with the TSP approach is that it converts what might have been a sparse graph to a complete graph. So, like all things MIP, which approach is better is an empirical question.

Sunday, June 18, 2023

An Unbounded Bounded Feasible Region (II)

In my previous post, I cited a question on Operations Research Stack Exchange about an allegedly unbounded feasible region defined by a randomly generated quadratic constraint. In that post, I presented what I believe is a valid proof that, at least in theory, the feasible region should be bounded with probability 1.

Unfortunately, in our digital world, theory and computation sometimes diverge. To test whether I was correct, I coded several mixed integer quadratic programming (MIQCP) models to assess the boundedness of $X$, and applied them to a small sample of test problems (using Java and CPLEX 22.1.1). I set the dimension of the test problems to $n=5$ (for no particular reason).

All three models contained the variable $x$ and the constraint $x^{\prime}Qx+q^{\prime}x \le -q_{0}.$ Two of the models attempted to answer the question directly by maximizing either the $L_1$ or $L_\infty$ norm of $x$ over $X$.

For the $L_\infty$ model, I added continuous variable $y$ and binary variables $z_i$ and $w_i$ together with the constraints $$\sum_i z_i + \sum_i w_i = 1,$$ $$y_i \ge x_i,$$ $$y_i \ge -x_i,$$ $$ z_i = 1 \implies y_i = x_i$$ and $$w_i = 1 \implies y = -x_i.$$ The combined effect of these is to force $y = \vert x_i \vert$ for some $i$ where $$\vert x_i \vert = \max_j \vert x_j \vert = \parallel x \parallel_\infty.$$ The objective was to maximize $y.$

For the $L_1$ model, I added continuous variables $y_i \ge 0$ and constraints $y_i = \vert x_i \vert.$ (Internally, CPLEX adds binary variables does big-M magic to linearize the use of absolute values.) The objective was to maximize $\sum_i y_i = \parallel x \parallel_1.$

My third approach was iterative. The model was almost the same as the $L_\infty$ model, except that rather than maximizing $y$ I set the objective to minimize 0 (meaning the first feasible solution wins) and set the lower bound of $y$ to some value $M.$ If the model found a feasible solution (an $x\in X$ such that $\parallel x \parallel_\infty \ge M$), I doubled $M$ and tried again, until either $M$ exceeded some upper limit or the solver said the problem was infeasible (meaning $X$ is bounded in $L_\infty$ norm by the last value of $M$).

You can find my Java code (self-contained other than needing CPLEX) in my repository. Here are the results from a handful of test runs.

Random Seed Max L_infinity
Max L_1
Iterative
123
unbounded
unbounded
bounded (100)
456 unbounded
unbounded
bounded (100)
789 unbounded unbounded bounded (1600)
12345 unbounded bounded (22.1) bounded (100)
61623 bounded (12.4) unbounded bounded (100)
20230616 bounded (120.9) unbounded bounded (200)

The numbers in parentheses are upper bounds on the $L_1$ norm in the third column and the $L_\infty$ norm in the second and fourth columns. The bound found by the iterative method is never tight, so a bound of 100 means $\max_{x in X} \parallel x \parallel_\infty \le 100.$ As you can see, the iterative method always found the ellipsoids to be bounded, consistent with the mathematical argument in the previous post. The other two models frequently found the problem to be "unbounded", though they did not always agree on that. This is a bit confusing (OK, very confusing). In particular, the "Max L_infinity" and "Iterative" models differ only in whether you are maximizing $y$ or looking for any solution with $y\ge M,$ so saying (when the seed is 123) that the supremum of $y$ is $\infty$ but $y$ cannot exceed 100 is grounds for another beer (or three).

Something is apparently going on under the hood in CPLEX that is beyond me. Meanwhile, I'm sticking to my belief that $X$ is always bounded.

Saturday, June 17, 2023

An Unbounded Bounded Feasible Region (I)

If you find the title of this post confusing, join the club! A question on Operations Research Stack Exchange, "Randomly constructing a bounded ellipsoid", sent me down a rabbit hole, eventually leading to this opus. I'm going to make a couple of notational tweaks to the original question, but the gist is as follows. We have an elliptical feasible region $X = \lbrace x \in \mathbb{R}^n : f(x) \le 0 \rbrace$ where $f(x) = x^\prime Q x + q^\prime x + q_0.$ (One of my tweaks is to absorb the author's factor $1/2$ into $Q.$) $Q\in \mathbb{R}^{n \times n},$ $q\in \mathbb{R}^n$ and $q_0\in \mathbb{R}$ are generated by sampling random numbers from the standard normal distribution. In the case of $Q,$ we sample an $n\times n$ matrix $H$ and then set $Q = \frac{1}{2} H^\prime H.$ (My introducing the symbol $H$ for the sampled matrix is the other notational tweak.) Note that $Q$ is automatically symmetric and positive semidefinite, and is positive definite with probability 1. (For it not to be positive definite, $H$ would have to have less than full rank, which has zero probability of occurring.) I should point out here that saying something has probability 1 or 0 assumes that the random number generator works as advertised.

The author of the question said that in their experience $X$ was unbounded "most of the time." That struck me as impossible, and after a bunch of scribbling on marker boards I finally came down to what I think is a correct argument that $X$ must be bounded. Let $\left\{ x_{1},\dots,x_{n}\right\} $ be an orthonormal basis of eigenvectors of $Q,$ with $Qx_{i}=\lambda_{i}x_{i}$ and $$x_i^\prime x_j =\begin{cases} 1 & i = j \\ 0 & i \neq j. \end{cases}$$

(I'll leave the proof that such a basis exists to the reader as an exercise.)

Now suppose that $X$ is unbounded, meaning that for an arbitrarily large $M$ we can find $x\in X$ such that $\parallel x\parallel>M.$ Write $x$ in terms of the basis: $x=\sum_{i}a_{i}x_{i}.$ Observe that $$M^{2}=\parallel x\parallel^{2}=x^{\prime}x=\sum_i \sum_j a_i a_j x_i^\prime x_j = \sum_{i}a_{i}^{2}\left(x_{i}^{\prime}x_{i}\right)=\sum_{i}a_{i}^{2}.$$Expanding $f(x),$ we have 

\begin{align*} f(x) & =\left(\sum_{i}a_{i}x_{i}\right)^{\prime}Q\left(\sum_{i}a_{i}x_{i}\right)+q^{\prime}\left(\sum_{i}a_{i}x_{i}\right)+q_{0} \\ & =\sum_{i,j}a_{i}a_{j}\left(x_{i}^{\prime}Qx_{j}\right)+\sum_i a_{i}\left(q^{\prime}x_{i}\right)+q_{0} \\ & =\sum_{i,j}a_{i}a_{j}\lambda_{j}\left(x_{i}^{\prime}x_{j}\right)+\sum_{i}a_{i}\left(q^{\prime}x_{i}\right)+q_{0} \\ & =\sum_{i}a_{i}^{2}\lambda_{i}+\sum_{i}a_{i}\left(q^{\prime}x_{i}\right)+q_{0}. \end{align*}

Since $x\in X,$ $\sum_{i}a_{i}^{2}\lambda_{i}\le-\sum_{i}a_{i}\left(q^{\prime}x_{i}\right)-q_{0}.$ According to the Cauchy-Schwarz inequality, $\vert q^{\prime}x_{i}\vert\le\parallel q\parallel\parallel x_{i}\parallel=\parallel q\parallel,$ so we have $$\sum_{i}a_{i}^{2}\lambda_{i}\le-\sum_{i}a_{i}\left(q^{\prime}x_{i}\right)-q_{0}\le\sum_{i}\vert a_{i}\vert\parallel q\parallel+\vert q_{0}\vert.$$

On the other hand, if $\Lambda=\min_{i}\lambda_{i}>0,$ then $$\sum_{i}a_{i}^{2}\lambda_{i}\ge\Lambda\sum_{i}a_{i}^{2}=\Lambda M^{2}.$$ Combining these, $$\Lambda M^{2}\le\parallel q\parallel\sum_{i}\vert a_{i}\vert+\vert q_{0}\vert.\quad (1)$$

Now let $A=\max_{i}\vert a_{i}\vert$ and assume without loss of generality that $\vert a_{1}\vert=A.$ Since $M^{2}=\sum_{i}a_{i}^{2},$ $A^{2}=M^{2}-\sum_{i>1}a_{i}^{2}\le M^{2}$ and so $0<A\le M.$ Meanwhile, $M^{2}=\sum_{i}a_{i}^{2}\le nA^{2},$ which implies $A\ge\frac{M}{\sqrt{n}}.$

  Dividing both sides of (1) by $A,$ we have $$\Lambda M\le\Lambda\frac{M^{2}}{A}\le\parallel q\parallel\sum_{i}\frac{\vert a_{i}\vert}{A}+\frac{\vert q_{0}\vert}{A}\le\parallel q\parallel n+\frac{\vert q_{0}\vert}{A}.\quad (2)$$

The left side of (2) increases as we increase $M,$ while the right side decreases (since $A$ increases and both $q_0$ and $\parallel q\parallel n$ are constant). This leads to a contradiction.


So, barring an error in the above, we have a mathematical proof that $X$ must be bounded. In the next post I will explore the computational side of things.

Monday, May 8, 2023

Finding a Connected Subgraph

A recent question on Mathematics Stack Exchange asked about using a mixed integer linear program (MILP) to find a connected subgraph of a specified size (vertex count) within a given graph. Specifically, it asked how to formulate constraints to enforce the requirement that the subgraph be connected.

As I have previously mentioned, one way to force a graph to be connected is to add flows between vertices, treating one vertex as a source and the other vertices as sinks. In the context of the current question, that works as follows. Let $V$ and $E$ be the sets of vertices and edges respectively in our (undirected) graph and $M<\vert V \vert$ the number of vertices required in the connected subgraph. Note that we do not assume the parent graph is connected, and in fact the sample graph in the Math SE post is not. Since flows are inherently directional, we replace each edge $e=(v,w)\in E$ with two arcs, $(v,w)$ and $(w,v).$ Let $A$ be the set of all arcs.

We designate (arbitrarily) one vertex in the subgraph to be the "root" (source) vertex, with a supply of $M-1$ thingies. (I was going to say "honest politicians", but then the problem becomes infeasible, hence "thingies".) Each of the remaining vertices in the subgraph consumes one thingy. Thingies flow along arcs, but only if both endpoints are selected. Since $M-1$ thingies enter the network at the root vertex and each of the remaining $M-1$ vertices consumes one thingy, the only way for flow to be balanced is if the selected vertices form a connected graph.

For each vertex $v\in V,$ let $x_v \in \lbrace 0, 1 \rbrace$ be 1 if and only if vertex $v$ is selected and let $y_v\in \lbrace 0, 1 \rbrace$ be 1 if and only if vertex $v$ is chosen to be the root vertex. For each arc $a\in A,$ let $f_a \in [0, M-1]$ be the flow across arc $a.$ The objective function is immaterial, since we just want a feasible solution, so we will minimize 0. The constraints are as follows:

  • The correct number of vertices must be chosen: $$\sum_{v\in V} x_v = M.$$
  • One vertex must be designated the root: $$\sum_{v \in V} y_v = 1.$$
  • The root vertex must be one of the chosen vertices: $$y_v \le x_v \quad \forall v\in V.$$
  • The flow on any arc must be 0 unless both head and tail are among the chosen vertices: $$f_{(v,w)}\le (M-1)x_v \quad \forall (v,w)\in A$$ and $$f_{(v,w)}\le (M-1)x_w \quad \forall (v,w)\in A.$$
  • The net flow out of any vertex (flow out - flow in) must be $M - 1$ for the source vertex, -1 for every other selected vertex and 0 otherwise: $$\sum_{(v,w)\in A} f_{(v,w)} - \sum_{(w,v)\in A} f_{(w,v)} = M\cdot y_v - x_v \quad \forall v\in V.$$

This works in a MILP model, but a MILP model is probably not the best way to solver the underlying problem. Consider the following algorithm, which is based on constructing a layered subgraph. Assume the vertices are in some order. Make the first vertex the root and create a graph containing just it  (which we will call layer 0). Add up to $M-1$ vertices connected to the root by an edge (layer 1). If the graph now contains $M$ vertices, declare victory and stop. Otherwise, for each vertex in layer 1, find all vertices connected to it by an edge and not in layers 0 or 1 and as many as them as are needed to reach $M$ vertices (or all of them if you are still short). Repeat until done or until all vertices in layer 1 have been processed. The added vertices form layer 2 because they are two edges from the root. Repeat with each layer until either you have $M$ vertices (victory) or there are no vertices left to process.

If all connected vertices have been found and the count is less than $M,$ discard the current root and start over with the next vertex as root. Note that when processing subsequent roots, you can ignore any vertices already tried as root and just look at vertices later in the vertex ordering.

I ginned up some Java code (open source, as always) to test both the MILP model and the layered graph approach, using CPLEX as the MILP solver. Both got the examples from the Math SE post correct in a trivial amount of time. On a randomly generated graph with 5,000 nodes and 624,713 edges (about 5% of the edges in a complete graph that size), with a target subgraph size of 1,000 nodes, the MILP model took about 8 seconds to build and 12 seconds to solve. The layered graph search needed 3 milliseconds. I tried a larger graph, but my computer ran out of heap space trying to build the model (mainly due to the expanding number of flow variables and related constraints).

So there are two things to take away. One is that flows and flow balance constraints can be used to enforce connection among vertices, even when there is no actual thing flowing. The other is that, much as I like MILP models, sometimes simpler approaches are better.


Friday, April 28, 2023

A Matrix Puzzle

A question on Mathematics Stack Exchange, "Placing number blocks so that the resulting matrix is symmetric", revolves around a square matrix whose entries are partitioned into blocks (submatrices). The submatrices need to be rearranged so that the resulting matrix is symmetric around its main diagonal. The author of the question asked if it were possible to do this using linear programming. I think the answer is no, but we can certainly do it with either an integer linear program or a constraint program. The author posts two examples ($4\times 4$ and $5\times 5$). I cobbled together Java code for a MIP model (using CPLEX) and a CP model (using CP Optimizer) and had no trouble solving both problems.

There are a couple of generalizations worth noting. First, while the example matrices contain integers, there is nothing preventing the two models from being used with other content types (real numbers, complex numbers, text). Some variable definitions would need modification, but conceptually both methods would work. Second, the author of the post used only one-dimensional blocks (row vectors or column vectors), but my code allows for arbitrary rectangular blocks. The code assumes that the "parent matrix" is square, but that would be easy enough to relax, so the same models (with minor adjustments) would work with arbitrary rectangular matrices.

I think that the puzzle makes a good vehicle for comparing MIP and CP applications to logic problems. In optimization problems, I suspect that MIP models often do a better job of computing objective bounds than do CP models. That is a non-issue here, since the problem is just to find a feasible solution. For logic problems and similar things (particularly scheduling), I think CP models tend to be more expressive, meaning certain types of constraints or relationships can be expressed more naturally with CP than with MIP (where the relationships turn into rather arcane and complicated adventures with binary variables). That applies here, where the CP model exploits the ability to use integer variables as subscripts of other variables.

As described in the PDF file, though, that subscripting ability has its limits. CP Optimizer will let you index a one-dimensional vector of variables using an integer variable, but not a two-dimensional array. In other words, x[y] is fine but x[y, z] is not (where x, y and z are all variables). The workaround I used is to flatten a 2-D matrix into a 1-D matrix. So if $N\times N$ matrix $x$ is flattened into $N^2\times 1$ vector $\hat{x},$  $x_{y,z}$ becomes $\hat{x}_{N(y - 1) + z}.$

The models are a bit too verbose to squeeze into a blog post, so I wrote them up in a separate PDF file. My Java code (which requires recent versions of CPLEX and CP Optimizer) can be had from my code repository under a Creative Commons license.

Tuesday, April 18, 2023

Node Coloring

Someone posted a question on OR Stack Exchange about coloring nodes in a sensor network. The underlying problem has something to do with placing sensors, but in mathematical terms the poster wanted to color the nodes of a weighted undirected graph with a fixed number of colors, while minimizing the sum of the edge weights of edges connecting nodes of the same color.

There is an obvious MIP model for this problem, which was posted as an answer to an earlier version of the question. If $N$ and $C$ are the index sets for nodes and colors respectively, $E\subseteq N\times N$ is the set of edges, $w_{ij}$ is the objective weight of edge $(i,j)\in E$ and $x_{n,c}$ is a binary variable indicating whether node $n$ is assigned color $c$, the problem can be written as a binary quadratic program:

$$\min\ \sum_{(n,m)\in E}w_{nm}\sum_{c\in C}x_{nc}x_{mc}\\
\textrm{s.t. }\sum_{c\in C}x_{nc}  =1\quad\forall n\in N\\
\phantom{\textrm{s.t. }}x_{nc} \in\left\{ 0,1\right\} \quad\forall n\in N,c\in C.$$

 

The model can be linearized by introducing continuous variables $y_{nm}\ge 0$ to represent the objective contribution of each edge $(n,m)\in E$:

 

$$\min\ \sum_{(n,m)\in E}w_{nm} y_{nm}\\
\textrm{s.t. }\sum_{c\in C}x_{nc}=1 \quad \forall n\in N\\
\phantom{\textrm{s.t. }}y_{nm}\ge x_{nc} + x_{mc} - 1 \quad \forall (n,m)\in E, \forall c\in C\\
\phantom{\textrm{s.t. }}x_{nc} \in\left\{ 0,1\right\} \quad\forall n\in N,c\in C\\
\phantom{\textrm{s.t. }}y_{nm}\ge 0 \quad \forall (n,m)\in E.$$


There is some symmetry in the model: given any feasible solution, another solution of identical cost is obtained by permuting the colors. It is possible to add constraints to remove that symmetry.


The person posting the question was looking for alternatives to a MIP model. An easy choice for me would be a genetic algorithm. I wanted to test this in R, using the GA library for the genetic algorithm. The GA library allows three types of chromosomes: binary or real vectors, or permutations of an index vector. If I were coding my own GA, the chromosome would be a vector in $\lbrace 1,\dots,C\rbrace^N.$ Within the available chromosome types in the library, the obvious choice was to use a vector $x\in [0,C]^N$ and then set the color of node $i$ to $\lceil x_i \rceil\in \lbrace 1,\dots,C\rbrace.$ I suspect, though, that the symmetry issue described above might have negative repercussions for a GA. If two "good" solutions were effectively the same (other than a permutation of the color indices), their offspring (which would be a mix of the two color numbering methods) might be a mess.


A couple of responders to the OR SE question suggested heuristics, as did I. My choice was to start with a random generation heuristic (assign colors randomly), apply an improvement heuristic (go through the nodes in random order and, for each node, reassign it to the cheapest color given the rest of the solution), and repeat (changing the order in which nodes are examined on each pass) until one pass produced no improvement. Since the heuristic is fairly fast, it can be done multiple times (new random restarts) with the best solution retained. Heuristics like this are very easy to code.


I coded all three approaches in R, using CPLEX (with the ompr and ROI libraries) for the MIP model, the GA library for the genetic algorithm, and nothing special for the improvement heuristic. I also threw in the tictoc library to do some timings. The test problem was a randomly constructed $10\times 20$ complete grid graph (200 nodes; 19,900 edges), to be colored using five colors.


To break symmetry in the MIP model, I forced the colors to be ordered so that the lowest node index receiving any color was smaller than the lowest node index receiving the next color. I let CPLEX run for five minutes using default settings. It made what I would call reasonable progress on the incumbent objective value, but the formulation is apparently very weak (at least on the test example). The lower bound barely moved, and the gap was 99.97% after five minutes.


I mostly used default parameters for the GA, although I let it run for 2,000 generations with a population of size 100. It found a much better solution (in 38 seconds) than CPLEX found in five minutes. To be fair, though, CPLEX was balancing improving the incumbent with working on the bound. Had I changed the CPLEX parameters to emphasize improving the incumbent, it likely would have done better (though likely not as well as the GA).


The improvement heuristic surprised me a bit. 600 random starts needed only 37 seconds (and I'm not the poster boy for optimized R code), and the best solution it found had an objective value about 5% better than what the GA produced. I did a small number of runs with different random seeds (but the same problem size), and the improvement heuristic consistently edged out the GA. I suppose the moral here is to try simple heuristics before going "high tech".


You are welcome to play with my R notebook, which contains both code and results.

Thursday, April 13, 2023

The rJava Curse Strikes Again

Apparently I have not needed the rJava R package in a while, because when I wanted to install an R package today that has rJava as a dependency, it was not there. So I tried to install it (this is on Linux Mint), and of course it failed to install. I have a long history of installation battles with rJava (see here, here and here in chronological order ... or better still don't traumatize yourself by reading them). Why should this time be different?

All my previous battles involved older versions of Java with apparently different directory locations or structures, and none of the previous fixes worked. After considerable aggravation, I found a very helpful post by "datawookie" that nearly got the job done. I did in fact get an error message about "jni" and used the trick in datawookie's post of setting JAVA_HOME to the correct path (in my case to Open JDK 17) as an argument to javareconf. When I then attempted to install rJava, I got a different error ("could not find -lbz2"), which prompted me to install libbz2.dev ("sudo apt-get install libbz2.dev"), after which R was finally able to install rJava (woo-hoo!).

I'd say this is getting ridiculous, but we passed that milestone years ago.

Monday, March 20, 2023

Annoying Message with R GA Package

This is a small technical note (mainly a reminder to myself) about using the GA package for genetic algorithms in R, specifically within an R Markdown document (such as an R notebook). The GA::ga() function includes an optional argument to turn on parallel evaluation of the fitness function. The argument value can be true or false (default), which are self-explanatory, or a number of cores, or a character string ("snow" or "multicore") for how parallelization is to be done. The default is "multicore" except on Windows, where "snow" is apparently the only option. When I choose to turn on parallel objective evaluation, I take the easy route and just set it to TRUE.

When run in a terminal, I just get a sequence of log entries, one per generation, and then it terminates. That's also what I see when the command is run in an R Markdown document ... until the document is rendered to HTML. In the HTML output, interspersed with each line of log output is a block of four identical copies of the following.

loaded GA and set parent environment

The fact that there are four copies is presumably because my PC has four cores. Turning off the monitor option gets rid of the progress printouts but does not eliminate the repetitive messages. Switching the parallel mode to "snow" does get rid of them, but I'm not sure what the broader implications of that are.

After significant spelunking, I traced the messages to the doParallel R package, which is apparently being used to implement parallel processing under the "multicore" option. In any event, the way to get rid of them is to add the option "message = FALSE" to the R chunk in which ga() is being executed. Barring other chunk options, this would change {r} to {r, message = FALSE} for that one chunk. (You can also set it as a global option in the document.) Happily, the monitor output (stats for each generation) still prints out. It just gets rid of those <expletive deleted> parallel processing messages.

Tuesday, March 7, 2023

Another GA for Clustering

Someone asked on OR Stack Exchange about clustering nodes, a subject that comes up from time to time. The premise is that you have $N$ points (nodes), and for each pair $i \neq j$ there is (or may be) an edge with weight $w_{i,j} \in [0, 1]$ (where $w_{i,j} = 0$ if there is no edge between $i$ and $j$). The objective is to create exactly $G$ groups (clusters), with no group having cardinality greater than $M,$ such that the sum of within-group edge weights is maximized. It is simple to write an integer programming model for the problem, and the question includes one, but the author was looking for something that did not require an IP solver. The question specifies, as an example, dimensions of $N=500$ points, $G=8$ groups, and a maximum group size of $M=70.$

Since I have suggest genetic algorithms for previous clustering problems (see here and here), it probably will not shock you that I once again looked for a GA approach, using a "random key" algorithm (where the chromosome gets decoded into a feasible solution. In this case, the chromosome consists of $N+G$ values between 0 and 1. The first $G$ values are used to select the sizes $n_1,\dots, n_G$ of the groups. The last $N$ values are converted into a permutation of the point indices $1,\dots, N.$ The first $n_1$ entries in the permutation are the indices of the points in group 1, the next $n_2$ entries give the points in group 2, and so on.

The tricky part here is converting the first portion of the chromosome into the group sizes. We know that the maximum group size is $M,$ and it is easy to deduce that the minimum group size is $m = N - (G-1)M.$ So the minimum and maximum fractions of the population to assign to any group are $a=m/N$ and $b=M/N,$ where $$m = N - (G-1)M \implies a = 1 - (G-1)b.$$ Now suppose that I have values $\pi_1, \dots,\pi_G \in (a,b)$ and assign group sizes $n_i = \pi_i \cdot N,$ which clearly meet the minimum and maximum size limits. (I'm cheating here, but I'll fix it in a minute.) We want $\sum_{i=1}^G n_i = N,$ which means we need $\sum_i \pi_i = 1.$

To get the $\pi_i,$ we take a rather roundabout route. Assume that the first $G$ entries in the chromosome are $x_1,\dots,x_G \in (0,1).$ Set $$y_i = 1 - \frac{x_i}{\sum_{j=1}^G x_j} \in (0,1)$$ and observe that $\sum_{i=1}^G y_i = G-1.$ Finally, set $\pi_i = a + (b-a) y_i \in (a, b).$ We have $$\sum_i \pi_i = G\cdot a + (b-a) \sum_i y_i\\ = G\cdot a + (b-a)(G-1) = (G-1)b + a = 1.$$

Now to confess to my cheating. I said the $i$-th group size would be $n_i=\pi_i \cdot N,$ ignoring the minor problem that the left side is an integer and the right side almost surely is not. So of course we will round $\pi_i \cdot N$ to get $n_i.$ After rounding, though, we can no longer be sure that $\sum_i n_i = N.$ So we iteratively fix it. If $\sum_i n_i < N,$ we add 1 to the smallest $n_i$ and recheck. if $\sum_i n_i > N,$ we subtract 1 from the largest $n_i$ and recheck. Eventually, we end up with a feasible set of group sizes.

All this is encoded in an R notebook I wrote, including a sample problem. Whether the GA gets a "good" (near optimal) partition or not I cannot say, since I did not write the equivalent MIP model. I can say, though, that it gets a sequence of progressively improving partitions.

Monday, January 30, 2023

A Random Tree Generator

For a problem I was noodling with in R, I needed to generate random trees (meaning connected, acyclic, layered, undirected graphs, not classification trees or anything statistical). There are a bunch of libraries for R that have various capabilities for constructing and or massaging graphs and networks (see the CRAN Task View for graphical models). After spending some time poking through the docs for some of the libraries listed in the task view, and unsuccessfully trying to install one, I decided it would be faster just to roll my own code.

Along the way I succumbed to the software developer's maxim: if it works, it needs more features. Still, I wound up with a function that is not horribly complicated and seems to work. Given how many nodes and how many layers you want in the tree, it outputs a matrix of edges forming the desired tree. There are optional arguments for how you want the nodes labeled (the default is 1, 2, ... but you can supply a vector of labels) and in what order the labels should be assigned (top to bottom/left to right raster scan of the graph or randomly), as well as how you want the output matrix organized (randomly, by label order, or by layer order).

The code is in an R notebook that demonstrates the use of the function. It imports the DiagrammeR library in order to plot the resulting trees, but the function does not require DiagrammeR and in fact has no dependencies. You can view the notebook here, and if you want you can download the code from it (see the select box in the upper right of the notebook). The code is licensed under the Creative Commons Attribution 3.0 Unported license.

Friday, January 27, 2023

Weird Caja Bug

In the past day or so I've tripped over a rather annoying bug in Caja, the file manager for the MATE desktop (which I use with Linux Mint, and which is also used by many Ubuntu users). Theoretically, the F2 key is a shortcut for Edit > Rename..., which lets you rename selected files. Usually, it works fine, but today (and at least once in the recent past) the F2 key had no effect.

At first I thought there was a problem with my keyboard, but no, the F2 key was being recognized. A Google search led me to a bug report from 2018 (!) where a diligent soul with screen name "terzag" discovered that the problem occurred when the Caja window displayed a vertical scrollbar and disappeared when the window was resized large enough to lose the scrollbar. Sure enough, that fix worked for me today ... and then things got weirder.

I did a little experimenting, and the bug reappeared when the scrollbar reappeared and disappeared whenever the scrollbar disappeared. Then I tried F2 with multiple file icons selected (which opens a dialog to bulk rename them), and it worked, despite there being a visible scrollbar. Since then, no matter how many times I close and reopen Caja, the bug does not manifest. So it is temporarily fixed, I guess, with emphasis on "temporarily".

I also encountered a possibly related bug, again reported years ago. If I hit either F2 or Edit > Rename... without first selecting one or more files, it closes (crashes?) all open Caja windows. I'm pretty sure that's a bug and not a "feature".

Hopefully this all gets fixed relatively soon ... although looking at bug reports from almost five years ago does not make me feel sanguine about the prospects.

Finding All Simple Cycles (III)

This is the third and (hopefully) final post about finding all simple cycles in an undirected graph. In the first post, I discussed the problem briefly and described a simple, efficient iterative algorithm. In the second post, I posed (but definitely did not recommend) a mixed integer linear program (MILP) to find all simple cycles. Here I will discuss an alternative MILP approach, specific to CPLEX. To make sense of it, you will definitely need to have read the second post. Java code demonstrating all three methods is available from my repository.

Recent versions of CPLEX have a feature called the solution pool, about which I have previous written. One use is to accumulate solutions (including suboptimal ones if desired) encountered along the way to the optimal solution of a MILP model. Another, using the IloCplex.populate() method in the Java API, is to just generate a gaggle of feasible solutions. So the approach I propose here is to build a MILP model for finding a single simple cycle, set the solution pool population limit parameter (the maximum number of solutions to retain) to something really large and the solution pool intensity parameter (how hard CPLEX works to churn out solutions) to its maximum value, and then use the populate() method to find simple cycles. This is implemented in the SingleCycleMIP.java class in my code. 

Actually, I'm not sure setting the intensity parameter is necessary. On the small test graph, results were similar no matter what intensity setting I used. CPLEX stopped with 160 solutions, many of which were duplicates. After removing duplicates, there were 13 distinct simple cycles, which matches what the other methods found. The only difference was run time, which was a fraction of a second using either the default intensity or the maximum intensity (4) and about two seconds using the minimum intensity (1). Either way, it was much faster than the full MILP model, at least on this small test graph.

The MILP model used here is a subset of the model in the previous post. The model has no objective value (or equivalently minimizes the constant zero function). We drop the $u$ variables (which were used to ensure that cycles were distinct) and the $x$ variables (which signaled whether a slot in the solution was filled with a cycle or not). For all other variables, we drop the $c$ subscript (which slot is being filled), since we are only building one cycle. The "one anchor per cycle" constraint becomes $\sum_i w_i = 1,$ and other constraints involving $u$ and/or $x$ are dropped.

One issue with this model is that the same cycle can be discovered multiple times, either because its orientation is reversed (so the "clockwise" and "counterclockwise" versions are considered distinct) or possibly because minor difference (rounding error) in the flow variables make two copies of the same solution look distinct. Regardless of the reason, the fact CPLEX terminated with 160 solutions when only 13 cycles exist tells you duplication is happening. Fortunately, we do not have to monkey with the model (or set overly finicky values for tolerance values); we can just filter the pool after the run and weed out the duplicates (which my code does). The alternative would be to use a callback (either an incumbent callback if you are using legacy callbacks, or the more current generic callback in "candidate" context) to weed out duplicates as they arise. Since the logic to be used in the callback is the same as the logic used in post-processing, I cannot see any virtue to using a callback, and it definitely would complicate the code.

With that I am hopefully done with this topic. :-)


Thursday, January 26, 2023

Finding All Simple Cycles (II)

Yesterday's post described my introduction (by way of a question on OR Stack Exchange) to the problem of how to find all simple cycles in a graph. In that post I described a simple, efficient iterative algorithm for finding them.

The OR SE question specifically asked about a mixed integer linear program (MILP) to find all simple cycles. While I would never take that approach myself, I got curious as to whether it was possible to construct such a model. Down the rabbit hole I went. The answer turns out to be yes. In this post, I'll describe the model (again, without advocating that it be used).

Assume that we have an undirected graph with vertices numbered $1,\dots,N$ and edge set $E.$ In what follows, $i$ and $j$ will always be vertex numbers and $c$ and $c'$ will index cycles. Although the graph is undirected, when we construct cycles we will give them an orientation (e.g., $3 \rightarrow 2 \rightarrow 5 \rightarrow 3$), which will be useful in ensuring that the result is an actual cycle (you can get from the starting node back to itself) and not the composition of multiple smaller cycles (e.g., not treating $3 \rightarrow 2 \rightarrow 4 \rightarrow 3$ and $5 \rightarrow 6 \rightarrow 9 \rightarrow 8 \rightarrow 5$ as if they form a single cycle).

The model depends on guessing a non-binding upper bound $C$ on the number of simple cycles in the graph. (If the solution to the model does contain $C$ cycles, your guess may have been too small. To be certain that all cycles were found, you would need to bump up $C$ and solve again, until the optimal solution found fewer than $C$ cycles.)

The model contains the following gazillion binary variables:

  • $x_{c}$ is an indicator for whether cycle $c\in\left\{ 1,\dots,C\right\} $
    is constructed;
  • $y_{i,c}$ is an indicator for whether vertex $i$ belongs to cycle $c$;
  • $z_{i,j,c}$ is an indicator for whether edge $[i,j]\in E$ is part of
    cycle $c$ with the tour of $c$ proceeding from $i$ to $j$;
  • $w_{i,c}$ is an indicator for whether vertex $i$ is the anchor (starting
    and ending point) for cycle $c$; and
  • $u_{i,j,c,c'}$ is an indicator that edge $[i,j]\in E$ (in either
    orientation) belongs to exactly one of cycles $c$ and $c'\neq c.$

We also have one set of variables that can be defined as either continuous or integer (continuous probably makes the model solve faster):

  • $f_{i,j,c}\ge 0$ is the ``flow'' on edge $[i,j]$ in the orientation $i\rightarrow j$ in cycle $c.$

The flow variables are an adaptation of the variables used in the Miller-Tucker-Zemlin formulation for the traveling salesman problem (TSP). We will use them to ensure that cycles found are in fact cycles.

The objective, at least, is straightforward. Since we want to find all possible simple cycles, we maximize the number of cycles constructed: $\max\sum_{c=1}^C x_{c}.$ Now come the constraints. (Brace yourself for some serious vertical scrolling.)

  • Edges can only belong to cycles containing both endpoints, and only
    in one orientation: \begin{align*}
    z_{i,j,c}+z_{j,i,c} & \le y_{i,c}\quad\forall[i,j]\in E,\forall c\\
    z_{i,j,c}+z_{j,i,c} & \le y_{j,c}\quad\forall[i,j]\in E,\forall c
    \end{align*}
  • Flow only occurs on edges that are used, and only in the orientation
    used: \begin{align*}
    f_{i,j,c} & \le Nz_{i,j,c}\quad\forall[i,j]\in E,\forall c\\
    f_{j,i,c} & \le Nz_{j,i,c}\quad\forall[i,j]\in E,\forall c
    \end{align*}
  • Constructed cycles have one anchor and unused cycles have none: \[
    \sum_{i=1}^N w_{i,c}=x_{c}\quad\forall c
    \]
  • A cycle's anchor must belong to the cycle: \[
    w_{i,c}\le y_{i,c}\quad\forall i,\forall c
    \]
  • In every cycle, at every node, there are either two incident edges
    (if the node belongs to the cycle) or none (if the node is not part
    of the cycle: \[
    \sum_{j:[i,j]\in E}(z_{i,j,c}+z_{j,i,c})=2y_{i,c}\quad\forall i,\forall c
    \]
  • In any used cycle, at any node in the cycle other than the anchor, flow out is at least one unit less than flow in (mimicking the MTZ formulation of the TSP): \[
    \sum_{j:[i,j]\in E}f_{i,j,c}\le\sum_{j:[i,j]\in E}f_{j,i,c}-y_{i,c}+Nw_{i,c}\quad\forall i,\forall c
    \]
  • To ensure a complete cycle, one unit of flow must return to the anchor node: \[
    \sum_{j:[i,j]\in E}f_{j,i,c}\ge w_{i,c}\quad\forall i,\forall c
    \]
  • The $u$ variables are defined in terms of the $z$ variables: \[
    u_{i,j,c,c'}\le z_{i,j,c}+z_{j,i,c}+z_{i,j,c'}+z_{j,i,c'}\quad\forall[i,j]\in E,\forall c<c'
    \] \[
    u_{i,j,c,c'}\le2-\left(z_{i,j,c}+z_{j,i,c}+z_{i,j,c'}+z_{j,i,c'}\right)\quad\forall[i,j]\in E,\forall c<c'
    \]
  • Any pair of cycles constructed must differ in at least one edge (which
    also prevents a cycle from repeating with the opposite orientation): \[
    \sum_{[i,j]\in E}u_{i,j,c,c'}\ge x_{c}+x_{c'}-1\quad\forall c<c'
    \]

It is also possible to add constraints to mitigate some of the symmetry in the model. Two come to mind:

  • Unused cycles must have higher indices than used cycles: \[
    x_{c}\le x_{c-1}\quad\forall c\in\left\{ 2,\dots,C\right\}
    \]
  • The anchor in a cycle must have the smallest index of any node
    in the cycle: \[
    w_{i,c}+y_{j,c}\le1\quad\forall i>j,\forall c
    \]

The first one is of low enough dimension that I automatically included it in my Java code. I made the second one optional in the code.

When I ran it against the 9 node, 12 edge graph in the OR SE question, CPLEX had a solution with 13 cycles (the full number) after about 25 seconds and 5,000 nodes -- way, way more time than the 13 milliseconds the iterative method needed. I set $C=20$ as the upper bound on the number of cycles, which of course is also the initial upper bound on the objective. After 60 seconds (the time limit I set), CPLEX's upper bound was still 20. So probable optimality will not be easy to come by.

I've got one more MILP model up my sleeve, coming in the next post. Meanwhile, as a reminder, my Java code is available here.

Wednesday, January 25, 2023

Finding All Simple Cycles (I)

This is the first of a sequence of posts related to the question of finding all simple cycles in an undirected graph. Motivation for the posts comes from a question on OR Stack Exchange. A simple cycle is a cycle with no repeated nodes (other than the starting/ending node).

Assume we have an undirected graph with nodes indexed from 1 to $N$ and with edge set $E.$ I will use the term "anchor" to refer to the node at which a tour of a simple cycle starts (and ends), and require (without loss of generality) that the anchor of every cycle be the lowest index node in the cycle. Since a cycle is the same whether you traverse it "clockwise" or "counterclockwise", I will also require without loss of generality that nodes in a cycle be listed so that the node immediately after the anchor has smaller index than the node immediately before the anchor. This is just to reduce production of duplicate cycles.

I came up with three approaches to finding all simple cycles. The first, which I will describe below, is a simple iterative scheme that is the fastest. The other two, described in subsequent posts, use mixed integer linear programs (MILPs) -- not because they are efficient but because the OR SE question specifically asked about MILP approaches. I will describe those in subsequent posts.

The iterative scheme is pretty simple. It uses a queue of paths (incomplete cycles) to store work in progress. I will use "terminus" to indicate the last node in a path. When the terminus equals the anchor, we have a cycle. Due to my definition of "anchor", we can say that no node on a path can have a lower index than the anchor of the path.

The iterative scheme loops over possible anchors from 1 to $N.$ For each anchor, we initialize the queue with all paths consisting of a single edge incident at the anchor and not incident at any node less than the anchor. While the queue is not empty, we pop a path and create one or more new paths by extending that path with each qualifying edge. A qualifying edge is one that meets the following conditions:

  • it is incident at the terminus of the current path;
  • it is not incident at any node already on the path (other than anchor);
  • it is not incident at any node less than the anchor; and
  • if it is incident at the anchor, it is not the same as the first edge of the path (i.e., we do not allow a "cycle" of the form $a\rightarrow b \rightarrow a$).

If the new edge leads to the anchor, we have a cycle. If not, we update the terminus and add the extended path to the queue, for further processing later. If a path cannot be extended because it is not a cycle and there are no qualifying edges left, we discard it.

Coding the algorithm is pretty straightforward, and it should work on any undirected graph (even those that are not connected). On the example from the OR SE post (9 nodes, 12 edges) it took a whopping 11 milliseconds to find all 13 simple cycles.

Java code for this and the two MILP models (the latter requiring a recent version of CPLEX) can be found in my GitLab repository. The iterative algorithm is in the CycleFinder class. Stay tuned for the MILP models.