Monday, January 30, 2023

A Random Tree Generator

For a problem I was noodling with in R, I needed to generate random trees (meaning connected, acyclic, layered, undirected graphs, not classification trees or anything statistical). There are a bunch of libraries for R that have various capabilities for constructing and or massaging graphs and networks (see the CRAN Task View for graphical models). After spending some time poking through the docs for some of the libraries listed in the task view, and unsuccessfully trying to install one, I decided it would be faster just to roll my own code.

Along the way I succumbed to the software developer's maxim: if it works, it needs more features. Still, I wound up with a function that is not horribly complicated and seems to work. Given how many nodes and how many layers you want in the tree, it outputs a matrix of edges forming the desired tree. There are optional arguments for how you want the nodes labeled (the default is 1, 2, ... but you can supply a vector of labels) and in what order the labels should be assigned (top to bottom/left to right raster scan of the graph or randomly), as well as how you want the output matrix organized (randomly, by label order, or by layer order).

The code is in an R notebook that demonstrates the use of the function. It imports the DiagrammeR library in order to plot the resulting trees, but the function does not require DiagrammeR and in fact has no dependencies. You can view the notebook here, and if you want you can download the code from it (see the select box in the upper right of the notebook). The code is licensed under the Creative Commons Attribution 3.0 Unported license.

Friday, January 27, 2023

Weird Caja Bug

In the past day or so I've tripped over a rather annoying bug in Caja, the file manager for the MATE desktop (which I use with Linux Mint, and which is also used by many Ubuntu users). Theoretically, the F2 key is a shortcut for Edit > Rename..., which lets you rename selected files. Usually, it works fine, but today (and at least once in the recent past) the F2 key had no effect.

At first I thought there was a problem with my keyboard, but no, the F2 key was being recognized. A Google search led me to a bug report from 2018 (!) where a diligent soul with screen name "terzag" discovered that the problem occurred when the Caja window displayed a vertical scrollbar and disappeared when the window was resized large enough to lose the scrollbar. Sure enough, that fix worked for me today ... and then things got weirder.

I did a little experimenting, and the bug reappeared when the scrollbar reappeared and disappeared whenever the scrollbar disappeared. Then I tried F2 with multiple file icons selected (which opens a dialog to bulk rename them), and it worked, despite there being a visible scrollbar. Since then, no matter how many times I close and reopen Caja, the bug does not manifest. So it is temporarily fixed, I guess, with emphasis on "temporarily".

I also encountered a possibly related bug, again reported years ago. If I hit either F2 or Edit > Rename... without first selecting one or more files, it closes (crashes?) all open Caja windows. I'm pretty sure that's a bug and not a "feature".

Hopefully this all gets fixed relatively soon ... although looking at bug reports from almost five years ago does not make me feel sanguine about the prospects.

Finding All Simple Cycles (III)

This is the third and (hopefully) final post about finding all simple cycles in an undirected graph. In the first post, I discussed the problem briefly and described a simple, efficient iterative algorithm. In the second post, I posed (but definitely did not recommend) a mixed integer linear program (MILP) to find all simple cycles. Here I will discuss an alternative MILP approach, specific to CPLEX. To make sense of it, you will definitely need to have read the second post. Java code demonstrating all three methods is available from my repository.

Recent versions of CPLEX have a feature called the solution pool, about which I have previous written. One use is to accumulate solutions (including suboptimal ones if desired) encountered along the way to the optimal solution of a MILP model. Another, using the IloCplex.populate() method in the Java API, is to just generate a gaggle of feasible solutions. So the approach I propose here is to build a MILP model for finding a single simple cycle, set the solution pool population limit parameter (the maximum number of solutions to retain) to something really large and the solution pool intensity parameter (how hard CPLEX works to churn out solutions) to its maximum value, and then use the populate() method to find simple cycles. This is implemented in the class in my code. 

Actually, I'm not sure setting the intensity parameter is necessary. On the small test graph, results were similar no matter what intensity setting I used. CPLEX stopped with 160 solutions, many of which were duplicates. After removing duplicates, there were 13 distinct simple cycles, which matches what the other methods found. The only difference was run time, which was a fraction of a second using either the default intensity or the maximum intensity (4) and about two seconds using the minimum intensity (1). Either way, it was much faster than the full MILP model, at least on this small test graph.

The MILP model used here is a subset of the model in the previous post. The model has no objective value (or equivalently minimizes the constant zero function). We drop the $u$ variables (which were used to ensure that cycles were distinct) and the $x$ variables (which signaled whether a slot in the solution was filled with a cycle or not). For all other variables, we drop the $c$ subscript (which slot is being filled), since we are only building one cycle. The "one anchor per cycle" constraint becomes $\sum_i w_i = 1,$ and other constraints involving $u$ and/or $x$ are dropped.

One issue with this model is that the same cycle can be discovered multiple times, either because its orientation is reversed (so the "clockwise" and "counterclockwise" versions are considered distinct) or possibly because minor difference (rounding error) in the flow variables make two copies of the same solution look distinct. Regardless of the reason, the fact CPLEX terminated with 160 solutions when only 13 cycles exist tells you duplication is happening. Fortunately, we do not have to monkey with the model (or set overly finicky values for tolerance values); we can just filter the pool after the run and weed out the duplicates (which my code does). The alternative would be to use a callback (either an incumbent callback if you are using legacy callbacks, or the more current generic callback in "candidate" context) to weed out duplicates as they arise. Since the logic to be used in the callback is the same as the logic used in post-processing, I cannot see any virtue to using a callback, and it definitely would complicate the code.

With that I am hopefully done with this topic. :-)

Thursday, January 26, 2023

Finding All Simple Cycles (II)

Yesterday's post described my introduction (by way of a question on OR Stack Exchange) to the problem of how to find all simple cycles in a graph. In that post I described a simple, efficient iterative algorithm for finding them.

The OR SE question specifically asked about a mixed integer linear program (MILP) to find all simple cycles. While I would never take that approach myself, I got curious as to whether it was possible to construct such a model. Down the rabbit hole I went. The answer turns out to be yes. In this post, I'll describe the model (again, without advocating that it be used).

Assume that we have an undirected graph with vertices numbered $1,\dots,N$ and edge set $E.$ In what follows, $i$ and $j$ will always be vertex numbers and $c$ and $c'$ will index cycles. Although the graph is undirected, when we construct cycles we will give them an orientation (e.g., $3 \rightarrow 2 \rightarrow 5 \rightarrow 3$), which will be useful in ensuring that the result is an actual cycle (you can get from the starting node back to itself) and not the composition of multiple smaller cycles (e.g., not treating $3 \rightarrow 2 \rightarrow 4 \rightarrow 3$ and $5 \rightarrow 6 \rightarrow 9 \rightarrow 8 \rightarrow 5$ as if they form a single cycle).

The model depends on guessing a non-binding upper bound $C$ on the number of simple cycles in the graph. (If the solution to the model does contain $C$ cycles, your guess may have been too small. To be certain that all cycles were found, you would need to bump up $C$ and solve again, until the optimal solution found fewer than $C$ cycles.)

The model contains the following gazillion binary variables:

  • $x_{c}$ is an indicator for whether cycle $c\in\left\{ 1,\dots,C\right\} $
    is constructed;
  • $y_{i,c}$ is an indicator for whether vertex $i$ belongs to cycle $c$;
  • $z_{i,j,c}$ is an indicator for whether edge $[i,j]\in E$ is part of
    cycle $c$ with the tour of $c$ proceeding from $i$ to $j$;
  • $w_{i,c}$ is an indicator for whether vertex $i$ is the anchor (starting
    and ending point) for cycle $c$; and
  • $u_{i,j,c,c'}$ is an indicator that edge $[i,j]\in E$ (in either
    orientation) belongs to exactly one of cycles $c$ and $c'\neq c.$

We also have one set of variables that can be defined as either continuous or integer (continuous probably makes the model solve faster):

  • $f_{i,j,c}\ge 0$ is the ``flow'' on edge $[i,j]$ in the orientation $i\rightarrow j$ in cycle $c.$

The flow variables are an adaptation of the variables used in the Miller-Tucker-Zemlin formulation for the traveling salesman problem (TSP). We will use them to ensure that cycles found are in fact cycles.

The objective, at least, is straightforward. Since we want to find all possible simple cycles, we maximize the number of cycles constructed: $\max\sum_{c=1}^C x_{c}.$ Now come the constraints. (Brace yourself for some serious vertical scrolling.)

  • Edges can only belong to cycles containing both endpoints, and only
    in one orientation: \begin{align*}
    z_{i,j,c}+z_{j,i,c} & \le y_{i,c}\quad\forall[i,j]\in E,\forall c\\
    z_{i,j,c}+z_{j,i,c} & \le y_{j,c}\quad\forall[i,j]\in E,\forall c
  • Flow only occurs on edges that are used, and only in the orientation
    used: \begin{align*}
    f_{i,j,c} & \le Nz_{i,j,c}\quad\forall[i,j]\in E,\forall c\\
    f_{j,i,c} & \le Nz_{j,i,c}\quad\forall[i,j]\in E,\forall c
  • Constructed cycles have one anchor and unused cycles have none: \[
    \sum_{i=1}^N w_{i,c}=x_{c}\quad\forall c
  • A cycle's anchor must belong to the cycle: \[
    w_{i,c}\le y_{i,c}\quad\forall i,\forall c
  • In every cycle, at every node, there are either two incident edges
    (if the node belongs to the cycle) or none (if the node is not part
    of the cycle: \[
    \sum_{j:[i,j]\in E}(z_{i,j,c}+z_{j,i,c})=2y_{i,c}\quad\forall i,\forall c
  • In any used cycle, at any node in the cycle other than the anchor, flow out is at least one unit less than flow in (mimicking the MTZ formulation of the TSP): \[
    \sum_{j:[i,j]\in E}f_{i,j,c}\le\sum_{j:[i,j]\in E}f_{j,i,c}-y_{i,c}+Nw_{i,c}\quad\forall i,\forall c
  • To ensure a complete cycle, one unit of flow must return to the anchor node: \[
    \sum_{j:[i,j]\in E}f_{j,i,c}\ge w_{i,c}\quad\forall i,\forall c
  • The $u$ variables are defined in terms of the $z$ variables: \[
    u_{i,j,c,c'}\le z_{i,j,c}+z_{j,i,c}+z_{i,j,c'}+z_{j,i,c'}\quad\forall[i,j]\in E,\forall c<c'
    \] \[
    u_{i,j,c,c'}\le2-\left(z_{i,j,c}+z_{j,i,c}+z_{i,j,c'}+z_{j,i,c'}\right)\quad\forall[i,j]\in E,\forall c<c'
  • Any pair of cycles constructed must differ in at least one edge (which
    also prevents a cycle from repeating with the opposite orientation): \[
    \sum_{[i,j]\in E}u_{i,j,c,c'}\ge x_{c}+x_{c'}-1\quad\forall c<c'

It is also possible to add constraints to mitigate some of the symmetry in the model. Two come to mind:

  • Unused cycles must have higher indices than used cycles: \[
    x_{c}\le x_{c-1}\quad\forall c\in\left\{ 2,\dots,C\right\}
  • The anchor in a cycle must have the smallest index of any node
    in the cycle: \[
    w_{i,c}+y_{j,c}\le1\quad\forall i>j,\forall c

The first one is of low enough dimension that I automatically included it in my Java code. I made the second one optional in the code.

When I ran it against the 9 node, 12 edge graph in the OR SE question, CPLEX had a solution with 13 cycles (the full number) after about 25 seconds and 5,000 nodes -- way, way more time than the 13 milliseconds the iterative method needed. I set $C=20$ as the upper bound on the number of cycles, which of course is also the initial upper bound on the objective. After 60 seconds (the time limit I set), CPLEX's upper bound was still 20. So probable optimality will not be easy to come by.

I've got one more MILP model up my sleeve, coming in the next post. Meanwhile, as a reminder, my Java code is available here.

Wednesday, January 25, 2023

Finding All Simple Cycles (I)

This is the first of a sequence of posts related to the question of finding all simple cycles in an undirected graph. Motivation for the posts comes from a question on OR Stack Exchange. A simple cycle is a cycle with no repeated nodes (other than the starting/ending node).

Assume we have an undirected graph with nodes indexed from 1 to $N$ and with edge set $E.$ I will use the term "anchor" to refer to the node at which a tour of a simple cycle starts (and ends), and require (without loss of generality) that the anchor of every cycle be the lowest index node in the cycle. Since a cycle is the same whether you traverse it "clockwise" or "counterclockwise", I will also require without loss of generality that nodes in a cycle be listed so that the node immediately after the anchor has smaller index than the node immediately before the anchor. This is just to reduce production of duplicate cycles.

I came up with three approaches to finding all simple cycles. The first, which I will describe below, is a simple iterative scheme that is the fastest. The other two, described in subsequent posts, use mixed integer linear programs (MILPs) -- not because they are efficient but because the OR SE question specifically asked about MILP approaches. I will describe those in subsequent posts.

The iterative scheme is pretty simple. It uses a queue of paths (incomplete cycles) to store work in progress. I will use "terminus" to indicate the last node in a path. When the terminus equals the anchor, we have a cycle. Due to my definition of "anchor", we can say that no node on a path can have a lower index than the anchor of the path.

The iterative scheme loops over possible anchors from 1 to $N.$ For each anchor, we initialize the queue with all paths consisting of a single edge incident at the anchor and not incident at any node less than the anchor. While the queue is not empty, we pop a path and create one or more new paths by extending that path with each qualifying edge. A qualifying edge is one that meets the following conditions:

  • it is incident at the terminus of the current path;
  • it is not incident at any node already on the path (other than anchor);
  • it is not incident at any node less than the anchor; and
  • if it is incident at the anchor, it is not the same as the first edge of the path (i.e., we do not allow a "cycle" of the form $a\rightarrow b \rightarrow a$).

If the new edge leads to the anchor, we have a cycle. If not, we update the terminus and add the extended path to the queue, for further processing later. If a path cannot be extended because it is not a cycle and there are no qualifying edges left, we discard it.

Coding the algorithm is pretty straightforward, and it should work on any undirected graph (even those that are not connected). On the example from the OR SE post (9 nodes, 12 edges) it took a whopping 11 milliseconds to find all 13 simple cycles.

Java code for this and the two MILP models (the latter requiring a recent version of CPLEX) can be found in my GitLab repository. The iterative algorithm is in the CycleFinder class. Stay tuned for the MILP models.