Monday, March 20, 2023

Annoying Message with R GA Package

This is a small technical note (mainly a reminder to myself) about using the GA package for genetic algorithms in R, specifically within an R Markdown document (such as an R notebook). The GA::ga() function includes an optional argument to turn on parallel evaluation of the fitness function. The argument value can be true or false (default), which are self-explanatory, or a number of cores, or a character string ("snow" or "multicore") for how parallelization is to be done. The default is "multicore" except on Windows, where "snow" is apparently the only option. When I choose to turn on parallel objective evaluation, I take the easy route and just set it to TRUE.

When run in a terminal, I just get a sequence of log entries, one per generation, and then it terminates. That's also what I see when the command is run in an R Markdown document ... until the document is rendered to HTML. In the HTML output, interspersed with each line of log output is a block of four identical copies of the following.

loaded GA and set parent environment

The fact that there are four copies is presumably because my PC has four cores. Turning off the monitor option gets rid of the progress printouts but does not eliminate the repetitive messages. Switching the parallel mode to "snow" does get rid of them, but I'm not sure what the broader implications of that are.

After significant spelunking, I traced the messages to the doParallel R package, which is apparently being used to implement parallel processing under the "multicore" option. In any event, the way to get rid of them is to add the option "message = FALSE" to the R chunk in which ga() is being executed. Barring other chunk options, this would change {r} to {r, message = FALSE} for that one chunk. (You can also set it as a global option in the document.) Happily, the monitor output (stats for each generation) still prints out. It just gets rid of those <expletive deleted> parallel processing messages.

Tuesday, March 7, 2023

Another GA for Clustering

Someone asked on OR Stack Exchange about clustering nodes, a subject that comes up from time to time. The premise is that you have $N$ points (nodes), and for each pair $i \neq j$ there is (or may be) an edge with weight $w_{i,j} \in [0, 1]$ (where $w_{i,j} = 0$ if there is no edge between $i$ and $j$). The objective is to create exactly $G$ groups (clusters), with no group having cardinality greater than $M,$ such that the sum of within-group edge weights is maximized. It is simple to write an integer programming model for the problem, and the question includes one, but the author was looking for something that did not require an IP solver. The question specifies, as an example, dimensions of $N=500$ points, $G=8$ groups, and a maximum group size of $M=70.$

Since I have suggest genetic algorithms for previous clustering problems (see here and here), it probably will not shock you that I once again looked for a GA approach, using a "random key" algorithm (where the chromosome gets decoded into a feasible solution. In this case, the chromosome consists of $N+G$ values between 0 and 1. The first $G$ values are used to select the sizes $n_1,\dots, n_G$ of the groups. The last $N$ values are converted into a permutation of the point indices $1,\dots, N.$ The first $n_1$ entries in the permutation are the indices of the points in group 1, the next $n_2$ entries give the points in group 2, and so on.

The tricky part here is converting the first portion of the chromosome into the group sizes. We know that the maximum group size is $M,$ and it is easy to deduce that the minimum group size is $m = N - (G-1)M.$ So the minimum and maximum fractions of the population to assign to any group are $a=m/N$ and $b=M/N,$ where $$m = N - (G-1)M \implies a = 1 - (G-1)b.$$ Now suppose that I have values $\pi_1, \dots,\pi_G \in (a,b)$ and assign group sizes $n_i = \pi_i \cdot N,$ which clearly meet the minimum and maximum size limits. (I'm cheating here, but I'll fix it in a minute.) We want $\sum_{i=1}^G n_i = N,$ which means we need $\sum_i \pi_i = 1.$

To get the $\pi_i,$ we take a rather roundabout route. Assume that the first $G$ entries in the chromosome are $x_1,\dots,x_G \in (0,1).$ Set $$y_i = 1 - \frac{x_i}{\sum_{j=1}^G x_j} \in (0,1)$$ and observe that $\sum_{i=1}^G y_i = G-1.$ Finally, set $\pi_i = a + (b-a) y_i \in (a, b).$ We have $$\sum_i \pi_i = G\cdot a + (b-a) \sum_i y_i\\ = G\cdot a + (b-a)(G-1) = (G-1)b + a = 1.$$

Now to confess to my cheating. I said the $i$-th group size would be $n_i=\pi_i \cdot N,$ ignoring the minor problem that the left side is an integer and the right side almost surely is not. So of course we will round $\pi_i \cdot N$ to get $n_i.$ After rounding, though, we can no longer be sure that $\sum_i n_i = N.$ So we iteratively fix it. If $\sum_i n_i < N,$ we add 1 to the smallest $n_i$ and recheck. if $\sum_i n_i > N,$ we subtract 1 from the largest $n_i$ and recheck. Eventually, we end up with a feasible set of group sizes.

All this is encoded in an R notebook I wrote, including a sample problem. Whether the GA gets a "good" (near optimal) partition or not I cannot say, since I did not write the equivalent MIP model. I can say, though, that it gets a sequence of progressively improving partitions.

Monday, January 30, 2023

A Random Tree Generator

For a problem I was noodling with in R, I needed to generate random trees (meaning connected, acyclic, layered, undirected graphs, not classification trees or anything statistical). There are a bunch of libraries for R that have various capabilities for constructing and or massaging graphs and networks (see the CRAN Task View for graphical models). After spending some time poking through the docs for some of the libraries listed in the task view, and unsuccessfully trying to install one, I decided it would be faster just to roll my own code.

Along the way I succumbed to the software developer's maxim: if it works, it needs more features. Still, I wound up with a function that is not horribly complicated and seems to work. Given how many nodes and how many layers you want in the tree, it outputs a matrix of edges forming the desired tree. There are optional arguments for how you want the nodes labeled (the default is 1, 2, ... but you can supply a vector of labels) and in what order the labels should be assigned (top to bottom/left to right raster scan of the graph or randomly), as well as how you want the output matrix organized (randomly, by label order, or by layer order).

The code is in an R notebook that demonstrates the use of the function. It imports the DiagrammeR library in order to plot the resulting trees, but the function does not require DiagrammeR and in fact has no dependencies. You can view the notebook here, and if you want you can download the code from it (see the select box in the upper right of the notebook). The code is licensed under the Creative Commons Attribution 3.0 Unported license.

Friday, January 27, 2023

Weird Caja Bug

In the past day or so I've tripped over a rather annoying bug in Caja, the file manager for the MATE desktop (which I use with Linux Mint, and which is also used by many Ubuntu users). Theoretically, the F2 key is a shortcut for Edit > Rename..., which lets you rename selected files. Usually, it works fine, but today (and at least once in the recent past) the F2 key had no effect.

At first I thought there was a problem with my keyboard, but no, the F2 key was being recognized. A Google search led me to a bug report from 2018 (!) where a diligent soul with screen name "terzag" discovered that the problem occurred when the Caja window displayed a vertical scrollbar and disappeared when the window was resized large enough to lose the scrollbar. Sure enough, that fix worked for me today ... and then things got weirder.

I did a little experimenting, and the bug reappeared when the scrollbar reappeared and disappeared whenever the scrollbar disappeared. Then I tried F2 with multiple file icons selected (which opens a dialog to bulk rename them), and it worked, despite there being a visible scrollbar. Since then, no matter how many times I close and reopen Caja, the bug does not manifest. So it is temporarily fixed, I guess, with emphasis on "temporarily".

I also encountered a possibly related bug, again reported years ago. If I hit either F2 or Edit > Rename... without first selecting one or more files, it closes (crashes?) all open Caja windows. I'm pretty sure that's a bug and not a "feature".

Hopefully this all gets fixed relatively soon ... although looking at bug reports from almost five years ago does not make me feel sanguine about the prospects.

Finding All Simple Cycles (III)

This is the third and (hopefully) final post about finding all simple cycles in an undirected graph. In the first post, I discussed the problem briefly and described a simple, efficient iterative algorithm. In the second post, I posed (but definitely did not recommend) a mixed integer linear program (MILP) to find all simple cycles. Here I will discuss an alternative MILP approach, specific to CPLEX. To make sense of it, you will definitely need to have read the second post. Java code demonstrating all three methods is available from my repository.

Recent versions of CPLEX have a feature called the solution pool, about which I have previous written. One use is to accumulate solutions (including suboptimal ones if desired) encountered along the way to the optimal solution of a MILP model. Another, using the IloCplex.populate() method in the Java API, is to just generate a gaggle of feasible solutions. So the approach I propose here is to build a MILP model for finding a single simple cycle, set the solution pool population limit parameter (the maximum number of solutions to retain) to something really large and the solution pool intensity parameter (how hard CPLEX works to churn out solutions) to its maximum value, and then use the populate() method to find simple cycles. This is implemented in the class in my code. 

Actually, I'm not sure setting the intensity parameter is necessary. On the small test graph, results were similar no matter what intensity setting I used. CPLEX stopped with 160 solutions, many of which were duplicates. After removing duplicates, there were 13 distinct simple cycles, which matches what the other methods found. The only difference was run time, which was a fraction of a second using either the default intensity or the maximum intensity (4) and about two seconds using the minimum intensity (1). Either way, it was much faster than the full MILP model, at least on this small test graph.

The MILP model used here is a subset of the model in the previous post. The model has no objective value (or equivalently minimizes the constant zero function). We drop the $u$ variables (which were used to ensure that cycles were distinct) and the $x$ variables (which signaled whether a slot in the solution was filled with a cycle or not). For all other variables, we drop the $c$ subscript (which slot is being filled), since we are only building one cycle. The "one anchor per cycle" constraint becomes $\sum_i w_i = 1,$ and other constraints involving $u$ and/or $x$ are dropped.

One issue with this model is that the same cycle can be discovered multiple times, either because its orientation is reversed (so the "clockwise" and "counterclockwise" versions are considered distinct) or possibly because minor difference (rounding error) in the flow variables make two copies of the same solution look distinct. Regardless of the reason, the fact CPLEX terminated with 160 solutions when only 13 cycles exist tells you duplication is happening. Fortunately, we do not have to monkey with the model (or set overly finicky values for tolerance values); we can just filter the pool after the run and weed out the duplicates (which my code does). The alternative would be to use a callback (either an incumbent callback if you are using legacy callbacks, or the more current generic callback in "candidate" context) to weed out duplicates as they arise. Since the logic to be used in the callback is the same as the logic used in post-processing, I cannot see any virtue to using a callback, and it definitely would complicate the code.

With that I am hopefully done with this topic. :-)

Thursday, January 26, 2023

Finding All Simple Cycles (II)

Yesterday's post described my introduction (by way of a question on OR Stack Exchange) to the problem of how to find all simple cycles in a graph. In that post I described a simple, efficient iterative algorithm for finding them.

The OR SE question specifically asked about a mixed integer linear program (MILP) to find all simple cycles. While I would never take that approach myself, I got curious as to whether it was possible to construct such a model. Down the rabbit hole I went. The answer turns out to be yes. In this post, I'll describe the model (again, without advocating that it be used).

Assume that we have an undirected graph with vertices numbered $1,\dots,N$ and edge set $E.$ In what follows, $i$ and $j$ will always be vertex numbers and $c$ and $c'$ will index cycles. Although the graph is undirected, when we construct cycles we will give them an orientation (e.g., $3 \rightarrow 2 \rightarrow 5 \rightarrow 3$), which will be useful in ensuring that the result is an actual cycle (you can get from the starting node back to itself) and not the composition of multiple smaller cycles (e.g., not treating $3 \rightarrow 2 \rightarrow 4 \rightarrow 3$ and $5 \rightarrow 6 \rightarrow 9 \rightarrow 8 \rightarrow 5$ as if they form a single cycle).

The model depends on guessing a non-binding upper bound $C$ on the number of simple cycles in the graph. (If the solution to the model does contain $C$ cycles, your guess may have been too small. To be certain that all cycles were found, you would need to bump up $C$ and solve again, until the optimal solution found fewer than $C$ cycles.)

The model contains the following gazillion binary variables:

  • $x_{c}$ is an indicator for whether cycle $c\in\left\{ 1,\dots,C\right\} $
    is constructed;
  • $y_{i,c}$ is an indicator for whether vertex $i$ belongs to cycle $c$;
  • $z_{i,j,c}$ is an indicator for whether edge $[i,j]\in E$ is part of
    cycle $c$ with the tour of $c$ proceeding from $i$ to $j$;
  • $w_{i,c}$ is an indicator for whether vertex $i$ is the anchor (starting
    and ending point) for cycle $c$; and
  • $u_{i,j,c,c'}$ is an indicator that edge $[i,j]\in E$ (in either
    orientation) belongs to exactly one of cycles $c$ and $c'\neq c.$

We also have one set of variables that can be defined as either continuous or integer (continuous probably makes the model solve faster):

  • $f_{i,j,c}\ge 0$ is the ``flow'' on edge $[i,j]$ in the orientation $i\rightarrow j$ in cycle $c.$

The flow variables are an adaptation of the variables used in the Miller-Tucker-Zemlin formulation for the traveling salesman problem (TSP). We will use them to ensure that cycles found are in fact cycles.

The objective, at least, is straightforward. Since we want to find all possible simple cycles, we maximize the number of cycles constructed: $\max\sum_{c=1}^C x_{c}.$ Now come the constraints. (Brace yourself for some serious vertical scrolling.)

  • Edges can only belong to cycles containing both endpoints, and only
    in one orientation: \begin{align*}
    z_{i,j,c}+z_{j,i,c} & \le y_{i,c}\quad\forall[i,j]\in E,\forall c\\
    z_{i,j,c}+z_{j,i,c} & \le y_{j,c}\quad\forall[i,j]\in E,\forall c
  • Flow only occurs on edges that are used, and only in the orientation
    used: \begin{align*}
    f_{i,j,c} & \le Nz_{i,j,c}\quad\forall[i,j]\in E,\forall c\\
    f_{j,i,c} & \le Nz_{j,i,c}\quad\forall[i,j]\in E,\forall c
  • Constructed cycles have one anchor and unused cycles have none: \[
    \sum_{i=1}^N w_{i,c}=x_{c}\quad\forall c
  • A cycle's anchor must belong to the cycle: \[
    w_{i,c}\le y_{i,c}\quad\forall i,\forall c
  • In every cycle, at every node, there are either two incident edges
    (if the node belongs to the cycle) or none (if the node is not part
    of the cycle: \[
    \sum_{j:[i,j]\in E}(z_{i,j,c}+z_{j,i,c})=2y_{i,c}\quad\forall i,\forall c
  • In any used cycle, at any node in the cycle other than the anchor, flow out is at least one unit less than flow in (mimicking the MTZ formulation of the TSP): \[
    \sum_{j:[i,j]\in E}f_{i,j,c}\le\sum_{j:[i,j]\in E}f_{j,i,c}-y_{i,c}+Nw_{i,c}\quad\forall i,\forall c
  • To ensure a complete cycle, one unit of flow must return to the anchor node: \[
    \sum_{j:[i,j]\in E}f_{j,i,c}\ge w_{i,c}\quad\forall i,\forall c
  • The $u$ variables are defined in terms of the $z$ variables: \[
    u_{i,j,c,c'}\le z_{i,j,c}+z_{j,i,c}+z_{i,j,c'}+z_{j,i,c'}\quad\forall[i,j]\in E,\forall c<c'
    \] \[
    u_{i,j,c,c'}\le2-\left(z_{i,j,c}+z_{j,i,c}+z_{i,j,c'}+z_{j,i,c'}\right)\quad\forall[i,j]\in E,\forall c<c'
  • Any pair of cycles constructed must differ in at least one edge (which
    also prevents a cycle from repeating with the opposite orientation): \[
    \sum_{[i,j]\in E}u_{i,j,c,c'}\ge x_{c}+x_{c'}-1\quad\forall c<c'

It is also possible to add constraints to mitigate some of the symmetry in the model. Two come to mind:

  • Unused cycles must have higher indices than used cycles: \[
    x_{c}\le x_{c-1}\quad\forall c\in\left\{ 2,\dots,C\right\}
  • The anchor in a cycle must have the smallest index of any node
    in the cycle: \[
    w_{i,c}+y_{j,c}\le1\quad\forall i>j,\forall c

The first one is of low enough dimension that I automatically included it in my Java code. I made the second one optional in the code.

When I ran it against the 9 node, 12 edge graph in the OR SE question, CPLEX had a solution with 13 cycles (the full number) after about 25 seconds and 5,000 nodes -- way, way more time than the 13 milliseconds the iterative method needed. I set $C=20$ as the upper bound on the number of cycles, which of course is also the initial upper bound on the objective. After 60 seconds (the time limit I set), CPLEX's upper bound was still 20. So probable optimality will not be easy to come by.

I've got one more MILP model up my sleeve, coming in the next post. Meanwhile, as a reminder, my Java code is available here.

Wednesday, January 25, 2023

Finding All Simple Cycles (I)

This is the first of a sequence of posts related to the question of finding all simple cycles in an undirected graph. Motivation for the posts comes from a question on OR Stack Exchange. A simple cycle is a cycle with no repeated nodes (other than the starting/ending node).

Assume we have an undirected graph with nodes indexed from 1 to $N$ and with edge set $E.$ I will use the term "anchor" to refer to the node at which a tour of a simple cycle starts (and ends), and require (without loss of generality) that the anchor of every cycle be the lowest index node in the cycle. Since a cycle is the same whether you traverse it "clockwise" or "counterclockwise", I will also require without loss of generality that nodes in a cycle be listed so that the node immediately after the anchor has smaller index than the node immediately before the anchor. This is just to reduce production of duplicate cycles.

I came up with three approaches to finding all simple cycles. The first, which I will describe below, is a simple iterative scheme that is the fastest. The other two, described in subsequent posts, use mixed integer linear programs (MILPs) -- not because they are efficient but because the OR SE question specifically asked about MILP approaches. I will describe those in subsequent posts.

The iterative scheme is pretty simple. It uses a queue of paths (incomplete cycles) to store work in progress. I will use "terminus" to indicate the last node in a path. When the terminus equals the anchor, we have a cycle. Due to my definition of "anchor", we can say that no node on a path can have a lower index than the anchor of the path.

The iterative scheme loops over possible anchors from 1 to $N.$ For each anchor, we initialize the queue with all paths consisting of a single edge incident at the anchor and not incident at any node less than the anchor. While the queue is not empty, we pop a path and create one or more new paths by extending that path with each qualifying edge. A qualifying edge is one that meets the following conditions:

  • it is incident at the terminus of the current path;
  • it is not incident at any node already on the path (other than anchor);
  • it is not incident at any node less than the anchor; and
  • if it is incident at the anchor, it is not the same as the first edge of the path (i.e., we do not allow a "cycle" of the form $a\rightarrow b \rightarrow a$).

If the new edge leads to the anchor, we have a cycle. If not, we update the terminus and add the extended path to the queue, for further processing later. If a path cannot be extended because it is not a cycle and there are no qualifying edges left, we discard it.

Coding the algorithm is pretty straightforward, and it should work on any undirected graph (even those that are not connected). On the example from the OR SE post (9 nodes, 12 edges) it took a whopping 11 milliseconds to find all 13 simple cycles.

Java code for this and the two MILP models (the latter requiring a recent version of CPLEX) can be found in my GitLab repository. The iterative algorithm is in the CycleFinder class. Stay tuned for the MILP models.