OR in an OB World: graphs

Showing posts with label graphs. Show all posts

Wednesday, January 29, 2025

Minimizing Flow Support (II)

In yesterday's post I described a graph optimization problem (from a post on OR Stack Exchange) and how to generate random test instances. You are given a digraph with a single commodity flowing through it. There may be multiple supply and demand nodes (with supply and demand in balance), and arcs have neither costs nor capacity limits. The problem is to find a subgraph with all the original nodes and a minimal number of arcs (the "support" of the flow) such that there is a feasible flow pattern to satisfy all demands. In this post, I will discuss a couple of mixed integer programming (MIP) models for the problem.

I will use the following notation. The nodes are $n_1, \dots, n_N$, and the supply or demand at node $n_i$ is given by $s_i,$ where $s_i > 0$ at supply nodes and $s_i < 0$ at demand nodes. The total supply is given by $F.$ The set of arcs is $A,$ and at each node $n_i$ we denote by $\delta^+(n_i)$ and $\delta^-(n_i)$ respectively the set of arcs flowing into and out of $n_i.$ The common elements of my two MIP models are the following.

Variable $x_a \in \lbrace 0,1\rbrace$ is 1 if and only if arc $a\in A$ is selected for the subgraph.
Variable $y_a \in [0, F]$ is the flow volume over arc $a \in A.$
The objective is to minimize the total number of arcs selected:

$$\textrm{minimize }\sum_{a\in A} x_a.$$
For each node $n_i$ we require that flows in and out combined with any supply or demand balance:

$$\sum_{a \in \delta^-(n_i)} y_a - \sum_{a\in \delta^+(a)} y_a = s_i.$$

The models differ in the remaining requirement, that there be no flow on any arcs that were not selected (i.e., $y_a = 0$ if $x_a =0$). Those constraints are added as follows.

In the "big M" model, for each $a\in A$ we add the constraint $y_a \le Fx_a.$
In the "indicators" model, for each $a\in A$ we add an if-then constraint $x_a = 0 \implies y_a = 0.$

I tested both models using two different solvers, IBM CPLEX 22.1.1 and FICO Xpress 9.5 Optimizer (version 44.01.01). Before running a "production" problem I ran all four combinations of model and solver on a small instance using the solvers' respective tuning routines. In three cases, the default solver settings seemed best. In one case (CPLEX on the big M model), the tuner suggested a couple of nondefault parameter settings, but they fared poorly on the production problem. So I used default parameter settings on all the production runs.

The production problem had 25 supply nodes, 34 demand nodes and 41 transit nodes (nodes where supply was 0) and 526 arcs. Total supply was $F=1000.$ I gave each combination of model and solver a one hour time limit (on my slightly vintage PC). I was mainly curious about how the two models would compare, and secondarily on how the two solvers would compare. Of course, running one test instance for one hour per combination is far from probative, but my curiosity has its bounds. Here is what I found.

Solver	Model	Incumbent	Lower bound	Gap (%)
CPLEX	big M	55	49.5	10
CPLEX	indicators	54	48	11
Xpress	big M	57	49	16
Xpress	indicators	58	49	18

There are some differences among combinations in the results (which, again, might not bear up under multiple tests), but what I found a bit interesting was that the gap never made it below 10% in any combination, even though I consider the test problem to be not particularly large. (Also slightly interesting was that only roughly 10% of the arcs in the graph were needed.)

I will note one difference in the solvers. I'm not sure if it is a function of different default parameter settings or different memory management. Within the one hour run time limit, neither attempt with Xpress ran into memory issues. In contrast, one of the CPLEX runs exhausted system memory (and hung the system) before the hour was up. So I did the other CPLEX run with a limit of 9500 MB on the tree size (set via the parameter CPXPARAM_MIP_Limits_TreeMemory), and that run ended due to memory exhaustion before the hour was up. (Both CPLEX runs lasted only about 53 minutes.)

One of the main reasons I ran the tests was to see whether the model with indicators would give be tighter than the big M model. A bit of wisdom I received a long time ago was that big M was better than indicators if you have insights into the model that allow you to use a not terribly large value of $M.$ Here, the worst case would be if the entire flow volume $F$ passed through a single arc, which lets me use $F$ (1000 in the test problem) as my value of $M.$ The big M runs did produce smaller gaps than their indicator counterparts, but not by much, and possibly not by a "statistically significant" amount (if you can even mention "statistical significance" while working with a sample size of 1 🥲).

Still, for now I will stick to preferring big M constraints with not-so-big values of $M$ over indicator constraints.

As mentioned in the previous post, you can find my code here, including a README.md file that explains the code structure.

Tuesday, January 28, 2025

Minimizing Flow Support (I)

This is the first of hopefully two posts related to a question posted on Operations Research Stack Exchange. You are given a digraph through which a single commodity flows. Arcs have neither costs nor capacity limits. Each node $n_i$ is either a supply node (with supply $s_i>0$), a demand node (with demand $s_i<0$ treated as a "negative supply” in the optimization models), or what I will call a “transit node” ($s_i =0$) with neither supply nor demand. Key assumptions are that total supply equals total demand and that it is possible to find paths through the digraph satisfying all demands (and thus consuming all supplies).

The problem is to select a minimal number of arcs such that the reduced digraph (using all the original nodes but just the selected arcs) contains routes fulfilling all demands. In this post, I'll describe one way to generate random test instances of the problem. The following post will discuss modeling and solving the problem. I have Java code demonstrating both parts in my university GitLab repository.

My approach to generating a test instance starts with specification of the number of nodes in the graph and a lower bound for the number of arcs. (The lower bound might need to be exceeded in order to ensure that a feasible flow satisfying all demands exists.) My code also asks the user to specify an integer value $F$ for total supply (and total demand). I used integer flows just to make printing the supply/demand at each node and flow on each arc neat. You might prefer to use real valued flows and (since the problem is invariant with respect to the flow volume) just set the total supply/demand at 1.

In what follows, I will use the terms "upstream" and "downstream" to refer to nodes from which there are directed paths to a given node (upstream) or to which there are directed paths from a given node (downstream). To simplify explanation, I will treat demands as positive values. The construction process starts by creating the desired number of nodes and partitioning them into supply, demand and transit nodes. Since you need at least one supply node and at least one demand node, the first node is assigned as a supply node and the second as a demand node. (My code graph also assigns one node as a transit node, but if you do not care whether there are any transit nodes you can skip that.) The remaining nodes are randomly classified as supply, demand or transit. Since my code uses integer flow values, each supply (demand) node needs a supply (demand) of at least 1. If the partitioning process creates more than $F$ supply or demand nodes, the excess nodes are reclassified as transit nodes. If you are using real flow values, this can be skipped.

The next step is to allocate total supply (total demand) randomly across the supply (demand) nodes. Again, since I am using integer flows, my code first allocates one unit of supply or demand to each non-transit node, then allocates the remaining supply (demand) one unit at a time, randomly choosing with replacement a supply (demand) node to receive the next unit.

With the nodes created, it is time to move on to creating arcs. My code allocates to each node of any type two initially empty sets of nodes, those upstream and downstream, and two initially empty sets of arcs, those entering and those leaving the node. It also assigns a temporary variable containing the excess supply if the node is a supply node or the unmet demand if the node is a demand node. Two sets of nodes are created, one containing supply nodes with unused supply (initially, all the supply nodes) and the other containing demand nodes with unmet demands (initially, all the demand nodes).

A list of all possible arcs (excluding arcs from a node to itself) is created and randomly shuffled. Arcs are now added from that list until enough directed paths exist to ensure that all demands can be met. As each arc $(a, b)$ is added to the digraph, it is added to the set of arcs exiting the tail node $a$ and to the set of arcs entering the head node $b.$ The set of nodes upstream of $b$ is updated to include $a$ and all nodes upstream of $a,$ and the set of nodes downstream of $a$ is updated to include $b$ and all nodes downstream of $b.$ Finally, nodes upstream of $b$ with unused supply and nodes downstream of $a$ with unmet demand are paired up. The lesser of the unused supply and unmet demand is subtracted from both the supply of the upstream node and the demand of the downstream node, and whichever has zero supply/demand left is removed from the set of nodes with unused supply/unmet demand. It is possible (and certain when the very last drop of supply/demand is accounted for) that both nodes will be removed from their respective sets.

Once there are no nodes left with excess supply/demand, we can be certain the digraph contains at least one feasible solution. All that remains is to randomly add arcs until the user's specified minimum number of arcs is met.

Tuesday, September 12, 2023

Randomly Generating a Connected Graph

Someone posted a question on Operations Research Stack Exchange about generating random test instances of a connected graph. (They specified a planar graph, but I am going to ignore that restriction, which I suspect was unintended and which I am pretty sure would make the problem much harder.) The premise is fairly straightforward. You start with a fixed set of nodes, partitioned into three subsets (supply, transshipment and demand nodes). The goal is to select a random set of edges of fixed cardinality such that (a) the resulting graph is connected and (b) no edge connects two supply nodes or a supply node and a demand node. Setting the number of edges too low will obviously make the problem infeasible. The author specified that the graph could be either directed or undirected. My solutions are coded for undirected graphs but can be adapted to directed graphs.

I proposed two solutions, both of which I coded and tested in Java. One uses a mixed integer programming (MIP) model, which my code solves using CPLEX. Let $E$ be the set of all valid edges, $N$ the number of nodes and $K$ the desired number of edges. The MIP model uses binary variables $x_{i,j}$ for all $(i,j)\in E.$ The requirement to include $K$ edges is handled by the constraint $$\sum_{(i,j)\in E} x_{i,j} = K$$ and randomness is achieved by minimizing $$\sum_{(i,j)\in E} c_{i,j} x_{i,j}$$ where the cost coefficients $c_{i,j}$ are generated randomly.

That leaves the matter of forcing the graph to be connected. To do that, we introduce forward and backward flow variables $f_{i,j} \ge 0$ and $r_{i,j} \ge 0$ for each edge $(i,j),$ where $f_{i,j}$ is interpreted as flow from $i$ to $j$ and $r_{i,j}$ as flow from $j$ to $i.$ The concept is to single out one node (call it $s$) as a source for $N-1$ units of flow of some mystery commodity and assign every other node a demand of one unit of the commodity. For $i\neq s,$ the flow conservation constraint is $$\sum_{(j,i)\in E} (f_{j,i} - r_{j,i}) + \sum_{(i,j)\in E} (r_{i,j} - f_{i,j}) = 1,$$ which says that flow in minus flow out equals 1. To ensure that flow occurs only on selected edges, we add the constraints $$f_{i,j} \le (N-1) x_{i,j}$$ and $$r_{i,j} \le (N-1) x_{i,j}$$ for all $(i,j)\in E.$

The edge removal heuristic is a bit simpler. We start by selecting all eligible edges and shuffling the list randomly. While the number of edges in the graph exceeds $K,$ we pop the next edge from the list, remove it, and test whether the graph remains connected. If yes, the edge is permanently gone. If no, we replace the edge in the graph (but not on the list) and move to the next edge in the list. To confirm that removing edge $(i,j)$ leaves a connected graph, we start from node $i$ and find all nodes that can be reached from $i$ in one step (i.e., remain adjacent). Then we find all nodes adjacent to those nodes, and continue until either we encounter node $j$ (in which case the graph is still connected) or run out of nodes to test (in which case the graph is now disconnected).

The edge removal heuristic is simpler to explain, does not require a MIP solver and likely is faster. There are two potential disadvantages compared to the MIP approach. One is that the MIP is done after solving once, whereas there is a possibility that the edge removal heuristic fails to produce a connected graph due to an "unlucky" choice of which edges to remove, requiring one or more restarts with different random number seeds. The other is that if $K$ is set too low (making a connected graph impossible), the MIP model will detect that the problem is infeasible. With the edge removal heuristic, you would not be able to distinguish infeasibility from bad luck (although if the heuristic failed multiple times with different random seeds, infeasibility would start to look likely). In very very limited testing of my code, the edge removal heuristic was definitely faster than the MIP and achieved a valid result in the first pass each time.

Java code for both methods is in my code repository.

Sunday, July 30, 2023

Visiting All Nodes

A question on Operations Research Stack Exchange asks about finding a shortest route through a connected graph that visits every node. This is distinct from the well-known traveling salesman problem (TSP), in that the TSP requires each node to be visited exactly once whereas the OR SE question explicitly allows nodes to be visited more than once.

The question reminded me of a conversation I had years ago with a doctoral student in our logistics program. He was working on routing deliveries (via truck) from a depot to various customers, and was trying to formulate it as a TSP using a graph based on a road map (buildings or intersections as nodes, roads as edges). I did my best at the time to convince him that the TSP was unnecessarily restrictive and might even make the problem infeasible. The second example in the OR SE question is a nice example of how that could happen. Generally speaking, in an actual routing problem there is nothing stopping you from passing by/through a previously visited location if that's the best route. There are a few exceptions to this, some in military applications (you were laying mines and/or blowing up bridges as you passed them) and some in normal life (you were speeding and might not want to return some place where a traffic cop would remember you), but for the most part I believe it holds.

My advice to the doctoral student was similar to the answer Rob Pratt posted on OR SE. Rethink the graph, keeping just the depot and customers as nodes and letting the edges in the new graph represent the shortest routes between pairs of nodes. So edge $(i,j)$ and edge $(k,\ell)$ connect different pairs of nodes but might represent physical routes that cross (you pass through the same intersection on both, albeit possibly in different directions) or overlap (both take US 127 north from mile marker 102 to mile marker 137). Building the edges in the new graph involves solving a shortest path problem between each pair of nodes (using the original graph). You can apply Dijkstra's algorithm to each pair of nodes, but I'm pretty sure applying the Floyd-Warshall algorithm would be faster. Pro tip: if you use Floyd-Warshall, don't bother to extract the route for each pair of nodes. Just get the distances, and keep the F-W solution handy. Once you have solved a TSP on the modified graph and know which edges will be used, you can extract the physical routes corresponding to just the used edges from the F-W solution.

I posted an explicit MIP model (which I will not repeat here, since it's a space hog) as an answer to the OR SE question. I suspect that Rob's approach (TSP on a modified graph) might be faster, particularly because it allows you to use a dedicated TSP solver. Although TSPs have a reputation for being difficult (being the poster child for NP-hard problems), bespoke solvers like Concorde can chew through rather large TSPs pretty quickly. Other than the upfront computational cost of finding the shortest path between every pair of nodes, the one concern I might have with the TSP approach is that it converts what might have been a sparse graph to a complete graph. So, like all things MIP, which approach is better is an empirical question.

Monday, May 8, 2023

Finding a Connected Subgraph

A recent question on Mathematics Stack Exchange asked about using a mixed integer linear program (MILP) to find a connected subgraph of a specified size (vertex count) within a given graph. Specifically, it asked how to formulate constraints to enforce the requirement that the subgraph be connected.

As I have previously mentioned, one way to force a graph to be connected is to add flows between vertices, treating one vertex as a source and the other vertices as sinks. In the context of the current question, that works as follows. Let $V$ and $E$ be the sets of vertices and edges respectively in our (undirected) graph and $M<\vert V \vert$ the number of vertices required in the connected subgraph. Note that we do not assume the parent graph is connected, and in fact the sample graph in the Math SE post is not. Since flows are inherently directional, we replace each edge $e=(v,w)\in E$ with two arcs, $(v,w)$ and $(w,v).$ Let $A$ be the set of all arcs.

We designate (arbitrarily) one vertex in the subgraph to be the "root" (source) vertex, with a supply of $M-1$ thingies. (I was going to say "honest politicians", but then the problem becomes infeasible, hence "thingies".) Each of the remaining vertices in the subgraph consumes one thingy. Thingies flow along arcs, but only if both endpoints are selected. Since $M-1$ thingies enter the network at the root vertex and each of the remaining $M-1$ vertices consumes one thingy, the only way for flow to be balanced is if the selected vertices form a connected graph.

For each vertex $v\in V,$ let $x_v \in \lbrace 0, 1 \rbrace$ be 1 if and only if vertex $v$ is selected and let $y_v\in \lbrace 0, 1 \rbrace$ be 1 if and only if vertex $v$ is chosen to be the root vertex. For each arc $a\in A,$ let $f_a \in [0, M-1]$ be the flow across arc $a.$ The objective function is immaterial, since we just want a feasible solution, so we will minimize 0. The constraints are as follows:

The correct number of vertices must be chosen: $$\sum_{v\in V} x_v = M.$$
One vertex must be designated the root: $$\sum_{v \in V} y_v = 1.$$
The root vertex must be one of the chosen vertices: $$y_v \le x_v \quad \forall v\in V.$$
The flow on any arc must be 0 unless both head and tail are among the chosen vertices: $$f_{(v,w)}\le (M-1)x_v \quad \forall (v,w)\in A$$ and $$f_{(v,w)}\le (M-1)x_w \quad \forall (v,w)\in A.$$
The net flow out of any vertex (flow out - flow in) must be $M - 1$ for the source vertex, -1 for every other selected vertex and 0 otherwise: $$\sum_{(v,w)\in A} f_{(v,w)} - \sum_{(w,v)\in A} f_{(w,v)} = M\cdot y_v - x_v \quad \forall v\in V.$$

This works in a MILP model, but a MILP model is probably not the best way to solver the underlying problem. Consider the following algorithm, which is based on constructing a layered subgraph. Assume the vertices are in some order. Make the first vertex the root and create a graph containing just it (which we will call layer 0). Add up to $M-1$ vertices connected to the root by an edge (layer 1). If the graph now contains $M$ vertices, declare victory and stop. Otherwise, for each vertex in layer 1, find all vertices connected to it by an edge and not in layers 0 or 1 and as many as them as are needed to reach $M$ vertices (or all of them if you are still short). Repeat until done or until all vertices in layer 1 have been processed. The added vertices form layer 2 because they are two edges from the root. Repeat with each layer until either you have $M$ vertices (victory) or there are no vertices left to process.

If all connected vertices have been found and the count is less than $M,$ discard the current root and start over with the next vertex as root. Note that when processing subsequent roots, you can ignore any vertices already tried as root and just look at vertices later in the vertex ordering.

I ginned up some Java code (open source, as always) to test both the MILP model and the layered graph approach, using CPLEX as the MILP solver. Both got the examples from the Math SE post correct in a trivial amount of time. On a randomly generated graph with 5,000 nodes and 624,713 edges (about 5% of the edges in a complete graph that size), with a target subgraph size of 1,000 nodes, the MILP model took about 8 seconds to build and 12 seconds to solve. The layered graph search needed 3 milliseconds. I tried a larger graph, but my computer ran out of heap space trying to build the model (mainly due to the expanding number of flow variables and related constraints).

So there are two things to take away. One is that flows and flow balance constraints can be used to enforce connection among vertices, even when there is no actual thing flowing. The other is that, much as I like MILP models, sometimes simpler approaches are better.

Monday, January 30, 2023

A Random Tree Generator

For a problem I was noodling with in R, I needed to generate random trees (meaning connected, acyclic, layered, undirected graphs, not classification trees or anything statistical). There are a bunch of libraries for R that have various capabilities for constructing and or massaging graphs and networks (see the CRAN Task View for graphical models). After spending some time poking through the docs for some of the libraries listed in the task view, and unsuccessfully trying to install one, I decided it would be faster just to roll my own code.

Along the way I succumbed to the software developer's maxim: if it works, it needs more features. Still, I wound up with a function that is not horribly complicated and seems to work. Given how many nodes and how many layers you want in the tree, it outputs a matrix of edges forming the desired tree. There are optional arguments for how you want the nodes labeled (the default is 1, 2, ... but you can supply a vector of labels) and in what order the labels should be assigned (top to bottom/left to right raster scan of the graph or randomly), as well as how you want the output matrix organized (randomly, by label order, or by layer order).

The code is in an R notebook that demonstrates the use of the function. It imports the DiagrammeR library in order to plot the resulting trees, but the function does not require DiagrammeR and in fact has no dependencies. You can view the notebook here, and if you want you can download the code from it (see the select box in the upper right of the notebook). The code is licensed under the Creative Commons Attribution 3.0 Unported license.

Friday, January 27, 2023

Finding All Simple Cycles (III)

This is the third and (hopefully) final post about finding all simple cycles in an undirected graph. In the first post, I discussed the problem briefly and described a simple, efficient iterative algorithm. In the second post, I posed (but definitely did not recommend) a mixed integer linear program (MILP) to find all simple cycles. Here I will discuss an alternative MILP approach, specific to CPLEX. To make sense of it, you will definitely need to have read the second post. Java code demonstrating all three methods is available from my repository.

Recent versions of CPLEX have a feature called the solution pool, about which I have previous written. One use is to accumulate solutions (including suboptimal ones if desired) encountered along the way to the optimal solution of a MILP model. Another, using the IloCplex.populate() method in the Java API, is to just generate a gaggle of feasible solutions. So the approach I propose here is to build a MILP model for finding a single simple cycle, set the solution pool population limit parameter (the maximum number of solutions to retain) to something really large and the solution pool intensity parameter (how hard CPLEX works to churn out solutions) to its maximum value, and then use the populate() method to find simple cycles. This is implemented in the SingleCycleMIP.java class in my code.

Actually, I'm not sure setting the intensity parameter is necessary. On the small test graph, results were similar no matter what intensity setting I used. CPLEX stopped with 160 solutions, many of which were duplicates. After removing duplicates, there were 13 distinct simple cycles, which matches what the other methods found. The only difference was run time, which was a fraction of a second using either the default intensity or the maximum intensity (4) and about two seconds using the minimum intensity (1). Either way, it was much faster than the full MILP model, at least on this small test graph.

The MILP model used here is a subset of the model in the previous post. The model has no objective value (or equivalently minimizes the constant zero function). We drop the $u$ variables (which were used to ensure that cycles were distinct) and the $x$ variables (which signaled whether a slot in the solution was filled with a cycle or not). For all other variables, we drop the $c$ subscript (which slot is being filled), since we are only building one cycle. The "one anchor per cycle" constraint becomes $\sum_i w_i = 1,$ and other constraints involving $u$ and/or $x$ are dropped.

One issue with this model is that the same cycle can be discovered multiple times, either because its orientation is reversed (so the "clockwise" and "counterclockwise" versions are considered distinct) or possibly because minor difference (rounding error) in the flow variables make two copies of the same solution look distinct. Regardless of the reason, the fact CPLEX terminated with 160 solutions when only 13 cycles exist tells you duplication is happening. Fortunately, we do not have to monkey with the model (or set overly finicky values for tolerance values); we can just filter the pool after the run and weed out the duplicates (which my code does). The alternative would be to use a callback (either an incumbent callback if you are using legacy callbacks, or the more current generic callback in "candidate" context) to weed out duplicates as they arise. Since the logic to be used in the callback is the same as the logic used in post-processing, I cannot see any virtue to using a callback, and it definitely would complicate the code.

With that I am hopefully done with this topic. :-)

Thursday, January 26, 2023

Finding All Simple Cycles (II)

Yesterday's post described my introduction (by way of a question on OR Stack Exchange) to the problem of how to find all simple cycles in a graph. In that post I described a simple, efficient iterative algorithm for finding them.

The OR SE question specifically asked about a mixed integer linear program (MILP) to find all simple cycles. While I would never take that approach myself, I got curious as to whether it was possible to construct such a model. Down the rabbit hole I went. The answer turns out to be yes. In this post, I'll describe the model (again, without advocating that it be used).

Assume that we have an undirected graph with vertices numbered $1,\dots,N$ and edge set $E.$ In what follows, $i$ and $j$ will always be vertex numbers and $c$ and $c'$ will index cycles. Although the graph is undirected, when we construct cycles we will give them an orientation (e.g., $3 \rightarrow 2 \rightarrow 5 \rightarrow 3$), which will be useful in ensuring that the result is an actual cycle (you can get from the starting node back to itself) and not the composition of multiple smaller cycles (e.g., not treating $3 \rightarrow 2 \rightarrow 4 \rightarrow 3$ and $5 \rightarrow 6 \rightarrow 9 \rightarrow 8 \rightarrow 5$ as if they form a single cycle).

The model depends on guessing a non-binding upper bound $C$ on the number of simple cycles in the graph. (If the solution to the model does contain $C$ cycles, your guess may have been too small. To be certain that all cycles were found, you would need to bump up $C$ and solve again, until the optimal solution found fewer than $C$ cycles.)

The model contains the following gazillion binary variables:

$x_{c}$ is an indicator for whether cycle $c\in\left\{ 1,\dots,C\right\} $
is constructed;
$y_{i,c}$ is an indicator for whether vertex $i$ belongs to cycle $c$;
$z_{i,j,c}$ is an indicator for whether edge $[i,j]\in E$ is part of
cycle $c$ with the tour of $c$ proceeding from $i$ to $j$;
$w_{i,c}$ is an indicator for whether vertex $i$ is the anchor (starting
and ending point) for cycle $c$; and
$u_{i,j,c,c'}$ is an indicator that edge $[i,j]\in E$ (in either
orientation) belongs to exactly one of cycles $c$ and $c'\neq c.$

We also have one set of variables that can be defined as either continuous or integer (continuous probably makes the model solve faster):

$f_{i,j,c}\ge 0$ is the ``flow'' on edge $[i,j]$ in the orientation $i\rightarrow j$ in cycle $c.$

The flow variables are an adaptation of the variables used in the Miller-Tucker-Zemlin formulation for the traveling salesman problem (TSP). We will use them to ensure that cycles found are in fact cycles.

The objective, at least, is straightforward. Since we want to find all possible simple cycles, we maximize the number of cycles constructed: $\max\sum_{c=1}^C x_{c}.$ Now come the constraints. (Brace yourself for some serious vertical scrolling.)

Edges can only belong to cycles containing both endpoints, and only
in one orientation: \begin{align*}
z_{i,j,c}+z_{j,i,c} & \le y_{i,c}\quad\forall[i,j]\in E,\forall c\\
z_{i,j,c}+z_{j,i,c} & \le y_{j,c}\quad\forall[i,j]\in E,\forall c
\end{align*}
Flow only occurs on edges that are used, and only in the orientation
used: \begin{align*}
f_{i,j,c} & \le Nz_{i,j,c}\quad\forall[i,j]\in E,\forall c\\
f_{j,i,c} & \le Nz_{j,i,c}\quad\forall[i,j]\in E,\forall c
\end{align*}
Constructed cycles have one anchor and unused cycles have none: \[
\sum_{i=1}^N w_{i,c}=x_{c}\quad\forall c
\]
A cycle's anchor must belong to the cycle: \[
w_{i,c}\le y_{i,c}\quad\forall i,\forall c
\]
In every cycle, at every node, there are either two incident edges
(if the node belongs to the cycle) or none (if the node is not part
of the cycle: \[
\sum_{j:[i,j]\in E}(z_{i,j,c}+z_{j,i,c})=2y_{i,c}\quad\forall i,\forall c
\]
In any used cycle, at any node in the cycle other than the anchor, flow out is at least one unit less than flow in (mimicking the MTZ formulation of the TSP): \[
\sum_{j:[i,j]\in E}f_{i,j,c}\le\sum_{j:[i,j]\in E}f_{j,i,c}-y_{i,c}+Nw_{i,c}\quad\forall i,\forall c
\]
To ensure a complete cycle, one unit of flow must return to the anchor node: \[
\sum_{j:[i,j]\in E}f_{j,i,c}\ge w_{i,c}\quad\forall i,\forall c
\]
The $u$ variables are defined in terms of the $z$ variables: \[
u_{i,j,c,c'}\le z_{i,j,c}+z_{j,i,c}+z_{i,j,c'}+z_{j,i,c'}\quad\forall[i,j]\in E,\forall c<c'
\] \[
u_{i,j,c,c'}\le2-\left(z_{i,j,c}+z_{j,i,c}+z_{i,j,c'}+z_{j,i,c'}\right)\quad\forall[i,j]\in E,\forall c<c'
\]
Any pair of cycles constructed must differ in at least one edge (which
also prevents a cycle from repeating with the opposite orientation): \[
\sum_{[i,j]\in E}u_{i,j,c,c'}\ge x_{c}+x_{c'}-1\quad\forall c<c'
\]

It is also possible to add constraints to mitigate some of the symmetry in the model. Two come to mind:

Unused cycles must have higher indices than used cycles: \[
x_{c}\le x_{c-1}\quad\forall c\in\left\{ 2,\dots,C\right\}
\]
The anchor in a cycle must have the smallest index of any node
in the cycle: \[
w_{i,c}+y_{j,c}\le1\quad\forall i>j,\forall c
\]

The first one is of low enough dimension that I automatically included it in my Java code. I made the second one optional in the code.

When I ran it against the 9 node, 12 edge graph in the OR SE question, CPLEX had a solution with 13 cycles (the full number) after about 25 seconds and 5,000 nodes -- way, way more time than the 13 milliseconds the iterative method needed. I set $C=20$ as the upper bound on the number of cycles, which of course is also the initial upper bound on the objective. After 60 seconds (the time limit I set), CPLEX's upper bound was still 20. So probable optimality will not be easy to come by.

I've got one more MILP model up my sleeve, coming in the next post. Meanwhile, as a reminder, my Java code is available here.

Wednesday, January 25, 2023

Finding All Simple Cycles (I)

This is the first of a sequence of posts related to the question of finding all simple cycles in an undirected graph. Motivation for the posts comes from a question on OR Stack Exchange. A simple cycle is a cycle with no repeated nodes (other than the starting/ending node).

Assume we have an undirected graph with nodes indexed from 1 to $N$ and with edge set $E.$ I will use the term "anchor" to refer to the node at which a tour of a simple cycle starts (and ends), and require (without loss of generality) that the anchor of every cycle be the lowest index node in the cycle. Since a cycle is the same whether you traverse it "clockwise" or "counterclockwise", I will also require without loss of generality that nodes in a cycle be listed so that the node immediately after the anchor has smaller index than the node immediately before the anchor. This is just to reduce production of duplicate cycles.

I came up with three approaches to finding all simple cycles. The first, which I will describe below, is a simple iterative scheme that is the fastest. The other two, described in subsequent posts, use mixed integer linear programs (MILPs) -- not because they are efficient but because the OR SE question specifically asked about MILP approaches. I will describe those in subsequent posts.

The iterative scheme is pretty simple. It uses a queue of paths (incomplete cycles) to store work in progress. I will use "terminus" to indicate the last node in a path. When the terminus equals the anchor, we have a cycle. Due to my definition of "anchor", we can say that no node on a path can have a lower index than the anchor of the path.

The iterative scheme loops over possible anchors from 1 to $N.$ For each anchor, we initialize the queue with all paths consisting of a single edge incident at the anchor and not incident at any node less than the anchor. While the queue is not empty, we pop a path and create one or more new paths by extending that path with each qualifying edge. A qualifying edge is one that meets the following conditions:

it is incident at the terminus of the current path;
it is not incident at any node already on the path (other than anchor);
it is not incident at any node less than the anchor; and
if it is incident at the anchor, it is not the same as the first edge of the path (i.e., we do not allow a "cycle" of the form $a\rightarrow b \rightarrow a$).

If the new edge leads to the anchor, we have a cycle. If not, we update the terminus and add the extended path to the queue, for further processing later. If a path cannot be extended because it is not a cycle and there are no qualifying edges left, we discard it.

Coding the algorithm is pretty straightforward, and it should work on any undirected graph (even those that are not connected). On the example from the OR SE post (9 nodes, 12 edges) it took a whopping 11 milliseconds to find all 13 simple cycles.

Java code for this and the two MILP models (the latter requiring a recent version of CPLEX) can be found in my GitLab repository. The iterative algorithm is in the CycleFinder class. Stay tuned for the MILP models.

Monday, November 15, 2021

A Heuristic Surprise

The maximum cut problem is one of those combinatorics problems that are deceptively easy to articulate and yet NP-annoying to solve. Given an undirected graph $G=(V,E)$, find a partition of $V$ into disjoint subsets $A$ and $B$ such that the set of edges in $E$ with one endpoint in $A$ and the other in $B$ (the "cut set") is as large as is possible. There is also a weighted version, where you maximize the total weights of the cut set rather than its cardinality.

Someone asked on OR Stack Exchange about a variant of the problem, in which there is a specified size (which the author called a "budget") for one of the sets, i.e., a constraint $|A|=b$ for some given $b$. This generated various suggestions, including some heuristics. I had a couple of thoughts about possible heuristics, and in playing with some computational examples found myself rather surprised by the results.

Mixed Integer Program

It is of course quite straightforward to write an integer or mixed-integer linear program for the problem. I'll assume that $V=\lbrace 1,\dots,N\rbrace$ and that $2\le b \le N-1$. Let variable $x_i$ be 1 if vertex $i\in V$ is in $A$ and 0 if it is in $B$, and let variable $y_{i,j}$ be 1 if edge $(i,j)\in E$ is in the cut set ($i\in A$ and $j\in B$ or vice versa) and 0 if not ($i$ and $j$ both in $A$ or both in $B$).

Please allow me a slight digression here. The $y$ variables can be declared either continuous or binary. The combination of objective and constraints will force the values to 0 or 1 even if they are declared as continuous. Force of habit, dating back to when I was using very early generation software on not particularly powerful (by current standards) computers, has me declaring every variable as continuous unless it has to be integer. On the other hand, it has been suggested to me that declaring such variables integer-valued may help solvers tighten bounds or fix variables. That makes sense to me, but in this particular model I don't see anything that would suggest making $y$ integer would help the solver.

Digression over, we can write the MIP model as

\begin{alignat*}{1} \max\sum_{(i,j)\in E}y_{ij}\\ \textrm{s.t. }\sum_{i\in V}x_{i} & =b\\ y_{ij} & \le x_{i}+x_{j}\quad\forall(i,j)\in E\\ y_{ij} & \le2-x_{i}-x_{j}\quad\forall(i,j)\in E\\ x_{i} & \in\left\{ 0,1\right\} \quad\forall i\in V\\ y_{ij} & \in\left[0,1\right]\quad\forall(i,j)\in E. \end{alignat*}

The first constraint enforces the "budget", while the second and third constraints ensure that $y_{ij}=0$ if $i$ and $j$ are on the same side of the partition. The objective function will force $y_{ij}=1$ if $i$ and $j$ are on different sides. The MIP model is easy to solve for small instances and grows more difficult as the number of vertices and the number of edges grows, and as $b$ gets closer to $N/2$ (since the number of possible partitions satisfying the budget increases).

Genetic Algorithm

Since the max cut problem is NP-hard, I suspect the budgeted version also is, and in any case it is likely that solving the MIP to optimality can take too long (or too much memory) in many cases. That led me to thinking about heuristics. One easy heuristic is to solve the MIP model but not to optimality, stopping at some arbitrary time or memory/node limit. That requires a MIP solver. Another possibility (among many) is a genetic algorithm.

I was fooling around with the problem in R, which has a very handy genetic algorithm library (named, rather suggestively, GA). Among other things, the GA library allows you to use a permutation as a "chromosome" (solution encoding). So we can define a candidate solution to be a permutation of $1,\dots,N$, and it's "fitness" is then the size of the cut set defined by setting $A$ equal to the first $b$ elements of the permutation.

Pairwise Swap

Another possibility that I suggested on OR SE was a simple pairwise swapping heuristic. Start by generating a random choice of $A$ (and, by implication, $B$) and calculate the size of the cut set. Now consider all pairs of vertices $i\in A$ and $j\in B$ and, for each pair, see what happens if you swap them (moving $i$ to $B$ and $j$ to $A$). If the size of the cut set increases, accept the swap; otherwise do not. This is a primitive example of a "neighborhood" search, and it can be used in fancier heuristics, including simulated annealing (where you occasionally accept a swap that actually makes the objective smaller, in order to get you out of your current neighborhood). To keep things simple, I suggested just doing pairwise swaps when the objective improved, with random restarts. That means assigning a time limit to the heuristic, and when you run out of acceptable swaps, start fresh with a new random solution until time is exhausted.

And the surprise is ...

I code the MIP model (solved with CPLEX), the GA and the swapping heuristic in R and ran them on a few random test cases, keeping dimensions small enough that CPLEX could solve the MIP reasonably quickly. Since the swapping heuristic relies on a time limit to stop it, I set a limit of two minutes. For the GA, I set a limit of 10,000 generations (which it never reached, as I expected) and a stagnation limit of 200 generations (meaning that, after 200 consecutive generations with no improvement in the best objective value, it would terminate). Early experiments suggested that the GA could stagnate pretty quickly, so I used an "island" GA (in which several smaller populations on "islands" evolve independently, with periodic migrations from one island to another to freshen the "genetic pools" on the islands).

My expectation was that the "sophisticated" GA would outperform the "primitive" swapping heuristic. Wrong! In the table below, I show the optimality gap (percent worse than the optimum, as computed by CPLEX) for the GA and the swap heuristic on six examples. Since the GA stagnates well before the two minute limit I gave to the swap heuristic, I also ran the GA with restarts (column "GA-R"), which restarted the GA with a new population each time it stagnated until the two minute limit was reached. The restarts improved the GA performance somewhat, but as you can see the swapping heuristic beat it on every one of the six test cases (and found the optimum in four out of six).

Graph			Gap
Nodes	Edges	Budget (b)	GA	GA-R	Swap
50	429	10	1.7%	1.7%	0.0%
50	490	20	3.0%	0.7%	0.0%
75	416	20	6.4%	4.3%	0.0%
75	416	30	6.8%	4.7%	0.4%
100	495	30	8.4%	7.8%	1.6%
100	990	25	9.1%	4.4%	0.0%

I don't expect the swapping heuristic to find the optimum this consistently on larger problems. Whether it would outperform the GA (with restarts) on tougher problems is an open question. Nonetheless, I take this as a reminder that sometimes simple heuristics can do a pretty good job.

My R notebook is available if you want to play with it. Fair warning: the notebook assumes you have CPLEX installed and loads seven R libraries (four for use with CPLEX, one for the GA, one to time execution of the heuristics and a separate one to set a time limit on the heuristics). If some of that does not interest you, you can of course edit out the related code and skip loading the corresponding libraries.

Wednesday, November 3, 2021

Smallest Polygon Containing A 2D Point Set

A question on OR Stack Exchange asked about an optimization model (specifically a MILP) for finding the "smallest hull" containing a finite set of points in the plane. The phrase "smallest hull" is rather fuzzy. In a follow up comment, the author of the question clarified that "smallest" means smallest area. That leaves the meaning of "hull". Note that any Hamiltonian path through the points would contain all the points (with all of them on its "boundary") and would have area zero. Based on a sample illustration added by the author, I will assume that "hull" here means polygon (but not necessarily a convex one).

I'm pretty sure that a MILP model could be devised, but I suspect it would be rather large and might solve slowly. So I looked at the possibility of a heuristic solution, and came up with something that seems to do pretty well but definitely does not produce an optimal solution in all cases. I got it working in R. You can see the full notebook (which includes the code) here. (Warning: It's almost 6MB, due to the presence of some graphs.) The notebook uses three R libraries: ggplot2 and plotly, for plotting; and interp, for generating a tesselation (more about that below). The plotly library is used just once, to generate an interactive plot with "tooltips" identifying each point, which is mainly useful for debugging. So if you don't feel like installing it, you can just delete any references to it and still run the important parts of the code.

Suppose we start with the following 2D point set.

Point set (with convex hull)

The heuristic first finds the convex hull of the points (shown above) and a Delaunay triangulation of them, using the tri.mesh function from the interp library. The triangulation is a collection of nonoverlapping triangles, with all vertices taken from the point set, such that no point is in the interior of any triangle. The next plot shows what the triangulation looks like.

Delaunay triangulation

The heuristic now takes a lap around the boundary of the convex hull. For each line segment $[a,b]$, it locates the (unique) triangle containing that edge. Let's say the third vertex of that triangle is $c$. If $c$ is not currently on the boundary, the heuristic adds it to the boundary, replacing the segment $[a,b]$ with the segments $[a,c]$ and $[c,b]$. It then deletes the $a-b-c$ triangle from the polygon, shaving off some area, and moves on to the next boundary segment.

This loop is repeated until a full circuit around the boundary fails to make any changes, at which point the heuristic terminates. Here is what the final polygon looks like.

Final polygon

What appear to be line segments sticking out in a few places are actually very skinny triangles. Clearly the heuristic did not select this polygon for its aesthetics.

The points were randomly generated in the unit square, which of course has area 1. For this example, the convex hull has area 0.865, while the final polygon has area 0.511, a reduction of almost 41%. At the same time, it is clear from the final plot that the solution is not optimal. There is one point left in the interior of the polygon, around $(0.493, 0.378)$. The heuristic cannot do anything about it because none of the triangles to which it belongs have an edge on the boundary. We could certainly add line segments from it to two boundary points, at $(0.483, 0.280)$ and $(0.450, 0.175)$, that currently share an edge. That would form a new triangle containing no other points, and we could drop the existing edge, adding the two new edges to boundary, and shed a little more area.

So the heuristic does not produce "nice looking" polygons, nor does it find an optimal solution, but my intuition after running a few examples is that it probably does reasonably well.

Monday, July 5, 2021

Vertex Numbering via GA

A question on OR Stack Exchange asks how to number vertices in a layered graph so that the endpoints of edges have similar numbers. Note that "numbering" the vertices here means numbering within layers (so that, for instance, every layer has a vertex numbered 1). We will assume that every node has a unique ID before we launch into the numbering process. The author of the question chose the sum of squared differences of endpoint labels for all edges as the objective function to minimize. The following image shows an example of a layered graph with a (probably suboptimal) numbering scheme. The numbers on the edges are their contributions to the objective function.

The prolific Rob Pratt noted that the problem can be modeled as an assignment problem with a quadratic objective function, using binary variables. That model produces an exact solution, given sufficient time and memory.

Note that numbering the nodes within a layer is equivalent to picking one of the possible permutations of the node IDs. The author of the question indicated receptiveness to a metaheuristic, so I decided to try coding a random key genetic algorithm (RKGA) for the problem. I've mentioned RKGAs before (for instance, here and here). As I understand it, they were originally designed for sequencing / scheduling problems, where things need to be permuted optimally, so an RKGA seemed like a natural choice. I coded both the integer programming (IP) model and the RKGA in Java, using CPLEX 20.1 as the IP solver and Watchmaker Framework 0.7.1 for the GA. The Watchmaker Framework has not been under active development for quite a few years, but it works well.

To use an RKGA, you need to come up with a coding for a "chromosome" (candidate solution) and a mechanism for decoding the chromosome into a solution to the original problem (in this case separate vertex permutations for each graph layer) such that the decoded chromosome is always feasible. I chose as my chromosome representation a double-precision vector with elements between 0 and 1, having one element per vertex in the original graph. Double-precision is probably overkill, but I'm in the habit of using double-precision rather than single-precision, so it was the path of least resistance. To decode the chromosome, I first had to chop it up into smaller vectors (one per layer) and then extract the sort index of each smaller vector. So, using the image above as an example, a chromosome would be a double vector with 2 + 5 + 3 = 10 components. If the last three components were (0.623, 0.021, 0.444) the sort indices would be (3, 1, 2), yielding the vertex numbering for the last layer in the image. To convert a double vector into a sort index vector, I used a Java library (ValueOrderedMap, described in this post) that I wrote some time back.

A GA can "stagnate", meaning cease to improve on the best known solution. One obvious reason for stagnation is that it cannot improve on an optimal solution, but stagnation can occur for other reasons. On small test problems, the GA tended to stagnate rather quickly, so I set a stagnation limit and put the GA in a loop that would restart it up to nine times or until a specified time limit (five minutes ... I wasn't very ambitious). On larger test problems, it was not as quick to stagnate, but it eventually did.

I used the GA's best solution as a starting solution for the IP model. On smaller test problems, CPLEX occasionally closed the gap to zero within five minutes but usually did not. On larger problems, CPLEX struggled to get the best bound above zero (100% gap) within five minutes. So I suspect the IP model is prone to loose bounds. An example of a "small" test problem is one with 29 vertices spread across five layers and 35 edges. (I worked mainly with sparse examples, to keep the size of the IP model down.)

Interestingly, in none of the trials did CPLEX improve on the initial solution provided by the GA. So it might be that the GA was consistently finding an optimum, although I cannot prove that. (On the few smaller instances where CPLEX reached provable optimality, the optimal solution was indeed the GA solution.) Note that, while the GA solution appeared optimal, that does not mean that each GA run produced an optimum. On the 29 vertex example mentioned above, CPLEX found a provable optimum with objective value 111. The GA ran 10 times (using about 92 seconds total), and of the 10 runs two stagnated at 111, five at 112, and three at 114. So even if the GA is prone to finding optimal solutions, it may require multiple starts to do so.

OR in an OB World