Showing posts with label network models. Show all posts
Showing posts with label network models. Show all posts

Thursday, March 3, 2022

Finding Almost All Paths

A question posted on Stack Overflow (and subsequently deleted) led to a blog post by Erwin Kalvelagen on how to find all paths between two nodes in a directed graph (possibly with self-loops, i.e. arcs from a node to itself) subject to two constraints: no arc can be used more than once in a path; and there is an upper limit $M$ on the number of arcs used. Note that a path might visit a *node* more than once. It just cannot repeat an arc. The original question seems to have referred to an undirected graph, but Erwin's post works with directed graphs and so will I.

Erwin explored some mixed integer linear programming (MIP) models in his post, and a subsequent post on OR Stack Exchange led to more proposals of MIP models (including one from me). I also suggested that a "brute force" approach might be faster than any of the MIP models. In what follows, I will spell out both my MIP model and the brute force approach I used. Java code for both (which requires CPLEX and the Apache Commons Collections library) are in my code repository.

In what follows $A$ is the set of arcs in the graph, $s$ is the origin node for all paths, $t$ is the destination node for all paths, and $M$ is the maximum number of arcs to include in a path.

MIP model

 
The MIP model uses the following variables. 
  • $u_{a}\in\left\{ 0,1\right\} $ is 1 if and only if arc $a$ is used on the path.
  • $f_{a}\in\left\{ 0,1\right\} $ is 1 if and only if arc $a$ is the first arc on the path.
  • $\ell_{a}\in\left\{ 0,1\right\} $ is 1 if and only if arc $a$ is the last arc on the path.
  • $y_{ab}\in\left\{ 0,1\right\} $ is 1 if and only if arc $b$ immediately follows arc $a$ on the path.
  • $z_{a}\in\left[0,M\right]$ will be the number of arcs preceding arc $a$ on the path (0 if $a$ is not on the path).

Some of the variables can be eliminated (fixed at 0) at the outset.

  • $f_{a}=0$ if node $s$ is not the tail of arc $a.$
  • $\ell_{a}=0$ if node $t$ is not the head of arc $a.$
  • $y_{ab}=0$ if the head of arc $a$ is not the tail of arc $b.$

Since we are just interested in finding feasible solutions, the objective is to minimize 0.

Use of the $z$ variables mimics the Miller-Tucker-Zemlin approach to avoiding subtours in the traveling salesman problem. A common knock on the MTZ formulation for the TSP is that it has a somewhat loose continuous relaxation. Since we have a trivial objective (all feasible solutions are optimal), that is not a concern here.

The constraints are as follows.

  • There must be one first arc and one last arc.
    $$\sum_{a\in A}f_{a}=1.$$$$\sum_{a\in A}\ell_{a}=1.$$
  • At most $M$ arcs can be used.$$\sum_{a\in A}u_{a}\le M.$$
  • An arc is used if and only if it is either the first arc or follows another arc on the path.$$f_{a}+\sum_{b\in A}y_{ba}=u_{a}\quad\forall a\in A.$$
  • The last arc must be a used arc.$$\ell_{a}\le u_{a}\quad\forall a\in A. (1)$$
  • The sequence value of an unused arc is 0.$$z_{a}\le Mu_{a}\quad\forall a\in A.$$
  • No arc can follow the last arc.$$\ell_{a}+\sum_{b\in A}y_{ab}\le1\quad\forall a\in A. (2)$$
  • If an arc is used, either it is the last arc or another arc follows it.$$\ell_{a}+\sum_{b\in A}y_{ab}=u_{a}\quad\forall a\in A. (3)$$
  • If an arc $b$ follows arc $a$, the sequence number of arc $b$ must be one higher than the sequence number of arc $a.$ $$z_{a}-z_{b}+\left(M+1\right)y_{ab}\le M\quad\forall a,b\in A,a\neq b.$$

Over on OR Stack Exchange, Rob Pratt correctly pointed out that constraint (3) implies constraints (1) and (2). I've left them in the Java code anyway. The CPLEX presolver removes them automatically.

Finding all solutions 

To find all solutions, one possible approach is to solve whatever MIP model you choose, then add a "no-good" constraint that eliminates the solution just found (and only that one) and solve again, until eventually the aggregation of "no-good" constraints makes the model infeasible. What I did instead was to use the "populate" command in CPLEX, which accumulates a pool of solutions. Besides a time limit, I used two non-default parameter settings: I cranked up the solution pool capacity (the maximum number of solutions to find) to something high enough to let it find all solutions; and I set the solution pool intensity to its highest value (4), which tells CPLEX to aggressively seek out all feasible solutions.

Brute force approach

 
The brute force approach is remarkably straightforward. We will use a queue of (incomplete) paths that I will call the "to-do list". Start by creating a length one path for each arc with tail $s$ and add them to the to-do list. We now proceed in a loop until the to-do list is empty. At each iteration, we pull a path $P$ off the to-do list, identify the arcs whose tails match the head of the last arc in $P$, remove any arcs already on $P$, and for each surviving arc $a$ create a new path $P'$ by extending $P$ with $a$. If the head of arc $a$ is $t$, $P'$ is a new $s-t$ path, which we record. Either way, if $P'$ has length less than $M$, we add it to the to-do list, and eventually try to extend it further.
 

Do they work?

 
Would I be posting this if they didn't. :-) I tested both methods on the digraph from Erwin's post, which has 10 nodes and 22 arcs (with two self-loops). The source and sink are nodes 1 and 10 respectively. With $M=3$ there are four paths (which both methods find), and with $M=4$ there are nine paths (which both methods find). In both cases, brute force took about 1 ms. on my PC (using non-optimized Java code with a single thread). CPLEX times were rather stochastic, but I think it fair to say that building the models took around 55+ ms. and solving them typically took 20 ms. or more.
When I set a maximum length ($M$) of 10, things got interesting. The brute force approach found 268 paths (in about 6 ms), while CPLEX found only 33. Neither time limit nor pool size were a factor, so I assume that this is just a limitation of the aggressive setting for pool intensity. That means that to find all possible solutions using CPLEX, the solve / add constraint / solve again approach would be necessary.

Monday, April 12, 2021

A Math Puzzle as a Network

There is a standard type of math puzzle that has been around at least since I was a child. The details vary, but the concept is consistent. You are typically given a few initially empty containers of various (integer) capacities, an essentially infinite reservoir of something that goes in the containers, and a goal (integer) for how much of that something you want to end up with. You have to figure out how to reach the goal without having any measuring instruments, meaning that your operations are limited to emptying a container into the reservoir, filling a container from the reservoir, or moving content from one container to another until you empty the source or fill the destination, whichever happens first. (All this is done under the assumption of no spillage, meaning the originator of the puzzle did not have me in mind.) I think I've seen a variant that involves cutting things, where your ability to measure where to cut is limited to stacking pieces you already have as a guide to the piece you want to cut.

A question popped up on Mathematics Stack Exchange about how to solve one of these puzzles using dynamic programming (DP) with backward recursion. The problem at hand involves two jugs, of capacities seven and three liters respectively, and a lake, with the desired end state being possession of exactly five liters of water. The obvious (at least to me) state space for DP would be the volume of water in each jug, resulting in 32 possible states ($\lbrace 0,\dots,7\rbrace \times \lbrace 0,\dots,3 \rbrace$). Assuming the objective function is to reach the state $(5,0)$ with a minimal number of operations, the problem can be cast just as easily as a shortest path problem on a digraph, in which each node is a possible state of the system, each arc has weight 1, and arcs fall into one of the categories mentioned in the previous paragraph.

I was looking for an excuse to try out the igraph package for R, and this was it. In my R notebook, a node label "5|2" would indicate the state where the larger jug contains five liters and the smaller jug contains two. Arcs are labeled with one of the following: "EL" (empty the larger jug); "FL" (fill the larger jug); "ES" (empty the smaller jug); "FS" (fill the smaller jug); "PLS" (pour the larger jug into the smaller jug); or "PSL" (pour the smaller jug into the larger jug).

Assuming I did not screw up the digraph setup, a total of nine operations are required to get the job done. If you are interested, you can see my code (and the solution) in this R notebook.

Thursday, January 28, 2021

A Monotonic Assignment Problem

 A question posted on Stack Overflow can be translated to an assignment problem with a few "quirks". First, the number of sources ($m$) is less than the number of sinks ($n$), so while every source is assigned to exactly one sink, not every sink is assigned to a source. Second, there are vectors $a\in\mathbb{R}^m$ and $b\in\mathbb{R}^n$ containing weights for each source and sink, and the cost of assigning source $i$ to sink $j$ is $a_i \times b_j$. Finally, there is a monotonicity constraint. If source $i$ is assigned to sink $j$, then source $i+1$ can only be assigned to one of the sinks $j+1,\dots,n$.

Fellow blogger Erwin Kalvelagen posted a MIP model for the problem and explored some approaches to solving it. A key takeaway is that for a randomly generated problem instance with $m=100$ and $n=1,000$, CPLEX needed about half an hour to get a provably optimal solution. After seeing Erwin's post, I did some coding to cook up a network (shortest path) solution in Java. Several people proposed similar and in some cases essentially the same model in comments on Erwin's post. Today, while I was stuck on a Zoom committee call and fighting with various drawing programs to get a legible diagram, Erwin produced a follow-up post showing the network solution (including the diagram I was struggling to produce ... so I'll refer readers to Erwin's post and forget about drawing it here).

The network is a layered digraph (nodes organized in layers, directed arcs from nodes in one layer to nodes in the next layer). It includes two dummy nodes (a start node in layer 0 and a finish node in layer $m+1$). All nodes in layer $i\in \lbrace 1,\dots,m \rbrace$ represent possible sink assignments for source $i$. The cost of an arc entering a node representing sink $j$ in layer $i$ is $a_i \times b_j$, regardless of the source of the arc. All nodes in layer $m$ connect to the finish node via an arc with cost 0. The objective value of any valid assignment is the sum of the arc costs in the path from start to finish corresponding to that assignment, and the optimal solution corresponds to the shortest path from start to finish.

The monotonicity restriction is enforced simply by omitting arcs from any node in layer $i$ to a lower-index node in layer $i+1$. It also allows us to eliminate some nodes. In the first layer after the start node (where we assign source 1), the available sinks are $1,\dots,n-m+1$. The reason sinks $n-m+2,\dots,n$ are unavailable is that assigning source 1 to one of them and enforcing monotonicity would cause us to run out of sinks before we had made an assignment for every source. Similarly, nodes in layer $i>1$ begin with sink $i$ (because the first $i-1$ sinks must have been assigned or skipped in earlier layers) and end with sink $n-m+i$ (to allow enough sinks to cover the remaining $m-i$ nodes).

For the dimensions $m=100$, $n=1000$, the network has 90,102 nodes and 40,230,551 arcs. That may sound like a lot, but my Java code solves it in under four seconds, including the time spent setting up the network. I used the excellent (open-source) algs4 library, and specifically the AcyclicSP class for solving the shortest path problem. Erwin reports even faster solution time for his network model (0.9 seconds, coded in Python), albeit on different hardware. At any rate, he needed about half an hour to solve the MIP model, so the main takeaway is that for this problem the network model is much faster.

If anyone is interested, my Java code is available for download from my Git repository. The main branch contains just the network model, and the only dependency is the algs4 library. There is also a branch named "CPLEX" which contains the MIP model, in case you either want to compare speeds or just confirm that the network model is getting correct results. If you grab that branch, you will also need to have CPLEX installed.


Friday, April 24, 2020

Generating Random Digraphs

In a recent post, OR consultant and blogger Erwin Kalvelagen discussed generating a random sparse network in GAMS. More specifically, he starts with a fixed set of nodes and a desired number of arcs, and randomly generates approximately that number of arcs. His best results, in terms of execution time, came from exporting the dimensions to R, running a script there, writing out the arcs and importing them back into GAMS.

There are three possible issues with the script. Erwin acknowledged the first, which applies to all his approaches: after removing duplicates, he wound up with fewer than the targeted number of arcs. In many applications this would not be a problems, since you would be looking for "about 1% density" rather than "exactly 1% density". Still, there might be times when you need a specific number of arcs, period. You could supplement Erwin's method with a loop that would test the number of arcs and, if short, would generate more arcs, remove duplicates, add the survivors to the network and repeat.

The second possible issue is the occurrence of self-loops (arcs with head and tail the same, such as (5, 5)). Again, this may or may not be a problem in practice, depending on the application. I rarely encounter network applications where self-loops are expected, or even tolerated. Again, you could modify Erwin's code easily to remove self-loops, and it would not increase execution time much.

The third possible issue is that some nodes may be "orphans" (no arcs in or out), and others may be accessible only one way (either inward degree 0 or outward degree 0). Once again, the application will dictate whether this is a matter of concern.

I decided to test a somewhat different approach to generating the network (using R). It has the advantage of guaranteeing the targeted number of arcs, with no self-loops. (It does not address the issue of orphan nodes.) It has the disadvantage of being slower than Erwin's algorithm (but by what I would call a tolerable amount). My approach is based on assigning an integer index to every possible arc. Assume there are $N$ nodes, indexed $0, \dots, N-1$. (Erwin uses 1-based indexing, but it is trivial to adjust from 0-based to 1-based after the network has been generated.) There are $N^2$ arcs, including self-loops, indexed from $0,\dots,N^2-1$. The arc with index $k$ is given by $$f(k) = (k \div n, k \mod n),$$where $\div$ denotes the integer quotient (so that $7 \div 3 = 2$). As self-loop is an arc whose index $k$ satisfies $k\div n = k \mod n$; those are precisely the arcs with indices $k=m(n + 1)$ for $m=0,\dots,n-1$. So my version of the algorithm is to start with the index set $\lbrace 0,\dots,n^2-1\rbrace$, remove the indices $0, n+1, 2n+2,\dots, n^2-1$, take a random subset of the survivors and apply $f()$ to them.

I have an R Notebook that compares the two algorithms, using the same dimensions Erwin had: $n=5000$ nodes, with a target density of 1% (250,000 arcs). Timing is somewhat random, even though I set a fixed random number seed for each algorithm. The notebook includes both code and output. As expected, my code gets the targeted number of arcs, with no self-loops. Erwin's code, as expected, comes up a bit short on the number of arcs, and contains a few (easily removed) self-loops. Somewhat interestingly, in my test runs every node had both inward and outward degree at least 1 for both algorithms. I think that is a combination of a fairly large arc count and a bit of luck (the required amount of luck decreasing as the network density increases). If orphans, or nodes with either no outbound or no inbound arcs, turn out to be problems, there is a fairly easy fix for both methods. First, randomly generate either one or two arcs incident on each node (depending on whether you need both inward and outward arcs everywhere). Then generate the necessary number of additional arcs by adjusting the sample size. As before, you might come up a few arcs short with Erwin's algorithm (unless you include a loop to add arcs until the target is reached). In my algorithm, you can easily calculate the indices of the initial set of arcs (the index of arc $(i,j)$ is $n\times i + j$) and then just remove those indices at the same time that you remove the indices of the self-loops, before generating the remaining arcs.

Tuesday, July 17, 2018

Selecting Box Sizes

Someone posted an interesting question about box sizes on Mathematics Stack Exchange. He (well, his girlfriend to be precise) has a set of historical documents that need to be preserved in boxes (apparently using a separate box for each document). He wants to find a solution that minimizes the total surface area of the boxes used, so as to minimize waste. The documents are square (I'll take his word for that) with dimensions given in millimeters.

To start, we can make a few simplifying assumptions.
  • The height of a box is not given, so we'll assume it is zero, and only consider the top and bottom surfaces of the box. For a box (really, envelope) with side $s$, that makes the total area $2s^2$. If the boxes have uniform height $h$, the area changes to $2s^2 + 4hs$, but the model and algorithm I'll pose are unaffected.
  • We'll charitably assume that a document with side $s$ fits in a box with side $s$. In practice, of course, you'd like the box to be at least slightly bigger, so that the document goes in and out with reasonable effort. Again, I'll let the user tweak the size formula while asserting that the model and algorithm work well regardless.
The problem also has three obvious properties.
  • Only document sizes need be considered as box sizes, i.e. for every selected size at least one document should fit "snugly".
  • The number of boxes you need at each selected size equals the number of documents too large to fit in a box of the next smaller selected size but capable of fitting in a box of this size.
  • You have to select the largest possible box size (since that is required to store the largest of the documents).
What interests me about this problem is that it can be a useful example of Maslow's Hammer: if all you have is a hammer, every problem looks like a nail. As an operations researcher (and, more specifically, practitioner of discrete optimization) it is natural to hear the problem and think in terms of general integer variables (number of boxes of each size), binary variables (is each possible box size used or not), assignment variables (mapping document sizes to box sizes) and so on. OR consultant and fellow OR blogger Erwin Kalvelagen did a blog post on this problem, laying out several LP and IP formulations, including a network model. I do recommend your reading it and contrasting it to what follows.

The first thought that crossed my mind was the possibility of solving the problem by brute force. The author of the original question supplied a data file with document dimensions. There are 1166 documents, with 384 distinct sizes. So the brute force approach would be to look at all $\binom{383}{2} = 73,153$ or $\binom{383}{3} = 9,290,431$ combinations of box sizes (in addition to the largest size), calculate the number of boxes of each size and their combined areas, and then choose the combination with the lowest total. On a decent PC, I'm pretty sure cranking through even 9 million plus combinations will only need a tolerable amount of time.

A slightly more sophisticated approach is to view the problem through the lens of a layered network. There are either three or four layers, representing progressively larger selected box sizes, plus a "layer 0" containing a start node. In the three or four layers other than "layer 0", you put one node for each possible box size, with the following restrictions:
  • the last layer contains only a single node, representing the largest possible box, since you know you are going to have to choose that size;
  • the smallest node in each layer is omitted from the following layer (since layers go in increasing size order); and
  • the largest node in each layer is omitted from the preceding layer (for the same reason).
Other than the last layer (and the zero-th one), the layers here will contain 381 nodes each if you allow four box sizes and 382 if you allow three box sizes. An arc connects the start node to every node in the first layer, and an arc connects every node (except the node in the last layer) to every node in the next higher layer where the head node represents a larger size box than the tail node. The cost of each arc is the surface area for a box whose size is given by the head node, multiplied by the number of documents too large to fit in a box given by the tail node but small enough to fit in a box given by the head node.

I wanted to confirm that the problem is solvable without special purpose software, so I coded it in Java 8. Although there are plenty of high quality open-source graph packages for Java, I wrote my own node, arc and network classes and my own coding of Dijkstra's shortest path algorithm just to prove a point about not needing external software. You are welcome to grab the source code (including the file of document sizes) from my Git repository if you like.

I ran both the three and four size cases and confirmed that my solutions had the same total surface areas that Erwin got, other than a factor of two (I count both top and bottom; he apparently counts just one of them). How long does it take to solve the problem using Dijkstra's algorithm? Including the time reading the data, the four box version takes about half a second on my decent but not workstation caliber PC. The three box version takes about 0.3 seconds, but of course gives a worse solution (since it is more tightly constrained). This is single-threaded, by the way. Both problem set up and Dijkstra's method are amenable to parallel threading, but that would be overkill given the run times.

So is it wrong to take a fancier modeling approach, along the lines of what Erwin did? Not at all. There are just trade-offs. The modeling approach produces more maintainable code (in the form of mathematical models, using some modeling language like GAMS or AMPL) that are also more easily modified if the use case changes. The brute force and basic network approaches I tried requires no extra software (so no need to pay for it, no need to learn it, ...) and works pretty well for a "one-off" situation where maintainability is not critical.

Mainly, though, I just wanted to make a point that we should not overlook simple (or even brute force) solutions to problems when the problem dimensions are small enough to make them practical ... especially with computers getting more and more powerful each year.