A recent question on Mathematics Stack Exchange dealt with reducing the number of sets in a partition of a set of items. I'll repeat it here but with slightly different terminology from the original question. You start with $N$ items partitioned into $M$ disjoint sets. Your goal is to generate a smaller partition of $K < M$ sets (which I will henceforth call "collections" to distinguish them from the original sets). It is required that all items from any original set end up in the same collection (i.e., you cannot split the original sets). The criterion for success is that "the new [collection] sizes should be as close to even as possible".
This is easily done with an integer programming model. The author of the question thought about minimizing the variance in the collection sizes, which would work, but I'm fond of keeping things linear, so I will minimize the range of collection sizes. I'll denote the cardinality of original set $i$ by $n_i$. Let $x_{ij}$ be a binary variable which is 1 if set $i\in \lbrace 1,\dots, M\rbrace$ is assigned to collection $j\in \lbrace 1,\dots,K\rbrace$ and 0 if not. Let $y$ and $z$ denote the sizes of the smallest and largest collections. Finally, for $j\in \lbrace 1,\dots,K\rbrace$ let $s_j$ be the size (cardinality) of collection $j$. A MILP model for the problem is the following:
\begin{align} \min\,z-y\\ \textrm{s.t. }\sum_{j=1}^{K}x_{ij} & =1\;\; \forall i\in\left\{ 1,\dots M\right\} \\ \sum_{i=1}^{M}n_{i}x_{ij} & =s_{j}\;\; \forall j\in\left\{ 1,\dots,K\right\} \\ s_{j} & \le z\;\; \forall j\in\left\{ 1,\dots,K\right\} \\ s_{j} & \ge y\;\; \forall j\in\left\{ 1,\dots,K\right\} \\ y,z,s_{\cdot} & \ge0\\ x_{\cdot\cdot} & \in\left\{ 0,1\right\} \end{align}
The author of the question also indicated an interest in "fast greedy approximate solutions" (and did not specify problem dimensions). The first greedy heuristic that came to my mind was a simple one. Start with $K$ empty collections and sort the original sets into descending size order. Now assign each set, in turn, to the collection that currently has the smallest size (breaking times whimsically). Why work from largest to smallest set? There will be times when you will want to offset a large set in one collection with two or more smaller sets in another collection, and that will be easier to do if you start big and keep the smaller sets in reserve as long as is possible. Rob Pratt, owner of a rather massive reputation score on MSE, correctly noted that this is equivalent to the "longest processing time" heuristic for assigning jobs to machines so as to minimize makespan.
I put together an R notebook to test this "greedy" heuristic against the optimization model (solved with CPLEX). The notebook uses Dirk Schumacher's OMPR package for building the MILP model. It in turn uses the ROI package (which requires the Rcplex package) in order to communicate with CPLEX. On a test run using nice, round values of $N$, $M$ and $K$ that all ended in zeros (and in particular where $K$ divided evenly into both $M$ and $N$), the greedy heuristic nearly found the optimal solution. When I switched to less round numbers ($N=5723$, $M=137$, $K=10$), though, the heuristic did not fare as well. It was fast (well under one second on my PC) but it produced a solution where collection sizes ranged from 552 to 582 (a spread of 30), while CPLEX (in about 21 seconds) found an optimal solution where all collections had size either 572 or 573 (spread of 1). So I tacked on a second heuristic to refine the solution of the first heuristic. The second heuristic attempts pairwise swaps of the smallest set from the smallest collection with a larger set from a larger collection (trying collections in descending size order). Swaps are constrained not to leave the second collection (the one donating the larger set) smaller than the first collection started out. The intuition is to shrink the range by making the smallest collection bigger while shrinking the largest collection if possible and, if not, at least some collection that is larger than the smallest one. The new heuristic also ran in well under one second and shrank the range of collection sizes from 30 to 3 -- still not optimal, but likely good enough for the application the original questioner had in mind.
You are free to use the R code (which can be extracted from the notebook linked above) under the Creative Commons license that governs the blog.
No comments:
Post a Comment
Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.