Friday, March 27, 2020

A Minimum Variance Partition Problem

Someone posed a question on Mathematics Stack Exchange that I answered there. This post is to explain the model I used (and generalize a bit).

In general terms, you start with a collection $\Omega=[\omega_1, \dots, \omega_n]$ of positive integers. I choose to think of the integers as weights. The mission is to partition $\Omega$ into subcollections $S_1, \dots, S_m$ such that (a) you use as few subcollections as possible (i.e., minimize $m$), (b) the variance of their weight totals is minimal, and (c) no subcollection in the partition has total weight greater than some specified constant $L$. (If, for example, $S_1 = [\omega_4, \omega_9, \omega_{10}]$ then the weight total for $S_1$ is $\omega_4 + \omega_9 + \omega_{10}$.)  Why am I saying "collections" rather than sets? This would be a set partitioning problem were it not for the fact that $\Omega$ may contain duplicate entries and so is technically a multiset.

The term "variance" is slightly ambiguous (population variance or sample variance), but happily that will not matter here. Let $$s_j =\sum_{i : \omega_i \in S_j} \omega_i$$be the weight of subcollection $S_j$. Since we are talking about a partition here (every item in exactly one subcollection), $$\sum_{j=1}^m s_j = \sum_{i=1}^n \omega_i = n \bar{\omega}$$where $\bar{\omega}$ is the (constant) mean of all the item weights, and so the mean of the subcollection weights is $$\bar{s} = \frac{n\bar{\omega}}{m},$$which for a given number $m$ of subcollections is constant. The variance of the subcollection weights is $$\frac{1}{m}\sum_{j=1}^m (s_j - \bar{s})^2=\frac{1}{m}\sum_{j=1}^m s_j^2-\bar{s}^2.$$To minimize that, we can just minimize $\sum_j s_j^2$ in the optimization model, and then do the arithmetic to get the variance afterward.

As originally articulated, the problem has two objectives: minimize the variance of the subcollection weights and minimize the number of subcollections used. Normally, a multiobjective problem involves setting priorities or assigning weights to objectives or something along those lines. For the original question, where there are only $n=12$ elements in $\Omega$ (and thus at most $m=12$ sets in the partition), a plausible approach is to find a minimum variance solution for each possible value of $m$ and present the user with the efficient frontier of the solution space, letting the user determine the trade-off between partition size and variance that most appeals to them. For the particular example in the SE question, I was a bit surprised to find that the variance increased monotonically with the number of sets in the partition, as seen in the following table.

# of Subcollections   Minimum Variance



1 infeasible
2 infeasible
3 infeasible
4 infeasible
5 672
6 1048.22
7 1717.39
8 3824.75
9 7473.73
10 9025.80
11 9404.81
12 11667.06

So the five subcollection partition is minimal both in number of subcollections and weight variance. I would be surprised if that generalized.

The QP formulation uses two sets of variables. For $i=1,\dots,n$ and $j=1,\dots,m$, $x_{ij}\in \lbrace 0, 1\rbrace$ determines whether $\omega_i$ belongs to $S_j$ ($x_{ij}=1$) or not ($x_{ij}=0$). For $j=1,\dots,m$, $s_j \ge 0$ is the total weight of subcollection $j$. The objective function is simply $$\min_{x,s} \sum_{j=1}^m s_j^2$$which happily is convex. The first constraint enforces the requirement that the subcollections form a partition (every element being used exactly once): $$\sum_{j=1}^m x_{ij} = 1 \quad\forall i\in \lbrace 1,\dots,n\rbrace.$$For a partition to be valid, we do not want any empty sets, which leads to the following constraint:$$\sum_{i=1}^n x_{ij} \ge 1\quad\forall j\in\lbrace 1,\dots,m\rbrace.$$ The next constraint defines the weight variables: $$s_j = \sum_{i=1}^n \omega_i x_{ij}\quad\forall j\in\lbrace 1,\dots,m\rbrace.$$What about the limit $L$ on the weight of any subcollection? We can enforce that by limiting the domain of the $s$ variables to $s_j\in [0,L]$ for all $j$.

Finally, there is symmetry in the model. One source is duplicate items in $\Omega$ ($\omega_i = \omega_{i'}$ for some $i\neq i'$). That happens in the data of the SE question, but it's probably not worth dealing with in an example that small. Another source of symmetry is that, given any partition, you get another partition with identical variance by permuting the indices of the subcollections. That one is easily fixed by requiring the subcollections to be indexed in ascending (or at least nondecreasing) weight order:$$s_j \le s_{j+1}\quad\forall j\in\lbrace 1,\dots,m-1\rbrace.$$Is it worth bothering with? With the anti-symmetry constraint included, total run time for all 12 models was between one and two seconds on my PC. Without the anti-symmetry constraint, the same solutions for the same 12 models were found in a bit over 100 minutes.

If you would like to see my Java code (which requires CPLEX), you can find it at https://gitlab.msu.edu/orobworld/mvpartition.