## Tuesday, March 7, 2023

### Another GA for Clustering

Someone asked on OR Stack Exchange about clustering nodes, a subject that comes up from time to time. The premise is that you have $N$ points (nodes), and for each pair $i \neq j$ there is (or may be) an edge with weight $w_{i,j} \in [0, 1]$ (where $w_{i,j} = 0$ if there is no edge between $i$ and $j$). The objective is to create exactly $G$ groups (clusters), with no group having cardinality greater than $M,$ such that the sum of within-group edge weights is maximized. It is simple to write an integer programming model for the problem, and the question includes one, but the author was looking for something that did not require an IP solver. The question specifies, as an example, dimensions of $N=500$ points, $G=8$ groups, and a maximum group size of $M=70.$

Since I have suggest genetic algorithms for previous clustering problems (see here and here), it probably will not shock you that I once again looked for a GA approach, using a "random key" algorithm (where the chromosome gets decoded into a feasible solution. In this case, the chromosome consists of $N+G$ values between 0 and 1. The first $G$ values are used to select the sizes $n_1,\dots, n_G$ of the groups. The last $N$ values are converted into a permutation of the point indices $1,\dots, N.$ The first $n_1$ entries in the permutation are the indices of the points in group 1, the next $n_2$ entries give the points in group 2, and so on.

The tricky part here is converting the first portion of the chromosome into the group sizes. We know that the maximum group size is $M,$ and it is easy to deduce that the minimum group size is $m = N - (G-1)M.$ So the minimum and maximum fractions of the population to assign to any group are $a=m/N$ and $b=M/N,$ where $$m = N - (G-1)M \implies a = 1 - (G-1)b.$$ Now suppose that I have values $\pi_1, \dots,\pi_G \in (a,b)$ and assign group sizes $n_i = \pi_i \cdot N,$ which clearly meet the minimum and maximum size limits. (I'm cheating here, but I'll fix it in a minute.) We want $\sum_{i=1}^G n_i = N,$ which means we need $\sum_i \pi_i = 1.$

To get the $\pi_i,$ we take a rather roundabout route. Assume that the first $G$ entries in the chromosome are $x_1,\dots,x_G \in (0,1).$ Set $$y_i = 1 - \frac{x_i}{\sum_{j=1}^G x_j} \in (0,1)$$ and observe that $\sum_{i=1}^G y_i = G-1.$ Finally, set $\pi_i = a + (b-a) y_i \in (a, b).$ We have $$\sum_i \pi_i = G\cdot a + (b-a) \sum_i y_i\\ = G\cdot a + (b-a)(G-1) = (G-1)b + a = 1.$$

Now to confess to my cheating. I said the $i$-th group size would be $n_i=\pi_i \cdot N,$ ignoring the minor problem that the left side is an integer and the right side almost surely is not. So of course we will round $\pi_i \cdot N$ to get $n_i.$ After rounding, though, we can no longer be sure that $\sum_i n_i = N.$ So we iteratively fix it. If $\sum_i n_i < N,$ we add 1 to the smallest $n_i$ and recheck. if $\sum_i n_i > N,$ we subtract 1 from the largest $n_i$ and recheck. Eventually, we end up with a feasible set of group sizes.

All this is encoded in an R notebook I wrote, including a sample problem. Whether the GA gets a "good" (near optimal) partition or not I cannot say, since I did not write the equivalent MIP model. I can say, though, that it gets a sequence of progressively improving partitions.