## Thursday, April 8, 2021

### A GA Model for a Joint Clustering Problem

A problem in grouping users and servers was posted on Mathematics Stack Exchange and OR Stack Exchange. (Someone remind me to rant about cross-posting in a future blog post. Just don't cross-post the reminder.) The gist of the problem is as follows. We have $S$ servers of some sort, and $U$ users. For each combination of user $u$ and server $s$, we have a parameter $h_{u,s}$ which pertains to the quality / strength / something of service user $u$ would get from server $s$. We are told to group users and servers into a predefined number $G$ of groups or clusters. Every user in cluster $g$ will be served by every server in cluster $g$, but servers in other clusters will interfere with the service to user $u$. (A possible application might be cellular phone service, where signals from towers to which you are not connected might interfere with your signal. Just guessing.)

There is one more parameter, the maximum number ($M$) of servers that can be assigned to a group. It is explicitly stated that there is no limit to the number of users that can be assigned to a group. I'm going to go a step further and assume that every group must contain at least one server but that there is no lower limit to the number of users assigned to a group. (If a group has no users, presumably the servers in that group get to relax and play video games or whatever.)

The objective is to maximize $\sum_{u=1}^U q_u$, the total quality of service, where $q_u$ is the quality of service for user $u$. What makes the problem a bit interesting is that $q_u$ is a nonlinear function of the allocation decisions. Specifically, if we let $\mathcal{S}_1, \dots, \mathcal{S}_G$ be the partition of the set $\mathcal{S} = \lbrace 1,\dots, S\rbrace$ of all servers, and if user $u$ is assigned to group $g$, then $$q_{u}=\frac{\sum_{s\in\mathcal{S}_{g}}h_{us}}{\sum_{s\notin\mathcal{S}_{g}}h_{us}}.$$Note that the service quality for user $u$ depends only on which servers are/are not in the same group with it; the assignments of other users do not influence the value of $q_u$.

An answer to the OR SE post explains how to model this as a mixed-integer linear program, including how to linearize the objective. That is the approach I would recommend. The original poster, however, specifically asked for a heuristic approach. I got curious and wrote a genetic algorithm for it, in R, using the GA library. Since this is a constrained problem, I used a random key GA formulation. I won't go into excessive detail here, but the gist is as follows. We focus on assigning servers to groups. Once we have a server assignment, we simply compute the $q_u$ value for each user and each possible group, and assign the user to the group that gives the highest $q_u$ value.

To assign servers to groups, we start with an "alphabet" consisting of the indices $1,\dots,S$ for the servers and $G$ dividers (which I will denote here as "|"). In the R code, I use NA for the dividers. A "chromosome" is an index vector that permutes the alphabet. Without loss of generality, we can assume that the first server goes in the first group, and the last divider must come after the last group, and thus we permute only the intervening elements of the alphabet. For instance, if $S=5$ and $G=3$, the alphabet is $1,2,3,4,5,|,|,|$ and a chromosome $(2, 7, 4, 6, 5, 3)$ would translate to the list $1, 2, |, 4, |, 5, 3, |$. (Each element of the chromosome is the index of an element in the alphabet.) I would interpret that as group 1 containing servers 1 and 2, group 2 containing server 4, and group 3 containing servers 3 and 5.

There is no guarantee that a random chromosome produces a server grouping that contains at least 1 and at most $M$ servers in every group, so we post-process it by going group by group and adjusting the dividers by the minimum amount necessary to make the current group legal. Once we have the servers grouped, we assign users by brute force and compute an overall fitness of the solution.

I am deliberately leaving out some (OK, many) gory details here. My R code and a test problem are contained in an R notebook that you are welcome to peruse. When I rerun the GA on the same test problem, I get different results, which is not surprising since the GA is a heuristic and is most definitely not guaranteed to produce an optimal solution. Whether the solution is "good enough", and whether it scales to the size problem the original poster has in mind, are open questions.