## Thursday, December 9, 2021

### Normalization Constraints

There are occasions where one might consider adding a constraint to an optimization problem specifying that $\|x\| \sim \nu,$ where $x$ is the solution vector, $\nu$ is a positive constant (usually 1, but sometimes something else), the norm is whichever norm (1, 2, $\infty$) suits the modeler's fancy, and the relation $\sim$ is whichever of $\le,$ $=$ or $\ge$ fits the modeler's needs. I can think of two somewhat common reasons for a constraint like this.

The first is a situation where $x$ is free to scale up or down and the modeler is trying to avoid having the solver choose ridiculously large or ridiculously small values for $x.$ Commonly, this is to prevent the solver from exploiting rounding error. As an example, consider the two group discriminant problem with a linear discriminant function, in which your observations have dimension $k$ and you have two samples, one of observations that should receive a positive discriminant score and one of observations that should receive a negative discriminant score. Let $P$ be an $m_P \times k$ matrix containing the first sample and $N$ an $m_N \times k$ matrix containing the second sample. Your goal is to find a coefficient vector $x\in \mathbb{R}^k$ and scalar $x_0\in \mathbb{R}$ that maximize the number of indices $i$ for which $\sum_j P_{i,j} x_j + x_0 \ge \delta$ plus the number of indices $i$ for which $\sum_j N_{i,j} x_j + x_0 \le -\delta,$ where $\delta > 0$ is a cutoff for how small a discriminant score can be and still be counted as nonzero. There is an obvious formulation as a mixed integer program using binary indicator variables for correct v. incorrect scores.

From a "pure mathematics" perspective, scaling the coefficients of the discriminant function (including $x_0$) up or down by a constant positive factor does not change whether an observation gets a positive, negative or zero score, but unfortunately the computer's finite precision arithmetic drags us out of the realm of "pure mathematics". Absent some sort of normalization, the solver might scale the coefficients way up to exploit some lucky rounding error (where an observation in the positive sample whose score should be zero or even negative gets a tiny positive score courtesy of rounding error, which becomes big enough to clear the $\delta$ threshold after the solution is scaled up). A normalization constraint is one way to avoid this, but I suspect a better way is just to impose bounds on the variables, which avoids any nonlinearities introduced by the norm function. Note, though, that when the domain of the variables contains both positive and negative values (true in the discriminant example, where coefficients can have either sign), bounds will not stop the solver from scaling the solution downward. If the solver wants to make the solution artificially small (for instance, if there is a penalty for scores that are incorrect by more than a particular amount), bounds will not prevent it; you will need a norm constraint.

The other situation I have encountered is one where zero is a feasible solution that is perhaps attractive from the perspective of the objective function but needs to be eliminated. An example of this recently popped up on OR Stack Exchange. Given a matrix $W,$ the author wants to find a vector $x\in\mathbb{R}^n$ that maximizes the number of nonnegative components of $Wx.$ The author turns this into a MIP model using binary indicator variables for whether a component is nonnegative or not. The catch here is that $x=0\implies Wx=0,$ so the zero solution trivially maximizes the objective (all components nonnegative). The author rules this out by adding the normalization constraint $\| x \|_1=n.$ The one-norm will introduce binary variables, either directly or indirectly (meaning they might be added internally by the solver).

For this situation, there is an alternative way to rule out zero. Suppose that we draw an observation $r\in\mathbb{R}^n$ from a continuous distribution over a set of full dimension. For simplicity, I like the uniform distribution, so I might take $r\sim U([-1, 1]^n).$ I would normalize the vector (changing $r$ to $r/\|r\|_2$) just to avoid the possibility that it is too close to a zero vector (or, if using a larger domain, too big). In lieu of the norm constraint, I would add the linear constraint $r^\prime x = 1.$ Clearly this eliminates $x=0,$ but does it eliminate other values of $x?$ Most likely not. If $r^\prime x \neq 0$ and if $x$ is free to scale, than the solver can scale an otherwise optimal $x$ to satisfy $r^\prime x = 1.$ Because $r$ is drawn from a continuous distribution over a full-dimensional domain, $\Pr(r^\prime x = 0)=0$ for any nonzero $x.$ So the only real concern would be if the optimal $x$ had a (nonzero) inner product with $r$ small enough that attempting to scale it to meet the normalization constraint would violate some bounds on the components of $x.$ That is a risk I'm usually willing to take.

Here is the punch line (and the motivation for this post). I suggested using the random product in lieu of the norm constraint in an answer I posted on OR SE. The author of the question commented that he had tried it, and it "improves performance by at least an order of magnitude". Presumably this is be eliminating the (explicit or implicit) binary variables in the computation of the 1-norm, along with an auxiliary variables and constraints used to implement the norm restriction.

I tested a genetic algorithm versus the author's MIP model (modified to replace the norm constraint with a random normalization) using CPLEX. If anyone is curious how the GA does, you can play with my R code (which, as usual, requires one library for the GA and a gaggle of libraries to use CPLEX).