## Thursday, September 5, 2013

### A Minimum Variance Convex Combination

Someone asked a question on Quora that reduces to the following: how to solve the quadratic program (QP)\begin{gather*}\mathrm{minimize}\,\,f(x)=\frac{1}{2}x'Qx\\\mathrm{s.t.}\,\,\sum_{i=1}^n x_i=1\\x\in \Re^n_+.\end{gather*}The constraints make $x$ the weights in a convex combination of something. My impression of the original question was that $Q$ would be the covariance or correlation matrix of a set of $n$ random variables, in which case (a) $Q$ would be symmetric and at least positive semidefinite (positive definite if the variables are not multicollinear) and (b) the objective function would represent the variance (scaled in the case of that $Q$ is a correlation matrix) of the convex combination of those variables using weights $x$.

Assuming (as I will below) that $Q$ is symmetric and at least positive semidefinite, this problem is easily solved by a variety of QP solvers, and my first recommendation was to link to a suitable library. (The questioner intends to solve the problem within a C++ program.) The questioner, however, prefers to "roll his own" code: he will be solving a large number of small problems, and apparently believes the overhead getting in and out of libraries might be excessive. Personally, I'm a big fan of code libraries. I make every attempt not to reinvent the wheel, partly to save time and partly because my wheels somehow never come out completely round. To each his own, however. I suggested a few standard algorithms, including Rosen's gradient project method (for which, curiously, I cannot find a suitable link) and the generalized reduced gradient (GRG) algorithm.

While toddling over to a local coffee shop, though, a very simple (naive?) algorithm occurred to me. I do not have a formal proof of convergence, but it is easy to show that it cannot terminate at a suboptimal solution, so the only questions are (a) how slow it is compared to more sophisticated algorithms and (b) whether it is susceptible to jamming (zigzagging) (another term for which I cannot find a suitable link). The algorithm is predicated on the following observations:
• if $x$ is feasible and I shift weight from one component of $x$ to another, moving from $x$ to $\hat{x}=x+\lambda d$ where$d_k=\begin{cases}+1 & k=i\\-1 & k=j\\0 & i\neq k\neq j\end{cases},$then$\sum_k \hat{x}_k=1$ and feasibility boils down to $\hat{x}_j\ge 0$, or equivalently $\lambda \le x_j$.
• For $d$ defined as above, if we let $h(\lambda)=f(x+\lambda d)$, then \begin{gather*}h'(\lambda)=\nabla f(x+\lambda d)d \\ = (x+\lambda d)'Qd \\ = x'Qd + \left(d'Qd\right)\lambda \\ = \nabla f(x)d + \left(d'Qd\right)\lambda \\ = (g_i - g_j) + \left(Q_{ii} + Q_{jj} - 2Q_{ij}\right)\lambda,\end{gather*}where $g=Qx$ is the (transposed) gradient of $f$ at $x$.
Armed with that, the naive algorithm can be expressed in words as follows. Start with an interior feasible point (such as the center of the unit simplex). At each point, compute the gradient ($g$) and locate the smallest component among all variables (index $i$) and largest component among all variables that are not zero (index $j$). Clearly $g_i \le g_j$; if $g_i = g_j$, stop (you are unable to find an improving direction). Assuming $g_i < g_j$, set up the direction vector $d$ to shift weight from $x_j$ to $x_i$ in equal measure. The largest feasible step length is $x_j$; the step length at which $h$ turns from decreasing to increasing is $$\frac{(g_j - g_i)}{Q_{ii} + Q_{jj} - 2Q_{ij}}.$$The denominator of that fraction is strictly positive if $Q$ is positive definite. (If $Q$ is only positive semidefinite, there is a danger that it might be 0.) Pick whichever step length is smaller, take that size step in direction $d$, and iterate. Written in a mathematics/pseudocode mishmash:

$t\leftarrow 0$, $x^{(t)}\leftarrow\left(\frac{1}{n},\dots,\frac{1}{n}\right)$ // initialization
while (not converged) {
$g^{(t)}\leftarrow Qx^{(t)}$ // gradient at $x^{(t)}$
$i\leftarrow \mathrm{arg\,min}_k\left\{ g^{(t)}_k\right\}$ // index of smallest (most negative) gradient component
$j\leftarrow \mathrm{arg\,max}_k\left\{ g^{(t)}_k | x^{(t)}_k > 0\right\}$ // index of largest gradient component with positive $x$
$d^{(t)}_k\leftarrow \begin{cases} +1 & k=i\\ -1 & k=j\\ 0 & i\neq k\neq j \end{cases}$ // search direction
$a^{(t)}\leftarrow g^{(t)}_i - g^{(t)}_j$
$b^{(t)}\leftarrow Q_{ii} + Q_{jj} - 2Q_{ij}$ // derivative along search path is a + b*step
if ($a^{(t)} = 0$) {
STOP // solution is optimal
} else if ($a^{(t)} + b^{(t)}x^{(t)}_j \le 0$) {
$\lambda^{(t)}\leftarrow x^{(t)}_j$ // the objective decays along this direction all the way to the boundary
} else {
$\lambda^{(t)}\leftarrow -a^{(t)}/b^{(t)}$ // move until the directional derivative becomes zero
}
$t\leftarrow t+1$
$x^{(t)}\leftarrow x^{(t)} + \lambda d^{(t)}$ // take a step
}

I coded a couple of small tests in Java, using both CPLEX and the algorithm above. The CPLEX solver executed considerably faster than my scraggly, non-optimized Java code. No shock there. Interestingly, though, overall execution time (including printing and overhead for function calls) was very close, with my code about 1 ms. slower as measured by the system clock. Note that, at each iteration, I only have to do one matrix-vector multiplication (to get the gradient $g$, after which the remaining computations for that iteration are pretty trivial. Still, I'm surprised my algorithm came that close in terms of time. As far as accuracy goes, the solutions on test problems matched to within convergence limits.

Update: A couple of additional remarks are in order.
• If $Q$ is only positive semidefinite and, at some iteration $t$, $b^{(t)}=0$, then one of two things happens. Either $a^{(t)}=0$ and we stop, or else $h'<0$ all the way to the boundary, and we take $\lambda^{(t)}=x^{(t)}_j$.
• Other than at the outset, a full matrix multiplication is not required to update the gradient vector. It is easy to verify that $g^{(t+1)} = g^{(t)} + \lambda^{(t)}\left(Q_i - Q_j\right)$ where $i$ and $j$ are the indices selected at iteration $t$. (Sorry, I absolutely refuse to put a $(t)$ superscript on a subscript -- this is math, not particle physics.) So gradient updates amount to something like $n$ multiplications and $2n$ additions or subtractions. (If the iteration count gets high enough, you might want to do an occasional full $g=Qx$ multiplication just to clear out some of the rounding error, akin to occasionally refactoring an LP basis.)
I got curious enough to tweak my Java code and run a set of simulations. I generated covariance matrices randomly. In particular, the matrices were dense and scaled so that variances of the underlying random variables were typically around 2.5. My results might not apply if variances are larger or more spread out, or if there are lots of zero covariances. (Then again, those things might make no difference.) At any rate, I ran 100 replications with $n=10$, contrasting my results with those of CPLEX 12.5.1. I terminated iteration of my method when the step size fell below $10^{-9}$. What I got was the following:
• mean execution times (including setup and extracting solution) of 0.13 ms. for my method v. 14.78 ms. for CPLEX;
• mean objective ratio (my result/CPLEX result) of 0.9999999971843175 (so we're pretty much tied for optimality);
• mean $L_2$ norm of the difference in weight vectors = $2.8\times 10^{-7}$.

1. FWIW, (euclidian) projection on the feasible set is an easy problem [1], with a nifty combinatorial approach for solving the dual obtained by applying Lagrangian relaxation to the convexity constraint. If the Hessian is small enough, some accelerated constrained gradient method should provably converge nicely (in O(log(epsilon)) iterations) and each iteration is just a gradient evaluation and one (or two) projection step(s).

[1]: Efﬁcient Projections onto the ℓ1-Ball for Learning in High Dimensions, Duchi, Shalev-Shwartz, Singer, and Chandra, 2008.

1. Paul: Thanks for the comment and reference. I'll confess that I was motivated in picking my method by intellectual laziness (not having to worry about tracking changes to the active set, or equivalently worrying about onto which face I would be projecting the gradient). Turns out that the work per iteration is pretty minimal. The key question is whether there are problem instances for which my method would require too many iterations; if so, gradient projection using the method you cite might pay for its extra cost per iteration by cutting down the iteration count sufficiently.

2. (I'm the original poster on Quora)

Just wanted to let you know, we've implemented your algorithm in Python and C++ with great success!

For our initial tests, we had 6 by 6 matrices. The algorithm usually converges in around 10 iterations (tolerance is a<1e-4, precision is not very important for us) and gives very good solutions. Our first run on a full image (the application is medical image segmentation), with around 1.5 million voxels and thus 1.5 million problems took less than five minutes, which is very reasonable from our point of view.

Thanks again for taking the time to think about my problem and coming up with such a nice solution!

3. Thanks for letting me know how it turned out. (Also, I'm gratified to know that the application is medical imaging and not, say, pricing of bizarre financial instruments. :-)) Given that I thought you were dealing with sample covariance or correlation matrices (somewhat noisy buggers), I had thought that using a really tight convergence tolerance would probably be wasteful, but I also figured you were the one to decide that, and I see you did.