Assuming (as I will below) that $Q$ is symmetric and at least positive semidefinite, this problem is easily solved by a variety of QP solvers, and my first recommendation was to link to a suitable library. (The questioner intends to solve the problem within a C++ program.) The questioner, however, prefers to "roll his own" code: he will be solving a large number of small problems, and apparently believes the overhead getting in and out of libraries might be excessive. Personally, I'm a big fan of code libraries. I make every attempt not to reinvent the wheel, partly to save time and partly because my wheels somehow never come out completely round. To each his own, however. I suggested a few standard algorithms, including Rosen's gradient project method (for which, curiously, I cannot find a suitable link) and the generalized reduced gradient (GRG) algorithm.
While toddling over to a local coffee shop, though, a very simple (naive?) algorithm occurred to me. I do not have a formal proof of convergence, but it is easy to show that it cannot terminate at a suboptimal solution, so the only questions are (a) how slow it is compared to more sophisticated algorithms and (b) whether it is susceptible to jamming (zigzagging) (another term for which I cannot find a suitable link). The algorithm is predicated on the following observations:
- if $x$ is feasible and I shift weight from one component of $x$ to another, moving from $x$ to $\hat{x}=x+\lambda d$ where\[d_k=\begin{cases}+1 & k=i\\-1 & k=j\\0 & i\neq k\neq j\end{cases},\]then$\sum_k \hat{x}_k=1$ and feasibility boils down to $\hat{x}_j\ge 0$, or equivalently $\lambda \le x_j$.
- For $d$ defined as above, if we let $h(\lambda)=f(x+\lambda d)$, then \begin{gather*}h'(\lambda)=\nabla f(x+\lambda d)d \\ = (x+\lambda d)'Qd \\ = x'Qd + \left(d'Qd\right)\lambda \\ = \nabla f(x)d + \left(d'Qd\right)\lambda \\ = (g_i - g_j) + \left(Q_{ii} + Q_{jj} - 2Q_{ij}\right)\lambda,\end{gather*}where $g=Qx$ is the (transposed) gradient of $f$ at $x$.
$t\leftarrow 0$, $x^{(t)}\leftarrow\left(\frac{1}{n},\dots,\frac{1}{n}\right)$ // initialization
while (not converged) {
$g^{(t)}\leftarrow Qx^{(t)}$ // gradient at $x^{(t)}$
$i\leftarrow \mathrm{arg\,min}_k\left\{ g^{(t)}_k\right\}$ // index of smallest (most negative) gradient component
$j\leftarrow \mathrm{arg\,max}_k\left\{ g^{(t)}_k | x^{(t)}_k > 0\right\}$ // index of largest gradient component with positive $x$
$d^{(t)}_k\leftarrow \begin{cases} +1 & k=i\\ -1 & k=j\\ 0 & i\neq k\neq j \end{cases}$ // search direction
$a^{(t)}\leftarrow g^{(t)}_i - g^{(t)}_j$
$b^{(t)}\leftarrow Q_{ii} + Q_{jj} - 2Q_{ij}$ // derivative along search path is a + b*step
if ($a^{(t)} = 0$) {
STOP // solution is optimal
} else if ($a^{(t)} + b^{(t)}x^{(t)}_j \le 0$) {
$\lambda^{(t)}\leftarrow x^{(t)}_j$ // the objective decays along this direction all the way to the boundary
} else {
$\lambda^{(t)}\leftarrow -a^{(t)}/b^{(t)}$ // move until the directional derivative becomes zero
}
$t\leftarrow t+1$
$x^{(t)}\leftarrow x^{(t)} + \lambda d^{(t)}$ // take a step
}
I coded a couple of small tests in Java, using both CPLEX and the algorithm above. The CPLEX solver executed considerably faster than my scraggly, non-optimized Java code. No shock there. Interestingly, though, overall execution time (including printing and overhead for function calls) was very close, with my code about 1 ms. slower as measured by the system clock. Note that, at each iteration, I only have to do
Update: A couple of additional remarks are in order.
- If $Q$ is only positive semidefinite and, at some iteration $t$, $b^{(t)}=0$, then one of two things happens. Either $a^{(t)}=0$ and we stop, or else $h'<0$ all the way to the boundary, and we take $\lambda^{(t)}=x^{(t)}_j$.
- Other than at the outset, a full matrix multiplication is not required to update the gradient vector. It is easy to verify that $g^{(t+1)} = g^{(t)} + \lambda^{(t)}\left(Q_i - Q_j\right)$ where $i$ and $j$ are the indices selected at iteration $t$. (Sorry, I absolutely refuse to put a $(t)$ superscript on a subscript -- this is math, not particle physics.) So gradient updates amount to something like $n$ multiplications and $2n$ additions or subtractions. (If the iteration count gets high enough, you might want to do an occasional full $g=Qx$ multiplication just to clear out some of the rounding error, akin to occasionally refactoring an LP basis.)
- mean execution times (including setup and extracting solution) of 0.13 ms. for my method v. 14.78 ms. for CPLEX;
- mean objective ratio (my result/CPLEX result) of 0.9999999971843175 (so we're pretty much tied for optimality);
- mean $L_2$ norm of the difference in weight vectors = $2.8\times 10^{-7}$.
FWIW, (euclidian) projection on the feasible set is an easy problem [1], with a nifty combinatorial approach for solving the dual obtained by applying Lagrangian relaxation to the convexity constraint. If the Hessian is small enough, some accelerated constrained gradient method should provably converge nicely (in O(log(epsilon)) iterations) and each iteration is just a gradient evaluation and one (or two) projection step(s).
ReplyDelete[1]: Efficient Projections onto the ℓ1-Ball for Learning in High Dimensions, Duchi, Shalev-Shwartz, Singer, and Chandra, 2008.
Paul: Thanks for the comment and reference. I'll confess that I was motivated in picking my method by intellectual laziness (not having to worry about tracking changes to the active set, or equivalently worrying about onto which face I would be projecting the gradient). Turns out that the work per iteration is pretty minimal. The key question is whether there are problem instances for which my method would require too many iterations; if so, gradient projection using the method you cite might pay for its extra cost per iteration by cutting down the iteration count sufficiently.
Delete(I'm the original poster on Quora)
DeleteJust wanted to let you know, we've implemented your algorithm in Python and C++ with great success!
For our initial tests, we had 6 by 6 matrices. The algorithm usually converges in around 10 iterations (tolerance is a<1e-4, precision is not very important for us) and gives very good solutions. Our first run on a full image (the application is medical image segmentation), with around 1.5 million voxels and thus 1.5 million problems took less than five minutes, which is very reasonable from our point of view.
Thanks again for taking the time to think about my problem and coming up with such a nice solution!
Thanks for letting me know how it turned out. (Also, I'm gratified to know that the application is medical imaging and not, say, pricing of bizarre financial instruments. :-)) Given that I thought you were dealing with sample covariance or correlation matrices (somewhat noisy buggers), I had thought that using a really tight convergence tolerance would probably be wasteful, but I also figured you were the one to decide that, and I see you did.
DeleteGlad it's working for you.