## Thursday, November 10, 2016

### MIP Models in R with OMPR

A number of R libraries now exist for formulating and solving various types of mathematical programs in R (or formulating them in R and solving them with external solvers). For a comprehensive listing, see the Optimization and Mathematical Programming task view on CRAN. I decided to experiment with Dirk Schumacher’s OMPR package for R. OMPR is a domain specific language for linear and mixed-integer linear programs. OMPR currently relies on the R Optimization Infrastructure package (ROI), which uses a plugin architecture to act as a bridge between R code and third party solvers (including CPLEX). The CPLEX plugin for ROI, in turn, has a dependency on the RCplex package to connect to CPLEX.

Of course, to try out an optimization modeling package you need to have something to optimize. I went back to some ancient research I did, specifically a paper in 1990 on using MIP models to choose linear (or polynomial) discriminant functions for classifying observations into one of two groups. (For the sleep deprived, the full citation is:
Rubin, P. A. Heuristic solution procedures for a mixed-integer programming discriminant model. Managerial and Decision Economics, Vol. 11, No. 4, October 1990.
I used Anderson's iris data for the test case (since it's conveniently available in R, through the datasets package), and just for the halibut I also threw in a support vector machine (SVM) model using the kernlab package for R. Comparing the two is a bit misleading, since SVM models are inherently nonlinear, but I just felt like doing it. In any case, the purpose is to see how OMPR works with CPLEX, not to compare MIP discriminant models and SVMs.

The details are too lengthy for a blog post, so I posted them separately in an R notebook. If you're not familiar with R notebooks, you can find details here. The web page generated by my notebook contains the source code, and there's a control in the upper right of the web page that will let you download it as a notebook (R Markdown) file. You can also grab it from my GitLab repository for the project. As with other content of this blog, it's yours to play with under a Creative Commons license.

The MIP model is as follows. We start with samples of two classes ( "positive" and "negative") containing $p$ features. Let $X_1$ be the $N_1\times p$ matrix of data for the negative sample and let $X_2$ be the $N_2\times p$ matrix of data for the positive sample. The discriminant function we wish to train is $f(x)=w'x+c$; we will classify an observation $x$ as belonging to the positive or negative class according to the sign of $f(x)$. To allow for both limited measurement accuracy of the data and the inevitable adventures with rounding error, we arbitrarily choose some constant $\delta > 0$ and declare a positive observation correctly classified if $f(x)\ge \delta$ and a negative observation correctly classified if $f(x)\le -\delta$.

To avoid the pitfall of having the solver scale $w$ up to some ridiculous magnitude in order to force borderline observations to "look" correctly classified (i.e., to get around the use of $\delta$ as a minimum magnitude for non-zero classification scores), we bound $w$ via its sup norm: $-1 \le w_j \le 1 \, \forall j\in \{1,\dots,p\}$. The constant term $c$ is unbounded and unrestricted in sign.

To detect and count misclassifications (including ambiguous classifications, $-\delta < f(x) < \delta$, we introduce binary variables $y_i,\, i\in\{1,\dots,N_1\}$ for the negative sample and $z_i,\, i\in\{1,\dots,N_2\}$ for the positive sample. Each will take value 0 if the corresponding observation is correctly classified and 1 if not. In the OMPR demo, I just count total misclassifications ($\sum_i y_i + \sum_i z_i$); in general, misclassifications can be weighted to account for prior probabilities, oversampling of one group relative to the other, or relative importance of different types of errors (e.g., false positives looking for cancer tumors are bad, but false negatives can be fatal). There is also a variable $d$ that captures the minimum absolute value of any correct classification score (i.e., how far correct scores are from the boundary value of 0). Larger values are rewarded in the objective function (using a coefficient $\epsilon > 0$).

The model also contains "big M" type constraints to define the binary variables. Formulas for selecting the values of the parameters $M_1$, $M_2$ and $\epsilon$ are given in the paper. So, finally, we get to the MIP model:

\begin{alignat*}{2} & \textrm{minimize} & \mathbf{1}^{\prime}y+\mathbf{1}^{\prime}z-\epsilon d\\ & \textrm{subject to}\quad & X_{1}w+c \cdot \mathbf{1}+d \cdot \mathbf{1}-M_{1} \cdot y & \le 0\\ & & X_{2}w+c \cdot \mathbf{1}-d \cdot \mathbf{1}+M_{2} \cdot z & \ge 0\\ & & w & \le \mathbf{1}\\ & & w & \ge -\mathbf{1}\\ & & c \quad & \textrm{free}\\ & & d \quad & \ge \delta\\ & & y, \: z \quad & \textrm{binary} \end{alignat*}