Thursday, October 17, 2019

A Value-ordered Java Map

For a project I'm working on (in Java), I wanted to store pairs of items in a key-value map. The catch is that I need the items sorted according to the values, not the keys. Java provides an efficient, presorted mapping in the TreeMap class, but it sorts on the keys, not the values. Revering the roles of keys and values is not an option for me, as I might have several keys mapped to the same value.

A Google search wasn't particularly helpful, other than to confirm that I was screwed. The most practical suggestions I saw revolved around the theme of using a conventional mapping (HashMap, TreeMap, whatever) and then, as needed, dumping the entries into a list or stream and sorting that using a comparator based on the values. For my application, though, this would mean modifying the map contents, exporting it and sorting it thousands of times, which would put a big drag on the application. So I really wanted something moderately efficient that would maintain the map contents in sorted order. Eventually, I gave up looking for a solution and wrote one myself. I gave my class the not entirely compact name ValueOrderedMap. It's generic, meaning you can specify any types for keys and values (as long as both types are comparable).

I've implemented most of the commonly used methods associated with map classes (but not all known map methods), and I've tried to make it thread-safe (but as yet have not tested that). Anyone who wants to take it for a spin is welcome to. It's released under a Creative Commons license. There are only two classes, and only one (ValueOrderedMap) is essential; the other is just a small demo program. You can find the source code in my campus GitLab repository. There's an issue tracker (requires registration) where you can report bugs or ask for new features. If you just want a binary jar file (along with the Javadoc, README and license files), I've parked a zip archive on my faculty website.

Tuesday, October 8, 2019

Numerical Inaccuracy (Linear Algebra Edition)

"There are three kinds of lies: lies, damned lies, and statistics." The origin of the statement is lost in time, but the sentiment lives on. I'm tempted to give it a 21st century update. "There are three kinds of lies: lies, damned lies, and floating point arithmetic."

I am currently working on a research project in which the following problem arises. We start with an $m \times n$ matrix $A$ with $m < n$, and a set of indices $I\subseteq \{1,\dots,n\}$. I need to confirm that $x_i = 0$ for all $i\in I$ and for all $x\in \mathcal{N}$, where $\mathcal{N} = \{x\in \mathbb{R}^n : Ax = 0\}$. There are various names for $\mathcal{N}$; I'll go with the null space of $A$.

How do you test whether a particular component of every vector in an uncountably infinite set is zero? Since $\mathcal{N}$ is a vector space, we can compute a basis for it and just check the basis vectors. If $x_i =0$ for every basis vector $x$ of $\mathcal{N}$, it will be true for every vector in $\mathcal{N}$.

This is going on in a Java program, and I have tried multiple linear algebra packages. Initially I had the idea to use a QR decomposition of $A$ to get a basis for the null space, but that requires (at least as far as I know) an upper triangular $R$ component, and neither of the two most promising packages I tried delivered one. (I'm pretty sure the Javadocs for one or both said $R$ would be upper triangular ... but it wasn't.) So I switched to singular value decomposition (SVD) using the EJML library. SVD is generally slower than QR decomposition, but tends to be more accurate.

I got an answer, but a check performed by the Java program warned me that some of the $x_i$ values that were supposed to be zero were in the vicinity of $10^{-6}$, which is a bit dicey to shrug off as rounding error. (The basis vectors are all normalized, by the way.) So I decided to port $A$ and some other stuff over to R and do some testing there ... which is when the adventure began.

The dimensions of this $A$ matrix were $m=401$ and $n=411$. EJML gave me a basis of 17 vectors, indicating that $A$ had nullity 17 and thus rank 394. (The nullity of matrix $A$ is the dimension of $\mathcal{N}$, and is in turn the difference between $n$ and the rank of $A$.) I tested this by doing QR and SVD decompositions of $A$ (using the qr() and svd() functions from the R base package) and the Rank() function from the R pracma package. The QR and SVD results indicated that $A$ had rank 393 and nullity 18. Only pracma::Rank() agreed with EJML ... and that came with the warning "Rank calculation may be problematic". (As if the rest of this stuff isn't.)

So EJML might have shorted me one dimension in the null space ... or maybe not. The smallest eigenvalues of $A$ from the SVD were as follows:  9.16e-05, 5.37e-06, 1.09e-10, 1.13e-15, 7.56e-16, 6.20e-16, 4.87e-16, 3.79e-16, 3.37e-16, 3.16e-16. So the last seven (italicized) were pretty clearly zero and the preceding one (bold) is maybe zero, maybe not. The SVD does not produce the ten zero eigenvalues resulting from the difference between $m$ and $n$, so basically we're looking at a nullity of 18 if you treat the bold one as zero and 17 if you don't.

The 17 basis vectors EJML gave me were all at least close to zero in all components $i\in I$. R had almost all the designated components $10^{-7}$ or smaller. The pracma::nullspace() function gave me 18 basis vectors (despite pracma::Rank() coming up with a nullity of 17), and some of the corresponding $x_i$ were pretty big (0.29, for example) -- way too big to be zero. So my condition is (probably) satisfied if I believe EJML and (definitely) not satisfied if I believe the R computations.

Hoping for some clarity, I calculated (in R) $Ax$ for each basis vector $x$ from either EJML or pracma::nullspace(). The products with the EJML basis vectors were all zero vectors (meaning no components larger than about $10^{-15}$ in magnitude). For the pracma basis vectors, the products had components around $10^{-9}$ in many cases. So much for clarity. Do we treat $10^{-9}$ as approximately zero and believe pracma, or do we call it nonzero and believe EJML?

I'm not new to the limitations of double-precision arithmetic. My first course in grad school was numerical analysis, and I've had a fair share of adventures with MIP models in CPLEX where the numerics were, shall we say, dicey. Still, it's rather frustrating to have to try multiple packages and then have to judge (meaning guess) which one is lying to me the least.