Monday, May 10, 2021

Symmetric Difference of Sets in Java

The symmetric difference of two sets $A$ and $B$ is defined as $$C=(A \cup B)\backslash (A \cap B).$$It is the set of objects that belong to either $A$ or $B$ but not both. I've been working on some Java code for a research project in which I will need to compute the sizes of the symmetric differences of a large number of pairs of sets. The contents of the sets are nonnegative integers (Java type Integer), which makes life a bit simpler because integers are nicely ordered. (In Java-speak, Integer implements the Comparable<Integer> interface.) Since I will be doing a large number of differences, and since the sets involved will be moderately large (by my standards), I wanted to find the fastest way to compute a difference. So I wrote a Java program to generate random pairs of integer sets and compute the size of their symmetric difference four different ways.

Since the introduction of streams in Java (I believe in version 8), what I think is the most obvious/intuitive way to do it is to convert each set to a stream, filter out members of the other set, and count up the survivors. On the other hand, I am a happy user of the Apache Commons Collections library, whose CollectionsUtils class provides not one but two ways to do this. One is to use the subtract() method twice, switching the order of the arguments. This does the same thing that my first (stream-based) method does. The other is to use the disjunction() method, which calculates the symmetric difference in one invocation.

Equally obvious to me from a mathematical perspective, but not from a Java perspective, is to take the bitwise exclusive OR of the characteristic vectors of the two sets. Java contains a BitSet class, with set() and flip() methods for individual bits, which makes this easy to do.

My program runs the experiments on instances of both HashSet<Integer> and TreeSet<Integer>. Hash sets are generally faster, but tree sets have more features (and impose an internal ordering on their contents). My somewhat naive intuition was that having the contents accessed in ascending order would make differencing tree sets faster than differencing hash sets.

There are lots of moving parts here (how large the integers get, how big the sets involved in the comparisons are, ...), so I hesitate to read too much into the results. That said, here are some timing results using sets of cardinality between 100 and 200 with integers from 0 to 100,000. Times are in microseconds and are averages of 10,000 replications. (You can afford big samples when things go this quickly.)

Method Streams subtract()
disjunction() BitSet
HashSet 10.07 36.40 62.15 9.68
TreeSet 23.79 35.37 60.94 7.21


I was surprised (OK, shocked) to find that the disjunction() method, which directly computes the symmetric difference, was almost twice as slow as two invocations of the subtract() method. Less surprising was that turning both sets into streams and filtering each other out was faster than calling subtract() twice. (Note that I am computing the size of the symmetric difference, not the symmetric difference itself. Uniting the two filtered streams or the two sets resulting from calls to subtract() into one combined set might move their times closer to those of disjunction()).

Another surprise to me was that the approach using streams was consistently slower with tree sets than it was with hash sets, despite the tree sets being inherently ordered. The characteristic vector (BitSet) approach seemed to have a slight preference for tree sets, but I'm not entirely sure that was consistent.

In this particular run, the characteristic vector approach (using BitSet) beat all comers. In other runs with similar parameters, streams sometimes beat BitSet and sometimes did not when the sets were instances of HashSet. With TreeSet, using BitSet appeared to be consistently faster. Let's contrast that with a second experiment, in which the integers are still between 0 and 100,000 but the sets are smaller (cardinality between 10 and 20).

Method Streams subtract()
disjunction() BitSet
HashSet 3.18 5.96 9.75
TreeSet 3.11


Note that with the smaller sets the method using streams beats the characteristic vector (BitSet) method, and even the Commons Collections subtract() method may be a bit better than using the characteristic vector.

For my project, I am using instances of TreeSet, and I'm fairly certain the sets will (mostly) have cardinality in the low hundreds, so I will probably go with the BitSet approach. If anyone would like to run their own tests, my code is available in a GitLab repository.

Update: It belatedly occurred to me that in my research project, where I only need to get the size of the symmetric difference and not the symmetric difference itself, it might make sense to exploit the fact that the size of the symmetric difference between $A$ and $B$ is$$\vert A\vert + \vert B \vert - 2 \cdot \vert A \cap B\vert.$$So I can compute the size of the intersection, do a little arithmetic, and have the size of the symmetric difference.

I updated my code to include two versions of this approach. One version computes the intersection by streaming one set and filtering it based on inclusion in the other set. The other version uses the CollectionUtils.intersection() method. Once again, the CollectionUtils method was not competitive. Comparing the stream intersection method to the original stream approach and the characteristic vector approach using the original specifications for sets (cardinality 100 to 200), it seems the streamed intersection method is about twice as fast as the original stream method on both hash sets and tree sets (which makes sense, since it does with one stream what the original method did with two). It is also about twice as fast as the characteristic vector method on hash sets, but roughly half as fast on tree sets, as seen in the table below. (Times for the first two methods differ from previous tables because exact timings are unfortunately not reproducible.)

Method Streams BitSet Intersection
HashSet 11.33 10.33 5.59
TreeSet 23.32 7.47


So the single stream intersection approach looks to be the fastest for computing the size (but, again, not contents) of the symmetric difference if I'm using hash sets, while the characteristic vector (BitSet) approach is fastest if I'm using tree sets.