Showing posts with label scholarship. Show all posts
Showing posts with label scholarship. Show all posts

Thursday, April 20, 2017

Sharing "Implicit" Test Problems

The topic of reproducible research is garnering a lot of attention these days. I'm not sure there is a 100% agreed upon, specific, detailed definition of it, and I do think it's likely to be somewhat dependent on the type of research, but for purposes of this post the Wikipedia definition (previous link) works for me. In particular, I want to call out one part of the Wikipedia entry:
The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the laboratory notebooks and full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research.
I've seen a lot of press related to reproducibility in the biological sciences, although I assume it's an issue in other sciences as well. (Good luck with that, physicists!) It's also increasingly an issue in the social sciences, where one study asserted that over half of the published results they tested failed to replicate. (That study itself was criticized as allegedly failing to replicate.) All of this has been termed by some a "replication crisis". In short, reproducibility is a big deal. There's at least one blog devoted to it, and a considerable amount of work in the statistics community is going into tools to facilitate reproducibility (such as R and Python "notebooks"). R users in particular should have a look at the rOpenSci Reproducibility Guide.

My research tends almost exclusively to fall into the category of "stupid math programming tricks". Either I'm trying to find some clever formulation of a (deterministic) problem, I'm trying to find an efficient way to generate an exact solution, I'm trying to find a decent heuristic for getting "good" solutions "quickly" (preferably without stretching analogies to nature too far: there's no way I'm beating slime mold optimization in the naming contest), or some combination of the above. Since I mostly avoid statistics (other than the occasional comparison of run times with some selected benchmark alternative), I've been largely unaffected by the debate on reproducibility ... until now.

Recently, the journals published by INFORMS have moved in the direction of reproducible research, and I suspect others are doing so (or will do so in the near future) as well. As an example relevant to my interests, the INFORMS Journal on Computing (JOC) has introduced policies on the sharing of both data and code. Personally, I think this is a good idea. I've shared data and code for one particular paper (machine scheduling) on my web site since the paper came out (20+ years ago), and I share data for a more recent paper as well (having released the code as an open source project).

At the same time, I recognize that there are various difficulties (licensing/copyright, for instance) in doing so, and also there are costs for the researcher. I'd be willing to share the code for one or two other projects, but it would take a major effort on my part to untangle the spaghetti and figure out which libraries are/are not required, and which code should be included or should be excluded as irrelevant to the ultimate experiments. I'm reluctant to commit that time until someone actually requests it.

There's another twist that I would not have anticipated until quite recently. I'm working on a project that involves "implicit hitting set" (IHS) problems. In an IHS problem, you have a master problem that formulates as a pure integer (in fact, 0-1) program. Candidate solutions to that model are fed to one or more "oracles", which either bless the solutions as feasible or generate additional constraints for the master problem that the candidates violate. Note that I have not said anything about the nature of the oracles. Oracles might be solving linear or integer programming models, which would be relatively easy to share as "data", but they might also be doing something decidedly funky that is encapsulated in their coding. In the latter case, the test "data" is actually code: I would have to provided the code for the oracles in order for someone to reproduce my results.

Well, okay, if sharing code is on the table now, isn't that just more code? Not quite. Let's say that some unfortunate doctoral student has been tasked by her advisor to code their shiny new IHS algorithm and test it against my published problems. The doctoral student used Python or Julia (or some other trendy language), whereas I used stodgy old Java, so there's a good chance the luckless doctoral student will have to recode my stuff (which, among other things, requires making sense of it). Moreover, I will have created an API to my oracles that may or may not fit with what they are doing ... and that's if I used an API at all. If I directly integrated program structures external to the oracle functions into the oracle (passed in variables, constraints or other elements of my master problems, for instance), our doctoral student will need to figure out how to isolate those elements and replace them with corresponding elements from her code ... if there is even a correspondence.

There's at least one implication in this for me. If I actually subscribe to the push for reproducibility (or if I take pity on other people's doctoral students), I need to code my oracles not as an integral part of my master code but as a separate software module or library with a well-documented and reasonably flexible interface to the outside world (API). <sigh>More work for me.</sigh>

Friday, June 6, 2014

Reproducibility and Java Collections

I've been immersed in Java coding for a research project, and I keep tripping over unintentional randomness in the execution of my code. Coincidentally, I happened to read a blog post today titled "Some myths of reproducible computational research", by C. Titus Brown, a faculty member (one of those hybrid species: Computer Science and Biology) at Michigan State University, my former employer. The post is worth reading. I've seen quite an up-tick in online discussions of the importance of reproducible research, sharing code, sharing data etc., to which I will add one more virtue (specific to computational research): it's a darn site easier to debug a program whose behavior is deterministic compared to debugging an inherently stochastic program.

That brings me to a couple of tripwires that have been snagging me on the recent project (and some other projects, for that matter). First, any code that uses parallel threads is capable of injecting some unwanted randomness. A given thread will consume different amounts of time in different runs, due to uncontrollable (and hence, for our purposes, "random") interference by external processes that from time to time are swapped into the CPU core running the given thread. If your program has multiple interdependent threads, the timing of when one thread gets a message from another thread will be different on each run unless you do some serious (and IMHO seriously tedious) magic to synchronize them. I generally shy away from coding multiple threads myself, the exception being programs with a graphical user interface, where the GUI has its own set of scheduling/event threads, and really lengthy computations need to be done in "worker" threads to avoid paralyzing the GUI. Even then, when I'm using a multithreaded library such as CPLEX, execution times vary in unpredictable ways.

The second tripwire has to do with Java collections. As a mathematician, I tend to think in terms of matrices, vectors and sets, and not so much in terms of lists and maps. I'll often code a collection of objects as a HashSet, both because I think of them as a (mathematical) set rather than a list, and because the Set interface in Java has one key property not shared by the List interface: adding an object to a Set that already contains it does not create a second instance of the object. Unfortunately, a key shortcoming (IMHO) of HashSet is clearly spelled out in its documentation:
It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time.
To give one example of what this implies for reproducibility (and debugging), I coded a heuristic for constructing a spanning tree that takes eac node from a set (as in HashSet) of unattached nodes and links it to randomly selected node from the tree under construction. The selection of the node already in the tree to which the new node will be linked (by an edge) is done using a pseudorandom number stream with an explicit, user-specified seed. By repeating the seed, I could reproduce the choices of those nodes, if only the rest of the heuristic were deterministic. Unfortunately, because iterating over a HashSet is random in irreproducible ways, I get a different tree each time I run the code, even with the same seed value. So I need to change that HashSet to a list of some sort ... and I need to remember this lesson the next time I'm tempted to use a HashSet.

Monday, September 3, 2012

Microsoft Academic Search: Randomization in Action?

Move over, Google Scholar! Microsoft has apparently decided to join the scholarly search fray with Microsoft Academic Search (MAS). The page (whose logo indicates it is in beta test) allows you to search for papers and authors and find all sorts of fascinating (to the author's mother) information ... about someone who may or may not be her offspring.

I discovered this when someone sent me an email message expressing confusion about my relationship to "Paul B. Callahan". Trust me, his confusion was dwarfed by mine, as I've never heard of Mr. (Dr.?) Callahan. The author of the email included a link to this page on MAS. Since the page is likely to change in the future, I'll embed a snapshot below.

top of MS Academic Search page for Paul B. Callahan
That is in fact my smiling face next to Dr. Callahan's name, and the "Homepage" link does take you to my home page. In addition to the difference in names, I have never been at Johns Hopkins University (which I'm sure comes as a relief to them), I know next to nothing about neuroscience, and I skipped the pharmacological portions of the '60s and '70s.

So I tried searching MAS for "Paul A. Rubin" and ended up here. Again, I'll include a screen shot.

The picture might be me if taken in very low light, and once again the homepage link is to my home page. My association with George Mason University is much like my association with Johns Hopkins University: nonexistent. (For future reference, I have no association with any university whose school name is or sounds like the first and last names of a person.) "Business Administration and Economics" might fit me, to the extent that some of my publications are in journals that fall in that category (and I did teach in a business school), but nuclear physics and nuclear energy are a major mismatch. My parole office won't let me have a slingshot, much less get near anything fissile. For that matter, I'm wondering if the PAR at GMU (if one really exists) actually has one foot in business and another in nuclear anything. (I say "another" and not "the other" because playing with nuclear stuff has been known to cause mutations.)

If you examine the list of publications on that second page, some are mine and some are very much not.

I searched the site for instructions on how to report errors. The answer is that you do not report them; you edit them out yourself. I have no idea if this means that anyone with an account can edit these pages (what could possibly go wrong with that?), or if the "owner" has to edit the page (in which case who owns the second one, me or my doppelganger at GMU?). In any event, I have no intention of creating an account with Microsoft just so that I can clean up their errors.

If I could edit the pages, I'd be tempted to change my photos into links to Google Scholar. So probably MS is just as happy that I'm not setting up an account.