Sunday, March 27, 2011

Mixed Case Titles

Writing pointless academic papers requires citing other pointless academic papers.  Thus do academicians, in a symbiotic process eerily similar to those of lawyers and nuclear weapons manufacturers, keep each other in business.  Since I write in LyX and generate the documents with LaTeX, it is unsurprising that I process bibliographies using BibTeX.  My preferred tool for maintaining BibTeX databases is JabRef, which is cross-platform, has a decent feature set, and just plain works.

When I add references to by BibTeX databases, I'm careful to capitalize most of the words in the title.  Anyone who has had the misfortune to deal with journals  knows that every journal publisher employs at least one or two people with obsessive/compulsive disorder, who are tasked with ensuring that this journal formats various things (possibly tables or figure captions, but always, always references) in a way that visually distinguishes it from every other journal in the galaxy. Thus I have to find (or cobble together) a BibTeX style file specific to whatever journal makes the mistake of accepting of my paper.  BibTeX style files can "downshift" titles to lower case, but as far as I know they do not "upshift" to what I'll call "title" case (nouns, verbs, adjectives, adverbs and initial articles capitalized, other articles and conjunctions not).  So it behooves me to insert the reference into the database in "title" case.

On the other hand, I'm a firm believer in copying and pasting to avoid retyping.  So I frequently copy article information from a web page or PDF file and paste it into JabRef.  Small problem: as far as I know, JabRef does not have an editing feature to adjust cases.  That led me to some searching, which in turn led me to discover that gedit (my primary editor on Linux) comes with a preinstalled plug-in to shift cases.  Said plug-in can shift to lower, upper or (drumroll) "title" case. On Windows, NoteTab (including the free "Light" version) has a similar feature.  If I recall correctly, NoteTab's converter is available with no configuration.  In gedit, though, it needs to be turned on (Edit > Preferences > Plugins > Change Case), a revelation to which I have recently (and belatedly) come.

Sunday, March 20, 2011

Semicontinuous Variables

This question pops up from time to time on various forums.  In a math programming model, how does one handle a variable that can either be zero or in some range not containing zero?  The mathematical description of the variable would be something like $$x\in {0} \cup [a,b]$$where $a>0$.  A couple of common applications in supply chain models include quantity discounts (a vendor offers a discounted price if you buy in bulk; $x$ is the amount purchased at the discounted price and $a$ is the minimum amount to qualify for the discount) and minimum lot sizes (where $x$ is the amount of a custom-made product obtained from a vendor who requires a minimum order of $a$ to justify the setup cost of the order).

The variable $x$ is known in math programming jargon as semicontinuous.  For whatever reason, the term is not well publicized; I taught a graduate course on integer programming for years without coming across it.  Some solvers can model these variables directly.  For instance, the C++ and Java APIs for CPLEX contain a class named IloSemiContVar.  I'm not sure which other solvers directly support semicontinuous variables. For solvers that do not recognize semicontinuous variables, there is a generic formulation.  You introduce a binary variable $y$ that will be 1 if $x$ is nonzero and 0 if $x$ is zero, and you add the contraints $x\ge ay$ and $x\le by$.

AMPL allows you to define a semicontinuous variable without explicitly adding constraints. Our example could be written in AMPL as -->
param a > 0;
param b > a;
var x in {0} union interval[a, b];
Currently, AMPL converts that into the pair of constraints mentioned above, even if the chosen solver is CPLEX.  Whether that will change in the future I do not know.

One potential virtue of directly modelling semicontinuous variables is that it may lead to a more parsimonious formulation.  If the solver understands semicontinuous variables, it can modify the branching logic to separate nodes based on $x=0$ versus $x\ge a$ without having to add the binary variable $y$ and the extra constraints.

Wednesday, March 16, 2011

Forming Competitive Teams

This post is motivated by the March INFORMS blog challenge: O.R. and Sports.

A couple or so years ago, someone contacted me about a model to assign children to sports teams.  My correspondent volunteered for a youth recreation league in his home town.  Two things are significant in that sentence.  The first is "volunteered", which comes from the Latin phrase meaning "budget = zero".  So whatever solution we came up with had to be cost-free.  The second is "youth".  In recreational sports leagues (at least for the youngest competitors), the tendency is for the parents to enroll their children in a league, rather than a specific team.  The league then has to form rosters, and the goal is not to produce winners (which would imply losers as well), but to form teams that are as close to competitive parity as possible.

The approach taken by this league was to try to get the average (mean) value of various attributes as close to constant across teams as possible, subject to some constraints (minimum and maximum team sizes, for instance).  For qualitative attributes, such as gender, we can treat this as averaging indicator variables (1=girl, 0=boy or 1=experienced, 0=inexperienced).  So far, this looks like a modified version of the Hitchcock (transportation) problem, with kids as unit-supply "sources" and teams as "sinks".  The modifications include some additional constraints (a utilized "sink" has to contain at least enough players to field a full team) and a rather funky (and, as it turns out, nonlinear) objective function.  The nonlinearity occurs because attribute averages involve dividing by team size, which is not constant.

Anyone who has designed a model for a real-world (not textbook) problem is probably painfully aware that the problem you are initially asked to model winds up not being the problem you actually have to model.  Sure enough, additional wrinkles creep in.  Parents with twins want/need to have the kids on the same team (to avoid car-pool nightmares).  Some kids may need to be separated (to avoid fights).  Some kids have parents who volunteer (that word again) to be assistant coaches, but only for the team their kid is on; there are not enough assistant coaches to go around, so we really don't want two on the same team, which means separating the kids with volunteer parents.  And so on.

I won't go into mind-numbing details (I'll save that for students that can't outrun me :-)), but the code I ended up writing (called Parity Builder) is available open-source.  All it requires for infrastructure is Sun Java (1.6 or higher), and it comes with a tutorial.

One last point, which I find interesting.  One of my colleagues in organizational behaviour (the "OB world" in the blog title) pointed me to some literature alleging that competitive performance may depend more on the within-team variance of certain attributes than the mean.  If true (and if it applies to any of the attributes used by the recreational league), perhaps the approach we took does not really optimize balance.  On the other hand, it almost surely optimizes the appearance of balance, and the true objective function was probably "minimize parental bitching".  So maybe we're good after all.

Sunday, March 13, 2011

Random Sampling in AMPL

This came up on the AMPL user group; I figured I'd "double dip" and post an edited version of my solution here.  The question was how to randomly sample without replacement in AMPL.  The code below shows one way.

# randomly sample items from a set, with or without replacement
set UNIVERSE ordered;  # set of items to sample -- must be ordered
param ssize integer > 0;  # sample size desired
param sample {1..ssize} symbolic;  # will hold generated sample

data; # data below is just for demonstration purposes
set UNIVERSE := 'cat', 'dog', 'bird', 'fish', 'turtle', 'snake', 'frog', 'rock';
param ssize := 4;
model;  # back to the code

let ssize := min(ssize, card(UNIVERSE));  # omit if sampling with replacement
param item symbolic;  # dummy parameter to hold each selection
for {i in 1..ssize} {
  let item := member(ceil(Uniform(0, card(UNIVERSE))), UNIVERSE);
  let sample[i] := item;
  let UNIVERSE := UNIVERSE diff {item};  # omit if sampling with replacement

display sample; 

I might mention a few things about the code:
  • I used symbolic data in the example. If you are sampling from a set of numbers, you can omit the two instances of the keyword symbolic.
  • If you want to sample with replacement, there are two lines (marked by comments) that you need to omit.
  • The code puts the sample in a (vector) parameter.  If you are sampling without replacement, an alternative is to put the code into a set.  Define the set with a default value of {}, and use the union operator with second argument {item} to add the item to the sample set inside the loop.

Thursday, March 10, 2011

Configuring R for Java

I have two R packages that have consistently refused to be updated: rgl (which I believe is used for 3D visualization) and rJava (an API for accessing Java from R).  I probably don't really need either one, but at least one R GUI (JGR), needs rJava.  I must have figured out (and then forgotten) the configuration magic that allows rJava to update, because the update.packages() command on my office PC (Linux Mint) has not problem with it ... but update.packages() on my laptop (also Linux Mint) kept running into an error.

So here is the solution (which I found here), just in case I need it again.
  • Run sudo update-alternatives --config java (in a terminal) and pick Sun Java as the default. (I thought I'd already done that, but apparently not.)
  • Add export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/jre to .bashrc (again, thought I'd done that but hadn't).
  • Run sudo R CMD javareconf in a terminal.  (That I had done -- multiple times, with no error messages -- but apparently with the wrong configuration in place.)
  • Run R as root and update the package.
I hope that, the next time this comes up, I at least remember I put the necessary steps here.

Wednesday, March 9, 2011

Honey, I Shrank the Registry!

I have one rather superannuated PC running Windows XP, which I refuse to upgrade to Win 7 for various reasons (not least being that I suspect it would run at the speed of glacial drift, if at all). Lately, it's been complaining at boot that the system registry is too large, and that future requests for registry space will be denied (and the offending programs no doubt will be executed).

So I did a little searching and found a pair of free programs by a programmer in Germany named Lars Hederer: ERUNT and NTREGOPT.  The former backs up and restores the registry (not a bad idea when you plan to perform surgery on it), while the latter compresses it.  In my case, NTREGOPT reduced the registry size by 75% (!!), from around a quarter GB to something in the 70 MB range.

Hmm ... wonder if it could reduce my size (weight please; advancing age is already attacking my height) (and preferably not 75%).

Tuesday, March 8, 2011

A Vote for VIFs

Shown below is output from a linear multiple regression model run in Minitab:

Predictor   Coef SE Coef     T     P
Constant  2.4765  0.7055  3.51 0.001
DIAMETER -0.7880  0.1826 -4.31 0.000
WEIGHT   0.44314 0.03633 12.20 0.000 

Other statistics programs I've used, or whose output I've seen, may format things differently, but they generally stick to the same fundamental ingredients: the estimated coefficient value; the standard error of that coefficient; the T-ratio; and the p-value of that ratio.  Note that there is a certain redundancy here.  If I know any two of the first three items, I can figure out the third rather easily.  If I know the T-ratio (and the degrees of freedom), I can get the p-value.  (Going the other direction is subject to accuracy problems, since small p-values tend to be truncated).

Redundancy is not really the issue, though.  What do we care about in that output?  Certainly we want the coefficient, and quite possibly we want the standard error (if we are going to hand compute confidence intervals, for instance).  If we care about statistical significance (and we generally should), we want to see the p-value.  Since the sole reason for computing the T-ratio is to get to the p-value, I consider it a waste of space.  I suspect it is a holdover from bygone days when software did not compute p-values, and users had to look up the T-ratio in tables.

Meanwhile, there is something missing from the standard output that I think really should be there:  the variance inflation factors (VIFs).  VIFs give you a quick diagnostic check of whether you need be concerned about multicollinearity.  In a doctoral seminar on regression that I've taught in recent years, we read papers from the business and social science literature in which regression models are presented (and used to test hypotheses).  The vast majority of these papers contain no indication that the authors checked for multicollinearity.  In a few cases, the authors state that because the correlation matrix of the predictors exhibits no large pairwise correlations, multicollinearity is not a concern.  This sense of security is entirely unfounded when there are more than two predictors, as multicollinearity can occur without any large pairwise correlations.

Perhaps if we saw VIFs every time we ran a multiple regression, we would get in the habit of spot-checking for multicollinearity.  Most software will produce VIFs, but you may have to dig to find out how to get them.  (In R, for instance, I have to load the "car" library to find a VIF function.)  If software vendors are concerned about real estate on the screen/page, I nominate the T-ratio column as something we can sacrifice to make room for VIFs.

Sunday, March 6, 2011

Syntax Highlighting

I just burned about an hour of my life going back and retrofitting syntax highlighting to some of my old posts. The fault lies entirely with Bo Jensen, who first suggested it.  Some of the code to be highlighted (including what triggered the suggestion) is in R, some in Java, a bit in Bash.  (Thank goodness I got over APL decades ago; that would be a font nightmare.)  Obviously, for blogging purposes, I need a highlighter that generates HTML, not just a syntax highlighting code editor.  Given my rather modest output rate, an online highlighter would be just fine (no need to install it locally).  Finally, and this turns out to disqualify several of the available R highlighters, I like having function names highlighted, not just keywords.

So I scrounged around the 'Net a bit and found two very useful sites:
Thanks to both sites for making my life easier.

One side note:  This morning, by sheer coincidence, I received a couple of tweets from Hakan Kjellerstrand indicating that he was experimenting with a GPL version of the J programming language.  Curious, I took a look at some code samples on the Jsoftware site and started having flashbacks to APL.  While perusing the Wikipedia page for APL (linked above), I discovered that the flashback was not just random:  Kenneth Iverson, the designer of APL, was also a designer of J.

I need to stare at some FORTRAN for a while to clear my head.

Update (9 March 2011):  It's official -- I'm stupid. On my Windows box, I've been using Notepad++ for some time now (not so much for programming as for general editing of text files).  As it turns out, Notepad++ does syntax highlighting for a variety of languages, including both R and Java, and can export to an HTML file.

Fine, but I do most of work on Linux Mint these days, and Notepad++ is a Windows-only program.  It's based on Scintilla, though, as is SciTE, which I use for similar purposes on my Mint PC and laptop.  (SciTE is also available for Windows, but I'm already using NP++ and, as we say here, if it ain't broke, don't fix it.)  SciTE does Java highlighting out of the box, and with a small tweak, it does R syntax highlighting.  It also exports to HTML.  (The tweak: run SciTE via sudo, open the global options file, scroll down near the bottom and uncomment "import r", then save.)

So I'm set for highlighting with tools I'm already using.

Update (13 March 2011):  I discovered a Linux command line utility named (shockingly) highlight.  It converts code files in a variety of languages (including AMPL, which I needed today, but sadly not including R) to a variety of output formats (notably HTML, but also LaTeX). The utility is available from the Ubuntu universe repository, so you can load it via Synaptic without having to add a new source.

Of course, life can't be quite that simple.  The executable is installed as /usr/bin/highlight.  I already have a program of the same name at /usr/local/bin/highlight.  I don't know where it came from or what it does, but it seems to expect input from stdin regardless of any command line switches.  Since it's in the local bin directory, it loads ahead of the one I want (grrr).  Not knowing whether it's part of a larger package, I'm reluctant to nuke it.  So I've added alias highlight=/usr/bin/highlight to my .bashrc file, which gives me a safe (I think) workaround.

Update (10 July 2012): A reader tipped me off that Java code in one of my posts did not show up in Internet Explorer.  When I looked at the blog in IE, I discovered one post where nothing appeared except the title and the footers!  It turns out that in some cases I had inline CSS styles, but when I switched to highlight I was using a <style> tag to provide the style details.  Although this worked fine in Firefox, Chrome and (for all I know) every other browser, it was enough to confuse IE. I couldn't find a way to generate inline styles with highlight, so I switched to Pygments, also available via Synaptic (and recommended by my namesake in the comments below). It provides both a command line program (which I use) and a Python library. The syntax I use looks like

pygmentize -f html -l java -o myfile.html -O noclasses,nobackground,cssstyles="background: #CCFFFF;"

where the first option specifies HTML output, the second specifies Java input, the third (lower case "o") specifies the output file, the fourth (upper case "O") specifies options, and the last argument is the source code file to highlight. The noclasses option is the key: it forces inline CSS. The other two options suppress the usual background color in the <div> tag that surrounds the code and replace it with a color of my choosing.

Tuesday, March 1, 2011

Math and Science Can Be Sexy

I just tripped over the following facts (one of which I already knew):
  • Hedy Lamarr, vintage Hollywood hottie, co-invented frequency hopping, used by cell phones today.
  • Danica McKellar, teen age TV star (and successful actress since), is co-author of the Chayes–McKellar–Winn theorem (as an undergraduate math major).  (This one I knew.)
  • Mayim Bialik, who as a teen had the title role in the TV series "Blossom" and now plays a neurobiologist on "The Big Bang Theory", actually has a Ph.D. in neurobiology.
Now we just need some sexy male mathematicians and scientists (besides me, that is) to balance the list.