OR in an OB World: January 2014

Tuesday, January 28, 2014

Automated Volume Reset in Linux Mint/Ubuntu

I've been watching course videos lately, which requires me to crank up the volume on my speakers of my Linux Mint PC (using the volume applet in the bottom panel of the display). Being absent-minded (a job requirement when I was a professor), I often forget to reset the volume after watching a video. I'll suspend or hibernate the machine, go off to do something else, come back and resume/thaw, and at some point get blasted out of my chair by an unexpectedly loud system sound. Today I decided it was time to fix this by automatically resetting the volume any time I resume from suspend or "thaw" from hibernation.

I've previously described how to create a script to reset the volume. It turns out that running it at resume/thaw is a bit tricky. There are a number of forum posts on how to run a script at resume/thaw, but the standard solution did not work in this case. I believe the problem was that PulseAudio (the audio system) was running in my user session, whereas the script (being run as the root user) was looking for it in the root user's session and not finding it. The solution involved both baking in a short delay (to ensure that PulseAudio had restarted before trying to control it) and running the volume reset script as me, rather than as root. Adding to the confusion: the usual way to run as someone else, in this case

sudo -u paul ..., did not work (and I'm not sure why, but I'm pretty sure it's tied to whose environment is in use).

Cutting to the chase scene, here is the solution. The script to reset the volume (see the previous post) is /home/paul/Scripts/resetVolume.sh. I created a wakeup script by doing the following in a terminal:

  cd /etc/pm/sleep.d
  sudo touch 00_reset-volume.sh
  sudo chmod +x 00_reset-volume.sh
  sudo gedit 00_reset-volume.sh

There is nothing sacred about the name (00_reset-volume.sh) I used, but the "00" prefix should help ensure that it runs late in the resume sequence (giving more time for my user session to be reloaded, including the PulseAudio daemon).

The contents of 00_reset-volume.sh are as follows:

#!/bin/bash
case "$1" in
  thaw|resume)
    sleep 10 && su -c - paul /home/paul/Scripts/resetVolume.sh
      # wait 10 sec. for PulseAudio to restart
    ;;
  *)
    ;;
esac
exit $?

The 10 second delay before running my volume reset script may not be necessary, but it works, and if it ain't broke, don't fix it. su -c - paul ... runs the script as me rather than as root, apparently using my environment.

Thursday, January 23, 2014

Histogram Abuse

Consider the following Trellis display of histograms of a response variable ($Z$), conditioned on two factors $(X, Y)$ that can each be low, medium or high:

The combined sample size for the nine plots is 10,000 observations. Would you agree with the following assessments?

$Z$ seems to be normally distributed for medium values of either $X$ or $Y$ (or both).
$Z$ appears to be negatively skewed when both $X$ and $Y$ are low (and perhaps when $X$ is low and $Y$ is high), and perhaps positively skewed when both $X$ and $Y$ are high.
$Z$ appears to be leptokurtic when both $X$ and $Y$ are low.

What motivated this post is a video I recently watched, in which the presenter generated a matrix of conditional histograms similar to this one. In fairness to the presenter (whom I shall not name), the purpose of the video was to show how to code plots like this one, not to do actual statistical analysis. In presenting the end result, he made some off-hand remarks about distributional differences of the response variable across various values of the covariates. Those remarks caused my brain to itch.

So, are my bulleted statements correct? Only the first one, and that only partially. (The reference to $Y$ is spurious.) In actuality, $Z$ is independent of $Y$, and its conditional distribution given any value of $X$ is normal (mesokurtic, no skew). More precisely, $Z=2X+U$ where $U\sim N(0,1)$ is independent of both $X$ and $Y$. The data is simulated, so I can say this with some assurance.

If you bought into any of the bulleted statements (including the "either-or" in the first one), where did you go wrong? It's tempting to blame the software's choice of endpoints for the bars in the histograms, since it's well known that changing the endpoints in a histogram can significantly alter its appearance, but that may be a bit facile. I used the lattice package in R to generate the plot (and those below), and I imagine some effort went into the coding of the default endpoint choices in the histogram() function. Here are other things to ponder.

Drawing something does not make it real.

The presence of $Y$ in the plots tempts one to assume that it is related to $Z$, even without sufficient context to justify that assumption.

Sample size matters.

The first thing that made my brain itch, when I watched the aforementioned video, was that I did not know how the covariates related to each other. That may affect the histograms, particularly in terms of sample size.

It is common knowledge that the accuracy of histograms (and pretty much everything else statistical) improves when sample size increases. The kicker here is that we should ask about the sample sizes of the individual plots. It turns out that $X$ and $Y$ are fairly strongly correlated (because I made them so). We have 10,000 observations stretched across nine histograms. An average of over 1,000 observations per histogram should be enough to get reasonable accuracy, but we do not actually have 1,000 data points in every histogram. The scatter plot of the original covariates (before turning them into low/medium/high factors) is instructive.

The horizontal and vertical lines are the cuts for classifying $X$ and $Y$ as low, medium or high. You can see the strong negative correlation between $X$ and $Y$ in the plot. If you look at the lower left and upper right rectangles, you can also see that the samples for the corresponding two histograms of $Z$ ($X$ and $Y$ both low or both high) are pretty thin, which suggests we should not trust those histograms so much.

Factors often hide variation (in arbitrary ways).

In some cases, factors represent something inherently discrete (such as gender), but in other cases they are used to categorize (pool) values of a more granular variable. In this case, I cut the original numeric covariates into factors using the intervals $(-\infty, -1]$, $(-1, +1]$ and $(+1, +\infty)$. Since both covariates are standard normal, that should put approximately 2/3 of the observations of each in the "medium" category, with approximately 1/6 each in "low" and "high".

To generate the plot matrix, I must convert $X$ and $Y$ to factors; I cannot produce a histogram for each individual value of $X$ (let alone each combination of $X$ and $Y$). Even if my sample were large enough that each histogram had sufficiently many points, the reader would drown in that many plots. Using factors, however, means implicitly treating $X=-0.7$ and $X=+0.3$ as the same ("medium") when it comes to $Z$. This leads to the next issue (my second "brain itch" when viewing the video).

Mixing distributions blurs things.

As noted above, in the original lattice of histograms I am treating the conditional distributions of $Z$ for all $(X, Y)$ values in a particular category as identical, which they are not. When you mix observations from different distributions, even if they come from the same family (normal in this case) and have only modestly different parameterizations, you end up with a histogram of a steaming mess. Here the conditional distribution of $Z$ given $X=x$ is $N(2x, 1)$, so means are close if $X$ values are close and standard deviations are constant. Nonetheless, I have no particular reason to expect a sample of $Z$ when $X$ varies between $-1$ and $+1$ to be normal, let alone a sample where $X$ varies between $-\infty$ and $-1$.

To illustrate this point, let me shift gears a bit. Here is a matrix of histograms from samples of nine variables $X_1,\dots,X_9$ where $X_k \sim N(k, 1)$.

Each sample has size 100. I have superimposed the normal density function on each plot. I would say that every one of the subsamples looks plausibly normal.

Now suppose that there is actually a single variable $X$ whose distribution depends on which of the nine values of $k$ is chosen. In order to discuss the marginal distribution of $X$, I'll assume that an arbitrary observation comes from a random choice of $k$. (Trying to define "marginal distribution" when the distribution is conditioned on something deterministic is a misadventure for another day.) As an example, $k$ might represent one of nine possible categories for undergraduate major, and $X$ might be average annual salary over an individual's working career.

Let's see what happens when I dump the 900 observations into a single sample and produce a histogram.

The wobbly black curve is a sample estimate of the density function, and the red curve is the normal density with mean and standard deviation equaling those of the sample (i.e., $\mu=\bar{X}, \sigma=S_X$). To me, the sample distribution looks reasonably symmetric but too platykurtic (wide, flat) to be normal ... which is fine, because I don't think the marginal distribution of $X$ is likely to be normal. It will be some convolution of the conditional distribution of $X$ given $k$ (normal, with mean $k$) with the probability distribution for $k$ ... in other words, a steaming mess.

When we pool observations of $X$ from different values of $k$, we are in effect mixing apples and oranges. This is what I meant by "blurring" things. The same thing went on in the original plot matrix, where we treated values of $Z$ coming from different distributions (different values of $X$ in a single category, such as "medium") as being from a common distribution.

So what is the takeaway here?

In Empirical Model Building and Response Surfaces (Box and Draper, 1987), the following quote appears, usually attributed to George E. P. Box: "Essentially, all models are wrong, but some are useful." I think that the same can be said about statistical graphs, but perhaps should be worded more strongly. Besides visual appeal, graphics often make the viewer feel that they are what they are seeing is credible (particularly if the images are well-organized and of good quality) and easy to understand. The seductive feeling that you are seeing something "obvious" makes them dangerous.

Anyone who is curious (or suspicious) is welcome to download my R code.

Sunday, January 5, 2014

NOOK for Android: Bugs or "Features"?

I bought a NOOK Color when it was new on the market, and use it to read both books and magazines. Lately, though, I'm more likely to read magazines on a 10" Android tablet using the NOOK reader application for Android. It works well enough, with one glaring exception: periodically it refuses to download new issues of magazines, claiming the disk is full (which it is not).

I'm not sure exactly what causes NOOK to think it has a storage limit, but given the existence of such a limit I can see why NOOK believes it is reaching said limit. When you are done with content on an NOOK device, you have the options to either delete or archive the content. According to the FAQ for the NOOK app, archiving content theoretically removes it from your device but keeps it available in the cloud, from whence you can download it again if you want another look. Deleting content deletes it permanently; it does not remain available from the cloud. So archiving is generally the preferred route ... except that the part about deleting it is not entirely true.

After once again running out of space, I decided to go spelunking with a file manager. I found a folder (/Nook/Content/.Temp/Drp/) containing .epub folders for various archived issues. All told, a few hundred megabytes of storage were being consumed by issues that theoretically had been deleted from my tablet.

I did a web chat with a customer support person from Barnes and Noble, and at one point (s)he had me sign out of the NOOK application and then sign back in (which syncs the library). I pretty much never sign out of the app, having no particular reason to do so. Woo-hoo, all the archived issues were gone! Sadly, so was an issue I had downloaded but not yet read. There was no problem downloading it again, other than consumption of time and bandwidth.

I asked the support person to file a couple of bug reports. As far as I'm concerned, having to sign out and back in to free up space is dopey. I might add that I had previously synced the library (without signing out), and that did not deleted the .epub folders for archived content (despite the NOOK app correctly listing the content has having been archived). I also think that losing active (not deleted, not archived) content when you sign out is a (separate) bug.

We'll see if those things get fixed, but in the meantime my workaround is to use the file manager to delete .epub folders with old dates whenever the NOOK app tells me I'm out of space.