Saturday, October 28, 2023

The Trouble with Tibbles

Apologies to Star Trek fans for the title pun, but I couldn't resist. (If you're not a Trekkie, see here for clarification.)

I've been working on a interactive R application (using Shiny, although that's probably irrelevant here). There are places where the code needs to loop through a data frame, looking for consecutive rows where either three text fields match or two match and one does not. Matching rows are copied into new data frames. The data I'm testing the code on has a bit over 9,000 rows, and the time spent on this process can take upwards of nine seconds -- not an eternity, but a bit annoying when you are sitting there waiting for it to hatch.

I decided to use the profiler in RStudio to see where time was being eaten up. Almost all the nine seconds was blamed on two steps. The biggest time suck was doing case-insensitive string comparisons on the three fields, which did not come as a big surprise. I went into the profiling process thinking the other big time suck would be adding a data frame to a growing list of data frames, but that was actually quite fast. To my surprise, the number two consumer of time was "df <- temp[n, ]", which grabs row n from data frame "temp" and turns it into a temporary data frame named "df". How could such a simple operation take so long?

I had a hunch that turned out to be correct. Somewhere earlier in the code, my main data frame (from which "temp" was extracted) became a tibble. Tibbles are modernized, souped-up versions of data frames, with extra features but also occasional extra overhead/annoyances. One might call them the backbone of the Tidyverse. The Tidyverse has its adherents and its detractors, and I don't want to get into the middle of that. I'm mostly happy to work with the Tidyverse, but in this case using tibbles became a bit of a problem.

So I tweaked the line of code that creates "temp" to "temp <- ... %>% as.data.frame()", where ... was the existing code. Lo and behold, the time spend on "df <- temp[n, ]" dropped to a little over half a second. Somewhat surprisingly, the time spent on the string comparisons dropped even more. So the overall processing time fell from around 9 seconds to under 1.3 seconds, speeding up the code by a factor of almost 7.

I'll need to keep this in mind if I bump into any other places where my code seems unusually slow.


No comments:

Post a Comment

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.