A couple of weeks ago, I wrote a follow up article to some work I did on NumPy versus PyArrow performance a year ago. The NumPy versus PyArrow question is of interest to me because I help manage a fantasy baseball league and also have a team in the same league. I spend so much time dissecting data and preparing accessible reports for my league mates that speeding things up even slightly is a major quality of life enhancement for me. If you play fantasy baseball in a private league, you may be familiar with long long waits for the end of season data you'll need to make next year's draft picks and so you're aware that my performance enhancements would be a competitive advantage for our league.
It is... but being fully accessible is a bigger competitive advantage so we have such a long waiting list that we're no longer accepting managers on the wait list. In other words, this is not a promotion! :)
While I was doing the Lahman's baseball database stage of the NumPy versus PyArrow performance testing, I didn't really find that big of a difference between the two tools when I measured it in terms of absolute time. But the files and dataset were relatively small (which is a really really amazing feat considering how darned much data the Lahman's Baseball Database contains). The small relative time showed me that I had to design a better test with bigger files.
And so then, I started looking for data. It's weird because while there is a metric crap load of data available online, when you start looking for 15 - 20 GB of CSV data the choices start getting a little more limited. Even moreso when you want a dataset with small, medium-sized and large individual files.
NYC Yellow Taxi Data to the Rescue
Where do you go when you're looking to construct a dataset with a very very specific shape? The New York City Taxi and Limousine Commission is an amazing place. That service rescued me in this case.
It also inspired me to start thinking about taxi data from a (self serving) baseball perspective. And so I'm going to start working on another data report inspired by the two datasets that I've used for NumPy versus PyArrow performance testing. This one is going to look at cab data in relationship to Yankees and Mets performance. In the words of fancy data scientists, it would be neataroo to find one spinoff and maybe see one small part of the economy trickle down.
And I have one Taxi Driver joke I've been dying to make so it's a win win for everyone.
PyArrow wins
To learn about the three tests and the methodologies, please refer to the main page on performance testing pandas default versus the PyArrow CSV engine on NYC TLC Yellow Taxi data. But the data was convincing - on this large 18.2 GB dataset, PyArrow was faster in all 48 files that we opened.
In this case, the actual time involved was significant. Across the completed repeated-read benchmark, the summed median read times showed an overall PyArrow speedup of more than 8x. Cigarette smokers and/or people who vape will certainly find reason to stick with the pandas-default because on my equipment, the difference was really a smoke break versus sending an email.
The NumPy vs PyArrow choice: Size Matters
However, while PyArrow was comfortably and noticeably faster on this dataset, it doesn't mean that PyArrow is always the best choice. As I concluded in the last set of research, the default versus PyArrow is still a choice. For the smaller Lahman-style files I use most often in league work, pandas default was frequently as fast or faster in absolute terms, and the differences were too small to justify optimizing around.
That was always to be expected and while I knew there was a cut off somewhere on my machine, I really wasn't sure where that was. Today, while I'm still not exactly sure, I can still look back at one of the really intersting data points from the very first test that I ran on raw csv read speeds.
If you sort that first dataset by size, you will see that PyArrow starts to show improvements at 66.7 KB. All 8 files smaller than that were all faster with the pandas-default. The largest file where pandas default won was only 29.1 KB, leaving a large unexplored gap between that result and the first PyArrow win at 66.7 KB. With the baseball dataset that I use the most, the biggest single file speedup was a whopping 140 milliseconds. :)
At that point, 140 milliseconds doesn't exist when I look at total timing for the types of scripts I run to prepare the data for our managers. It's not even a fraction worth considering and makes such a small blip in overall runtimes that it's the last thing I would optimize. Hence as I said earlier, this isn't really a choice I would make for fantasy baseball data because it really doesn't matter.
With massive datasets however, it matters a lot.
Concluding remarks and future research
Right now, I feel like the Numpy versus PyArrow question has enough data to mostly be left behind. I have a few more questions that this data uncovered that I will start looking into separately. For example, I am very interested in parquet now and plan to run some benchmarking on that format. And as mentioned earlier, I want to look at the how the Yankees and Mets performance trickles down into Yellow taxi trip data.
There is no possible way that a Yankees fan armed with data that indicates his favourite team's performance has an economic impact could possibly go wrong.