A few days ago, I launched a new data analysis section mostly as a place to put some interesting things that I find with data. Since it had been exactly a year since I first looked at NumPy versus PyArrow performance, I decided to run four different tests against the Lahman Baseball Database. The Lahman Baseball Database is an interesting data source for me because my fantasy league uses it extensively and so every November, I'm slicing and dicing it up. And since I know it so well, it seemed like a good dataset to run through some benchmarking.
In reality, it is a great dataset though one test showed a shortcoming in the shape of the CSVs. When you assess PyArrow versus pandas default CSV loading, the key question always ends up at throughput. And so the size of the files that you open, and the shape of the data, truly matters.
The second test - Small, Medium, and Large CSV File Performance
The Lahman Baseball Database is composed of 27 CSV files. These files contain baseball data dating back to 1871 and have been meticulously compiled by volunteers over many years. The Lahman Baseball Database is an absolutely amazing achievement and I am grateful to all the great minds that have worked on it. As a small baseball-mad child growing up in the 1980s, I dreamed of a resource like this. And now, thanks to others, I get to work with it whenever and however I want.
In this case, I used baseball data to start adding more data to a different question - PyArrow versus NumPy performance. The first test I did in 2025 used what I term 'fantasy optimized data'. While it showed that PyArrow was faster across the workload, that dataset was really strongly biased against PyArrow just because of its relatively small size. For the second round of testing, I used the entire Lahman Baseball Database.
And to look at the question differently, I wrote four different sub-tests on the data. The first sub-test looked at raw CSV read speed across Lahman tables and the second sub-test, which I am introducing today, looked at how the size of the files you read impacts the performance of pandas default CSV loading and PyArrow.
To perform the test, I created three buckets:
- small-file bucket: files under 1 MB. The Lahman Baseball Database has 20.
- medium-file bucket: files from 1 MB up to 10 MB. The Lahman Baseball Database has 7.
- large-file bucket: files 10 MB or larger. The Lahman Baseball Database has 0.
After two warmup runs, the script loaded each file 100 times and I only benchmarked read time.
The results - PyArrow wins but with an asterisk
One of the most famous asterisks in the world was applied to Roger Maris' 61 home run season in 1961. Baseball purists argued that he played more games than Babe Ruth so the record wasn't pure. Those same baseball purists would have lit themselves on fire and danced the hokey pokey had they seen the steroid wars of the late 1990s, but I digress.
In this case, PyArrow did win overall. But there are some issues and so I can claim there is a useful signal here, but I can't claim there is a universal rule. The issues include:
- the medium-file bucket was 100% in PyArrow's favour.
- the small-file bucket was mixed - 7/20 files were actually faster with pandas default.
- we had zero files in the large-file bucket.
When you sort the data by file size you see that the default wins on all the smallest files. But the shape of the data has a major size gap within it. So while the default 'wins' on a 29.1 KB read, PyArrow was already winning by the time the files reached 66.7 KB.
Conclusions and future tests
Asterisks aren't uncommon in software development. In fact, the older and more experienced I get the more asterisks I see. Based upon this dataset the only real conclusion I can make is that PyArrow was faster on this dataset but that the shape of the data matters a lot. Architecturally when you dig into the differences between PyArrow and NumPy that makes a lot of sense - it's what you expect. And so I'm glad to have replicated what should logically happen.
One other important conclusion is that while PyArrow 'wins', the absolute time difference is relatively minor. In some cases, being able to save 10 - 20 ms a file really matters, in other cases that's an optimization that didn't need to be made. And while premature optimization is a big problem, unneeded optimizations turn into the kind of technical debt that can sure make future enhancements tricky.
The most important conclusion is that Lahman gets me closer, but it still does not answer the whole data-shape question. And so, I am already planning my third PyArrow versus NumPy benchmarking project. This time, I'm building datasets off of New York City taxi data and will be able to build larger datasets with more files, a greater range of file sizes and the kind of chaotic data that I find fun.