A year ago, I saw a talk by Reuven Lerner at PyCon 2025 in which he did some simple performance testing between two options in pandas - PyArrow and NumPy. Eager to dig in and learn more about the results, I conducted some benchmarking on PyArrow vs NumPy and wrote a blog article one year ago today that documented the findings.

The dataset that I used for that analysis was interesting for fantasy baseball purposes but really wasn't that great at analyzing the differences between the default CSV reader and PyArrow. And so one year later, I decided to fix some of my earlier errors and write some proper benchmarking.

Introducing Data Reports

About a month ago, I started working on some data-driven baseball reports and today I am taking that idea a little further with a new hub on more miscellaneous data analysis and reports. In this experiment, I am still going to use baseball data but instead of trying to answer a question about baseball, I am going to dive into some performance benchmarking.

This time my goals are a little more complicated but I hope the results are more useful. Instead of scraping and running some Python magic to create a test dataset, I am going to start with a dataset I use fairly regularly - the Lahman Baseball Database. And this time, instead of conducting analysis on the baseball stats, I am going to treat the data in a few different ways and compare performance in each treatment across the pandas default CSV reader and PyArrow.

The First Test - Raw CSV Read Speed

Today, since it is the one year anniversary of my last article on performance benchmarking NumPy vs PyArrow I am going to introduce the very first (of four) tests I performed to compare how PyArrow and the pandas default CSV reader perform under various situations. The first test is a brute-force one but it gets the point across.

Basically, I took each CSV file in the full Lahman baseball database and opened it 100 times with pandas with both the default reader and the PyArrow option. There are 27 files in the full database containing more than 700,000 rows and more than 13,000,000 cells. And somehow with all that data, I still manage to finish in the very bottom of my league every single season.

With this amount of data and different sized files, I felt I could create some benchmarking that would help me choose when to stick to the default (which is practically muscle memory for me) or reach for PyArrow.

The Results - PyArrow mostly wins (with reservations)

When I ran the data last year, I saw a small improvement when I used PyArrow but that dataset was a little more analysis focused. This time, I started with a real-world dataset that I work with often and found significantly more useful results.

PyArrow was faster for the full Lahman CSV workload, especially on larger and wider files. However, when you start digging into individual files you start to see that the question is a lot deeper than it seems. PyArrow is faster in 20 of 27 files, but the pandas default CSV reader still shines when files are smaller. This is to be expected when you consider how PyArrow works and it brings us back again to an important lesson in software development.

Fundamentally, you need to use the right tool for the job. In many jobs, PyArrow would demonstrate appreciable performance gains. However when you are dealing with smaller files, the startup overhead inherent in PyArrow is more important than bulk throughput so the pandas default CSV reader actually comes out ahead.

You can see the full data (and even download the data.json file the report is based on) at the first of four reports on NumPy versus PyArrow performance. The other three reports are available but I will be introducing them with individual articles throughout the week.

Conclusion

All in, I am glad that I (finally) got around to digging into this again. PyArrow vs NumPy is an interesting area to dive into and when you work with really massive datasets, knowing the limitations of each is a great way to get the most out of pandas. But the important lesson is that I need to do a lot more digging and run this same test on more datasets and across more operating systems. Because while my conclusion is that you really need to use the best tool for the job, I don't quite have enough data yet to put together a full flow chart on what tool is best for the job.

And so over the next few days I will be writing more analysis on the NumPy vs PyArrow benchmarking report but then I will be planning some different datasets so that I can learn more about the full performance capabilities of both these tools.