About this test
This report compares two ways of loading Lahman baseball CSV files with pandas:
pandas.read_csv(path)pandas.read_csv(path, engine="pyarrow")
The title keeps the original “NumPy versus PyArrow” framing because this test is a follow-up to an earlier article. The method labels are more precise. This page benchmarks pandas' default CSV reader against pandas' PyArrow CSV engine. It does not benchmark NumPy arrays directly.
The test uses 27 Lahman CSV files. Together, those files contain more than 700,000 rows and more than 13 million cells.
Why this is interesting
The earlier version of this test used scraped and simulated fantasy baseball data. That was useful and gave me some numbers to work with, but it was not the same as working with an existing public data collection.
The Lahman database is a better follow-up because it contains many differently shaped CSV files. Some are tiny metadata files. Some are larger statistical tables. Some are wider. Some are longer. That makes the test more realistic than one synthetic file repeated at different sizes to provide something like real-world data. But real-world is always better than something like real-world and so a year after my original benchmarking, I'm publishing something using a real dataset that I work with all the time.
The result is not just “which reader is faster?” It is also:
Where does PyArrow help, where does pandas default still hold up, and how much does file shape matter?
How to read the table
The table contains one row per CSV file.
The main fields are:
- File: the Lahman CSV file name.
- Size: the CSV file size.
- Rows: the number of rows loaded during profiling.
- Columns: the number of columns loaded during profiling.
- pandas avg: average read time using
pandas.read_csv(path). - PyArrow avg: average read time using
pandas.read_csv(path, engine="pyarrow"). - Speedup: pandas default average time divided by PyArrow average time.
- Winner: the faster method for that file.
A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.
Methodology
Each CSV file was read 100 times with each reader after two warmup runs.
The benchmark records CSV read time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.
The operating system can cache files between repeated reads, so these results should be read as local benchmark results for this environment, not as universal constants. In the future, I am going to explore other operating systems and different setups with this and other datasets so I can add more to the PyArrow versus NumPy subject.
What this test does not prove
This test does not prove that PyArrow is always faster.
In this run, PyArrow won most of the Lahman files and won the full workload clearly. But pandas default still won several very small CSV files. That matters because parser startup overhead can be more important than throughput when a file is tiny. This test also does not prove that CSV is the best format for repeated analysis. The later CSV versus Parquet test looks at that question directly.
It also (and most importantly) gives me the opportunity to make a comma separated joke. Come for the data and stay for the hilarity?
Accessibility notes
This report is built around text, summary cards, and tables. The table is the primary source of the report data.
The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.
If something does not work properly, please contact me so I can fix it. Accessibility is important enough to me that I not only have an accessibility policy but am committed to keep learning about it so your feedback will only help me improve everything I do for everyone who uses it.
What did we learn?
PyArrow was faster for the full Lahman CSV workload, especially on larger and wider files.
But the file-by-file results matter. The smaller Lahman files show that pandas default can still be faster when the data is tiny enough that startup overhead matters more than bulk throughput. And so in this case (as in all software development) the real answer about PyArrow versus NumPy is a big old 'that depends'. PyArrow is strong for this particular workload, very very strong in fact. But the actual answer depends on the shape of your data and the work being done.
It's also important to note that while the percentages are high in some cases, the absolute number I am starting with is quite low. So neither PyArrow nor NumPy are irrationally slow and I'm really talking about differences of less than 10 milliseconds per file. So this is a case of benchmarking something where while there are pronounced differences, the absolute time differences are quite small.