About this test
This report compares two ways of loading Lahman baseball CSV files with pandas:
pandas.read_csv(path)pandas.read_csv(path, engine="pyarrow")
The title keeps the original “NumPy versus PyArrow” framing because this test is a follow-up to an earlier article. The method labels are more precise. This page benchmarks pandas' default CSV reader against pandas' PyArrow CSV engine. It does not benchmark NumPy arrays directly.
The test uses 27 Lahman CSV files. Together, those files contain more than 700,000 rows and more than 13 million cells.
Why this is interesting
The earlier version of this test used scraped and simulated fantasy baseball data. That was useful, but it was not the same as working with an existing public data collection.
The Lahman database is a better follow-up because it contains many differently shaped CSV files. Some are tiny metadata files. Some are larger statistical tables. Some are wider. Some are longer. That makes the test more realistic than one synthetic file repeated at different sizes.
The result is not just “which reader is faster?” It is also:
Where does PyArrow help, where does pandas default still hold up, and how much does file shape matter?
How to read the table
The table contains one row per CSV file.
The main fields are:
- File: the Lahman CSV file name.
- Size: the CSV file size.
- Rows: the number of rows loaded during profiling.
- Columns: the number of columns loaded during profiling.
- pandas avg: average read time using
pandas.read_csv(path). - PyArrow avg: average read time using
pandas.read_csv(path, engine="pyarrow"). - Speedup: pandas default average time divided by PyArrow average time.
- Winner: the faster method for that file.
A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.
Methodology
Each CSV file was read 100 times with each reader after two warmup runs.
The benchmark records CSV read time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.
The operating system can cache files between repeated reads, so these results should be read as local benchmark results for this environment, not as universal constants.
What this test does not prove
This test does not prove that PyArrow is always faster.
In this run, PyArrow won most of the Lahman files and won the full workload clearly. But pandas default still won several very small CSV files. That matters because parser startup overhead can be more important than throughput when a file is tiny.
This test also does not prove that CSV is the best format for repeated analysis. The later CSV versus Parquet test looks at that question more directly.
Accessibility notes
This report is built around text, summary cards, and tables. The table is the primary source of the report data.
The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.
If something does not work properly, please contact me so I can fix it.
What did we learn?
PyArrow was faster for the full Lahman CSV workload, especially on larger and wider files.
But the file-by-file results matter. The smaller Lahman files show that pandas default can still be faster when the data is tiny enough that startup overhead matters more than bulk throughput.
That is a more useful answer than a simple yes or no. PyArrow is very strong for this workload, but the best reader still depends on the shape of the data and the work being done.