Raw CSV Read Speed Across Lahman Tables

This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This benchmark compares pandas default CSV reads with pandas read_csv using the PyArrow engine across the Lahman baseball CSV dataset.

The goal is practical: test real CSV files with different sizes, shapes, and column counts instead of relying only on scraped or simulated data.

Loading examples

The featured examples will appear after the report data loads.

Raw CSV read benchmark for Lahman baseball files. Times are average milliseconds over 100 timed runs after two warmup runs.

Loading report data.

About this test

This report compares two ways of loading Lahman baseball CSV files with pandas:

pandas.read_csv(path)
pandas.read_csv(path, engine="pyarrow")

The title keeps the original “NumPy versus PyArrow” framing because this test is a follow-up to an earlier article. The method labels are more precise. This page benchmarks pandas' default CSV reader against pandas' PyArrow CSV engine. It does not benchmark NumPy arrays directly.

The test uses 27 Lahman CSV files. Together, those files contain more than 700,000 rows and more than 13 million cells.

Why this is interesting

The earlier version of this test used scraped and simulated fantasy baseball data. That was useful, but it was not the same as working with an existing public data collection.

The Lahman database is a better follow-up because it contains many differently shaped CSV files. Some are tiny metadata files. Some are larger statistical tables. Some are wider. Some are longer. That makes the test more realistic than one synthetic file repeated at different sizes.

The result is not just “which reader is faster?” It is also:

Where does PyArrow help, where does pandas default still hold up, and how much does file shape matter?

How to read the table

The table contains one row per CSV file.

The main fields are:

File: the Lahman CSV file name.
Size: the CSV file size.
Rows: the number of rows loaded during profiling.
Columns: the number of columns loaded during profiling.
pandas avg: average read time using pandas.read_csv(path).
PyArrow avg: average read time using pandas.read_csv(path, engine="pyarrow").
Speedup: pandas default average time divided by PyArrow average time.
Winner: the faster method for that file.

A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.

Methodology

Each CSV file was read 100 times with each reader after two warmup runs.

The benchmark records CSV read time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.

The operating system can cache files between repeated reads, so these results should be read as local benchmark results for this environment, not as universal constants.

What this test does not prove

This test does not prove that PyArrow is always faster.

In this run, PyArrow won most of the Lahman files and won the full workload clearly. But pandas default still won several very small CSV files. That matters because parser startup overhead can be more important than throughput when a file is tiny.

This test also does not prove that CSV is the best format for repeated analysis. The later CSV versus Parquet test looks at that question more directly.

Accessibility notes

This report is built around text, summary cards, and tables. The table is the primary source of the report data.

The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.

If something does not work properly, please contact me so I can fix it.

What did we learn?

PyArrow was faster for the full Lahman CSV workload, especially on larger and wider files.

But the file-by-file results matter. The smaller Lahman files show that pandas default can still be faster when the data is tiny enough that startup overhead matters more than bulk throughput.

That is a more useful answer than a simple yes or no. PyArrow is very strong for this workload, but the best reader still depends on the shape of the data and the work being done.

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files. The goal is to make the method and result inspectable.

About This Benchmark

This page compares CSV read time only. It does not measure joins, data cleaning, analysis, or visualization work.

Raw CSV Read Speed Across Lahman Tables

- Raw CSV Benchmark Summary -

- Featured Results -