A one-year follow-up with better data

This project is a follow-up to my earlier NumPy versus PyArrow CSV performance test. The original version used scraped and simulated fantasy baseball data. This version uses the Lahman baseball database, which gives the benchmark a better mix of real public CSV files: tiny metadata tables, medium-sized player-stat tables, wide team tables, and practical file groups that resemble actual baseball analysis work.

The title keeps the original “NumPy versus PyArrow” framing because that is where the project started. The reports use more precise labels internally: pandas.read_csv(path) and pandas.read_csv(path, engine="pyarrow") . The benchmark measures pandas CSV loading behaviour, not raw NumPy array operations.

The goal is not to crown one tool forever. The useful question is smaller and more practical: when working with Lahman CSV files, when does PyArrow help, when does pandas default still hold up, and when does converting to Parquet change the shape of repeated analysis?

Overall findings

Raw CSV reads

Across 27 Lahman CSV files, PyArrow was faster on 20 files and pandas default was faster on 7 files. The full raw CSV workload showed a 5.50x overall speedup for PyArrow.

The dataset contained 706,466 rows and 13,518,724 cells. That makes the result more useful than a single-file benchmark because the file shapes vary.

File size matters

The small-file bucket was mixed: PyArrow won 13 small files and pandas default won 7. The medium-file bucket was much clearer: PyArrow won all 7 medium files.

No Lahman CSV files in this run were 10 MB or larger. That absence is useful context because this is a real public-data benchmark, not an artificial giant-file test.

Many-file workloads

Six practical Lahman workloads were tested, including people and teams, regular-season player stats, postseason stats, awards, salary context, and the full CSV set.

PyArrow won all 6 workloads. The largest win was the regular-season player stats workload, which loaded Batting, Pitching, Fielding, and Appearances with a 5.90x speedup.

CSV versus Parquet

All 27 Lahman CSV files converted successfully to Parquet. The full CSV set was 42.15 MB. The generated Parquet files were 10.79 MB, about 25.6% of the CSV size.

Parquet repeated reads were 7.07x faster than pandas default CSV reads overall and 1.29x faster than PyArrow CSV reads overall.

The practical lesson

PyArrow was clearly stronger for this Lahman CSV workload, especially once the files became large enough for throughput to matter.

But the smallest files still matter. Several tiny CSV files were faster with pandas default, and some tiny Parquet files were larger than their CSV sources. Tool choice still depends on data shape.

Why Parquet changes the question

CSV is still useful for publishing, sharing, and inspecting data. This project does not argue that CSV is bad.

The Parquet test asks a different question: after a public CSV dataset is converted once, does a local repeated-read format make future work faster and smaller? In this run, the answer was yes.

The four reports

Raw CSV Read Speed Across Lahman Tables

The first report compares pandas default CSV reads with pandas read_csv using the PyArrow engine across all 27 Lahman CSV files.

It includes summary cards, featured examples, filters, sorting, and a full accessible table with file size, rows, columns, average read times, median read times, speedup, and winner.

Small, Medium, and Large CSV File Performance

The second report reuses the raw CSV results and groups the Lahman files by size: small files under 1 MB, medium files from 1 MB to 10 MB, and large files at 10 MB or more.

It shows that the small-file bucket was mixed while the medium-file bucket was entirely PyArrow wins.

Many-File Baseball Workload

The third report loads practical groups of Lahman files together. These workloads are closer to the first step of a real baseball data project than one-file-at-a-time benchmarking.

The test measures loading only. It deliberately excludes joins, group-by operations, modeling, and visualization so the loading step stays isolated.

CSV versus Parquet for Repeated Lahman Reads

The fourth report converts each Lahman CSV file to Parquet once, then compares repeated reads against the CSV timings from the raw benchmark.

It is the project’s most practical future-work test because repeated analysis often benefits from using a faster local working format.

Methodology

The benchmark uses a local Python script, Lahman CSV files, pandas, and PyArrow. Each reader was run 100 timed times per file or workload after two warmup runs. The generated JSON files are copied beside each static report page and loaded in the browser.

The reports measure loading and repeated reads only. They do not include joins, group-by operations, data cleaning, modeling, visualization, or page rendering. That narrow scope is intentional because it keeps each result easier to interpret.

Operating-system file caching can affect repeated read benchmarks. These results should be treated as a local benchmark for this environment, not as universal law handed down from the mountain by a compiler wearing sandals.

Accessibility and data access

These reports are built around text, summary cards, filters, and accessible tables. The tables are the primary source of the report data. The pages do not require chart interpretation, colour-only meaning, or mouse-only interaction.

Each report is generated from a local data.json file. The JSON can be opened directly from the report page, saved, and inspected without scraping the page.

All reports in this project