NumPy versus PyArrow with Lahman Baseball Data

A practical benchmark using baseball data

This project is a follow-up to my earlier NumPy versus PyArrow CSV performance test . The original version used scraped and simulated fantasy baseball data. This version uses the Lahman Baseball Database, which provides a better mix of real public CSV files: metadata tables, player-stat tables, team tables, salary data, and practical groups of related files.

The project keeps the original “NumPy versus PyArrow” framing because that is where the research began. The reports use more exact labels internally: pandas.read_csv(path) and pandas.read_csv(path, engine="pyarrow") . These tests measure pandas CSV loading behaviour, not raw NumPy array operations.

The purpose is not to declare a universal winner. The purpose is to examine how reader choice, file size, practical workload shape, and a native columnar format affect a familiar public dataset.

What this project found

Raw CSV reads

Across 27 Lahman CSV files, PyArrow was faster on 20 files while pandas default was faster on 7 files.

The full raw CSV workload showed a 5.50x overall speedup for PyArrow. The result was useful because the files varied substantially in row counts, widths, and size.

Open Test 1

File size matters

The small-file bucket was mixed: PyArrow won 13 small files while pandas default won 7.

The medium-file bucket was clearer: PyArrow won all 7 medium files. The Lahman dataset did not contain files larger than 10 MB, which made that limitation visible instead of hiding it.

Open Test 2

Many-file workloads

Six practical Lahman workloads were tested, including people and teams, regular-season player statistics, postseason data, awards, salary context, and the full CSV set.

PyArrow won all six workloads. The largest measured win was the regular-season player-stat workload, which loaded Batting, Pitching, Fielding, and Appearances with a 5.90x speedup.

Open Test 3

CSV versus Parquet

All 27 Lahman CSV files converted successfully to Parquet. The full CSV set was 42.15 MB, while the generated Parquet files were 10.79 MB.

Parquet repeated reads were 7.07x faster than pandas default CSV reads overall and 1.29x faster than PyArrow CSV reads overall.

Open Test 4

The four reports

Raw CSV Read Speed Across Lahman Tables

Compare pandas default CSV reads with pandas read_csv using the PyArrow engine across all 27 Lahman CSV files.

Includes summary cards, examples, filters, sorting, and an accessible file-level table.

Open Test 1

Small, Medium, and Large CSV File Performance

Reuse the raw CSV benchmark results and group Lahman files by size: under 1 MB, 1 MB to under 10 MB, and 10 MB or larger.

This report shows where the original benchmark design was useful and where the dataset could not answer the larger-file question.

Open Test 2

Many-File Baseball Workload

Load practical groups of Lahman files together. These workloads are closer to the beginning of real baseball-data work than a one-file-at-a-time benchmark.

The test isolates loading and deliberately excludes joins, grouping, modelling, and visualization.

Open Test 3

CSV versus Parquet for Repeated Lahman Reads

Convert each Lahman CSV file to Parquet once, then compare repeated reads against the completed CSV benchmark.

This test asks a more workflow-oriented question: after conversion, does a faster local working format change repeated analysis?

Open Test 4

Methodology and access

The benchmark uses local Python scripts, Lahman CSV files, pandas, and PyArrow. Generated JSON is stored beside each static report page and loaded directly in the browser.

The reports measure loading and repeated reads only. They do not include joins, group-by operations, cleaning, modelling, visualization, or page rendering. That narrow scope is intentional because it keeps each result easier to interpret.

Every report is built around readable text, summary cards, filters, and accessible tables. The generated data can be opened directly from each report page without scraping the site.