A larger public-data benchmark

The Lahman Baseball Database project raised a useful question but had an important limitation: the dataset did not include genuinely large CSV files.

This project uses monthly NYC Yellow Taxi trip-record data from January 2019 through December 2022. The 48-file corpus contains 179,807,942 rows and derived CSV files ranging from roughly 24 MB to 813 MB.

The goal is to examine the same pandas CSV reader question under a much more demanding workload: pandas.read_csv(path) compared with pandas.read_csv(path, engine="pyarrow") .

What this project found

Single-file CSV reads

PyArrow won all 48 individual file comparisons. The median per-file speedup was 10.44x.

When summed median timing is compared across the full corpus, PyArrow completed in about 32.5 seconds compared with about 288.6 seconds for pandas default parsing: an 8.89x overall speedup.

File-size analysis

The 48 completed file-level results were grouped into tiny, small, medium, large, and giant CSV buckets without reopening or rerunning the source files.

PyArrow won every file in every bucket. The strongest aggregate result came from the medium bucket at 11.47x, while giant files remained strong wins at 7.04x.

Many-file workloads

Seven deliberate read-plus-concatenate workloads tested small imports, recovery-era periods, mixed file shapes, cross-year comparisons, and complete annual datasets.

PyArrow won all seven workloads. The annual 2019 workload included 12 files and 84.6 million rows, where PyArrow still won by 2.52x.

The three Round 2 reports

NYC TLC CSV Read Speed

The complete 48-file single-file benchmark. Compare pandas default CSV parsing and the PyArrow engine across four years of monthly Yellow Taxi trip-record files.

Includes timing, file size, row count, filtering, sorting, and an accessible file-level table.

CSV File-Size Performance

A derived analysis that groups the completed single-file results by file size, from tiny files under 100 MB through giant files at 750 MB or larger.

It tests the useful but incomplete idea that parser advantage always grows with file size.

Many-File CSV Workloads

Seven workload-level tests that load several monthly files, retain the source DataFrames, and concatenate them into one analysis-ready table.

This is the bridge between an isolated parser benchmark and the first stage of a real analysis workflow.

Why this project matters

A single-file benchmark can make parser choice look simple. The larger NYC TLC corpus shows the fuller picture: PyArrow remains faster across a broad range of file sizes, but multi-file workflows also bring memory allocation, concatenation, and system pressure into the result.

The next stage is Round 3: native Parquet reads. That work will compare the completed CSV benchmark against the original Parquet corpus and ask the larger workflow question: what changes when the working format is not CSV in the first place?

Methodology and access

The benchmark was run locally on Windows 11 using Python, pandas, PyArrow, and isolated subprocesses. The reports separate first reads from repeated reads, validate expected row and column counts, and keep the generated JSON beside each static report page.

The pages are built around readable text, summary cards, filtering, sorting, captions, and accessible tables. Charts are not required to understand the result, and every report exposes its source JSON for inspection.