About this test
This report reuses the completed Round 2 single-file NYC TLC CSV benchmark. It does not reopen the files or rerun the readers.
Instead, it groups the 48 completed file-level results into five derived CSV size buckets:
- Tiny: under 100 MB
- Small: 100 MB to under 250 MB
- Medium: 250 MB to under 500 MB
- Large: 500 MB to under 750 MB
- Giant: 750 MB and larger
The question is practical:
Does the advantage of
pandas.read_csv(path, engine="pyarrow")become larger as CSV files become larger?
What the benchmark found
PyArrow won every file in every size bucket.
The relationship was not a simple staircase where each larger bucket produced a larger speedup. The strongest aggregate result came from the medium bucket, where 21 files totaling about 6.8 GB of CSV data were read about 11.47x faster in aggregate with the PyArrow engine.
The giant files were still decisive PyArrow wins, with an aggregate speedup of about 7.04x. Their smaller relative advantage does not mean the PyArrow engine stopped helping. It means that larger files can introduce more variation from allocation, memory pressure, operating-system behavior, and other parts of the end-to-end load.
The result remains clear across the full range: from the 24 MB April 2020 file to the 813 MB March 2019 file, PyArrow was faster in every completed comparison.
Why this is useful
The first report established that PyArrow won all 48 individual CSV reads. This report asks whether those wins have a meaningful relationship to file size.
That matters because parser startup costs can be more visible for small files, while throughput usually matters more once files become large enough.
The findings support a careful conclusion:
File size matters, but it is not the only factor. In this corpus, PyArrow was consistently faster at every tested size, while the size of the advantage varied across the buckets.
How to read the summary table
The bucket table contains one row per size bucket.
The main fields are:
- Bucket: the file-size range.
- Files: number of completed file comparisons in the bucket.
- CSV size: total derived CSV storage across the bucket.
- Rows: total rows across the bucket.
- PyArrow wins: files where the PyArrow engine was faster.
- Median speedup: median per-file pandas-default time divided by PyArrow time.
- Aggregate speedup: summed pandas-default median times divided by summed PyArrow median times.
The file table contains one row per monthly Yellow Taxi CSV file and can be searched, filtered by size bucket, and sorted.
A speedup above 1.00x means the PyArrow engine was faster.
Methodology
This report analyzes the completed results from:
output/round2-read-csv/data.json
The source benchmark compared:
pd.read_csv(path)
and:
pd.read_csv(path, engine="pyarrow")
Each original file-level result includes three repeated reads per reader after a separately reported first read. This report uses the original repeated-read median values.
No CSV files were reopened. No new timing was collected. No values were estimated.
The source benchmark environment was:
- Windows 11
- Python 3.14.6
- pandas 3.0.3
- PyArrow 24.0.0
- psutil 7.2.2
Important caveats
This is a size-grouped analysis of one local benchmark environment.
It does not establish a universal file-size threshold where PyArrow begins to win. It does not prove that every CSV dataset, operating system, pandas version, or disk configuration will have the same pattern.
Repeated local reads can benefit from operating-system caching. The source benchmark does not claim cold-cache timing.
The size buckets are useful analytical categories, not natural laws. A 250 MB CSV and a 500 MB CSV can have very different column types, value distributions, and parser behavior.
What this test does not prove
This report does not prove that bigger files always create bigger PyArrow wins.
The giant bucket shows exactly why that statement would be too simple. PyArrow remained faster, but the aggregate advantage was smaller than the medium bucket.
It also does not compare CSV with native Parquet. That is a separate Round 3 question.
It does show that, in this specific 48-file NYC TLC CSV corpus, the PyArrow engine was faster across every tested file-size range.
Accessibility notes
This report is built around text, bucket summaries, and tables.
The table is the primary source of the report data. It can be searched, filtered by bucket, and sorted with standard controls. Result counts update after filtering. All outcomes are available in text and raw values, so the report does not depend on colour, charts, or mouse-only interaction.
If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and feedback helps improve this site.
What did we learn?
The early Lahman baseball version of this question was constrained because it did not contain genuinely large CSV files.
The NYC TLC corpus fixes that problem. It spans five usable file-size ranges and more than 18 GB of derived CSV data. PyArrow won every bucket and every file.
The useful lesson is not “larger always means faster.” The useful lesson is that once CSV parsing is a meaningful part of a repeated analytical workflow, parser choice is not a small implementation detail. It is a recurring performance decision.