This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This report groups the completed 48-file NYC TLC CSV benchmark into five size buckets, from tiny files under 100 MB to giant files of 750 MB or more.

It reuses completed results only. No CSV files are reopened and no benchmark timing is rerun here.

- NYC TLC CSV File-Size Performance -

How pandas default CSV reads and the PyArrow CSV engine behave across tiny, small, medium, large, and giant CSV files.

Loading data report.
files compared
PyArrow wins
overall CSV speedup
median file speedup
medium aggregate speedup
giant aggregate speedup

PyArrow won at every size. The advantage varied by bucket, which is more useful than pretending file size alone decides everything.

- Size-Bucket Summary -

Each card summarizes completed single-file benchmark results for one CSV size range.

Loading bucket summaries

The bucket summaries will appear after the data loads.

- Full File-Size Table -

Search files, filter by size bucket, and sort by file size, rows, timing, or speedup.

Waiting for table data.
NYC TLC Yellow Taxi CSV file-size analysis. Times are completed repeated-read medians from the original Round 2 single-file benchmark.
Loading report data.

Download the data

This page is generated from data.json. The bucket summaries and full file-level data are available without scraping the page.

Download this report's data.json

About this test

This report reuses the completed Round 2 single-file NYC TLC CSV benchmark. It does not reopen the files or rerun the readers.

Instead, it groups the 48 completed file-level results into five derived CSV size buckets:

  • Tiny: under 100 MB
  • Small: 100 MB to under 250 MB
  • Medium: 250 MB to under 500 MB
  • Large: 500 MB to under 750 MB
  • Giant: 750 MB and larger

The question is practical:

Does the advantage of pandas.read_csv(path, engine="pyarrow") become larger as CSV files become larger?

What the benchmark found

PyArrow won every file in every size bucket.

The relationship was not a simple staircase where each larger bucket produced a larger speedup. The strongest aggregate result came from the medium bucket, where 21 files totaling about 6.8 GB of CSV data were read about 11.47x faster in aggregate with the PyArrow engine.

The giant files were still decisive PyArrow wins, with an aggregate speedup of about 7.04x. Their smaller relative advantage does not mean the PyArrow engine stopped helping. It means that larger files can introduce more variation from allocation, memory pressure, operating-system behavior, and other parts of the end-to-end load.

The result remains clear across the full range: from the 24 MB April 2020 file to the 813 MB March 2019 file, PyArrow was faster in every completed comparison.

Why this is useful

The first report established that PyArrow won all 48 individual CSV reads. This report asks whether those wins have a meaningful relationship to file size.

That matters because parser startup costs can be more visible for small files, while throughput usually matters more once files become large enough.

The findings support a careful conclusion:

File size matters, but it is not the only factor. In this corpus, PyArrow was consistently faster at every tested size, while the size of the advantage varied across the buckets.

How to read the summary table

The bucket table contains one row per size bucket.

The main fields are:

  • Bucket: the file-size range.
  • Files: number of completed file comparisons in the bucket.
  • CSV size: total derived CSV storage across the bucket.
  • Rows: total rows across the bucket.
  • PyArrow wins: files where the PyArrow engine was faster.
  • Median speedup: median per-file pandas-default time divided by PyArrow time.
  • Aggregate speedup: summed pandas-default median times divided by summed PyArrow median times.

The file table contains one row per monthly Yellow Taxi CSV file and can be searched, filtered by size bucket, and sorted.

A speedup above 1.00x means the PyArrow engine was faster.

Methodology

This report analyzes the completed results from:

output/round2-read-csv/data.json

The source benchmark compared:

pd.read_csv(path)

and:

pd.read_csv(path, engine="pyarrow")

Each original file-level result includes three repeated reads per reader after a separately reported first read. This report uses the original repeated-read median values.

No CSV files were reopened. No new timing was collected. No values were estimated.

The source benchmark environment was:

  • Windows 11
  • Python 3.14.6
  • pandas 3.0.3
  • PyArrow 24.0.0
  • psutil 7.2.2

Important caveats

This is a size-grouped analysis of one local benchmark environment.

It does not establish a universal file-size threshold where PyArrow begins to win. It does not prove that every CSV dataset, operating system, pandas version, or disk configuration will have the same pattern.

Repeated local reads can benefit from operating-system caching. The source benchmark does not claim cold-cache timing.

The size buckets are useful analytical categories, not natural laws. A 250 MB CSV and a 500 MB CSV can have very different column types, value distributions, and parser behavior.

What this test does not prove

This report does not prove that bigger files always create bigger PyArrow wins.

The giant bucket shows exactly why that statement would be too simple. PyArrow remained faster, but the aggregate advantage was smaller than the medium bucket.

It also does not compare CSV with native Parquet. That is a separate Round 3 question.

It does show that, in this specific 48-file NYC TLC CSV corpus, the PyArrow engine was faster across every tested file-size range.

Accessibility notes

This report is built around text, bucket summaries, and tables.

The table is the primary source of the report data. It can be searched, filtered by bucket, and sorted with standard controls. Result counts update after filtering. All outcomes are available in text and raw values, so the report does not depend on colour, charts, or mouse-only interaction.

If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and feedback helps improve this site.

What did we learn?

The early Lahman baseball version of this question was constrained because it did not contain genuinely large CSV files.

The NYC TLC corpus fixes that problem. It spans five usable file-size ranges and more than 18 GB of derived CSV data. PyArrow won every bucket and every file.

The useful lesson is not “larger always means faster.” The useful lesson is that once CSV parsing is a meaningful part of a repeated analytical workflow, parser choice is not a small implementation detail. It is a recurring performance decision.

Related Links

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files.

About This Benchmark

This report reanalyzes completed CSV read results by file size. It does not reopen files or create new benchmark timing.