NYC TLC CSV File-Size Performance: pandas Default versus PyArrow

This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This report groups the completed 48-file NYC TLC CSV benchmark into five size buckets, from tiny files under 100 MB to giant files of 750 MB or more.

It reuses completed results only. No CSV files are reopened and no benchmark timing is rerun here.

Loading bucket summaries

The bucket summaries will appear after the data loads.

NYC TLC Yellow Taxi CSV file-size analysis. Times are completed repeated-read medians from the original Round 2 single-file benchmark.

Loading report data.

About this test

This report reuses the completed Round 2 single-file NYC TLC CSV benchmark. It does not reopen the files or rerun the readers. If you've been paying attention, you'll notice this is the same methodology I used in the small - medium - large benchmarking project on Lahman's baseball data.

And of course you've been paying attention... after all, it's summer in Canada and so clearly you have nothing more interesting to do than read weirdly specific benchmarking data.

Like my test on the Lahman's Baseball Database, it groups the 48 completed file-level results into five derived CSV size buckets:

Tiny: under 100 MB
Small: 100 MB to under 250 MB
Medium: 250 MB to under 500 MB
Large: 500 MB to under 750 MB
Giant: 750 MB and larger

The question is practical:

Does the advantage of pandas.read_csv(path, engine="pyarrow") become larger as CSV files become larger?

A secondary question emerges from the original test where the shape of the data indicated there were places where the default was better, but we didn't have enough small files (and we had zero large files) in the Lahman's dataset to really make any conclusions.

What the benchmark found

PyArrow won every file in every size bucket.

The relationship was not a simple staircase where each larger bucket produced a larger speedup. The strongest aggregate result came from the medium bucket, where 21 files totaling about 6.8 GB of CSV data were read about 11.47x faster in aggregate with the PyArrow engine.

The giant files were still decisive PyArrow wins, with an aggregate speedup of about 7.04x. Their smaller relative advantage does not mean the PyArrow engine stopped helping. It means that larger files can introduce more variation from allocation, memory pressure, operating-system behavior, and other parts of the end-to-end load.

The result remains clear across the full range: from the 24 MB April 2020 file to the 813 MB March 2019 file, PyArrow was faster in every completed comparison.

Why this is useful

The first report established that PyArrow won all 48 individual CSV reads. This report asks whether those wins have a meaningful relationship to file size.

That matters because parser startup costs can be more visible for small files, while throughput usually matters more once files become large enough.

The findings support a careful conclusion:

File size matters, but it is not the only factor. In this corpus, PyArrow was consistently faster at every tested size, while the size of the advantage varied across the buckets.

How to read the summary table

The bucket table contains one row per size bucket.

The main fields are:

Bucket: the file-size range.
Files: number of completed file comparisons in the bucket.
CSV size: total derived CSV storage across the bucket.
Rows: total rows across the bucket.
PyArrow wins: files where the PyArrow engine was faster.
Median speedup: median per-file pandas-default time divided by PyArrow time.
Aggregate speedup: summed pandas-default median times divided by summed PyArrow median times.

The file table contains one row per monthly Yellow Taxi CSV file and can be searched, filtered by size bucket, and sorted.

A speedup above 1.00x means the PyArrow engine was faster.

Methodology

This report analyzes the completed results from:

output/round2-read-csv/data.json

The source benchmark compared:

pd.read_csv(path)

and:

pd.read_csv(path, engine="pyarrow")

Each original file-level result includes three repeated reads per reader after a separately reported first read. This report uses the original repeated-read median values.

No CSV files were reopened. No new timing was collected. No values were estimated.

The source benchmark environment was:

Windows 11
Python 3.14.6
pandas 3.0.3
PyArrow 24.0.0
psutil 7.2.2

Important caveats

This is a size-grouped analysis of one local benchmark environment.

It does not establish a universal file-size threshold where PyArrow begins to win. It does not prove that every CSV dataset, operating system, pandas version, or disk configuration will have the same pattern.

Repeated local reads can benefit from operating-system caching. The source benchmark does not claim cold-cache timing.

The size buckets are useful analytical categories, not natural laws. A 250 MB CSV and a 500 MB CSV can have very different column types, value distributions, and parser behavior.

What this test does not prove

This report does not prove that bigger files always create bigger PyArrow wins.

The giant bucket shows exactly why that statement would be too simple. PyArrow remained faster, but the aggregate advantage was smaller than the medium bucket.

It also does not compare CSV with native Parquet. That is a separate Round 3 question. It only shows that, in this specific 48-file NYC TLC CSV corpus, the PyArrow engine was faster across every tested file-size range.

Strangely, it also proves that I need still more data to show where the pandas default is the best choice. In the first set of csv read benchmarking, the default was fastest in all files smaller than 30 KB whereas PyArrow was faster in files larger than 66.7 KB. On my machine, there's a point in there where PyArrow's overhead becomes worth it. I'd like to do more testing to figure out where.

Accessibility notes

This report is built around text, bucket summaries, and tables.

The table is the primary source of the report data. It can be searched, filtered by bucket, and sorted with standard controls. Result counts update after filtering. All outcomes are available in text and raw values, so the report does not depend on colour, charts, or mouse-only interaction.

If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and feedback helps improve this site.

What did we learn?

The early Lahman baseball version of this question was constrained because it did not contain genuinely large CSV files.

The NYC TLC corpus fixes that problem. It spans five usable file-size ranges and more than 18 GB of derived CSV data. PyArrow won every bucket and every file.

The useful lesson is not “larger always means faster.” The useful lesson is that once CSV parsing is a meaningful part of a repeated analytical workflow, parser choice is not a small implementation detail. It is a recurring performance decision.

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files.

About This Benchmark

This report reanalyzes completed CSV read results by file size. It does not reopen files or create new benchmark timing.

NYC TLC CSV File-Size Performance: pandas Default versus PyArrow