NYC TLC Many-File CSV Workloads: pandas Default versus PyArrow

This report is generated from a local JSON file. The table, search, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This report compares full multi-file CSV workflows using pandas default parsing and pandas with the PyArrow CSV engine.

Each workload reads a defined group of NYC TLC Yellow Taxi CSV files, retains the resulting DataFrames, and concatenates them into one table.

Loading examples

The examples will appear after the data loads.

NYC TLC multi-file CSV workloads. Times are median total milliseconds from three full read-plus-concat runs.

Loading report data.

About this test

This is the many-file workload report for the NYC TLC Round 2 CSV benchmark.

The earlier read-speed report measured one monthly CSV file at a time. That is a useful baseline, but it is not the whole shape of real analytical work.

Many practical tasks require several files to be:

read into memory;
retained as DataFrames;
concatenated into one analysis-ready table.

This test benchmarks that complete workflow with:

frames = [pd.read_csv(path) for path in workload_paths]
combined = pd.concat(frames, ignore_index=True)

and:

frames = [
    pd.read_csv(path, engine="pyarrow")
    for path in workload_paths
]
combined = pd.concat(frames, ignore_index=True)

The seven workloads were deliberately chosen to represent different kinds of work:

a three-file low-volume pandemic import;
a six-file recovery-period import;
a full recovery-era year;
a high-volume pre-pandemic quarter;
a complete high-volume 2019 year;
the same calendar month across four years;
and a deliberately uneven mixed-shape folder.

What the benchmark found

PyArrow won all seven many-file workloads.

The median workload speedup was 5.87x and the average was 6.05x.

The smallest workload, covering April through June 2020, loaded and concatenated 1.14 million rows about 10.57x faster with the PyArrow engine. The high-volume 2019 annual workload was the opposite end of the spectrum: 84.6 million rows across 12 giant CSV files, still a PyArrow win, but by 2.52x.

That annual result is the important reality check. PyArrow's CSV parsing advantage remains substantial, but as workloads become large, concatenation, allocation, and memory pressure become increasingly important parts of the total elapsed time.

Why this is useful

Single-file benchmarks can make a parser look like the whole story. It is not.

A multi-file workflow adds several costs:

retaining multiple DataFrames at once;
reconciling columns during concatenation;
allocating the final combined DataFrame;
copying or reusing data blocks depending on pandas internals;
and operating close enough to memory limits that variation becomes visible.

This report therefore tests a more realistic question:

Does the PyArrow engine still help after the workflow grows beyond opening one CSV file?

For these seven workloads, the answer was yes. Every workload completed faster with PyArrow, even when the final combine step became a major cost.

How to read the table

The table contains one row per logical workload.

The main fields are:

Workload: the designed scenario and a plain-language description.
Files: number of monthly CSV files loaded.
CSV size: total derived CSV size across the workload.
Rows: expected final row count after concatenation.
pandas default median: median total time for reading all files and concatenating them with pandas default CSV parsing.
PyArrow engine median: median total time for the same workflow with engine="pyarrow".
Speedup: pandas default median divided by PyArrow-engine median.
Winner: the faster reader for that workload.

A speedup above 1.00x means the PyArrow engine was faster.

Methodology

Each workload was run three times with each reader.

Each reader and workload combination ran in its own subprocess. Reader order alternated between workloads. This reduces the chance that retained DataFrames or one reader's allocator state become the next reader's luggage.

For every timed workload run, the benchmark measured:

read phase time;
concat phase time;
total time;
observed rows and columns;
final DataFrame deep memory usage;
RSS before reading;
RSS after reading;
RSS after concatenation;
approximate sampled peak RSS.

The workload's expected rows and columns were verified against the combined result. Every completed workload returned the expected row count and 19 columns.

The public table uses median total time across the three full workflow runs.

The benchmark environment was:

Windows 11
Python 3.14.6
pandas 3.0.3
psutil 7.2.2

Important caveats

These are local results, not universal constants.

The operating system can cache local files, and repeated workload timing can be affected by memory pressure, allocator behavior, and background activity. The annual 2019 workload in particular is a serious resource test, not a tiny toy example wearing a lab coat.

Approximate peak RSS was sampled every 50 milliseconds. A sampled peak can miss brief spikes, so it should not be treated as an exact maximum-memory claim.

The benchmark also records DataFrame deep-memory usage. That is useful for describing the final DataFrame, but it is not a substitute for total process memory.

This report does not benchmark:

native Parquet reads;
CSV-to-Parquet conversion;
joins, grouping, cleaning, or visualization;
cloud or network storage;
distributed processing;
or alternative dataframe systems.

What this test does not prove

This does not prove that PyArrow will always be faster for every many-file CSV workflow.

It does not test every schema, every CSV dialect, or every machine. It also does not establish a universal point where concatenation becomes more important than parsing.

It does show something narrow and practical:

In this Windows 11 environment, for seven deliberately designed NYC TLC CSV workloads, pandas using the PyArrow engine completed the full read-plus-concat workflow faster than pandas default CSV parsing every time.

Accessibility notes

This report is built around text, summary cards, examples, and a data table.

The table can be searched and sorted with standard controls. The result count updates after filtering. Every benchmark outcome is available as text and raw values, so the report does not depend on colour, charts, or mouse-only interaction.

Headings, form labels, captions, table headers, and live status messages are provided for screen-reader navigation.

If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and feedback helps improve this site.

What did we learn?

PyArrow's advantage survives contact with a more realistic workflow.

The speedup is smaller than the individual-file benchmark because the parser is no longer the only major cost. Once several large DataFrames exist at once, combining them can take seconds or tens of seconds. That work affects both readers.

But the result did not flip. The PyArrow engine won the tiny workload, the mixed-shape workload, the cross-year workload, the complete 2021 workload, and the 84.6-million-row 2019 annual workload.

That is the practical takeaway: choosing the faster CSV reader remains valuable even after the task becomes "open a folder of files and make one table." The next Round 2 report examines how those results behave across file-size buckets. Round 3 will ask the larger format question: what changes when the data is read as native Parquet instead of CSV?

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files.

About This Benchmark

This report measures CSV loading and concatenation. It does not measure native Parquet reads, joins, grouping, or visualization.

NYC TLC Many-File CSV Workloads: pandas Default versus PyArrow