NYC TLC CSV Read Speed: pandas Default versus PyArrow

This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This benchmark compares pandas read_csv(path) against pandas read_csv(path, engine="pyarrow") across 48 derived CSV files from the NYC TLC Yellow Taxi trip-record corpus.

Every file represents one month between January 2019 and December 2022. The goal is simple: measure how parser choice changes the time required to open the same local CSV data.

Loading examples

The benchmark examples will appear after the data loads.

NYC TLC Yellow Taxi derived CSV benchmark. Times are median milliseconds from three timed reads. The first read was measured separately and is not shown in this table.

Loading report data.

About this test

This is the first NYC TLC report in the Round 2 CSV benchmark series.

It compares the same 48 derived CSV files using two pandas calls:

pd.read_csv(path)

and:

pd.read_csv(path, engine="pyarrow")

The corpus contains monthly NYC Yellow Taxi trip-record files from January 2019 through December 2022.

Across those 48 files, the benchmark covers:

179,807,942 rows
18.2 GB of derived CSV data
2.5 GB of original Parquet source data
CSV files ranging from roughly 24 MB to 813 MB

The central question is practical:

When the CSV corpus is large and repeatedly opened for local analysis, how much does the parser matter?

What the benchmark found

PyArrow won every one of the 48 file-level comparisons.

The median individual-file speedup was 10.44x. When the per-file median read times are summed across the full corpus, pandas using the PyArrow engine took about 32.5 seconds, compared with about 288.6 seconds for pandas' default CSV reader. That is an overall summed-median speedup of 8.89x.

That does not mean every file had exactly the same shape of advantage. The largest measured PyArrow win was November 2021, where the PyArrow engine read the file about 13.48x faster. The largest CSV file, March 2019, was still a clear PyArrow win at about 7.17x faster.

The result is unusually clean for me. I didn't have any reader failures throughout all the testing, though I likely use pandas more than any other tool so it's just muscle memory. But the data was also unusually clean - no ties within one percent, and no pandas-default wins in the completed 48-file run. The pandas-default and Toronto Maple Leafs have a lot in common in this dataset.

Why this is useful

CSV remains useful because it is easy to inspect, exchange, version, and open with almost any tool. But CSV has to be parsed as text every time it is read.

That creates recurring work:

reading the bytes from disk,
splitting text into fields,
interpreting values,
inferring types,
allocating a DataFrame,
and carrying any parser-specific overhead.

This test does not argue that CSV should disappear. It shows that, where CSV is already required or already present, the pandas reader selection can materially change the time needed to get from file to usable DataFrame. This is useful in any environments where performance is extremely important.

It's also important to properly put this data into perspective. During the Lahman's Baseball Database round, none of the files were large enough for the choice of reader to make a truly noticeable difference. This time, the difference is very significant and represents a noticeable amount of time.

How to read the table

The table contains one row per monthly Yellow Taxi CSV file.

The main fields are:

File: benchmark input identifier, plus the monthly period and taxi type.
CSV size: size of the derived CSV file.
Rows: expected source rows for the file.
Columns: expected source columns.
pandas default median: median time for repeated pandas.read_csv(path) reads.
PyArrow engine median: median time for repeated pandas.read_csv(path, engine="pyarrow") reads.
Speedup: pandas default median divided by PyArrow-engine median.
Winner: the faster reader for that file.

A speedup above 1.00x means the PyArrow engine was faster. A speedup below 1.00x would mean pandas default was faster.

The table can be searched, filtered by year and winner, and sorted by every major field.

Methodology

The benchmark used the full 48-file derived CSV corpus from monthly NYC Yellow Taxi trip-record files from January 2019 through December 2022.

Each file was benchmarked with both readers in isolated subprocesses:

pd.read_csv(path)

pd.read_csv(path, engine="pyarrow")

For each reader and file:

the first read was measured separately;
three repeated reads were measured;
median, average, minimum, maximum, and standard deviation were recorded;
expected rows and columns were checked against observed rows and columns;
raw byte-read timing was recorded for I/O context only;
post-read process Resident Set Size (RSS) and pandas deep DataFrame memory were recorded.

The public table uses the repeated-read median. It does not include the first read because the first read is intentionally reported separately in the source data. It's important to note there is a significant change in my methodology between this report and the first report using Lahman's baseball data. During the Lahman's testing, I ran each file 100 times. I only opened the New York City taxi data files three times each.

I made that decision because of the statistical concept of Too Fricking Long (TFL) in which opening up 18.2 GB of data 100 times would take too fricking long.

The benchmark environment was:

Windows 11
Python 3.14.6
pandas 3.0.3
PyArrow 24.0.0
psutil 7.2.2

Important caveats

These results are local benchmark results, not universal constants.

Operating-system file caching can affect repeated local reads. The benchmark does not claim to measure cold-cache performance, network storage, cloud object storage, or every possible machine configuration.

Raw byte-read timing is included in the generated data as context. It is not subtracted from parser timing. Reading bytes and parsing CSV are related but different parts of the workload.

The reported Resident Set Size (RSS) values are post-read snapshots, not true peak-memory measurements. Pandas deep-memory reporting also describes the resulting DataFrame, not the entire Python process.

This report measures file loading only. It does not benchmark:

joins,
group-by operations,
cleaning,
visualization,
writing output files,
native Parquet reads,
or distributed workloads.

Those are separate questions and should remain separate tests. They're the messiest parts of data work so there is a very very high chance that once that work is incorporated into real flows, the difference between these options will again seem trivial.

What this test does not prove

This benchmark does not prove that PyArrow always wins in every CSV workload. The first test I performed proves the opposite - the pandas-default was significantly faster in very small files.

It does not test tiny configuration files, every CSV dialect, custom type settings, every operating system, or every pandas release. It also does not show whether native Parquet will be faster, because that is a different format comparison planned for Round 3.

It show something narrower and more useful to me:

In this Windows 11 environment, with this four-year NYC TLC Yellow Taxi CSV corpus, pandas using the PyArrow CSV engine was faster than pandas default parsing for every completed file-level comparison.

That is a very strong result for this specific workload.

Accessibility notes

This report is built around text, summary cards, examples, and a fully inspectable table.

The table is the primary source of report data. It can be searched, filtered, and sorted with standard form controls and buttons. The result count updates after filtering. Headings, captions, labels, table headers, and status messages are provided so screen-reader users can navigate the report without depending on charts, color, or mouse-only interaction.

No benchmark result is communicated by color alone. The winner is always named in text and the raw timing values remain available in the table.

If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and feedback helps me get better.

What did we learn?

The obvious conclusion is that the PyArrow CSV engine was faster. The more interesting conclusion is how consistently it was faster. The first report on the Lahman's Baseball Database was nowhere near this clean and again, it all comes down to file size.

The corpus includes pandemic-collapse months, recovery months, medium-sized files, large files, and giant files. It includes files ranging from hundreds of thousands of rows to almost eight million rows. Yet the completed result stayed one-sided: PyArrow won every file.

That makes parser selection a serious practical concern for repeated local CSV analysis. Saving a few seconds on one file can feel modest. Saving minutes over a large corpus, or over repeated investigative work, changes the rhythm of the entire workflow.

Round 2 continues with a file-size analysis and deliberate many-file workloads. Round 3 will compare the completed CSV results with native Parquet reads, where the question becomes larger than parser choice: what changes when the working format is not CSV in the first place?

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files. The goal is to make the method and result inspectable.

About This Benchmark

This page measures CSV read time. It does not measure joins, aggregation, data cleaning, visualization, or native Parquet reads.