About this test
This is the first NYC TLC report in the Round 2 CSV benchmark series.
It compares the same 48 derived CSV files using two pandas calls:
pd.read_csv(path)
and:
pd.read_csv(path, engine="pyarrow")
The corpus contains monthly NYC Yellow Taxi trip-record files from January 2019 through December 2022.
Across those 48 files, the benchmark covers:
- 179,807,942 rows
- 18.2 GB of derived CSV data
- 2.5 GB of original Parquet source data
- CSV files ranging from roughly 24 MB to 813 MB
The central question is practical:
When the CSV corpus is large and repeatedly opened for local analysis, how much does the parser matter?
What the benchmark found
PyArrow won every one of the 48 file-level comparisons.
The median individual-file speedup was 10.44x. When the per-file median read times are summed across the full corpus, pandas using the PyArrow engine took about 32.5 seconds, compared with about 288.6 seconds for pandas' default CSV reader. That is an overall summed-median speedup of 8.89x.
That does not mean every file had exactly the same shape of advantage. The largest measured PyArrow win was November 2021, where the PyArrow engine read the file about 13.48x faster. The largest CSV file, March 2019, was still a clear PyArrow win at about 7.17x faster.
The result is unusually clean: no reader failures, no ties within one percent, and no pandas-default wins in the completed 48-file run.
Why this is useful
CSV remains useful because it is easy to inspect, exchange, version, and open with almost any tool. But CSV has to be parsed as text every time it is read.
That creates recurring work:
- reading the bytes from disk,
- splitting text into fields,
- interpreting values,
- inferring types,
- allocating a DataFrame,
- and carrying any parser-specific overhead.
This test does not argue that CSV should disappear. It shows that, where CSV is already required or already present, the pandas reader selection can materially change the time needed to get from file to usable DataFrame.
How to read the table
The table contains one row per monthly Yellow Taxi CSV file.
The main fields are:
- File: benchmark input identifier, plus the monthly period and taxi type.
- CSV size: size of the derived CSV file.
- Rows: expected source rows for the file.
- Columns: expected source columns.
- pandas default median: median time for repeated
pandas.read_csv(path)reads. - PyArrow engine median: median time for repeated
pandas.read_csv(path, engine="pyarrow")reads. - Speedup: pandas default median divided by PyArrow-engine median.
- Winner: the faster reader for that file.
A speedup above 1.00x means the PyArrow engine was faster. A speedup below 1.00x would mean pandas default was faster.
The table can be searched, filtered by year and winner, and sorted by every major field.
Methodology
The benchmark used the full 48-file derived CSV corpus.
Each file was benchmarked with both readers in isolated subprocesses:
pd.read_csv(path)
pd.read_csv(path, engine="pyarrow")
For each reader and file:
- the first read was measured separately;
- three repeated reads were measured;
- median, average, minimum, maximum, and standard deviation were recorded;
- expected rows and columns were checked against observed rows and columns;
- raw byte-read timing was recorded for I/O context only;
- post-read process RSS and pandas deep DataFrame memory were recorded.
The public table uses the repeated-read median. It does not include the first read because the first read is intentionally reported separately in the source data.
The benchmark environment was:
- Windows 11
- Python 3.14.6
- pandas 3.0.3
- PyArrow 24.0.0
- psutil 7.2.2
Important caveats
These results are local benchmark results, not universal constants.
Operating-system file caching can affect repeated local reads. The benchmark does not claim to measure cold-cache performance, network storage, cloud object storage, or every possible machine configuration.
Raw byte-read timing is included in the generated data as context. It is not subtracted from parser timing. Reading bytes and parsing CSV are related but different parts of the workload.
The reported RSS values are post-read snapshots, not true peak-memory measurements. Pandas deep-memory reporting also describes the resulting DataFrame, not the entire Python process.
This report measures file loading only. It does not benchmark:
- joins,
- group-by operations,
- cleaning,
- visualization,
- writing output files,
- native Parquet reads,
- or distributed workloads.
Those are separate questions and should remain separate tests.
What this test does not prove
This benchmark does not prove that PyArrow always wins in every CSV workload.
It does not test tiny configuration files, every CSV dialect, custom type settings, every operating system, or every pandas release. It also does not show whether native Parquet will be faster, because that is a different format comparison planned for Round 3.
It does show something narrower and useful:
In this Windows 11 environment, with this four-year NYC TLC Yellow Taxi CSV corpus, pandas using the PyArrow CSV engine was faster than pandas default parsing for every completed file-level comparison.
That is a very strong result for this specific workload.
Accessibility notes
This report is built around text, summary cards, examples, and a fully inspectable table.
The table is the primary source of report data. It can be searched, filtered, and sorted with standard form controls and buttons. The result count updates after filtering. Headings, captions, labels, table headers, and status messages are provided so screen-reader users can navigate the report without depending on charts, color, or mouse-only interaction.
No benchmark result is communicated by color alone. The winner is always named in text and the raw timing values remain available in the table.
If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and feedback helps improve this site.
What did we learn?
The obvious conclusion is that the PyArrow CSV engine was faster. The more interesting conclusion is how consistently it was faster.
The corpus includes pandemic-collapse months, recovery months, medium-sized files, large files, and giant files. It includes files ranging from hundreds of thousands of rows to almost eight million rows. Yet the completed result stayed one-sided: PyArrow won every file.
That makes parser selection a serious practical concern for repeated local CSV analysis. Saving a few seconds on one file can feel modest. Saving minutes over a large corpus, or over repeated investigative work, changes the rhythm of the entire workflow.
Round 2 continues with a file-size analysis and deliberate many-file workloads. Round 3 will compare the completed CSV results with native Parquet reads, where the question becomes larger than parser choice: what changes when the working format is not CSV in the first place?