This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This benchmark compares pandas read_csv(path) against pandas read_csv(path, engine="pyarrow") across 48 derived CSV files from the NYC TLC Yellow Taxi trip-record corpus.

Every file represents one month between January 2019 and December 2022. The goal is simple: measure how parser choice changes the time required to open the same local CSV data.

- NYC TLC CSV Read Benchmark -

The same 48 CSV files were read with pandas default parsing and pandas using the PyArrow CSV engine.

Loading data report.
CSV files compared
total rows
derived CSV corpus
PyArrow wins
median file speedup
overall summed-median speedup

PyArrow won every file in this benchmark. The useful practical lesson is not merely that one parser is faster, but that parser choice becomes a recurring cost when a large CSV corpus is read repeatedly.

- Two Useful Reference Points -

These examples are generated from the same data.json file as the full table.

Loading examples

The benchmark examples will appear after the data loads.

- Full CSV Read Table -

Search by file, period, taxi type, or winner. Filter by year and winning reader. Sort the table to compare file size, row count, median read time, and speedup.

Waiting for table data.
NYC TLC Yellow Taxi derived CSV benchmark. Times are median milliseconds from three timed reads. The first read was measured separately and is not shown in this table.
Loading report data.

Download the data

This page is generated from data.json. You can inspect the full benchmark output without scraping the page.

Download this report's data.json

About this test

This is the first NYC TLC report in the Round 2 CSV benchmark series.

It compares the same 48 derived CSV files using two pandas calls:

pd.read_csv(path)

and:

pd.read_csv(path, engine="pyarrow")

The corpus contains monthly NYC Yellow Taxi trip-record files from January 2019 through December 2022.

Across those 48 files, the benchmark covers:

  • 179,807,942 rows
  • 18.2 GB of derived CSV data
  • 2.5 GB of original Parquet source data
  • CSV files ranging from roughly 24 MB to 813 MB

The central question is practical:

When the CSV corpus is large and repeatedly opened for local analysis, how much does the parser matter?

What the benchmark found

PyArrow won every one of the 48 file-level comparisons.

The median individual-file speedup was 10.44x. When the per-file median read times are summed across the full corpus, pandas using the PyArrow engine took about 32.5 seconds, compared with about 288.6 seconds for pandas' default CSV reader. That is an overall summed-median speedup of 8.89x.

That does not mean every file had exactly the same shape of advantage. The largest measured PyArrow win was November 2021, where the PyArrow engine read the file about 13.48x faster. The largest CSV file, March 2019, was still a clear PyArrow win at about 7.17x faster.

The result is unusually clean: no reader failures, no ties within one percent, and no pandas-default wins in the completed 48-file run.

Why this is useful

CSV remains useful because it is easy to inspect, exchange, version, and open with almost any tool. But CSV has to be parsed as text every time it is read.

That creates recurring work:

  • reading the bytes from disk,
  • splitting text into fields,
  • interpreting values,
  • inferring types,
  • allocating a DataFrame,
  • and carrying any parser-specific overhead.

This test does not argue that CSV should disappear. It shows that, where CSV is already required or already present, the pandas reader selection can materially change the time needed to get from file to usable DataFrame.

How to read the table

The table contains one row per monthly Yellow Taxi CSV file.

The main fields are:

  • File: benchmark input identifier, plus the monthly period and taxi type.
  • CSV size: size of the derived CSV file.
  • Rows: expected source rows for the file.
  • Columns: expected source columns.
  • pandas default median: median time for repeated pandas.read_csv(path) reads.
  • PyArrow engine median: median time for repeated pandas.read_csv(path, engine="pyarrow") reads.
  • Speedup: pandas default median divided by PyArrow-engine median.
  • Winner: the faster reader for that file.

A speedup above 1.00x means the PyArrow engine was faster. A speedup below 1.00x would mean pandas default was faster.

The table can be searched, filtered by year and winner, and sorted by every major field.

Methodology

The benchmark used the full 48-file derived CSV corpus.

Each file was benchmarked with both readers in isolated subprocesses:

pd.read_csv(path)
pd.read_csv(path, engine="pyarrow")

For each reader and file:

  • the first read was measured separately;
  • three repeated reads were measured;
  • median, average, minimum, maximum, and standard deviation were recorded;
  • expected rows and columns were checked against observed rows and columns;
  • raw byte-read timing was recorded for I/O context only;
  • post-read process RSS and pandas deep DataFrame memory were recorded.

The public table uses the repeated-read median. It does not include the first read because the first read is intentionally reported separately in the source data.

The benchmark environment was:

  • Windows 11
  • Python 3.14.6
  • pandas 3.0.3
  • PyArrow 24.0.0
  • psutil 7.2.2

Important caveats

These results are local benchmark results, not universal constants.

Operating-system file caching can affect repeated local reads. The benchmark does not claim to measure cold-cache performance, network storage, cloud object storage, or every possible machine configuration.

Raw byte-read timing is included in the generated data as context. It is not subtracted from parser timing. Reading bytes and parsing CSV are related but different parts of the workload.

The reported RSS values are post-read snapshots, not true peak-memory measurements. Pandas deep-memory reporting also describes the resulting DataFrame, not the entire Python process.

This report measures file loading only. It does not benchmark:

  • joins,
  • group-by operations,
  • cleaning,
  • visualization,
  • writing output files,
  • native Parquet reads,
  • or distributed workloads.

Those are separate questions and should remain separate tests.

What this test does not prove

This benchmark does not prove that PyArrow always wins in every CSV workload.

It does not test tiny configuration files, every CSV dialect, custom type settings, every operating system, or every pandas release. It also does not show whether native Parquet will be faster, because that is a different format comparison planned for Round 3.

It does show something narrower and useful:

In this Windows 11 environment, with this four-year NYC TLC Yellow Taxi CSV corpus, pandas using the PyArrow CSV engine was faster than pandas default parsing for every completed file-level comparison.

That is a very strong result for this specific workload.

Accessibility notes

This report is built around text, summary cards, examples, and a fully inspectable table.

The table is the primary source of report data. It can be searched, filtered, and sorted with standard form controls and buttons. The result count updates after filtering. Headings, captions, labels, table headers, and status messages are provided so screen-reader users can navigate the report without depending on charts, color, or mouse-only interaction.

No benchmark result is communicated by color alone. The winner is always named in text and the raw timing values remain available in the table.

If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and feedback helps improve this site.

What did we learn?

The obvious conclusion is that the PyArrow CSV engine was faster. The more interesting conclusion is how consistently it was faster.

The corpus includes pandemic-collapse months, recovery months, medium-sized files, large files, and giant files. It includes files ranging from hundreds of thousands of rows to almost eight million rows. Yet the completed result stayed one-sided: PyArrow won every file.

That makes parser selection a serious practical concern for repeated local CSV analysis. Saving a few seconds on one file can feel modest. Saving minutes over a large corpus, or over repeated investigative work, changes the rhythm of the entire workflow.

Round 2 continues with a file-size analysis and deliberate many-file workloads. Round 3 will compare the completed CSV results with native Parquet reads, where the question becomes larger than parser choice: what changes when the working format is not CSV in the first place?

Related Links

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files. The goal is to make the method and result inspectable.

About This Benchmark

This page measures CSV read time. It does not measure joins, aggregation, data cleaning, visualization, or native Parquet reads.