NumPy versus PyArrow Performance Data Reports

One question, different datasets

This project began with a narrow practical question: when does pandas.read_csv(path, engine="pyarrow") materially change CSV loading performance compared with pandas.read_csv(path) ?

The answer depends on the data. A small baseball database and a four-year taxi-trip corpus create very different workloads, file sizes, memory profiles, and opportunities for parser throughput to matter.

The goal is not to make a universal claim from one machine. The goal is to build a growing collection of inspectable, repeatable reports that show where results hold, where they vary, and what the data does not prove. My personal goal is to eke every ounce of performance out of my own scripts as possible so that I can better run my fantasy team and help run my league. If the data is useful to you, that's great but remember, I ran these reports on my own machine and your machine may show different results. If the raw numbers are immportant to you, I recommend downloading the raw data (it will map out all the sources) and running the report on your own hardware.

Current dataset projects

Lahman Baseball Database

A four-part benchmark using 27 Lahman baseball CSV files. It compares raw CSV read speed, size-grouped results, practical many-file baseball workloads, and repeated CSV versus Parquet reads.

The project is useful because the files have varied shapes and real baseball-analysis relevance, while also showing the limitations of a dataset without genuinely giant files. This report was likely the most practical report for me, but it left enough questions unanswered that I knew I needed a stronger dataset.

Open the Lahman Baseball Database project

NYC TLC Yellow Taxi Data

A larger three-report Round 2 benchmark using 48 monthly Yellow Taxi trip-record files from 2019 through 2022.

The corpus contains 179.8 million rows and derived CSV files from roughly 24 MB through 813 MB, making it a much stronger test of sustained CSV parsing and multi-file workload behaviour. As a bonus, it made my computer work about 70% as hard as Baldur's Gate 3 so the sound of it brought back some great memories.

Open the NYC TLC Taxi Data project

Round 3: Native Parquet

The next project stage will compare the completed NYC TLC CSV results with native reads of the original Parquet corpus.

That changes the question from parser selection to format selection: what happens when the data is not CSV in the first place? This report will hopefully have applications for small enterprises that are considered bringing their data work in-house.

Review the CSV baseline

What these reports measure

Single-file reads

Load one CSV file at a time and compare pandas default parsing with pandas using the PyArrow CSV engine.

Results include repeated-read timing, row and column validation, file size, and generated data tables.

File-size analysis

Reuse completed benchmark results and group them by file size to see whether speedup patterns change across different ranges.

These reports describe the observed relationship without pretending that size alone explains every result.

Many-file workflows

Load several files, retain the DataFrames, and concatenate them into one analysis-ready table.

These workloads show where parsing, allocation, concatenation, and memory pressure begin to share the stage.

Methodology and access

Each report starts with a local script, a defined dataset, and generated JSON. The report page loads that JSON directly in the browser and turns it into text, summary cards, filters, result counts, and accessible tables.

The reports are intentionally narrow. They do not attempt to benchmark every operation involved in data work. They document what was measured, what was not measured, and the practical limits of each conclusion.

The tables are the primary data source. Each report makes its generated JSON available directly from the page so the result can be checked without scraping the site.