This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This benchmark groups Lahman CSV files by size and compares pandas default CSV reads with pandas read_csv using the PyArrow engine.

The goal is to see where PyArrow begins to matter. Small files can behave differently than medium files because parser overhead can outweigh throughput advantages.

- File Size Benchmark Summary -

This test reuses the raw CSV benchmark data and groups files into small, medium, and large buckets.

Loading data report.
CSV files tested
small files
medium files
large files
small-file overall speedup
medium-file overall speedup

The small-file bucket was mixed. The medium-file bucket was not: every medium Lahman CSV file was faster with PyArrow in this run.

- Bucket Results -

These bucket summaries are generated from the same data.json file as the full table.

Loading buckets

The bucket results will appear after the data loads.

- Full Bucketed CSV Table -

Search by file name, bucket, or winner. Filter by size bucket and winning reader. Sort the table to compare file size, rows, columns, read times, and speedup.

Waiting for table data.
Lahman CSV read benchmark grouped by file size. Times are average milliseconds over 100 timed runs after two warmup runs.
Loading report data.

Download the data

The page is generated from data.json. You can open the JSON file directly, save it locally, and inspect the benchmark output without scraping the page.

Download this report's data.json

About this test

This report takes the raw CSV benchmark results and groups the Lahman files by file size.

The buckets are:

  • Small CSV files: files smaller than 1 MB.
  • Medium CSV files: files from 1 MB up to 10 MB.
  • Large CSV files: files 10 MB or larger.

The goal is to test a practical question:

Does PyArrow become more useful as the CSV files get larger?

Why this is interesting

The raw benchmark showed that PyArrow was faster across the full Lahman CSV workload. But that headline hides an important detail: not every file behaves the same way.

Small files can be dominated by parser startup cost. Medium files give the reader more room to show throughput advantages. Large files would be the obvious stress test, but this Lahman CSV set does not include any files over 10 MB.

That absence is useful too. It keeps the conclusion honest. This is not a giant-data benchmark. It is a real public-data benchmark.

How to read the table

The table contains one row per CSV file, with a size bucket added.

The main fields are:

  • Bucket: small, medium, or large.
  • File: the Lahman CSV file name.
  • Size: the CSV file size.
  • Rows: the number of rows loaded during profiling.
  • Columns: the number of columns loaded during profiling.
  • pandas avg: average read time using pandas.read_csv(path).
  • PyArrow avg: average read time using pandas.read_csv(path, engine="pyarrow").
  • Speedup: pandas default average time divided by PyArrow average time.
  • Winner: the faster method for that file.

A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.

Methodology

This test reuses the raw CSV benchmark results. Each file was read 100 times with each reader after two warmup runs.

The benchmark records read time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.

The operating system can cache files between repeated reads, so these results should be read as local benchmark results for this environment, not as universal constants.

What this test does not prove

This test does not prove that PyArrow always wins once a file crosses a universal size threshold.

It does show that in this Lahman run, the medium-sized files were all PyArrow wins. The small files were mixed, which supports the idea that PyArrow's throughput advantage matters more once a file is large enough to overcome parser startup overhead.

Accessibility notes

This report is built around text, summary cards, bucket summaries, and tables. The table is the primary source of the report data.

The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.

If something does not work properly, please contact me so I can fix it.

What did we learn?

For this Lahman dataset, PyArrow became much more convincing in the medium-size bucket.

Small files were mixed. PyArrow still won more small files than pandas default, but pandas default won several tiny files. Medium files were different: PyArrow won every medium file in this run and did so by a much larger overall margin.

That makes the practical lesson clear: file size is not the only factor, but it matters.

Related Links

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files. The goal is to make the method and result inspectable.

About This Benchmark

This page groups the raw CSV benchmark by file size. It does not measure joins, data cleaning, analysis, or visualization work.