Small, Medium, and Large CSV File Performance

This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This benchmark groups Lahman CSV files by size and compares pandas default CSV reads with pandas read_csv using the PyArrow engine.

The goal is to see where PyArrow begins to matter. Small files can behave differently than medium files because parser overhead can outweigh throughput advantages.

Loading buckets

The bucket results will appear after the data loads.

Lahman CSV read benchmark grouped by file size. Times are average milliseconds over 100 timed runs after two warmup runs.

Loading report data.

About this test

This report takes the raw CSV benchmark results and groups the Lahman files by file size.

The buckets are:

Small CSV files: files smaller than 1 MB.
Medium CSV files: files from 1 MB up to 10 MB.
Large CSV files: files 10 MB or larger.

The goal is to test a simple and practical question:

Does PyArrow become more useful as the CSV files get larger?

Why this is interesting

The raw benchmark showed that PyArrow was faster across the full Lahman CSV workload. But that headline hides an important detail: not every file behaves the same way.

Small files can be dominated by parser startup cost. Medium files give the reader more room to show throughput advantages. Large files would be the obvious stress test, but this Lahman CSV set does not include any files over 10 MB and so the methodology does not entirely support this dataset. However, I will be running these same benchmarks with a variety of different datasets in the future and so I will be able to add in different reports that show more varied and interesting data shapes.

Taken another way though, the absence of large files can be taken as a strength. This is real-world data working with a dataset that I use constantly in fantasy baseball. So while the methodology wasn't totally appropriate for the dataset, the methodology taught me some things that I will use when I'm in the middle of a draft and my clock is ticking down.

How to read the table

The table contains one row per CSV file, with a size bucket added.

The main fields are:

Bucket: small, medium, or large.
File: the Lahman CSV file name.
Size: the CSV file size.
Rows: the number of rows loaded during profiling.
Columns: the number of columns loaded during profiling.
pandas avg: average read time using pandas.read_csv(path).
PyArrow avg: average read time using pandas.read_csv(path, engine="pyarrow").
Speedup: pandas default average time divided by PyArrow average time.
Winner: the faster method for that file.

A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.

Methodology

This test reuses the raw CSV benchmark results. Each file was read 100 times with each reader after two warmup runs.

The benchmark records read time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.

The operating system can cache files between repeated reads, so these results should be read as local benchmark results for this environment, not as universal constants. In future tests, I am going to explore other operating systems and different setups.

What this test does not prove

This test does not prove that PyArrow always wins once a file crosses a universal size threshold.

It does show that in this Lahman run, the medium-sized files were all PyArrow wins. The small files were mixed, which supports the idea that PyArrow's throughput advantage matters more once a file is large enough to overcome parser startup overhead.

To really 'prove' (whatever that means) that PyArrow always wins once a file crosses a universal size threshold, I need to do a lot more work, hit this benchmarking with different datasets and (most importantly) I really need to run these benchmarks with some big files. Once I have more data, more runs, different operating systems and more varied data shapes to run with these tests, maybe I can get closer to proving something or providing something more like guidance.

Accessibility notes

This report is built around text, summary cards, bucket summaries, and tables. The table is the primary source of the report data.

The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.

If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and your feedback helps me grow. If you're interested, you can also read my accessibility policy for further details on how I test and build accessible websites and applications.

What did we learn?

For this Lahman dataset, PyArrow became much more convincing in the medium-size bucket. Unfortunately, the dataset did not have any large files and I'll have to fix this in a future benchmarking.

Small files were mixed. PyArrow still won more small files than pandas default, but pandas default won several tiny files. Medium files were different and much more dramatically skewed in favour of PyArrow: PyArrow won every medium file in this run and did so by a much larger overall margin.

That makes the practical lesson clear: file size is not the only factor, but it matters.

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files. The goal is to make the method and result inspectable.

About This Benchmark

This page groups the raw CSV benchmark by file size. It does not measure joins, data cleaning, analysis, or visualization work.

Small, Medium, and Large CSV File Performance

- File Size Benchmark Summary -

- Bucket Results -