About this test
This report takes the raw CSV benchmark results and groups the Lahman files by file size.
The buckets are:
- Small CSV files: files smaller than 1 MB.
- Medium CSV files: files from 1 MB up to 10 MB.
- Large CSV files: files 10 MB or larger.
The goal is to test a practical question:
Does PyArrow become more useful as the CSV files get larger?
Why this is interesting
The raw benchmark showed that PyArrow was faster across the full Lahman CSV workload. But that headline hides an important detail: not every file behaves the same way.
Small files can be dominated by parser startup cost. Medium files give the reader more room to show throughput advantages. Large files would be the obvious stress test, but this Lahman CSV set does not include any files over 10 MB.
That absence is useful too. It keeps the conclusion honest. This is not a giant-data benchmark. It is a real public-data benchmark.
How to read the table
The table contains one row per CSV file, with a size bucket added.
The main fields are:
- Bucket: small, medium, or large.
- File: the Lahman CSV file name.
- Size: the CSV file size.
- Rows: the number of rows loaded during profiling.
- Columns: the number of columns loaded during profiling.
- pandas avg: average read time using
pandas.read_csv(path). - PyArrow avg: average read time using
pandas.read_csv(path, engine="pyarrow"). - Speedup: pandas default average time divided by PyArrow average time.
- Winner: the faster method for that file.
A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.
Methodology
This test reuses the raw CSV benchmark results. Each file was read 100 times with each reader after two warmup runs.
The benchmark records read time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.
The operating system can cache files between repeated reads, so these results should be read as local benchmark results for this environment, not as universal constants.
What this test does not prove
This test does not prove that PyArrow always wins once a file crosses a universal size threshold.
It does show that in this Lahman run, the medium-sized files were all PyArrow wins. The small files were mixed, which supports the idea that PyArrow's throughput advantage matters more once a file is large enough to overcome parser startup overhead.
Accessibility notes
This report is built around text, summary cards, bucket summaries, and tables. The table is the primary source of the report data.
The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.
If something does not work properly, please contact me so I can fix it.
What did we learn?
For this Lahman dataset, PyArrow became much more convincing in the medium-size bucket.
Small files were mixed. PyArrow still won more small files than pandas default, but pandas default won several tiny files. Medium files were different: PyArrow won every medium file in this run and did so by a much larger overall margin.
That makes the practical lesson clear: file size is not the only factor, but it matters.