About this test
This report takes the raw CSV benchmark results and groups the Lahman files by file size.
The buckets are:
- Small CSV files: files smaller than 1 MB.
- Medium CSV files: files from 1 MB up to 10 MB.
- Large CSV files: files 10 MB or larger.
The goal is to test a simple and practical question:
Does PyArrow become more useful as the CSV files get larger?
Why this is interesting
The raw benchmark showed that PyArrow was faster across the full Lahman CSV workload. But that headline hides an important detail: not every file behaves the same way.
Small files can be dominated by parser startup cost. Medium files give the reader more room to show throughput advantages. Large files would be the obvious stress test, but this Lahman CSV set does not include any files over 10 MB and so the methodology does not entirely support this dataset. However, I will be running these same benchmarks with a variety of different datasets in the future and so I will be able to add in different reports that show more varied and interesting data shapes.
Taken another way though, the absence of large files can be taken as a strength. This is real-world data working with a dataset that I use constantly in fantasy baseball. So while the methodology wasn't totally appropriate for the dataset, the methodology taught me some things that I will use when I'm in the middle of a draft and my clock is ticking down.
How to read the table
The table contains one row per CSV file, with a size bucket added.
The main fields are:
- Bucket: small, medium, or large.
- File: the Lahman CSV file name.
- Size: the CSV file size.
- Rows: the number of rows loaded during profiling.
- Columns: the number of columns loaded during profiling.
- pandas avg: average read time using
pandas.read_csv(path). - PyArrow avg: average read time using
pandas.read_csv(path, engine="pyarrow"). - Speedup: pandas default average time divided by PyArrow average time.
- Winner: the faster method for that file.
A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.
Methodology
This test reuses the raw CSV benchmark results. Each file was read 100 times with each reader after two warmup runs.
The benchmark records read time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.
The operating system can cache files between repeated reads, so these results should be read as local benchmark results for this environment, not as universal constants. In future tests, I am going to explore other operating systems and different setups.
What this test does not prove
This test does not prove that PyArrow always wins once a file crosses a universal size threshold.
It does show that in this Lahman run, the medium-sized files were all PyArrow wins. The small files were mixed, which supports the idea that PyArrow's throughput advantage matters more once a file is large enough to overcome parser startup overhead.
To really 'prove' (whatever that means) that PyArrow always wins once a file crosses a universal size threshold, I need to do a lot more work, hit this benchmarking with different datasets and (most importantly) I really need to run these benchmarks with some big files. Once I have more data, more runs, different operating systems and more varied data shapes to run with these tests, maybe I can get closer to proving something or providing something more like guidance.
Accessibility notes
This report is built around text, summary cards, bucket summaries, and tables. The table is the primary source of the report data.
The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.
If something does not work properly, please contact me so I can fix it. Accessibility is a lifelong learning path and your feedback helps me grow. If you're interested, you can also read my accessibility policy for further details on how I test and build accessible websites and applications.
What did we learn?
For this Lahman dataset, PyArrow became much more convincing in the medium-size bucket. Unfortunately, the dataset did not have any large files and I'll have to fix this in a future benchmarking.
Small files were mixed. PyArrow still won more small files than pandas default, but pandas default won several tiny files. Medium files were different and much more dramatically skewed in favour of PyArrow: PyArrow won every medium file in this run and did so by a much larger overall margin.
That makes the practical lesson clear: file size is not the only factor, but it matters.