About this test
This report compares many-file Lahman loading workloads.
Instead of reading one CSV at a time, each workload loads a practical group of files that could be used together in a baseball analysis project.
The tested workloads are:
- Core people and teams
- Regular season player stats
- Postseason stats
- Awards and Hall of Fame
- Salary context
- All Lahman CSV tables
The test compares:
- loading the workload with
pandas.read_csv(path) - loading the workload with
pandas.read_csv(path, engine="pyarrow")
Why this is interesting
Single-file benchmarks are useful, but they can hide the shape of real work. In most real data work, the analyst loads many files of many different sizes. And baseball is an excellent example of this.
A baseball project rarely opens only one file. It may load people, teams, appearances, batting, pitching, fielding, salaries, awards, or postseason tables together before doing joins or analysis.
This benchmark keeps the scope narrow. It measures loading only. It does not include joins, group-by work, modeling, or visualization. That makes the result easier to interpret:
When a project needs several Lahman CSV files at once, which CSV reader gets the data into pandas faster?
While it makes the result easier to interpret it's important to note that if you're coming here wanting to optimize the scripts you run for your own fantasy baseball (or other) purposes, you are likely better off looking elsewhere. Because while the changes look quite large from a percentage value, neither is irrationally slow and in most cases the difference comes down to less than 10 milliseconds on average between file.
If you're looking to dramatically optimize how fast your scripts are, you are better off starting at different places.
How to read the table
The table contains one row per workload.
The main fields are:
- Workload: the practical Lahman file group.
- Files: the number of CSV files loaded.
- Total size: the combined CSV size for the workload.
- pandas avg: average workload load time using
pandas.read_csv(path). - PyArrow avg: average workload load time using
pandas.read_csv(path, engine="pyarrow"). - Speedup: pandas default average time divided by PyArrow average time.
- Winner: the faster method for that workload.
- Files loaded: the CSV files included in the workload.
A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.
Methodology
Each workload was loaded 100 times with each reader after two warmup runs.
The benchmark records CSV loading time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.
Keeping joins out of the benchmark is intentional. The point of this test is to isolate the many-file loading step before analysis begins.
What this test does not prove
This test does not prove that PyArrow will make every full baseball analysis faster. Once data is loaded, performance depends on the operations performed next. Joins, filters, group-by operations, data cleaning, type conversion, and modeling can all change the shape of the total runtime.
Instead it answers a very very narrow question - when loading practical groups of Lahman CSV files, PyArrow was faster for every workload in this run.
Accessibility notes
This report is built around text, summary cards, workload cards, and a table. The table is the primary source of the report data.
The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.
If something does not work properly, please contact me so I can fix it.
What did we learn?
PyArrow won every many-file workload in this run.
The largest PyArrow win came from the regular-season player stats workload, which loaded Batting, Pitching, Fielding, and Appearances. That workload was the largest practical file group in the test, and it was about 5.90x faster with PyArrow.
The smaller workloads still showed PyArrow advantages, but the margins were smaller. That matches the pattern from the earlier tests - PyArrow becomes much more convincing as the standout winner as the amount of data increases.