About this test
This report compares many-file Lahman loading workloads.
Instead of reading one CSV at a time, each workload loads a practical group of files that could be used together in a baseball analysis project.
The tested workloads are:
- Core people and teams
- Regular season player stats
- Postseason stats
- Awards and Hall of Fame
- Salary context
- All Lahman CSV tables
The test compares:
- loading the workload with
pandas.read_csv(path) - loading the workload with
pandas.read_csv(path, engine="pyarrow")
Why this is interesting
Single-file benchmarks are useful, but they can hide the shape of real work.
A baseball project rarely opens only one file. It may load people, teams, appearances, batting, pitching, fielding, salaries, awards, or postseason tables together before doing joins or analysis.
This benchmark keeps the scope narrow. It measures loading only. It does not include joins, group-by work, modeling, or visualization. That makes the result easier to interpret:
When a project needs several Lahman CSV files at once, which CSV reader gets the data into pandas faster?
How to read the table
The table contains one row per workload.
The main fields are:
- Workload: the practical Lahman file group.
- Files: the number of CSV files loaded.
- Total size: the combined CSV size for the workload.
- pandas avg: average workload load time using
pandas.read_csv(path). - PyArrow avg: average workload load time using
pandas.read_csv(path, engine="pyarrow"). - Speedup: pandas default average time divided by PyArrow average time.
- Winner: the faster method for that workload.
- Files loaded: the CSV files included in the workload.
A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.
Methodology
Each workload was loaded 100 times with each reader after two warmup runs.
The benchmark records CSV loading time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.
Keeping joins out of the benchmark is intentional. The point of this test is to isolate the many-file loading step before analysis begins.
What this test does not prove
This test does not prove that PyArrow will make every full baseball analysis faster.
Once data is loaded, performance depends on the operations performed next. Joins, filters, group-by operations, data cleaning, type conversion, and modeling can all change the shape of the total runtime.
This test answers a narrower question: when loading practical groups of Lahman CSV files, PyArrow was faster for every workload in this run.
Accessibility notes
This report is built around text, summary cards, workload cards, and a table. The table is the primary source of the report data.
The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.
If something does not work properly, please contact me so I can fix it.
What did we learn?
PyArrow won every many-file workload in this run.
The largest PyArrow win came from the regular-season player stats workload, which loaded Batting, Pitching, Fielding, and Appearances. That workload was the largest practical file group in the test, and it was about 5.90x faster with PyArrow.
The smaller workloads still showed PyArrow advantages, but the margins were smaller. That matches the pattern from the earlier tests: PyArrow becomes more convincing as the amount of data increases.