Many-File Baseball Workload

This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This benchmark loads practical groups of Lahman CSV files and compares pandas default CSV loading with pandas read_csv using the PyArrow engine.

The test focuses on loading only. It does not include joins, group-by operations, modeling, chart generation, or downstream baseball analysis.

Loading workloads

The workload results will appear after the data loads.

Many-file Lahman workload benchmark. Times are average milliseconds over 100 timed runs after two warmup runs.
							Files loaded
Loading report data.

About this test

This report compares many-file Lahman loading workloads.

Instead of reading one CSV at a time, each workload loads a practical group of files that could be used together in a baseball analysis project.

The tested workloads are:

Core people and teams
Regular season player stats
Postseason stats
Awards and Hall of Fame
Salary context
All Lahman CSV tables

The test compares:

loading the workload with pandas.read_csv(path)
loading the workload with pandas.read_csv(path, engine="pyarrow")

Why this is interesting

Single-file benchmarks are useful, but they can hide the shape of real work. In most real data work, the analyst loads many files of many different sizes. And baseball is an excellent example of this.

A baseball project rarely opens only one file. It may load people, teams, appearances, batting, pitching, fielding, salaries, awards, or postseason tables together before doing joins or analysis.

This benchmark keeps the scope narrow. It measures loading only. It does not include joins, group-by work, modeling, or visualization. That makes the result easier to interpret:

When a project needs several Lahman CSV files at once, which CSV reader gets the data into pandas faster?

While it makes the result easier to interpret it's important to note that if you're coming here wanting to optimize the scripts you run for your own fantasy baseball (or other) purposes, you are likely better off looking elsewhere. Because while the changes look quite large from a percentage value, neither is irrationally slow and in most cases the difference comes down to less than 10 milliseconds on average between file.

If you're looking to dramatically optimize how fast your scripts are, you are better off starting at different places.

How to read the table

The table contains one row per workload.

The main fields are:

Workload: the practical Lahman file group.
Files: the number of CSV files loaded.
Total size: the combined CSV size for the workload.
pandas avg: average workload load time using pandas.read_csv(path).
PyArrow avg: average workload load time using pandas.read_csv(path, engine="pyarrow").
Speedup: pandas default average time divided by PyArrow average time.
Winner: the faster method for that workload.
Files loaded: the CSV files included in the workload.

A speedup above 1.00x means PyArrow was faster. A speedup below 1.00x means pandas default was faster.

Methodology

Each workload was loaded 100 times with each reader after two warmup runs.

The benchmark records CSV loading time only. It does not include joins, group-by operations, chart rendering, downstream calculations, or writing new files.

Keeping joins out of the benchmark is intentional. The point of this test is to isolate the many-file loading step before analysis begins.

What this test does not prove

This test does not prove that PyArrow will make every full baseball analysis faster. Once data is loaded, performance depends on the operations performed next. Joins, filters, group-by operations, data cleaning, type conversion, and modeling can all change the shape of the total runtime.

Instead it answers a very very narrow question - when loading practical groups of Lahman CSV files, PyArrow was faster for every workload in this run.

Accessibility notes

This report is built around text, summary cards, workload cards, and a table. The table is the primary source of the report data.

The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.

If something does not work properly, please contact me so I can fix it.

What did we learn?

PyArrow won every many-file workload in this run.

The largest PyArrow win came from the regular-season player stats workload, which loaded Batting, Pitching, Fielding, and Appearances. That workload was the largest practical file group in the test, and it was about 5.90x faster with PyArrow.

The smaller workloads still showed PyArrow advantages, but the margins were smaller. That matches the pattern from the earlier tests - PyArrow becomes much more convincing as the standout winner as the amount of data increases.

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files. The goal is to make the method and result inspectable.

About This Benchmark

This page measures many-file loading only. It does not include joins, group-by operations, modeling, visualization, or downstream analysis.

Many-File Baseball Workload

- Many-File Workload Summary -

- Workload Results -