Today, I would like to kick off a series on fantasy baseball. This series will show some ways to collect and analyze baseball data. The analysis part is up to you, but I can help you collect a lot of different data in a lot of different ways.
Last month however, Reuven Lerner delivered a very interesting talk at PyCon US 2025. As part of that presentation, Lerner did a performance test:
def load_with_time(csv_path):
start_time = time.perf_counter()
df = pd.read_csv(csv_path,low_memory=False)
end_time = time.perf_counter()
total_time = end_time - start_time
print("f{total time:.02f}")
def pyarrow_load_with_time(csv_path):
start_time = time.perf_counter()
table = pd.read_csv(csv_path, engine='pyarrow')
end_time = time.perf_counter()
total_time = end_time - start_time
print("f{total time:.02f}")
If you’re not familiar with Python that still likely makes a lot of sense - Lerner took the same CSV and loaded it with pyarrow in one test case and the default NumPy in another and compared how long each took to load. The results were staggering:
load_with_time -> 103.71
pyarrow_load_with_time -> 8.66
To conduct a series on fantasy baseball, I will need a data set. For the sake of demonstration, I have decided to scrape websites for the data. And this creates a really interesting opportunity - I can collect a full year’s worth of data, replicate Neuven’s experiment and make a data informed decision on whether to use PyArrow or NumPy for the rest of my analysis of baseball data.
Note - some of my fantasy baseball friends know that I am starting a series on analyzing baseball data. They have asked me to please add a disclaimer noting that I always finish in the basement so my analysis should be considered as purely instructive. The actual analysis should be considered a cautionary tale on how not to run a successful fantasy baseball team.
Scraping and Accessibility
Scraping has gotten quite a bit more difficult since generative AI went public a few years ago. It used to be relatively easy to just trust in the semantics expressed through html. Now, websites are using some sneaky methods to make markup less semantically relevant.
This is kind of funny because I tested several generative AI services and was able to get the exact same data so it’s not like any of these tricks are actually shielding the data. But I use a screen reader as part of my development flow and it sure makes the sites one heck of a lot harder to use when I turn off my monitor and only use my ears and the keyboard.
This site is hard to use too and I’m going to fix that soon. So I can’t really talk about accessibility now when I’m failing. But there’s something really special about making sites more difficult to use while also failing to keep generative AI out.
I’m not going to mention any sites because I live in a glass house too. But I can still get the data. And now I’m going to show you parts of how I did it.
The Scraper
There are quite a few ways to get baseball data, from unofficial and undocumented APIs to downloading already packaged csv files to scraping. I’m going to focus on scraping because armed with a half decent web scraper, a baseball fan with a data problem can enhance their experience watching games. If you’ve watched a game and really wanted to see a visualization, you can.
The scraper I implemented used Puppeteer. The scraper starts by collecting team abbreviations (this is useful if you want to go into the past and analyze years when there were different/fewer teams). Then, it feeds them all into a structure that starts like this:
for (const team of teamAbbrs) {
const teamUrl = `${BASE_URL}/REDACTED/${team}/${YEAR} REDACTED`;
console.log('Scraping:', teamUrl);
await page.goto(teamUrl, { waitUntil: 'networkidle2' });
And then it ends like this:
await new Promise(resolve => setTimeout(resolve, 4500 + Math.random() * 1000));
}
I have redacted parts of the teamURL so I don’t drive a lot of scraper traffic to a particular site. But you can adapt that structure to any site and the console.log on the third line above will show you what url you’re scraping to help you in your testing.
The ending closes the for loop but most importantly, it sets a timer. I don’t want to flood a server with too many requests, but I also want to look somewhat human so I add a random to space out the requests. This scraper will scrape pages between ever 4.5 and 5.5 seconds.
The Data
This part is a little bit complicated. In fantasy baseball, it doesn’t make a lot of sense to collect batters and pitchers together and analyze them all as a whole. Not only are the stats very different but they’re analyzed differently. Wins above replacement is one way to compare batters to pitchers, but other sorts of scoring systems that try to compare things like hits to strikeouts, or average to ERA end up undervaluing dominant closers like Mariano Rivera and Dennis Eckersley.
For purposes of this experiment, I have treated the two data stores separately but also generated 100,000 and 200,000 row data sets based on batters_2024.csv so that I also had a test where the performance benefits of PyArrow could really shine. And finally to make the data a little more representative, I repeated the analysis 100 times for each file and collected average values.
The Results
Running NumPy benchmark on multiple files...
[NumPy] batters_2024.csv - loaded 734 rows in 2.39 ms (avg over 100 runs)
[NumPy] pitchers_2024.csv - loaded 861 rows in 2.28 ms (avg over 100 runs)
[NumPy] batters_100k.csv - loaded 100000 rows in 186.91 ms (avg over 100 runs)
[NumPy] batters_200k.csv - loaded 200000 rows in 592.75 ms (avg over 100 runs)
Running PyArrow benchmark on multiple files...
[PyArrow] batters_2024.csv - loaded 734 rows in 2.23 ms (avg over 100 runs)
[PyArrow] pitchers_2024.csv - loaded 861 rows in 2.07 ms (avg over 100 runs)
[PyArrow] batters_100k.csv - loaded 100000 rows in 41.07 ms (avg over 100 runs)
[PyArrow] batters_200k.csv - loaded 200000 rows in 192.68 ms (avg over 100 runs)
It’s not surprising that PyArrow outperformed NumPy. Nor is it surprising by how much. These results don’t quite replicate the 10x increase demonstrated by Lerner but there are some important notes about this particular data set and my environment:
- I ran these tests within Windows 11. The magnitude of the difference is more pronounced when I run these in Linux but if I presented those results I wouldn’t be able to get two articles out of this code.
- Data complexity matters and batter data contains a mix of strings, numbers and symbols. It also contains some oddly shaped columns, like this gem from Aaron Judge’s 2024 stat line - *8D/97. Fantasy baseball fans will recognize that right away - Judge started over ten games in CF and as DH, but less than 10 games in right or left field.
Despite largely setting PyArrow up to fail, results were still impressive. Results were not as pronounced in small data sets, but PyArrow was designed for bulk throughput. It was not designed to shine until 100K+ rows of data or large and complicated columnar data sets. But it still shows a minor speed improvement even on smaller data sets.
That improvement isn’t massive, so I will likely use the Pandas default (NumPy) for most single season work - it’s just muscle memory for me at this point. But PyArrow will really start to shine when we start working on more complicated data sets that include things like Fantasy League/Actual salary, contract status and depth chart simulations. But this was a very interesting study and I am glad that I started the fantasy baseball series off like this.