Fantasy Baseball

More Performance Testing - NumPy versus PyArrow

A follow-up to the Lahman baseball benchmark using 18.2 GB of NYC Yellow Taxi CSV data. PyArrow was faster in all 48 file-level comparisons, but the real lesson is still about data shape, absolute time, and choosing the right tool for the work.

July 5, 2026

A short guide to using generative AI in fantasy baseball

Practical guidance for using generative AI in fantasy baseball, including avoiding overengineered draft tools, handling data scraping responsibly, and using chat to challenge research assumptions before writing code.

June 24, 2026

NumPy versus PyArrow - CSV versus Parquet

A CSV versus Parquet benchmark using the Lahman Baseball Database. Parquet reduced the dataset to 25.6% of its CSV size and was faster overall, but several tiny files were larger or slower after conversion.

June 20, 2026

NumPy versus PyArrow: File Size Matters

The second NumPy versus PyArrow benchmark groups Lahman Baseball Database CSV files into small, medium, and large buckets. PyArrow wins the medium files, but pandas default still wins several tiny files.

June 18, 2026

Performance Testing - NumPy versus PyArrow, One Year Later

A one-year follow-up to my NumPy versus PyArrow performance testing, using the Lahman baseball database to compare pandas default CSV reads with pandas read_csv using the PyArrow engine.

June 16, 2026

Baseball Reports - MLB Attendance and the 1994 Strike

A quick blog post introducing a new baseball report on MLB attendance before and after the 1994-1995 strike. The post explains why attendance per opening is a useful way to compare strike-era attendance, why 1995 shows the clearest damage, and why the 1998 recovery needs expansion context.

June 6, 2026

Baseball Reports - Simulating MLB’s salary cap and floor proposal

I launched Baseball Reports with a salary cap and floor simulation because baseball data should be readable, accessible, and useful. The first report tests MLB’s proposed cap and floor against current standings and payroll estimates.

May 29, 2026

Performance Testing - NumPy versus PyArrow

This post, the first in a series on Fantasy Baseball, is going to start at two points. First, it's going to implement a scraper that will collect stats for an entire year of major league baseball. And then it's going to run some performance tests to see whether NumPy or PyArrow is faster at reading the CSV files generated. PyArrow is faster all the time...but it particularly shines when data sets get larger.

June 16, 2025

Baseball. Math. Obsession. Strategy. Welcome home.