Data Reports | hluska.ca

Data reports for practical questions

Data Reports are structured projects that sit somewhere between a blog post and a full research project. They are for questions where the answer needs more than an opinion, but where the work is still small enough to explain clearly on a static page.

The pattern is very simple: use public or local data, process it with a repeatable script, save generated JSON beside the report page, and turn that JSON into summary cards, filters, and accessible tables in the browser.

Thus far, all of the reports on this site have either been fully baseball related or baseball adjacent. The NumPy versus PyArrow question is a good example of a baseball adjacent line of enquiry. I have played fantasy baseball for over twenty years and have been co-managing a fantasy baseball league for the last few seasons. Getting data out quickly is a feature of our league and individually, optimizing my own scripts is a major part of my draft strategy - our drafts have a time limit per manager (similar to chess) so 5 to 7x speedups are a strategic advantage.

Featured data report project

NumPy versus PyArrow

A growing benchmark project examining pandas default CSV reads, pandas read_csv with the PyArrow engine, file-size effects, many-file workloads, and native Parquet workflows.

The project now includes two public datasets: the Lahman Baseball Database and NYC TLC Yellow Taxi trip-record data.

View the NumPy versus PyArrow project

Built from generated JSON

Each report page can load a local data.json file from the same directory as the page. That keeps the page static, inspectable, and straightforward to host.

The page explains the method, the JSON stores the generated result, and the browser handles the searchable tables.

Open a generated report

Accessible by default

Data reports should not depend on chart-only meaning or visual-only scanning. These pages use explanations, summary cards, searchable controls, result counts, captions, headings, and tables.

A chart can be useful later, but the table still needs to carry the meaning. The data should not vanish behind a graphic. And thus charts are considered an enhancement to the report, never the full meal deal (as the young people say).

Read the accessibility statement

Current data report projects

NumPy versus PyArrow

A practical benchmark project using public datasets to compare pandas default CSV loading, the PyArrow CSV engine, many-file workflows, file-size behaviour, and Parquet.

The project currently contains a Lahman baseball dataset branch and a larger NYC TLC Yellow Taxi data branch.

Open the project

Lahman Baseball Database

A four-report project using the Lahman Baseball Database to compare raw CSV reads, file-size groups, practical baseball workloads, and repeated CSV versus Parquet reads.

Open the Lahman reports

NYC TLC Yellow Taxi Data

A larger Round 2 benchmark covering 48 monthly files, 179.8 million rows, five CSV file-size buckets, and seven practical multi-file workloads.

Open the NYC TLC reports

How these reports are built

The reports are static pages on hluska.ca. When a report needs generated data, that data is saved as JSON beside the report and loaded in the browser. This keeps the reports simple to host and easy to inspect.

Most reports follow the same workflow: a local script prepares data.json , the report page loads that file, and the browser turns it into summary cards, filters, result counts, and tables.

This structure keeps the data available and the reports reproducible in your own hardware environment. The page can explain its sources, the generated data can be downloaded, and the final result can focus on what the numbers actually say.

All data report projects

NumPy versus PyArrow - practical CSV and Parquet benchmarks using public datasets, currently including Lahman baseball data and NYC TLC Yellow Taxi trip-record data.
Lahman Baseball Database - a four-report dataset branch covering CSV reads, file size, practical baseball workloads, and repeated Parquet reads.
NYC TLC Yellow Taxi Data - a large CSV benchmark branch covering 48 files, file-size grouping, and multi-file analysis workloads.