CSV versus Parquet for Repeated Lahman Reads

This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This benchmark converts Lahman CSV files to Parquet once, then compares repeated CSV reads with repeated Parquet reads.

CSV is useful for sharing and inspection. Parquet is tested here as a local repeated-read format for data work.

Loading conversions

The conversion results will appear after the data loads.

CSV versus Parquet repeated read benchmark. CSV timings are reused from the raw CSV benchmark. Parquet timings are average milliseconds over 100 timed runs after two warmup runs.

Loading report data.

About this test

This report compares repeated CSV reads with repeated Parquet reads.

Each Lahman CSV file was converted once to Parquet using PyArrow. Then the benchmark compared repeated reads using:

pandas default CSV timing from the raw CSV benchmark
pandas PyArrow CSV timing from the raw CSV benchmark
pandas.read_parquet(path, engine="pyarrow")

The CSV read timings are reused from the raw CSV benchmark so the four reports stay internally consistent.

Why this is interesting

CSV is a useful exchange format. It is readable, portable, and easy to inspect.

But repeated analysis changes the tradeoff. If a project reads the same public dataset many times, it may be worth converting the source files into a more efficient local working format.

That is the practical question here:

If the Lahman CSV files are converted to Parquet once, how much smaller do they get, and how fast are repeated reads?

How to read the table

The table contains one row per Lahman file.

The main fields are:

CSV file: the original Lahman CSV file.
Parquet file: the generated Parquet file.
CSV size: the original CSV file size.
Parquet size: the generated Parquet file size.
Size ratio: Parquet size divided by CSV size.
CSV default avg: average CSV read time from pandas.read_csv(path).
CSV PyArrow avg: average CSV read time from pandas.read_csv(path, engine="pyarrow").
Parquet avg: average Parquet read time using pandas.read_parquet(path, engine="pyarrow").
Default CSV to Parquet speedup: pandas default CSV average divided by Parquet average.
PyArrow CSV to Parquet speedup: PyArrow CSV average divided by Parquet average.

A size ratio below 1.00 means the Parquet file was smaller than the CSV file. A read speedup above 1.00x means Parquet was faster than that CSV reader.

Methodology

Each CSV file was converted once to Parquet using PyArrow.

The benchmark records repeated read time. It does not include downstream joins, group-by operations, modeling, visualization, or report rendering.

The CSV timing values are reused from the raw CSV test. The Parquet read timings were measured separately.

What this test does not prove

This test does not prove that CSV should never be used.

CSV is still useful for publishing, sharing, inspecting, and archiving simple data. Parquet is being tested here as a local repeated-read format.

This test also does not claim that Parquet is always faster for every small file. Some tiny files were still faster with CSV. That matters because Parquet has its own read overhead.

Accessibility notes

This report is built around text, summary cards, conversion cards, and a table. The table is the primary source of the report data.

The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.

If something does not work properly, please contact me so I can fix it.

What did we learn?

For the full Lahman dataset, Parquet files were much smaller than CSV files and faster than pandas default CSV reads.

Compared with PyArrow CSV reads, Parquet still won overall, but by a smaller margin. That is the important practical lesson. PyArrow already makes CSV loading fast. Parquet still helps, especially as a repeated-read working format, but the advantage is not as dramatic as the comparison against pandas default CSV.

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files. The goal is to make the method and result inspectable.

About This Benchmark

This page tests Parquet as a repeated-read format. It does not claim that CSV is obsolete or that every tiny file benefits from conversion.

CSV versus Parquet for Repeated Lahman Reads

- CSV versus Parquet Summary -

- Conversion Results -