This report is generated from a local JSON file. The table, filters, and sorting run in your browser. No benchmark data is sent to a server, your usage is not tracked, and this page will never have ads.

This benchmark converts Lahman CSV files to Parquet once, then compares repeated CSV reads with repeated Parquet reads.

CSV is useful for sharing and inspection. Parquet is tested here as a local repeated-read format for data work.

- CSV versus Parquet Summary -

This test looks at Parquet as a repeated-read working format, not as a replacement for every use of CSV.

Loading data report.
files converted
total CSV size
total Parquet size
Parquet size ratio
default CSV to Parquet speedup
PyArrow CSV to Parquet speedup

Parquet cut the full Lahman dataset to about one quarter of the CSV size and improved repeated reads overall. The gain over PyArrow CSV was real, but much smaller than the gain over pandas default CSV.

- Conversion Results -

These cards summarize the generated Parquet files before the full sortable table.

Loading conversions

The conversion results will appear after the data loads.

- Full CSV and Parquet Table -

Search by CSV file or Parquet file. Filter by whether Parquet was smaller, whether Parquet beat PyArrow CSV, or whether Parquet beat pandas default CSV.

Waiting for table data.
CSV versus Parquet repeated read benchmark. CSV timings are reused from the raw CSV benchmark. Parquet timings are average milliseconds over 100 timed runs after two warmup runs.
Loading report data.

Download the data

The page is generated from data.json. You can open the JSON file directly, save it locally, and inspect the benchmark output without scraping the page.

Download this report's data.json

About this test

This report compares repeated CSV reads with repeated Parquet reads.

Each Lahman CSV file was converted once to Parquet using PyArrow. Then the benchmark compared repeated reads using:

  1. pandas default CSV timing from the raw CSV benchmark
  2. pandas PyArrow CSV timing from the raw CSV benchmark
  3. pandas.read_parquet(path, engine="pyarrow")

The CSV read timings are reused from the raw CSV benchmark so the four reports stay internally consistent.

Why this is interesting

CSV is a useful exchange format. It is readable, portable, and easy to inspect.

But repeated analysis changes the tradeoff. If a project reads the same public dataset many times, it may be worth converting the source files into a more efficient local working format.

That is the practical question here:

If the Lahman CSV files are converted to Parquet once, how much smaller do they get, and how fast are repeated reads?

How to read the table

The table contains one row per Lahman file.

The main fields are:

  • CSV file: the original Lahman CSV file.
  • Parquet file: the generated Parquet file.
  • CSV size: the original CSV file size.
  • Parquet size: the generated Parquet file size.
  • Size ratio: Parquet size divided by CSV size.
  • CSV default avg: average CSV read time from pandas.read_csv(path).
  • CSV PyArrow avg: average CSV read time from pandas.read_csv(path, engine="pyarrow").
  • Parquet avg: average Parquet read time using pandas.read_parquet(path, engine="pyarrow").
  • Default CSV to Parquet speedup: pandas default CSV average divided by Parquet average.
  • PyArrow CSV to Parquet speedup: PyArrow CSV average divided by Parquet average.

A size ratio below 1.00 means the Parquet file was smaller than the CSV file. A read speedup above 1.00x means Parquet was faster than that CSV reader.

Methodology

Each CSV file was converted once to Parquet using PyArrow.

The benchmark records repeated read time. It does not include downstream joins, group-by operations, modeling, visualization, or report rendering.

The CSV timing values are reused from the raw CSV test. The Parquet read timings were measured separately.

What this test does not prove

This test does not prove that CSV should never be used.

CSV is still useful for publishing, sharing, inspecting, and archiving simple data. Parquet is being tested here as a local repeated-read format.

This test also does not claim that Parquet is always faster for every small file. Some tiny files were still faster with CSV. That matters because Parquet has its own read overhead.

Accessibility notes

This report is built around text, summary cards, conversion cards, and a table. The table is the primary source of the report data.

The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.

If something does not work properly, please contact me so I can fix it.

What did we learn?

For the full Lahman dataset, Parquet files were much smaller than CSV files and faster than pandas default CSV reads.

Compared with PyArrow CSV reads, Parquet still won overall, but by a smaller margin. That is the important practical lesson. PyArrow already makes CSV loading fast. Parquet still helps, especially as a repeated-read working format, but the advantage is not as dramatic as the comparison against pandas default CSV.

Related Links

About Data Reports

Data Reports are practical research pages built from public datasets, local scripts, and generated JSON files. The goal is to make the method and result inspectable.

About This Benchmark

This page tests Parquet as a repeated-read format. It does not claim that CSV is obsolete or that every tiny file benefits from conversion.