About this test
This report compares repeated CSV reads with repeated Parquet reads.
Each Lahman CSV file was converted once to Parquet using PyArrow. Then the benchmark compared repeated reads using:
- pandas default CSV timing from the raw CSV benchmark
- pandas PyArrow CSV timing from the raw CSV benchmark
pandas.read_parquet(path, engine="pyarrow")
The CSV read timings are reused from the raw CSV benchmark so the four reports stay internally consistent.
Why this is interesting
CSV is a useful exchange format. It is readable, portable, and easy to inspect.
But repeated analysis changes the tradeoff. If a project reads the same public dataset many times, it may be worth converting the source files into a more efficient local working format.
That is the practical question here:
If the Lahman CSV files are converted to Parquet once, how much smaller do they get, and how fast are repeated reads?
How to read the table
The table contains one row per Lahman file.
The main fields are:
- CSV file: the original Lahman CSV file.
- Parquet file: the generated Parquet file.
- CSV size: the original CSV file size.
- Parquet size: the generated Parquet file size.
- Size ratio: Parquet size divided by CSV size.
- CSV default avg: average CSV read time from
pandas.read_csv(path). - CSV PyArrow avg: average CSV read time from
pandas.read_csv(path, engine="pyarrow"). - Parquet avg: average Parquet read time using
pandas.read_parquet(path, engine="pyarrow"). - Default CSV to Parquet speedup: pandas default CSV average divided by Parquet average.
- PyArrow CSV to Parquet speedup: PyArrow CSV average divided by Parquet average.
A size ratio below 1.00 means the Parquet file was smaller than the CSV file. A read speedup above 1.00x means Parquet was faster than that CSV reader.
Methodology
Each CSV file was converted once to Parquet using PyArrow.
The benchmark records repeated read time. It does not include downstream joins, group-by operations, modeling, visualization, or report rendering.
The CSV timing values are reused from the raw CSV test. The Parquet read timings were measured separately.
What this test does not prove
This test does not prove that CSV should never be used.
CSV is still useful for publishing, sharing, inspecting, and archiving simple data. Parquet is being tested here as a local repeated-read format.
This test also does not claim that Parquet is always faster for every small file. Some tiny files were still faster with CSV. That matters because Parquet has its own read overhead.
Accessibility notes
This report is built around text, summary cards, conversion cards, and a table. The table is the primary source of the report data.
The table can be searched, filtered, and sorted. The result count updates after filtering. The data is presented with normal headings, captions, and table headers so screen reader users can move through the report without depending on charts, color, or mouse-only interaction.
If something does not work properly, please contact me so I can fix it.
What did we learn?
For the full Lahman dataset, Parquet files were much smaller than CSV files and faster than pandas default CSV reads.
Compared with PyArrow CSV reads, Parquet still won overall, but by a smaller margin. That is the important practical lesson. PyArrow already makes CSV loading fast. Parquet still helps, especially as a repeated-read working format, but the advantage is not as dramatic as the comparison against pandas default CSV.