NumPy versus PyArrow - CSV versus Parquet

Weird things occasionally seep out of the old brain. For whatever reason, while I was working on this article, I kept thinking of WrestleMania III, when Hulk Hogan body-slammed André the Giant. Why did adding CSV versus Parquet to my NumPy versus PyArrow series bring up wrestling memories? Your guess is as good as mine.

Extending NumPy versus PyArrow with CSV and Parquet

CSV and Parquet are both excellent and highly useful formats in data work. Since the difference between them often comes down to performance, it made sense to run a test on repeated reads with CSV and Parquet.

The methodology

Again, this was not a complicated test. It uses data from the raw CSV test I wrote about a few days ago for its CSV read speeds with pandas default and the PyArrow engine.

For this test, I converted all the files to Parquet using PyArrow and then compared three things:

pandas default CSV timing from the raw CSV benchmark
pandas CSV timing using the PyArrow engine from the raw CSV benchmark
pandas.read_parquet(path, engine="pyarrow")

Results

In this case, the results were really as expected:

We converted 27 files.
The total size of these files in CSV was 42.15 MB.
The total size of these files in Parquet was 10.79 MB, or 25.6% of the CSV size.
Compared with the default pandas CSV reader, Parquet was 7.07x faster on this dataset.
Compared with pandas CSV reads using the PyArrow engine, Parquet was 1.29x faster on this dataset.

Implications

This one was predictable, and I do not think it will benefit as much from a more varied dataset as the raw CSV and file-size bucket reports.

However, I want to make a few points so that they do not get lost. One important one is that if you sort the table by CSV size, you will find that 3 of 27 files were larger with Parquet. More interestingly, 6 of 27 were slower with Parquet than with the default CSV read. Parquet has overhead, and that overhead shows up on small files. So, if your dataset has lots of very small CSV files, Parquet may not actually be the format for you.

Another important point is that even if the shape of your data supports Parquet, this report does not say that CSV is bad. CSV is excellent for sharing and publishing data, and as you can see above, it has some performance value when we are talking about really small files.

Finally, my child was not amused by my parkour jokes. Humour is dead.

There is another implication within the data that will hopefully be of some use to smaller enterprises. Parquet is usually much smaller than CSV and, with rising cloud costs, it can make sense for some businesses to take more of their data work in house. Storing data that gets used regularly at 25.6% of the size of CSV makes a lot of sense for those kinds of companies.

Final words

I have now covered three of the four reports that I wrote on NumPy versus PyArrow performance. In a few days, I will cover the last report, where we can talk about multi-file workloads. Then I am going to do one more report on the NumPy versus PyArrow question. This time, I am going to find a much bigger and more varied dataset using New York City taxi data.

However, I do not think this Parquet question will benefit from that level of dataset variation, so I will not be repeating this exact method on the new data. Fundamentally, this test has shown a few important things:

There is overhead to Parquet that should be planned for.
Most of the time, Parquet files are much smaller than CSV.
Reading Parquet through pandas with the PyArrow engine was 7.07x faster than reading CSV with the default pandas reader in this test.
The speedup was not as large when comparing Parquet reads with CSV reads using the PyArrow engine.

The fourth point is important. If you read data regularly, converting it once to Parquet likely makes sense. But if you already read CSV with PyArrow, check your timings. You may not save much time through conversion, though you may still save substantial storage space.