Open Formats Are Your Insurance Policy: The Case for Parquet + S3 as the Foundation

The file format you choose for your data lake is a ten-year decision. You probably don't treat it that way — it shows up as a checkbox in a sprint, "store processed events in S3," and the format is whatever is convenient today. But that choice determines what processing engines can read your data five years from now, what compression ratios you'll see at ten petabytes, and how painful it is to leave a platform that turns out not to be the right long-term fit.

The argument for open formats — specifically Parquet as your default — is not about performance, though the performance advantages are real. It's about preserving optionality. Proprietary formats are a commitment. Open formats are insurance.

What Parquet Is and Why the Column Layout Matters

Parquet is a columnar storage format developed originally at Twitter and Cloudera, now an Apache project. "Columnar" means that all the values for a given column are stored together, rather than storing each row together. For analytics workloads — where queries typically read a few columns from millions of rows — this is a decisive advantage: the engine reads only the columns the query touches and skips the rest entirely at the I/O layer.

On a wide events table with 40 columns, a query that touches 5 of them reads approximately 12% of the data a row-oriented format would read. That's not a theoretical claim — it's the I/O profile you'll observe in practice, and at cloud storage rates it translates directly to dollar savings on compute and S3 GET request costs.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("format-comparison").getOrCreate()

# Writing Parquet — compression is automatic (snappy by default)
events_df.write \
    .mode("overwrite") \
    .partitionBy("event_date", "event_type") \
    .parquet("s3://data-lake/processed/events/")

# Reading back — Spark applies partition pruning and column pruning automatically
page_views = spark.read \
    .parquet("s3://data-lake/processed/events/") \
    .filter("event_type = 'page_view' AND event_date = '2016-05-01'") \
    .select("user_id", "page_url", "event_ts")
# Spark only reads the event_type=page_view/event_date=2016-05-01 partition
# and only deserializes three columns out of forty

The Compression Numbers

Parquet with Snappy compression typically achieves 4–8x compression over raw JSON for event data, and 8–12x for more structured transactional data. At scale that's meaningful storage cost reduction. With Gzip compression (higher CPU, better ratio) you can push further — 10–15x over raw JSON is common for log-style event data.

The Portability Argument

Parquet is readable by every major processing engine available today: Spark, Hive, Presto, Impala, Drill, and every cloud-managed query service — AWS Athena, Google BigQuery external tables, Azure Data Lake Analytics. If you write your processed data in Parquet to S3, you are not committed to any specific processing engine. Databricks reads it. A future EMR cluster reads it. Athena reads it directly without loading or preprocessing.

Contrast that with building your processed layer inside Redshift or Snowflake in their native formats. When the business case for switching platforms arrives — price increase, capability gap, acquisition — the migration starts with a full data export. From Parquet on S3, it starts with pointing the new system at the existing bucket.

The Objection: "We Don't Plan to Switch"

Nobody plans to switch. Nobody plans for the acquisition that changes the pricing model. Nobody plans for the compliance requirement that arrives three years from now. The cost of using Parquet instead of a proprietary format is essentially zero — Parquet is free, the tooling support is excellent, and the performance is better for analytics workloads. The insurance value is real. There is no argument for proprietary formats on the data lake layer that holds up under examination.

Use proprietary storage where it provides a specific capability advantage you can't get otherwise — some specialized transactional engines have reasons to own their format. Your data lake storage is not that situation.

If you're in the middle of a format migration or evaluating Parquet alternatives like ORC, I'm happy to compare notes. As always, I'm here to help.

Read more