S3 Is Your New HDFS (If You Do It Right)
Teams migrating from on-prem Hadoop to AWS EMR in 2015 are making a predictable mistake: they're treating S3 like a remote HDFS. Same design patterns, same small-file structure, same assumptions about read performance. The migration works — the jobs run — but the costs are higher than expected and the performance is worse than the benchmarks promised.
S3 is not HDFS. It has different consistency guarantees, different performance characteristics under small-file load, and a cost model that rewards completely different design decisions. Getting cloud-native data lake right means unlearning some of what made your Hadoop cluster work.
The Consistency Problem
HDFS gives you immediate read-after-write consistency. You write a file, you can read it back immediately, and every node in the cluster sees the same version. S3, as of 2015, gives you eventual consistency for overwrite PUTs and DELETEs. Write a file, immediately try to read it from a different process — you might get the old version. This is not a theoretical concern: it causes real failures in pipelines that check for output existence immediately after writing it.
The defensive pattern: write to a staging prefix, then rename (copy + delete in S3 terms) to the final location only after the write is confirmed complete. Don't check for output existence in the same process that just wrote it.
The Small File Problem Is Worse on S3
On HDFS, small files waste NameNode memory and slow down MapReduce job initialization. On S3, small files also cost you per-request API charges. An EMR job that produces 50,000 files of 10KB each is making 50,000 PUT requests to write and 50,000 GET requests per downstream read pass. At scale, that API cost rivals the compute cost.
The fix is the same as on Hadoop — compact output files — but the pressure to fix it is higher on S3 because you see it in your AWS bill, not just in your job times.
from pyspark import SparkContext
sc = SparkContext("yarn", "compact-events")
# Bad: default parallelism produces hundreds of small files
events = sc.textFile("s3://data-lake/raw/events/2015/06/01/")
processed = events.map(transform_event)
processed.saveAsTextFile("s3://data-lake/processed/events/2015/06/01/")
# Better: coalesce to a reasonable number of output files
processed.coalesce(20).saveAsTextFile("s3://data-lake/processed/events/2015/06/01/")The right number of output files depends on your data volume and downstream read patterns. A rule of thumb: target 128–256 MB per file. For most daily batch workloads, 20–50 files per partition date is a reasonable starting point.
Parquet Is the Right Format
If you're still writing text files or CSV to S3, stop. Parquet is columnar, splittable, and compresses well — typical 4–10x compression over raw text. Downstream Spark or Hive jobs reading Parquet can skip columns they don't need, which cuts I/O dramatically for analytics queries that touch a subset of a wide schema.
from pyspark.sql import SQLContext
sc = SparkContext("yarn", "parquet-events")
sqlContext = SQLContext(sc)
# Read raw JSON events
events_df = sqlContext.read.json("s3://data-lake/raw/events/2015/06/01/")
# Write Parquet partitioned by event type — Hive and Spark both read this natively
events_df.write \
.partitionBy("event_type") \
.parquet("s3://data-lake/processed/events/2015/06/01/")S3 Prefix Design for Partitioning
HDFS partitioning is typically expressed as directory structure: /data/events/event_date=2015-06-01/. S3 uses the same pattern, and Hive and Spark both understand it as a partition key if you follow the key=value naming convention. The difference is that S3 prefix listing is slower than HDFS directory listing at large scale — a table with three years of daily partitions has thousands of prefixes to scan on every query. Design your partition hierarchy to match your most common access pattern, not your most granular one.
For most event tables: s3://bucket/events/year=2015/month=06/day=01/ works well. Avoid partitioning at the hour or minute level unless your queries almost always filter to a specific hour — you're just multiplying prefix scan costs without adding meaningful selectivity.
The Payoff
Get these decisions right — Parquet format, reasonable file sizes, well-designed partition structure — and S3 as a data lake foundation is genuinely better than HDFS for most batch analytics workloads. Lower operational overhead, cheaper at scale, and readable by every major processing engine. Get them wrong and you'll spend the next year wondering why EMR jobs that ran in 20 minutes on your old Hadoop cluster take 90 minutes on "the cloud."
If you're in the middle of this migration and hitting specific S3 performance issues, I'd like to hear what you're seeing. As always, I'm here to help.