AWS Glue and the Promise of Serverless ETL

AWS announced Glue at re:Invent this week. The pitch: serverless ETL on managed Spark, plus a Data Catalog underneath it that automatically crawls your S3 buckets and makes their schemas queryable via Athena. It's a lot to unpack, and the announcement materials conflate two things that deserve separate evaluation: the Glue ETL job service and the Glue Data Catalog. They are not the same product, and they have very different levels of practical readiness.

The Glue Data Catalog: This Is the Interesting Part

The Glue Data Catalog is a centralized metadata repository for your S3 data lake. You point Glue Crawlers at S3 prefixes, they infer schemas from your Parquet (or JSON, or CSV) files, and they register those schemas as table definitions in the catalog. Those table definitions are then queryable via Athena, Redshift Spectrum, and — in the future — EMR and other services that support the Glue Catalog API.

This solves a real problem. The Hive Metastore, the traditional metadata layer for Hadoop data lakes, requires a running database (typically MySQL) and ties your schema definitions to a specific cluster. When you spin up a new EMR cluster, you either provision a new metastore, share the same MySQL instance, or export and import schemas manually. Glue Catalog is cloud-managed, persistent, and accessible across clusters and services without any of that infrastructure.

The Athena integration alone is worth attention. Athena is AWS's serverless SQL query service over S3, announced earlier this year. With a Glue Catalog, your S3-backed Parquet tables are immediately queryable via standard SQL with no cluster required and no data loading step. Run a crawler once, define the schema, and your analysts have SQL access to the raw data lake.

-- After Glue Crawler registers your S3 Parquet data as a table,
-- Athena can query it directly. No cluster, no loading.
SELECT
    user_id,
    COUNT(*) AS event_count,
    MIN(event_ts) AS first_event,
    MAX(event_ts) AS last_event
FROM glue_catalog.events.page_views
WHERE event_date = '2016-12-01'
GROUP BY user_id
HAVING COUNT(*) >= 10
ORDER BY event_count DESC
LIMIT 100;

Athena charges $5 per terabyte scanned. With Parquet and partition pruning, a query that scans 100GB of raw data might touch 10GB in Parquet. That's $0.05 per query, with no cluster to provision or maintain. For ad-hoc analytics on a data lake, that's a compelling economics story.

The Glue ETL Service: Wait and See

Glue's ETL job service is a different story. It generates PySpark code from a visual mapping interface, runs it on a managed Spark environment, and promises serverless ETL without cluster management. The promise is attractive. The execution at launch is rough.

The generated PySpark code is verbose and opinionated in ways that fight you when you need to customize it. The Glue DynamicFrame abstraction — Glue's wrapper around Spark DataFrames — handles semi-structured data with mixed types better than a strict DataFrame schema, but adds a learning curve for anyone who already knows Spark well. The job startup time (Spark cluster bootstrap) is several minutes, making it unsuitable for low-latency or frequent small jobs.

My recommendation: use the Glue Data Catalog today — it's a real improvement over the alternatives for S3 data lake metadata management. Evaluate the ETL job service carefully against your actual workload requirements before committing. If you're already running PySpark well on Databricks or EMR, Glue ETL doesn't offer enough at launch to justify migration.

The Bigger Picture

What Glue signals is that AWS is serious about the data lake catalog layer — the metadata infrastructure that makes a pile of S3 files into a queryable, governable data asset. That's the right problem to be solving, and a managed, cross-service catalog is the right shape of solution. The ETL surface will improve over time. The catalog is useful now.

If you're at re:Invent and dug deeper into the Glue session content, I'd like to compare notes on what's actually in the preview. As always, I'm here to help.

Read more