What GPT-3 Can (and Can't) Do for a Data Engineer

Shannon Lowder

15 Aug 2020 — 3 min read

OpenAI just opened their GPT-3 API to a small beta. I got access last week. The first thing I did was paste a data engineering problem into the Playground — describe a pipeline, ask for the Spark code.

The results were simultaneously impressive and wrong in ways that taught me more than the things it got right.

What I Tested

I ran through a set of representative data engineering tasks, from simple to complex. The goal wasn't to benchmark the model — it was to understand its failure mode taxonomy so I could figure out whether and how to use it in real work.

Simple schema transformation: "Write a PySpark job that reads a JSON file with fields user_id, page_url, event_ts and writes Parquet partitioned by event_date derived from event_ts." The output was correct. The partitioning logic was right. The column renaming was right. The Parquet write was right. On well-specified, structurally simple tasks with standard patterns, GPT-3 works.

Business logic: "A session ends when there's a gap of more than 30 minutes between events from the same user. Write a PySpark job that assigns a session_id to each event." The output used groupBy and a UDF that compared consecutive timestamps — the right structure, but it got the window function semantics wrong in a way that would produce incorrect session IDs on datasets where events arrive out of order. The code ran. It produced output. The output was wrong.

Library hallucination: I asked for a job using a specific pattern I use with a configuration library. The model invented methods on the library that don't exist. Confident, plausible, completely made up.

The Failure Mode Taxonomy

After two weeks of testing, here's how I categorize GPT-3's failure modes for data engineering work:

Confidently wrong logic. Code that runs and produces output, where the output is incorrect because the algorithm is subtly wrong. This is the most dangerous failure mode because nothing alerts you to it — you need a test suite to catch it.
Hallucinated APIs. Methods, parameters, or library features that don't exist. Easy to catch if you know the API; invisible if you don't.
Context-free defaults. The model makes a choice where a choice is required — data types, NULL handling, partition strategies — without flagging that a choice was made. The choice may be wrong for your data without being wrong in general.
Correct structure, wrong semantics. The code does something. It does not do the thing you described. Especially common with window functions, join semantics, and aggregation logic.

What It's Actually Good For

The tasks where GPT-3 earns its keep are the ones that are well-specified and structurally standard: boilerplate Spark DataFrame operations on named columns, Airflow operator configurations, schema DDL, data type conversions. Tasks where the question is "how do I express this known pattern in this API" rather than "how do I solve this problem."

It's also useful as a starting point for code you intend to review carefully. Generate the scaffold, audit it, fix the gaps. Faster than writing from scratch; requires more scrutiny than you'd apply to code written by a senior colleague.

The Trust Calibration Problem

The thing that makes GPT-3 more dangerous than a junior engineer is that it doesn't signal uncertainty. A junior engineer writes code and says "I'm not sure about the window function here, can you check?" GPT-3 writes the same uncertain code with the same confidence it uses for the things it's sure about. There's no verbal tell. You have to supply the skepticism yourself.

Trust calibration is the meta-skill that makes GPT-3 useful rather than harmful. The engineers who will get the most value from this model are the ones who already know enough to recognize the failure modes — which means it augments senior engineers more than it substitutes for junior ones. At least for now.

If you've gotten API access and run through your own test suite of data engineering tasks, I'd like to compare failure mode notes. As always, I'm here to help.

What GPT-3 Can (and Can't) Do for a Data Engineer

Shannon Lowder

What I Tested

The Failure Mode Taxonomy

What It's Actually Good For

The Trust Calibration Problem

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving