What GPT-3 Can (and Can't) Do for a Data Engineer
OpenAI just opened their GPT-3 API to a small beta. I got access last week. The first thing I did was paste a data engineering problem into the Playground — describe a pipeline, ask for the Spark code.
The results were simultaneously impressive and wrong in ways that taught me more than the things it got right.
What I Tested
I ran through a set of representative data engineering tasks, from simple to complex. The goal wasn't to benchmark the model — it was to understand its failure mode taxonomy so I could figure out whether and how to use it in real work.
Simple schema transformation: "Write a PySpark job that reads a JSON file with fields user_id, page_url, event_ts and writes Parquet partitioned by event_date derived from event_ts." The output was correct. The partitioning logic was right. The column renaming was right. The Parquet write was right. On well-specified, structurally simple tasks with standard patterns, GPT-3 works.
Business logic: "A session ends when there's a gap of more than 30 minutes between events from the same user. Write a PySpark job that assigns a session_id to each event." The output used groupBy and a UDF that compared consecutive timestamps — the right structure, but it got the window function semantics wrong in a way that would produce incorrect session IDs on datasets where events arrive out of order. The code ran. It produced output. The output was wrong.
Library hallucination: I asked for a job using a specific pattern I use with a configuration library. The model invented methods on the library that don't exist. Confident, plausible, completely made up.
The Failure Mode Taxonomy
After two weeks of testing, here's how I categorize GPT-3's failure modes for data engineering work:
- Confidently wrong logic. Code that runs and produces output, where the output is incorrect because the algorithm is subtly wrong. This is the most dangerous failure mode because nothing alerts you to it — you need a test suite to catch it.
- Hallucinated APIs. Methods, parameters, or library features that don't exist. Easy to catch if you know the API; invisible if you don't.
- Context-free defaults. The model makes a choice where a choice is required — data types, NULL handling, partition strategies — without flagging that a choice was made. The choice may be wrong for your data without being wrong in general.
- Correct structure, wrong semantics. The code does something. It does not do the thing you described. Especially common with window functions, join semantics, and aggregation logic.
What It's Actually Good For
The tasks where GPT-3 earns its keep are the ones that are well-specified and structurally standard: boilerplate Spark DataFrame operations on named columns, Airflow operator configurations, schema DDL, data type conversions. Tasks where the question is "how do I express this known pattern in this API" rather than "how do I solve this problem."
It's also useful as a starting point for code you intend to review carefully. Generate the scaffold, audit it, fix the gaps. Faster than writing from scratch; requires more scrutiny than you'd apply to code written by a senior colleague.
The Trust Calibration Problem
The thing that makes GPT-3 more dangerous than a junior engineer is that it doesn't signal uncertainty. A junior engineer writes code and says "I'm not sure about the window function here, can you check?" GPT-3 writes the same uncertain code with the same confidence it uses for the things it's sure about. There's no verbal tell. You have to supply the skepticism yourself.
Trust calibration is the meta-skill that makes GPT-3 useful rather than harmful. The engineers who will get the most value from this model are the ones who already know enough to recognize the failure modes — which means it augments senior engineers more than it substitutes for junior ones. At least for now.
If you've gotten API access and run through your own test suite of data engineering tasks, I'd like to compare failure mode notes. As always, I'm here to help.