The Config-Driven Pipeline: Separating Logic from Data Knowledge
Here is the most common anti-pattern I see in data pipeline code: the pipeline knows the names of its tables. It knows the column names it's going to transform. It knows the business rules. It knows the output schema. All of this — structural knowledge and business logic — lives in the same Python or Scala file, intertwined, with no separation between "how to process data" and "what data to process."
The consequence is that changing anything requires understanding everything. Schema changed upstream? Find every place the column name appears in the processing code. New business rule? Edit the transformation code and hope it doesn't accidentally touch the structural code. Add a new table to the pipeline? Copy the entire job file, change the hardcoded table names, and now maintain two copies of the same logic.
There is a better way. It's called config-driven design, and software engineers have been using it for decades.
The Pattern
Separate your pipeline into two layers: a configuration layer that knows about your specific data (table names, column mappings, business rules, source and target locations) and a logic layer that knows how to process data generically. The logic layer reads the configuration at runtime and applies it. Changing what data you process means changing the config. Changing how you process it means changing the logic — once, in one place, for all tables.
A Concrete Example
Suppose you're building daily ingestion for a set of event tables: page views, clicks, and conversions. The naive approach: three separate Spark jobs, each with hardcoded source paths, hardcoded column transformations, and hardcoded target paths. When the source schema changes, you update three files. When you add a fourth event type, you write a fourth job.
The config-driven approach: one config file, one job.
{
"tables": [
{
"name": "page_views",
"source_path": "s3://data-lake/raw/page_views/",
"target_path": "s3://data-lake/processed/page_views/",
"partition_column": "event_date",
"column_mappings": {
"user_id": "user_id",
"page_url": "url",
"event_timestamp": "ts"
},
"filter": "event_type = 'page_view'"
},
{
"name": "clicks",
"source_path": "s3://data-lake/raw/clicks/",
"target_path": "s3://data-lake/processed/clicks/",
"partition_column": "event_date",
"column_mappings": {
"user_id": "user_id",
"element_id": "target",
"event_timestamp": "ts"
},
"filter": "event_type = 'click'"
}
]
}import json
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
def run_ingest(config_path: str, date: str):
spark = SparkSession.builder.appName("config-driven-ingest").getOrCreate()
config = json.loads(open(config_path).read())
for table in config["tables"]:
source = spark.read.parquet(f"{table['source_path']}{date}/")
# Apply column mappings
renamed = source.select([
F.col(src).alias(tgt)
for src, tgt in table["column_mappings"].items()
])
# Apply filter
if table.get("filter"):
renamed = renamed.filter(table["filter"])
# Write to target partitioned by date
renamed.write \
.mode("overwrite") \
.partitionBy(table["partition_column"]) \
.parquet(table["target_path"])
print(f"Processed {table['name']}: {renamed.count()} rows")Adding a third event type means adding one entry to the JSON file. No code change. Adding a new transformation step means updating the logic function once, and it applies to all tables immediately.
The Audit Benefit
When your pipeline config is a structured file, you can log it. Capture the config version alongside every pipeline run — in a database table, in the job output metadata, in a log file. You now have a complete record of what configuration drove what output. When a business analyst asks "why did the conversion rate look different in March than in April?", you can check whether the column mapping changed between those two months rather than excavating git history for the relevant commit.
Where This Gets Harder
Config-driven design works cleanly when your transformations are structurally similar across tables. When one table needs custom business logic that's genuinely unlike everything else, forcing it through a generic config structure produces contorted config that's harder to read than just writing the specific code. Know when to break the pattern — use it for the 80% of pipeline work that's structurally regular, and write explicit code for the 20% that isn't.
If you've implemented this pattern on your pipelines and hit the point where the config model breaks down, I'd like to hear where the seams were. As always, I'm here to help.