Shannon Lowder (Page 26)

Delta Lake + Structured Streaming: ACID for Your Kafka Consumer

Writing a Spark Structured Streaming job that reads from Kafka and writes to Parquet files sounds straightforward until you watch it crash mid-write and leave your output directory in an ambiguous state. Did it write all the records for that micro-batch? Half of them? None? The checkpoint committed, but the

ADF Git Integration One Year Later: The Good, The Gotcha, and What's Still Missing

It's been a year since ADF git integration shipped. My team has been living with it — daily authoring, PRs, deployments — long enough that the novelty has worn off and the real picture is clear. Here's the honest 12-month assessment. What's Genuinely Good No More

Incremental Loads in Databricks: The Patterns That Actually Work

Most data engineering pipelines start with full loads. You truncate and reload. Every run processes every row from the source. It works, it's simple, and it ages badly — as your source tables grow, your pipeline run time grows with them, until you're running a 4-hour job

The Bronze-Silver-Gold Architecture: Organizing Your Data Lake the Right Way

One of the first things I noticed when I started working seriously with data lakes is that the same raw data would get processed in slightly different ways by different teams, and the "current version" of a customer record or a product attribute was genuinely hard to determine.

Cluster Policies: Guardrails for Multi-User Databricks Workspaces

Multi-user Databricks workspaces have a problem that SQL Server DBAs will recognize immediately: people will spin up clusters that are too large, leave them running, and run experimental queries on production data when they think no one is looking. The solution in SQL Server is database permissions and resource governor.

From SQL Agent Jobs to Databricks Jobs: The Operational Model Shift

SQL Server DBAs have SQL Server Agent. It's deeply integrated with the database, it knows about databases and jobs and schedules, and it's been around long enough that everyone knows how it works. Moving to Databricks means you no longer have SQL Server Agent. The question

Databricks Job Clusters vs Interactive Clusters: The Cost Math You Need to Do

SQL Server DBAs think about compute costs in one of two ways: either the server is on and you're paying for it regardless, or you're paying licensing and hardware amortized over years. Either way, you're not used to thinking about "this specific query

Delta Lake Write Modes: Append, Overwrite, Merge, and When to Use Each

When you write to a Delta table, you have four modes to choose from. Most people discover them by trial and error — they try overwrite, wonder why their data disappeared, then slowly piece together when each mode is appropriate. This post is the thing I wish I'd read

ADF Wrangling Data Flows: Power Query Inside Your Pipelines

ADF now has two types of data flows, and most people only know about one of them. Mapping Data Flows — the code-free Spark transformation engine — got all the attention in 2018. Wrangling Data Flows shipped quietly and haven't gotten nearly as much coverage. They're worth understanding,

Great Expectations SQLAlchemy Integration: Validating SQL Server Data with Python

The Great Expectations posts so far have focused on pandas DataFrames — CSV files, in-memory data, the data science use case. But most of the data I work with doesn't live in CSVs. It lives in SQL Server, or Azure SQL, or PostgreSQL. And I want to run the

Installing Python Libraries in Databricks: Cluster Scope vs Notebook Scope

At some point after you've been using Databricks for a few months, the question of how to install Python libraries comes up. Not because the built-in libraries aren't comprehensive — they are — but because your work requires something specific. A data quality library, a custom connector, an

The Delta Transaction Log: What Is Inside _delta_log and Why It Matters

Every Delta Lake table has a directory called _delta_log sitting alongside its data files. Most tutorials tell you it exists and leave it at that. Understanding what's actually in it changes how you reason about Delta's behavior — the ACID guarantees, the time travel, and the

Latest