From SQL Agent Jobs to Databricks Jobs: The Operational Model Shift

SQL Server DBAs have SQL Server Agent. It's deeply integrated with the database, it knows about databases and jobs and schedules, and it's been around long enough that everyone knows how it works. Moving to Databricks means you no longer have SQL Server Agent. The question is what you have instead — and the answer is more interesting than just "a scheduler."

What Databricks Jobs Does

A Databricks job runs a notebook (or a JAR, or a Python script) on a schedule or in response to a trigger. At its core, it's a scheduler wrapped around your code. But there are some important differences from SQL Agent worth naming explicitly.

The cluster is separate from the job. In SQL Server, the Agent job runs on the SQL Server instance — the compute is implicit. In Databricks, you specify what cluster the job runs on. This means you have explicit control over how much compute a job gets, and you can size different jobs differently.

Jobs don't know about each other by default. SQL Agent has job dependencies built in. Databricks Jobs supports multi-task jobs with dependency chains, but you have to wire them up explicitly. There's no "run job B after job A succeeds" unless you define that relationship in your job configuration.

There's no built-in notification system like Database Mail. Databricks Jobs can send email alerts on success/failure, but for anything more sophisticated you're building it yourself or using an orchestration layer like Airflow.

Creating a Basic Job

# Via the Jobs API (POST /api/2.0/jobs/create)
{
  "name": "Daily Order Summary",
  "schedule": {
    "quartz_cron_expression": "0 0 6 * * ?",
    "timezone_id": "America/Chicago"
  },
  "tasks": [{
    "task_key": "summarize_orders",
    "notebook_task": {
      "notebook_path": "/pipelines/daily_order_summary",
      "base_parameters": {
        "environment": "prod"
      }
    },
    "new_cluster": {
      "spark_version": "6.4.x-scala2.11",
      "node_type_id": "Standard_DS3_v2",
      "num_workers": 2
    }
  }]
}

Or create it through the Jobs UI (Jobs → Create Job) and configure the schedule there. Cron syntax is Quartz format — slightly different from standard Unix cron but well-documented.

Multi-Task Jobs and Dependencies

When you have a pipeline with stages that need to run in order, define them as tasks within a single job with explicit dependencies:

{
  "tasks": [
    {
      "task_key": "extract_raw",
      "notebook_task": {"notebook_path": "/pipelines/extract"}
    },
    {
      "task_key": "transform_silver",
      "depends_on": [{"task_key": "extract_raw"}],
      "notebook_task": {"notebook_path": "/pipelines/transform"}
    },
    {
      "task_key": "load_gold",
      "depends_on": [{"task_key": "transform_silver"}],
      "notebook_task": {"notebook_path": "/pipelines/load"}
    }
  ]
}

Task B runs only if task A succeeds. If A fails, B doesn't run and the job is marked as failed. This is the equivalent of SQL Agent job steps with "on failure, quit reporting failure."

The Gap: Cross-Job Orchestration

What Databricks Jobs doesn't do well (yet) is coordinate across multiple independent jobs. If you have 20 separate jobs and job 15 should only run after jobs 3 and 7 complete, you need Airflow or a similar orchestrator to manage that. Databricks Jobs handles pipeline-internal sequencing well. Cross-pipeline orchestration is a separate problem.

For most teams starting out, Databricks Jobs handles 80% of what SQL Agent was doing. The remaining 20% — complex inter-job dependencies, cross-platform orchestration, conditional logic — that's where you eventually add an orchestration layer. Start with Jobs. Add Airflow when you feel the gaps. As always, I'm here to help.

Read more