ADF has been generally available for months. The connector list has grown. Copy Activity has matured. PolyBase support for SQL Data Warehouse is excellent. And there is still no git integration. I've been patient about this. I'm done being patient.
What Git Integration Would Actually Mean
Let me be specific about what I'm asking for, because "git integration" can mean a lot of things. Here's what it needs to mean for ADF:
- Link your data factory to a git repository (GitHub, Azure DevOps, BitBucket — doesn't matter)
- When you create or edit a linked service, dataset, or pipeline in the portal, the change is committed to the repository automatically — or at minimum, saved as a draft and committed on explicit user action
- Before deploying to production, a pull request workflow validates and approves the change
- Full history and diff are available for every factory object — you can see what changed between two versions and who changed it
- The git repository is the source of truth; the portal is an IDE that reads from and writes to the repo
That's it. This is not a novel requirement. Databricks notebooks have linked to git repositories. Terraform state maps to actual infrastructure. Every other artifact-based tool in the modern data stack has figured this out. ADF has not.
What SSIS Developers Have Had for Years
An SSIS project is a folder. The project file is XML. Each package is an XML file. Drop the folder in a git repository and you have complete version history for every package in the project. Diff a package between two commits and you can see exactly which transforms changed, which connections were added, which data flow columns were modified.
This isn't a feature Microsoft shipped for SSIS. It's a consequence of the fact that SSIS stores its artifacts as text files on disk. Files on disk go in git. This has been true since SSIS shipped in 2005. ADF stores its artifacts as JSON documents in Azure — also text — and somehow managed to build a tool that gives you no path to put those JSON documents in git from within the tool itself.
The Manual Workaround I Use
Git repository with a folder structure mirroring the factory:
my-factory/
linkedservices/
AzureStorageLinkedService.json
OnPremSQLLinkedService.json
datasets/
DailySalesBlob.json
StagingSalesTable.json
pipelines/
DailySalesPipeline.json
deploy.ps1
deploy-config.dev.ps1
deploy-config.prod.ps1
The deploy script reads environment config (connection strings, account keys — not committed), injects values into linked service templates, and deploys each object type in order (linked services first, then datasets, then pipelines — dependency order matters):
# deploy.ps1
param([string]$Environment = "dev")
. ".deploy-config.$Environment.ps1"
$ResourceGroup = $Config.ResourceGroup
$FactoryName = $Config.FactoryName
Get-ChildItem ".linkedservices*.json" | ForEach-Object {
$json = Get-Content $_.FullName -Raw
$json = $json.Replace("{{STORAGE_KEY}}", $Config.StorageKey)
$json = $json.Replace("{{SQL_PASSWORD}}", $Config.SQLPassword)
$tmpFile = [System.IO.Path]::GetTempFileName() + ".json"
$json | Set-Content $tmpFile
New-AzureRmDataFactoryLinkedService -ResourceGroupName $ResourceGroup -DataFactoryName $FactoryName -File $tmpFile -Force
Remove-Item $tmpFile
}
Get-ChildItem ".datasets*.json" | ForEach-Object {
New-AzureRmDataFactoryDataset -ResourceGroupName $ResourceGroup -DataFactoryName $FactoryName -File $_.FullName -Force
}
Get-ChildItem ".pipelines*.json" | ForEach-Object {
New-AzureRmDataFactoryPipeline -ResourceGroupName $ResourceGroup -DataFactoryName $FactoryName -File $_.FullName -Force
}
This works. I've been running it for six months. But it requires discipline, and discipline is fragile under pressure.
The Discipline Tax
Here is what goes wrong when teams don't maintain the discipline:
Portal edits that don't get committed. Someone fixes a bug in a pipeline directly in the portal — it's faster than editing the JSON locally, deploying, and committing. The fix works. Three weeks later, the next full deploy from git overwrites the fix because the JSON in git doesn't have it. The bug is back. No one can explain why.
Environment drift. Dev factory gets a new dataset added via portal. The developer forgets to add it to the git repo. Prod factory never gets the dataset. A pipeline that worked in dev fails in prod. Hours of debugging determine the root cause is a missing dataset definition that exists in dev but was never committed.
New team members who don't know the convention. The convention lives in a README. New developers read documentation inconsistently. The first time a new developer edits the portal directly, the discipline gap is established. Once the convention breaks, restoring it requires a full reconciliation of factory state vs. git state — which is manual, tedious, and error-prone.
What Microsoft Should Build
The ADF portal should become an IDE for a git-linked repository. Link a factory to a repo in settings. Every save in the portal is a commit (or draft). The "publish" action deploys from the repo to the factory. The portal shows the diff between current branch state and what's deployed to prod. Pull request workflow for prod deploys. This is exactly the model that Databricks notebooks use, and it works.
I've heard "this is on the roadmap." I have been hearing this since the preview. The V2 announcement mentions improved authoring — I'll believe git integration when I see it deployed, not when it's described in a blog post.
Until it ships, the workaround works. Build the discipline into your team's process from day one. Make the portal a monitoring-only interface by convention. I'm here to help if you need to set up the PowerShell deployment workflow for your factory.