Delta Lake VACUUM: The Maintenance Task Nobody Tells You About
When you run an UPDATE or DELETE on a Delta table, the old version of the data doesn't disappear. Delta writes new Parquet files with the changed rows and marks the old files as removed in the transaction log — but the old files stay on disk. That's what enables time travel and point-in-time reads.
Over time, those old files accumulate. A table that gets updated daily for a year has 365 sets of data files sitting in your storage account, most of which you'll never read again. VACUUM is how you clean them up.
Running VACUUM
-- Remove files older than the default retention period (7 days / 168 hours)
VACUUM orders
-- Explicitly set the retention period
VACUUM orders RETAIN 168 HOURS
-- Check what would be deleted without actually deleting anything
VACUUM orders DRY RUNRun the DRY RUN first. It shows you exactly which files would be removed. When you're comfortable with the list, run it without the flag.
The Retention Period and Time Travel
The default retention is 7 days. If you run VACUUM with that default, you lose the ability to time-travel to versions older than 7 days — the underlying data files for those versions are gone.
-- This fails after VACUUM removes files older than 7 days
-- (assuming version 50 was created 10 days ago)
SELECT * FROM orders VERSION AS OF 50If you need longer time travel capability, keep your VACUUM retention period at least as long as your recovery window:
-- Keep data files for 30 days
VACUUM orders RETAIN 720 HOURS
-- Also extend the transaction log retention to match
ALTER TABLE orders
SET TBLPROPERTIES ('delta.logRetentionDuration' = 'interval 30 days')The log retention and the VACUUM retention work together. Mismatching them — long log retention with short VACUUM — creates log entries that reference data files that no longer exist, which breaks time travel for those versions.
The Safety Check
Delta won't let you VACUUM with less than 168 hours (7 days) retention by default. Try it and you get:
AnalysisException: requirement failed: Are you sure you would like to vacuum files
with such a low retention period?This is a guardrail: a concurrent reader that started before your VACUUM and hasn't finished yet might be trying to access a file you just deleted. You can override it:
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM orders RETAIN 0 HOURS;This is appropriate for development environments where you want to immediately free up storage and you're certain nothing is reading the table concurrently. Don't touch it in production.
When to Run It
VACUUM is a scheduled maintenance task, not something that runs at the end of a pipeline. Right cadence by use case:
- Tables updated daily: weekly VACUUM
- Tables with streaming appends: monthly (fewer old files to clean)
- Tables with heavy MERGE or UPDATE operations: more frequent — the file accumulation is faster
Set it up as a separate Databricks job on a schedule. VACUUM on a large table takes time, and you don't want it holding up data loads. The SQL Server analogy is a scheduled shrink or cleanup job — you do it on a maintenance window, not at peak hours. Same principle. As always, I'm here to help.