dbutils.fs: File System Operations When Your Storage Is Not a File System

When you're used to SQL Server, your data lives in a known location: a database, on a server, at a port. When you move to Databricks, your data lives in object storage — S3, ADLS, or GCS — which looks like a filesystem if you squint but behaves differently enough to cause real confusion.

DBFS (Databricks File System) is the abstraction layer. It sits on top of whatever cloud storage backs your workspace and gives you a unified path syntax. The tool for navigating it is dbutils.fs.

The Basic Operations

# List files in a path
display(dbutils.fs.ls("dbfs:/mnt/myproject/"))

# Create a directory
dbutils.fs.mkdirs("dbfs:/mnt/myproject/staging/2019/01/")

# Copy a file
dbutils.fs.cp("dbfs:/mnt/myproject/raw/orders.parquet",
              "dbfs:/mnt/archive/orders_20190112.parquet")

# Recursive directory copy
dbutils.fs.cp("dbfs:/mnt/myproject/raw/", "dbfs:/mnt/archive/", recurse=True)

# Delete a file
dbutils.fs.rm("dbfs:/mnt/myproject/staging/temp.csv")

# Delete a directory recursively
dbutils.fs.rm("dbfs:/mnt/myproject/staging/", recurse=True)

dbutils.fs.ls() returns a list of FileInfo objects with path, name, size, and modificationTime. In a notebook, display() renders it as a table. In a script, iterate over it directly:

for f in dbutils.fs.ls("dbfs:/mnt/myproject/raw/"):
    if f.size > 100_000_000:  # files over 100MB
        print(f.name, f.size)

Mount Points: Where Your Actual Storage Lives

DBFS root (dbfs:/) is backed by your workspace's default storage — fine for scratch, not right for real data. Your actual datasets live in storage accounts or S3 buckets you've provisioned. You access those through mount points.

configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type":
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id":
    dbutils.secrets.get("myproject", "sp-client-id"),
  "fs.azure.account.oauth2.client.secret":
    dbutils.secrets.get("myproject", "sp-client-secret"),
  "fs.azure.account.oauth2.client.endpoint":
    "https://login.microsoftonline.com/YOUR_TENANT_ID/oauth2/token"
}

dbutils.fs.mount(
  source="abfss://data@mystorageaccount.dfs.core.windows.net/",
  mount_point="/mnt/myproject",
  extra_configs=configs
)

Once mounted, /mnt/myproject/ maps to that storage container. The service principal doing the mounting grants workspace-level read access — so think carefully before mounting production storage into a shared development workspace. Everyone on that workspace gets access to everything under that mount point.

The Path Syntax Gotcha

DBFS paths work two ways and documentation uses them interchangeably:

  • dbfs:/mnt/myproject/data.parquet — full DBFS URI, required in dbutils.fs calls
  • /mnt/myproject/data.parquet — short form, works in spark.read, Spark SQL, and most APIs
# Both of these work in DataFrameReader
df = spark.read.parquet("dbfs:/mnt/myproject/orders/")
df = spark.read.parquet("/mnt/myproject/orders/")

When calling dbutils.fs methods, use the dbfs:/ prefix. For everything else, either works. Pick one convention per codebase and stop thinking about it.

DBFS Root Is Shared

dbfs:/user/ and dbfs:/tmp/ are shared across all workspace users. Drop a dataset in dbfs:/tmp/ and every user on the workspace can see it. Workspace admins can see all of DBFS root.

Treat DBFS root as temporary scratch space. Anything you want isolated per team, per project, or per sensitivity level goes under /mnt/ in a dedicated storage container. This is where your data governance model shifts when you leave SQL Server: instead of database permissions, you're managing storage account access controls and mount point scoping. It's a different layer, but it's still a layer. As always, I'm here to help.

Read more