Installing Python Libraries in Databricks: Cluster Scope vs Notebook Scope
At some point after you've been using Databricks for a few months, the question of how to install Python libraries comes up. Not because the built-in libraries aren't comprehensive — they are — but because your work requires something specific. A data quality library, a custom connector, an internal package your team built. The answer changes depending on where and how you need the library.
The Two Scopes
Databricks has two ways to install a Python library, and they're not interchangeable:
- Cluster libraries — installed on the cluster itself, available to all notebooks attached to that cluster, persist across cluster restarts (until you remove them)
- Notebook-scoped libraries — installed using
%pipin a notebook cell, available only in that notebook, reset when the cluster restarts or the notebook is detached
Cluster Libraries
Cluster libraries are managed through the Databricks UI (Clusters → your cluster → Libraries tab) or via the Clusters API. You can install from PyPI, Maven, CRAN, DBFS, or a direct file upload.
# Via Databricks CLI
databricks libraries install --cluster-id 1234-567890-abc12 --pypi-package great-expectations==0.13.0The library installs at cluster start time (or immediately if the cluster is already running, after a short wait). All notebooks on that cluster can import it without any additional steps.
When to use cluster libraries:
- Libraries used by most or all notebooks on a cluster
- Production jobs where the library version needs to be locked
- Libraries that require native extensions or have long install times
Notebook-Scoped Libraries with %pip
%pip install pandas==1.1.0 scipy numpy==1.19.0Run this in a notebook cell. Databricks restarts the Python kernel after %pip install runs, so put your %pip commands at the top of the notebook before any imports. The library is available for the remainder of the notebook session.
# Always put pip installs before imports
%pip install great-expectations==0.13.0 pyarrow==1.0.0
import great_expectations as ge
import pyarrow as paWhen to use notebook-scoped libraries:
- Notebooks that need a different version of a library than what's on the cluster
- Exploratory work where you want to try a library without touching the cluster config
- Notebooks run by different teams that have different dependency requirements
Version Conflicts
This is where teams get into trouble. A cluster library installs version X of a package. A notebook installs version Y using %pip. Within that notebook, the %pip version wins. But if the cluster library version had native extensions already loaded by the Python runtime, you can end up with unexpected behavior.
The cleaner approach for production: use cluster libraries with pinned versions, and don't override them in individual notebooks. Reserve %pip for development and exploration. When you find a library combination that works, pin it in the cluster configuration and remove the %pip calls from the notebooks.
One more thing: if you're installing from a private PyPI repository (your org's internal package server), configure the cluster's init script to set up the pip index URL rather than passing it in each %pip call. Credentials in %pip commands end up in notebook output. As always, I'm here to help.