The first time I saw someone read a Databricks table with a hardcoded ADLS path buried in a production notebook, I had a flashback to SQL Server connection strings stored directly in SSIS packages. The path works today. It stops working when the storage account gets renamed. Nobody knows what it points to without reading the code. There's a better way, and it's called the Hive Metastore.
What the Hive Metastore Is
The Hive Metastore is a catalog — a central registry that stores metadata about your data: table names, column names and types, storage locations, partition information, and table statistics. It's the equivalent of SQL Server's sys.tables, sys.columns, and INFORMATION_SCHEMA, but for distributed data stored in cloud storage.
When a table is registered in the Hive Metastore, you can refer to it by name instead of path:
# Without metastore: fragile, path-dependent
df = spark.read.format("delta").load(
"abfss://analytics@mystorageaccount.dfs.core.windows.net/tables/customers/"
)
# With metastore: stable, portable, discoverable
df = spark.table("analytics.customers")
# or equivalently
df = spark.sql("SELECT * FROM analytics.customers")
The metastore knows the path. Your code doesn't need to.
Managed vs. External Tables
This is the decision that matters most when creating tables, and it directly parallels a SQL Server concept.
Managed tables: Databricks manages both the metadata (in the metastore) and the data (stored in Databricks' default location, typically dbfs:/user/hive/warehouse/). When you DROP TABLE a managed table, Databricks deletes both the metastore entry and the underlying data files. Think of it like a SQL Server heap — the server owns the storage.
External tables: Databricks manages the metadata in the metastore, but the data lives at a location you specify (your ADLS container, your S3 bucket). When you DROP TABLE an external table, Databricks removes the metastore entry but leaves the data files intact. Think of it like a SQL Server external table or a linked server — the metadata is managed, but the data isn't.
-- Managed table: Databricks picks the storage location
CREATE TABLE analytics.customers (
customer_id BIGINT,
name STRING,
region STRING,
created_date DATE
) USING DELTA;
-- External table: you specify where the data lives
CREATE TABLE analytics.events
USING DELTA
LOCATION 'abfss://analytics@mystorageaccount.dfs.core.windows.net/tables/events/';
My default recommendation: use external tables for anything production. You want the data to outlive the metastore entry. If someone accidentally drops a managed table, the data is gone. If they drop an external table, you re-register the path and you're back in business.
Databases in the Metastore
The metastore supports a two-level namespace: database (also called schema) and table. This maps directly to SQL Server's schema concept.
-- Create a logical grouping
CREATE DATABASE IF NOT EXISTS analytics;
-- Tables registered under it
CREATE TABLE analytics.customers ...;
CREATE TABLE analytics.orders ...;
-- Query cross-database
SELECT c.name, COUNT(o.order_id) AS order_count
FROM analytics.customers c
JOIN analytics.orders o ON c.customer_id = o.customer_id
GROUP BY c.name;
The default database in Databricks is default. Tables created without specifying a database land in default. Use explicit database names for anything real.
Querying the Metastore
-- List databases
SHOW DATABASES;
-- List tables in a database
SHOW TABLES IN analytics;
-- Describe a table's schema
DESCRIBE TABLE analytics.customers;
-- Describe including storage location and partition info
DESCRIBE TABLE EXTENDED analytics.customers;
-- Show table properties
SHOW TBLPROPERTIES analytics.customers;
In Python:
# List tables in a database
spark.catalog.listTables("analytics")
# Get table metadata
spark.catalog.getTable("analytics.customers")
# Check if a table exists
spark.catalog.tableExists("analytics", "customers")
The Partition Repair Gotcha
Classic Hive tables (non-Delta) track partitions explicitly in the metastore. If you write partitioned data directly to the underlying storage path without going through Spark SQL, the metastore doesn't know about the new partitions:
-- After manually writing files to a partitioned Hive table location:
MSCK REPAIR TABLE analytics.events;
-- This scans the storage path and registers any new partition directories
Delta Lake tables don't have this problem. Delta's transaction log tracks all data files automatically, and Spark always reads the transaction log to determine what files exist. No repair needed. This is one of the concrete reasons to prefer Delta format over plain Parquet or ORC for tables in the metastore.
What the Metastore Doesn't Do
The Hive Metastore doesn't handle access control. In 2020, there's no row-level or column-level security built into the default Databricks Hive Metastore. Access is cluster-level: if you have access to the cluster and the underlying storage, you can query the table. Unity Catalog (a later Databricks feature) addresses this with fine-grained access control, but that's further down the road.
The metastore also doesn't automatically keep statistics updated the way SQL Server's auto-update statistics does. You can run ANALYZE TABLE analytics.customers COMPUTE STATISTICS to generate column statistics that the query optimizer can use, but this is manual. Delta Lake's OPTIMIZE command handles file-level compaction but doesn't update metastore statistics.
For most workloads, the metastore is essential infrastructure — not optional. Get your tables registered in it from the start. The alternative is everyone hardcoding storage paths, and that ends exactly as badly as you'd expect.