Reading and Writing Data in Databricks: The Formats That Actually Matter

Databricks can read almost anything. CSV, JSON, Parquet, ORC, Avro, Delta, JDBC sources, Kafka topics — the API is consistent across all of them. The question isn't "can Databricks read this" but "should I be storing it in this format." Let me save you some trial and error.

The Format Hierarchy

For files you control, the format choice matters a lot for query performance and storage costs. Here's the ranked order:

Delta Lake — for anything you write to in production. Adds ACID transactions, time travel, and schema enforcement on top of Parquet. No performance downside over raw Parquet for reads; Databricks' Spark engine has deeply integrated Delta optimizations.

Parquet — for data you receive from external systems (data vendors, partner teams using different stacks). Columnar format, compressed by default (Snappy), supports predicate pushdown (Spark can skip reading columns and row groups it doesn't need). Significantly better than CSV or JSON for analytical workloads.

CSV/JSON — for ingestion from source systems that don't speak Parquet. Write them straight to Delta on arrival. Never do analytics against raw CSV at scale.

Reading Data

# Delta
df = spark.read.format("delta").load("abfss://container@storage.dfs.core.windows.net/tables/orders/")

# Parquet
df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/raw/orders/year=2020/")

# CSV with explicit schema (faster and safer than inferSchema)
from pyspark.sql.types import StructType, StructField, StringType, LongType, TimestampType

schema = StructType([
    StructField("order_id", LongType(), nullable=False),
    StructField("customer_id", LongType(), nullable=True),
    StructField("order_date", TimestampType(), nullable=True),
    StructField("total_amount", StringType(), nullable=True),
])

df = spark.read.csv(
    "abfss://container@storage.dfs.core.windows.net/raw/orders/",
    schema=schema,
    header=True
)

# JSON
df = spark.read.json("abfss://container@storage.dfs.core.windows.net/raw/events/")

A note on inferSchema=True: convenient, but it requires a full pass over the data to determine types before any actual processing starts. For large datasets, define the schema explicitly and skip the scan.

Writing Data

The write modes:

  • overwrite — replace the target table/path completely
  • append — add rows to existing data, no deduplication
  • errorIfExists — fail if the target already has data (the default)
  • ignore — silently skip the write if the target already has data
# Write as Delta to a path
df.write     .format("delta")     .mode("overwrite")     .save("abfss://container@storage.dfs.core.windows.net/tables/orders/")

# Write as Delta and register in the metastore
df.write     .format("delta")     .mode("overwrite")     .saveAsTable("analytics.orders")

# Partitioned write: separate directories per partition key
df.write     .format("delta")     .mode("overwrite")     .partitionBy("order_date")     .save("abfss://container@storage.dfs.core.windows.net/tables/orders_partitioned/")

Partitioned Writes and Why They Matter

When you write with partitionBy("order_date"), Spark creates a subdirectory for each distinct value of order_date. Queries that filter on order_date only read the relevant subdirectory — this is partition pruning, and it's how you avoid full table scans on large historical datasets.

The SQL Server equivalent: table partitioning on a date column, where queries against a date range only touch the relevant filegroups. Same concept, file-system implementation.

Choose partition keys carefully. Good partition keys have:

  • Low to medium cardinality (date, region, product category — not customer_id, which has millions of values)
  • Even distribution (no single value containing 80% of the data)
  • Common filter usage (queries almost always filter on this column)

Reading from JDBC Sources

Databricks can read directly from SQL Server, PostgreSQL, and other JDBC sources. This is useful for migration and for pulling reference data:

jdbc_df = spark.read     .format("jdbc")     .option("url", "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb")     .option("dbtable", "dbo.customers")     .option("user", dbutils.secrets.get(scope="kv", key="sql-user"))     .option("password", dbutils.secrets.get(scope="kv", key="sql-password"))     .option("numPartitions", "8")     .option("partitionColumn", "customer_id")     .option("lowerBound", "1")     .option("upperBound", "10000000")     .load()

The numPartitions, partitionColumn, lowerBound, upperBound options are critical. Without them, Spark uses a single connection and pulls all data through the driver — back to the driver-bound workload problem. With them, Spark opens N parallel connections and pulls ranges in parallel across executors.

For large SQL Server tables, always specify the partition options. The performance difference between single-thread JDBC pull and parallelized JDBC pull can be an order of magnitude.

Read more