Databricks Community Edition: Standing Up Your First Spark Cluster in 15 Minutes

Before diving into the architecture and the API, you need a cluster to run code against. Databricks Community Edition gives you one for free — no credit card, no time limit on account access, one active cluster at a time. It's not production-grade (it's a single-node cluster and Databricks terminates it after 2 hours of inactivity), but it's exactly the right environment for learning the platform without incurring a cloud bill.

Getting Set Up

Go to community.cloud.databricks.com and create an account. The sign-up flow is straightforward. Once you're in, you land on the Databricks workspace.

Create a cluster: go to Compute in the left sidebar, click Create Cluster. The Community Edition options are limited by design — you'll see a single machine type available. Give your cluster a name, pick the latest LTS (Long Term Support) Databricks Runtime version available, and hit Create Cluster. The cluster takes about 3–5 minutes to start the first time.

The runtime version matters. In 2020, Databricks Runtime 7.x is the current stable series — it ships with Spark 3.0, Python 3.7, and Delta Lake 0.7. The LTS variants get security patches longer. For learning purposes: pick the latest LTS available to you and you'll be fine.

Your First Notebook

Go to Workspace, create a new notebook (Python), and attach it to the cluster you just created. You'll see a cell prompt. Run this:

spark.version

If you get a version string back instead of an error, your notebook is connected to the cluster and Spark is running. The spark variable is a SparkSession — it was automatically created for you by Databricks when the cluster started. In standalone Spark code outside of Databricks, you'd create it explicitly:

from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("my-first-app")     .getOrCreate()

But in Databricks notebooks, spark is always available. Consider it provided by the environment.

Reading Data and Running a Query

Databricks Community Edition includes some sample datasets. Let's use one:

df = spark.read.csv(
    "/databricks-datasets/samples/population-vs-price/data_geo.csv",
    header=True,
    inferSchema=True
)
df.printSchema()
df.show(5)

The printSchema() call shows column names and inferred types. show(5) displays the first 5 rows. These look like SQL Developer operations because they are — Spark's DataFrame API deliberately mirrors SQL concepts.

Now run a simple aggregation:

from pyspark.sql import functions as F

result = df.groupBy("State Code")            .agg(F.avg("2015 median sales price").alias("avg_price"))            .orderBy("avg_price", ascending=False)

result.show(10)

If you're thinking "this looks a lot like SQL GROUP BY," you're right. And that's intentional. The DataFrame API was designed to feel familiar to SQL practitioners. The difference is in what's happening underneath: groupBy is a distributed operation, running across partitions in parallel — though on Community Edition's single node, you won't see the parallelism until you move to a real cluster.

Running SQL Directly

If you'd rather write actual SQL, Databricks supports that too. Register the DataFrame as a temporary view and query it with SQL:

df.createOrReplaceTempView("housing")

spark.sql("""
    SELECT `State Code`, AVG(`2015 median sales price`) AS avg_price
    FROM housing
    GROUP BY `State Code`
    ORDER BY avg_price DESC
    LIMIT 10
""").show()

Same result, standard SQL syntax. This is a genuine choice in Databricks: Python DataFrame API, SQL, or Scala — all first-class citizens. For SQL-heavy teams, starting with spark.sql() is often the lowest-friction on-ramp.

What's Actually Happening Here

Even though none of this feels unusual yet, something important is happening under the hood: Spark built a logical execution plan, optimized it via the Catalyst optimizer, and compiled it down to physical operations that could run on distributed partitions. On Community Edition, those partitions all happen to be on one machine. On a real cluster, they'd be spread across workers.

The API is identical regardless of cluster size. That's the point: write code that works on Community Edition, run it unchanged on a 50-node production cluster. The scale changes; the code doesn't.

Next: how that distribution actually works. The driver, the executors, and why sorting M&Ms by color is the best analogy I've found for explaining it.

Read more