Why I Run Data Engineering and Data Science on the Same Platform
At most organizations, data engineering and data science are separate teams with separate infrastructure. The engineers run production Spark jobs on EMR or Databricks Jobs. The scientists run experiments in Jupyter notebooks on standalone EC2 instances or SageMaker. They share data through S3 and share nothing else — different Spark versions, different Python environments, different library versions, completely separate debugging workflows.
The handoff between them is painful in a specific way: a data scientist proves that a feature engineering approach works in their notebook, and then a data engineer spends one to three weeks "porting" that work to production. The porting is mostly not interesting engineering — it's reconciling environments, rewriting pandas operations as PySpark equivalents, and dealing with edge cases the notebook never encountered because it ran on a sample.
Databricks collapses this split. It's not the only reason I prefer the platform, but it's one of the most practically valuable.
What "Same Platform" Actually Means
On Databricks, a data scientist working in a Databricks Notebook is writing Python (or Scala, or SQL, or R) that runs on a Spark cluster — the same kind of cluster, with the same Databricks Runtime, that your production jobs run on. They're not on a standalone Python instance; they're on Spark. Their pandas code can be replaced with PySpark DataFrame operations incrementally. Their prototype and your production job are the same kind of artifact running in the same environment.
When the data scientist hands off a feature engineering notebook, the data engineer isn't porting it to a different runtime. They're productionizing it: parameterizing the date range, removing visualization code, adding error handling, scheduling it as a Databricks Job. That's an afternoon of work, not a week.
Library Parity
Databricks cluster libraries are managed at the cluster level and can be set as cluster-scoped or notebook-scoped. A data scientist can install a new Python library on a cluster and have it available immediately in their notebook. When that library gets promoted to the production cluster, the data engineer installs the same version. No "it worked on my machine" arguments — the machine is the same cluster type, running the same runtime, with the same library.
# In a Databricks Notebook — prototype phase
# Data scientist is exploring a new feature
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Small-sample exploration in pandas first
sample_df = spark.table("events.page_views").sample(0.01).toPandas()
scaler = StandardScaler()
sample_df["normalized_duration"] = scaler.fit_transform(
sample_df[["session_duration_seconds"]]
)
# Validated on sample — now scale to full dataset with PySpark
from pyspark.ml.feature import StandardScaler as SparkScaler
from pyspark.ml.linalg import Vectors
full_df = spark.table("events.page_views")
# The same data, the same cluster, just the full version
# Transition from pandas to Spark is incremental, not a rewriteMLlib on the Same Data
Databricks has Spark MLlib available natively. When a data scientist wants to run k-means clustering on the same Parquet data that the engineering team processes for reports, they open a notebook on the shared cluster and run it. No data export to a separate ML environment, no synchronization problem between the data the model trains on and the data the pipeline processes.
This sounds like a small thing. It compounds. When the ML team and the data engineering team share a data platform, they build shared conventions for feature tables, shared naming standards for output datasets, and a shared understanding of the data's shape. When they're on separate infrastructure, those conventions diverge and the integration work grows with every iteration.
The Organizational Argument
Beyond the technical efficiency, there's an organizational argument: two teams sharing a platform have a forcing function to communicate about data structure, naming, and shared dependencies. Two teams on separate infrastructure drift apart without anyone making a deliberate decision to let them drift.
If your data science and data engineering teams are currently on separate platforms and the handoff between them is a recurring friction point, Databricks as a shared platform is worth a real evaluation. The unified runtime is not the only solution, but it's the most direct one I've worked with. As always, I'm here to help.