Cross-Cloud Data Engineering: Running the Same Metadata Config on Azure, AWS, and GCP

The metadata-driven pipeline design I've been building toward solves a specific problem: how do you build a pipeline that runs everywhere without rewriting it for each environment? The answer is that the pipeline doesn't know where it's running. The config table knows, and the pipeline reads the config.

This becomes tangible when a client wants to run the same data platform on Azure today and add AWS next year without rebuilding everything. Here's how the cross-cloud design works.

The Abstraction Layer

Each cloud has different storage APIs, different authentication mechanisms, and different native services. The config table abstracts all of this: instead of hardcoding abfss://container@account.dfs.core.windows.net/path in notebook code, the notebook reads a path from config and uses a storage access layer that handles the cloud-specific protocol.

from enum import Enum
from abc import ABC, abstractmethod
from pyspark.sql import DataFrame

class CloudProvider(str, Enum):
AZURE = "azure"
AWS = "aws"
GCP = "gcp"

class StorageAccessLayer(ABC):
@abstractmethod
def read_delta(self, logical_path: str) -> DataFrame:
pass

@abstractmethod
def write_delta(self, df: DataFrame, logical_path: str, mode: str = "append") -> None:
pass

@abstractmethod
def resolve_physical_path(self, logical_path: str) -> str:
pass

class AzureStorageAccess(StorageAccessLayer):
def __init__(self, storage_account: str, container: str):
self.base_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net"

def resolve_physical_path(self, logical_path: str) -> str:
return f"{self.base_path}/{logical_path.lstrip('/')}"

def read_delta(self, logical_path: str) -> DataFrame:
return spark.read.format("delta").load(self.resolve_physical_path(logical_path))

def write_delta(self, df: DataFrame, logical_path: str, mode: str = "append") -> None:
df.write.format("delta").mode(mode).save(self.resolve_physical_path(logical_path))

class AwsStorageAccess(StorageAccessLayer):
def __init__(self, bucket: str, prefix: str = ""):
self.base_path = f"s3a://{bucket}/{prefix}"

def resolve_physical_path(self, logical_path: str) -> str:
return f"{self.base_path}/{logical_path.lstrip('/')}"

def read_delta(self, logical_path: str) -> DataFrame:
return spark.read.format("delta").load(self.resolve_physical_path(logical_path))

def write_delta(self, df: DataFrame, logical_path: str, mode: str = "append") -> None:
df.write.format("delta").mode(mode).save(self.resolve_physical_path(logical_path))

Wiring to the Config Table

def get_storage_layer(env_config: dict) -> StorageAccessLayer:
provider = CloudProvider(env_config['cloud_provider'])
if provider == CloudProvider.AZURE:
return AzureStorageAccess(
storage_account=env_config['storage_account'],
container=env_config['container']
)
elif provider == CloudProvider.AWS:
return AwsStorageAccess(
bucket=env_config['s3_bucket'],
prefix=env_config.get('s3_prefix', '')
)
raise ValueError(f"Unsupported cloud provider: {provider}")

# Config table has a row per environment
env_config = spark.sql("""
SELECT cloud_provider, storage_account, container, s3_bucket, s3_prefix
FROM meta.EnvironmentConfig
WHERE env_name = 'prod' AND is_active = 1
""").collect()[0].asDict()

storage = get_storage_layer(env_config)

# Pipeline code uses the abstraction — no cloud-specific code in the transform logic
raw_df = storage.read_delta("bronze/orders/")
# ... transform ...
storage.write_delta(silver_df, "silver/orders/")

Terraform Artifacts per Cloud

The Terraform generator from earlier builds cloud-specific infrastructure configs from the same metadata. A row in the config table has a cloud_provider field; the generator picks the appropriate Terraform resource types:

def render_storage_resource(config: dict) -> str:
if config['cloud_provider'] == 'azure':
return render_azure_storage_container(config)
elif config['cloud_provider'] == 'aws':
return render_aws_s3_bucket(config)
elif config['cloud_provider'] == 'gcp':
return render_gcs_bucket(config)

The pipeline code is identical across clouds. The Terraform is different. The config table captures which cloud each environment runs on, and the tooling generates the right artifacts. That separation — shared pipeline logic, environment-specific infrastructure — is what makes cross-cloud operation practical rather than heroic. As always, I'm here to help.

Read more