Shannon Lowder

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Shannon Lowder — Fri, 26 Jun 2026 10:00:00 GMT

I wrote recently about Azure Agent Mesh and OpenSharing — two infrastructure layers that between them cover how enterprises register, discover, share, and execute agents. Between them, they address a lot of the plumbing that has been missing from the enterprise agent stack.

But there's a gap neither of them touches, and it's the one that determines whether your agents actually produce useful results: the quality of the context you give them.

Agent Mesh tells you how to run agents. OpenSharing tells you how to share agent skills across organizations. Neither tells you how to manufacture context that makes those agents smart about your specific problem, in your specific environment, with your specific history. That's not a protocol problem. It's a memory problem.

The Garbage-In Problem for Agents

The fundamental failure mode I see in production agent deployments is not model capability — it's context quality. An agent reasoning about a pipeline failure has access to a generic system prompt, the immediate error message, and maybe a few recent logs if someone wired that up. It doesn't have the history of how this table has behaved over the last six months. It doesn't know that this exact error pattern appeared twice before and both times it was a schema evolution issue upstream. It doesn't have the context of what remediation worked last time.

That missing context isn't secret or hard to find. It exists in your pipeline run logs, your incident records, your agent's own previous interactions. The problem is that it's scattered across storage systems with no retrieval layer that understands what's relevant right now, for this specific task, weighted by how recent and reliable each piece of information is.

The result is an agent that reasons well but decides poorly, because it's reasoning from an impoverished context. The model isn't the bottleneck. The memory system is.

What Context Manufacturing Actually Requires

Retrieval is the obvious first answer, and it's necessary but not sufficient. A vector similarity search over historical data gets you semantically relevant documents. What it doesn't do:

Weight by recency: a note from two years ago about how this table schema worked under a different ETL system is technically relevant but practically misleading. Context needs temporal decay.
Fuse multiple signals: the best match under vector similarity isn't always the best match under keyword relevance. A hybrid retrieval that combines semantic search, full-text search, and a reranker produces better results than any single method.
Shape for the consumer: a pipeline triage agent needs context in a different shape than a stakeholder report agent. Raw retrieved documents aren't the right unit; consumer-shaped views are.
Improve over time: if the context you provided led to a bad agent decision, the memory system should learn from that — flagging the divergence, surfacing it for correction, tightening the retrieval on the next call.

This is what I'm building with Cortex Forge, and why I think of it as a context manufacturing system rather than just a vector database.

How Cortex Forge Approaches It

The architecture follows a medallion model with strict tier separation.

Bronze is the immutable, append-only archive — every raw event, run log, conversation turn, and note captured verbatim. Nothing is ever deleted from Bronze without a human-gated operation. It's the eidetic layer: the guarantee that nothing is lost.

Silver is the derived, system-owned knowledge layer. The engine processes Bronze events into structured notes and extracted facts — cleaned, deduplicated, reconciled. Silver is regenerable from Bronze, which means if the derivation logic improves, you can rebuild it without losing the source record. Human edits are captured as overlay patches — external authority over the system's internal notes, with the human winning on short-term disputes while the system's model of revealed behavior accumulates over time.

Gold is where retrieval happens. Consumer-shaped views over Silver: pgvector indexes for semantic search, BM25 indexes for full-text, per-agent memory sets scoped to specific workflows. A retrieval request against Gold runs hybrid search — HNSW vector + BM25 fused with Reciprocal Rank Fusion, passed through a reranker, filtered by temporal relevance with recency decay. The result isn't a list of documents — it's a ranked, weighted, consumer-shaped context optimized for the specific agent and task making the request.

The MCP server is the Gold consumer interface. Any MCP-compatible agent — LangGraph, Claude Code, Copilot, a custom agent behind OmniRoute — hits the same endpoint and gets back context shaped for its declared purpose.

The Reconciler Is What Makes It Self-Improving

The piece that differentiates this from a well-engineered vector store is the reconciler. Cortex Forge tracks the divergence between what the system believes to be true (Silver) and what is revealed through actual behavior (Bronze). When an agent's decision based on the manufactured context led to a correct outcome, that's a signal. When it didn't, that's a signal too.

The reconciler surfaces flagged divergences — "you stated X, but six months of behavior suggests Y" — for human review. The human's verdict is itself a Bronze event, feeding back into the accuracy of future Silver derivations. The system gets less wrong over time not through automated self-modification but through a structured human-in-the-loop feedback cycle that the system itself generates.

The governing rule is simple: autonomous action is permitted only for reversible operations. Destructive or irreversible operations — deleting a Bronze record, modifying a human overlay — require human authorization regardless of the system's confidence. Confidence affects ranking and whether to surface a proposal; it never authorizes irreversible action.

Where This Plugs Into Agent Mesh and OpenSharing

The Cortex Forge MCP server is itself an agent skill in the OpenSharing model. A provider that wants to offer enriched context retrieval — temporal-aware, hybrid-search, consumer-shaped — can publish the skill through OpenSharing's standard share/schema/asset hierarchy, with scoped credentials and zero-copy access. Any recipient who has been granted access can wire the MCP endpoint into their own agent stack without copying any underlying data.

For Azure Agent Mesh, the connection is even more direct. Register the Cortex Forge MCP server in Azure API Center alongside your other agent skills and tools. The API Center data plane MCP server makes it discoverable to any agent in the mesh. An agent running on Foundry Hosted Agents hits the Cortex Forge endpoint the same way it hits any other registered tool — through the unified discovery surface.

Both protocols were already designed to accommodate exactly this kind of infrastructure-as-a-skill. The MCP standard is the seam. Cortex Forge sits on the provider side of that seam, manufacturing context. The agent sits on the consumer side, using it.

The Practical Difference

I've been running agents with and without this kind of memory layer on the same tasks. The difference isn't subtle. An agent with access to a well-manufactured context from Cortex Forge makes better triage decisions on pipeline failures because it can reason about historical patterns, not just the immediate error. It catches recurrences that would otherwise look like new incidents. It proposes remediation approaches that worked before, rather than generating something from first principles.

The models are the same in both cases. The routing layer is the same. The only difference is whether the context going into the model call is generic or manufactured. That difference shows up in every decision the agent makes downstream.

Both OpenSharing and Azure Agent Mesh assume you've solved the context problem. Cortex Forge is my answer to that assumption. As always, I'm here to help if you're thinking through the memory layer for your own agent stack.

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

Shannon Lowder — Wed, 24 Jun 2026 10:00:00 GMT

Photo: “Vicars' hall and gateway” by ell brown, licensed under CC BY 2.0.

Unity AI Gateway, announced at DAIS this week, is the feature I've been waiting for since Agent Bricks shipped last year. It's a centralized governance layer for model access in Databricks — you configure which models are approved for use in your environment, who can call them, with what data access, at what cost budget, and with what logging requirements. Every model call in your Databricks environment goes through the Gateway.

For organizations that have been letting teams call foundation models from notebooks without any governance visibility, this is the compliance and cost control story you've been missing.

What the Gateway Actually Controls

Model allowlisting: your security team approves the set of models available in the environment. A team can't call an unapproved external model from a Databricks notebook once the Gateway is enforcing the allowlist.

Cost budgets: per-team or per-project token budgets with alerting when approaching the limit. The "who spent $40k on OpenAI calls last month" forensics conversation goes away when you have budget enforcement at the platform level.

Unified audit logging: every model call through the Gateway — model invoked, tokens consumed, user, timestamp, output classification if configured — lands in a Unity Catalog table. The same lineage and governance you have for your data applies to your model calls.

The Integration With Unity Catalog

The tightest part of the integration is the connection between Unity Catalog permissions and what data a model can be called with. A model call that includes data from a table the calling user doesn't have read access to can be blocked at the Gateway level. That's the data access governance story for AI that's been missing from every platform I've worked with. It's still early, but the architecture is right. I'm here to help design the Gateway policy structure for your environment.

You Don't Need Fable. You Need a Router.

Shannon Lowder — Sat, 20 Jun 2026 10:00:00 GMT

Photo: “Riverside Path signpost directions in Northwich” by kitchenkraft, licensed under CC BY 2.0.

The performance gap between open-weight models and closed frontier models has spent the last year collapsing faster than anyone predicted. Epoch AI's tracking puts open weights at roughly a three-to-four-month lag behind state-of-the-art closed models on average. For coding tasks, the gap has effectively closed — DeepSeek V3.2 and MiMo V2 Pro sit within striking distance of Opus 4.8 on real-world workloads. For complex reasoning, the closed frontier still holds a meaningful edge.

That remaining gap matters less than people think, for a reason that's easy to miss: the tasks in your pipeline are not uniformly hard. Most of them are nowhere near the frontier. And if you're routing every request to a frontier model because "it's the best," you're paying frontier prices for work that a well-prompted 7B model handles correctly — while also handing your data to a vendor you can't audit, on infrastructure you don't control.

The mature architecture isn't "pick the best model." It's build a routing layer.

What the Performance Landscape Actually Looks Like

The wave of MoE (mixture-of-experts) open-weight models that landed in the first half of this year changed the economics more than any single benchmark result. Models like DeepSeek V4-Pro, Qwen 3.6-35B-A3B, and Mistral Small 4 achieve very high active-parameter efficiency — only a fraction of total parameters activate per token, which means they run fast on modest hardware while delivering quality that rivals much larger dense models.

The result is a bifurcated market. For routine tasks — classification, extraction, structured generation, code templating — open-weight models are now the volume leaders. For the hardest reasoning, long-context synthesis, and nuanced generation, the closed frontier still earns its premium. The right response to this landscape is not to pick a side. It's to route.

The Stack I'm Running

I've been building toward a multi-provider routing architecture, and after spending time testing different approaches, I landed on OmniRoute as the gateway layer. It's an OpenAI-compatible endpoint that routes across 200+ providers — closed APIs, local inference endpoints, everything — with 15 routing strategies, 4-tier auto-fallback, and a prompt compression pipeline that cuts token counts 15-75% per request before the model ever sees them.

The compression piece matters more than I initially gave it credit for. Fewer input tokens means lower cost at every provider, lower latency everywhere, and meaningfully better performance on smaller models that struggle under bloated prompts.

Behind OmniRoute I'm running about a dozen providers. The interesting one for this post is the local tier: models running on a Mac Mini via MLX and Ollama. MLX became the dominant Apple Silicon inference backend after Ollama switched its Metal backend to MLX — it's 30-60% faster than the previous llama.cpp approach, and 3-4x faster on prompt processing on M4 hardware. On a Mac Mini M4 Pro with 64GB unified memory, a MoE model like Qwen3-Coder-30B runs at around 130 tokens per second — fast enough for real pipeline work, not just demos.

That local tier covers three things nothing in the cloud can: zero per-token cost at any volume, full data sovereignty (the payload never leaves the machine), and offline operation when the cluster is down.

The Three-Tier Routing Model

The routing logic I've arrived at isn't complicated, but it has to be explicit. Here's the actual decision tree:

Local models (Mac Mini / self-hosted): Anything involving sensitive client data, high-volume routine tasks where per-token cost accumulates, anything that needs to work offline, and anything where I want absolute certainty the payload doesn't leave my network. These run at zero marginal cost and with complete sovereignty.

Mid-tier cloud (Mistral Small 4, DeepSeek V3.2, Haiku 4.5, open-weight providers): Tasks that need more quality than local models reliably deliver, but don't need frontier reasoning — complex extraction, multi-step code generation, structured analysis. Cost is a fraction of frontier, latency is acceptable, and quality meets the bar.

Frontier cloud: Reserved for tasks where the quality difference is real and worth paying for — complex multi-step reasoning, high-stakes decision points in agent pipelines, content where prose quality visibly matters. Maybe 5-10% of total request volume in a mature pipeline. Which specific frontier provider fills this slot at any given moment is, as it turns out, not something you can assume in advance.

The routing decision itself runs on a fast small model. Task classification is cheap; sending the wrong task to the wrong tier is expensive.

Provider Risk Is Not Theoretical

The argument for multi-provider routing used to feel like defensive engineering — sensible in principle but unlikely to matter in practice. Then the US government issued an export-control directive requiring Anthropic to immediately suspend access to Fable 5 and Mythos 5 for all foreign nationals, citing national security concerns over a reported jailbreak. Anthropic couldn't segment foreign nationals from US-based users across a hundreds-of-millions user base on same-day notice, so they pulled both models for everyone — US customers included.

Teams whose workflows depended on Fable 5 had no warning and no graceful fallback. Those who had been routing through a gateway with multiple frontier options configured fell over to the next available provider automatically, with no human intervention and no pipeline downtime.

The shutdown wasn't caused by a technical failure, a pricing decision, or a vendor relationship gone bad. It was a regulatory action that the vendor had no choice but to comply with. That's a category of risk that doesn't show up in SLA discussions or uptime metrics, and it's one that a single-provider dependency can't hedge against.

I'm not going to speculate on the merits of the directive or predict when or whether access is restored. What the situation makes undeniable is the architecture point: if your pipeline has a hard dependency on any specific closed model, you're exposed to every kind of availability risk that model's provider faces — technical, commercial, and regulatory. A routing layer with multiple frontier providers configured doesn't eliminate that exposure. It makes recovery automatic instead of manual.

The Sovereignty and Cost Math

Running this architecture for several months, the economics are instructive. The local tier absorbs most of the volume for high-frequency pipeline tasks. The mid-tier cloud handles the work that needs more quality than local provides. The frontier tier handles a small fraction of requests. The blended cost across all three tiers is dramatically lower than routing everything to a frontier API — and the data exposure surface is dramatically smaller because the sensitive volume stays on-prem.

The other dimension people undercount is exactly what the Fable situation illustrated: provider resilience. A routing architecture that runs across a dozen providers with auto-fallback absorbs outages, pricing changes, model deprecations, and regulatory shutdowns, without a pipeline change or an on-call incident.

What This Is Not

This is not a recommendation to avoid frontier models or to treat any specific provider as unreliable. The closed frontier produces genuinely better results on hard reasoning tasks, and a well-configured routing setup should absolutely include frontier options. The point is that which frontier provider fills that slot should be a routing decision, not an architectural dependency.

The goal is calibration: the right model for the right task, routed automatically, with fallback when something fails — for whatever reason. Build the router first. Then figure out where each tier actually earns its slot in your pipeline. As always, I'm here to help.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving

Shannon Lowder — Fri, 19 Jun 2026 10:00:00 GMT

Photo: “jigsaw puzzle pieces” by Electric-Eye, licensed under CC BY 2.0.

The central message from DAIS this week, delivered by Ali Ghodsi in the opening keynote, was direct: AI doesn't have an intelligence problem, it has a context problem. If your CFO can't get an AI system to explain why margins changed, that's not a model capability failure — it's a context gap. The model doesn't have the enterprise-specific data, semantics, and business context it needs to give a meaningful answer.

That framing explains the entire 2026 Databricks product roadmap.

Genie One and Genie Ontology

Genie One is positioned as a smart AI coworker that understands your data — natural language queries against your lakehouse that produce accurate, business-contextual answers rather than technically-correct-but-business-wrong SQL. The underlying technology is Genie Ontology: a continuously-learning semantic layer that maps business terms to their underlying data representations in your catalog.

The ontology piece is the hard part that previous natural language to SQL systems got wrong. Knowing that "revenue" means SUM(net_order_amount) from a specific table, with specific filtering for refunds, in your specific business context — that's not something a general model knows. Genie Ontology learns it from your data and your corrections over time.

LTAP: The Transactional-Analytical Convergence

The other major architectural announcement is LTAP — Lake Transaction and Analytical Processing — which brings transactional and analytical workloads together at the storage layer rather than requiring separate systems with ETL between them. Combined with Lakebase now GA, this is Databricks making a serious structural argument that the lakehouse should be the operational database too.

The implications for pipeline architecture are significant: if your operational data and analytical data live in the same governed store, the data movement pipelines between them are reduced to transformation pipelines. That simplifies a lot of architecture that currently exists only to bridge the operational/analytical divide. I'm here to help think through what that means for your specific architecture.

LLMs as a Tool, Not a Solution

Shannon Lowder — Mon, 15 Jun 2026 12:00:00 GMT

Photo: “The Workbench” by Phil Gradwell, licensed under CC BY 2.0.

Two and a half years in, with a production knowledge system, a working orchestration layer, local inference running alongside cloud providers, and prompt hygiene enforced at the infrastructure level — I want to be direct about something the AI industry is not particularly incentivized to say clearly: building this has been hard, it has taken a lot of time, and it is still not something I would recommend to most engineers as a place to invest significant effort without understanding exactly what they're signing up for.

That's not pessimism. It's the honest accounting that I would want from anyone describing a technology investment of this magnitude.

What LLMs Actually Are

Language models are pattern-completion engines trained on text at massive scale. They are remarkably good at generating text that follows patterns similar to their training data. They are not reasoning systems in the way a human expert reasons — they do not maintain internal models of the world that they update as new information arrives, they do not have persistent memory across sessions by default, and they do not know when they are wrong.

Everything in my stack that makes LLMs useful for professional work is infrastructure that works around these properties: the knowledge system compensates for the lack of persistent memory; the orchestration layer compensates for the lack of reliable multi-step reasoning; the output verification compensates for the fact that models don't know when they're wrong; the provider routing compensates for the fact that no single model is best for all tasks.

The tools are powerful. The infrastructure required to make them reliable for professional use is significant. That's the honest framing.

The Pattern-Driven Development Connection

The part of this story that I find most interesting — and that doesn't get enough attention in the AI tooling conversation — is how much the LLM value proposition depends on having well-established patterns in the first place.

I've been writing about pattern-driven development for over a decade: metadata-driven pipelines, configuration-driven frameworks, template-based code generation. The core idea is that consistent patterns enable automation. LLMs extend that idea in a specific direction: if your work follows consistent, recognizable patterns, language models can assist with the pattern-following parts — generating the boilerplate, applying the conventions, producing the structural scaffolding — freeing human attention for the parts that require genuine judgment.

The implication: engineers who already work in highly pattern-consistent ways get more value from LLMs than engineers whose work is more ad hoc. If you've invested in framework design, in consistent naming conventions, in well-documented architectural decisions — all of that investment pays forward into LLM assistance quality. The knowledge base I built is essentially an explicit representation of patterns I had already developed implicitly. Making those patterns explicit also made them accessible to a model.

Who Is Ready for This

The audience for serious AI tooling investment, as of right now, is narrow. You need enough engineering depth to evaluate model output critically — to catch the confident wrong answers and the subtle logic errors that look correct but aren't. You need enough infrastructure comfort to build and maintain the surrounding systems without being blocked by the operational complexity. You need enough patience to invest in tooling that pays back over months rather than days.

You also need either deep pockets or the hardware you bought before GPU scarcity made local inference inaccessible. The engineers running capable local models today are largely those who acquired the hardware before the market moved against them. That window may or may not reopen.

The honest assessment: AI-assisted development, done at the level of investment I've described, is currently viable for a specific population of technically senior, infrastructure-comfortable, high-pain-tolerance engineers. The consumer-ready version — the one that works well without significant setup, without ongoing maintenance, without deep expertise in the tools being used — does not yet exist. It is probably coming. The trajectory of improvement is real. It is not here yet.

What Comes Next

The work continues. The knowledge system needs better automatic ingestion. The orchestration layer needs better error recovery. The provider routing needs more sophisticated cost awareness. The de-identification pipeline needs to handle more edge cases. None of these are done; all of them are improving.

The question I keep returning to is not "when will LLMs be good enough?" — they're already good enough for a significant fraction of the work I do. The question is "when will the infrastructure required to use them reliably be accessible enough that the investment makes sense for a broader population?" The answer to that question depends on how the tooling ecosystem matures, and the answer is not yet obvious.

I'll keep building, keep documenting, and keep being honest about what's working and what isn't. If you're on a similar path — building the infrastructure, not just using the models — I'd genuinely like to compare notes. As always, I'm here to help.

Claude Fable 5 and the New Capability Tier Above Opus

Shannon Lowder — Wed, 10 Jun 2026 10:00:00 GMT

Photo: “Sunrise on the High Country” by Zach Dischner, licensed under CC BY 2.0.

Anthropic announced Claude Fable 5 and Claude Mythos 5 yesterday — the first models in the new Mythos-class tier that sits above Opus in the model family hierarchy. Fable 5 is the accessible tier of the Mythos class; Mythos 5 is the top of the stack. If you've been tracking the model naming conventions, the message is clear: Anthropic is building upward from the existing Opus tier, not just iterating horizontally.

From a data engineering workflow perspective, here's what this announcement actually means for you.

The Capability Ceiling Moved Again

The Mythos class is positioned for the hardest reasoning tasks — the ones where current Opus models make errors that matter. For most data engineering agent use cases, Opus 4.8 is already more capable than what the task requires. Mythos class is for the tail of genuinely hard problems: complex multi-constraint optimization, synthesizing contradictory information from large corpora, tasks that require reasoning about reasoning.

Don't automatically upgrade your pipeline infrastructure to the highest tier. Evaluate against your actual task distribution.

The Tiering Strategy This Implies

With Haiku 4.5 at the bottom, Sonnet 4.5 in the middle, Opus 4.8 for hard tasks, and now Fable/Mythos 5 at the frontier, Anthropic has the most complete model tier lineup in the market. The right response for pipeline architects is to make your model routing explicit and tier-aware — and to build eval infrastructure that tells you which tier each of your pipeline tasks actually needs.

Most of your tasks will run on Haiku or Sonnet. A few will benefit from Opus. Reserve the Mythos class for the use cases that justify the cost and can actually benefit from the additional capability. As always, I'm here to help design the routing and eval layer for your specific pipeline.

Claude Opus 4.8 and Dynamic Workflows: The Honest Assessment

Shannon Lowder — Fri, 29 May 2026 10:00:00 GMT

Photo: “Input_Register_Display_Clutch” by tony_duell, licensed under CC BY 2.0.

Claude Opus 4.8 shipped yesterday, emphasizing honesty and reliability as the headline characteristics alongside the Dynamic Workflows research preview in Claude Code. The reliability framing is an interesting choice — it's a signal that Anthropic sees predictable, trustworthy behavior in agentic contexts as the frontier to push on, not raw benchmark performance.

Let me give you the practitioner's take on both pieces.

The Reliability Focus

Opus 4.8's reliability improvements are most visible in long, multi-step agentic workflows. The model is less likely to contradict itself between turns, less likely to abandon a well-formed plan when it hits a minor obstacle, and more consistent about following complex instruction sets through a long context. For the kind of pipeline orchestration I build — where the agent needs to maintain coherent reasoning across 20+ tool calls — this matters more than benchmark scores on capability evaluations.

In practice, I've seen this translate to fewer "the agent got confused and started over" failures in production workflows, which is a meaningful reliability improvement even if it doesn't show up cleanly in public benchmarks.

Dynamic Workflows Research Preview

Dynamic Workflows in Claude Code is the research preview that lets the model adapt its task plan based on what it discovers during execution — rather than committing to a fixed sequence of steps upfront. For data engineering automation tasks where the right sequence of operations depends on what you find when you look at the current state of the system, adaptive planning produces better outcomes than a rigid predefined plan.

It's a research preview, which means it's not production-ready and the behavior can be surprising. Worth experimenting with on non-critical workflows; not yet the basis for production pipeline automation. I'm here to help design the evaluation if you want to test it on your use cases.

Azure Agent Mesh and the Enterprise Multi-Agent Infrastructure Question

Shannon Lowder — Tue, 19 May 2026 10:00:00 GMT

Photo: “A large, modern geodesic dome structure with a translucent roof under a bright, sunny sky, viewed behind lush green trees and a stone wall.” by Alina Kakshapati, licensed under CC0.

Azure Agent Mesh, announced at Build this week, is Microsoft's answer to a problem that's been building for the past year: as organizations deploy more agents — in Copilot Studio, in Azure AI Foundry, in custom LangGraph deployments — the coordination problem grows. How does an agent in one environment know what capabilities another agent has? How do you control which agents can call which agents, with access to what data?

Agent Mesh is an infrastructure layer that addresses the discovery, routing, and governance problem at the multi-agent level.

What Agent Mesh Provides

Agent discovery: agents register their capabilities in a shared registry. A calling agent can query the registry to find an agent that can perform a specific function, rather than having hard-coded endpoint URLs. This is the same pattern as service discovery in microservices architecture, applied to the agent layer.

Governed invocation: access control at the agent call level, not just the API level. You can specify that a data analysis agent can invoke a SQL generation agent but not a write-to-production agent. That governance layer has been missing from every multi-agent framework I've worked with.

Discovery plus governance: find an agent by capability in the registry, then gate the call by caller and Unity Catalog data permissions.

How This Interacts With Databricks

The practical integration point for Databricks environments is Agent Bricks and the Unity Catalog governance layer. An agent deployed through Agent Bricks can register in Agent Mesh with the data access permissions it has in Unity Catalog — which means the mesh knows what data the agent can access before it's invoked. No more "the agent called a downstream agent that had broader data permissions than it should have" problems.

This is early and the integration is not turnkey yet. But the architectural direction is right. If you're running a hybrid Azure/Databricks environment and thinking about multi-agent architecture, the Agent Mesh pattern is worth designing toward even before it's fully GA. I'm here to help think through the architecture.

Prompt Hygiene at Scale

Shannon Lowder — Fri, 15 May 2026 12:00:00 GMT

Photo: “Home Inspection” by MarkMoz12, licensed under CC BY 2.0.

Prompt hygiene is not a glamorous topic. It doesn't make a good conference talk. It doesn't generate engagement on social media. It is, however, the difference between an AI-assisted workflow that you can trust with real work and one that will eventually cause you a problem you can't fully explain to a client.

Two and a half years of building AI workflows into professional consulting work has produced specific, hard-won rules about what goes into prompts and what doesn't. Here they are.

Rule 1: No Secrets in Prompts. Ever.

API keys, database credentials, client account identifiers, personal information, and anything that belongs in a secrets manager do not belong in prompts. This sounds obvious. It's violated constantly.

The violation usually happens because someone pastes code into a prompt and the code happens to contain a secret — an API key that should be in an environment variable but ended up hardcoded, a connection string with credentials embedded, a config file with an account identifier. The code looks fine in the local IDE context where the developer knows what they're looking at. It looks like a data breach in a model provider's request log.

The prevention is automatic secret scanning at the prompt level, run before the prompt reaches the provider abstraction layer:

SENSITIVE_PATTERNS = [
    r'[A-Za-z0-9_-]{32,}',          # long random strings (API keys)
    r'postgres://[^@]+@[^\s]+',      # database connection strings
    r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}',  # email addresses
    r'\b\d{3}-\d{2}-\d{4}\b',       # SSNs
    r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b',   # credit card patterns
]

def scan_for_secrets(text: str) -> list[str]:
    findings = []
    for pattern in SENSITIVE_PATTERNS:
        matches = re.findall(pattern, text)
        findings.extend(matches)
    return findings

def validate_prompt(prompt: str) -> None:
    findings = scan_for_secrets(prompt)
    if findings:
        raise PromptSecretError(
            f"Potential secrets detected in prompt: {len(findings)} matches. "
            "Review and remove before sending to model provider."
        )

This scan runs on every prompt that goes through the orchestration layer. False positives are possible — a long random string might be a UUID that's not actually a secret. The scan raises an exception and requires human review of the flagged content. That friction is intentional.

Rule 2: Scope the Context

Every token in a prompt has a cost — monetary for cloud models, latency for local models, and cognitive for the model's attention. Pasting entire files into a prompt when only a function is relevant is wasteful and counterproductive. The model's attention is diluted by the irrelevant content.

The discipline of scoping context — sending only what's relevant to the current task — also has a secondary benefit: it forces clarity about what is actually relevant. If you can't identify the relevant portion of a file, you probably haven't thought clearly enough about what you're asking the model to do.

The knowledge retrieval system is part of the answer here. Retrieved context is already scoped — the retrieval step returns the five or ten most relevant entries, not the entire knowledge base. The limit on retrieved context length is enforced at the retrieval layer.

Rule 3: Explicit Instructions Over Implicit Assumptions

A prompt that relies on the model inferring what you want is a prompt that will fail in unpredictable ways. The model will infer something. Whether it infers what you intended depends on how similar your intent is to the intent of the prompts in the model's training data.

Explicit instructions don't guarantee correct output, but they define what correct output looks like in a way that the output verification step can check against. "Return a JSON object with keys 'findings' (list of strings) and 'severity' (one of: low, medium, high)" is verifiable. "Review this code and tell me what's wrong" is not.

Rule 4: Log Everything

Every prompt, every response, every model call in the orchestration layer gets logged. Not just the final output — the intermediate steps, the retrieved context, the routing decision, the model and provider selected, the token counts, and the verification outcome.

This logging is not for debugging during development. It's for auditing when something goes wrong in production. "The AI generated incorrect code" is not useful after the fact. "The AI generated incorrect code because the retrieved context included a stale entry from three months ago that described a different version of the API" is useful — it tells you what to fix.

The log also provides the data to improve the system over time. Patterns in failures point at specific failure modes. Patterns in successes point at what's working. Without structured logs, you're flying blind on where to invest improvement effort.

The Cumulative Effect

None of these rules is individually surprising. Applied consistently, together, they produce a workflow that's meaningfully more trustworthy than one that treats prompts as ad hoc text. The secret scanning catches the most dangerous failure mode. The context scoping improves output quality. The explicit instructions make output verifiable. The logging makes the system auditable.

Four rules as one gate before the provider: scan for secrets, scope the context, demand explicit instructions, log everything.

Prompt hygiene is infrastructure, not polish. It belongs in the design phase, not the cleanup phase. As always, I'm here to help if you want to compare notes on specific hygiene rules or enforcement approaches.

Microsoft Build 2026: Project Polaris and What It Means When Microsoft Builds Its Own Model

Shannon Lowder — Fri, 08 May 2026 10:00:00 GMT

Photo: “Vermilion Cliffs National Monument” by BLMArizona, licensed under Public Domain.

The headline from Microsoft Build this week that landed hardest in the developer community: Project Polaris, Microsoft's own foundation model, replaces GPT-4 in GitHub Copilot by August. After years of deep dependency on OpenAI, Microsoft is shipping its own model into its flagship developer product. That's a significant strategic signal worth unpacking.

Why This Matters Beyond the Microsoft Story

The OpenAI-Microsoft relationship has been the most consequential partnership in the AI industry over the past three years. Microsoft's move to build its own model — MAI-Code-1 for GitHub Copilot, with MAI-Thinking-1 for reasoning tasks — signals that even the largest and most committed AI buyers in the world are hedging against vendor dependency. If Microsoft is building its own models, the "just use the best API" strategy has limits that even trillion-dollar companies have decided to address.

For enterprise data engineering teams: the lesson is architectural, not product-specific. Your infrastructure should not have a hard dependency on any single model provider. Build against a model abstraction layer today; you'll need it when the model landscape shifts again.

Azure Agent Mesh and What It Enables

The other significant announcement is Azure Agent Mesh — a coordination layer that lets agents built on different frameworks and deployed in different environments discover and communicate with each other. An agent running in Databricks can invoke an agent running in Azure AI Foundry through a standardized interface. This is the enterprise multi-agent infrastructure story that's been missing from the Azure stack.

Combined with the MCP and A2A protocols Microsoft has been advancing, the Fabric plus Databricks plus Azure agent coordination story is becoming more technically coherent. The governance piece — who can invoke which agent, with what data access — is still being worked out, but the plumbing is coming together. I'm here to help if you're mapping out the agent infrastructure for a hybrid Microsoft/Databricks environment.

The Open vs. Closed Model Decision in 2026: A Framework That Actually Works

Shannon Lowder — Sat, 25 Apr 2026 10:00:00 GMT

Photo: “Balance” by Rekyt, licensed under CC BY 2.0.

The open-weight vs. closed-API model decision has gotten more nuanced and more contentious over the past 18 months. There are strong opinions on both sides that are often driven by ideology rather than analysis. Here's the framework I actually use with clients, which is grounded in what matters operationally.

Factors That Favor Closed APIs

Task quality ceiling: for tasks where the absolute best output quality matters — complex reasoning, nuanced prose, multi-step code generation — closed frontier models still have an edge for some tasks. If you're optimizing for quality and cost is secondary, start with the best closed model.

Operational simplicity: a closed API is an HTTP call. No GPU infrastructure, no model serving layer, no CUDA compatibility issues. For small teams without dedicated MLOps capability, the operational simplicity of a closed API is worth a significant cost premium.

Factors That Favor Open Weights

Data sovereignty: if your data cannot leave your network perimeter, you have no choice but to run on-premises or in your own cloud environment. Open weights are the only option.

Cost at scale: at high enough request volumes, the per-token premium of closed APIs exceeds the infrastructure cost of self-hosted inference. The crossover point depends on your volume and hardware costs, but it's real and calculable.

Fine-tuning on domain data: you can fine-tune open weights on your data. You cannot fine-tune a closed model on data that doesn't leave your environment.

The Hybrid Is Usually the Answer

Most production environments I work with use closed APIs for development and experimentation, and migrate high-volume stable workloads to open-weight self-hosted when the economics justify it. The architecture should support both through a model abstraction layer. Don't paint yourself into one corner. I'm here to help design the right split for your workload profile.

Closed for quality and ops simplicity, open for sovereignty, scale economics, and fine-tuning — behind one abstraction layer so you're never painted into a corner.

Llama 4 Behemoth and What Trillion-Scale Parameters Actually Buy You

Shannon Lowder — Fri, 17 Apr 2026 10:00:00 GMT

Photo: “Musical Orchestra Conductor” by gavinwhitner, licensed under CC BY 2.0.

Meta's Llama 4 Behemoth — 2 trillion total parameters, 288 billion active per token — is the largest open-weight model ever previewed. The numbers are designed to impress, and the benchmark results on STEM reasoning and mathematics are genuinely strong. But "biggest model wins" is a lazy frame, and the question worth asking is more precise: what does parameter scale at this level actually give you that a smaller model with more context doesn't?

The answer is more specific than most coverage acknowledges, and it's not what you probably assume.

What the Scaling Laws Actually Say

The empirical data on parameter scaling has been accumulating for years and the picture is clear: different capabilities plateau at very different scales. Code generation performance flattens out around 34 billion parameters. Reasoning on benchmark tasks like GSM8K plateaus around 70 billion. Language understanding hits diminishing returns even earlier, around 13 billion. Beyond those thresholds, you're paying more compute and memory for smaller and smaller marginal gains on those specific tasks.

The Chinchilla result — a 70B model trained on 1.4 trillion tokens outperforming 280B Gopher on nearly every benchmark — is the cleanest demonstration that parameter count isn't the primary lever. Training data quality and volume move the needle more reliably than adding parameters beyond the point of diminishing returns.

So what does scale buy once you're past the task-specific plateaus?

The Long Tail Is the Real Answer

World knowledge follows a power-law distribution. A small number of facts and entities appear constantly in training data; the vast majority appear rarely. Rare medical conditions, regional legal specifics, obscure programming language internals, niche scientific subfields, historical events outside the Western canon — these get seen once or twice during training and need parameter capacity to stick.

A 30B model simply can't memorize enough of the long tail. It will hallucinate confidently on rare domain questions not because it's bad at reasoning, but because the relevant facts were never reinforced enough to be retrievable. A trillion-parameter model, trained on the same corpus, retains more of that low-frequency knowledge — not because it reasons better, but because it has more room to store what it saw.

This is the honest core of what Behemoth's scale buys: breadth of coverage, not depth on any specific task.

Cross-Domain Synthesis on Rare Associations

The other genuine advantage of scale is synthesizing knowledge across domain boundaries that rarely appear together in training data. Connecting distributed systems theory to immunological network topology, or reasoning about the intersection of Byzantine fault tolerance and epidemiology, requires the model to have absorbed rare co-occurrences across disparate fields. Those joint distributions are sparse in any training corpus, and smaller models don't see them often enough to build reliable associations.

This shows up most clearly in STEM tasks that require multi-domain reasoning — which is exactly why Behemoth's benchmark wins cluster there. It's not that larger models reason more deeply within a domain. It's that they've absorbed more of the unusual cross-domain associations that hard STEM problems require.

What Scale Does Not Buy: The Code Generation Case

Here's where the conventional narrative breaks down. Complex code generation — specifically the repository-level kind, where you're generating a multi-file transformation that handles edge cases across a large, complex schema — is not primarily a parameter problem. It's a context problem.

The benchmark evidence is direct. Research comparing models across parameter counts on coding tasks shows a 165M parameter model with an 8K context window matching a 20B model with a 2K context window on the APPS benchmark. At the repository level, what matters is fitting the relevant codebase, schema definitions, and upstream context into the model's working window — not how many parameters the model has. Llama 4 Scout's 10-million-token context window is a more powerful tool for this task than Behemoth's 2 trillion parameters.

There's a compounding factor here that's easy to miss: larger models tend to generate more verbose output. On long-context tasks, verbosity fills the window faster and degrades performance. A well-prompted mid-sized model with a large context often outperforms a frontier model with a cluttered context on this dimension.

If your task is "write a PySpark transformation that correctly handles schema evolution and null propagation across a 40-table medallion architecture," you want Scout with the schema and upstream pipeline context in the window. You don't need Behemoth.

The Orchestrator Frame

The most accurate mental model for what a trillion-parameter model is good for is a generalist orchestrator that needs to cover an enormous surface area. It knows a lot of things about a lot of domains. It can follow unusual instruction formats it wasn't specifically fine-tuned for. It can zero-shot generalize to task types that specialist models don't handle well. It synthesizes across disciplines in ways that require broad training coverage.

That's a valuable capability in an architecture where the orchestrator routes to specialist models for execution. The generalist at the top needs breadth. The specialists underneath need depth and speed. Confusing the two — expecting Behemoth to outperform a fine-tuned 34B model on your specific data pipeline task — is a category error.

Scale buys breadth, not depth: use a trillion-parameter generalist as the orchestrator and route execution to specialists — big-context Scout for code, small models for structured work.

The Weights Still Haven't Shipped

One more thing worth stating plainly: Behemoth's public weights have not been released as of this writing. Meta previewed it at the Llama 4 launch and cited internal benchmark results. The claims about STEM performance are from Meta's own evaluations on their own benchmarks. Until the weights ship and the community runs independent evals — particularly on tasks outside the specific benchmark suite Meta chose — the real-world performance profile is not fully known.

That matters for how you plan. If you're making infrastructure decisions based on Behemoth, you're betting on a model you can't currently run, evaluated on benchmarks chosen by the organization that trained it. Build the architecture that the weights you actually have access to can support, and make room to plug in Behemoth if and when it ships and earns its slot. As always, I'm here to help think through how that fits in your specific stack.

De-Identifying Prompts: Protecting Business Logic From the Model

Shannon Lowder — Wed, 15 Apr 2026 12:00:00 GMT

Photo: “Old brass key” by Ivan Radic, licensed under CC BY 2.0.

When you send a prompt to a cloud model, you are sending data to a server you don't control, operated by a company whose data handling practices you've agreed to in a terms of service document you may not have read carefully. For personal projects, that's a risk you can evaluate and accept. For client work, the calculus is different — the data in that prompt may not be yours to send.

De-identification is the practice of removing or obscuring identifying information before it reaches a model provider. It's not a perfect solution, and it introduces its own engineering complexity. It is, for certain categories of work, the only approach that keeps client data under appropriate control.

What's Actually in a Prompt

The risk isn't always obvious because the sensitive information isn't always explicit. Consider a prompt that asks a model to review a data transformation function. The function itself might reference a specific table name, a specific client's naming convention, or a column name that — even without other context — reveals what kind of data the system handles. A function called transform_phi_encounters tells a reader with any domain knowledge that you're handling protected health information.

More subtle: the structure of your code reveals your architecture. Naming conventions reveal your domain. Schema patterns reveal your data model. None of this is a trade secret in isolation. In aggregate, and combined with other context, it describes your client's system in more detail than most clients would consent to if the question were asked directly.

The question I started asking: before I paste this into a cloud model, would I be comfortable if this exact text appeared in a training dataset, or in a log file that got subpoenaed in a legal proceeding? If the answer is no, the text needs to be de-identified before it leaves the machine.

The De-Identification Pipeline

De-identification in code prompts is different from de-identification in text documents. The common approach for documents — named entity recognition, regex patterns for known identifier formats — doesn't translate directly to code, where the sensitive identifiers are function names, variable names, and schema definitions rather than person names and account numbers.

The approach I use is substitution-based: replace domain-specific identifiers with generic equivalents before the prompt is sent, and reverse the substitution on the response. A function that handles fiscal period calculations becomes a function that handles period_calculations. A table named client_accounts becomes entity_records. The model sees generic names; the response comes back with generic names; the substitution map translates the response back to the actual names in the codebase.

Substitute domain identifiers locally, let the cloud model see only generic names, then reverse the map on the way back — the business logic never leaves your machine.

class PromptDeidentifier:
    def __init__(self):
        self.substitution_map: dict[str, str] = {}
        self.reverse_map: dict[str, str] = {}
        self._counter = 0

    def substitute(self, identifier: str, category: str = "entity") -> str:
        if identifier not in self.substitution_map:
            self._counter += 1
            generic = f"{category}_{self._counter:04d}"
            self.substitution_map[identifier] = generic
            self.reverse_map[generic] = identifier
        return self.substitution_map[identifier]

    def deidentify(self, text: str, identifiers: list[str]) -> str:
        result = text
        for identifier in sorted(identifiers, key=len, reverse=True):
            generic = self.substitute(identifier)
            result = result.replace(identifier, generic)
        return result

    def reidentify(self, text: str) -> str:
        result = text
        for generic, original in self.reverse_map.items():
            result = result.replace(generic, original)
        return result

The substitution map is scoped to a session — the same identifier gets the same generic replacement consistently within a session, so the model can reason about relationships between entities without the identifiers being meaningful outside their substituted context.

Using Local Models to Break Down Business Logic

There's a more interesting application of this pattern: using a local model as a pre-processor to translate business logic into technical terms before sending the result to a cloud model.

The idea: your business problem contains domain-specific value. "Calculate the yield on a non-standard bond with these terms" reveals that you're building financial software. "Transform this time series with these aggregation rules" could describe almost anything. The business concept contains the intellectual property; the technical pattern does not.

Running a local model to translate the business description into a technical specification — substituting domain concepts with technical abstractions — strips the identifiable business logic before the cloud model sees it. The cloud model gets the technical problem. The local model holds the translation key. The intellectual property never leaves the machine.

This isn't foolproof — clever inference from technical patterns can sometimes reconstruct the business domain. But it significantly reduces the information density of what the cloud model receives, which is the achievable goal. Perfect de-identification doesn't exist. Meaningful de-identification does.

The Operational Discipline

De-identification only works if it's applied consistently. A pipeline that de-identifies most prompts but occasionally sends raw text to a cloud model is not meaningfully more private than a pipeline that sends everything raw. The consistency requirement means the de-identification has to be enforced at the infrastructure level — not left to developer discretion on each call.

In the ForgeAI orchestration layer, de-identification is applied automatically for any task that the routing logic classifies as high data sensitivity. The developer doesn't decide whether to de-identify; the routing configuration decides, and the de-identification runs before the model call. Removing the opt-in decision removes the failure mode of forgetting to opt in. As always, I'm here to help if you want to compare notes on de-identification approaches for your specific situation.

Agent Reliability After Six Months of Production LangGraph Pipelines

Shannon Lowder — Fri, 27 Mar 2026 10:00:00 GMT

Photo: “Toddler's hand on a tramboline safety net. Symbol of carefree childhood” by Ivan Radic, licensed under CC BY 2.0.

I've been running LangGraph-based agents in production for the better part of a year now. Here's an honest accounting of where the reliability challenges actually show up — not the architectural problems that show up in conference talks, but the operational ones that show up on a Tuesday afternoon when something breaks.

The Prompt Drift Problem

Agent behavior is sensitive to prompt changes in ways that don't always surface in unit tests. A change to the system prompt that improves performance on one class of inputs can silently degrade performance on another. If you're iterating on prompts in production without eval coverage, you're flying blind. The fix is non-negotiable: every prompt version gets a version number, every change gets run through your eval suite before deployment.

The Long-Tail Failure Modes

The failure modes that show up most often in production aren't the ones you designed for — they're the inputs that fall outside the distribution you tested against. An extraction agent trained on clean text will fail quietly on text that includes special characters, unicode issues, or unexpected formatting. The fix is adding those cases to your eval suite as you discover them, and maintaining a quarantine path for inputs the agent expresses low confidence on.

Latency Budgets and Their Violations

Multi-step agent workflows accumulate latency. A five-node graph where each node calls an LLM with a 2-second median latency has a 10-second median end-to-end time — and a much longer 99th percentile when one call hits a cold model or a network hiccup. For workflows that block downstream processes, set explicit timeouts at the graph level and design the timeout path to be safe (quarantine the record, alert a human) rather than destructive (assume success). I'm here to help design the reliability layer for your specific pipeline.

Version and eval-gate prompts, then make the failure path safe — quarantine and alert a human, never assume success.

MCP Six Months After GA: What's Actually Working in Production

Shannon Lowder — Wed, 18 Mar 2026 10:00:00 GMT

Photo: “Switch!” by andrewfhart, licensed under CC BY-SA 2.0.

Model Context Protocol went GA at Microsoft Build in May. Six months in, I have enough production experience with it to give you a more honest take than the launch-day excitement warranted.

The short version: MCP is the right abstraction and the ecosystem is growing fast, but there are friction points in the production deployment story that nobody talks about at conferences.

What's Working Well

Tool portability is exactly as advertised. An MCP server I wrote to expose Unity Catalog table metadata works with Claude.ai, LangGraph, Copilot Studio, and the OpenAI Agents SDK without any modification. Write the server once, connect it to any client. That's a genuine win for teams that don't want to maintain multiple tool integration implementations.

Write the tool server once; every MCP client connects to it unchanged — the portability win that holds up in production.

The inspector tooling for debugging MCP servers is solid. You can run an MCP server locally, point the MCP inspector at it, and interactively test tool calls without a full agent workflow. That's a much faster development loop than "spin up an agent, run it, check the logs."

The Friction Points

Authentication is underspecified in the current protocol. Each MCP client implements auth differently, which means your server needs to handle the lowest common denominator or maintain client-specific auth logic. For internal tools where everything is in the same network this is fine; for external-facing MCP servers it's a real problem.

Error handling semantics are inconsistent across clients. When a tool call fails, different clients treat the error differently — some retry, some surface to the model, some raise an exception. Design your error responses defensively.

The production deployment story (running an MCP server in a container, managing lifecycle, handling restarts) is still mostly DIY. The ecosystem around this is immature compared to typical API service deployment. Worth it for the portability; just go in with realistic expectations about the ops overhead. I'm here to help work through the implementation.