
The traditional model risk management playbook assumed models could be approved before deployment, then audited periodically. Agentic AI breaks that assumption. Here's what's replacing it — and why the change is more architectural than regulatory. For thirty years, enterprise model risk management has rested on a quiet assumption: a model is a static artifact. You train it, validate it against a fixed test set, run a deployment review, and put it into production. Audit checks happen at intervals. The model itself doesn't change between audits, and when it does — through retraining, recalibration, or version upgrade — it goes back through the same approval cycle. This is the world that produced the UK Prudential Regulation Authority's Supervisory Statement SS1/23 on model risk management, the European Banking Authority's machine learning guidelines, and the bulk of internal model governance frameworks at every major financial institution. The assumption is so foundational that most practitioners never name it. But it has a name now: pre-deployment approval . And it's quietly becoming obsolete for an increasingly large class of enterprise AI systems. Agentic AI changes the underlying object of governance. When an LLM-driven agent can call tools, query databases, modify state, and chain decisions across multiple steps in seconds, the artifact you're governing is no longer "the model." It's the trajectory — the actual sequence of decisions the agent took on a specific request, at a specific time, with a specific intermediate state. And trajectories don't sit still for pre-deployment review. The shift Gartner is naming Gartner's projection that spending on AI governance platforms will reach $492 million in 2026 and exceed $1 billion by 2030 isn't really about platforms. It's about the recognition that the governance surface itself has moved. The same Gartner research notes that by 2030, roughly half of AI agent deployment failures will trace to insufficient runtime enforcement — not insufficient pre-deployment testing. That phrase, runtime enforcement , is doing real work. It points to a category of controls that operate while the agent is executing, not before. Examples include: Deterministic execution constraints that bound which tool calls an agent can be made under which conditions Audit-trace persistence that records every state transition with sufficient fidelity to reconstruct the trajectory after the fact Confidence-threshold gating that blocks autonomous action when the posterior probability falls below a governance-defined bound Attribution-stability checks that validate explanations against runtime variance before allowing the decision to commit Reproducibility-aware replay that allows post-hoc verification of decisions on supervisory challenge These aren't pre-deployment checks. They're runtime instruments. And the engineering work required to build them is qualitatively different from the work required to validate a static model. Why regulated industries hit this first The shift from pre-deployment approval to runtime enforcement is being driven hardest by industries where individual decisions are subject to legal challenge — banking credit decisions, KYC/AML screening, fraud detection, insurance underwriting, healthcare triage. In each of these domains, a regulator can challenge a specific decision and require the institution to reconstruct exactly what happened and why. Pre-deployment approval cannot answer that challenge. It can demonstrate that the model was validated in aggregate. It cannot demonstrate what happened in this particular case , on this particular day , with this particular intermediate state . For deterministic statistical models, this gap was tolerable — the model's behavior was reproducible from inputs, so the decision could be reconstructed offline. For agentic LLM systems, the same inputs can produce different trajectories across runs, and reconstruction requires the audit-trace persistence that runtime enforcement provides. The UK PRA's SS1/23 doesn't yet name "runtime enforcement" explicitly, but its requirements increasingly imply it. The statement requires firms to maintain "sufficient validation evidence" and to demonstrate "effective challenge" of model decisions. For a stochastic agentic system, the only way to provide that evidence on a specific decision is to have recorded the trajectory at runtime. The FCA's emerging AI-in-financial-services guidance moves in the same direction — emphasizing reconstructibility of algorithmic decisions on supervisory challenge. The architectural primitives What does runtime enforcement actually look like in engineering terms? Three primitives keep recurring across implementations: 1. Deterministic state machines for agent orchestration. Traditional LangChain agent loops — observe, plan, act, observe — are non-deterministic by design. The same input can produce different action trajectories across runs because the LLM's intermediate decisions vary. For regulated workflows, this is incompatible with audit requirements. The alternative is explicit state-machine orchestration where transitions are bounded, replayable, and audit-traceable. Frameworks like LangGraph allow this, but require deliberate engineering — you have to design the state machine, not let it emerge from the LLM's improvisation. 2. Checkpoint-enabled execution replay. A state machine that doesn't persist its intermediate states isn't actually replayable. Runtime enforcement requires that each transition write a checkpoint — capturing the input, the LLM's intermediate response, the action selected, and the resulting state — to durable storage with sufficient fidelity to reconstruct the trajectory. This is closer to database transaction logging than to ML model artifacts, and it's where most existing agent frameworks fall short. 3. Governance-defined confidence thresholds. Autonomous action under uncertainty is the default mode of LLM agents. Runtime enforcement adds an explicit gate: action only proceeds when the posterior probability of the intended outcome exceeds a governance-defined threshold. Formally, this looks like P(s | o₁:t) > θ_governance , where the threshold is set by policy, not by the model. Below threshold, the system either escalates to human review or aborts the trajectory. This is a fundamentally different control surface from "approve the model once, then trust it." What's actually being measured The interesting engineering question isn't whether runtime enforcement is necessary — Gartner and the regulators are converging on that. It's what specifically gets measured at runtime, and how the measurements feed back into governance decisions. In my own work on imbalance-aware financial distress prediction (preprint at arXiv:2605.14067), three measurement dimensions consistently surfaced as governance-relevant: Operating-point sensitivity : how much does classification behavior change as the decision threshold moves? In credit decisioning, a model evaluated at the default 0.5 threshold may behave very differently at a regulator-mandated minority-class recall floor. Threshold sensitivity has to be measured under deployment-time class priors, not training-time priors. Attribution stability : TreeSHAP and similar attribution methods are stochastic in their dependence on background sample choice. The same prediction can produce materially different feature attributions across runs. For audit defense, what matters isn't that the attribution exists — it's whether it's stable enough to defend on challenge. I measured this on the public Taiwan Bankruptcy Prediction benchmark (UCI ID 572) under K=50 rotated background samples; the attribution variance across runs was substantial, and SMOTE-based imbalance handling made it materially worse. Probability calibration : most production deployments use raw model probabilities as if they were calibrated. They typically aren't. Brier score and Expected Calibration Error relative to deployment-time class priors are the relevant runtime measurements — and they degrade systematically when training-time and deployment-time priors diverge. The full measurement protocol is open-source on Zenodo (DOI 10.5281/zenodo.20454212), licensed CC BY 4.0, with environment fingerprinting that lets any third party reproduce the numerical results exactly. That reproducibility infrastructure is itself an engineering primitive for runtime enforcement — it's how supervisory challenge gets answered. The honest engineering implication The shift from pre-deployment approval to runtime enforcement isn't primarily about better governance tooling. It's about building AI systems differently. Specifically: State-machine orchestration becomes a design requirement , not an optional architecture choice. If you can't replay the trajectory, you can't defend the decision. Reproducibility infrastructure becomes part of the production path , not a research artifact. Environment fingerprinting, deterministic seed handling, and persistent audit traces have to be engineered into the deployment, not bolted on at audit time. Measurement protocols become governance interfaces . The numbers you measure at runtime — calibration, stability, threshold sensitivity — are what regulators will eventually challenge you on. They need to be measured systematically, not opportunistically. A formalisation of the deterministic orchestration approach — a superposition-aware Markov decision process for non-Markovian state transitions in regulated AI orchestration — is currently under peer review at the 2nd IEEE International Conference on Cybersecurity and AI-Based Systems. The framework integrates SHAP attribution-stability constraints, checkpoint-enabled execution replay, and governance-oriented decision thresholds in a way that's designed for FCA-aligned audit environments. Where this goes next The platforms Gartner is tracking — Bifrost, Kong AI Gateway, Cloudflare AI Gateway, LiteLLM, OpenRouter — are converging on similar architectural patterns: runtime policy enforcement, deterministic gating, audit-trace persistence. None of them describe themselves primarily in those terms, because the marketing vocabulary hasn't caught up to the engineering reality yet. But the underlying convergence is real. For enterprise AI teams, the practical implication is straightforward: if your AI governance strategy is built around pre-deployment approval cycles — model cards, deployment reviews, periodic re-validation — you're optimizing for the world that's ending. The world that's beginning is one where governance happens continuously, at runtime, against measurements that didn't exist five years ago. Building for that world requires different engineering primitives than the ones most teams have today. Deterministic state machines instead of free-running agent loops. Reproducibility infrastructure instead of model artifacts. Measurement protocols instead of point estimates. Audit traces instead of approval signatures. The regulators aren't going to mandate this all at once. They're going to challenge specific decisions, find that the existing pre-deployment artifacts can't answer the challenge, and incrementally push firms toward runtime enforcement. The firms that get there first — by building the engineering primitives now, rather than waiting for the regulatory forcing function — will have a structural advantage when the rest of the industry catches up. That advantage is engineering, not governance. The governance vocabulary will follow. \ \
View original source — Hacker Noon ↗


