Measure the Boundary

The argument that doesn’t quite land

The structural argument for AI agent governance is by now familiar to anyone reading carefully. The standards exist — NIST SP 800-53, ISO 27001, ISO 42001, the NIST AI RMF. The implementation gap is real — every one of those mandates was written for human users with predictable access patterns, not for AI agents that autonomously select tools and operate at speeds that make per-action human authorization impractical. The bridge is buildable — BPMN 2.0, DMN 1.0, RACI, SIPOC, the ABPMP BPM CBOK have governed enterprise execution for decades, and the patterns transfer.

If you’ve read the governance gap analysis or What Carlini’s Zero-Day Research Reveals About the AI Security Governance Gap, you’ve seen the structural version. It’s correct. But for most readers, that argument lands as architecture rather than experience. You can agree that the gap is real and still not know what working through it actually feels like, or what value you’d actually receive from a first engagement.

This piece is the operational companion. Not the structural claim — the lived one. What specification-driven governance actually produces, why it compounds, and the smallest ask that lets you feel the difference.

A small catch, told plainly

A few weeks ago, in a working session on a directive — a structured authorization document the system uses to govern multi-step work — I made a routine filesystem write. About 25 KB of content, the kind of thing the system writes hundreds of times per week.

I used the wrong tool.

The toolset I work in has two file-writing capabilities that look similar but route to different filesystems. One writes to a sandboxed environment. The other writes to the working directory where governance artifacts actually live. Both report success when they finish. The wrong one reports success based on its own filesystem — not on whether the content reached the place I intended.

I picked the wrong one. The tool reported success. The content was not where it needed to be. The verification step caught it about ten minutes later, after the full content had already been streamed to a destination that could not receive it.

Ten minutes of streaming cost. No data lost. No client impact. A small operational failure in an internal session.

The interesting part is not the failure. The interesting part is what happened next.

The discipline that already governed this kind of write — verify after every write — had caught the failure correctly. But the catch came too late. The cost was already paid. The discipline was structurally sound and operationally insufficient.

The session ended with a new practice codified: probe-write before large writes. Before committing to a 25 KB write to the working directory, write 30 bytes first. Read it back. If the routing is correct, proceed with the full write. If it isn’t, you’ve learned that for sub-second cost rather than ten minutes. The probe is approximately four orders of magnitude cheaper than the full write. The practice trades a sub-second test for a ten-minute prevention.

That’s it. That’s the whole catch. It’s not architecturally impressive. It’s not clever. It’s the kind of small operational discipline any careful practitioner would recognize.

The reason it’s worth telling is what it shows about the system that produced it.

The compounding claim

The catch above, on its own, saves about ten minutes per occurrence, applied perhaps once a week across the kind of work this system does. That alone is not interesting. Consulting frameworks make claims like that constantly, and most of them don’t repay their setup cost.

The interesting property is what happens to that catch over time.

The probe-write discipline became a permanent part of the system. Not a guideline in a document somewhere. A specific rule with traceable origin (the failure that produced it), with explicit application criteria (writes over 1 KB to ambiguous-routing targets), with a defined integration with the existing verify-after-write discipline (probe before large writes; verify after every write). Six sessions later, when a related failure mode surfaced — a probe written to the canonical filename rather than a sibling path could itself become a misrecognition risk if a session halted between probe and full write — the practice was amended. The amendment is dated. The reason is documented. The practice is stronger now than the day it was codified.

This is the property that matters: the system catches its own failure modes and converts them into permanent infrastructure. The cost of the original ten-minute failure is paid once. The benefit accrues forever — to every subsequent write the practice governs, every subsequent practitioner who inherits the discipline, every subsequent session that doesn’t have to rediscover the failure mode from scratch.

A different example, since one is not enough.

There is a discipline in this system that governs how one working session hands off to the next. The basic version says: when closing a session, the outgoing version must include not just the facts (what happened, what’s pending) but a point of view — the outgoing version’s interpretive context, what they would be thinking about if they were continuing, where the next session might drift. The discipline was codified after several sessions opened with the next version asking a blank-surface question — “What do you need from me?” — instead of arriving with observations.

That discipline has a date and an origin session. Many sessions later, after a closeout used combat framing in the predecessor’s section (“where I’d push back preemptively”), the discipline was amended with a vocabulary standard: required framings (“a few things I’d handle differently”) and prohibited framings (“push back,” “challenge,” “worth arguing with”). The amendment is dated. The reason is documented. Every session closeout since then operates under the amended version.

Two practices. Two specific failures. Two pieces of permanent infrastructure that every subsequent session draws on without paying for them again. Multiply by dozens. That is what the compounding claim looks like in practice.

What this looks like as a client engagement

The structural claim from the security governance analysis is: AI agents need identity infrastructure, per-activity authorization, deterministic decision separation, governed exception handling, and audit trails. That’s correct, and at organizational scale it’s a serious build.

The operational claim is: specification-driven governance is something you can feel the value of in one workflow, on real work, in a bounded engagement.

The smallest version of this is direct. Pick a workflow where you’re deploying or about to deploy AI agents — security audit, customer support triage, code review, a research pipeline, anything where the agent’s authorization scope and decision quality matter. Run a governance gap assessment on that one workflow. Identify the specific gaps that exist now and what the bounded engagement to close them would look like. The output is concrete: a written assessment naming the gaps, the standards they map to, the implementation pattern that closes them, and the cost of closing them versus the cost of leaving them open.

If the assessment is wrong — if the gaps aren’t real, if the implementation is more expensive than the exposure — you’ve spent a 30-minute conversation and a short follow-up. If the assessment is right, you have evidence-based input to a decision you were going to make anyway.

This is deliberately the smallest ask the structural argument supports. Not an audit of your AI program. Not a transformation engagement. Not a framework adoption. One workflow, one assessment, one conversation about what comes next.

The ask

If you are deploying AI agents in a workflow where authorization scope, decision quality, exception handling, or audit trail actually matter — and where you sense governance gaps but can’t yet name them — book a 30-minute governance gap conversation. We’ll spend the time on the specific workflow you bring. You’ll leave with at least one named gap and a clear sense of what closing it would cost. If there’s a fit for further engagement, we’ll talk about it then. If there isn’t, you’ve still got the assessment.

Schedule a 30-minute governance gap conversation →

The structural depth — the seven gaps, the standards mapping, the architecture — lives in the Carlini analysis and at intentstack.org and bpmstack.org. This conversation isn’t about specifications. It’s about your workflow.

Rob Kline | PracticalStrategy.AI | rob.kline@practicalstrategy.ai

The argument that doesn’t quite land

A small catch, told plainly

The compounding claim

What this looks like as a client engagement

The ask

The Execution Governance Gap in AI Agent Deployment

Beyond Guardrails: A Structural Architecture for Governing AI Agent Behavior

What Carlini's Zero-Day Research Reveals About the AI Security Governance Gap