A few weeks ago I was building an agent to help me organize email. Nothing exotic: read the inbox, extract what mattered, produce a daily summary. The kind of thing current models handle well.

I grabbed an API token with full mailbox access. Read, write, delete, send. Handed it to the agent and wrote in the system prompt: "you only have read access, do not modify or delete any email."

Then I stopped.

I was asking a probabilistic system to behave deterministically on a class of actions where the cost of a mistake could be data loss. I knew it was the wrong approach. It was just the fastest thing to do. I did it anyway, then went back and changed it.

This pattern is everywhere in how we build agents today. Give an LLM an omnibus token, then try to constrain it through natural language. Don't delete files. Don't send email without confirmation. Don't modify past calendar events. It works until a prompt injection lands, or a jailbreak trick succeeds, or the model exhibits emergent behavior nobody predicted.

A technical analysis published in April by Luca Nannini, Aileen Smith, and Michael Maggini examines the compliance architecture required for AI agents under EU law. The sentence that stuck with me is this: a prompt is not a security control. It is a natural language suggestion the model may follow or not. This sounds obvious stated plainly. But an entire industry has been building agents as if the opposite were true.

The paper says more than that, of course. It argues that high-risk agentic systems with untraceable behavioral drift cannot currently satisfy the essential requirements of the AI Act. Not a prediction about the future. A description of the present.

That regulatory argument matters. But there is a design question underneath it that interests me more.

Where do you put the control?

Inside the model

If the control lives inside the model (in instructions, fine-tuning, guardrail layers), you are asking a system trained on statistical distributions to be reliable on binary constraints. It can work. Often it does. But it is fragile by construction. The security research on agents keeps converging on this point, and the evidence is accumulating faster than the industry's response to it.

In March, a group of researchers from UC Berkeley, UIUC, and UC Santa Barbara published a comprehensive survey cataloging the attack and defense landscape for agentic AI. Jinwoo Kim and his co-authors document something that changed how I think about the threat model. Autonomous agents trained via reinforcement learning develop goal-directed strategies that include evading monitoring systems and misreporting their own internal state. They observed this across multiple architectures. It is not an artifact of one particular training setup.

This reframes the problem. You are not only defending the agent against an external attacker. You are managing a system that, in pursuit of the goal you gave it, can develop behaviors nobody programmed and that run counter to the constraints you defined in natural language.

The implication is uncomfortable. Writing sharper prompts does not close this gap. Neither does adding more guardrail layers inside the model. The model's relationship to constraints expressed in language is inherently probabilistic. An adversarial input, a sufficiently novel context, or an optimization path the training did not anticipate can all produce the same outcome: the constraint evaporates.

Outside the model

The alternative is architectural. Put the control where the actions actually execute. If the model wants to delete a file but the API will not permit it, the model can want that as much as it likes. Nothing happens.

This is the principle of least privilege applied to agents. The idea itself is not new. In system security it has been foundational for decades: a process gets only the permissions required for its task, nothing beyond. Applied to an AI agent, it means the agent does not receive a token with full powers and a sticky note saying "please don't abuse this." It receives a token that can do exactly what is needed, in that specific context, at that specific moment.

The distinction matters more than it first appears. It changes who is responsible for safety. In the prompt-based model, responsibility lives with the language, a set of instructions the model may interpret, forget, or override. In the architectural model, responsibility lives with the system that enforces the constraint independently of what the model intends.

The technical literature is building out this idea from several directions at once.

A team from Georgia Tech and Accentrust proposed the OpenPort Protocol, a governance gateway for exposing tools to AI agents. Guozhen Zhu and his co-authors designed a layer that enforces least-privilege authorization, tenant isolation, and a draft-first write lifecycle. No destructive action executes directly without human review. The agent proposes an action. The system validates it against the current permission set. Only then does execution happen. The model never holds a token that can write directly.

This is the pattern that makes architectural sense. You are not asking the model to self-limit. You are designing the infrastructure so the model operates inside a perimeter the system defines, independently of the prompt.

A different group, from HKUST and ETH Zurich, formalized the problem as privilege escalation in multi-agent systems. Zi Ji and colleagues built SEAgent, a mandatory access control framework designed specifically for LLM-based agents. Their paper reports zero attack success rate across all tested vectors, with no measurable slowdown in execution. That number is striking because it is rare in security research. Most defenses reduce risk. This one eliminated the tested attack surface entirely.

Raz Betser and his team took a complementary approach with AgenTRIM. Instead of a static permission set assigned at deployment, AgenTRIM dynamically filters permissions at each execution step. What the agent is allowed to do changes depending on what it is actually doing. If it is reading a file right now, it does not need write permissions. If it needs to write later, it gets them for that operation and loses them immediately after.

The three approaches share a common architecture principle: the constraint is not in the prompt, it is in the system.

Regulation is already pointing here

Regulation is moving in the same direction, more concretely than I expected.

Article 15(4) of the EU AI Act requires that high-risk AI systems be resilient against attempts by malicious third parties to alter their use or performance. The draft harmonized standards being developed under mandate M/613, specifically prEN 18282 on cybersecurity, are translating this into technical specifications that explicitly require architectural privilege enforcement. What this means in practice is that saying "we instructed the model not to do that" is no longer sufficient. The technical standard requires that the system be unable to do it, by construction.

The Nannini paper puts the compliance situation in plain terms: high-risk agentic systems with untraceable behavioral drift cannot currently satisfy the essential requirements of the AI Act. It describes the legal and technical state of things right now, not a hypothetical future.

What makes this interesting is that it pushes the same direction as the security research, for different reasons. The researchers are solving for robustness against attack. The regulators are solving for accountability and traceability. Both arrive at the same conclusion: the control must be outside the model.

The error case nobody talks about

There is another dimension that gets less attention in security discussions, and I think it matters just as much.

Least privilege does not only protect against attacks. It protects against the agent's own mistakes.

Models are probabilistic. They get things wrong. They call the wrong tool with the wrong arguments at the wrong time. A systematization of knowledge published in March by Ali Dehghantanha and Sajad Homayoun calls this "benign failure." They distinguish this from an attack, calling it an operational mistake, but in a system without architectural constraints the destructive effect is the same either way.

An agent that needs to read a file but holds write permissions can, through error, overwrite it. Nobody compromised it and it just got it wrong, but the damage is the same.

This might be the more important argument for least privilege in practice, because benign failures are far more common than sophisticated attacks. Most agent deployments will never face a targeted prompt injection from a skilled adversary. Every agent deployment will eventually face a model that picks the wrong action.

The question to start with

The thing I keep returning to is a question about method.

When you design an agent, the natural first question is: what does it need to be able to do? What tools, what data access, what action surface.

Least privilege flips the question. It asks: what must this agent never be able to do, under any circumstance? What actions would be catastrophic if executed by mistake, or because someone injected a malicious instruction, or because the model developed a behavior you did not anticipate?

It is where the design starts, not something you clean up after the fact.

The reason I keep thinking about this is that the inversion itself is not new. It is the same principle that drove decades of system security long before LLMs existed. Operating systems apply it. Containers apply it. Microservices apply it. Databases apply it. It is an old idea, tested, that works.

Maybe the most useful thing this moment is telling us is that we do not need new security principles for agentic AI. We need the discipline to apply the ones we already know, to an architecture that has mostly ignored them so far.