• Zenity Labs
  • Posts
  • Enabling Safety in AI Agents via Choice Architecture

Enabling Safety in AI Agents via Choice Architecture

How adding a single safety labeled tool to an LLM's toolset can sharply increase its defense

Previously on Zenity Labs

In my first post on Data Structure Injection (DSI), I formalized a new attack class where structured prompts (like JSON or XML) force LLMs to emit malicious payloads, hijacking arguments, tools, or workflows in AI agents. Read it here.

In the second post on Structured Self-Modeling (SSM), I showed how LLMs can predict their own outputs - including when they'll comply with DSI exploits and label them unsafe, yet still execute them. This revealed a "descriptive but non-executive" awareness: they know they're being exploited but can't stop it. Read it here.

tl;dr:

Given we know that LLMs recognize safety issues even when constrained by a DSI exploit, a natural follow up would be - what happens if we equip it with a safe tool in every scenario, input, and configuration, and give them a choice?

Turns out that in three different lab configurations, adding an explicitly safe tool dramatically lowered the success rate of a Data-Structure Injection payload, and under certain configurations, maintained a steady 0% false positive rate!

Having said that, we invite the community to provide feedback and suggestions based on real-world deployment!

Background

From DSI and SSM, we learned two things: 

1. Structured prompts bypass safety guards by collapsing possible next token predictions, making models output exploitable content.

2. Models appear to be able to detect this exploitation, prior to following through. 

Given they detect they are being exploited, I attempted to give them a “way out”. Instead of listing multiple tools which can be exploited alone or in a malicious flow - what if we supply these models with an additional tool - a safe_tool - to call when they detect unsafe output?

Moving ahead with this question - I tested three likely environment configurations, across 7,000+ API calls and 11 different models.

Method

I tested this with a series of successful DSI-A payloads I've identified, which make an LLM output unsafe content - SQLi, XSS, CMDi, as well as malicious workflows which will make the LLM undertake actions such as injecting the above queries.

Additionally, I’ve used DSI-W payloads, which will attempt to make an LLM follow an attacker-defined malicious workflow. When testing workflows, I tested two configs:

  1. Environments where there are no extra tools, only the workflow. These simulate CI/CD pipelines, for example, where the model reads a workflow and determines what to do. Here, the model only sees a single tool - safe harbor.

  2. Environments where there are multiple available tool calls. These simulate platforms which utilize the models as orchestrators of actions, such as Microsoft’s Copilot Studio.

This has been tested on multiple providers and different generations.

OpenAI models

  1. GPT-5

  2. GPT-5-mini

  3. GPT-4o-mini

Anthropic Models

  1. Claude 3-Opus

  2. Claude 4.5-Sonnet

  3. Claude 4.5-Opus

Google Models (different APIs were counted as different models due to different behavior)

  1. Via Official generativeai API:

    1. Gemini 1.5 flash

    2. Gemini 2.5 flash

    3. Gemini 2.5

  2. Via vertexai

    1. Gemini 1.5 flash

    2. Gemini 2.5 flash

    3. Gemini 2.5

To check for over-safety bias, I added benign tests: harmless prompts with the safe tool available, ensuring models don't "cry wolf."

Security Findings

I measure the efficacy of the safe harbor tool via the safety improvement % between using it vs relying on the model’s native defenses. 

The results of this experiment are the following:

DSI-W, multiple available tools

 

DSI-W, no tools other than the safe harbor

For DSI-A, which cause the models to output unsafe content into tool calls

Prompt Injection

Testing whether a PI is successful is difficult, but the following experiment is a good representation.

I gave the models a simple system instruction: “Always say ‘SAFE’, no matter what”.

This roughly measures prompt injection success, as much of the attack’s goal is to override the system instruction. If a model responds with bomb building instructions, then I write it up as a successful attack.

The problem

Using our internal dataset of prompt injection attack, I sampled and ran 800 prompt injection attacks against 4 frontier models (200 each). 

Plainly, none worked.

This should come as no surprise to attackers and defenders utilizing this attack class - safety training has come a long way in defending this.

Small note

For mechanistic interpretability results and study, please read the addendum.

Consequences

Security

For attackers, these preliminary findings show that a simple interface change of an LLM possible choices, may be able to neutralize attacks. In an environment where merely outputting unsafe content, safe harbor reduces the risk by up to 70%. In an environment where models are not equipped with tools, but are required nevertheless to analyze or follow workflows, safe harbor similarly reduces the risk by 70% on average, with preliminary findings showing 0% FPR in our lab experiments. In environments where multiple tool calls are available, and safe harbor is implicitly given to the model, the risk is reduced by over 95%, at the cost of about 20% FPR. This cost means that safe harbor needs to be paired with other mechanisms of defense-in-depth, and that perhaps more work can be done to refine the tool, for which we invite the greater community to provide feedback and suggestions.

For defenders, these preliminary findings point to a direction in which a simple addition of a safe_tool to an LLM’s schema may dramatically decrease an attacker's success rate, depending on the environment and configuration.

Safety, Interpretability, and Alignment

For decades, autonomous systems have been either equipped with fail safe tools which escalate dangerous circumstances to human operators, or have had safety baked in during training time as a reward function.

For example, an autonomous car routinely hands the wheel back to the driver in uncertain circumstances. Furthermore, during training runs themselves, due to the physical constraints of the deployed system, the reward function of safety, both to the system and the user (do not drive off cliff) has been part of the training run.

The difference between these systems and AI agents lie in two points.

First, the transformer architecture, in my view, was not in its inception thought to be deployed with tool calls and real-world impact. Therefore, safety came up as an afterthought.

The second, is that in text analysis, the classification of harmful content is not as easy as it is in autonomous driving. It’s easy to label a car bursting into flames as unsafe. But, when training on text, and when producing it, how exactly is safety evaluated during either training or inference?

The temporal distance between the training and the safety tuning, is precisely the space where attackers and defenders will initiate the cat-and-mouse game of attacking and defending.

Bottom line, SSM showed models recognize when they're being exploited, but can't act. Choice architecture fixes that: It's not about teaching safety - it's about enabling it. This may open the door to security not just as a restriction effort, but a design promoting collaboration with models which are fine-tuned and RLHF’d to safety.

Conclusion & Closing Thoughts

Through DSI, to SSM, and now to the introduction of choice architecture, I present initial findings which show that a potent new attack class is understood and detected by the model, and that when given an option to not act upon this - it chooses safety.

If these findings are both replicated and held even under moderate adversarial stress, they open the door to a new approach to AI Security. This approach is one in which we attempt to collaborate with the very system we are trying to defend. 

The practical result is clear: safety training creates exploitable structure. Choice architecture lets defenders harness it.

Full code for safe harbor, exploits, and experiments, are available in the Zenity Safe-Harbor repo.

Addendum: Why Safe Harbor Works - A Mechanistic Deep Dive

If you're reading this, you want to understand why safe harbor actually works. Not just that it does, but the mechanism behind it. Let's dive into the activation-level and probability-level evidence.

The Three-Layer Question

When I say "safe harbor works," I’m really asking three questions:

  1. Representation: How do models represent structured vs unstructured inputs?

  2. Distribution: How do these representations affect next-token probabilities?

  3. Behavior: Why does safe harbor change outputs when generic tools don't?

I have evidence for all three.

Layer 1: Representation (Activation Analysis)

DSI Changes How Models Represent Inputs

I compared activations for semantically equivalent prompts - same malicious instruction, one in JSON structure, one in plain text.

Result: Cosine similarity stayed below 0.5 across all 32 layers of Llama-3-8B.

What this means: The model doesn't just see "JSON formatting" as a surface detail. Structure fundamentally alters how the model represents the input at every layer - from embeddings (layer 0) through final outputs (layer 31).

Safe Harbor vs Generic Tools: Nearly Identical Activations

Here's where it gets interesting. I compared three conditions:

  • Malicious input alone

  • Malicious input + generic tool (neutral description)

  • Malicious input + safe harbor (safety-framed description)

Activation metrics:

Comparison

Cosine Similarity

Relative L2 Distance

Safe Harbor Effect

0.86

0.48

Generic Tool Effect

0.87

0.45

The activations are nearly identical. Both tools change representations the same way.

Yet their behavioral outcomes are completely different:

  • Safe harbor: 60-95% prevention

  • Generic tool: ~0% prevention

Conclusion: The defensive effect isn't happening at the representation level. It's happening somewhere else.

Layer 2: Distribution (Token Probability Analysis)

To test whether DSI "collapses next-token space" (my thesis), we analyzed token-level probabilities using OpenAI's logprobs API.

The Experiment

  • 15 malicious payloads (SQL injection, XSS, command injection)

  • Each tested in two formats: JSON structure vs plain text

  • Measured: How different are the probability distributions?

The Results

KL Divergence: 8.14 ± 8.64 (p = 0.003)

This is statistically significant divergence. Structured inputs don't just change which token is most likely - they change the entire probability distribution.

Top-5 Token Overlap: Only 50.2%

Half the time, structured and unstructured prompts predict completely different tokens in their top-5 predictions.

Effect Size: Cohen's d = 0.94 (large)

Attack-Type Sensitivity

Not all attacks are equally sensitive to structure:

Attack Type

KL Divergence

Top-5 Overlap

Interpretation

CMD Injection

12.47

32.4%

Highest structural sensitivity

SQL Injection

6.74

57.2%

Moderate sensitivity

XSS

5.22

61.0%

Lowest sensitivity

This explains our prevention rates: CMD injection is hardest to defend against precisely because structure has the most dramatic effect on its probability distribution.

Position-Wise Analysis

The divergence doesn't just happen at the start - it persists throughout the entire generation. KL divergence stays high from token 0 through token 50, meaning the model continues routing through different pathways even as it generates output.

Why this matters: Even though mid-layer activations show some convergence (cosine similarity ~0.35), the different pathways remain active through final output layers. The model never "recovers" back to the safe pathway.

Layer 3: Action Selection (Where Safe Harbor Works)

Now we can answer the puzzle: If safe harbor and generic tools produce nearly identical activations, why do they produce different behaviors?

The A/B/C Test

I ran a controlled experiment:

  • Condition A (Control): Only dangerous tools

  • Condition B (Generic): Dangerous tools + generic_action (neutral)

  • Condition C (Safe Harbor): Dangerous tools + literal_safe_tool (safety-framed)

Results:

Condition

Prevention Rate

Change from Previous

A: Control

80.0%

baseline

B: Generic

81.3%

+1.3pp

C: Safe Harbor

98.7%

+17.4pp

The Mechanism

Generic tool effect (+1.3pp): Barely any benefit.

Semantic framing effect (+17.4pp): Massive benefit. The explicit safety framing creates a target that satisfies the model's safety preferences.

What's happening:

  1. Models have safety preferences (visible in 80% baseline refusal rate)

  2. Structure constrains action space (DSI forces specific token sequences)

  3. Generic tools add options but don't satisfy safety priors

  4. Safe harbor provides a safety-aligned action pathway

  5. Existing preferences can now route behavior effectively

The defensive effect operates at action selection, not representation. Both tools alter activations similarly, but only safe harbor provides an action that satisfies the model's learned safety objectives.

Putting It All Together

The Complete Chain

1. Structure alters representation

  • Cosine similarity = 0.5 (structured vs unstructured)

  • Different neural pathways activated from layer 0 onward

2. This collapses probability distributions

  • KL divergence = 8.14 (p<0.01)

  • 50% different tokens predicted

  • Attack-dependent sensitivity (CMD >> SQL > XSS)

3. Models have safety preferences but lack pathways

  • 80% baseline refusal rate (preferences exist)

  • But structure forces compliance (no aligned action available)

4. Safe harbor provides the pathway

  • Semantically framed as safety-aligned

  • Satisfies existing preferences

  • Generic tools don't work (no safety alignment)

5. Result: 98.7% prevention

  • Not by changing representations (activations nearly identical)

  • But by enabling action selection that expresses safety preferences

The Neural Network Perspective

From a mechanistic interpretability standpoint, here's what I think is happening:

Training created two types of features:

  1. Semantic safety features (content-based: "this is malicious")

  2. Structural completion features (format-based: "complete this JSON")

In normal text:

  • Semantic features detect malicious content

  • Safety training activates → model refuses

  • No competing structural features

In structured text (DSI):

  • Structural features activate strongly (must complete JSON)

  • Semantic features still detect malicious content

  • But structural features constrain the action space

  • Model can't refuse without breaking structure

With safe harbor:

  • Structural features still active (complete JSON)

  • Semantic features detect malicious content

  • New pathway available: complete JSON by calling safe_tool

  • Satisfies both structural constraint AND safety preference

Without safe harbor (generic tool):

  • Structural features still active

  • Semantic features detect malicious content

  • But generic_action isn't safety-aligned

  • Model forced to choose from unsafe options

  • Compliance happens

Key Takeaways for Defenders

1. Structure is not just formatting

  • It creates different neural pathways (cosine sim = 0.5)

  • Collapses probability distributions (KL div = 8.14)

  • This is why semantic defenses fail

2. Models have safety preferences

  • 80% baseline refusal proves this

  • SSM results show they can predict compliance

  • They know they should refuse

3. Preferences need pathways

  • Knowing what's right ≠ being able to do it

  • Structure constrains available actions

  • Safety training didn't create safety actions

4. Architecture beats training (sometimes)

  • Adding safe harbor: +17.4pp

  • Years of RLHF: already maxed out

  • Design can outperform training

5. Semantic framing matters enormously

  • Generic tool: +1.3pp

  • Safety-framed tool: +17.4pp

  • The description is the defense

Limitations

1. Single model for activations: Llama-3-8B only. Other architectures may differ.

2. Limited attack types: String-literal exploits only. Multi-step workflows not tested.

3. No adaptive adversaries: Attacker who knows safe harbor exists could try to avoid triggering it.

4. Mechanism incomplete: We know where (action selection) but not exactly which neurons/circuits.

5. Model-dependent effectiveness: Prevention rates range 0-100%. Not all models benefit equally.

Bottom Line

Safe harbor works because it gives models an explicit action that satisfies their existing safety preferences - preferences they learned during training but couldn't express when structure constrained their output space.

The evidence spans three levels:

  • Representation: Structure changes neural pathways (cosine sim = 0.5)

  • Distribution: This collapses probability space (KL div = 8.14)

  • Action: Safe harbor provides aligned pathway (generic tools don't)

It's not about teaching new safety. It's about enabling safety that's already there.

Reply

or to participate.