- Zenity Labs
- Posts
- Analyzing The Security Risks of OpenAI's AgentKit
Analyzing The Security Risks of OpenAI's AgentKit
Introduction
OpenAI has just launched AgentKit - a set of tools for building agentic workflows right in the heart of the OpenAI platform. No more using a plethora of frameworks, tools, and hosting services for building and deploying your agents. Now you can do it all in one place. But with new products come security risks, old and new. Let's jump in for an overview of the platform, its security pitfalls and how to avoid them.
Platform Overview
AgentKit is comprised of 3 major components:
Agent Builder - a visual agent workflow builder with all the features you’ve come to know: drag-and-drop components in a graphical canvas, pre-built templates, and versioning.
Connector Registry - a collection of MCP connectors curated by OpenAI and other trusted vendors for connecting your agents to external data sources and tools.
ChatKit - a toolkit for integrating the agents built using Agent builder into the developer’s product. Think of it as the UI for the agentic workflow.
An agent created in AgentKit will usually contain a combination of all 3 parts. Let’s explore each one in more detail.
Agent Builder
Agent Builder is the most interesting offering from a security perspective so we’re going to focus on it. It includes a visual interface for building multi-step agent workflows:

We’ve preloaded the “Planning helper” template to get a glimpse of what a workflow looks like.
The canvas consists of nodes connected by edges. The entry point to every workflow is the “Start” which connects user input to the agent: it embeds the user input into the chat history and exposes the input_as_text
variable to all nodes in the workflow.
Here’s a breakdown of all the main nodes:
1. Core nodes - the essential building blocks.
Agent — wraps a model + system instructions; can be supercharged with built-in tools:
Web Search — lets the agent browse/search and ground answers in fresh pages.
Code Interpreter — executes sandboxed Python and returns results/plots/files.
File Search / RAG — indexes user files and retrieves relevant chunks for context.
Function Tool — wraps your own function or API with a JSON schema so the agent can call it safely (classic function-calling).
Start - entrypoint to the workflow. Here's where user input enters the agent.
End - format how the output should look like.
2. Tool nodes - allow adding functionality to the agent. One interesting node is the Guardrails - a node that verifies its input for personally identifiable information (PII), prompt injections/jailbreaks and misuse. If you’ve been following the AI security world lately, you might already ask yourself “But is it enough?” (The answer is - No.)
3. Logic nodes - nodes that control the flow such as If/else conditionals, While loop and most importantly: User approval - a node that triggers an approval from the user before the flow continues.
Once an agent has been built, the user can publish it and integrate it with ChatKit.
Connector Registry
The connector registry is a collection of MCP servers created by OpenAI to common services like Gmail, Google Calendar, etc. as well as other top vendors like Zapier, Shopify, PayPal and others, allowing agents built with Agent Builder to connect with other data sources as well as use a wide range of tools exposed by these servers. This is what turns an LLM into an agent - allowing it access multiple data sources for context and execute tools to perform tasks. You can already imagine that with great power comes great risk.
ChatKit
OpenAI’s ChatKit offers an embeddable chat UI layer (via JavaScript) that lets developers wire up conversational agents directly into their frontend. It handles core chat behaviors like streaming responses, threading, UI state, and linking to backend agent logic so you don’t need to build those plumbing pieces from scratch. On setup, you include ChatKit’s library in your client app and configure it to route messages and context to your agent workflow.
What can go wrong?
Now that we have a general understanding of how things work, we can start diving into how they can break.
Private data leakage and/or corruption - this is one of the major risks of any agentic AI platform and this is one’s no different. Agent Builder allows connecting an agent with 2 types of sources that can contain sensitive data:
File Search - a vector database hosted by OpenAI that contains embeddings of files uploaded by the user. This can be valuable and highly sensitive data that provides the agent extra context to drive its response.
MCP Connectors - Gmail emails, DropBox files, Stripe invoices, all can be made available to an agent via these connectors.
The data in these sources can be leaked in several ways:
Prompt injection risks - A prompt injection attack on an agent, either direct or indirect, can cause it to expose the data in these sources to an attacker.
Connect to a custom MCP server - Agent builder provides the option to connect to a custom server hosed on a remote server:
3rd-party MCP servers not under the user’s control can have numerous security issues that can affect the clients connected to them such as Rug Pulls and Tool Poisoning to name a few. More details can be found here.Publish - an agent can be made available to the outside world by clicking on the “Publish button”. It can then be integrated into ChatKit, and users can start interacting with it. If any private data is left unintentionally in the RAG, it can be made available to outside users interacting with the agent.
We’ve so far only mentioned data leakage but data alteration is also possible.
If a MCP server has a tool that can insert or edit the data store it’s connected to, a prompt injection can have hazardous effects depending on the systems that are connected to it.
Excessive agency - lets take for instance the following workflow:
The workflow details a “MegaZord” agent that has to decide which tool to call based on the user’s input. We can see it can do quite a lot - read Google Drive files, create PayPal invoices and call a custom Swiss Army Knife MCP server. But here lies the catch - the agent is a single point of failure with the autonomy to both decide and act making it extremely potent. A user that sets it astray (with a prompt injection for example) can basically take over the entire flow. In general agents should be limited to the minimal number of tools needed to accomplish their task, with special care for tools that can perform write operations.Soft security boundaries - To fix the previous example one might consider a flow like this:
The “Safe For MegaZord” agent was added to classify the user’s intent and decide whether the input is safe for the agent or not. But this is not good enough - using an agent as a LLM-as-a-judge is considered a “Soft Guardrail” and can be bypassed. More on this in the next section.Unsafe agent instruction definition - When defining an agent, the user can provide context to its instructions. Incorporating untrusted user input into these instructions is essentially the definition of prompt injection. One such example is using the
input_as_text
variable - The “Start” node defines the user input and exposes it as the input_as_text variable to all nodes in the workflow:
If user input is directly incorporated into an agent’s system instructions (as seen above), hijacking the agent becomes much easier in case of a potential attack.ChatKit deployment risks - To make an agent available to the world, it must be published and integrated into the ChatKit frontend code. This leaves your frontend relatively thin, but shifts much of the trust and security responsibilities to how you expose and protect the backend agent interface which flags up “classical” web application security concerns. Users must follow security best practices and vet their applications for client and server-side issues.
Mitigations and Guardrails
Now that we understand how things can go wrong, let’s look at how OpenAI’s Agent Builder tries to keep things under control.
OpenAI introduced three built-in safety layers to protect data, prevent abuse, and keep workflows predictable. Namely: Guardrails, User Approval, and LLM-as-a-Judge.
These mechanisms are an important first step, but they are not a full solution. They often create the illusion of complete protection while still allowing risky behavior to slip through. Developers should treat them as a baseline, not a guarantee.
1. Guardrails: The Safety Net with Holes
At the core of Agent Builder’s safety stack is the Guardrails node (highlighted in blue on the image), a modular validation layer that filters data as it moves through a workflow.
Developers can drop it before sensitive components like connectors or agents with external access, creating a checkpoint that decides whether to pass, block, or redirect unsafe input.

Each Guardrails node consists of three main components:
Input channel – the data being validated (e.g., user input or retrieved file content).
Logic unit – the rule set that evaluates the input (e.g., PII detection, jailbreak filtering, or harmful-content classification).
Output channels – two branches: “Pass” for safe input and “Fail” for unsafe or policy-violating input. Developers define how each branch behaves, continue, block, or return an error.
This branching structure enables adaptive workflows, allowing agents to respond intelligently when unsafe input is detected instead of executing blindly.
With that foundation in mind, let’s look at the different Guardrail logic configurations and what they aim to protect against.

PII Guardrail – Detects and blocks personal identifiers such as names, emails, or credit card numbers before they reach the model.
Hallucination Guardrail – Checks model outputs against a trusted data source to detect ungrounded or factually incorrect responses.
Moderation Guardrail – Filters disallowed content (hate, sexual, violent, or self-harm material) to ensure compliance and safe deployment.
Jailbreak Guardrail – Attempts to catch prompt injections or instruction overrides designed to make the model ignore its safety rules.
Additionally, the Guardrail node includes a “Continue on Error” configuration, which lets developers define what happens if the guardrail itself encounters a runtime failure. Instead of breaking the workflow, you can directly control the agent’s flow to decide how the error should be handled. You can read more about the guardrails in OpenAI’s Agent Builder safety documentation.
While these guardrails seem promising and give the impression of a complete safety net, they are not nearly as strong as they appear. In practice, each one has blind spots and failure modes that can be exploited or simply missed. We dive into those weaknesses below.
2. User Approval: The Human-in-the-Loop Layer
The User Approval node introduces a critical human checkpoint into the agent workflow. It allows developers to pause automation at key decision points and request explicit approval from a user before the agent proceeds (highlighted in purple on the image) .

When the flow reaches a User Approval node, the execution stops and prompts the user with a customizable message written by the developer, for example, confirming a data update, sending an email, or accessing a sensitive connector. The user can then either approve the action to continue the flow or deny it, which routes the workflow through the “Reject” path defined in the node configuration.

3. LLM- as-a-Judge: The Customizable Soft Guardrail Layer
While not a native node in Agent Builder, developers can still create a Judge agent to act as an intelligent decision layer within a workflow. This pattern, often referred to as LLM-as-a-Judge, introduces a flexible yet model-driven checkpoint that classifies inputs or outputs before the workflow continues.
In this approach, developers build a dedicated agent that receives context, evaluates it according to defined criteria, and returns a structured decision. The result is then passed into an IF/ELSE logic node, which determines the next step in the flow. By connecting the Judge agent to tools such as Web Search, File Search, or MCP connectors, developers can design more sophisticated and adaptive behaviors. Examples include assessing whether generated content aligns with brand or policy guidelines.
In fact, developers may not realize that the “Classify” agent template included in Agent Builder is itself an example of the LLM-as-a-Judge pattern. It uses an LLM to evaluate input, decide on a category or outcome, and direct the flow accordingly.
However, this pattern represents a soft guardrail, not a hard enforcement layer. Because it relies on an LLM’s own reasoning and classification, it provides adaptive safety rather than deterministic control. As a result, it can still be bypassed or misclassified under certain conditions. Developers should use this method to add flexibility and context awareness, but always pair it with stricter, rule-based guardrails when security or compliance is critical.

Why Guardrails Fall Short
While these guardrails do improve an agent’s safety posture, their protection is only as strong as the assumptions behind them. Each one is built on a simplified idea of what “safe” means, and those ideas don’t always hold up in real-world use.
To understand their limits, let’s look at three key weaknesses.
1. Not All Guardrails Are Created Equal
Guardrails help reduce risk, but they don’t eliminate it. Because they rely on pattern detection, confidence thresholds, and model-driven classification, they’re prone to both false positives and false negatives. A malicious prompt might slip through unnoticed, while a legitimate one could be wrongly blocked. In some cases, runtime or network issues can even cause a guardrail to fail silently. As OpenAI’s own documentation acknowledges, “even with mitigations in place, agents can still make mistakes or be tricked.”
Agent Builder’s safety mechanisms combine two main types of guardrails, soft and hard, each with distinct trade-offs:
Soft guardrails, such as Jailbreak and Hallucination, are model-driven and probabilistic. They depend on an LLM’s own reasoning to detect unsafe behavior, which means they can often be bypassed through creativity, ambiguity, or indirect prompt attacks.
Hard guardrails, most likely including PII and Moderation, are deterministic and rule-based. They follow fixed policies like pattern matching, redaction, or blocking, offering predictable enforcement but little understanding of context or nuance.
Each one of these guardrail types suffer from shortcomings. Soft guardrails rely on a model to interpret intent. A cleverly disguised jailbreak, written in another language, encoded, or spread across multiple turns, can slip past detection entirely.
And while it seems like the hard guardrails such as PII detection won’t suffer from this problem, they do. Not all sensitive information follows a specific pattern, and even the ones that do can be encoded by attackers to bypass any pattern recognition.
When put to the test each guardrail exposes a flawed assumption about what safety actually looks like
2. They classify, not correct.
Across all four guardrails, a consistent pattern emerges: detection is not protection.
Each mechanism identifies what looks unsafe but does little to handle it safely. When an input passes through a guardrail, the outcome is binary: pass or fail. The system doesn’t attempt to sanitize, rewrite, or repair unsafe content.If an input fails, the workflow either stops or follows a “Fail” branch. This means developers must design their own recovery logic, asking users to rephrase, routing flagged items to review, or building fallback flows. The guardrail classifies risk but doesn’t contain it.
In short: it labels the fire but doesn’t put it out.
3. Humans can fail too
The User Approval node adds a vital human-in-the-loop layer, pausing automation until a person confirms or rejects an action. While this step adds oversight, it’s not foolproof.
Humans make mistakes. They can approve unsafe actions due to fatigue, misunderstanding, or social engineering. Worse, overly technical or vague approval prompts can mislead users into authorizing actions they don’t fully understand.
Final Thoughts
OpenAI’s Agent Builder is a bold step toward a future where anyone can design autonomous, intelligent workflows and agents. But as with any new capability, convenience can easily outpace control. The current guardrails, while valuable, remain largely soft. They detect problems after the fact, rely on probabilistic reasoning, and often leave developers with a false sense of security. Real safety requires more than classification; it demands enforcement.
The simplicity that comes with AgentKit can be misleading, building safe AI agents requires careful design, an understanding of the risk, proper monitoring, and real enforceable boundaries. Stay aware when building agents (no matter how easy it might look).
Reply