Perplexity Comet: A Reversing Story

A deeper look into an agentic browser's inner workings

Comet is Perplexity's agentic browser that enables an AI model to interact with web pages autonomously. This technical blog post dissects Comet's architecture, explaining how the model communicates with the browser, which tools are available, and how the model perceives and interacts with web page content.

The Agentic Browser Architecture

Before diving into Comet's internals, it helps to understand what we're looking at. Comet isn't a single monolithic piece of software - it's a complex system spanning multiple parts:

  1. Perplexity API Backend - Where the AI model lives, plans tasks, and issues commands.

  2. UI - The interface the user interacts with.

  3. Custom Chrome Extensions - The ones that actually control the browser and perform the user’s tasks.

  4. The Browser Itself - as you’ve probably guessed, it's Chromium-based.

Figure 1. Comet architecture diagram

Each component plays a specific role, and the communication between them reveals a lot about how agentic browsers actually work under the hood. The choice to use Chrome extensions isn’t a coincidence – the Chrome Extensions API is battle tested and provides a sound and secure framework for sensitive interactions with a webpage.

The UI: SPA and Sidecar

Perplexity calls it the Sidecar, but it’s essentially the pane that opens on the right-hand side when you click on “Assistant” and gets loaded from https://www.perplexity.ai/sidecar. It's a separate representation of the Perplexity UI, distinguished from the main SPA (Single Page Application) – the search layout you see when you open Comet and lives in https://www.perplexity.ai/b/home. Since the sidecar is what the user interacts with the most when using the agent, we will focus our attention on it.

Figure 2. Comet main window layout including SPA and Sidebar

The Sidecar sits between the AI backend and the browser extensions and is responsible for initializing them:

Figure 3. Extension initialization logic in the Sidecar

When the backend, i.e. the LLM driving the browser behind the scenes, decides it needs to interact with a webpage, it doesn't talk directly to the extensions—it goes through the Sidecar first.

The Sidecar is in charge of (to name a few):

  • API Server Communication Stream - manages the stream of control messages that are being sent from the backend API server to the Comet browser.

  • Agent Workflow Visualization - Renders multi-step AI reasoning in real-time displaying tool invocations, intermediate results, and decision points in the UI as the model executes complex tasks.

  • MCP Connector Integration — Authenticates and interacts with external services (Slack, GitHub, Asana, Linear, Notion, Atlassian, Gmail, Google Calendar, Shopify) through the Model Context Protocol, displaying tool inputs/outputs inline.

When an entropy_request, i.e. Perplexity's term for a task the browser to needs perform, comes in from the backend, the Sidecar unpacks and forwards it to the chrome extension (see comet-agent in the next section) which then handles it and executes the necessary actions within the browser:

Figure 4. Sidecar entropy_request processing logic

The Extension Trio

Comet ships with three custom Chrome extensions that work together to enable browser automation:

Figure 5. external_extensions.json file detailing 3 custom extensions

These are not publicly available on the Chrome Web Store and auto-update through Perplexity's servers via a GET request to https://www.perplexity.ai/rest/browser/update-crx, keeping the entire stack under their control:

  • comet-agent agents.crx - The comet-agent extension is where browser automation actually happens. Its 700KB service worker implements a full RPC system that receives commands from Perplexity's backend and executes them against live web pages.

    It contains the dispatchRpcRequest method, which routes incoming requests from the model to specific handlers (listed in the Fine-Grained Actions section below):

Figure 6. dispatchRpcRequest implementation with hard boundary checks

Notice the two functions marked with the “Hard Boundary” comment? We’ll analyze them thoroughly in the next section as they bear special meaning in the context of security.

  • Comet perplexity.crx - Acts as the browser-side orchestration layer, managing tab lifecycles, sidecar AI sidebar panels, and split-view browsing sessions while providing the web app with access to browser history, top sites, and tab content for contextual suggestions. It also handles PDF parsing via offscreen documents, monitors idle/suspend states to manage background operations, and captures exceptions across all services through Sentry for production diagnostics.

  • Comet Web Resources comet_web_resources.crx - The third extension is minimal. Its manifest declares no permissions and no background script. Its only job is to make the /sidecar/* and /spa/* static assets web-accessible to perplexity.ai domains, effectively acting as a local CDN for the automation UI embedded in the browser.

User Prompt Flow: The Dual Channel Architecture

When a user submits a query like "Navigate to hackernews and click on the top link", Comet orchestrates a dance between two separate communication channels: an SSE (Server-Sent Events) stream for the sidepanel, and a WebSocket connection for browser automation.

The flow begins when the user's query hits /rest/sse/perplexity_ask:

Figure 7. SSE initialization POST message body

The backend opens an SSE stream back to the browser, streaming the model's reasoning and responses in real-time. This is the channel that powers the familiar chat interface - tokens appear as they're generated, and the UI updates progressively.

Figure 8. SSE message stream example

But when the model decides it needs to interact with the browser, something different happens. The SSE stream delivers an entropy_request message containing a base_url field pointing to wss://www.perplexity.i/agent. The Sidecar receives this, unpacks the task parameters, and forwards them to the comet-agent extension via Chrome's extension messaging API. The agent extension then opens a WebSocket connection directly to the backend—completely separate from the SSE stream.

This dual-channel design serves a specific purpose: the SSE stream remains dedicated to the conversational UI (model reasoning, citations, final answers), while the WebSocket handles the high-frequency, bidirectional communication required for browser automation (RPC requests, screenshots, action results). The two channels operate in parallel—the SSE stream might be delivering text to the user while the WebSocket is simultaneously coordinating a multi-step click sequence.

Once the agent completes its task, results flow back through the WebSocket. But the SSE stream doesn't close - it continues, allowing the model to incorporate the automation results into its final response. The path parameter in subsequent SSE messages reveals what step the model is on: checking sources, generating citations, or finalizing the answer. Only when the model has fully composed its response does the SSE stream close.

Here’s a ridiculously large diagram detailing the complete flow:

Figure 9. Complete user flow diagram including SSE and WebSocket connections

Hard Boundaries

In the previous section we marked a couple of functions in the dispatchRpcRequest function of the comet-agent: isInternalPage and isUrlBlocked. These are where Comet implements restrictions on navigation performed by the agent, in other words – hard boundaries. Every security research that’s performed on agentic browsers should identify how these boundaries are implemented and explore if bypasses exist.
Let's take a closer look at them:

  1.  isInternalPage

Figure 10. isInternalPage implementation

Comet restricts navigating to sensitive Chrome internal URLs such as chrome://settings, chrome://password-manager, etc. and their comet:// counterparts (comet:// just redirects to chrome://). Hijacking the agent and having it modify the browser’s settings is highly dangerous, hence this filtering was placed.

  1. isUrlBlocked

Figure 11. isUrlBlocked implementation

isUrlBlocked wraps inside of it multiple checks:

  • isInternalPage which we’ve seen.

  • Block URLs that start with file:// or view-source:file:// and try to access the local filesystem.

  • Block URLs that try to access disallowed document types.

  • isUrlBlockedByAdmin - block URLs defined by Perplexity and saved in the extension’s managed storage:

Figure 12. isUrlBlockedByAdmin implementation

  • isDomainBlacklist - Comet allows users to define blacklisted domains in the Assistant settings that it won’t act on e.g:

Figure 13. Domain blacklist configuration in the Assistant settings

 The blacklist is saved in the extension’s managed storage:

Figure 14. isDomainInBlacklist blocking a blacklisted URL in DevTools

If the user tries to navigate to the blacklisted domain, Comet will prevent them from doing so:

Figure 15. The Comet agent refusing to navigate to a blacklisted URL

The Model's Toolkit

What can the AI actually do with browser access? Quite a lot, it turns out.

High-Level Browser Operations

These are requested through a browser_tool field within the SSE messages stream with different step_type values:

Step Type

What It Does

BROWSER_OPEN_TAB

Opens a new tab to a URL

BROWSER_CLOSE_TABS

Closes tabs by ID

BROWSER_GROUP_TABS

Groups tabs together

BROWSER_UNGROUP

Ungroups tabs

GET_URL_CONTENT

Fetches and parses page content

ENTROPY_REQUEST

Kicks off complex multi-step automation

For example, opening a tab looks like this in the wire protocol:

Figure 16. BROWSER_OPEN_TAB request body example

Fine-Grained Actions

Once the agent kicks in and the WebSocket connection is established, there's a whole set of actions the backend API can request the browser to perform:

  • ComputerBatch — Executes sequences of low-level actions (clicks, drags, scrolls, keystrokes) using raw pixel coordinates

  • FormInput — Sets values in form elements by node reference

  • Navigate — URL navigation with forward/back support

  • ReadPage — Extracts page content as YAML-formatted accessibility tree

  • GetPageText — Converts page HTML to markdown

  • TabsCreate / CreateSubagent — Spawns new tabs or nested automation tasks

These are specified in the WebSocket’s message action field along with the necessary parameters. For instance, for a request to type “AI” in a search box and press Enter, we’ll get the following ComputerBatch:

Figure 17. ComputerBatch request body including click, type and key press actions

How the AI Sees Web Pages

This is where it gets clever. The model can't process a full DOM—too many tokens, too much noise. So Comet generates a simplified HTML representation with special annotations. Only HTML elements that are “interactable” get annotated - links, buttons, textboxes, etc. The annotated DOM gets sent to the model upon calling the ReadPage RPC method:

Figure 18. ReadPage RPC method showing multiple chrome.debugger API call implementation

As we can see, the extension calls the Accessibility.getFullAXTree Chrome API function via chrome.debugger and responds with a YAML representation of the accessibility tree:

Figure 19. Chrome accessibility tree example of Hacker News (news.ycombinator.com)

When the model decides to click something, it can use either the reference ID or the on-screen coordinate and the extension knows exactly which DOM element that refers to. So “Click on submit” is actually “Click on ref_32”.

What This All Means

Comet represents a mature implementation of agentic browser control. The architecture is thoughtful:

  • Separation of concerns: The Sidecar handles user interaction, the extensions handle execution.

  • Efficient page representation: Simplified DOM with node annotations keeps token usage manageable.

  • Flexible interaction: Both element-based and coordinate-based actions are supported.

  • Real browser integration: Using CDP means actions are indistinguishable from human input.

But that sophistication comes with surface area. The model has access to downloads, form filling, file uploads, and arbitrary navigation.

We'll follow up with digging into the security implications. This is just the architecture overview—the interesting questions are about what happens when things go wrong.

Stay tuned.

Reply

or to participate.