Accessibility Tree vs Screenshots: Two Approaches to Browser AI
Every AI browser agent must answer a fundamental question: how does the AI "see" the web page? The answer determines the agent's speed, cost, accuracy, and reliability. There are two dominant approaches in 2026: reading the accessibility tree (a structured text representation of the page) and analyzing screenshots (sending a visual image to a vision model). Prophet uses the accessibility tree. Anthropic's Claude computer use and most screenshot-based agents use the visual approach. This article provides a technical comparison of both methods so you can understand the tradeoffs.
What Is the Accessibility Tree?
Every modern web browser maintains an accessibility tree alongside the visual render tree. The accessibility tree is a hierarchical representation of the page's interactive and semantic elements, originally created for screen readers and other assistive technologies. It contains:
- Element roles: button, link, textbox, checkbox, heading, list, table, etc.
- Names and labels: The text label associated with each element (button text, input placeholder, ARIA label)
- States: Whether an element is enabled, disabled, checked, expanded, selected, or focused
- Values: Current values of form inputs, selected options in dropdowns, progress bar values
- Hierarchy: Parent-child relationships between elements (a list contains list items, a form contains inputs)
- Properties: Additional attributes like URL targets for links, required/optional for inputs, multiline for text areas
The tree does not contain visual information: colors, positions, sizes, fonts, images, or layout. It is a pure semantic and interactive representation of the page.
What Is the Screenshot Approach?
The screenshot approach captures a rendered image of the visible portion of the web page and sends it to a vision-capable language model (like GPT-4o's vision capabilities or Claude's vision mode). The model processes the image, identifies UI elements, reads text via OCR, and determines the coordinates of elements to interact with.
Some implementations enhance the screenshot by overlaying element IDs or bounding boxes on interactive elements before sending the image to the model. This hybrid approach helps the model identify clickable regions more reliably but still relies on visual processing as the primary perception method.
Speed Comparison
Accessibility Tree
Extracting the accessibility tree from the browser takes 50-200 milliseconds depending on page complexity. The result is a text string typically 2,000-10,000 tokens long. Sending this as part of a text-only API call to Claude means the perception step adds negligible latency beyond the normal API call time.
For a typical page, the full cycle (extract tree, send to API, receive response) completes in 1-3 seconds, dominated by API response time rather than perception time.
Screenshots
Capturing a screenshot is fast (under 100ms), but processing it through a vision model is slow. Vision API calls typically take 3-8 seconds because the model must process the image pixels, perform OCR, identify UI elements, and then reason about the content. This is 2-4x slower than a text-only call of equivalent complexity.
For a typical page, the full cycle (capture screenshot, send to vision API, receive response) takes 4-10 seconds. For multi-step tasks requiring 10-20 perception cycles, this adds up to 40-200 seconds of cumulative latency versus 10-60 seconds with the accessibility tree approach.
Speed Verdict
The accessibility tree approach is 2-4x faster per step. For single-step tasks (asking a question about a page), the difference is noticeable but not critical. For multi-step automation tasks, the cumulative speed difference is significant: a 15-step task might take 30 seconds with the accessibility tree versus 90 seconds with screenshots.
Cost Comparison
Accessibility Tree
The tree is sent as text tokens. A typical page's accessibility tree is 3,000-8,000 tokens. At Claude Sonnet's input rate of $3/MTok, this costs $0.009-0.024 per perception step. Output tokens (the model's action decision) add another $0.005-0.015. Total per step: approximately $0.015-0.04.
Screenshots
Image tokens are more expensive. A typical screenshot encoded for a vision model consumes the equivalent of 1,000-2,000 tokens at image pricing rates, but the actual cost varies by provider. With additional text context and output tokens, each vision-based perception step costs approximately $0.03-0.08.
Cost Verdict
The accessibility tree approach costs roughly half as much per perception step. Over a 15-step task, this means $0.30-0.60 with the accessibility tree versus $0.50-1.20 with screenshots. For users on credit-based pricing like Prophet, this difference directly affects how many tasks they can complete per dollar.
Accuracy Comparison
Element Identification
The accessibility tree identifies elements deterministically. A button labeled "Submit" with ID "submit-42" is always identifiable as a button with that exact label and ID. There is no ambiguity about what the element is or where it is.
Screenshot-based identification is probabilistic. The vision model must interpret the image, decide that a certain region looks like a button, read its text via OCR, and assign coordinates. This works well for large, clearly labeled buttons but struggles with small elements, elements with low contrast, overlapping elements, and elements that look similar (multiple "Edit" buttons on the same page).
Text Reading
The accessibility tree provides exact text content with no OCR errors. Screenshot-based reading occasionally misreads characters, especially in small fonts, stylized text, or non-Latin scripts. The error rate is low (under 2% for standard web pages) but non-zero, and errors compound across multi-step tasks.
Dynamic Content
The accessibility tree reflects the current DOM state, including elements loaded by JavaScript, AJAX responses, and single-page application navigation. If an element exists in the DOM, it appears in the tree.
Screenshots only capture what is currently visible on screen. Elements below the fold, behind modals, in collapsed sections, or loaded after the screenshot is taken are invisible. Some screenshot-based agents address this with full-page screenshots, but these increase image size and processing cost.
Form States
The accessibility tree explicitly reports form states: which radio button is selected, what text is in an input field, whether a checkbox is checked, which option is selected in a dropdown. Screenshots can sometimes infer these states visually, but it is unreliable (a checked checkbox and an unchecked one may look similar at certain resolutions).
Accuracy Verdict
The accessibility tree is more accurate for element identification, text reading, dynamic content, and form state detection. Screenshots have one advantage: they capture visual layout, which is useful when the task depends on spatial relationships ("click the button to the right of the price").
Reliability in Production
Failure Modes: Accessibility Tree
- Poorly built websites: Sites with bad accessibility practices (missing ARIA labels, non-semantic HTML) produce sparse or uninformative trees. A button implemented as a styled div with no role or label appears as a generic element rather than a clickable button.
- Canvas and WebGL: Content rendered on HTML canvas or in WebGL contexts (games, some data visualizations) is invisible to the accessibility tree because it bypasses the DOM.
- Shadow DOM: Some web components use Shadow DOM encapsulation. Depending on the implementation, these elements may not appear in the accessibility tree exposed to extensions.
Failure Modes: Screenshots
- Page not fully loaded: If the screenshot is captured before all elements render, the agent sees an incomplete page.
- Overlays and popups: Cookie consent banners, chat widgets, and notification popups can obscure the underlying content, confusing the vision model.
- Responsive layouts: The same page may look completely different at different viewport sizes. An element visible on a desktop layout may be hidden behind a hamburger menu on a narrower viewport.
- Anti-bot measures: Some sites detect automated screenshot capture and serve different content or CAPTCHAs.
Reliability Verdict
Both approaches have failure modes, but they are different failure modes. The accessibility tree fails on poorly built or non-standard websites. Screenshots fail on dynamic, cluttered, or responsive pages. In practice, the accessibility tree produces more consistent results across the broader web because most modern websites follow basic accessibility standards (even if imperfectly), while visual complexity and dynamic content are ubiquitous.
When Screenshots Win
Despite the accessibility tree's advantages in speed, cost, and accuracy for most tasks, there are scenarios where screenshots are genuinely better:
- Visual verification: Tasks that require confirming what a page looks like (design review, visual QA, layout comparison) need visual information that the accessibility tree does not provide.
- Image-based content: Pages where critical information is embedded in images (infographics, charts, scanned documents) require vision model processing.
- Spatial reasoning: Tasks that depend on the physical layout of elements ("the navigation menu on the left" vs "the sidebar widget on the right") benefit from visual context.
- Canvas and rich media: Games, interactive visualizations, and canvas-based applications are invisible to the accessibility tree.
Prophet's Approach
Prophet uses the accessibility tree as its primary perception method for the reasons outlined above: it is faster, cheaper, more accurate for interactive tasks, and more reliable across the general web. The accessibility tree aligns with Prophet's core use cases: interacting with web pages, extracting information, filling forms, and navigating between pages.
For tasks that require visual information, users can describe what they see to the AI in the conversation, or use complementary tools. The architectural choice to prioritize the accessibility tree means that every dollar of Prophet credits goes further because each perception step costs less than a screenshot-based alternative. Read more about how AI web agents work for a broader perspective on agent architectures.
The Future: Hybrid Approaches
The most capable agents in the near future will likely use both approaches adaptively: accessibility tree for fast, routine interactions and screenshots for visual verification and edge cases. This hybrid approach would combine the speed and cost efficiency of text-based perception with the visual completeness of image-based perception, using each method where it performs best.
Until that convergence happens, the choice between accessibility tree and screenshot approaches reflects a real tradeoff. For browser automation, data extraction, and interactive web tasks, the accessibility tree is the more practical choice. For visual analysis and design-oriented tasks, screenshots are necessary. Prophet's bet on the accessibility tree reflects its focus on productivity and automation rather than visual analysis.
Try Prophet Free
Access Claude Haiku, Sonnet, and Opus directly in your browser side panel with pay-per-use pricing.
Add to Chrome