OpenAI computer use APIAI agents computer controlOpenAI autonomous agents

February 11, 2026 ยท 8 min read

Building Autonomous Agents with OpenAI's Computer Use API: A Practical Guide

Learn to build automation agents with OpenAI's computer use API. Practical examples for form filling, multi-app workflows, and production reliability.

Hero image for Building Autonomous Agents with OpenAI's Computer Use API: A Practical Guide

By the end of this post, you'll have an agent that fills multi-page web forms, moves data between desktop apps, and recovers from errors, all by reading screenshots and clicking. No selectors, no brittle XPaths. Here's the setup.

Setup

You'll need the OpenAI Python SDK (2.8+), Playwright for browser control, and Pillow for image handling. The computer-use tool runs through the computer-use-preview model in the Responses API, not Chat Completions.

terminal
pip install openai>=2.8.0 pillow playwright
playwright install chromium

The action loop works like this: you send a screenshot and a goal, the model returns a structured action (click, type, press, etc.), you execute it and send back the updated screenshot. OpenAI's computer use guide defines this protocol, including how computer_call and computer_call_output response items fit together. You don't need to invent the orchestration yourself.

Here's the base client:

computer_use_client.py
import openai
import base64
from PIL import Image
from io import BytesIO
 
client = openai.OpenAI()
 
def encode_screenshot(image: Image.Image) -> str:
    buffer = BytesIO()
    image.save(buffer, format="PNG")
    return base64.b64encode(buffer.getvalue()).decode("utf-8")
 
def execute_computer_action(screenshot: Image.Image, goal: str, history: list) -> dict:
    response = client.responses.create(  # highlight-line
        model="computer-use-preview",
        tools=[{
            "type": "computer_use_preview",
            "display_width": 1280,
            "display_height": 720,
            "environment": "browser",
        }],
        input=[{
            "role": "user",
            "content": [{
                "type": "input_text",
                "text": goal,
            }],
        }],
        truncation="auto"
    )
    return response.output

Screenshots at 1280x720 hit a practical sweet spot. Higher resolutions rarely improve action accuracy but cost noticeably more tokens per step.

Build: Form-Filling Agent

This agent fills a multi-page web form. The pattern works for CRM data entry, compliance submissions, or any workflow built around a sequence of inputs.

form_agent.py
from playwright.sync_api import sync_playwright
from PIL import Image
from io import BytesIO
from computer_use_client import execute_computer_action
import json
 
def capture_screenshot(page) -> Image.Image:
    screenshot_bytes = page.screenshot()
    return Image.open(BytesIO(screenshot_bytes))
 
def run_form_agent(url: str, form_data: dict):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(viewport={"width": 1280, "height": 720})
        page.goto(url)
 
        goal = f"Fill out the form with this data: {json.dumps(form_data)}. Submit when complete."
        history = []
        max_steps = 30
 
        for step in range(max_steps):
            screenshot = capture_screenshot(page)
            action = execute_computer_action(screenshot, goal, history)  # highlight-line
 
            if action["type"] == "complete":
                print(f"Form submitted after {step + 1} steps")
                break
 
            if action["type"] == "click":
                page.mouse.click(action["x"], action["y"])
            elif action["type"] == "type":
                page.keyboard.type(action["text"])
            elif action["type"] == "press":
                page.keyboard.press(action["key"])
 
            history.append(action)
 
        browser.close()
 
form_data = {
    "company_name": "Corvus Tech",
    "contact_email": "hello@corvustech.ca",
    "service_type": "AI Integration",
    "budget_range": "$50,000-$100,000"
}
run_form_agent("https://example.com/contact", form_data)

The loop is straightforward: screenshot, ask the model what to do, do it, repeat. I set max_steps to roughly 2-3x the expected number of interactions as a safety net.

One thing I didn't expect: the model handles field labels it's never seen before by reading them from the screenshot. Unlike selector-based automation, this survives UI redesigns without any code changes.

Build: Multi-App Workflow Agent

Real workflows span multiple apps. This agent pulls data from one desktop application and enters it into another.

multi_app_agent.py
import subprocess
from PIL import ImageGrab
from computer_use_client import execute_computer_action
import pyautogui
 
def capture_desktop() -> "Image.Image":
    return ImageGrab.grab()
 
def execute_desktop_action(action: dict):
    if action["type"] == "click":
        pyautogui.click(action["x"], action["y"])
    elif action["type"] == "type":
        pyautogui.write(action["text"], interval=0.02)
    elif action["type"] == "hotkey":
        pyautogui.hotkey(*action["keys"])
 
def cross_app_workflow(source_app: str, dest_app: str, task: str):
    subprocess.Popen(["open", "-a", source_app])  # macOS
    subprocess.Popen(["open", "-a", dest_app])
 
    goal = f"""
    Task: {task}
    Source application: {source_app}
    Destination application: {dest_app}
 
    Steps:
    1. Switch to {source_app} and locate the required data
    2. Copy the data
    3. Switch to {dest_app} and paste/enter the data
    4. Verify the data appears correctly
    """
 
    history = []
    for step in range(50):  # highlight-line
        screenshot = capture_desktop()
        action = execute_computer_action(screenshot, goal, history)
 
        if action["type"] == "complete":
            return {"success": True, "steps": step + 1}
 
        execute_desktop_action(action)
        history.append(action)
 
    return {"success": False, "reason": "max_steps_exceeded"}
 
result = cross_app_workflow(
    source_app="Preview",
    dest_app="QuickBooks",
    task="Extract the invoice total, date, and vendor name from the open PDF and create a new bill entry"
)

Desktop automation needs Accessibility permissions on macOS (System Preferences > Privacy > Accessibility). On Windows, you'll need to run as administrator if the target app is elevated. Worth verifying this before you spend time debugging what looks like a click miss.

Making It Reliable

Two things will trip you up in production: flaky execution and context bloat.

Retry with backoff. Network hiccups, unexpected dialogs, and application crashes are all normal. Here's my standard wrapper:

reliable_agent.py
import time
from typing import Callable
 
class AgentError(Exception):
    pass
 
def with_retry(
    action_fn: Callable,
    max_retries: int = 3,
    backoff_base: float = 1.0
) -> dict:
    last_error = None
 
    for attempt in range(max_retries):
        try:
            result = action_fn()
 
            if result.get("verification", {}).get("passed", True):  # highlight-line
                return result
 
            raise AgentError(f"Verification failed: {result['verification']['reason']}")
 
        except Exception as e:
            last_error = e
            wait_time = backoff_base * (2 ** attempt)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time}s")
            time.sleep(wait_time)
 
    raise AgentError(f"All {max_retries} attempts failed. Last error: {last_error}")

Also add stuck detection. If the model returns the same action three times in a row, force a screenshot refresh or inject a recovery prompt asking it to try a different approach. This catches cases where an unexpected modal is blocking progress and the model keeps clicking the same coordinate.

Manage context. Screenshots cost roughly 1,500-2,500 tokens each depending on complexity. A 50-step workflow burns 75,000-125,000 tokens in images alone. Keep recent history in full, summarise the rest:

context_management.py
def compress_history(history: list, keep_recent: int = 10) -> list:
    if len(history) <= keep_recent:
        return history
 
    old_actions = history[:-keep_recent]
    summary = {
        "type": "history_summary",
        "description": f"Completed {len(old_actions)} prior actions: " +
                       ", ".join(a.get("description", a["type"]) for a in old_actions[-5:]) +
                       f" (and {len(old_actions) - 5} more)"
    }
 
    return [summary] + history[-keep_recent:]  # highlight-line

For workflows past 100 steps, save application state at key milestones so you can resume from the last checkpoint instead of replaying the full history.

Keeping Costs Down

Computer-use workflows get expensive fast. Three things that help:

Scale screenshots to 1280x720 before encoding. Higher resolutions rarely improve accuracy.

Skip screenshots between pure keyboard steps. If an action is just typing text into a field you've already focused, you don't need a fresh screenshot before the next keystroke.

Break long workflows into sub-goals. Complete one chunk, verify, then start the next with a shorter context. The planning step is cheap compared to the savings on the execution side:

cost_optimized.py
def chunk_workflow(full_goal: str, chunk_size: int = 5) -> list:
    planning_response = client.chat.completions.create(
        model="gpt-5.4",
        messages=[{
            "role": "user",
            "content": f"Break this task into {chunk_size}-step chunks: {full_goal}"
        }]
    )
    return planning_response.choices[0].message.content.split("\n\n")

One more production tip: set hard step limits and wall-clock timeouts on every agent loop. Runaway agents are expensive and difficult to debug after the fact.

sandboxed_runner.py
import signal
 
def timeout_handler(signum, frame):
    raise TimeoutError("Agent exceeded maximum runtime")
 
def run_with_timeout(agent_fn, timeout_seconds: int = 300):
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(timeout_seconds)  # highlight-line
    try:
        return agent_fn()
    finally:
        signal.alarm(0)

Don't give production agents access to credentials, payment systems, or admin interfaces without a human confirmation step. Computer-use models can be manipulated by adversarial content visible on screen, and a model with full desktop control can click anything it sees.

Running in Production

I've had the invoice-processing version of this running for a week. It's processed around 200 forms with three failures, all from a third-party site swapping its CAPTCHA provider. For that kind of unstructured, UI-heavy workflow, that's a better success rate than I expected.

Start with the form-filling agent from this post. Pick a low-stakes internal workflow, get comfortable with the action loop, and build up from there. The code here is enough to get something real running in an afternoon. If you want help taking it from experiment to production, we do this kind of work.

Sources

FAQ