Gemini Computer Use Tutorial: Control a Browser with Gemini 3.5 Flash

A practical guide to Gemini API Computer Use: how to enable computer_use through the Interactions API, why Playwright or another client execution layer is still required, and which safety boundaries matter when building browser automation agents.

Google has made Computer Use a more visible capability in the Gemini API. In simple terms, it lets a model do more than answer: it can inspect screenshots, decide where to click or what to type, and hand those actions to your client code for execution.

This is useful for browser automation agents: testing website flows, filling repetitive forms, collecting page information, or doing simple research across sites. It is not “give the model a computer and everything is done.” The model plans actions; your code executes them safely, sends screenshots back, and controls when to stop.

Official documentation:

1
https://ai.google.dev/gemini-api/docs/computer-use?hl=zh-cn

How It Works

The basic loop is:

  1. Your program sends the task, tool configuration, and current screen state to Gemini.
  2. Gemini returns an action, such as click, type, scroll, or open a page.
  3. Your program executes that action with Playwright, mobile automation, or desktop automation.
  4. After execution, it captures a new screenshot and sends the updated state back.
  5. The loop repeats until the task is complete, a safety confirmation is triggered, or the program stops.

The key point is that the Gemini API does not click the browser for you. It returns suggested UI operations; execution remains in your client.

The recommended model is:

1
gemini-3.5-flash

Use the Interactions API and enable:

1
2
3
4
{
  "type": "computer_use",
  "environment": "browser"
}

environment tells the model what kind of interface it is operating. Browser automation is the easiest starting point because Playwright already handles clicks, typing, screenshots, and viewport control.

Minimal Python Example

Install dependencies:

1
2
pip install google-genai playwright
playwright install chromium

Send a request with computer_use enabled:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from google import genai

client = genai.Client()

interaction = client.interactions.create(
    model="gemini-3.5-flash",
    input="Search for 'Gemini API' on Google.",
    tools=[
        {
            "type": "computer_use",
            "environment": "browser",
            "enable_prompt_injection_detection": True,
        }
    ],
)

print(interaction)

It is worth enabling enable_prompt_injection_detection. Computer Use reads screen content, and web pages can contain prompt-injection text. Detection is not a full security boundary, but it is a useful extra layer.

Playwright Actually Executes

Initialize a browser environment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from playwright.sync_api import sync_playwright

SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900

playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=False)

context = browser.new_context(
    viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT}
)
page = context.new_page()
page.goto("https://www.google.com")

Gemini usually returns normalized coordinates. Convert them to real viewport coordinates:

1
2
3
4
5
6
def denormalize_x(x: int, screen_width: int) -> int:
    return int(x / 1000 * screen_width)


def denormalize_y(y: int, screen_height: int) -> int:
    return int(y / 1000 * screen_height)

If the model returns:

1
2
3
4
5
6
7
8
9
{
  "type": "function_call",
  "name": "click",
  "arguments": {
    "x": 450,
    "y": 120,
    "intent": "Click the search box to type the query."
  }
}

Your client can execute:

1
2
3
actual_x = denormalize_x(450, SCREEN_WIDTH)
actual_y = denormalize_y(120, SCREEN_HEIGHT)
page.mouse.click(actual_x, actual_y)

After that, capture another screenshot and send the result back to the model. The hard part is stabilizing the loop: model response, action execution, screenshot return, and next request.

Start Small

Do not start with a real admin workflow. Begin with low-risk tasks:

1
Open Google, search for Gemini API, and summarize the first few results.

Or:

1
Open a local test page, click the login button, and check whether an error message appears.

These tasks do not touch real accounts or money, are easy to reproduce, and make it easier to inspect whether each intent is reasonable.

Safety Boundaries

The main risk is that the model can operate an interface. Add limits in both system instructions and client logic.

  • Use a sandbox browser, container, or VM instead of your main browser profile.
  • Block access to browser history, autofill, and saved passwords.
  • Require user confirmation for login, payment, order submission, sending messages, posting content, or accepting agreements.
  • Do not let the model solve CAPTCHAs or bypass human verification.
  • Use allowlists or blocklists for websites.
  • Log prompts, screenshots, model actions, safety decisions, and executed actions.

Once an agent touches a real account, the risk is no longer just a wrong click. It can become data leakage, accidental sending, accidental purchase, or irreversible submission.

Function Calling vs Computer Use

Function calling is like asking the model to choose an API and fill structured parameters. Computer Use is closer to asking the model to look at a real UI and operate it.

If the target system has a stable API, use function calling first. Use Computer Use when you must operate the interface, test real user flows, or handle web interaction.

Common Pitfalls

1. Thinking the model controls the browser by itself

It does not. The model returns actions; you must provide the executor.

2. Forgetting coordinate conversion

Fix browser size, zoom, and screenshot dimensions, or clicks will drift.

3. Starting from a messy page

Pop-ups, cookie banners, notification prompts, and changing login state all affect model decisions.

4. Missing stop conditions

Set maximum steps, maximum runtime, and confirmation for risky actions.

5. Using it for critical decisions

This is a preview capability. Avoid fully automating financial, medical, government, account-security, or irreversible tasks.

Practical Project Layout

1
2
3
4
5
6
7
agent/
  client.py          # Call Gemini Interactions API
  browser.py         # Playwright browser control
  actions.py         # click/type/scroll execution
  safety.py          # allowlists, confirmations, risky-action blocking
  recorder.py        # screenshots, logs, step records
  prompts.py         # system prompts and task templates

Main loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Create browser environment
Open the start page
Send task and current screenshot
while not complete:
    Read Gemini function_call
    Check safety rules
    Execute the action
    Capture screenshot and log the step
    Send function_result back to Gemini
Close browser

Good and Poor Fits

Good fits:

  • Web app end-to-end testing.
  • Repetitive admin data entry.
  • Public webpage information collection.
  • Web flow validation with screenshots.
  • Low-risk internal tool automation.

Poor fits:

  • Sensitive account operation.
  • Payments, purchases, transfers, or orders.
  • CAPTCHA or anti-bot bypass.
  • Irreversible production admin actions.
  • Privacy, finance, or medical data handling.

Summary

Gemini Computer Use moves browser automation from fixed scripts toward decisions based on screen state. It is powerful for testing, data entry, and page research, but it behaves more like an agent execution framework than a single API. Keep the environment fixed, build a reliable action loop, and put safety rules first.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy