Teaching an Agent to Use a Browser
For the assistant to do real chores, it needed hands. This is the first half of giving it some.
I have a pile of small browser chores that only I can do because they live behind my own logins: register a warranty on an appliance, check a dashboard that has no API, fill in some form that wants my account. None of it is hard, all of it is tedious. Could I hand that work to one of the coding agents I already run, and have it do the same clicking and typing I’d do, on my own accounts, while I was off doing something else?
Note: most of this was before Claude Cowork was generally available, and anyway I wanted to build my own generic screen control for experimentation and things that there wouldn’t be Cowork connectors for, atleast initially.
The catch: an agent that can read and write files and run shell commands still can’t see a web page the way I can, and it can’t bring my logged-in session. So: how do you let an agent drive a browser at all?
What “drive a browser” actually means
At the cheap end is no browser at all - just curl and reading HTML. That’s fine for a static page, and it’s where you should start, but it falls over the instant the page needs JavaScript to render, or needs you to be logged in, or has a form that posts through some framework’s state machine. Most of the chores I cared about were that kind of page.
The other end is a real browser, fully rendered, with a real session, that something other than my hands is operating. That’s what I needed. The question is which “something”, and there are a few options.
Selenium, with my real Firefox profile
First thing that worked: Selenium driving Firefox, pointed at a copy of my own Firefox profile.
My day-to-day Firefox profile holds all my cookies and saved sessions. If I launch a fresh, empty browser, the agent is a stranger everywhere and has to log in from scratch, which means handling passwords and second factors and is a bad idea to automate. If instead I copy my existing profile and launch the browser against the copy, the agent shows up already logged in, as if I’d opened the browser myself. No password handling, no fresh login.
It has to be a copy, not the profile itself, because Firefox locks its profile directory - you can only have one instance using it at a time. So the pattern I settled on is: at the start of a task, copy the default profile to a throwaway directory; point Selenium at the copy; do the work; delete the copy when it’s done.
On NixOS this is a one-liner:
nix-shell -p geckodriver python3Packages.selenium --run "python3 script.py"
Selenium is good at the structured stuff. It talks to the page’s actual DOM, so the agent can find an element, read its text, click it, type into it - all the things you’d do by hand, but addressed by selector instead of by pixel. For a warranty form or a settings page, that’s precise and reliable in a way that pointing-and-clicking never is.
It’s not good at anything that doesn’t behave like a plain HTML form. Modern single-page apps were a recurring headache. A lot of the fancy dropdowns and pickers you see on a polished web app aren’t real <select> elements - they’re a pile of divs and ARIA spans wired up to some framework’s internal state, and a plain Selenium click() doesn’t update that state the way a genuine human click does. I lost an embarrassing amount of time to selects that looked clicked but weren’t. The fix was usually to stop being clever about the DOM and instead drive it the way a person would: move the mouse to the element, click, pause, type the search text, hit Enter, and let the component’s own JavaScript do its thing.
Still, for “log in on my account and fill in this form”, headless Selenium against a copy of my profile is what I reach for first.
Playwright, for the fresh-session jobs
Selenium leans on my existing logins, which is perfect when the task is about my account. But sometimes the task is the opposite - I want a clean browser that knows nothing about me. Scraping a public page, poking at something where I don’t want my session attached, browsing as a nobody. For that I used Playwright instead.
Playwright runs a headless Chrome with its own persistent profile, separate from my real one, and it’s better than Selenium for the “look at the page and decide what to do” loop. Screenshots come back inline, so the agent can see a rendering of the page, not just its HTML, and decide where to click from that. The rhythm is navigate, take a snapshot and a screenshot, click or type, then screenshot again to confirm something changed. When a page is mostly visual and the DOM is a mess, being able to see it is worth a lot.
The two tools split the work. Selenium when the job needs to be me: my logins, my account, my dashboards. Playwright when the job needs a clean room: a fresh browser with no history and no session. I kept both around because they’re good at opposite things, and used whichever matched the task.
Where it stopped being enough
For a long stretch, that two-tool setup covered everything I threw at it. The agent could log into my accounts, fill in forms, read dashboards, scrape the odd public page. Job done, mostly.
Two kinds of task broke that, both because they weren’t really in the browser the way the library assumed.
A couple of the sites I wanted to automate are openly hostile to anything that looks automated. I was trying to do ordinary things on my own accounts, the same clicks I’d do by hand, just without my hands. But some sites treat a headless or scripted browser as suspicious by default, and they’re within their rights to. My copied-profile trick stopped working when the browser wasn’t my actual running browser. The session lives partly in the fingerprint of the specific browser instance, and a headless copy isn’t the same instance.
The other kind isn’t a web page at all. Some of the chores I wanted to hand off live in a native desktop app, or in a UI so far from a plain form that driving its DOM is more pain than it’s worth. A browser-control library only controls a browser. The moment the task is “click this button in this window”, Selenium and Playwright have nothing to say.
Both needed the same fix: drive the screen. Move the real mouse and press real keys on my desktop, against whatever’s in front of it, browser or not. If the agent could operate the machine the way I do, then “my real Firefox, already logged in” and “that native app with no API” become the same problem, and the hostile-site question mostly goes away because it’s my own browser doing the clicking.
On Wayland, none of that was simple. Pointing a mouse at a pixel and clicking it took broken screenshot tools, a mouse that lied about where it was, and a trip into the compositor’s own remote-control API. That’s a story for its own post.
To be continued - screen control.