Teaching an Agent to Click on Wayland

Second half of giving the assistant hands: driving the actual screen.

In the last post I got an agent driving a browser - Selenium against a copy of my Firefox profile for anything on my own accounts, Playwright for the clean-room jobs. That didn’t cover native desktop apps with no DOM to drive, or the occasional site where only my actual running browser, with my actual session, would do.

Instead of controlling a browser, control the screen: move the real mouse, press real keys, against whatever window is in front of me. If the agent can operate the machine the way I do when I’m sitting at it, the browser and the native app stop being two different problems.

On GNOME on Wayland, that took a lot longer than I expected.

Why Wayland makes this hard

On X11, faking input and grabbing the screen were solved problems: tools could shove fake mouse events into the server and read pixels back out. Wayland closed most of that off for security. An application can’t snoop the whole screen or synthesise input for other windows, which is exactly what I was trying to do.

The loop I needed (take a screenshot, let the agent look at it, decide where to click, move there, click, screenshot again) has two halves that are both awkward on Wayland: seeing the screen, and moving the mouse. I got a proof of concept working by stitching together whatever worked, then replaced each piece as I found out how it was lying to me.

ydotool, and a mouse that drifts

For input, the tool that works on Wayland is ydotool. It injects events at the kernel level through uinput, below the display server entirely, so Wayland’s restrictions don’t apply. A daemon runs as a service, you talk to it over a socket, and on NixOS it’s a flag to flip plus adding myself to a group (and then, annoyingly, logging out and back in for the group to take).

The keyboard half was reliable from the start. Key codes are just key codes: Ctrl+L to focus the address bar, type, Enter. Resolution doesn’t matter.

ydotool is a relative input tool: it’s good at “move twenty pixels left”. When you ask it to go to an absolute coordinate, it fakes it by first slamming the pointer to a corner and then moving relative from there. On GNOME, shoving the pointer into the top-left corner triggers the hot corner and pops open the activities overview, right in the middle of whatever I was trying to click. You can disable the hot corner with a gsettings toggle, which I did, but it was a sign the approach was fighting the system.

Even with that papered over, the coordinates were never quite right. Mouse acceleration distorts a relative move, so the same “go to x, y” landed in different places depending on where the pointer started. A flat acceleration profile helped. The real killer was my monitor: a 4K display with GNOME’s display scaling on, so there are two coordinate systems in play: the physical 3840 by 2160 pixels, and the smaller logical space that applications think in. ydotool was working in one and I was reading targets off the other, and the mismatch showed up as the pointer landing tens of pixels away from where I aimed. Big targets it could hit. Small buttons it would miss, silently, and then the agent would click on nothing and carry on as if it had succeeded.

A click that misses isn’t an error: the command runs fine, it just clicks empty space, and the agent has no way to know. So I’d get confident, wrong sequences where every step “worked” and the net result was nonsense. For an agent meant to run while I’m not watching, “succeeds and does the wrong thing” is the failure you can’t tolerate.

Seeing the screen, which also didn’t work

gnome-screenshot is broken on GNOME 46 and later under Wayland: the Shell’s screenshot D-Bus method returns a flat “not allowed” for a third-party caller like mine. scrot runs but only sees the X11 world through XWayland, so for native Wayland windows it hands back a black rectangle, which the agent then tries to read coordinates from.

GNOME’s own built-in screenshot, bound to Shift+PrintScreen, saves a PNG to my screenshots folder. I couldn’t call it as a tool, but I could fake the keypress with ydotool, wait a beat, and read the file it dropped. Ugly, but it’s the most reliable way I’ve found to get a real picture of a Wayland screen from a script. 4K screenshots are huge, both in pixels and in megabytes, so I downscale them before handing them to the model for the “where is everything” overview pass, and only crop to full resolution when I need pixel precision on one spot.

The loop worked, for large targets, as long as I didn’t mind it missing small ones and lying about it.

The actual fix: ask the compositor

I’d been faking input underneath the display server and then fighting its coordinate system. GNOME’s compositor, Mutter, already has a sanctioned way to let a program control the desktop: the same machinery that screen-sharing and remote-desktop tools use, exposed through xdg-desktop-portal. Two portal interfaces matter: RemoteDesktop, which injects pointer and keyboard events, and ScreenCast, which gives a video stream of the screen.

RemoteDesktop offers NotifyPointerMotionAbsolute: move the pointer to an absolute position, expressed in the same logical pixels that applications use. No relative faking, no corner-slamming, no acceleration math, no 4K-versus-logical mismatch. You hand it the coordinate an app would understand and the pointer goes there. The first time I clicked a search box at a coordinate I’d read off a screenshot, it landed dead on, after weeks of near-misses.

Driving the portal directly has a few sharp edges, and the documentation is thin.

The portal session is process-scoped. GNOME hands you a session handle, and that handle is only valid inside the process that created it. You can’t open a session in one script, save the handle, and reuse it from a separate invocation: the moment the process exits, the session is gone. So every action has to happen inside a single run: open the session, move, click, type, all in one go. The tool parses a whole sequence of actions off the command line and replays them inside one session.

The consent dialog: the first time anything asks to remote-control the desktop, GNOME pops up a dialog with an “allow remote interaction” toggle and a Share button. Good friction for security, bad friction on every run. The portal’s answer is a restore token, a credential the compositor hands back after you consent, which you present next time to skip the dialog. First run, I approve it by hand once; the tool stashes the restore token in a state file under ~/.local/share/; every run after that reuses the token and starts silently.

The ScreenCast half: the absolute-motion call needs to know which screen it’s pointing at, and you get that identifier from setting up the screen-cast stream. Even though I mostly care about input, not video, the tool sets up both.

What it landed as

All of that wrapped into a small Python script I keep at ~/.local/bin/screen-control. It speaks D-Bus to the portal, handles the session and the restore token, and exposes a handful of verbs:

screen-control move <x> <y>        # absolute logical pixels
screen-control click [left|right]  # click at the current position
screen-control type <text>         # type via keysyms
screen-control key <keysym>        # press a single key
screen-control status              # is there a live session?

Because the session is process-scoped, you chain a sequence into one invocation (move, a short sleep to let the UI settle, click) and the tool runs the lot inside a single portal session. The click is the Linux input event code for the left mouse button (BTN_LEFT, which is 272), pressed and released through the RemoteDesktop interface.

This sits at the bottom of a three-tier escalation. Most browser work goes to Selenium with my profile copy. Some goes to Playwright. When neither can reach it (a native app, or a case where only my real Firefox will do) I drop to screen-control and drive the real screen, the same pixels I’d be clicking myself.

Where it earned its keep

The clearest payoff was a fight with my UniFi gateway.

I was adding a source-NAT rule to route outbound mail traffic on specific ports out through a particular WAN IP, the kind of thing that should be a quick API call. UniFi’s newer config does have an API endpoint for NAT rules, so I had the agent poke at it. Every minimal request came back with a flat HTTP 500: a bare internal server error, a null-pointer exception somewhere inside the gateway, with no documentation telling me the full set of required fields. I couldn’t reverse-engineer the schema from the errors because the errors told me nothing.

Create-in-the-UI, read-back-from-the-API broke the deadlock. Get the rule created once through the web UI, then ask the API to hand it back, and read off the exact, complete JSON it expected, every field properly filled in. From there, scripting future rules was trivial.

screen-control drove the UniFi web UI in my real browser: Settings, the policy table, create a new NAT rule, pick source-NAT, type in the ports and the translated address, save. Clicking through the panel by absolute pixel, the way I would have by hand. Then a single API GET read the freshly-created rule back, and there was the schema I’d been unable to guess. I had the rule working and verified end to end shortly after, confirmed by watching the source IP on the receiving mail server.

The API was a dead end, the UI was the only source of truth, and the only way to get an agent through that UI unattended was to let it drive the actual screen. A purely-in-the-browser tool would have been stuck at the HTTP 500.

Most days the browser libraries do the job and screen-control sits unused. But having a way to drive the real screen, when nothing else can reach what’s in front of it, was worth the Wayland yak-shaving it took to build.