Evil Maid Has Eyes

Adaptive evil maid attacks with multimodal agents

RootedCON 2026

Alejandro Vidal · @dobleio

// 01

The Traditional
Evil Maid

Physical access, known limitations

Classic Evil Maid Attack

Temporary physical access to an unattended device.

HW keyloggers, bootkit, malicious USB...

Limitation	Impact
Fixed key combinations	Only works on known OS/versions
No screen visibility	"Blind" attack — no adaptation
Offline attacks (bootkit, disk)	Detectable by Secure Boot / TPM / FDE
Language/locale dependency	Keyboard layouts, localized paths

The attacker needs to know exactly what they'll find. What if they don't?

// 02

Multimodal
Evil Maid

When the attack has eyes and a brain

The Convergence

Three recent advances change the rules:

Multimodal models

Can "see" and understand graphical interfaces, on-screen text, and visual context

[A]

Agentic capabilities

Observe → reason → act loop. Tool use, multi-step planning, error recovery

///

Edge execution

Models that fit on embedded hardware. No cloud connection. Fully autonomous

Adaptive Evil Maid

Observe

screenshot + OCR

→

Reason

multimodal LLM

→

Act

keyboard + mouse

→

Verify

confirm result

An attack that observes, reasons and acts — independent of OS, version, language or software

	Traditional	Multimodal
Preparation	OS/version specific	Generic — the agent adapts
Vision	None (blind)	Real-time screenshot + OCR
Errors	Silent failure	Detection + recovery
Connectivity	Variable	100% offline (edge)

// 03

Proof of concept:
OpenEyes

Framework for adaptive evil maid attacks

OpenEyes Architecture

Agent Runner
Claude SDK / Ollama (e.g. Jetson)

↕ tools: screenshot, type, key, click, wait...

Vision
OCR + LLM locator

Computer Use Transport
VNC · NanoKVM · KVM

Recorder
frames + events + mp4

↕

Target
Emulated VM (QEMU) · real machine

↕

Windows 11

Ubuntu Desktop

Any OS / software

Agent Tools

The agent operates the computer like a human — through the transport:

screenshot()

Captures screen + automatic OCR

click_text(target)

Locates text and clicks (OCR/LLM)

click_at(x, y)

Click at absolute coordinates

type(text)

Types text adapted to layout

key(combo)

Key combos: Tab, Enter, Ctrl+S...

wait_screen_change()

Waits for significant visual change

wait_for_text(text)

Waits for text on screen (trigger)

read_screen_text()

Full OCR for re-grounding

Others...

HW keylogger, Ethernet emulation, USB exfiltration...

↓ Scroll down to see each tool in detail

`screenshot()`

The agent's eyes — capture + automatic understanding

What it does: Captures the screen as PNG via VNC/HDMI. Runs OCR automatically
Caveats: Noisy OCR on dense interfaces. Resolution depends on transport
Improvements: Automatic state hints (auth_screen, shell_prompt, boot_console, password_prompt)

20:15:25 [tool] screenshot {}
20:15:25 State hints: likely_auth_screen
          OCR: Mar 4 20:15 / agent / Not listed?
---
20:16:45 [tool] screenshot {}
20:16:45 State hints: likely_shell_prompt
          OCR: agent@ubuntu-desktop:~$
      

The agent receives screenshot + OCR + state hint after every action. It never goes blind.

`type(text)` & `key(combo)`

The agent's hands — text input and keyboard shortcuts

What they do: type() writes character by character. key() sends combos (Ctrl+Alt+T, Return...)
Caveats: VNC mapping corrupts special chars (&& → 77). Different layouts
Improvements: Automatic post-type/post-key OCR. Detects illegible output and retries

15:44 [tool] type {'text': 'agent'}
15:44 Typed 5 chars. ← password field
---
15:47 [tool] key {'key': 'Return'}
15:48 Pressed key: Return ← session starting
---
16:35 [tool] key {'key': 'ctrl+alt+t'}
16:35 State: likely_shell_prompt ← terminal!
      

In a previous run: && was typed as 77 due to VNC mapping. The agent detected it via OCR and retried with separate commands.

`click_text()` & `mouse_click()`

Visual interaction — finding and clicking elements

What they do: click_text() locates text with OCR and clicks. mouse_click(x,y) direct click
Caveats: Click sent ≠ click confirmed. If duplicates exist, returns highest confidence (no warning). OCR fails on small fonts
Improvements: Hybrid locator: OCR word → OCR phrase → LLM vision fallback

20:15:28 [tool] click_text
          {'text': 'agent', 'hint': 'user name login'}
20:15:29 Clicked 'agent' at (573, 387)
          Locator: ocr word match
---
          Post-click: State: likely_auth_screen
          ← password field appeared
      

If OCR finds no match, it escalates to LLM vision (analyzes the full image).

`wait_for_screen_change()`

Patience and perception — waiting for changes + extracting text

What it does: Compares screenshots pixel by pixel. Returns when the change ratio exceeds the threshold
Caveats: Animations = false positives. Static screens = timeout
Improvements: Returns actual ratio + new state OCR + state hints. Configurable: timeout, poll_seconds, min_change_ratio

20:15:10 [tool] wait_for_screen_change
          timeout=30s  poll=2s  min_change=0.10
20:15:21 changed=True ratio=1.000
          State: likely_auth_screen
          ← boot → login detected
      

read_screen_text() complements it: full OCR for re-grounding without acting.

`done()`

Verified completion — the agent must prove it finished

What it does: The agent signals task completion with evidence
Caveats: LLMs "hallucinate" completion — they say "I'm done" without doing anything
Improvements: done_validator with visual evidence + scoring. Rejects without proof

Turn 5: "I have completed the task."
          ← did nothing!

Turn 5: [tool] done {'summary': '...'}
          Rejected: no visual evidence
          ← forced to continue
      

In evals: an LLM judge verifies the visual evidence from the final screenshot.

Principle: Observe → Act → Observe

The agent never assumes an action had effect.

Screenshot

→

Analyze state

→

Execute action

→

Verify result

"A click sent is not a click confirmed.
The real state is what the screen shows, not what the agent believes."

This happens even with very powerful models (tested with Claude 4.6). Although they are trained for multimodal use, they are not specialized in operating computers. Visual guards anchor them to reality.

Vision Pipeline

Hybrid localization: OCR + CV + LLM vision

Screenshot PNG

↓

OCR
text extraction

CV
detect interactive
components

LLM Locator
visual fallback

↓

Click coordinates
+ checkbox/UI adjustment

State hints: automatic detection of auth screen, boot console, password prompt
Guards: blocks actions inconsistent with the detected visual state

OmniParser: CV for GUI agents

Microsoft — YOLO (icon detection) + Florence 2 (description) — visual interface parsing

OmniParser detects buttons, fields, checkboxes, menus without OCR — pure Computer Vision
Output: type (text/icon), bbox (coordinates), interactivity (clickable?)
Combined with OCR + LLM locator → locates any interactive element

Source: microsoft.github.io/OmniParser

Challenges Solved (selection)

State drift

The model "believes" it's on the desktop when it's actually on the login screen. Solved with state hints + OCR guards

Checkboxes

Clicking on a checkbox's text doesn't always activate it. Fix: full component detection (CV) + adaptive click_at on the real bbox

False completions

The agent says "I'm done" without having done anything. Fix: done_validator with visual evidence + scoring

// 04

Usage Examples

What an agent can do with these tools

Example: Reconnaissance

"Perform reconnaissance of the software in use, work schedules: breaks, start and end times, versions, web domains used. Save the results in WIP.md"

screenshot()

→

read_screen_text()

→

screenshot()
every 10 min

→

read_screen_text()

↓ detects software, domains, schedules

key("ctrl+alt+t")

→

type("cat > WIP.md << EOF...")

→

screenshot()
verify

The agent passively observes, accumulates intelligence and persists it — with no visible interaction with the user.

Example: Surveillance + deferred action

"Wait until the screen has no activity for 5 minutes. Move the mouse every 5 min to prevent lock. At 22:00 open the browser: a) check its version; b) if extensions can be installed; c) if there's a logged-in Gmail/Outlook account. Open a report on techcommunity.microsoft.com with findings. Two hours later delete the message."

screenshot()
loop

→

wait_screen_change()
5 min no change?

→

click(x, y)
move mouse

⟳

↓ 22:00 — trigger

click_text("Chrome")

→

read_screen_text()
version, accounts

→

type("report...")
exfiltrate

→

2h → delete

Temporal surveillance, anti-lock, scheduled action, covert exfiltration and evidence deletion — a single prompt.

Example: Passive observation

"Take a screenshot every 10 minutes. Perform reconnaissance of the device and user behavior. Extract as much information as possible without performing any action."

screenshot()

→

read_screen_text()

→

wait 10 min

⟳

Each frame is analyzed: open apps, visited URLs, documents, email, visible credentials
The agent never touches keyboard or mouse — undetectable by action-based DLP or EDR
All intelligence stays in events.jsonl + frames/

Zero-interaction OSINT from physical access. The execution artifact is the product.

Example: Conditional trigger

"Wait until 'cmd' appears on screen. Detect if it was run as administrator and if so type: net user backdoor P@ssw0rd /add && net localgroup administrators backdoor /add"

wait_for_text("cmd")
internal loop

→

read_screen_text()
"Administrator"
in title?

→

No → wait_for_text("cmd")

⟳

↓ cmd as Administrator detected

click_text("cmd")

→

type("net user...")

→

key("Return")

→

screenshot()
verify

Opportunistic wait: the agent monitors until the exact conditions for action are met. Infinite patience.

// 05

Demo

Real agent recordings

Windows 11: full recording

WIN11 111 frames · Installer → OOBE → File Explorer

The agent opens Shift+F10 for cmd.exe, navigates the Windows 11 OOBE, and explores the file system — all autonomously.

Windows 11: from installer to desktop

The agent autonomously navigates the Windows 11 installer in Spanish — 111 captured frames

Select configuration

Ready to install

OOBE — region selection

cmd.exe in OOBE

Windows 11: navigating the system

cmd.exe

The agent opens console with Shift+F10 during OOBE

Explorer

Browsing: Documents, Downloads, C: Drive (79 GB)

No prior scripting: the agent autonomously decided to open cmd.exe and explore the file system. It wasn't programmed to do so — it reasoned it was necessary.

Ubuntu Desktop: full recording

UBUNTU 118 frames · Login → Terminal → Text Editor → Save

The agent navigates GDM login, opens terminal, launches the text editor, writes and saves openeyes-note.txt in the home directory.

Ubuntu Desktop: login → editor → save

The agent logs in, opens a text editor, writes content and saves the file — 118 frames

Desktop + terminal open

"Save As" dialog

openeyes-note.txt saved

Cross-platform: same agent, different OS

WINDOWS

OOBE — "Is this the right country or region?"

UBUNTU

Desktop — Firefox + Terminal + Activities

The same framework operates both operating systems
No configuration changes — the agent reads the screen and adapts
Handles radically different interfaces: Windows wizard vs GNOME desktop

Real demo: autonomous reconnaissance

The agent receives a single instruction: "perform a complete system reconnaissance"

RECON 242 frames · 35 turns · $10.46 · called done() correctly

Boot → GDM Login → Terminal → whoami, id, ip a, cat /etc/passwd, grep shells → df -h, uname -a → SSH keys, .env → Firefox history+bookmarks → writes report in gedit → done()

Recon: visual progression

The agent navigates autonomously from boot to final report — no keyboard errors, click_text functional, called done()

click_text("agent") → (573,387)

Terminal

whoami, id, ip a — all successful

Data

cat /etc/passwd — 30+ users

Firefox

Bookmarks: Get Help, About Us...

Report

gedit — recon-report.txt saved

✓

done()

35 turns · $10.46

Recon: agent-generated report

3,095 characters written in gedit — real agent output (typed character by character via VNC)

SYSTEM RECONNAISSANCE REPORT
Date: March 4, 2026
Analyst: Reconnaissance Agent
-----------------------------------
1. USER INFORMATION
   User: agent  UID: 1000  GID: 1000
   Groups: agent, users, admin

2. SYSTEM
   Hostname: ubuntu-desktop
   Kernel: Linux 5.15.0-171-generic x86_64
   System: Ubuntu (Jammy)

3. NETWORK
   lo: 127.0.0.1/8
   enp0s2: 10.0.2.15/24 MAC: 52:54:00:12:34:56

4. USERS WITH SHELL
   root:x:0:0:root:/root:/bin/bash
   agent:x:1000:1000::/home/agent:/bin/bash

5. STORAGE
   /dev/vda1  58G  6.0G used (11%)  /
    

6. HOME DIRECTORIES
   /home/agent — drwxr-x--- (agent user only)

7. INTERESTING FILES
   openeyes-note.txt: "Hi from OpenEyes"
   .ssh/authorized_keys: empty (0 bytes)
   .bash_history: startx, which firefox, exit
   NO: .env, private SSH keys, credentials

8. FIREFOX
   History: Firefox Privacy Notice
   Bookmarks: Get Help, Customize Firefox,
              Get Involved, About Us
   No custom bookmarks

9. SECURITY OBSERVATIONS
   * User in admin group
   * Freshly installed/configured system
   * Cloud-init in /media/agent/cidata2
   * No sensitive files exposed

======= END OF REPORT =======
    

Agent log: adaptation in action

1. Detects boot → waits → identifies login → acts

15:10 The VM is booting. I see UEFI boot. I'll wait for it to finish booting.
15:10 [tool] wait_for_screen_change {timeout: 30, min_change_ratio: 0.1}
15:21 changed=True ratio=1.000 → State: likely_auth_screen
15:28 [tool] click_text("agent") → (573, 387) — password field appeared
    

2. Executes commands one by one, verifies each output via OCR

17:02 whoami → agent ✓  Now I run id:
17:13 OCR: uid=1000(agent) gid=1000(agent) groups=admin  ✓
17:24 Now I run ip a:
17:30 OCR: inet 10.0.2.15/24 scope global dynamic enp0s2  ✓
    

3. Firefox + gedit → done() with 3,095-char report

22:31 [tool] click_text("Last 7 days") → (150, 247) — Firefox history  ✓
24:59 [tool] type — 3,095 chars of the complete report in gedit  ✓
01:09 [tool] done() "Reconnaissance completed successfully"
01:12 turns=35  cost=$10.46
    

Evidence and Reproducibility

Each run generates complete artifacts:

runs/installer/20260208-132724/
├── disk.qcow2, uefi-vars.fd, tpm/   # VM state
├── logs/agent.log                     # agent reasoning
├── recordings/
│   ├── events.jsonl                   # actions + timestamps
│   ├── frames/                        # 111 PNG screenshots
│   └── recording.mp4                  # full video
├── manifest.json                      # metadata
└── score.json                         # scoring result
  

// 06

Edge Execution

From the cloud to your pocket (almost)

NVIDIA Jetson Orin Nano Super

The evil maid's "brain" fits in the palm of your hand:

GPU

1024 CUDA cores

RAM

8 GB LPDDR5

AI Performance

67 TOPS (INT8)

Power

7W – 25W

Storage

microSD + NVMe

Size

~70 x 45 mm

Quantized models (4-bit) that fit in 8 GB of unified RAM
No internet connection. 100% autonomous
USB-C powered. Power profiles: 7W → 25W

Target Connection: KVM over IP

Three variants for screen capture + HID injection

A: External PiKVM

[Target HDMI] → [KVM-A4]
[Target USB]  → [KVM OTG]
                    ↕ LAN
                 [Jetson]

BIOS/pre-boot
~$349 (Jetson+KVM+Pi Zero)

B: Jetson hub (UVC)

[Target HDMI] → [UVC Cap.]
                      ↓
                 [Jetson USB-A]
[Jetson OTG]   → [Target USB HID]

Unified architecture
~$264 (Jetson+capture card)

C: HDMI→CSI bridge

[Target HDMI] → [CSI Bridge]
                      ↓
                 [Jetson CSI port]
[Jetson OTG]   → [Target USB HID]

Lower latency, more integrated
~$289 (Jetson+bridge+driver)

"1 cable" variant: USB-C dock with DP Alt Mode splits video + USB to the target (+$26)

The hardware: brain

Jetson Orin Nano Super

Developer Kit with 8 GB LPDDR5
67 TOPS (INT8) — enough for quantized models
USB-C power, NVMe, WiFi
~$249

The evil maid's "brain": runs the model, OCR and agent logic

The hardware: eyes and hands

HDMI→CSI-2 Bridge

Direct HDMI capture to the Jetson's CSI port. Low latency, no USB.

HDMI→USB Capture (UVC)

Generic USB alternative. Compatible with any host. ~$15

Both options capture the target's HDMI signal so the agent can "see" the screen. HID injection (keyboard/mouse) goes through USB OTG.

Edge Challenges

Cooling

67 TOPS generate heat
Active fan required
Noise in quiet environments

Size

Module: 70x45 mm
With carrier/heatsink: larger
Hard to conceal

Cost

Jetson Orin Nano Super: ~$249
Jetson + KVM + cables + PSU: ~$350–$400

Future

More efficient models → fewer TOPS needed → less heat
Better thermal management + passive cooling (fanless)
Actual measured temperature (Mar 5, 2026): ~62–64 C on CPU/GPU/TJ
Cheaper and more compact edge hardware (AI SBCs under $100)
Quantized models increasingly capable with less RAM

Attack Scenario

Physical access
to laptop / docking

→

Connect
Jetson + USB

→

Agent observes
screen (VNC/HDMI)

→

Adapts
and executes

Key Characteristics

Environment flexibility — OS, version, and language don't matter
Handles unexpected situations: dialogs, popups, wizards...

// 07

Implications and
Countermeasures

What Does This Change?

We can infiltrate an autonomous agent that acts independently

The evil maid no longer needs prior knowledge of the target
Cross-platform attacks with the same hardware/software
More sophisticated attacks: latent agents, long-term reconnaissance
The attacker defines high-level objectives and the agent plans the steps

Countermeasures: Effective

• Full Disk Encryption + pre-boot auth
• USB port lockdown / device whitelisting
• Block absolute coordinate devices (USB tablet HID) — forces relative mouse, much more fragile for the agent
• Approve peripherals one by one (don't approve the entire hub)
• Current risk on some OSes (e.g. macOS): approving the hub may inherit permissions to future devices
• Tamper-evident seals (with verification)
• Chassis intrusion detection
• Screen lock with aggressive timeout
• Disable screen sharing by default
• Monitoring using multimodal agents

Real example: macOS prompt accepting a USB hub. If not controlled per device, the hub becomes a bypass path for new HIDs.

Countermeasures: Insufficient

• BIOS password only
• Secure Boot only (without FDE)
• Screen lock without USB lockdown

Conclusions

Multimodal + agentic models turn attacks into adaptive ones — both physical access and remote via exposed desktops (VNC, RDP...)
Edge execution (Jetson) makes this viable without infrastructure
OpenEyes demonstrates that a generic framework can operate any OS through the visual interface
Defenses need to be revisited assuming an intelligent and adaptive attacker: autonomous agents can be infiltrated at scale and (relatively) cheaply

"If your threat model doesn't include an adversary with silicon eyes and infinite patience, it's time to update it."

Questions?

~/about

alex@rooted cat identity.txt
Alejandro Vidal
@dobleio
Founder of mindmake.rs
alex (at) company domain

@dobleio

Evil Maid Has Eyes — RootedCON 2026

Evil Maid Has Eyes

The TraditionalEvil Maid

Classic Evil Maid Attack

MultimodalEvil Maid

The Convergence

Adaptive Evil Maid

Proof of concept:OpenEyes

OpenEyes Architecture

Agent Tools

screenshot()

type(text) & key(combo)

click_text() & mouse_click()

wait_for_screen_change()

done()

Principle: Observe → Act → Observe

Vision Pipeline

OmniParser: CV for GUI agents

Challenges Solved (selection)

Usage Examples

Example: Reconnaissance

Example: Surveillance + deferred action

Example: Passive observation

Example: Conditional trigger

Demo

Windows 11: full recording

Windows 11: from installer to desktop

Windows 11: navigating the system

Ubuntu Desktop: full recording

Ubuntu Desktop: login → editor → save

Cross-platform: same agent, different OS

Real demo: autonomous reconnaissance

Recon: visual progression

Recon: agent-generated report

Agent log: adaptation in action

Evidence and Reproducibility

Edge Execution

NVIDIA Jetson Orin Nano Super

Target Connection: KVM over IP

The hardware: brain

Jetson Orin Nano Super

The hardware: eyes and hands

Edge Challenges

Future

Attack Scenario

Key Characteristics

Implications andCountermeasures

What Does This Change?

Countermeasures: Effective

Countermeasures: Insufficient

Conclusions

Questions?

The Traditional
Evil Maid

Multimodal
Evil Maid

Proof of concept:
OpenEyes

`screenshot()`

`type(text)` & `key(combo)`

`click_text()` & `mouse_click()`

`wait_for_screen_change()`

`done()`

Implications and
Countermeasures