Screenshot Vision

Your agents can see what you see. Hit the camera icon to grab any region of your screen with the native macOS crosshair, or drag an image straight onto the Composer. It drops in as a thumbnail, rides along with your next message, and Claude Code reads it from disk and sees exactly what you saw: the broken layout, the design mockup, the wall of red in some other window.

How it works

The flow has two doors in, both ending at the same place: a path on disk that your agent can open.

Here is the friendly version. See something on your screen you want your helper to look at? You have two easy options:

Click the camera icon in the Composer. Your screen dims and a crosshair appears (this is built into your Mac). Drag a box around anything: a button that looks wrong, a chart, an error message. Let go, and a little picture of it pops into the Composer.
Or just drag an image file (or any file) from Finder right onto the Composer. Same result: a thumbnail tile shows up.

Then you type or say what you want ("why is this button off-center?") and hit send. Behind the scenes, Sparkle hands your agent the location of that image on your computer, and Claude Code (the AI that does the actual building) opens it and looks. The very first time you capture a screen region, your Mac will ask permission to record the screen. Say yes once and you are set.

Two inputs, one outcome. The camera icon fires the macOS crosshair so you can grab a region without leaving the app; a drag-and-drop from Finder takes images or any file. Both land as tiles in the Composer.

The leverage move: stop describing visual bugs in prose. Do not type three paragraphs trying to explain that the modal is 4px too low and the wrong shade of blue. Screenshot it, drop it, and say "fix this." Pair it with parallel agents and you can fan out a design review: one agent on the header, one on the footer, each with its own screenshot, each in its own lane. The picture carries the context your words would have fumbled.

Mechanics, since you will want them. The camera path shells out to /usr/sbin/screencapture -i -x: interactive crosshair, shutter sound silenced, PNG written to the OS temp dir. Esc is a clean cancel (screencapture exits 0 but writes no file, treated as a no-op, not an error). First invocation trips the macOS Screen Recording permission prompt; the grant persists in System Settings.

Drag-and-drop runs through load_attachment: images (png, jpg, jpeg, gif, webp, bmp) get a base64 data: URL for the inline thumbnail; anything else, plus images over 40MB, falls through to a plain file tile (still attachable, just no preview). HEIC is deliberately excluded since the WebView cannot render it in a data URL.

On send, the attachment paths are space-joined, shell-quoted, and prefixed to the message body, so the agent receives the literal path and the Claude Code CLI reads the image from disk natively. No upload, no copy into a sandbox: the file stays where it is on your machine.

Why it matters

Words run out fast when the problem is visual. Try describing the exact shade of a button, or where a chart is overlapping its label, and you will tie yourself in knots. A screenshot skips all of that. You show your helper the thing, the way you would lean over and point at a friend's screen. That is a huge deal when you are new: you do not need the right technical vocabulary for a bug, because the picture says it for you.

This is raw speed. The round-trip of "describe the visual problem in text, get a guess back, correct it, repeat" collapses into one capture. UI bugs, mockups you want built, an error in some other app, a competitor's layout you want to riff on: all of it becomes context in one drag. You ship the fix instead of narrating the symptom.

Vision input closes the gap between what is on screen and what the model knows. Rendering bugs, broken responsive breakpoints, a screenshot of a failing CI run in the browser, a Figma export: these are exactly the inputs that are painful to serialize into text and trivial to capture. Zero config, no MCP server, no clipboard juggling. It is the shortest path from "I can see the problem" to "the agent can too."

vs. the 1980s terminal

A terminal is text in, text out. There is no notion of an image. You cannot paste a picture into it, you cannot drop a file onto it and have it shown, and it certainly cannot grab a region of your screen. The closest you get is typing a file path and hoping the tool on the other end knows what to do with it.

The old black window only understands typing. If your problem is something you can see but struggle to describe, the terminal is no help at all. Sparkle sits on top of it and adds the thing it never had: eyes. You point, your agent looks.

The terminal never grew a way to show it anything. Sparkle bolts vision on as a layer above the same shell: the capture, the thumbnail, the drag target all live in the Composer, while the unchanged terminal underneath just receives a path. You get image input without giving up the real tooling beneath it.

The terminal is a text stream, full stop, and that is a feature for most of what it does. But a text stream cannot carry pixels. Sparkle does not fight that: it keeps the path-based interface the CLI already understands (Claude Code reads image paths natively) and adds the capture and preview UI the terminal was never going to have. The shell stays exactly as load-bearing as it was; you just gained a way to feed it what you see.

Screenshot Vision

How it works

Why it matters

vs. the 1980s terminal

Where to go next