TLDR
Computer-use agents let AI operate real user interfaces by perceiving what’s on screen and taking OS-level actions, useful when selectors are missing or unstable.
With the AskUI Python SDK, you can run intent-based instructions with agent.act() / agent.get() and attach Tool Store tools (e.g., screenshots, file tools) to make runs easier to debug and reuse.
If you’re automating a web-only app with stable DOM selectors, Playwright may be simpler. AskUI fits when you need cross-app, OS-level automation beyond the DOM.
Note on naming
This guide reflects the AskUI Python SDK naming introduced in the v0.23.1 releases. In code, you typically create a VisionAgent.
Introduction
In the agent era, automation is shifting from brittle selector scripts to agents that can execute intent directly on interfaces users actually operate. A computer-use agent perceives what’s on screen and takes OS-level actions such as click, type, and navigate, making it practical when DOM-based automation is fragile or not available.
What you’ll build in this guide
- Run a first intent-based agent with
VisionAgent - Save a screenshot artifact for debugging
- Parameterize a run with
input.txt→ write results tooutput/result.txt
If you want a deeper architecture explanation, read: Understanding AskUI: The Eyes and Hands of AI Agents
Computer Use Agent vs Selector Based Automation
Traditional automation relies on DOM selectors or object identifiers. Computer use agents rely on visible cues such as text, layout, icons, and images, which makes them useful when selectors are missing or brittle.
| Feature | Computer use agents | Selector based tools |
|---|---|---|
| Element targeting | Visual cues such as text, layout, icons, and images | DOM selectors such as id, class, XPath, and CSS |
| Most likely to break when | UI changes visually in meaningful ways | DOM structure or selectors change |
| App coverage | Any UI you can see on screen | Mostly web apps with accessible DOM |
| Maintenance | Lower selector maintenance, more resilient to refactors | Ongoing selector maintenance and brittleness overtime |
| Best fit | Desktop apps, virtualized environments, kiosks, and custom-rendered UIs | Modern web apps with stable selectors |
Key Applications and Use Cases
Computer use agents are especially useful when:
- You need to automate beyond stable DOM selectors (desktop apps, virtualized environments, custom-rendered UIs)
- The UI is canvas based or custom rendered, where selectors are brittle or missing
- You want end-to-end workflows that reflect real user behavior and catch UI regressions
They also work well for:
- Cross application workflows
- Document assisted processes where saving screenshots improves debugging and auditability
Getting Started: Build Your First Agent with Python
Prerequisites
- Python 3.10 version or higher
- VS code or any Python IDE
- Windows, macOS, or Linux
Step 1 : Installation
pip install "askui[all]"For the latest installation notes and platform specific extras, see the docs→
Step 2: Sign up with AskUI
To run the examples, you’ll need an AskUI workspace and access token.
- Sign up at hub.askui.com.
- Copy your Workspace ID and Access Token from the Hub.
Step 3: Configure environment variables
macOS / Linux:
export ASKUI_WORKSPACE_ID="<your-workspace-id>"
export ASKUI_TOKEN="<your-access-token>"Windows PowerShell:
$env:ASKUI_WORKSPACE_ID="<your-workspace-id>"
$env:ASKUI_TOKEN="<your-access-token>"Optional (Anthropic models):
export ANTHROPIC_API_KEY="<your-anthropic-api-key>"Step 4: Verify your set up with a first script
Create a file named agent_demo.py:
from askui import VisionAgent
def main():
with VisionAgent() as agent:
agent.act(
"Open a browser, go to wikipedia.org, open the English Wikipedia main page, "
"and find the 'On this day' section."
)
text = agent.get("Read the first bullet point under 'On this day' and return it as plain text.")
print(f"\n📌 On this day: {text}\n")
if __name__ == "__main__":
main()Run it:
python agent_demo.pyIf you see logs in the terminal and a line like 📌 On this day: ..., you’re ready, your agent is successfully operating the interface and extracting information from the screen.
Step 5: Save artifacts (screenshots) for faster debugging
Screenshots are a useful debugging artifact because they show what the agent actually saw on the screen. With AskUI’s Tool Store, you can attach optional tools to your runs to capture artifacts like screenshots for easier debugging and repeatability.
5.1 Create the screenshots folder (first-time setup) Create a local folder where screenshots can be written:
mkdir -p screenshots5.2 Run the same flow + save a screenshot Create a new file named agent_demo_with_artifacts.py:
from askui import VisionAgent
from askui.tools.store.computer import ComputerSaveScreenshotTool
from askui.tools.store.universal import PrintToConsoleTool
def main():
with VisionAgent() as agent:
agent.act(
"Open a browser, go to wikipedia.org, open the English Wikipedia main page.",
)
agent.act(
"Now take a screenshot and save it into the screenshots folder (e.g., wiki.png). Also print ‘screenshot saved’.",
tools=[
ComputerSaveScreenshotTool(base_dir="./screenshots"),
PrintToConsoleTool(),
],
)
if __name__ == "__main__":
main()Tip: For more reliable results, keep “do the task” and “save an artifact” as separate
agent.act()calls.
Run it:
python agent_demo_with_artifacts.pyAfter it finishes, check:
ls -la screenshotsYou should see at least one image saved in the ./screenshots folder.
Note: PrintToConsoleTool is optional. It’s nice if you want extra log messages, but screenshots do not require it.
Step 6: Extend runs with Tool Store (files)
Once your first run works, the next step is making it repeatable: move inputs and outputs into files so you can rerun the same flow with different data and keep artifacts for debugging.Tool Store’s universal file tools (ReadFromFileTool / WriteToFileTool) let you read inputs from disk and persist outputs back to files.
6.1 Create an input file
Create input.txt:
echo "Artificial intelligence" > input.txt6.2 Read input → run the flow → write output Create a new file named agent_demo_with_files.py:
from askui import VisionAgent
from askui.tools.store.universal import ReadFromFileTool, WriteToFileTool
def main():
with VisionAgent() as agent:
agent.act(
"1) Use ReadFromFileTool to read 'input.txt'. "
"2) Open a browser, go to wikipedia.org, search for that exact text, and open the first result. "
"3) Read the first sentence of the article introduction. "
"4) Save ONLY that first sentence into 'result.txt' using WriteToFileTool.",
tools=[
ReadFromFileTool(base_dir="."),
WriteToFileTool(base_dir="./output"),
],
)
print("\n✅ Done. Check ./output/result.txt\n")
if __name__ == "__main__":
main()Run it:
python agent_demo_with_files.pyCheck the output:
cat output/result.txtAt this point you have a reusable run:
- Change one line in input.txt
- Run the script again
- Get a new result in ./output/result.txt
6.3 Load an image from disk (optional)
If your workflow needs a reference image (e.g., compare against a baseline screenshot), you can load images from disk with LoadImageTool for analysis or visual inspection.
from askui import VisionAgent
from askui.tools.store.universal import LoadImageTool
def main():
with VisionAgent() as agent:
agent.act(
"Describe the logo image called './images/logo.png'.",
tools=[LoadImageTool(base_dir="./images")],
)
if __name__ == "__main__":
main()Step 7: Where to go next
At this point you have a working baseline:
- Run intent-based automation with
VisionAgent(agent.act()/agent.get()) - Capture screenshots as debugging artifacts
- Parameterize runs with
input.txt→ persist results tooutput/result.txt
Next, you can level this up by:
- Saving screenshots around key actions to make debugging faster
- Splitting long instructions into smaller steps for stability
- Adding more Tool Store tools as your workflows grow
Demo project (optional)
If you want a complete end-to-end example (Tool Store, custom tools, CSV-driven steps, caching, and HTML reports), check out the AskUI Demo Project.
For more examples and platform-specific setup, see the AskUI documentation.
