How to Build Vision Agentic AI with Claude and AskUI

November 18, 2024
Academy
The image depicts a futuristic, lifelike head that appears partially robotic and partially organic, intricately detailed with circuits and glowing blue lights. Connected to the base of the head are numerous cables, suggesting a networked or integrated system. A sleek, metallic spaceship or aircraft is emerging from the side of the head, emitting a blue glow from its engines. The background features holographic interfaces displaying complex data and graphics, adding to the high-tech and advanced theme of the scene.
linkedin icontwitter icon

The potential for automation in the world of computing is continually expanding. One innovative combination that enhances this capability is utilizing AskUI's Vision Agent alongside Anthropic's Claude model. This integration allows you to automate computer tasks more effectively through natural language processing.

Please accept marketing-cookies to watch this video.

Getting Started with AskUI and Claude

Setting Up Your Environment

To begin building your vision agent with AskUI and Claude, you will first need to install the necessary software:

1. Install AskUI Agent OS: Start by downloading and running the AskUI installer suitable for your operating system. It supports Windows, Linux, and MacOS. However, note that it currently does not work on Wayland, so if you are using Linux, switch to XOrg.

2. Install Vision-Agent in Python: After installing the AskUI OS, you can install the Vision Agent within your Python environment using pip by running:

pip install askui

3. Authenticate with Anthropic: To connect with Anthropic's Claude, you need an API key from Anthropic. Set this key as an environment variable by naming it ANTHROPIC_API_KEY.

Building Your Vision Agent

Once set up, you can start configuring your vision agent. The foundation of your automation comes from the following Python snippet:

from askui import VisionAgent

# Initialize your agent context manager
with VisionAgent() as agent:
    # Use the webbrowser tool to start browsing
    agent.tools.webbrowser.open_new("http://www.google.com")

    # Start to automate individual steps
    agent.click("url bar")
    agent.type("http://www.google.com")
    agent.keyboard("enter")

    # Extract information from the screen
    datetime = agent.get("What is the datetime at the top of the screen?")
    print(datetime)

    # Or let the agent work on its own
    agent.act("search for a flight from Berlin to Paris in January")

Exploring the Capabilities

This script demonstrates core features of the AskUI Vision Agent:

- Opening a Web Browser: It can launch a browser window and navigate to a specified webpage.

- Automating User Interactions: It mimics user interactions, such as clicking, typing, and keyboard inputs.

- Extracting Information: The agent is capable of querying and extracting data displayed on your screen, like the current date and time.

- Autonomous Actions: Beyond simple tasks, the agent can autonomously perform complex sequences, such as searching for flights.

Additional Features of AskUI Vision Agent

AskUI Vision Agent comes equipped with a range of additional functionalities:

- Logging: By setting the `log_level` to DEBUG, you gain comprehensive insights into your agent’s activities.

- Multi-Monitor Support: You can specify which display to target by adjusting the `display` parameter, which is particularly useful in multi-monitor setups.

Benefits and Advancements

AskUI Vision Agent augments your automation capabilities by integrating Agent OS and Anthropic's Claude Sonnet 3.5 v2. The collaborative operation with AskUI's Prompt-to-Action series brings forth benefits such as:

- Multi-Screen Support: Effective management across multiple displays.

- Compatibility: Seamless operation across major OSs like Windows, MacOS, and Linux.

- Enhanced Visualization Enables process visualizations to track automation progress.

- Full Unicode Support: Ensures accurate character representation.

- Future Enhancements: Upcoming features include application selection, in-background automation, and video streaming.

Conclusion

The integration of AskUI Vision Agent with the agentic AI Claude represents a robust framework for computer task automation. With profound language understanding and user interaction capabilities, this toolkit allows you to create complex automation processes using natural language. Explore the benefits today by setting up your AskUI Vision Agent and see how it can transform your approach to automation.

Recommended read: Getting started: Vision Agents

·
November 18, 2024
On this page