The Missing Link: Device UI Controller Features Neccessary for Vision AI Agents

October 31, 2024
Academy
The image depicts a futuristic scene featuring a humanoid figure with a metallic and intricately detailed head, gazing at a large, hovering, digital eye. The figure's head is embedded with various technological components and gears, highlighting an advanced level of cybernetic enhancement. Their right hand is raised with the index finger pointing upwards in a thoughtful gesture, with vibrant colors and lights illuminating the scene, creating a dynamic and high-tech atmosphere. The digital eye is encircled by rings of energy and data streams, symbolizing themes of artificial intelligence and digital surveillance.
linkedin icontwitter icon

With Claudes' Computer Use feature being all the hype for the amazing thing it can do on your desktop we can surely say that AI will get to everybody eventually.

AI models are now able to understand images well enough that they can do sufficient Visual-Question-Answering: Detecting relations between objects! This is a big step forward because this task can be excellently solved by humans. But since the 1960s Computer Vision was not able to do this.

With the rapid growth in computing power -namely GPUs- and the rise of Large Language Models (LLMs) we are now able to reason well enough that AI models can decide what needs to be done on a User Interface (UI) to achieve a goal.

But unfortunately a critical component is often missing from the demos that is needed for widespread adoption inside real businesses: A reliable Device UI Controller that can act as real-human user. At AskUI we believe that true UI Automation is only possible if you control and automate your UI like a real human. With mouse movements, keypresses and clicks/taps.

Only then everybody can build reliable and intelligent Vision Agents for their use case. In this blog post we discuss what is necessary for a working Device UI Controller.

Missing From The Demos: A Reliable UI Device Controller

While the demos look impressive, there are massive hindrances to use them anywhere else except for impressing demos. Most of the demos use some kind of library like PyAutoGui. Those libraries serve their purpose well for the use case but are not able to be used in enterprise production applications because there are too many edge cases where they fall flat:

  • Cross-Platform compatibility
  • Real Unicode Char Typing
  • Multi-Screen Support
  • Type in Commandline
  • No need of administrator permissions
  • Support for all Desktop OS and native Mobile OS
  • Application selection
  • Process visualization

If you check the current landscape, no tool/library can fulfill these requirements. Most of them work on only a few operating systems or have trouble with real unicode char typing. This renders them fairly useless for business applications.

AskUIs' Controller: Production Ready for Intelligent Vision Agents

AskUI developed and extends its own Device UI Controller from scratch with all these requirements in mind. By integrating it deep into the operating system we achieve superior performance and features on each operating system:

  • Cross-Platform compatibility: Windows, Linux, macOS, Android
  • Multi Screen Support
  • Real Unicode Char Typing
  • Type in Commandline
  • No Need of administrator rights
  • Application Selection
  • Process visualization

Up and coming:

  • iOS support
  • In background automation
  • Native Tasks
  • Video Streaming

Conclusion

The missing link from all the demos - A reliable Device UI Controller that can act as real-human user - is already available today and ready to build Agentic AI.

Check out our AskUI Vision Agent Implementation

And if you want to use AskUIs Device Controller to build reliable enterprise production ready Agents with Vision:

Johannes Dienst
·
October 31, 2024
On this page