With Claudes' Computer Use feature being all the hype for the amazing thing it can do on your desktop we can surely say that AI will get to everybody eventually.
AI models are now able to understand images well enough that they can do sufficient Visual-Question-Answering: Detecting relations between objects! This is a big step forward because this task can be excellently solved by humans. But since the 1960s Computer Vision was not able to do this.
With the rapid growth in computing power -namely GPUs- and the rise of Large Language Models (LLMs) we are now able to reason well enough that AI models can decide what needs to be done on a User Interface (UI) to achieve a goal.
But unfortunately a critical component is often missing from the demos that is needed for widespread adoption inside real businesses: A reliable Device UI Controller that can act as real-human user. At AskUI we believe that true UI Automation is only possible if you control and automate your UI like a real human. With mouse movements, keypresses and clicks/taps.
Only then everybody can build reliable and intelligent Vision Agents for their use case. In this blog post we discuss what is necessary for a working Device UI Controller.
Missing From The Demos: A Reliable UI Device Controller
While the demos look impressive, there are massive hindrances to use them anywhere else except for impressing demos. Most of the demos use some kind of library like PyAutoGui. Those libraries serve their purpose well for the use case but are not able to be used in enterprise production applications because there are too many edge cases where they fall flat:
- Cross-Platform compatibility
- Real Unicode Char Typing
- Multi-Screen Support
- Type in Commandline
- No need of administrator permissions
- Support for all Desktop OS and native Mobile OS
- Application selection
- Process visualization
If you check the current landscape, no tool/library can fulfill these requirements. Most of them work on only a few operating systems or have trouble with real unicode char typing. This renders them fairly useless for business applications.
AskUIs' Controller: Production Ready for Intelligent Vision Agents
AskUI developed and extends its own Device UI Controller from scratch with all these requirements in mind. By integrating it deep into the operating system we achieve superior performance and features on each operating system:
- Cross-Platform compatibility: Windows, Linux, macOS, Android
- Multi Screen Support
- Real Unicode Char Typing
- Type in Commandline
- No Need of administrator rights
- Application Selection
- Process visualization
Up and coming:
- iOS support
- In background automation
- Native Tasks
- Video Streaming
Conclusion
The missing link from all the demos - A reliable Device UI Controller that can act as real-human user - is already available today and ready to build Agentic AI.
Check out our AskUI Vision Agent Implementation
And if you want to use AskUIs Device Controller to build reliable enterprise production ready Agents with Vision: