If you’ve followed the evolution of generative AI since ChatGPT debuted in 2022 and then the proliferation of the space with AI chatbots of varying capabilities, you may agree we are coming to a saturation point.
By mid-2024, the arrival of new large language models (LLMs) to generate videos, texts, and codes was greeted with a sense of indifference, and as more arrived, they lost their unique selling point.
Now the attention is on AI agents. More specifically, will they be able to “jump into an operating system and start using it”?
Microsoft explored how adept a range of artificial intelligence systems were at navigating everyday GUIs. It’s a mixed bag on this missing link in putting AI to work using our everyday tools.
Key Takeaways
- AI GUI agents aim to automate tasks by navigating computer interfaces such as Windows or MacOS.
- Specialized Large Action Models (LAMs) could provide a path for AI to interact with our everyday tools and programs.
- AI Agents are arriving fast, and the global AI agent market is set to grow to $47B by 2030, but the technology from a GUI side is not quite ready for the mainstream.
- Microsoft studies show promise among mainstream AI services but also highlight complexity and dataset gaps.
- Ethical concerns and job losses loom as adoption increases.
The Hunger for AI Assistants in the Enterprise
When AI startup Anthropic announced “Computer Use” in October 2024, it was deemed the next big leap in the GenAI race and, perhaps, correctly so.
Basically, “computer use” describes a graphical user interface (GUI) agent or AI agent that can click for us, or as Anthropic put it, an AI agent that can “use computers the way people do.”
The prospect of automating tasks starting with a simple text description is incredibly enticing.
While Anthropic has pulled ahead in the race by being the first to roll out an AI agent (in public beta), other players like Microsoft and Google have announced similar feats. OpenAI is reportedly readying one, codenamed “Operator”, for a launch in January.
The hunger for AI agents is skyrocketing due to their potential to boost productivity and cut costs for businesses.
According to PRNewswire, the global AI agents market is projected to take off next year, growing from $5 billion in 2024 to $47 billion by 2030.
As we near the end of 2024, companies are racing to stake their claim in this lucrative market. Salesforce, for instance, has already signed up over 200 companies to implement its AI agents, including big names like Accenture, FedEx, and IBM.
With such high stakes, the race for enterprise AI agents is truly heating up, and it’s anyone’s game to win.
Are AI Agents Ready for the Desktop? Microsoft Study
To understand their capabilities and accuracy, especially in enterprise workplaces, Microsoft researchers and their academic partners sought to figure out to what extent LLM-powered GUI agents can be applied to workflows.
The study also explored how these agents handle complex software navigations on different operating systems (PDF) and mobile and desktop interfaces.
On the positive side, they found that, unlike traditional software agents, LLM-based agents can process visual data from screens and follow spoken or written instructions, which allows them to manage intricate tasks without direct human intervention.
They also found that these agents adapt quickly to new tasks within familiar software environments.
The study also tested their ability to handle ambiguous instructions and adapt across different software applications.
Results showed they could make sense of unclear commands and switch between desktop and web environments with ease. This opens the door for integrating these agents into broader AI systems, further extending their utility.
On the downside, the study found that desktop GUI agents are let down by a relative lack of dedicated datasets, especially when compared with mobile and web platforms. This is despite the desktop’s crucial role in applications like productivity tools and enterprise software.
In addition, while the current GUI agents, which are built from foundational models such as GPT-4o and Claude 3.5 Sonnet (Computer Use), are smart enough to serve as a starting point, the researchers point out that they fall short in tackling the unique complexities of GUI-based tasks.
These findings corroborate a recent study that found Anthropic’s Claude 3.5 AI agent lacking in ability to handle complex multi-step operations. This is despite showing an 87% success rate on basic computing tasks and 92% in navigation tasks.
These are great numbers, but perhaps not good enough to let AI take over your spreadsheet for you.
LLM-Driven AI Agents May Fall Short Without LAMs
To improve the efficiency and accuracy of GUI agents, Microsoft researchers propose building on the foundational Large Language Models by fine-tuning them into specialized Large Action Models (LAMs)
They wrote: specialized agents are “tailored to improve the performance and efficiency of GUI agents. These LAMs bridge the gap between general-purpose capabilities and the specific demands of GUI-based interactions.”
They also argue that LAMs would allow GUI agents to handle intricate tasks more smoothly and consistently.
This shift would not only enhance their overall effectiveness but also help businesses rely more on these agents to cut down on repetitive tasks and improve productivity across the board.
To address the lack of dedicated datasets for desktop GUI agents, the researchers recommend prioritizing the development of specialized, high-quality datasets tailored to desktop environments.
The researchers argue that with these targeted datasets, developers can train Large Action Models (LAMs) to better understand and navigate the unique difficulties of desktop interfaces.
They emphasize that putting resources into creating these datasets would not only close the performance gap between desktop agents and those designed for mobile or web platforms but also pave the way for broader enterprise adoption of AI agents.
The Bottom Line
AI agents represent a promising frontier in automating tasks and boosting efficiency using artificial intelligence. OpenAI CEO Sam Altman calls them “the next giant breakthrough.”
Despite studies suggesting that they still require a great deal of fine-tuning, AI players will likely get them into the market by 2025.
As things get into the right shape, the potential for AI agents to handle a wide range of tasks, from customer service to project management, is there. No doubt, jobs will be lost.
If AI agents must be forced into the market, then the big players need to provide adequate human oversight. Regulatory bodies will also have a governance role to play to ensure caution is not thrown to the wind when these agents start clicking for us.