Can AI Tap Phones and Use Computers by Itself? A Reading of the Mobile-Agent Project

X-PLUG’s open source Mobile-Agent is no longer just a phone automation project. Based on the repository’s current positioning, it is more like a set of GUI-agent work accumulated by Tongyi Lab: Mobile-Agent-v1/v2/v3/v3.5, Mobile-Agent-E, PC-Agent, GUI-Critic-R1, UI-S1, GUI-Owl, ToolCUA, and more are all presented inside the same project system.

This line is worth watching. In the past, GUI agent discussions often centered on whether a model could understand a screenshot and tap the right place. Mobile-Agent goes further: it tries to let agents switch among mobile, desktop, browser, and tool use, handling longer and more complex real tasks.

What Problem It Solves

GUI agents do not face standard APIs; they face application interfaces. They need to understand the screen, locate controls, plan steps, perform taps or typing, and correct their path after failure. Mobile scenarios are especially complex because tasks often cross multiple apps, and interface state changes with login, permissions, popups, network conditions, and personalized recommendations.

The Mobile-Agent series breaks the problem into several directions:

Mobile-Agent-v1/v2 explore visual perception and multi-agent collaboration for phone GUIs.
PC-Agent extends multi-agent operation to PC scenarios.
Mobile-Agent-v3 and v3.5 advance a multi-platform GUI agent framework.
GUI-Owl models provide cross-platform GUI perception, grounding, and end-to-end operation.
GUI-Critic-R1, UI-S1, ToolCUA, and related work add error diagnosis, reinforcement learning, and GUI/tool path orchestration.

This makes it less like a single demo and more like a research and engineering path around “computer-use agents”.

The Focus of v3.5

The repository README shows that Mobile-Agent-v3.5 can be experienced through ModelScope online demo and Alibaba Cloud Bailian online demo, and Bailian also provides a v3.5 API. In March 2026, v3.5 also launched on Alibaba Cloud Wuying cloud phones, offering mobile-use experiences in cloud Android environments.

This suggests the project is filling in usage modes beyond “run experiments locally”. For GUI agents, cloud phones and cloud desktops matter: they provide more stable and reproducible runtime environments, reducing differences caused by local devices, OS versions, resolution, and app state.

If you want to evaluate this kind of agent, a stable environment is easy to underestimate. Without a controllable execution environment, it is hard to know whether a failure came from weak model capability, interface changes, device issues, or an unclear task definition.

GUI-Owl Is a Deeper Change

After Mobile-Agent-v3, GUI-Owl became a key model layer in this line. The README describes GUI-Owl as a multimodal cross-platform GUI VLM with GUI perception, grounding, and end-to-end operation ability. By GUI-Owl-1.5, the model series covers 2B, 4B, 8B, 32B, and 235B, and supports desktop, mobile, and browser automation.

The significance of this kind of model is that it does not only answer “what is on the screen”. It must connect the natural-language goal, screenshot content, UI element positions, and next operation. For GUI agents, visual understanding, coordinate grounding, action planning, and state memory are all necessary.

Of course, the more general the model becomes, the more important engineering boundaries are. Real deployment still needs executors, permission control, task logs, rollback mechanisms, and human confirmation. For high-risk actions involving payment, accounts, files, or message sending, a GUI agent must not only complete tasks automatically; it must also clearly explain what it is about to do.

The Direction Implied by ToolCUA

In May 2026, project news mentioned ToolCUA, positioned as an end-to-end Computer Use Agent for optimal GUI and tool path orchestration. This direction is interesting because it recognizes a practical fact: not every task should be completed by clicking through screens.

Some work suits GUI operation, such as logging into back offices, handling complex forms, or reading app state without APIs. Some work is better done through tools, such as search, calculation, file parsing, or structured interface access. A usable computer-use agent needs to learn when to switch between the two.

This is why the Mobile-Agent series is more worth watching than early phone automation projects. It no longer only asks whether an agent can tap apps like a human. It asks when an agent should look at the screen, when it should use tools, and when it should stop for confirmation.

Who Should Follow It

If you only want an out-of-the-box phone automation assistant, Mobile-Agent is still more of a research and engineering framework. It involves models, runtime environments, evaluation tasks, and concrete executors, so a complete run usually has setup cost.

But if you care about the following questions, it is worth tracking:

How mobile GUI agents move from demos to stable execution.
Whether desktop, browser, and phone automation can be unified under one agent framework.
How GUI models handle grounding, reflection, memory, and error diagnosis.
How agents choose between GUI operation and tool use.
Whether cloud phones and cloud desktops will become important runtime environments for GUI agents.

These questions directly affect personal assistants, enterprise workflow automation, remote desktop operation, app testing, and integration with systems that lack APIs.

My Take

The value of Mobile-Agent is not one version’s metrics, but that it pushes GUI agents from “phone screenshot and tap” into a larger system problem: how models, execution environments, evaluation, tool use, error diagnosis, and cross-platform tasks cooperate.

In the short term, it is better suited for researchers and developers observing the technical path of GUI agents. In the long term, projects like this may influence the shape of personal AI assistants and enterprise automation tools. The real difficulty is not only making an agent operate interfaces, but making it complete tasks in real applications in a stable, controllable, and traceable way.

Project link: X-PLUG/MobileAgent