🤖 Enable AI to control your desktop, mobile and HMI devices

We make Windows, MacOS, Linux, Android an iOS accessible for AI agents by finding any element on screen.

Key features of AskUI include:

Support for Windows, Linux, MacOS, Android and iOS device automation (Citrix supported)
Support for single-step UI automation commands (RPA like) as well as agentic intent-based instructions
In-background automation on Windows machines (agent can create a second session; you do not have to watch it take over mouse and keyboard)
Flexible model use (hot swap of models) and infrastructure for reteaching of models (available on-premise)
Secure deployment of agents in enterprise environments

AskUI_VisionAgentsforEnterprise.1.mp4

🔧 Setup

1. Install AskUI Agent OS

Agent OS is a device controller that allows agents to take screenshots, move the mouse, click, and type on the keyboard across any operating system.

Windows

AMD64

AskUI Installer for AMD64

ARM64

AskUI Installer for ARM64

Linux

⚠️ Warning: Agent OS currently does not work on Wayland. Switch to XOrg to use it.

AMD64

curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.run

bash /tmp/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.run

ARM64

curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.run

bash /tmp/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.run

MacOS

curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.run

bash /tmp/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.run

2. Install vision-agent in your Python environment

pip install askui

Note: Requires Python version >=3.10.

3a. Authenticate with an AI Model Provider

	AskUI INFO	Anthropic INFO
ENV Variables	`ASKUI_WORKSPACE_ID`, `ASKUI_TOKEN`	`ANTHROPIC_API_KEY`
Supported Commands	`act()`, `click()`, `get()`, `locate()`, `mouse_move()`	`act()`, `click()`, `get()`, `locate()`, `mouse_move()`
Description	Faster Inference, European Server, Enterprise Ready	Supports complex actions

To get started, set the environment variables required to authenticate with your chosen model provider.

How to set an environment variable?

Linux & MacOS

Use export to set an evironment variable:

export ASKUI_WORKSPACE_ID=<your-workspace-id-here>
export ASKUI_TOKEN=<your-token-here>
export ANTHROPIC_API_KEY=<your-api-key-here>

Windows PowerShell

Set an environment variable with $env:

$env:ASKUI_WORKSPACE_ID="<your-workspace-id-here>"
$env:ASKUI_TOKEN="<your-token-here>"
$env:ANTHROPIC_API_KEY="<your-api-key-here>"

Example Code:

from askui import VisionAgent

with VisionAgent() as agent:
    # AskUI used as default model

    agent.click("search field")

    # Use Anthropic (Claude 4 Sonnet) as model
    agent.click("search field", model="claude-sonnet-4-20250514")

3b. Test with 🤗 Hugging Face AI Models (Spaces API)

You can test the Vision Agent with Huggingface models via their Spaces API. Please note that the API is rate-limited so for production use cases, it is recommended to choose step 3a.

Note: Hugging Face Spaces host model demos provided by individuals not associated with Hugging Face or AskUI. Don't use these models on screens with sensible information.

Supported Models:

Example Code:

    agent.click("search field", model="OS-Copilot/OS-Atlas-Base-7B")

3c. Host your own AI Models

UI-TARS

You can use Vision Agent with UI-TARS if you provide your own UI-TARS API endpoint.

Step: Host the model locally or in the cloud. More information about hosting UI-TARS can be found here.
Step: Provide the TARS_URL and TARS_API_KEY environment variables to Vision Agent.
Step: Use the model="tars" parameter in your click(), get() and act() etc. commands or when initializing the VisionAgent.

▶️ Start Building

from askui import VisionAgent

# Initialize your agent context manager
with VisionAgent() as agent:
    # Use the webbrowser tool to start browsing
    agent.tools.webbrowser.open_new("http://www.google.com")

    # Start to automate individual steps
    agent.click("url bar")
    agent.type("http://www.google.com")
    agent.keyboard("enter")

    # Extract information from the screen
    datetime = agent.get("What is the datetime at the top of the screen?")
    print(datetime)

    # Or let the agent work on its own, needs an Anthropic key set
    agent.act("search for a flight from Berlin to Paris in January")

🎛️ Model Choice

You can choose different models for each click() (act(), get(), locate() etc.) command using the model parameter.

from askui import VisionAgent

# Use AskUI's combo model for all commands
with VisionAgent(model="askui-combo") as agent:
    agent.click("Next")  # Uses askui-combo
    agent.get("What's on screen?")  # Uses askui-combo

# Use different models for different tasks
with VisionAgent(model={
    "act": "claude-sonnet-4-20250514",  # Use Claude for act()
    "get": "askui",  # Use AskUI for get()
    "locate": "askui-combo",  # Use AskUI combo for locate() (and click(), mouse_move())
}) as agent:
    agent.act("Search for flights")  # Uses Claude
    agent.get("What's the current page?")  # Uses AskUI
    agent.click("Submit")  # Uses AskUI combo

# You can still override the default model for individual commands
with VisionAgent(model="askui-combo") as agent:
    agent.click("Next")  # Uses askui-combo (default)
    agent.click("Previous", model="askui-pta")  # Override with askui-pta
    agent.click("Submit")  # Back to askui-combo (default)

The following models are available:

AskUI AI Models

Supported commands are: act(), click(), get(), locate(), mouse_move()

Model Name	Info	Execution Speed	Security	Cost	Reliability
`askui`	`AskUI` is a combination of all the following models: `askui-pta`, `askui-ocr`, `askui-combo`, `askui-ai-element` where AskUI chooses the best model for the task depending on the input.	Fast, <500ms per step	Secure hosting by AskUI or on-premise	Low, <0,05$ per step	Recommended for production usage, can be (at least partially) retrained
`askui-pta`	`PTA-1` (Prompt-to-Automation) is a vision language model (VLM) trained by AskUI which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`"	fast, <500ms per step	Secure hosting by AskUI or on-premise	Low, <0,05$ per step	Recommended for production usage, can be retrained
`askui-ocr`	`AskUI OCR` is an OCR model trained to address texts on UI Screens e.g. "`Login`", "`Search`"	Fast, <500ms per step	Secure hosting by AskUI or on-premise	low, <0,05$ per step	Recommended for production usage, can be retrained
`askui-combo`	AskUI Combo is an combination from the `askui-pta` and the `askui-ocr` model to improve the accuracy.	Fast, <500ms per step	Secure hosting by AskUI or on-premise	low, <0,05$ per step	Recommended for production usage, can be retrained
`askui-ai-element`	AskUI AI Element allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name.	Very fast, <5ms per step	Secure hosting by AskUI or on-premise	Low, <0,05$ per step	Recommended for production usage, deterministic behaviour

Note: Configure your AskUI Model Provider here

Antrophic AI Models

Supported commands are: act(), get(), click(), locate(), mouse_move()

Model Name	Info	Execution Speed	Security	Cost	Reliability
`claude-sonnet-4-20250514`	The Computer Use model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"`	slow, >1s per step	Model hosting by Anthropic	High, up to 1,5$ per act	Not recommended for production usage

Note: Configure your Antrophic Model Provider here

Huggingface AI Models (Spaces API)

Supported commands are: click(), locate(), mouse_move()

Model Name	Info	Execution Speed	Security	Cost	Reliability
`AskUI/PTA-1`	`PTA-1` (Prompt-to-Automation) is a vision language model (VLM) trained by AskUI which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`"	fast, <500ms per step	Huggingface hosted	Prices for Huggingface hosting	Not recommended for production applications
`OS-Copilot/OS-Atlas-Base-7B`	`OS-Atlas-Base-7B` is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Please help me modify VS Code settings to hide all folders in the explorer view"`. This model is not available in the `act()` command	Slow, >1s per step	Huggingface hosted	Prices for Huggingface hosting	Not recommended for production applications
`showlab/ShowUI-2B`	`showlab/ShowUI-2B` is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Search in google maps for Nahant"`. This model is not available in the `act()` command	slow, >1s per step	Huggingface hosted	Prices for Huggingface hosting	Not recommended for production usage
`Qwen/Qwen2-VL-2B-Instruct`	`Qwen/Qwen2-VL-2B-Instruct` is a Visual Language Model (VLM) pre-trained on multiple datasets including UI data. This model is not available in the `act()` command	slow, >1s per step	Huggingface hosted	Prices for Huggingface hosting	Not recommended for production usage
`Qwen/Qwen2-VL-7B-Instruct`	[Qwen/Qwen2-VL-7B-Instruct`](https://github.com/QwenLM/Qwen2.5-VLB) is a Visual Language Model (VLM) pre-trained on multiple dataset including UI data. This model is not available in the` act()` command available	slow, >1s per step	Huggingface hosted	Prices for Huggingface hosting	Not recommended for production usage

Note: No authentication required! But rate-limited!

Self Hosted UI Models

Supported commands are: act(), click(), get(), locate(), mouse_move()

Model Name	Info	Execution Speed	Security	Cost	Reliability
`tars`	`UI-Tars` is a Large Action Model (LAM) based on Qwen2 and fine-tuned by ByteDance on UI data e.g. "`Book me a flight to rom`"	slow, >1s per step	Self-hosted	Depening on infrastructure	Out-of-the-box not recommended for production usage

Note: These models need to been self hosted by yourself. (See here)

🔧 Custom Models

You can create and use your own models by subclassing the ActModel (used for act()), GetModel (used for get()), or LocateModel (used for click(), locate(), mouse_move()) classes and registering them with the VisionAgent.

Here's how to create and use custom models:

import functools
from askui import (
    ActModel,
    GetModel,
    LocateModel,
    Locator,
    ImageSource,
    MessageParam,
    ModelComposition,
    ModelRegistry,
    OnMessageCb,
    Point,
    ResponseSchema,
    VisionAgent,
)
from typing import Type
from typing_extensions import override

# Define custom models
class MyActModel(ActModel):
    @override
    def act(
        self,
        messages: list[MessageParam],
        model_choice: str,
        on_message: OnMessageCb | None = None,
    ) -> None:
        # Implement custom act logic, e.g.:
        # - Use a different AI model
        # - Implement custom business logic
        # - Call external services
        if len(messages) > 0:
            goal = messages[0].content
            print(f"Custom act model executing goal: {goal}")
        else:
            error_msg = "No messages provided"
            raise ValueError(error_msg)

# Because Python supports multiple inheritance, we can subclass both `GetModel` and `LocateModel` (and even `ActModel`)
# to create a model that can both get and locate elements.
class MyGetAndLocateModel(GetModel, LocateModel):
    @override
    def get(
        self,
        query: str,
        image: ImageSource,
        response_schema: Type[ResponseSchema] | None,
        model_choice: str,
    ) -> ResponseSchema | str:
        # Implement custom get logic, e.g.:
        # - Use a different OCR service
        # - Implement custom text extraction
        # - Call external vision APIs
        return f"Custom response to query: {query}"


    @override
    def locate(
        self,
        locator: str | Locator,
        image: ImageSource,
        model_choice: ModelComposition | str,
    ) -> Point:
        # Implement custom locate logic, e.g.:
        # - Use a different object detection model
        # - Implement custom element finding
        # - Call external vision services
        return (100, 100)  # Example coordinates


# Create model registry
custom_models: ModelRegistry = {
    "my-act-model": MyActModel(),
    "my-get-model": MyGetAndLocateModel(),
    "my-locate-model": MyGetAndLocateModel(),
}

# Initialize agent with custom models
with VisionAgent(models=custom_models) as agent:
    # Use custom models for specific tasks
    agent.act("search for flights", model="my-act-model")

    # Get information using custom model
    result = agent.get(
        "what's the current page title?",
        model="my-get-model"
    )

    # Click using custom locate model
    agent.click("submit button", model="my-locate-model")

    # Mix and match with default models
    agent.click("next", model="askui")  # Uses default AskUI model

You can also use model factories if you need to create models dynamically:

class DynamicActModel(ActModel):
    @override
    def act(
        self,
        messages: list[MessageParam],
        model_choice: str,
        on_message: OnMessageCb | None = None,
    ) -> None:
        pass

# going to be called each time model is chosen using `model` parameter
def create_custom_model(api_key: str) -> ActModel:
    return DynamicActModel()


# if you don't want to recreate a new model on each call but rather just initialize
# it lazily
@functools.cache
def create_custom_model_cached(api_key: str) -> ActModel:
    return DynamicActModel()


# Register model factory
custom_models: ModelRegistry = {
    "dynamic-model": lambda: create_custom_model("your-api-key"),
    "dynamic-model-cached": lambda: create_custom_model_cached("your-api-key"),
    "askui": lambda: create_custom_model_cached("your-api-key"), # overrides default model
    "claude-sonnet-4-20250514": lambda: create_custom_model_cached("your-api-key"), # overrides model
}


with VisionAgent(models=custom_models, model="dynamic-model") as agent:
    agent.act("do something") # creates and uses instance of DynamicActModel
    agent.act("do something") # creates and uses instance of DynamicActModel
    agent.act("do something", model="dynamic-model-cached") # uses new instance of DynamicActModel as it is the first call
    agent.act("do something", model="dynamic-model-cached") # reuses cached instance

🔀 OpenRouter AI Models

You can use Vision Agent with OpenRouter to access a wide variety of models via a unified API.

Set your OpenRouter API key:

Linux & MacOS

export OPEN_ROUTER_API_KEY=<your-openrouter-api-key>

Windows PowerShell

$env:OPEN_ROUTER_API_KEY="<your-openrouter-api-key>"

Example: Using OpenRouter with a custom model registry

from askui import VisionAgent
from askui.models import (
    OpenRouterModel,
    OpenRouterSettings,
    ModelRegistry,
)


# Register OpenRouter model in the registry
custom_models: ModelRegistry = {
    "my-custom-model": OpenRouterModel(
        OpenRouterSettings(
            model="anthropic/claude-opus-4",
        )
    ),
}

with VisionAgent(models=custom_models, model={"get":"my-custom-model"}) as agent:
    result = agent.get("What is the main heading on the screen?")
    print(result)

🛠️ Direct Tool Use

Under the hood, agents are using a set of tools. You can directly access these tools.

< 9E88 div class="markdown-heading" dir="auto">

Name		Name	Last commit message	Last commit date
Latest commit History 498 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs/assets		docs/assets
src		src
tests		tests
.cursorrules		.cursorrules
.env.template		.env.template
.gitignore		.gitignore
.nvmrc		.nvmrc
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

License

askui/vision-agent

Folders and files

Latest commit

History

Repository files navigation

🤖 Enable AI to control your desktop, mobile and HMI devices

🔧 Setup

1. Install AskUI Agent OS

AMD64

ARM64

AMD64

ARM64

2. Install vision-agent in your Python environment

3a. Authenticate with an AI Model Provider

How to set an environment variable?

3b. Test with 🤗 Hugging Face AI Models (Spaces API)

3c. Host your own AI Models

UI-TARS

▶️ Start Building

🎛️ Model Choice

🔧 Custom Models

🔀 OpenRouter AI Models

🛠️ Direct Tool Use

Agent OS

Web browser

Clipboard

📜 Logging

📜 Reporting

🖥️ Multi-Monitor Support

🎯 Locating elements

📊 Extracting information

Basic usage

Using custom images

Using response schemas

What is AskUI Vision Agent?

Telemetry

Experimental

AskUI Chat

Configuration

Installation

Usage

Limitations

Architecture

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 36

Uh oh!

Contributors 6

Uh oh!

Languages