8000 GitHub - aaronsu11/Dum-E: Inspired by Tony Stark's robotic assistant Dum-E, the mission of this project is to create an intelligent, voice & vision enabled AI agent with robotic arm(s) capable of real-time human interaction, physical operations, and orchestration of a wide range of tools and services.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Inspired by Tony Stark's robotic assistant Dum-E, the mission of this project is to create an intelligent, voice & vision enabled AI agent with robotic arm(s) capable of real-time human interaction, physical operations, and orchestration of a wide range of tools and services.

License

Notifications You must be signed in to change notification settings

aaronsu11/Dum-E

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🦾 Dum-E | The Embodied AI Agent

License: MIT Discord X Follow Ask DeepWiki

What if your favorite AI agents have hands?

πŸš€ Get Started β€’ πŸ“– Documentation β€’ πŸ—ΊοΈ Roadmap β€’ 🀝 Contributing


🎯 Mission

Inspired by Tony Stark's robotic assistant Dum-E, the mission of this project is to create an intelligent, voice & vision enabled AI agent with robotic arm(s) capable of real-time human interaction, physical operations, and orchestration of a wide range of tools and services.

✨ Key Features

  • 🎀 Real-time Voice Interface - Natural voice conversation with configurable voice and language support
  • 🧠 Long-horizon Tasks Planning - Orchestrated by state-of-the-art VLMs with multi-modal reasoning and tool use
  • πŸ“‘ Asynchronous Tasks Execution - Support for asynchronous task execution with streaming updates
  • πŸ‘οΈ Hybrid Robot Control - deep learning policies for generalized physical operations combined with classical control for precise manipulation
  • πŸ”§ Modular Architecture - Flexible interfaces to add your own custom embodiments and tools
  • 🌐 MCP Protocol Support - Agents and tools available via MCP for custom client integration (coming soon)

πŸ“Ί Demo

As part of the LeRobot Worldwide Hackathon for grabbing fries.

πŸš€ Get Started

βš™οΈ Recommended System Requirements

Hardware

  • Robot: SO100/101 robotic arm assembled with a wrist camera
  • Webcam: External 1080p USB webcam for flexible vision input
  • GPU: RTX 3060 (12GB VRAM) or better for local policy inference

Software

  • OS: Linux (Ubuntu 22.04+) or Windows with WSL2
  • Python: 3.10

πŸ”§ Installation

πŸ“¦ Prerequisites

  1. Install required system dependencies

    sudo apt-get update
    sudo apt-get install ffmpeg libsm6 libxext6
  2. Install CUDA toolkit 12.4

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    sudo apt-get update
    sudo apt-get -y install cuda-toolkit-12-4
  3. Create a new conda or venv environment using Python 3.10, for example:

    conda create -n dum-e python=3.10
    conda activate dum-e
  4. Download LeRobot and Isaac-GR00T source code:

    git clone https://github.com/huggingface/lerobot.git
    git clone https://github.com/NVIDIA/Isaac-GR00T
  5. Make sure you are in the conda / venv environment and install LeRobot:

    # conda activate dum-e
    cd lerobot
    # use the stable version up to 5th June 2025 before hardware API redesign
    git checkout b536f47e3ff8c3b340fc5efa52f0ece0a7212a57
    # install with the feetech extension (for SO-ARM10x)
    pip install -e ".[feetech]"

    Calibrate the robot following the instructions for SO-100 or SO-101. If you already calibrated the robot before, you can simply copy and paste the .cache folder that includes main_follower.json to the root of the Dum-E repository later.

  6. Install Isaac-GR00T:

    # conda activate dum-e
    cd ../Isaac-GR00T
    pip install --upgrade setuptools
    pip install -e .[base]
    pip install --no-build-isolation flash-attn==2.7.1.post4

    Download a fine-tuned GR00T model from Hugging Face for the task you want to perform. For example, to pick up an orange lego block as shown in our demo, you can download our checkpoint by running:

    huggingface-cli download aaronsu11/GR00T-N1.5-3B-FT-PICK-0613 --local-dir ./GR00T-N1.5-3B-FT-PICK --exclude "optimizer.pt"

🦾 Dum-E

  1. Clone the Repository

    git clone https://github.com/aaronsu11/Dum-E.git
    cd Dum-E
  2. Install Dum-E Dependencies

    # conda activate dum-e
    pip install -r requirements.txt
  3. Start policy server

    In a new terminal, run the following command to start a local policy server from the Isaac-GR00T repository:

    cd <path-to-Isaac-GR00T>
    python scripts/inference_service.py \
    --server \
    --model_path ./GR00T-N1.5-3B-FT-PICK \
    --embodiment_tag new_embodiment \
    --data_config so100_dualcam

    This needs to be running in the background as long as you are using the policy for inference.

  4. Test policy execution

    Back in the Dum-E terminal/repository, run the following command by replacing the <wrist_cam_idx> and <front_cam_idx> with your own actual camera indices:

    python -m embodiment.so_arm10x.client \
    --use_policy \
    --host 0.0.0.0 \
    --port 5555 \
    --wrist_cam_idx <wrist_cam_idx> \
    --front_cam_idx <front_cam_idx> \
    --lang_instruction "Pick up the lego block."
  5. Environment Configuration

    Sign up for free accounts at ElevenLabs, Deepgram, Anthropic and obtain the API keys. Then copy the environment template and update the .env file with your credentials:

    cp .env.example .env

    Edit .env with your credentials:

    • Anthropic API key for LLM
    • ElevenLabs API key for TTS
    • Deepgram API key for STT
  6. Test the robot agent

    Once the policy is working, you can test the robot agent with text instructions by running the following command:

    python -m embodiment.so_arm10x.agent
  7. Start Dum-E

    Finally, let's start Dum-E!

    python dum_e.py

    This will launch a web interface at http://localhost:7860 where you can connect and speak to Dum-E using your microphone. Have fun!

πŸ—οΈ Architecture Overview

Core Components

graph TB
    subgraph Interface Layer
        A[Voice Interface<br/>Pipecat]
        B[Third-party Clients<br/>Dum-E MCP Server]
    end

    subgraph Agent Layer
        C[Robot Agent<br/>Reasoning VLM]
        D[Task Manager<br/>Lifecycle Tracking]
        E[Tool Registry<br/>Shared Tools]
        F[Event Publisher<br/>Streaming Updates]
    end

    subgraph Tool Layer
        G[Classical Control<br/>MoveIt]
        H[VLA Policy<br/>Gr00t Inference]
        I[Custom Tools<br/>Public MCP Servers]
    end

    subgraph Hardware Layer
        J[Physical Robot<br/>SO10x]
    end

    A --> C
    B --> C
    C --> D
    C --> E
    C --> F
    E --> G
    E --> H 
    E --> I
    G --> J
    H --> J
Loading

πŸ”„ Data Flow

  1. Voice Interaction β†’ Voice / Vision Streams β†’ Pipecat β†’ Conversation and Task Delegation
  2. Task Planning β†’ Task Manager β†’ Reasoning VLM β†’ Multi-step Instruction Breakdown
  3. Task Execution β†’ Tool Registry β†’ Hardware Interface β†’ SO100Robot
  4. Robot Control β†’ Camera Images β†’ DL Policy / Classical Control β†’ Joint Commands
  5. Feedback Loop β†’ Progress Events β†’ Model Context β†’ Voice Updates

πŸ—ΊοΈ Roadmap

Q3 2025

  • Voice Interaction

    • Multi-language support (Mandarin, Japanese, Spanish etc.)
    • Emotional understanding with speech-to-speech models
  • MCP Servers

    • Access to robot agent via MCP
    • Configurable MCP server endpoints for Dum-E
  • Local Model Support

    • Integration with Ollama for local language model inference
  • Upgrade Dependency

    • LeRobot new hardware APIs

Q4 2025

  • Cross-Platform Support

    • Docker containers for platform-agnostic deployment
  • ROS2 Integration

    • Native ROS2 node implementation
    • Integration with existing ROS toolkits

πŸ”¬ Ongoing: Research Initiatives

  • Embodied AI Research

    • Generalizable & scalable policies for physical tasks
    • Efficient dataset collection and training
  • Human-Robot Interaction

    • Natural multi-modal understanding
    • Contextual conversation memory
    • Self-evolving personality and skillset

🀝 Contributing

We welcome contributions from the robotics and AI community! Here's how you can help:

🌟 Ways to Contribute

  • πŸ› Bug Reports: Found an issue? Create a detailed bug report
  • πŸ’‘ Feature Requests: Have ideas? Share them in our discussions or Discord
  • πŸ“ Documentation: Help improve our docs and tutorials
  • πŸ§ͺ Testing: Add test cases and improve coverage
  • πŸš€ Code: Submit pull requests with new features or fixes

πŸ“‹ Development Guidelines

  1. Fork the Repository

    git fork https://github.com/aaronsu11/Dum-E.git
    cd Dum-E
    git checkout -b feature/issue-<number>/<your-feature-name>
  2. Follow Code Sta 611C ndards

    • Use Python 3.10 type hints
    • Follow PEP 8 style guidelines
    • Add comprehensive docstrings
    • Maintain test coverage >95%
  3. Testing Requirements

    # Run tests before submitting
    python -m pytest tests/
    python -m black <your-file-or-directory>
    python -m isort <your-file-or-directory>
  4. Pull Request Process

    • Create detailed PR description
    • Link related issues
    • Ensure CI/CD passes
    • Request review from maintainers

πŸ‘₯ Community

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

This project builds on top of the following open-source projects:


⭐ Star us on GitHub β€” it motivates us a lot!

πŸš€ Get Started β€’ πŸ“– Documentation β€’ 🀝 Join Community β€’ πŸ’Ό Commercial Use

Built with ❀️ for the future of robotics

About

Inspired by Tony Stark's robotic assistant Dum-E, the mission of this project is to create an intelligent, voice & vision enabled AI agent with robotic arm(s) capable of real-time human interaction, physical operations, and orchestration of a wide range of tools and services.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages

0