8000 docs: sim and bench docs by jmatejcz · Pull Request #598 · RobotecAI/rai · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

docs: sim and bench docs #598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 29, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/imgs/manipulation_benchmark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/tool_calling_agent_benchmark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 4 additions & 3 deletions docs/simulation_and_benchmarking/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,22 @@ RAI Sim provides a simulator-agnostic interface that allows RAI to work with any
- Easy integration with new simulators
- Seamless switching between simulation backends

The package also provides simulator bridges for concrete simulators, currently supporting only O3DE.
For detailed information about the simulation interface, see [RAI Sim Documentation](rai_sim.md).

## RAI Bench

RAI Bench builds on top of RAI Sim to provide a framework for creating and running benchmarks. It uses the simulator-agnostic interface to:
RAI Bench provides benchmarks with ready-to-use tasks and a framework to create your own tasks. It enables:

- Define and execute tasks in any supported simulator
- Define and execute tasks
- Measure and evaluate performance
- Collect and analyze results

For detailed information about the benchmarking framework, see [RAI Bench Documentation](rai_bench.md).

## Integration

RAI Sim and RAI Bench work together to provide a complete simulation and evaluation environment:
RAI Sim and RAI Bench work together to provide benchmarks which utilize simulations for evaluation:

1. **Simulation Interface**: RAI Sim provides the foundation with its simulator-agnostic interface
2. **Task Definition**: RAI Bench defines tasks that can be executed in any supported simulator
Expand Down
164 changes: 77 additions & 87 deletions d 8000 ocs/simulation_and_benchmarking/rai_bench.md
Original file line number Diff line number Diff line change
@@ -1,146 +1,136 @@
# RAI Bench

RAI Bench is a framework for creating and running benchmarks in simulation environments. It builds on top of RAI Sim to provide a structured way to define tasks, scenarios, and evaluate performance.
RAI Bench is a comprehensive package that both provides benchmarks with ready-to-use tasks and offers a framework for creating new tasks. It's designed to evaluate the performance of AI agents in various environments.

## Core Components
### Available Benchmarks

### Task
- [Manipulation O3DE Benchmark](#manipulation-o3de-benchmark)
- [Tool Calling Agent Benchmark](#tool-calling-agent-benchmark)

## Manipulation O3DE Benchmark

The `Task` class is an abstract base class that defines the interface for benchmark tasks. Each task must implement:
Evaluates agent performance in robotic arm manipulation tasks within the O3DE simulation environment. The benchmark evaluates how well agents can process sensor data and use tools to manipulate objects in the environment.

- `get_prompt()`: Returns the task instruction for the agent
- `validate_config()`: Verifies if a simulation configuration is suitable for the task
- `calculate_result()`: Computes the task score (0.0 to 1.0)
### Framework Components

### ManipulationTask
Manipulation O3DE Benchmark provides a framework for creating custom tasks and scenarios with these core components:

A specialized `Task` class for manipulation tasks that provides common functionality for:
![Manipulation Benchmark Framework](../imgs/manipulation_benchmark.png)

### Task

- Object type filtering
- Placement validation
- Score calculation based on object positions
The `Task` class is an abstract base class that defines the interface for tasks used in this benchmark.
Each concrete Task must implement:

- prompts that will be passed to the agent
- validation of simulation configurations
- calculating results based on scene state

### Scenario

A `Scenario` represents a specific test case combining:

- A task to be executed
- A simulation configuration
- The path to the configuration file

### Benchmark
### ManipulationO3DEBenchmark

The `Benchmark` class manages the execution of scenarios and collects results. It provides:
The `ManipulationO3DEBenchmark` class manages the execution of scenarios and collects results. It provides:

- Scenario execution management
- Performance metrics tracking
- Results logging and export
- Logs and results
- Robotic stack needed, provided as `LaunchDescription`

## Available Tasks
### Available Tasks

The framework includes several predefined manipulation tasks:
The benchmark includes several predefined manipulation tasks:

1. **MoveObjectsToLeftTask**
1. **MoveObjectsToLeftTask** - Move specified objects to the left side of the table

- Moves specified objects to the left side of the table
- Success measured by objects' y-coordinate being positive
2. **PlaceObjectAtCoordTask** - Place specified objects at specific coordinates

2. **PlaceObjectAtCoordTask**
3. **PlaceCubesTask** - Place specified cubes adjacent to each other

- Places an object at specific coordinates
- Success measured by distance from target position
4. **BuildCubeTowerTask** - Stack specified cubes to form a tower

3. **PlaceCubesTask**
5. **GroupObjectsTask** - Group specified objects of specified types together

- Places cubes adjacent to each other
- Success measured by proximity to other cubes
Tasks are parametrizable so you can configure which objects should be manipulated and how much precision is needed to complete a task.

4. **BuildCubeTowerTask**
Tasks are scored on a scale from 0.0 to 1.0, where:

- Stacks cubes to form a tower
- Success measured by height and stability
- 0.0 indicates no improvement or worse placement than the starting one
- 1.0 indicates perfect completion

5. **GroupObjectsTask**
The score is typically calculated as:

- Groups objects of specified types together
- Success measured by object proximity
```
score = (correctly_placed_now - correctly_placed_initially) / initially_incorrect
```

## Usage
### Available Scene Configs and Scenarios

### Creating Scenarios
You can find predefined scene configs in `rai_bench/manipulation_o3de/predefined/configs/`.

Scenarios can be created manually:
Predefined scenarios can be imported like:

```python
scenario = Scenario(
task=MoveObjectsToLeftTask(obj_types=["cube"]),
simulation_config=simulation_config,
simulation_config_path="path/to/config.yaml"
)
from rai_bench.manipulation_o3de import get_scenarios

get_scenarios(levels=["easy", "medium"])
```

Or automatically using the `Benchmark.create_scenarios()` method:
Choose which task you want by selecting the difficulty, from trivial to very hard scenarios.

```python
scenarios = Benchmark.create_scenarios(
tasks=tasks,
simulation_configs=configs,
simulation_configs_paths=config_paths
)
```
## Tool Calling Agent Benchmark

### Running Benchmarks
Evaluates agent performance independently from any simulation, based only on tool calls that the agent makes. To make it independent from simulations, this benchmark introduces tool mocks which can be adjusted for different tasks. This makes the benchmark more universal and a lot faster.

```python
benchmark = Benchmark(
simulation_bridge=bridge,
scenarios=scenarios,
results_filename="results.csv"
)
```
### Framework Components

## Scoring System
![Tool Calling Benchmark Framework](../imgs/tool_calling_agent_benchmark.png)

Tasks are scored on a scale from 0.0 to 1.0, where:
### SubTask

- 0.0 indicates no improvement or worse performance
- 1.0 indicates perfect completion
The `SubTask` class is used to validate just one tool call. Following classes are available:

The score is typically calculated as:
- `CheckArgsToolCallSubTask` - verify if a certain tool was called with expected arguments
- `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS 2topic was of proper type and included expected fields
- `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS 2service was of proper type and included expected fields
- `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS 2action was of proper type and included expected fields

```
score = (correctly_placed_now - correctly_placed_initially) / initially_incorrect
```
### Validator

The `Validator` class can combine single or multiple subtasks to create a single validation step. Following validators are available:

- OrderedCallsValidator - requires a strict order of subtasks. The next subtask will be validated only when the previous one was completed. Validator passes when all subtasks pass.
- NotOrderedCallsValidator - doesn't enforce order of subtasks. Every subtask will be validated against every tool call. Validator passes when all subtasks pass.

### Task

## Integration with RAI Sim
A Task represents a specific prompt and set of tools available. A list of validators is assigned to validate the performance.

RAI Bench leverages RAI Sim's simulator-agnostic interface to:
??? info "Task class definition"

- Execute tasks in any supported simulation environment
- Access and manipulate simulation entities
- Monitor scene state and object positions
- Manage simulation configurations
::: rai_bench.tool_calling_agent.interfaces.Task

This integration allows for:
As you can see, the framework is very flexible. Any SubTask can be combined into any Validator that can be later assigned to any Task.

- Consistent task execution across different simulators
- Reliable performance measurement
- Flexible task definition
- Comprehensive result analysis
### ToolCallingAgentBenchmark

## Configuration
The ToolCallingAgentBenchmark class manages the execution of tasks and collects results.

Simulation configurations are defined in YAML files that specify:
### Available Tasks

- Scene setup
- Object types and positions
- Task-specific parameters
Tasks of this benchmark are grouped by type:

## Error Handling
- Basic - basic usage of tools
- Navigation
- Spatial reasoning - questions about surroundings with images attached
- Manipulation
- Custom Interfaces - requires using messages with custom interfaces

The framework includes comprehensive error handling for:
If you want to know details about every task, visit `rai_bench/tool_calling_agent/tasks`

- Invalid configurations
- Task validation failures
- Simulation errors
- Performance tracking
## Test Models
20 changes: 13 additions & 7 deletions docs/simulation_and_benchmarking/rai_sim.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,13 @@ The `SimulationBridge` is an abstract base class that defines the interface for
- Object pose retrieval
- Scene state monitoring

### SimulationConfig
### SceneConfig

The `SimulationConfig` is a base configuration class that specifies the entities to be spawned in the simulation. Each simulation bridge can extend this with additional parameters specific to its implementation.
The `SceneConfig` is a configuration class that specifies the entities to be spawned in the simulation.

Key features:
### SimulationConfig

- Entity list management
- Unique name validation
- YAML configuration loading
- Frame ID specification
The `SimulationConfig` is an abstract configuration class. Each simulation bridge can extend this with additional parameters specific to its implementation.

### SceneState

Expand Down Expand Up @@ -54,6 +51,7 @@ To use RAI Sim with a specific simulation environment:
1. Create a custom `SimulationBridge` implementation for your simulator
2. Extend `SimulationConfig` with simulator-specific parameters
3. Implement the required abstract methods:
- `init_simulation`
- `setup_scene`
- `_spawn_entity`
- `_despawn_entity`
Expand Down Expand Up @@ -100,3 +98,11 @@ RAI Sim serves as the foundation for RAI Bench by providing:
- Configuration management

This allows RAI Bench to focus on task definition and evaluation while remaining simulator-agnostic.

## LaunchManager

RAI Sim also provides a ROS2LaunchManager class that manages the start and shutdown of ROS 2`LaunchDescription`

??? info "ROS2LaunchManager class definition"

::: rai_sim.launch_manager.ROS2LaunchManager
Loading
Loading
0