Authors:
Zhipeng Hou,
Junyi Tang,
Yipeng Wang
Contact:
japhonehou@gmail.com
If you find our work useful in your research, please consider citing the HALO as follows:
@misc{hou2025halohierarchicalautonomouslogicoriented,
title={HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems},
author={Zhipeng Hou and Junyi Tang and Yipeng Wang},
year={2025},
eprint={2505.13516},
archivePrefix={arXiv},
primaryClass={cs.MA},
url={https://arxiv.org/abs/2505.13516},
}
Abstract
Recent advancements in Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) have demonstrated tremendous potential in diverse task scenarios. Nonetheless, existing agentic systems typically rely on predefined agent-role design spaces and static communication structures, limiting their adaptability as well as flexibility in complex interaction environments and leading to subpar performance on highly specialized and expert-level tasks. To address these issues, we introduce HALO, a multi-agent collaboration framework based on a hierarchical reasoning architecture. Specifically, we incorporate a high-level planning agent for task decomposition, mid-level role-design agents for subtask-specific agent instantiation, and low-level inference agents for subtask execution. Particularly, subtask execution is reformulated as a structured workflow search problem, where Monte Carlo Tree Search (MCTS) systematically explores the agentic action space to construct optimal reasoning trajectories. Additionally, as the majority of users lack expertise in prompt engineering, we leverage an Adaptive Prompt Refinement module to transform raw queries into task-specific prompts. Empirical evaluations on Code Generation (HumanEval), General Reasoning (MMLU), and Arithmetic Reasoning (MATH) benchmark datasets highlight the effectiveness of HALO, yielding a 14.4% average improvement over state-of-the-art baselines. Notably, HALO achieves up to 13.3% performance gain on the Moral Scenarios subject in the MMLU benchmark and up to 19.6% performance gain on the Algebra subarea in the MATH benchmark, indicating its advanced proficiency in tackling highly specialized and expert-level tasks.conda create -n halo python=3.10
conda activate halo
pip install -r requirements.txt
Create api_setting.json
file in HALO/configs
directory and insert the following contents (GPT-4o is recommended):
{
"endpoints": "<base_url>/chat/completions",
"api_key": "sk-xxx",
"model": "xxx"
}
Locate the code between "user input begin" section and "user input ended" section in HALO/run.py
script. You can modify the "QUERY" as what you want to ask.
python run.py
Performance of HALO across three benchmarks. Metrics include
Structure | HumanEval | MMLU | MATH | Avg. | |
---|---|---|---|---|---|
HALO (Ours) | Hierarchical architecture + MCTS | 95.2 | 81.6 | 58.9 | 78.6 |
Ablation study of removing the Adaptive Prompt Refinement module and the high-level planning agent on GPT-4o across three benchmarks.
1.1 For windows, modify human-eval
package script, please refer here
python ./experiment/human_eval/run.py
2.1 Download MMLU datasets
python ./experiment/MMLU/run.py
3.1 Download MATH datasets
python ./experiment/MATH/run.py
python ./experiment/ablation_study/run_humaneval_w_o_prompt.py
python ./experiment/ablation_study/run_humaneval_w_o_task.py
python ./experiment/ablation_study/run_math_w_o_prompt.py
python ./experiment/ablation_study/run_math_w_o_task.py
python ./experiment/ablation_study/run_mmlu_w_o_prompt.py
python ./experiment/ablation_study/run_mmlu_w_o_task.py
In our experiments, data preprocessing was standardized across all tasks. Thus, we referred to DyLAN, HumanEval, MMLU, and MATH in this stage.