survey

Open access

Tool Learning with Foundation Models

ACM Computing Surveys, Volume 57, Issue 4

Article No.: 101, Pages 1 - 40

Published: 24 December 2024 Publication History

Abstract

Humans possess an extraordinary ability to create and utilize tools. With the advent of foundation models, artificial intelligence systems have the potential to be equally adept in tool use as humans. This paradigm, which is dubbed as tool learning with foundation models, combines the strengths of tools and foundation models to achieve enhanced accuracy, efficiency, and automation in problem-solving. This article presents a systematic investigation and comprehensive review of tool learning. We first introduce the background of tool learning, including its cognitive origins, the paradigm shift of foundation models, and the complementary roles of tools and models. Then we recapitulate existing tool learning research and formulate a general framework: starting from understanding the user instruction, models should learn to decompose a complex task into several subtasks, dynamically adjust their plan through reasoning, and effectively conquer each sub-task by selecting appropriate tools. We also discuss how to train models for improved tool-use capabilities and facilitate generalization in tool learning. Finally, we discuss several open problems that require further investigation, such as ensuring trustworthy tool use, enabling tool creation with foundation models, and addressing personalization challenges. Overall, we hope this article could inspire future research in integrating tools with foundation models.

1 Introduction

Tools are extensions of human capabilities designed to enhance productivity, efficiency, and problem-solving in human activities. Since the dawn of civilization, tools have been integral to the very essence of our existence [150]. Tool creation and utilization are motivated by a deep-rooted desire to overcome our physical limitations and discover new territories. More specifically, with advancements in tools, we can accomplish increasingly complex tasks with ease and efficiency, liberating time and resources to pursue more ambitious ventures. As such, tools have served as the crucial foundation upon which our cultural and social practices are built, transforming our modes of learning, communication, working, and entertainment, infusing these domains with new dimensions of accessibility and interactivity [38]. Throughout history, it is undeniable that human beings have played a pivotal role in the invention and manipulation of tools, which is a striking manifestation of intelligence [122]. Given the rise of Artificial Intelligence (AI), one natural question is, does AI possess the potential to be equally adept and capable as its creators?

The prerequisite of the manipulation of tools is a thorough comprehension of the tools’ functionalities, as well as the ability to understand user intents and perform planning and reasoning for tool use. Before the advent of powerful foundation models [13], conducting tool-oriented AI research was exceedingly challenging. While certain basic tools could be fitted using shallow statistical models or deep neural models [2, 87, 104], their performance and stability remained inadequate to meet the demands of practical applications, let alone generalizing across various tools. This is due to the limitations of traditional supervised learning in capturing the complex operations essential for tool utilization and the insufficiency of trial-and-error approaches like reinforcement learning (RL) in mastering the extensive decision space associated with tool use. In a nutshell, the fundamental limitations in tool use by earlier AI lie in the insufficient capabilities of the models. Recently, the emergence of more capable foundation models, characterized by significantly improved capabilities, has rendered tool learning practicable. They have shown enormous semantic understanding capacity in diverse tasks, spanning the fields of natural language processing (NLP) [15], computer vision (CV) [111], biology [53], and so on. Additionally, they have demonstrated superior reasoning and decision-making abilities in complex interactive environments [89]. By harnessing the extensive world knowledge garnered during pre-training, they can perform grounded actions and interact with the real world. Notably, the emergence of ChatGPT [92] highlights the potential of foundation models to understand human intentions, automate intricate processes, and generate natural responses; the advent of GPT-4 [93] offers immense potential for multi-modal perception, which is essential to the real-world grounding ability.

Therefore, foundation models enable AI to harness tools, which can lead to more potent and streamlined solutions for real-world tasks. Foundation models are able to decipher complex data, simulate human-like planning capabilities, and generate a broad spectrum of outputs. Concurrently, specialized tools can be employed to refine and target specific goals. The amalgamation of tools and models unveils vast potential where sophisticated procedures can be automated with limited human involvement. This paradigm, dubbed as tool learning with foundation models in this article (Figure 1), aims to combine the strengths of specialized tools and foundation models, thereby culminating in greater accuracy, efficiency, and autonomy in problem-solving. Recent research has shed light on foundation models’ potential to exhibit a level of dexterity and finesse in tool use [1, 16, 27, 46, 62, 89, 117, 136, 155, 163, 164]. Despite these breakthroughs, the efforts mainly focus on applying foundation models to specific tasks and domains with delicate algorithm designs. The current understanding of tool learning is still not comprehensive enough to estimate its characteristics and future developments. We believe that it is crucial to examine the current progress of tool learning to explore their potential and challenges and to better pave the way for future technological advancements.

Fig. 1.

In this article, we conduct a systematic investigation and comprehensive review of tool learning, attempting to build a full grasp of the key challenges, opportunities, and directions in this field.

Before delving into the tool learning framework, we introduce essential backgrounds (Section 2), covering both tools and foundation models and their interaction. Specifically, we first recapitulate the cognitive origins of tool use in human history and its potential implications for tool use in AI systems (Section 2.1), followed by a categorization of tools from the perspective of the user interface (Section 2.2). Then we review the AI paradigm shift brought by foundation models and highlight the emergence and significance of tool learning (Section 2.3). After that, we discuss the complementary roles of tools and foundation models, and argue that integrating both can bring various advantages, such as improving interpretability, enhancing robustness, and delivering better user experiences (Section 2.4).

Next, we present a comprehensive literature review for existing exploration in tool learning. While previous works often focus on specific aspects in isolation, we strive to formulate a general tool learning framework (Section 3.1), which comprises the controller (typically modeled using a foundation model), tool set, environment, perceiver, and human. Based on the unified framework, we review existing works of tool learning, highlight core research problems, and introduce their existing solutions as well as future explorations. The whole procedure (Section 3.2) of tool learning starts with a user instruction, and models are required to make an executable plan for tool execution. To bridge user instructions with appropriate tools, models should first learn to understand the user intents underlying the instruction (i.e., intent understanding) and understand the functionalities and usage of tools (i.e., tool understanding). Models should also learn to decompose a complex task into several subtasks, dynamically adjust their plan through reasoning, and effectively conquer each sub-task with the appropriate tools. Regarding the training strategy (Section 3.3) to facilitate models for improved tool utilization, we conclude with two mainstream methods: learning from demonstrations and learning from feedback. We discuss how to construct effective training supervision under different settings. To facilitate transferring the learned tool-use skills to new tools and situations, i.e., generalizable tool learning, it is important to design a unified interface that enables the model to interact with different tools in a standardized manner.

Finally, we discuss the remaining important research topics (Section 4) for applying our general framework to real-world scenarios, including (1) safety and trustworthiness, where we emphasize the potential risks from adversaries, governance, and trustworthiness. We contend that careful considerations are necessary before deploying tool learning models in high-stakes scenarios (Section 4.1); (2) tool creation, where we discuss the possibility that AI can also create new tools, challenging the long-held beliefs about what makes humans unique (Section 4.2); (3) personalized tool learning, where models provide tailored assistance to users in tool use. We highlight the challenges of aligning user preference with tool manipulation and introduce the shift from reactive to proactive systems, and the privacy-preserving concerns (Section 4.3); (4) knowledge conflicts in tool augmentation, where we review how tools can be leveraged to enhance models’ generation and the practical problems of knowledge conflicts, which can lead to inaccurate and unreliable model predictions (Section 4.4); (5) other open problems, such as viewing tool use capability as a measure for machine intelligence and tool learning for scientific discovery (Section 4.5). Overall, we hope this article could inspire further research in integrating tools with foundation models and developing more intelligent and capable AI systems.

2 Background

In this section, we first discuss the cognitive origins of human tool use (Section 2.1), followed by a tool categorization through the lens of the user interface (Section 2.2). Then we review the recent AI paradigm shift brought by foundation models (Section 2.3) and its significance in tool learning. After that, we examine the respective roles of specialized tools and foundation models in problem-solving, and discuss the benefits and challenges of their integration (Section 2.4).

2.1 Cognitive Origins of Tool Use

Tools are commonly viewed as extensions of human beings. Tool use is defined as a unique characteristic of human beings that is distinguished from other species [142]. Throughout evolution, the ability to use tools has been essential for animals, particularly those with advanced intellectual development [122]. For example, chimpanzees have been observed using stones or other materials to crack nuts [11], while New Caledonian crows can craft and utilize two distinct types of hook tools to aid in capturing prey [47]. However, human tool behavior diverges from these observations in several ways. Humans can create much more complicated tools than other animals, such as converting our actions into fundamentally different mechanical actions in tools like hammers and pencils [35]. Additionally, we can harness natural forces such as wind turbines to create tools. This ability may be attributed to our deep comprehension of cause-and-effect relations, which allows us to engage in technical reasoning [98].

Neural Basis of Tool Use. To better understand human tool use behaviors, researchers analyze the neural basis of tool observation and execution. It is verified that humans have parietal systems involved in grasping objects and using tools, and the anterior supramarginal gyrus activation of observing tool use is typical of human subjects, of which macaques do not exhibit [94]. This neurocognitive bases of tool observation may be related to the origins of cumulative technological evolution (e.g., the improvement in the efficiency and complexity of human tools and techniques over generations [98]). While for the tool execution, researchers hold different views on manipulation-based versus reasoning-based approaches [95]. The former claims that tool use has to be supported by the simulation of sensorimotor experiences, and the latter demonstrates the importance of reasoning based on mechanical knowledge in tool use. Nevertheless, the overall trend in cognitive science is understanding cognition as an enactive process that emphasizes interaction with the external world [33], and the feedback from observation, communication, and hands-on practice is important for mastering tool use.

Three Intelligence Levels of Tool Use. Besides, there are specific frameworks designed to discuss the level of intelligence represented by human tool use. For instance, “intoolligence” [96] divides the tool use behavior into three modes: assistive tool use is usually passive and unaware (e.g., walking in the rain shelter corridor); arbitrary tool use requires active interaction (e.g., driving, using smart phones); free tool use further needs to comprehend and choose appropriate tools for the scenarios (e.g., cooking new dishes). The three modes of tool use present a progressive relationship, and the key cognitive process for achieving free tool use is technical reasoning, which allows someone to learn new actions by observing others using, selecting, or making a tool instead of numerous practices.

Transition from Physical Tools to Conceptual Tools. Apart from tools in the physical world, we can also turn to more abstract tools. Take cognitive tools [44] as an example: it refers to an auxiliary aid that facilitates higher-order thinking (e.g., multi-step critical analysis, the generation of creative problem-solving solutions). Cognitive tools can be classified based on the functionalities they provide [60]. These include (1) supporting cognitive processes (e.g., documenting intermediate reasoning outcomes), (2) alleviating lower-level cognitive load to free up resources for advanced-level thinking, (3) enabling learners to engage in activities out of their reach and (4) allowing learners to generate and test hypotheses (e.g., simulated diagnoses for medical students).

2.2 Tool Categorization: A User-Interface Perspective

The growing number and complexity of tools in our world make it increasingly important to understand and group them in a meaningful way. Here we introduce a taxonomy that sorts these tools based on their modes of expression and interaction. As depicted in Figure 2, this taxonomy incorporates three levels of interaction, arranged from the most tangible to the least. The physical level involves direct physical interactions with tools. The graphical user interface (GUI) level facilitates user interaction with visual representations of tools. The program level involves users engaging directly with the underlying source code of tools.

Fig. 2.

Physical Interaction-based Tools. We start with the most tangible genre of tools, physical interaction-based tools. This class of tools involves direct interactions with the physical world, including devices like robots, sensors, and wearables that could physically impact the environment. Physical interaction tools have the capability to sense and respond to the physical environment of users, making them useful in a wide range of applications, from manufacturing to healthcare and education. Physical interaction tools are close to the real world, and they have the potential to substantially improve efficiency and productivity. For example, robots can perform from simple to intricate, even adventurous tasks to reduce human errors and labor costs. Sensors can collect valuable data, such as temperature and pressure, allowing for real-time monitoring and optimization of industrial processes.

GUI-based Tools. Some tools allow users to manipulate them through an interactive interface, i.e., visual representations of tools, with pre-defined operations. These tools, defined as GUI-based tools, do not have a direct impact on the physical world. The GUI interface typically includes buttons, menus, text boxes, and other graphical elements that allow users to interact with the underlying system. These tools are extensively employed in various industries and applications such as software development, data analysis, and design. Particularly, GUI-based tools can improve productivity by streamlining workflows and automating repetitive tasks. GUI-based tools could considerably simplify complex tasks and reduce the learning curve for non-technical users. From this viewpoint, tool learning with foundation models share the same primary goal, which simplifies intricate tasks to a natural language format. Representative GUI-based tools are usually well-developed software such as browsers, Microsoft Office, Adobe PhotoShop, and so on. These applications showcase the versatility that graphical interfaces can provide and enable users to access and manipulate complex features within the software. On the other hand, the main limitation of GUI-based tools is that they may not provide the flexibility and customizability of command-line interfaces or APIs.

Program-based Tools. The innermost layer of tools that users can access is the source code, offering a high degree of flexibility for the input and output of these program-based tools. Program-based tools are software tools primarily designed for use through programming interfaces rather than visual interfaces. They can take various forms, including declarative languages, programming libraries, software development kits (SDKs), and even neural network-based tools. These tools are typically used by developers or technical users who possess a deeper understanding of the underlying data, system or technology, with which the users could complete complex software applications. The main advantage of program-based tools is that they provide greater flexibility and customizability than GUI-based tools, and users can build more sophisticated solutions for current problems. As a result, such tools require a greater degree of technical expertise and programming knowledge, which may not be accessible to non-technical users. For example, program-based tools can be more time-consuming to set up and configure and may require more maintenance and support in the learning process. It is noteworthy that, although these tools pose difficulties for human beings in terms of the learning curve, they may not have the same level of challenges for foundation models.

The above three interaction modes have varying levels of connectivity with the tool kernel. They are not strictly mutually exclusive but indicate a tendency to intermingle with each other.

2.3 Paradigm Shift of Foundation Models

In recent years, the field of NLP has undergone a paradigm shift, marked by the advent of pre-trained language models (PLMs) [13, 31, 40]. Previously, NLP was a challenging field that necessitated designing separate learning objectives for distinct research domains, such as dependency parsing [58], named entity recognition [88], and summarization [90]. Despite the successful design of effective models and methods for these specific tasks, the separated nature of this paradigm impeded progress toward a holistic comprehension of language, thereby limiting its potential.

The invention of PLMs changes this paradigm. Building on Transformers [140], PLMs are trained on massive corpora, from which general linguistic ability and world knowledge are learned. This technique has expedited the unification of NLP tasks, giving rise to the pre-train-then-fine-tune paradigm, which has achieved new state-of-the-art performance on several NLP benchmarks, such as GLUE [146] and SuperGLUE [145]. At this stage, each task shares the same starting point and only diverges as the task-specific adaptation proceeds. The fusion of task paradigms is still ongoing. T5 [110] transforms all NLP tasks into a text-to-text format with textual descriptions, while GPT-3 [15] has discovered that introducing appropriate textual prompts can yield the desired output for specific tasks. Prompts can stimulate the knowledge learned by PLMs during pre-training. Research even suggests that with appropriate prompt guidance, models can perform complex reasoning tasks [148, 153]. Also, prompts formulated in a natural language format possess remarkable generalization capabilities. Specifically, models that have undergone fine-tuning using diverse instructions are able to effectively generalize to new, unseen data [115, 151]. Overall, prompts demonstrate a proof-of-concept that uses PLMs as the underlying infrastructure and natural language as the medium to uniformly perform various tasks.

Nevertheless, there exist numerous tasks that transcend the scope of purely natural language. For instance, generating presentation slides,¹ constructing 3D models via CAD applications, and scheduling meetings through the analysis of team member calendars are examples of complex tasks that have not been defined in traditional AI. Fortunately, the strong generalization ability of PLM enables us to use natural language as a medium to accomplish these tasks by manipulating tools [169]. Essentially, the key to tool learning is to decompose complex tasks into sub-actions, tokenize actions in the form of natural language and convert them into executable instructions that can be understood by specific tools. Language models serve as “translators”, making complex tasks more accessible to individuals without specialized technical knowledge. By enabling machines to understand and interact with human language in a more natural way, we can unlock new possibilities for collaboration and problem-solving that were previously impossible.

Superiority of Foundation Models over Traditional Approaches. The emergence of foundation models has opened up exciting opportunities for tool learning that were difficult to achieve with traditional machine learning (ML) approaches.

As discussed in Section 3.2.1, foundation models’ strong language understanding capabilities enable them to effectively interpret user intents expressed in natural language and map them to relevant tool executions. In contrast, traditional ML models often struggle with such open-ended language understanding tasks. Recent works have also demonstrated the emerging reasoning and planning capabilities of foundation models, allowing them to decompose complex tasks, generate multi-step plans, and adapt to intermediate results (Section 3.2.2). Such reasoning and planning skills are challenging to achieve with traditional ML models, which excel more at pattern recognition within narrow domains. Furthermore, foundation models exhibit remarkable few-shot and zero-shot learning abilities, enabling them to learn to use new tools with minimal or no examples. This is particularly valuable in tool learning, where collecting large annotated datasets for every tool is infeasible (Section 3.3.3). Traditional ML models, in contrast, heavily rely on abundant task-specific labeled data. Lastly, foundation models can be easily composed with different tools through a unified interface (e.g., natural language or programming languages). This modularity allows for flexible integration of new tools without requiring significant architectural changes (Section 3.3.3). Traditional ML models often have fixed input/output formats, making it difficult to seamlessly incorporate new tools.

In addition, traditional non-ML approaches cannot handle tool learning well. Statistical methods, for example, often rely on assumptions about data distributions and linear relationships that do not align well with the dynamic and interactive nature of tool learning. They struggle to capture the hierarchical structure and contextual reasoning involved in tool use. Similarly, rule-based systems face challenges in modeling the open-ended and contextual nature of tool use. Defining comprehensive rules for all possible scenarios is infeasible, and such systems often fail to adapt to novel situations or handle the ambiguity inherent in real-world tool interactions.

2.4 Complementary Roles of Tools and Foundation Models

The integration of specialized tools and foundation models represents a promising approach for harnessing the unique strengths of both. Specifically:

Benefits of Tools. Tools that are designed to streamline concrete and specific objectives bring several benefits for tool learning: (1) Mitigation for Memorization. Although foundation models have demonstrated an exceptional ability to memorize [18, 19, 20], they are not capable of memorizing every piece of training data. Furthermore, foundation models are often prompted with a relatively short context during model generation, thus not all the memorized knowledge can be properly steered [82]. Additionally, memorization alone does not support the real-time coverage of up-to-date knowledge. Besides, foundation models are also criticized to hallucinate knowledge [113, 123] by generating seemingly plausible but non-factual content. This is unacceptable in applications like financial transactions that require the results are 100% correct. Given the above factors, it is necessary to augment foundation models with real-time tool execution to mitigate limitations in memorization. (2) Enhanced Expertise. Specialized tools are designed to cater to specific domains with functionalities that are not available in foundation models. As a result, they are better suited to address the needs of domain-specific tasks, such as Wolfram² for scientific calculation, through the utilization of tailored algorithms. Instead of solely relying on the foundation model to accomplish the task, models could invoke appropriate tools to generalize to a wider range of tasks that are beyond their capabilities. (3) Better Interpretability. Foundation models are criticized for lacking transparency in their decision-making process [69], which can be a significant concern in applications such as healthcare or finance, where interpretability is critical for making informed decisions. In contrast, the process of tool execution reflects the whole process of how models solve complex requests, which allows for better interpretability and transparency. Users can easily understand why certain tools are called and how they contribute to the final output, which can improve trust and facilitate human-machine collaboration. (4) Improved Robustness. Foundation models are susceptible to adversarial attacks [52, 143], where slight modifications to the input can flip the model prediction. This is because these models heavily rely on statistical patterns in the training data. Conversely, tools are designed specifically for their intended use cases, which may be agnostic to the input perturbation. This makes tools more resistant to adversarial attacks.

Benefits of Foundation Models. Foundation models can provide a solid basis for understanding, planning, reasoning, and generation, which bring several benefits for tool learning as follows: (1) Improved Decision-Making and Reasoning Abilities. Foundation models are trained on vast amounts of data, enabling them to acquire world knowledge across a wide range of domains. If properly steered, such knowledge can be wielded to perform decision-making and planning over prolonged time horizons [45]. Besides, foundation models have demonstrated remarkable reasoning abilities [148, 153], thereby enabling them to extrapolate the consequences of actions and make judicious decisions. These reasoning abilities are particularly useful for tasks requiring a deep understanding of cause-and-effect relations (Section 3.2.2). (2) Better User Experience. Benefitting from the powerful intent understanding capability of foundation models, tool learning could revolutionize the way we interact with machines and liberate users from the cognition load, allowing them to engage in higher-order thinking and decision-making processes. This, in turn, fosters a seamless and more natural language-based interaction paradigm that revolutionizes traditional GUIs. The user only needs to provide high-level guidance and direction, and the model will seamlessly comprehend the user’s intent, thereby delivering more personalized and precise responses. In addition, tool learning has the potential to democratize access to complex tools. With the aid of foundation models, even novice users can easily and quickly get started with a new tool, regardless of their prior experience or technical expertise. This not only reduces the barriers to entry for new users but also unlocks a wealth of possibilities for innovation and creativity.

3 Tool Learning

To provide a comprehensive understanding of tool learning, we first present a general framework, which encompasses four fundamental components, namely tool set, environment, controller, and perceiver, as detailed in Section 3.1. Subsequently, we discuss the general procedure of tool learning in Section 3.2. Lastly, we delve into the training methods for tool learning and discuss how to achieve generalizable tool learning in Section 3.3. Considering the lack of a systematic tool learning evaluation in prior works, based on our framework, we conduct experiments on 18 representative tools with case studies and demonstrate that state-of-the-art foundation models (e.g., ChatGPT) can effectively use tools to solve tasks with simple prompting, highlighting the potential of tool learning. See the supplementary material for details.

3.1 Components of Tool Learning

While research in tool learning has achieved remarkable advancements, these efforts mainly focus on specific tasks or domains with delicate approach designs. In addition, existing surveys relevant to tool learning either focus on augmenting models with tool execution [82] or leveraging models for decision making [162], without investigating both streams under the same unified framework. This may hinder a comprehensive understanding of the core challenges and future directions in tool learning. To this end, we provide a general framework to better organize and understand existing works in tool learning. Specifically, we frame tool learning with four components, as shown in Figure 3. Each component has its own characteristics and functions (Section 3.1.1), but they also interact with each other closely (Section 3.1.2).

Fig. 3.

3.1.1 Understanding the Components.

We first introduce each component of the tool learning process.

Tool Set. Serving as the fundamental ingredient of tool learning, the tool set \(\mathcal {T} = \lbrace \mathcal {T}_1, \mathcal {T}_2, \ldots \rbrace\) contains a collection of different tools that have different functionalities. As we have elaborated in Section 2.2, a tool in \(\mathcal {T}\) can have different interfaces. In the following, we mainly take Application Programming Interface (API) as the example to illustrate how to interact with tools. We define an API as any function that can take the output of the foundation model as its input.

Environment. The environment \(\mathcal {E}\) is the world where the tools operate, which provides the perceiver with the execution results of tools. It provides the infrastructure necessary for tool execution, which can be either virtual or real. The former refers to a simulated environment that allows the model to interact with a digital representation of the tool, while a real environment involves actual interaction with the physical tool. Virtual environments have the advantage of being easily accessible and replicable, allowing for more cost-effective training for models. However, virtual environments may not fully replicate the complexities of the real-world environment, leading to overfitting and poor generalization [41]. On the other hand, real environments provide a more realistic context but may be more challenging to access and involve greater costs.

Controller. The controller \(\mathcal {C}\) serves as the “brain” for tool learning framework and is typically modeled using a foundation model. The purpose of the controller \(\mathcal {C}\) is to provide a feasible and precise plan for using tools to fulfill the user’s request. To this end, \(\mathcal {C}\) should understand user intent as well as the relationship between the intent and available tools, and then develop a plan to select the appropriate tools for tackling tasks, which will be discussed in Section 3.2.1. In cases where the query is complex and targets a high-level task, \(\mathcal {C}\) may need to decompose the task into multiple sub-tasks, which requires foundational models to have powerful planning and reasoning capabilities (Section 3.2.2).

Perceiver. The perceiver \(\mathcal {P}\) is responsible for processing the user’s and the environment’s feedback and generating a summary for the controller. Simple forms of feedback processing include concatenating the user and environment feedback or formatting the feedback using a pre-defined template. The summarized feedback is then passed to the controller to assist its decision-making. By observing this feedback, the controller can determine whether the generated plan is effective and whether there are anomalies during the execution that need to be addressed. Under more complex scenarios, the perceiver should be able to support multiple modalities, such as text, vision, and audio, to capture the diverse nature of feedback from the user and the environment.

3.1.2 Connecting the Components.

Formally, assume we have a tool set \(\mathcal {T}\), which the controller can utilize to accomplish certain tasks. At time step t, environment \(\mathcal {E}\) provides feedback \(e_t\) on the tool execution. The perceiver \(\mathcal {P}\) receives the user feedback \(f_t\) and the environment feedback \(e_t\), and generates summarized feedback \(x_t\). Typically, the perceiver can be achieved by pre-defined rules (e.g., concatenating \(f_t\) and \(e_t\)) to form \(x_t\), or modeled with complex neural models. The controller \(\mathcal {C}\) generates a plan \(a_t\), which selects and executes an appropriate tool from \(\mathcal {T}\). This process can be formulated as the following probability distribution:

\begin{align} p_\mathcal {C}(a_t)=p_{\theta _\mathcal {C}}(a_t\mid x_t, \mathcal {H}_t, q), \end{align}

(1)

where \(\theta _\mathcal {C}\) denotes the parameters of \(\mathcal {C}\), q denotes the user query or instruction, and \(\mathcal {H}_t=\lbrace (x_s, a_s)\rbrace _{s=0}^{t-1}\) denotes the history feedback and plans. In its simplest form, a generated plan \(a_t\) can simply be a specific action for tool execution. \(\mathcal {C}\) can also synergize its reasoning process with the action prediction, where \(a_t\) may additionally contain the reasoning traces that explain which sub-task should be solved next and which tool to choose for solving the sub-task. It is worth noting that if the dependence on \(x_s\) is removed from Equation (1), the resulting probability distribution becomes equivalent to autoregressive language modeling. From this perspective, the controller additionally grounds the foundation model to the environment and the tool set. Moreover, we can factorize Equation (1) as follows:

\begin{align} p_{\theta _\mathcal {C}}(a_t\mid x_t, \mathcal {H}_t, q)=\sum _{\mathcal {T}_i \in \mathcal {T}} p_{\theta _\mathcal {C}}(a_t \mid \mathcal {T}_i, x_t, \mathcal {H}_t, q) \times p_{\theta _\mathcal {C}}(\mathcal {T}_i \mid x_t, \mathcal {H}_t, q). \end{align}

(2)

The decomposition reveals that the construction of the plan \(a_t\) involves two subtasks: selecting the appropriate tool based on the user intent and deciding the actions to execute using the selected tool. For instance, given an instruction such as “I want to book a flight to Beijing next week”, the controller \(\mathcal {C}\) first infers that the user’s goal is to reserve a flight, with Beijing as the destination and the next week as the travel time. The model then selects the airline reservation system as the tool. Finally, it inputs the time and destination as the preliminary plan. In the process of making a reservation, we may face unexpected situations such as the unavailability of flights to Beijing in the next week. To cope with these anomalies, we can further equip \(\mathcal {C}\) with the ability to reason about the current context and generate alternative plans, as we will discuss in detail in Section 3.2.2.

After a plan \(a_t\) is generated, it will be executed in \(\mathcal {E}\), and the resulting feedback \(e_{t+1}\) from \(\mathcal {E}\) will be passed on to the perceiver. The above process repeats for multiple rounds until the controller accomplishes the task. The overall objective is to find an action sequence \(\lbrace a_t\rbrace\) that ultimately fulfills the task specified by the user instruction q. Note after tool execution, the controller may additionally integrate the execution results into a plausible response for the user.

3.1.3 Comparison with Traditional ML Tasks.

Tool learning differs from traditional ML tasks in several key aspects. First, tool learning involves understanding and interacting with real-world tools, such as APIs, software applications, and databases. These tools have complex functionalities, structured interfaces, and domain-specific requirements. Models need to learn to map between natural language and tool-specific actions, handle input/output formats, and integrate tool responses into their decision making. In contrast, traditional ML tasks often operate on self-contained datasets without the need for real-world tool grounding. Second, tool learning requires open-ended reasoning and planning. The user specifies their goals using natural language, which can be highly variable and ambiguous. The model needs to understand the user’s intent, break it down into sub-goals, and generate multi-step action sequences to accomplish the task. Classical ML models are more focused on pattern recognition within narrow domains and lack the necessary reasoning and planning capabilities for open-ended tool use. Third, tool learning systems need to engage in interactive exchanges with users and the environment. They should support back-and-forth dialogues to clarify user intents, provide intermediate results, and gather feedback. They also need to execute tool actions, observe their effects, and incorporate the results into subsequent reasoning steps. This interactive nature differs from classical ML setups that learn from static datasets.

3.2 The General Procedure: From Intent to Plan

As formulated in Section 3.1.2, the general procedure of tool learning necessitates intricate interplay among different components. In this section, we will further elaborate on the key issues involved in this procedure.

3.2.1 Understanding Intent and Tools.

To accurately fulfill the task specified by the user query q, the controller needs to understand two aspects: (1) the underlying intent of the user, which involves formalizing the query q as a high-level task (i.e., intent understanding); (2) the tool set \(\mathcal {T}\), which entails comprehending the functionality and objective of each tool (i.e., tool understanding). By understanding both aspects, the controller can bridge the gap between the user’s intent and the tool set, which is the pre-requisite for connecting controller \(\mathcal {C}\), the user, and tool set \(\mathcal {T}\) in Figure 3.

Intent Understanding. Understanding user intent is a long-standing research topic in NLP [51, 132]. By accurately identifying the user intent, the controller can provide more personalized responses with a better user experience. Recent explorations in instruction tuning [151] demonstrate that foundation models can possess extraordinary proficiency in comprehending user instructions. Prior work has shown that fine-tuning large language models on a collection of datasets templated with human instructions allows models to generalize even to instructions for unseen tasks [85, 99, 115, 151]. Promisingly, such generalization ability can further be enhanced by scaling up both the model size and the quantity or diversity of training instructions [49].

Despite the impressive intent understanding capabilities, challenges still exist in real-world tool learning scenarios: (1) Understanding Vague Instructions. Many user queries are inherently imprecise and can even be polysemous, requiring the controller to rely on contextual cues and background knowledge to infer the user’s intended meaning. One possible solution is to actively interact with users to clarify any ambiguity, such as asking for clarifications about a previous user query. (2) Generalization to Diverse Instructions. As the intent space is theoretically infinite, it is almost impractical for foundation models to be exposed to every real-world intention during training. In addition, the challenge of personalization arises from the fact that each individual has their own unique way of expressing intentions, which requires the model to adapt to the diverse expressions of intent of different individuals. One solution is to leverage user feedback and actively adapt the model to individual users, i.e., personalized tool learning (Section 4.3).

Tool Understanding. As noted by Hernik and Csibra [43], when learning to utilize a specific tool, human beings perceive it as an object with particular functions, engaging in a cognitive process to understand its purpose and operation. By observing goal-directed demonstrations and following actions performed by other people, they gradually acquire the necessary knowledge and skills to use the tools effectively. Similarly, this understanding process is crucial for successfully solving tasks with tools. Analogously, a comprehensive understanding of the tools’ functionalities is indispensable for enabling the controller to use tools proficiently.

In real-world scenarios, tools are typically accompanied by a manual (or tutorial), which provides sufficient relevant details about their functionalities and usage. Endowed with strong few-shot learning [15] and zero-shot learning [151] capabilities, foundation models can be prompted to unravel tools’ functionalities and comprehend how to use them. To this end, it is feasible to construct suitable task-specific prompts either through manual design [141] or retrieval [174]. These prompts should describe the API functionalities or exemplify with demonstrations of their usage.

We categorize two prompting approaches as shown in Figure 4: (1) zero-shot prompting, which describes API functionalities, their input/output formats, possible parameters, and so on. This approach allows the model to understand the tasks that each API can tackle; (2) few-shot prompting, which provides concrete tool-use demonstrations to the model. By mimicking human behaviors from these demonstrations, the model can learn how to utilize these tools. We provide experimental results of both prompting methods in the supplementary material. Overall, prompting has been widely adopted as a lightweight approach to teach foundation models about tools [32, 93, 164] with minimum human effort. Prompts can be easily adjusted to accommodate changes of tools. For instance, when tools are modified or upgraded, we can flexibly rewrite the prompts to adapt the model behaviors.

Fig. 4.

Despite these advantages, prompting methods still face several challenges. First, since the effectiveness of prompting depends a lot on the model, smaller or less capable models cannot understand prompts well. Second, prompting is restricted by input context length. Although foundation models have been shown to learn to use simple tools through prompts, the situation may be more challenging with multiple complex tools with long descriptions. A potential solution is to add an intermediate stage of tool selection, which first retrieves a small set of tools that are most suitable for the task at hand.

3.2.2 Planning with Reasoning.

In practice, the user query q often implies a complex task that should be divided into multiple sub-tasks with proper sequencing, thereby necessitating a process of reasoning. Recent research has revealed that reasoning capabilities can emerge when foundation models are scaled up to a certain size [152]. In particular, foundation models with tens or hundreds of billions of parameters can generate intermediate reasoning traces during complex problem-solving, thereby significantly enhancing their zero-shot and few-shot performances [89, 91, 152]. However, the vanilla few-shot prompt learning [15], whereby models are provided with a prompt consisting of several examples for the given task, has been shown to fail when it comes to problems that require complex reasoning [29]. To better elicit reasoning in foundation models, Wei et al. [153] propose Chain-of-Thought (CoT) prompting. Unlike vanilla few-shot prompt learning, CoT additionally inserts the reasoning trace required to derive the final answer for each example in the prompt. In this way, CoT prompts models to generate their “thoughts” on the necessary intermediate steps before arriving at the final answer. CoT has been proven to significantly boost performance on a wide range of tasks, including arithmetic reasoning, commonsense reasoning, and symbolic reasoning [153].

In light of the remarkable reasoning abilities of foundation models, recent research has made successful attempts to employ them in the controller in tool learning. It is demonstrated that their reasoning capabilities enable the controller to effectively decompose a complex problem into several sub-problems, and determine which tool to call upon for each sub-problem. We categorize relevant research into two streams: introspective reasoning and extrospective reasoning. The former involves generating a static plan of tool use without interacting with the environment \(\mathcal {E}\), while the latter generates plans incrementally by iteratively interacting with \(\mathcal {E}\) and utilizing feedback obtained from previous executions. As shown in Figure 5, the environment \(\mathcal {E}\) is invisible to the controller \(\mathcal {C}\) in introspective reasoning but is visible in extrospective reasoning, creating a closed-loop interaction among the four components.

Fig. 5.

Introspective Reasoning. This kind of reasoning directly generates multi-step plans for tool use without knowing intermediate execution results. For instance, Huang et al. [45], Qian et al. [106] investigate the planning ability of foundation models and show that they are capable of decomposing high-level tasks into semantically plausible sub-plans. Another representative work is Program-Aided Language Models (PAL) [37], which prompts models to generate Python codes for intermediate reasoning steps. PAL uses the Python program interpreter as the tool, enabling the model to act as a programmer writing detailed comments, and achieving significant improvements in arithmetic, symbolic, and algorithmic reasoning. Notably, the idea of model-as-programmer has also been shown to be successful in embodied agents, as evidenced by ProgPrompt [126] and Code-as-Policies [66], which prompt models to generate executable programs for embodied agents. These studies reveal that, despite not having direct interaction with the environment, models are capable of generating executable programs for agents and anticipating possible anomalies in the plan execution. Another example is Visual ChatGPT [155], which interleaves various vision foundation models with ChatGPT to enable understanding and generating images. In their system, ChatGPT serves as the core controller and makes sequential decisions. At each step, ChatGPT might call a vision model to modify an existing image or respond to the user with plain text. However, since these models are not grounded in the environment, they may generate unrealistic and nonsensical plans. To this end, SayCan [1] emphasizes those actions that the agent is “permitted” to execute instead of those it is “willing” to perform. In practice, they employ a value function to estimate the probability of each action being successfully executed, making the agent more physically grounded in the environment.

Extrospective Reasoning. Despite its simplicity, introspective reasoning cannot adapt the plan in response to intermediate execution results. A more rational approach to planning is taking the environment \(\mathcal {E}\) into account, and generating plans incrementally (e.g., one step at a time) with subsequent plans dependent on previous execution results. This allows the four components described in Section 3.1 to be well integrated and to cooperate effectively to achieve complex tasks. We refer to such an incremental reasoning strategy as extrospective reasoning. Compared to introspective reasoning, extrospective reasoning additionally considers feedback from the user and environment (Figure 5), and is thus better suited to complex tasks [76, 149], such as multi-step QA and embodied learning, where decision-making at each step is dependent on the preceding context. Recent works such as Self-Ask [105], ReAct [164], and ToolFormer [117] have demonstrated that by providing access to search engine APIs, models are able to achieve improved accuracy on multi-step QA. Through CoT prompting (Self-Ask and ReAct) or fine-tuning (ToolFormer), models can learn to decompose complex questions and utilize the search API to find the answer to the first sub-question. Based on the response and the question, they can then iteratively determine the subsequent question to ask or give the final answer.

For embodied learning, while introspective reasoning methods have demonstrated the ability to generate executable programs and address potential execution anomalies, direct interaction with the environment can further enhance models’ planning capabilities. For instance, Inner Monologue [46] leverages multiple sources of feedback from the environment, such as whether a task is completed successfully and the current scene information. In this way, models can generate more feasible plans and improve their ability of high-level instruction completion. LLM-Planner [127] explicitly considers anomalies that may arise during plan execution and utilizes environmental feedback to regenerate the plan in case of execution failure, enabling models to handle exceptions appropriately. Additionally, ReAct [164] grants models the autonomy to determine when to cease generating action tokens during planning, enabling them to reason about the current situation and develop more refined subsequent plans.

Extension to the Multi-Step Multi-Tool Scenario. Humans do not stick to only one single tool to complete complex tasks. Instead, we carefully decompose the task into several sub-tasks, select the most suitable tool for each sub-task, and gradually accomplish them step by step. Although most of the existing research is limited to either multi-step single-tool or single-step multi-tool scenarios, there is a recent surge of interest in the multi-step multi-tool scenario. For instance, Auto-GPT³ demonstrates the huge potential of GPT-4 in manipulating multiple tools and making long-term plans, pushing the boundaries of what is possible with tool learning. Given a user query, Auto-GPT will take step-by-step actions to accomplish the objective autonomously. Although Auto-GPT constitutes a significant step in the multi-step multi-tool scenario, there are still several challenges and future directions that need to be investigated.

—

Understanding the Interplay among Different Tools. The multi-step multi-tool scenario typically involves complex tasks that require a higher level of intent understanding and reasoning capability. To effectively utilize multiple tools under this scenario, models need to grasp not only the individual functionalities of tools but also their interactions and dependencies. Models should be able to sequence the tools in a logical order so that the subsequent tools can leverage the information generated by the previous tools and effectively complete the task.

—

Enhancing Reasoning through Formalisms and Mathematical Integration. LLM-based agents inherently comprehend and generate language, therefore, predominant research utilizes plain natural text to facilitate agents’ reasoning and planning. Nonetheless, emerging research indicates that incorporating external formalisms, such as mathematical tools and non-natural language forms, significantly enhances agents’ performance in complex reasoning tasks. For instance, Xu et al. [158] reveal that employing a probabilistic graph model (PGM) to represent the historical dynamics of multi-agent reasoning, and subsequently integrating PGM with an agent, substantially improves the agent’s decision-making capabilities. Ye et al. [165] propose Agentic Process Automation (APA), which effectively integrates intelligent agents into the conventional Robotic Process Automation (RPA), enhancing the system’s intelligence while remaining controllable.

—

From Sequential Execution to Parallel Execution. Tool executions do not have to be performed sequentially. In certain scenarios, parallel execution is possible for sub-tasks that do not depend on each other, which can potentially improve execution efficiency. For instance, given a user instruction “Generate two codes, one for drawing a rectangle, and one for drawing a circle.”, the two tasks can be assigned to two agents, enabling the codes to be generated simultaneously. Determining the dependencies among different sub-tasks and effectively switching between parallel and sequential execution is a promising direction that merits further investigation.

—

From Single-agent Problem-Solving to Multi-agent Collaboration. Previous works typically assume that a single agent (controller) is solely responsible for the entire tool learning procedure. However, in practice, complex tasks often demand collaboration among multiple agents, each possessing unique abilities and expertise. Embracing multi-agent collaboration can unlock more effective and efficient problem-solving approaches, necessitating the design of methods for communication, coordination, and negotiation among agents to ensure seamless collaboration and optimal task execution. Notably, recent work like Park et al. [101] demonstrates that multiple agents modeled with foundation models can simulate human behaviors (e.g., interpersonal communication) in interactive scenarios. This provides promising evidence for the adoption of multiple agents for tool learning.

3.3 Training Models for Improved Tool Learning

Guidance, either from humans or environments, plays a critical role in training foundation models to use tools. In contrast to the prompting-based methods mentioned in Section 3.2.1 and Section 3.2.2, which rely on the frozen foundation models’ in-context learning abilities, the training-based method optimizes the model with supervision. As noted by Fagard et al. [34], there are two primary ways for infants to learn a new tool, that is either from demonstration by an adult modeling the action or relying on their own exploration. Analogously, as shown in Figure 6, we categorize training strategies for tool learning into two streams: (1) learning from concrete tool-use demonstrations [89, 116], which often requires human annotation, and (2) learning from feedback, which typically involves RL [7, 112]. Finally, considering the existence of potentially massive tools, learning each of them one by one is infeasible in practice. Hence, we emphasize the importance of generalization in tool learning and discuss potential solutions (Section 3.3.3).

Fig. 6.

3.3.1 Learning from Demonstrations.

Models can be trained to mimic the behavior of human experts through imitation learning [7, 48, 73]. Behavior cloning [6] can be viewed as a simplistic form of imitation learning that focuses on learning policies in a supervised fashion, with the general assumption that the expert’s behavior is optimal or near-optimal. The objective of behavioral cloning is to train models to imitate human experts’ actions given certain inputs or conditions, and this approach is commonly adopted when the actions of an expert can be easily recorded [138].

Formally, assume that we have a dataset \(\mathcal {D}\) of size N consisting of pairs of user query q and the human demonstration annotation \(a^*\), i.e., \(\mathcal {D}=\lbrace (q_i, a^*_i)\rbrace _{i=0}^{N-1}\). Learning from human demonstrations optimizes the controller’s parameters \(\theta _\mathcal {C}\) with the following objective:

\begin{align} \theta _\mathcal {C}^*=\mathop{\arg\,\max}_{\theta _\mathcal {C}}{\mathbb {E}}_{(q_i, a^*_i)\in \mathcal {D}}\prod _{t=0}^{T_i}p_{\theta _\mathcal {C}}(a^*_{i,t}\mid x_{i,t},\mathcal {H}_{i,t},q_i), \end{align}

(3)

where \(a^*_{i,t}\) is the human annotation at the tth iteration for handling \(q_i\), and \(T_i\) is the total iteration number of \(a_i\), other variables follow the notations defined in Equation (1). Based on how \(a^*\) is obtained, learning from demonstration can be categorized into three streams, with human intervention gradually becoming less:

Supervised Learning. Traditionally, behavior cloning has been widely explored in autonomous vehicles and robotic applications [28, 75]. Recently, there has been a surge of interest in fine-tuning foundation models to perform tool-oriented tasks in a supervised way. For instance, Li et al. [65] utilize foundation models as policy networks, whose input is the tokenized environment observations, the original goals, and action history. Benefiting from the task-general inductive bias brought by foundation models, behavior cloning using the policy network significantly improves both in-domain performance and out-of-distribution generalization. Another example is WebGPT [89], which interacts with a search engine by iteratively refining its search queries and recording important information. To achieve this, the authors first build a search interface backed up by Bing⁴ and then fine-tune GPT-3 [15] to clone human web search behaviors. As a language model pre-trained on general domains, the original GPT-3 is not intrinsically anchored to valid browser commands. Therefore, it is crucial to first gather demonstrations of human interactions with the browser and then learn state-to-action mappings. After fine-tuning, the model shows exceptional capabilities in manipulating search engines for information retrieval, even surpassing human experts. Similarly, WebShop [163] provides a web-based interactive environment where an agent could browse and purchase products. Through behavior cloning, the trained agent exhibits non-trivial performance in purchasing the right product given human instructions.

Semi-supervised Learning. As is often the case, human behaviors cannot be easily recorded due to time and cost considerations. However, large-scale unlabeled data is often attainable, from which we could potentially construct weak, noisy supervision. Notably, recent works have shown that we could employ a less-capable model to annotate pseudo-labels on unlabeled data and convert them into weakly-supervised tool-use demonstrations. For instance, with a small amount of seed labeled data, [7] train a model to predict pseudo-labels of the action taken at each timestep in a Minecraft video game. Learning from these pseudo-labels, a more powerful model can be trained without requiring the rollout of models in a target environment or large-scale gold-standard human behavior annotation.

Self-supervised Learning. Despite reducing the heavy requirements on human behavior annotation, semi-supervised learning still requires a seed labeled dataset to attain the pseudo labels. Besides, the biases in the seed dataset may also be amplified during training, leading to poor generalization performance. To this end, researchers recently show that with a few demonstrations, foundation models can teach themselves how to utilize a tool in a self-supervised manner [100, 117]. For instance, Toolformer [117] leverages the in-context learning ability of foundation models to iteratively bootstrap tool-use examples based on a handful of human-written examples. These auto-generated examples are further filtered to reduce noise. The final tool-use dataset contains sufficient supervision, significantly improving tool-use performance, highlighting the potential of self-supervised learning for enhancing tool-use capabilities.

3.3.2 Learning from Feedback.

Collecting manually annotated tool-use examples, which probably include complete traces of human behaviors and the final answers, is time-consuming and labor-intensive. Alternatively, humans learn from trial and error to correct and rectify their tool-use behaviors [3]. Similarly, feedback from both the environment and humans can enable the model to understand the consequences of its actions and adapt its behaviors. Formally, learning from feedback can be described as optimizing the controllers’ parameters \(\theta _{\mathcal {C}}\) from open explorations with query set \(Q=\lbrace q_i\rbrace _i\):

\begin{align} \theta _\mathcal {C}^*=\mathop{\arg\,\max}_{\theta _\mathcal {C}}\mathop{\mathbb {E}}_{q_i \in Q } \mathop{\mathbb {E}}_{\lbrace a_{i,t}\rbrace _{t=0}^{T_i} \in p_{\theta _\mathcal {C}} } \left[ R({\lbrace a_{i,t}\rbrace }_{t=0}^{T_i}) \right], \end{align}

(4)

where R is the reward estimated from the sequence of feedback and \(T_i\) denotes the number of iterations needed for handling \(q_i\).

RL for Tool Learning. RL is a common solution to enabling artificial agents to learn from their environment in complex decision-making processes [10, 118, 125]. Tool learning can be considered an RL scenario, where the action space is defined by tools, and the agent learns to select the appropriate tool and perform the correct actions that maximize the reward signal. The policy model can be initialized by a foundation model [119]. Such initialization brings the policy model abundant prior knowledge, alleviating the need for the RL agent to learn basic skills. With a reward function that quantifies the performance of the agent in achieving the task goal, RL has been successfully used in various tool learning scenarios, such as robotic grasping [63] and multi-agent autocurricula [8]. By optimizing the loss function, the agent learns to reflect on the current state of the environment, select the appropriate tool, and perform the right actions that lead to the highest expected reward. In the following, we introduce two sources of feedback: environment feedback and human feedback, which can be considered sources of reward signals in the context of tool learning. These two feedbacks are complementary and can be combined with each other.

Environment Feedback. The controller interacts with the environment and receives feedback about the consequences of its actions. The model then updates its policy based on this feedback to improve its tool-use behavior. Environment feedback can be categorized into two forms: (1) result feedback, which is ultimate feedback returned from the environment, indicating whether the model’s actions have successfully completed the task or not. This type of feedback performs an overall assessment of the planning generated by the model. For instance, WebShop [163] uses a hand-coded reward to assess the similarity between human-bought and model-bought products, which indicates whether the actions performed by the controller lead to the correct final product. By receiving feedback on the success or failure of its actions, the model can iteratively update its planning strategy, and adjust its decision-making process; (2) intermediate feedback, which refers to the state change of the environment triggered by an action. By observing the state changes, foundation models can learn whether each action is effective and appropriate, making the model better adjust its behaviors accordingly. This kind of feedback provides more detailed and timely information about the effectiveness of each tool execution. Take the case of interacting with a search engine to gather information for question-answering, models could update their policy for more efficient information retrieval by observing the rendered information of a search query.

Human Feedback. Humans could give the model rewards and penalties based on its generated plans to regulate its behavior. Human feedback can be explicit, which provides clear and direct insights into the model performance representing human preferences. For example, rating the quality of the model-generated action on a scale of 1 to 5; human feedback can also be implicit, which is not directly specified by the user but can be derived from user behavior and interactions with the model. Examples include users’ comparison [99], response time, and actions taken after receiving a model’s output (e.g., clicking on a recommended link).

Though human feedback is accurate and stable, it is label-intensive and has high latency. To address this issue, reinforcement learning from human feedback (RLHF) [25] is proposed to finetune a model to imitate humans to give rewards, which are then used to optimize the policy with RL algorithms such as PPO [119]. RLHF has yielded exceptional performance in various domains such as text summarization [131, 175]. RLHF can also improve a model’s tool-use capabilities even if it has been trained on sufficient supervised human demonstrations. For instance, WebGPT [89] utilizes human feedback to guide a policy model to align with human preferences, which helps better manipulate search engines to answer long-form questions.

3.3.3 Generalizable Tool Learning.

Generalization of tool use is a key characteristic of human intelligence [97, 120, 135]. The ancient human, for instance, recognized that regardless of the specific tool being used, a sharp edge was essential for achieving clean cuts and efficiently carrying out tasks. This recognition allowed them to transfer their knowledge of sharpening a knife to sharpening other tools, such as scrapers or choppers. Generalization is also a critical aspect of tool learning, especially considering the existence of a massive and rapidly expanding array of tools. Collecting enough supervised tool-use data and conducting supervised fine-tuning on it is time-consuming and practically infeasible. Generalizable tool learning highlights the importance of recognizing commonalities and patterns of tools so that models could synthesize and transfer their knowledge and skills. For instance, by abstracting essential features such as layers, filters, and color adjustments, users can transfer their knowledge of using Adobe Photoshop to Adobe Illustrator, even if the interface and specific tool names in these two figure-editing software are different.

Foundation of Generalization: Interface Unification. To facilitate knowledge transfer among tools, it is critical to design a unified interface that enables the model to manipulate various tools in a consistent and standardized manner. In this way, models can identify and abstract essential features of tools more easily in a unified tool protocol rather than grappling with the difficulty of understanding various tool interfaces. Currently, the manipulation of tools is through predicting discrete action tokens, and the action space is not aligned in different scenarios, which prohibits the models from quickly adapting to new scenarios and tools. Based on the aspect we categorize tools in Section 2.2, we identify three potential ways of interface unification: the semantic interface, GUI interface, and programming interface.

—

Semantic Interface. The semantic interface operates by utilizing a specific text span (action name) as the action trigger, which is the most intuitive and natural way for interface unification. For instance, ReAct [164] employs Action:Search as the trigger for the function that searches for relevant passages. In robotic manipulation [1, 71], the generated natural language (e.g., pick up the sponge) is mapped to specific actions. Despite its ease of implementation, the semantic interface poses certain challenges that must be addressed. First, the mapping between the generated text and the corresponding tool action should be pre-defined individually. Moreover, the model may fail to accurately produce the precise form to trigger the intended action.

—

GUI Interface. Humans primarily interact with the digital world through GUI interface (e.g., mouse and keyboard), which has been extensively optimized to follow human action efficiently. Nevertheless, before robots can learn to use a GUI interface flexibly, it is necessary to establish a virtual environment that can facilitate mapping predicted tokens to human-like mouse movements and keyboard inputs. Prior research has explored providing platforms for agents to complete web-based tasks using keyboard and mouse actions [70, 121]. However, these environments restrict models to a limited set of pre-defined mouse options and common keyboard actions such as copy and paste. By leveraging foundation models, it is possible to introduce prior knowledge regarding common combinations of keyword and mouse actions, thereby expanding the potential actions that a model can execute.

—

Programming Interface. This kind of interface allows the model to go beyond pure natural language and specify its action using a program. Such unification requires the model to be acquainted with the syntax of the function calls. The recent code-generating language models (CLM) such as Incoder [36] and CodeX [22] provide the possibility of such unification. The programming interface has been applied widely. For example, Code-as-Policies [66] finds that with CLM as the backbone for robotic control, the robots can leverage the code grammar to execute complex actions, generalize to novel instructions, and give precise control with accurate parameter values to the functions. The programming interface provides promising opportunities for tool learning because (1) complex tool learning logic can be modeled using the control flow of programming language; (2) explicit calls of external APIs can be naturally implemented by executing programs.

It should be noted that the interface selection should align with the capabilities and limitations of the foundation model. For instance, language foundation models are trained to generate text and may be better suited for the semantic interface. Similarly, a multimodal foundation model that combines visual and textual information may be more appropriate for the GUI interface. On the other hand, code foundation models may be more suitable for the programming interface, as it is trained to understand code syntax and function calls.

Strategies of Generalizable Tool Learning. In general, a unified interface enables models to learn and transfer knowledge more easily and efficiently, but it does not guarantee optimal learning outcomes in all scenarios. Generalizable tool learning requires models to further adapt, refine, and specialize their learned knowledge to specific tasks or domains. Here, we discuss two potential approaches to achieving this goal and facilitating generalization.

—

Meta Tool Learning. Metacognition [26] is a crucial aspect of human intelligence that allows individuals to adapt their behaviors when faced with unfamiliar situations. This concept is extended to ML models through the process of meta tool learning. Meta tool learning enables models to identify common underlying principles or patterns in tool-use strategies and transfer them to new tasks or domains. This is achieved by training the model to not only use a tool but also to learn the optimal strategy for its use. For instance, consider the case of a web search tool. When a model trained on a source search engine (e.g., Bing Search) is transferred to a target one (e.g., Google Search), the model can identify common patterns in tool-use strategies, such as effective search queries, relevant results, and user feedback. This allows the model to better align with the algorithms and user interface of the new search engine. Another example could be a model trained to use a calculator tool for solving mathematical problems. Through meta tool learning, the model can generalize the use of the calculator to solve different types of mathematical problems, not just the ones it was initially trained on. This generalization ability is a key aspect of meta tool learning, making it a promising approach for developing more adaptable and intelligent ML models.

—

Curriculum Tool Learning. Another approach to improving model generalization is through curriculum learning [9]. This pedagogical strategy starts with simple tools and gradually introduces the model to more complex tools, allowing it to build upon its prior knowledge and develop a deeper understanding of the tool. For instance, we could start with a curriculum of basic algorithms and operations to effectively teach a model to use Mathematica,⁵ e.g., addition and subtraction. Once the model has mastered these basic operations, we can then gradually move on to more complex mathematical concepts like calculus and linear algebra. This training strategy ensures that the model is introduced to the simple, essential features of the tool before moving on to more complex concepts in a way that is manageable and effective. It enables the model to more effectively identify similarities and differences between situations and adjust its approach accordingly. Moreover, curriculum learning can be designed in a way that the complexity of the tools and tasks increases over time. For example, the model could first be taught to perform simple tasks like sorting a list of numbers using a basic algorithm. Once it has mastered this, it could then be introduced to more complex tasks like solving a system of linear equations. This gradual increase in complexity allows the model to build on its existing knowledge and skills, making it more adaptable and capable of handling a wider range of tasks. Furthermore, curriculum learning can also be combined with other learning strategies such as transfer learning and multi-task learning to further improve the model’s generalization ability. For instance, the knowledge and skills learned from using one tool could be transferred to another similar tool, or the model could be trained to use multiple tools simultaneously, thereby enhancing its ability to generalize across different tools and tasks.

3.3.4 Challenges of Foundation Models in Tool Learning.

Although foundation models have demonstrated remarkable capabilities in natural language understanding and generation tasks, applying these models to tool learning poses several unique challenges compared to traditional NLP tasks. Firstly, foundation models must excel in multi-step reasoning and planning in tool learning scenarios. Unlike many traditional NLP tasks, which are typically completed within a single turn, tool learning often involves a series of sequential actions to interact with environments. While some multi-step tasks exist in NLP, their subtask decomposition strategies are typically manually designed by domain experts. In contrast, in tool learning, foundation models are required to autonomously decompose user intents into multiple subtasks or a plan, necessitating a deep understanding of cause-effect relations within the environment. Secondly, foundation models must possess the ability to comprehend and generate specific formats during tool utilization. Utilizing a tool often requires specifying various parameters and configurations, particularly for complex professional tools. For instance, interacting with databases may involve generating valid SQL queries with intricate syntax. Moreover, different tools may necessitate distinct formatting requirements, demanding foundation models to achieve accurate and robust format comprehension and generation capabilities. Thirdly, while traditional NLP tasks primarily focus on understanding natural language input, tool learning tasks require foundation models to comprehend and interact with diverse environments. However, the feedback from these environments often extends beyond natural language, presenting modalities such as chemical formulas, binary file contents, or GUI screenshots. Consequently, foundation models must be equipped to interpret and process a broader range of modalities beyond conventional natural language inputs.

4 Discussion

4.1 Safe and Trustworthy Tool Learning

Armed with external tools, AI systems can be unprecedentedly capable and human-like. Although we are eager to witness how tool learning will change our life, it is paramount to take a step back and contemplate the underlying risks. For responsible AI research, here we discuss the safety and trustworthiness problems of tool learning.

Adversaries. Same as all the other AI systems, we could foresee that there will be external adversaries once the tool learning models are deployed in reality, and thus how to defend against these threats is of great significance [42, 52, 133, 143]. Recent works suggest that large foundation models like ChatGPT are more robust on hard and adversarial examples [134, 147], which improves their utility in the complicated real world. But the attempt of crafting misleading or even harmful queries will undoubtfully persist as well [102]. Moreover, due to training on massive web data, foundation models are faced with long-lasting training-time security issues in deep learning, such as backdoor attacks [30, 59] and data poisoning attacks [144]. In addition to models, tools could be new attack targets for adversaries. For example, the attackers could maliciously modify the manual documentation or even the tools themselves (e.g. attacking a news API to give biased reports) to mislead the model into erroneous outcomes. A safe and robust system requires the models to not only learn to use tools, but also possess the ability to scrutinize, rectify, and secure them.

Governance. There is long-standing worry about the misuse of foundation models [13]. Under the paradigm of tool learning, governance over foundation models is more urgently needed. The pertinent question at hand is which tools should be involved? Given the countless tools human beings have manufactured, we must consider if it is appropriate to allow models to master all of them. Certain tools, such as calculators and translators, may be deemed safe as they do not pose any harm to individuals. However, granting models access to the internet or permitting them to make decisions in the real world could be perilous, as they could cause negative or even dangerous influences such as disseminating falsehoods [167] and harming human lives. In this regard, research communities and companies need to deliberate carefully before permitting machines to master a certain tool. Besides, governance over tool usage is also a pertinent issue. As highlighted by Amodei et al. [5], the end-to-end training paradigm in deep learning does not regulate how models achieve their objectives. Foundation models are not only expected to finish tasks with the help of tools but also should follow the regulations and constraints of tool usage.

Trustworthiness. The goal of tool learning lies in creating advanced intelligent agents. However, determining whether these agents are trustworthy or not is a complex challenge. Even though tool learning delivers enhanced interpretability and robustness, the core foundation models are still considered “black boxes”. Recent research [24] shows that although large models achieve better performance, they are unable to predict when they will be wrong, rendering the calibration problem unresolved yet. Accompanied with tools, under what circumstances will the model call on the tools is unpredictable as well. Therefore, it is essential to thoroughly discuss to what extent should we allow AI to engage in human lives. Moreover, the morality of foundation models has emerged as a contentious issue in recent times. Despite OpenAI’s commendable efforts to imbue InstructGPT [99] and GPT-4 [93] with human values and preferences, given the discomforting “jailbreak” responses by ChatGPT [14] and New Bing [114], whether these big models will be mild and compliant remains doubtful. When models could learn actively from the world via tools, the challenge of controlling their actions will become more daunting than ever before.

4.2 From Tool User to Tool Maker: AI’s Evolutionary Role

Throughout the annals of human civilization, the evolution of tools has occupied a pivotal position [57, 86]. The progression of human civilization is inextricably intertwined with the evolution of tools. Human beings are the creators and users of almost all tools from the Stone Age to the 21st century. Although we take it as granted, things are different when foundation models are involved. Considering that they have proven tool-use capabilities to certain extents, it is also possible to put them into the lifecycle of tool creation.

Tools for AI. Humans create tools to satisfy our own needs, so the designation naturally suits human preference and convenience. However, current tool learning algorithms may not be optimal for models. This is because most tools are specifically designed for human use, and models process information in a different way. Therefore, it is necessary to create tools that are specifically suited for models. Possible solutions may include: (1) modularity, which decomposes tools into smaller, more modular units, making them more adaptable and flexible for AI models. In this regard, models can learn to use these components in a more fine-grained and compositional manner; (2) new input and output formats: developing new input and output formats that are specifically tailored to the needs of AI models can improve their interaction and utilization of tools, enabling more seamless integration and communication between models and tools.

Tools by AI. The creation and utilization of tools have traditionally been considered exclusive to human intelligence. However, increasing evidence indicates that the ability to create advanced tools is no longer limited to human beings. For instance, large code models [22] can generate executable programs based on language description. These programs can be deemed as tools to help accomplish specific tasks. Besides writing codes from scratch, foundation models can also encapsulate existing tools into stronger tools. In Figure 7, we showcase ChatGPT’s ability to encapsulate existing APIs into more advanced functions. The first example shows how it extends a weather forecast API to compute average temperature over a specified period, while the second integrates external stock market data and trends for investment recommendations. The third example highlights a fully automated medical diagnosis system that not only performs diagnostic tests but also adapts treatment plans based on real-time vitals. These examples underscore the transition from tool usage to tool creation, illustrating the potential for foundation models to autonomously develop sophisticated solutions. All such evidence implies the potential for foundation models to transition from merely tool users to tool makers.

Fig. 7.

Creativity of AI. However, whether foundation models can exhibit genuine creativity in creating novel tools remains an open problem. This issue is important because the capacity for novel tool creation is a defining characteristic that distinguishes humans from animals [4]. Understanding the extent of creativity, beyond simply memorizing, composing, and interpolating between human tools encountered during pre-training, is crucial for assessing their potential to contribute to the development of new tools. Such investigations may involve the development of novel evaluation metrics and benchmarks [67], as well as the exploration of new techniques that prioritize creative problem-solving.

4.3 From General Intelligence to Personalized Intelligence

Foundation models are typically trained on a generic domain and calibrated with broadly-defined human preferences that prioritize helpfulness and harmlessness [89, 99]. As a result, they struggle to process personal information and provide personalized assistance to users with varying needs for tool learning. User-centric and personalized natural language generation has received increasing attention in recent years [56, 160]. Existing works cover a wide range of tasks, such as dialogue generation [77, 80, 128, 173], machine translation [83, 84, 157], and summarization [159]. These methods utilize external user-specific modules, such as user embeddings and user memory modules [156, 172], to inject preferences, writing styles, and personal information of different users into the generated content. However, these works are often designed for specific tasks and experimented with limited user information. How to integrate user information into general-purpose tool learning models is still under-explored. We will discuss the key challenge of personalized tool learning in the following.

Aligning User Preference with Tool Manipulation. Personalized tool learning emphasizes the importance of considering user-specific information in tool manipulation. There are two main challenges: (1) heterogeneous user information modeling: in real-world scenarios, personal information can come from numerous heterogeneous sources. For instance, when using an e-mail tool, models need to consider the user’s language style from historical conversation records and gather relevant information from the user’s social networks. This requires modeling user information with diverse structures into a unified semantic space, allowing models to utilize this information jointly; (2) personalized tool planning: different users tend to have different preferences for tool planning and selection. For example, when completing the purchasing task, different users prefer to use different online shopping platforms. Therefore, the models need to develop personalized tool execution plans based on user preferences; (3) personalized tool call: adaptively calling tools according to the user’s preference is also an important direction in personalized tool learning. Most tools are designed without consideration of personalized information, which requires the model to generate different inputs for tools based on the user’s preferences.

From Reactive Systems to Proactive Systems. Currently, most of the foundation models are designed as reactive systems, which respond to user queries without initiating any actions on their own. A paradigm shift is underway toward proactive systems that can take action on behalf of the user. By leveraging the history of user interactions, proactive systems can continually improve their performance and tailor their responses to specific users, which provides a more personalized and seamless user experience. However, the introduction of proactive systems also raises several concerns regarding their safety and ethical implications. Proactive systems can initiate actions that have unintended consequences, particularly in complex and dynamic environments. This highlights the importance of designing proactive systems with safety in mind and incorporating fail-safe mechanisms to prevent catastrophic outcomes. To address these risks and challenges, proactive systems should be designed with the ability to identify and mitigate potential risks, as well as the flexibility to adapt and respond to unexpected situations.

Privacy Preserving Technologies. Personalized tool learning requires models to learn user preferences from private user information, which inevitably raises privacy-preserving concerns. On one hand, previous work has shown that training data extraction attacks can be applied to recover sensitive personal privacy from foundation models [20], which is a critical challenge for personalized tool learning. On the other hand, models with high computational costs must be deployed on cloud servers, which require uploading private data to the cloud to enable personalized responses. It is crucial to develop secure and trustworthy mechanisms to access and process user data while protecting user privacy. Addressing these challenges will help unlock the potential of personalized tool learning, enabling more effective and tailored tool manipulation to meet individual user needs.

4.4 Knowledge Conflicts in Tool Augmentation

Tools can be leveraged as complementary resources to augment foundation models to enhance their generation [82], which enables models to effectively incorporate domain-specific or up-to-date knowledge. Below we first give a brief review of prior efforts in augmenting foundation models with tools.

The most representative tool used for augmentation is the text retriever. Early endeavors resort to retrieving knowledge from local repositories to augment language generation. Researchers propose to retrieve knowledge using a frozen knowledge retriever (e.g., kNN-LM [55]), or train the retriever and the PLM in an end-to-end fashion [39, 50, 64]. Later works have gone beyond local repositories by leveraging the entire web as the knowledge source, which allows for improved temporal generalization and higher factual accuracy [62, 81, 103]. Instead of treating the retriever as a passive agent, researchers further demonstrate that PLMs can actively interact with a search engine like humans [124, 136].

Apart from the retrieval tool, researchers have explored employing other tools to perform specific sub-tasks and then integrating the execution results into foundation models. For instance, Cobbe et al. [27] train a PLM to employ a calculator to perform basic arithmetic operations. Liu et al. [72] use a physics simulation engine (MuJoCo [137]) to make PLMs’ reasoning grounded to the real world. Chen et al. [23], Gao et al. [37] propose to augment PLMs with Python interpreters. Specifically, given a complex task, PLMs first understand it and generate programs as intermediate thoughts. After that, the execution of programs is offloaded to Python interpreters. Nye et al. [91] augment PLMs with a scratchpad, allowing them to emit intermediate task-solving procedures into a buffer before entering the final answer, which enhances PLMs in performing complex discrete computations.

Knowledge Conflicts. In practice, foundation models can be augmented by a variety of knowledge sources, including model knowledge memorized from training data and augmented knowledge derived from tool execution. Nonetheless, different sources of knowledge may inevitably contain conflicts, posing a challenge to the accuracy and reliability of model generation and planning in domains such as medical assistance and legal advice. In the following, we first introduce different types of knowledge conflicts and then discuss potential solutions.

—

Conflicts between Model Knowledge and Augmented Knowledge. Conflicts arise when there are discrepancies between the model knowledge and the knowledge augmented by tools. Such conflicts result from three primary reasons: (1) the model knowledge may become outdated, as most foundation models do not frequently update their parameters over time. In contrast, most tools provide real-time responses that are not covered in pre-training data [107]; (2) the pre-training data is typically less curated than common AI datasets and may contain false knowledge such as human misconception and false beliefs [68]. When augmented with responses from reliable sources like Wikipedia, this false knowledge can lead to conflicts; (3) the execution results from tools can also be misleading, and it is crucial to carefully discriminate whether a knowledge source is trustworthy or not.

—

Conflicts among Augmented Knowledge from Different Tools. In practice, the controller may retrieve knowledge from multiple tools to acquire more comprehensive and precise knowledge. However, the information returned by different tools may results in conflicts due to several reasons: (1) the credibility of different tools can vary significantly, meaning that not all tools are equally reliable or authoritative in all areas; (2) different tools may have biases that can influence the information they provide; (3) even tools sharing the same functionality may produce various responses due to differences in their algorithms and implementation.

Potential Solutions for Knowledge Conflicts. Knowledge conflicts can lead to a lack of explainability in model prediction, and it is crucial to guide models to integrate tool responses correctly and reliably. Research in open-domain QA has shown that small-scale models like T5 [110] may rely too heavily on their own knowledge after being fine-tuned on a specific dataset [74]. In contrast, more advanced foundation models like ChatGPT handle such issues far better. In Figures 8 and 9, we conduct case studies of ChatGPT (Mar 23, 2023 version) by testing its behavior when conflicts arise. We find that ChatGPT is able to correct its own belief given retrieved information and discern the knowledge conflicts from different sources. We contend that models should have the ability to distinguish and verify the reliability of various sources. To achieve this goal, we suggest the following research directions: (1) conflict detection: models should first detect potential conflicts among different sources and flag them for further investigation; (2) conflict resolution: it is also important to make verification and choose reliable sources after conflict detection. Meanwhile, models should also provide explanations for their generation by interpreting which knowledge source is considered and how it is augmented into the final response.

Fig. 8.

Fig. 9.

4.5 Open Problems

In this section, we discuss the open problems revolving around the tool-using ability of AI models. Each of these problems presents unique challenges and opportunities.

Evaluation for Tool-Learning. Evaluating tool-learning models remains an open challenge due to the variety of tasks and tools involved. The key to effective evaluation lies in determining how well models can leverage external tools to enhance task performance in a meaningful manner. Several factors need careful consideration: (1) Task-Specific Performance Metrics: A fundamental challenge is identifying appropriate metrics for evaluating tool-learning across different tasks. Metrics such as F1-score, accuracy, or API success rates should be used to quantify how tool integration impacts specific tasks, ensuring that tools contribute to improved outcomes. (2) Tool Utilization Efficiency: Another open question is how to best measure a model’s efficiency in using tools. Monitoring the success rate of API calls can offer insights into whether the model understands how and when to use a tool effectively, but more refined methods may be needed to capture deeper aspects of tool interaction. (3) Impact of Tool Complexity: While some tools are simple to use, others require intricate interactions or parameter generation. A key open problem is determining how tool complexity influences a model’s ability to effectively utilize them and where improvements can be made. (4) Tool-Assisted vs Internal Knowledge: Understanding when tool assistance provides a clear benefit over a model’s internal knowledge is crucial. Comparing tool-assisted performance with baseline results is essential for determining the actual value of tool-learning. (5) Error Handling and Robustness: Finally, evaluating models on their ability to handle errors in tool usage is vital. Robustness in dealing with API failures, incomplete outputs, or unexpected responses is an area that needs further exploration, as it directly impacts the model’s reliability in real-world scenarios.

Addressing these open problems is essential for advancing the field of tool-learning. A comprehensive evaluation framework will not only assess task performance but also examine how efficiently and effectively models integrate tools, generalize across domains, and adapt to complex and error-prone environments.

Striking a Balance between Internalized Capabilities and External Tools. The question of whether the capabilities of these models should be primarily internalized or rely more heavily on external tools is a complex one. We have previously discussed the tool learning ability of foundation models, suggesting the possibility of developing modular architectures that seamlessly integrate with external tools to enhance their capabilities. This modular approach could enable a more flexible and customizable AI system, allowing for rapid expansion of model capabilities. However, implementing such an approach also presents challenges in terms of system complexity and the need for robust interfaces between the model and the external tools.

On the other hand, foundation models have demonstrated an increasing ability to internalize and perform various AI tasks that previously required separate tools. For example, the emerging multilingual abilities of models can reduce the need for external translation APIs [15]. This internalization of capabilities simplifies the system architecture and reduces dependence on external tools while placing greater demands on the model’s learning and generalization abilities.

In addition to tool learning, there is also the question of whether the knowledge or factual information of large language models can be externalized. While individuals may not possess as much factual knowledge as these models, their intelligence is still significant. Large language models have an extensive number of parameters that can store factual information. Thus, there is potential to externalize knowledge as well.

Determining the optimal balance between internalized capabilities and reliance on external tools, as well as the potential externalization of knowledge, remains an open question. The position of future models on the spectrum between modular and uniform architectures will likely depend on the specific task and available resources. It will require careful empirical investigation and theoretical analysis to reach meaningful conclusions.

Tool Use as a Gauge for Machine Intelligence. The ability to effectively use tools has long been considered a hallmark of human intelligence, and it is increasingly seen as an important measure of machine intelligence as well. The evaluation of tool use performance serves as a valuable indicator of an AI system’s capabilities. By assessing how well an AI system utilizes tools to accomplish specific tasks, we can better align its abilities with real-world applications and the concept of practical intelligence [130].

However, evaluating tool use performance in AI systems is not without its challenges. To effectively evaluate the use of tools, it is crucial to develop appropriate benchmarks and evaluation metrics that reflect the complexity and diversity of real-world tool use scenarios. Additionally, a deep understanding of the cognitive processes involved in tool use is necessary, and this remains an active area of research in cognitive science and neuroscience.

In recent work, the introduction of the Toolbench framework [108] has provided a valuable tool for evaluating the tool use capabilities of large-scale models. It offers a standardized approach to assess an AI system’s performance in utilizing various tools for problem-solving. However, even with this benchmark in place, there are still fundamental questions that need to be addressed, including understanding the limitations of tool use in AI systems, exploring the boundaries of tool integration, and investigating the potential biases or ethical considerations that may arise when relying heavily on tool use.

Despite these challenges and open questions, evaluating tool use performance remains a crucial avenue for gaining insights into the progress of AI systems. It not only helps researchers assess the ability of AI systems to assist human decision-making and collaborate effectively in problem-solving but also contributes to the development of AI applications in a wide range of real-world scenarios. Continual research and exploration in this area will further advance our understanding of machine intelligence and its alignment with human intelligence.

Towards Better Tool Use Capabilities of LLMs. Although closed-source models like OpenAI’s GPT exhibit robust tool-using capabilities, their open-source counterparts still lag, primarily due to inadequate training methodologies. Currently, most efforts center around Behavior Cloning to instruct these models [21, 108, 166, 168], leverage the GPT framework to generate tool-using trajectories for specific tasks, subsequently employing these trajectories to fine-tune open-source models. This approach is straightforward and efficacious, yet it faces challenges, notably the risk of the trained models failing to generalize to novel tools. Additionally, this method can lead to overfitting on specific task scenarios and a lack of adaptability in dynamic environments, restricting the broader applicability and scalability of the models.

Addressing these limitations, recent advancements in training methodologies have shifted toward leveraging RL to enhance LLM tool-using capabilities. For example, ETO [129] adopts RL to further enhance the tool-use capability of the LLMs. Based on the behavior cloning model, it gathers data from model-environment interactions, assesses trajectory scores, and generates both positive and negative examples. This data is then utilized in DPO [109], a widely adopted RL algorithm in LLM training, resulting in substantial improvements in its tool-using proficiency.

Although there’s still a lack of relevant exploration on integrating RL algorithms into the training of LLM’s tool-use capability, the importance of RL is indispensable. RL provides a framework that allows models to learn better behaviors through trial and error, adapting to changes and optimizing decisions based on dynamic feedback. This adaptability is crucial for developing LLMs that can effectively handle a variety of tools and environments, enhancing their practical utility and robustness in real-world applications.

Ethical Human-Model Collaboration in Tool Use. The integration of foundation models with human labor raises critical ethical concerns that warrant careful consideration. Employing human labor in conjunction with AI systems could result in more robust and accurate knowledge. However, this approach may also conflict with the widely accepted ethical principle that “human beings should be treated as ends in themselves, and not merely as means to an end” [54].

The ethical implications of human-model collaboration are complex and multifaceted. They involve questions about the value of human labor, the distribution of benefits and risks, and the potential for exploitation or discrimination. Moreover, they involve broader societal and cultural issues, such as the impact of AI on employment and the role of humans in a world increasingly dominated by intelligent machines.

To address these ethical concerns, it is essential for the community to establish guidelines and safeguards that prioritize human dignity and agency when integrating human labor with foundation models. This may involve setting clear boundaries on the types of tasks that can be delegated to humans, ensuring fair compensation and working conditions, and promoting transparency in the development of AI systems [78]. Moreover, fostering collaboration between AI researchers, ethicists, policymakers, and other stakeholders is crucial to develop a comprehensive understanding of the ethical implications of human-model collaboration and to create effective regulations that safeguard human rights and dignity [154].

Tool Learning for Code Generation. The integration of external tools with foundation models presents an important avenue for enhancing code generation, particularly when handling complex, real-world software engineering tasks. By leveraging tools like API search engines or development environments, models can overcome the limitations of memorized knowledge, ensuring more accurate and context-specific code outputs. Tool learning has the potential to shift code generation from a purely generative task to an interactive one, where models dynamically consult external resources for better decision-making. This is particularly crucial for real-world repo-level coding challenges, where understanding contextual dependencies across files and integrating code testing mechanisms are paramount. The ability of a model to actively search for and apply the correct APIs, as demonstrated by ToolCoder [171], or to operate within more sophisticated development interfaces, like those proposed in SWE-Agent [161] and CodeAgent [170], reflects a broader shift in how AI can augment software development. Such approaches not only improve accuracy but also align better with human-like programming workflows, which rely heavily on external resources. Moreover, frameworks like AppWorld [139] illustrate how tool learning can extend beyond mere API usage to orchestrate complex workflows across multiple apps, emphasizing the versatility and scalability of tool learning in diverse programming environments.

Tool Learning for Scientific Discovery. AI for science has shown great potential in various scientific scenarios, such as HyperTree Proof Search for proving Metamath theorems [61] and protein structure prediction in structural biology [53]. Overall, AI system has been proven effective in capturing rules and patterns from scientific data and providing hints for human researchers. Nevertheless, in the absence of professional scientific knowledge and reasoning ability training, the scientific problems that AI can solve are limited.

Tool learning brings new solutions to this problem. Specifically, AI systems are promising to manipulate scientific tools and play more important roles in scientific discovery, and solve multidisciplinary problems (e.g., mathematics, cybernetics, materials). For instance, MATLAB [79] is designed for algorithm development, data visualization/analysis, and numerical computation. With MATLAB, AI systems can analyze raw materials, design algorithms, and verify assumptions by conducting simulations.

Apart from the software level, it is also possible for AI systems to manipulate practical platforms such as the synthetic robots [17], and to conduct synthetic experiments independently. Recently, Boiko et al. [12] show the potential of this direction and build a system that uses foundation models to design, plan, and execute scientific experiments (e.g., catalyzed cross-coupling reactions). However, this approach also presents challenges in terms of the complexity of scientific problems, the need for robust and interpretable models, and the ethical and safety considerations of autonomous scientific discovery.

5 Conclusion

This article studies the paradigm of tool learning with foundation models. We first recapitulate the cognitive origins of tool use in human history and categorize tools from the perspective of the user interface. Then we review the AI paradigm shift of foundation models and discuss the complementary roles of tools and foundation models. We perform a comprehensive literature review for existing exploration in tool learning and formulate a general tool learning framework. Then we highlight core research problems such as bridging user intents with appropriate tools, better planning by leveraging the reasoning abilities of foundation models, training strategies for tool learning, and how to facilitate generalization for tool learning. Finally, we discuss important research topics, including safe and trustworthy tool learning, AI tool creation, personalized tool learning, knowledge conflict issue in tool augmentation, and so on. In general, this article serves as a systematic investigation of tool learning. We hope this article could facilitate research in integrating tools with foundation models in the future.

Acknowledgments

This work is supported by the National Key R&D Program of China (No. 2022ZD0116312) and National Natural Science Foundation of China (No. 62236004).

Footnotes

https://www.microsoft.com/en-us/microsoft-365

https://www.wolframalpha.com/

https://github.com/Torantulino/Auto-GPT

⁴

https://www.microsoft.com/en-us/bing/apis/bing-web-search-api

⁵

https://www.wolfram.com/mathematica

Supplemental Material

Supplemental Pdf

csur-2023-0587-File002

Download
9.27 MB

References

[1]

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. 2022. Do as I can, not as I say: Grounding language in robotic affordances. arXiv:2204.01691. Retrieved from https://arxiv.org/abs/2204.01691

Abstract

1 Introduction

2 Background

2.1 Cognitive Origins of Tool Use

2.2 Tool Categorization: A User-Interface Perspective

2.3 Paradigm Shift of Foundation Models

2.4 Complementary Roles of Tools and Foundation Models

3 Tool Learning

3.1 Components of Tool Learning

3.1.1 Understanding the Components.

3.1.2 Connecting the Components.

3.1.3 Comparison with Traditional ML Tasks.

3.2 The General Procedure: From Intent to Plan

3.2.1 Understanding Intent and Tools.

3.2.2 Planning with Reasoning.

3.3 Training Models for Improved Tool Learning

3.3.1 Learning from Demonstrations.

3.3.2 Learning from Feedback.

3.3.3 Generalizable Tool Learning.

3.3.4 Challenges of Foundation Models in Tool Learning.

4 Discussion

4.1 Safe and Trustworthy Tool Learning

4.2 From Tool User to Tool Maker: AI’s Evolutionary Role

4.3 From General Intelligence to Personalized Intelligence

4.4 Knowledge Conflicts in Tool Augmentation

4.5 Open Problems

5 Conclusion

Acknowledgments

Footnotes

Supplemental Material

References

Cited By

Index Terms

Recommendations

Tool use in computer-based learning environments: towards a research framework

A tool-based interactive drawing environment

Foundation Models in Healthcare: Opportunities, Biases and Regulatory Prospects in Europe

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations