4.1 Safe and Trustworthy Tool Learning
Armed with external tools, AI systems can be unprecedentedly capable and human-like. Although we are eager to witness how tool learning will change our life, it is paramount to take a step back and contemplate the underlying risks. For responsible AI research, here we discuss the safety and trustworthiness problems of tool learning.
Adversaries. Same as all the other AI systems, we could foresee that there will be external adversaries once the tool learning models are deployed in reality, and thus how to defend against these threats is of great significance [
42,
52,
133,
143]. Recent works suggest that large foundation models like ChatGPT are more robust on hard and adversarial examples [
134,
147], which improves their utility in the complicated real world. But the attempt of crafting misleading or even harmful queries will undoubtfully persist as well [
102]. Moreover, due to training on massive web data, foundation models are faced with long-lasting training-time security issues in deep learning, such as backdoor attacks [
30,
59] and data poisoning attacks [
144]. In addition to models, tools could be new attack targets for adversaries. For example, the attackers could maliciously modify the manual documentation or even the tools themselves (e.g. attacking a news API to give biased reports) to mislead the model into erroneous outcomes. A safe and robust system requires the models to not only learn to use tools, but also possess the ability to scrutinize, rectify, and secure them.
Governance. There is long-standing worry about the misuse of foundation models [
13]. Under the paradigm of tool learning, governance over foundation models is more urgently needed. The pertinent question at hand is
which tools should be involved? Given the countless tools human beings have manufactured, we must consider if it is appropriate to allow models to master all of them. Certain tools, such as calculators and translators, may be deemed safe as they do not pose any harm to individuals. However, granting models access to the internet or permitting them to make decisions in the real world could be perilous, as they could cause negative or even dangerous influences such as disseminating falsehoods [
167] and harming human lives. In this regard, research communities and companies need to deliberate carefully before permitting machines to master a certain tool. Besides, governance over tool usage is also a pertinent issue. As highlighted by Amodei et al. [
5], the end-to-end training paradigm in deep learning does not regulate how models achieve their objectives. Foundation models are not only expected to finish tasks with the help of tools but also should follow the regulations and constraints of tool usage.
Trustworthiness. The goal of tool learning lies in creating advanced intelligent agents. However, determining whether these agents are trustworthy or not is a complex challenge. Even though tool learning delivers enhanced interpretability and robustness, the core foundation models are still considered “black boxes”. Recent research [
24] shows that although large models achieve better performance, they are unable to predict when they will be wrong, rendering the calibration problem unresolved yet. Accompanied with tools, under what circumstances will the model call on the tools is unpredictable as well. Therefore, it is essential to thoroughly discuss to what extent should we allow AI to engage in human lives. Moreover, the morality of foundation models has emerged as a contentious issue in recent times. Despite OpenAI’s commendable efforts to imbue InstructGPT [
99] and GPT-4 [
93] with human values and preferences, given the discomforting “jailbreak” responses by ChatGPT [
14] and New Bing [
114], whether these big models will be mild and compliant remains doubtful. When models could learn actively from the world via tools, the challenge of controlling their actions will become more daunting than ever before.
4.2 From Tool User to Tool Maker: AI’s Evolutionary Role
Throughout the annals of human civilization, the evolution of tools has occupied a pivotal position [
57,
86]. The progression of human civilization is inextricably intertwined with the evolution of tools. Human beings are the creators and users of almost all tools from the Stone Age to the 21st century. Although we take it as granted, things are different when foundation models are involved. Considering that they have proven tool-use capabilities to certain extents, it is also possible to put them into the lifecycle of tool creation.
Tools for AI. Humans create tools to satisfy our own needs, so the designation naturally suits human preference and convenience. However, current tool learning algorithms may not be optimal for models. This is because most tools are specifically designed for human use, and models process information in a different way. Therefore, it is necessary to create tools that are specifically suited for models. Possible solutions may include: (1) modularity, which decomposes tools into smaller, more modular units, making them more adaptable and flexible for AI models. In this regard, models can learn to use these components in a more fine-grained and compositional manner; (2) new input and output formats: developing new input and output formats that are specifically tailored to the needs of AI models can improve their interaction and utilization of tools, enabling more seamless integration and communication between models and tools.
Tools by AI. The creation and utilization of tools have traditionally been considered exclusive to human intelligence. However, increasing evidence indicates that the ability to create advanced tools is no longer limited to human beings. For instance, large code models [
22] can generate executable programs based on language description. These programs can be deemed as tools to help accomplish specific tasks. Besides writing codes from scratch, foundation models can also encapsulate existing tools into stronger tools. In Figure
7, we showcase ChatGPT’s ability to encapsulate existing APIs into more advanced functions. The first example shows how it extends a weather forecast API to compute average temperature over a specified period, while the second integrates external stock market data and trends for investment recommendations. The third example highlights a fully automated medical diagnosis system that not only performs diagnostic tests but also adapts treatment plans based on real-time vitals. These examples underscore the transition from tool usage to tool creation, illustrating the potential for foundation models to autonomously develop sophisticated solutions. All such evidence implies the potential for foundation models to transition from merely tool users to tool makers.
Creativity of AI. However, whether foundation models can exhibit genuine creativity in creating novel tools remains an open problem. This issue is important because the capacity for novel tool creation is a defining characteristic that distinguishes humans from animals [
4]. Understanding the extent of creativity, beyond simply memorizing, composing, and interpolating between human tools encountered during pre-training, is crucial for assessing their potential to contribute to the development of new tools. Such investigations may involve the development of novel evaluation metrics and benchmarks [
67], as well as the exploration of new techniques that prioritize creative problem-solving.
4.3 From General Intelligence to Personalized Intelligence
Foundation models are typically trained on a generic domain and calibrated with broadly-defined human preferences that prioritize helpfulness and harmlessness [
89,
99]. As a result, they struggle to process personal information and provide personalized assistance to users with varying needs for tool learning. User-centric and personalized natural language generation has received increasing attention in recent years [
56,
160]. Existing works cover a wide range of tasks, such as dialogue generation [
77,
80,
128,
173], machine translation [
83,
84,
157], and summarization [
159]. These methods utilize external user-specific modules, such as user embeddings and user memory modules [
156,
172], to inject preferences, writing styles, and personal information of different users into the generated content. However, these works are often designed for specific tasks and experimented with limited user information. How to integrate user information into general-purpose tool learning models is still under-explored. We will discuss the key challenge of personalized tool learning in the following.
Aligning User Preference with Tool Manipulation. Personalized tool learning emphasizes the importance of considering user-specific information in tool manipulation. There are two main challenges: (1) heterogeneous user information modeling: in real-world scenarios, personal information can come from numerous heterogeneous sources. For instance, when using an e-mail tool, models need to consider the user’s language style from historical conversation records and gather relevant information from the user’s social networks. This requires modeling user information with diverse structures into a unified semantic space, allowing models to utilize this information jointly; (2) personalized tool planning: different users tend to have different preferences for tool planning and selection. For example, when completing the purchasing task, different users prefer to use different online shopping platforms. Therefore, the models need to develop personalized tool execution plans based on user preferences; (3) personalized tool call: adaptively calling tools according to the user’s preference is also an important direction in personalized tool learning. Most tools are designed without consideration of personalized information, which requires the model to generate different inputs for tools based on the user’s preferences.
From Reactive Systems to Proactive Systems. Currently, most of the foundation models are designed as reactive systems, which respond to user queries without initiating any actions on their own. A paradigm shift is underway toward proactive systems that can take action on behalf of the user. By leveraging the history of user interactions, proactive systems can continually improve their performance and tailor their responses to specific users, which provides a more personalized and seamless user experience. However, the introduction of proactive systems also raises several concerns regarding their safety and ethical implications. Proactive systems can initiate actions that have unintended consequences, particularly in complex and dynamic environments. This highlights the importance of designing proactive systems with safety in mind and incorporating fail-safe mechanisms to prevent catastrophic outcomes. To address these risks and challenges, proactive systems should be designed with the ability to identify and mitigate potential risks, as well as the flexibility to adapt and respond to unexpected situations.
Privacy Preserving Technologies. Personalized tool learning requires models to learn user preferences from private user information, which inevitably raises privacy-preserving concerns. On one hand, previous work has shown that training data extraction attacks can be applied to recover sensitive personal privacy from foundation models [
20], which is a critical challenge for personalized tool learning. On the other hand, models with high computational costs must be deployed on cloud servers, which require uploading private data to the cloud to enable personalized responses. It is crucial to develop secure and trustworthy mechanisms to access and process user data while protecting user privacy. Addressing these challenges will help unlock the potential of personalized tool learning, enabling more effective and tailored tool manipulation to meet individual user needs.
4.4 Knowledge Conflicts in Tool Augmentation
Tools can be leveraged as complementary resources to augment foundation models to enhance their generation [
82], which enables models to effectively incorporate domain-specific or up-to-date knowledge. Below we first give a
brief review of prior efforts in augmenting foundation models with tools.
The most representative tool used for augmentation is the text retriever. Early endeavors resort to retrieving knowledge from local repositories to augment language generation. Researchers propose to retrieve knowledge using a
frozen knowledge retriever (e.g.,
kNN-LM [
55]), or train the retriever and the PLM in an end-to-end fashion [
39,
50,
64]. Later works have gone beyond local repositories by leveraging the entire web as the knowledge source, which allows for improved temporal generalization and higher factual accuracy [
62,
81,
103]. Instead of treating the retriever as a passive agent, researchers further demonstrate that PLMs can actively interact with a search engine like humans [
124,
136].
Apart from the retrieval tool, researchers have explored employing other tools to perform specific sub-tasks and then integrating the execution results into foundation models. For instance, Cobbe et al. [
27] train a PLM to employ a calculator to perform basic arithmetic operations. Liu et al. [
72] use a physics simulation engine (MuJoCo [
137]) to make PLMs’ reasoning grounded to the real world. Chen et al. [
23], Gao et al. [
37] propose to augment PLMs with Python interpreters. Specifically, given a complex task, PLMs first understand it and generate
programs as intermediate thoughts. After that, the execution of
programs is offloaded to Python interpreters. Nye et al. [
91] augment PLMs with a scratchpad, allowing them to emit intermediate task-solving procedures into a buffer before entering the final answer, which enhances PLMs in performing complex discrete computations.
Knowledge Conflicts. In practice, foundation models can be augmented by a variety of knowledge sources, including model knowledge memorized from training data and augmented knowledge derived from tool execution. Nonetheless, different sources of knowledge may inevitably contain conflicts, posing a challenge to the accuracy and reliability of model generation and planning in domains such as medical assistance and legal advice. In the following, we first introduce different types of knowledge conflicts and then discuss potential solutions.
—
Conflicts between Model Knowledge and Augmented Knowledge. Conflicts arise when there are discrepancies between the model knowledge and the knowledge augmented by tools. Such conflicts result from three primary reasons: (1) the model knowledge may become outdated, as most foundation models do not frequently update their parameters over time. In contrast, most tools provide real-time responses that are not covered in pre-training data [
107]; (2) the pre-training data is typically less curated than common AI datasets and may contain false knowledge such as human misconception and false beliefs [
68]. When augmented with responses from reliable sources like Wikipedia, this false knowledge can lead to conflicts; (3) the execution results from tools can also be misleading, and it is crucial to carefully discriminate whether a knowledge source is trustworthy or not.
—
Conflicts among Augmented Knowledge from Different Tools. In practice, the controller may retrieve knowledge from multiple tools to acquire more comprehensive and precise knowledge. However, the information returned by different tools may results in conflicts due to several reasons: (1) the credibility of different tools can vary significantly, meaning that not all tools are equally reliable or authoritative in all areas; (2) different tools may have biases that can influence the information they provide; (3) even tools sharing the same functionality may produce various responses due to differences in their algorithms and implementation.
Potential Solutions for Knowledge Conflicts. Knowledge conflicts can lead to a lack of explainability in model prediction, and it is crucial to guide models to integrate tool responses correctly and reliably. Research in open-domain QA has shown that small-scale models like T5 [
110] may rely too heavily on their own knowledge after being fine-tuned on a specific dataset [
74]. In contrast, more advanced foundation models like ChatGPT handle such issues far better. In Figures
8 and
9, we conduct case studies of ChatGPT (Mar 23, 2023 version) by testing its behavior when conflicts arise. We find that ChatGPT is able to correct its own belief given retrieved information and discern the knowledge conflicts from different sources. We contend that models should have the ability to distinguish and verify the reliability of various sources. To achieve this goal, we suggest the following research directions: (1)
conflict detection: models should first detect potential conflicts among different sources and flag them for further investigation; (2)
conflict resolution: it is also important to make verification and choose reliable sources after conflict detection. Meanwhile, models should also provide explanations for their generation by interpreting which knowledge source is considered and how it is augmented into the final response.
4.5 Open Problems
In this section, we discuss the open problems revolving around the tool-using ability of AI models. Each of these problems presents unique challenges and opportunities.
Evaluation for Tool-Learning. Evaluating tool-learning models remains an open challenge due to the variety of tasks and tools involved. The key to effective evaluation lies in determining how well models can leverage external tools to enhance task performance in a meaningful manner. Several factors need careful consideration: (1) Task-Specific Performance Metrics: A fundamental challenge is identifying appropriate metrics for evaluating tool-learning across different tasks. Metrics such as F1-score, accuracy, or API success rates should be used to quantify how tool integration impacts specific tasks, ensuring that tools contribute to improved outcomes. (2) Tool Utilization Efficiency: Another open question is how to best measure a model’s efficiency in using tools. Monitoring the success rate of API calls can offer insights into whether the model understands how and when to use a tool effectively, but more refined methods may be needed to capture deeper aspects of tool interaction. (3) Impact of Tool Complexity: While some tools are simple to use, others require intricate interactions or parameter generation. A key open problem is determining how tool complexity influences a model’s ability to effectively utilize them and where improvements can be made. (4) Tool-Assisted vs Internal Knowledge: Understanding when tool assistance provides a clear benefit over a model’s internal knowledge is crucial. Comparing tool-assisted performance with baseline results is essential for determining the actual value of tool-learning. (5) Error Handling and Robustness: Finally, evaluating models on their ability to handle errors in tool usage is vital. Robustness in dealing with API failures, incomplete outputs, or unexpected responses is an area that needs further exploration, as it directly impacts the model’s reliability in real-world scenarios.
Addressing these open problems is essential for advancing the field of tool-learning. A comprehensive evaluation framework will not only assess task performance but also examine how efficiently and effectively models integrate tools, generalize across domains, and adapt to complex and error-prone environments.
Striking a Balance between Internalized Capabilities and External Tools. The question of whether the capabilities of these models should be primarily internalized or rely more heavily on external tools is a complex one. We have previously discussed the tool learning ability of foundation models, suggesting the possibility of developing modular architectures that seamlessly integrate with external tools to enhance their capabilities. This modular approach could enable a more flexible and customizable AI system, allowing for rapid expansion of model capabilities. However, implementing such an approach also presents challenges in terms of system complexity and the need for robust interfaces between the model and the external tools.
On the other hand, foundation models have demonstrated an increasing ability to internalize and perform various AI tasks that previously required separate tools. For example, the emerging multilingual abilities of models can reduce the need for external translation APIs [
15]. This internalization of capabilities simplifies the system architecture and reduces dependence on external tools while placing greater demands on the model’s learning and generalization abilities.
In addition to tool learning, there is also the question of whether the knowledge or factual information of large language models can be externalized. While individuals may not possess as much factual knowledge as these models, their intelligence is still significant. Large language models have an extensive number of parameters that can store factual information. Thus, there is potential to externalize knowledge as well.
Determining the optimal balance between internalized capabilities and reliance on external tools, as well as the potential externalization of knowledge, remains an open question. The position of future models on the spectrum between modular and uniform architectures will likely depend on the specific task and available resources. It will require careful empirical investigation and theoretical analysis to reach meaningful conclusions.
Tool Use as a Gauge for Machine Intelligence. The ability to effectively use tools has long been considered a hallmark of human intelligence, and it is increasingly seen as an important measure of machine intelligence as well. The evaluation of tool use performance serves as a valuable indicator of an AI system’s capabilities. By assessing how well an AI system utilizes tools to accomplish specific tasks, we can better align its abilities with real-world applications and the concept of practical intelligence [
130].
However, evaluating tool use performance in AI systems is not without its challenges. To effectively evaluate the use of tools, it is crucial to develop appropriate benchmarks and evaluation metrics that reflect the complexity and diversity of real-world tool use scenarios. Additionally, a deep understanding of the cognitive processes involved in tool use is necessary, and this remains an active area of research in cognitive science and neuroscience.
In recent work, the introduction of the Toolbench framework [
108] has provided a valuable tool for evaluating the tool use capabilities of large-scale models. It offers a standardized approach to assess an AI system’s performance in utilizing various tools for problem-solving. However, even with this benchmark in place, there are still fundamental questions that need to be addressed, including understanding the limitations of tool use in AI systems, exploring the boundaries of tool integration, and investigating the potential biases or ethical considerations that may arise when relying heavily on tool use.
Despite these challenges and open questions, evaluating tool use performance remains a crucial avenue for gaining insights into the progress of AI systems. It not only helps researchers assess the ability of AI systems to assist human decision-making and collaborate effectively in problem-solving but also contributes to the development of AI applications in a wide range of real-world scenarios. Continual research and exploration in this area will further advance our understanding of machine intelligence and its alignment with human intelligence.
Towards Better Tool Use Capabilities of LLMs. Although closed-source models like OpenAI’s GPT exhibit robust tool-using capabilities, their open-source counterparts still lag, primarily due to inadequate training methodologies. Currently, most efforts center around Behavior Cloning to instruct these models [
21,
108,
166,
168], leverage the GPT framework to generate tool-using trajectories for specific tasks, subsequently employing these trajectories to fine-tune open-source models. This approach is straightforward and efficacious, yet it faces challenges, notably the risk of the trained models failing to generalize to novel tools. Additionally, this method can lead to overfitting on specific task scenarios and a lack of adaptability in dynamic environments, restricting the broader applicability and scalability of the models.
Addressing these limitations, recent advancements in training methodologies have shifted toward leveraging RL to enhance LLM tool-using capabilities. For example, ETO [
129] adopts RL to further enhance the tool-use capability of the LLMs. Based on the behavior cloning model, it gathers data from model-environment interactions, assesses trajectory scores, and generates both positive and negative examples. This data is then utilized in DPO [
109], a widely adopted RL algorithm in LLM training, resulting in substantial improvements in its tool-using proficiency.
Although there’s still a lack of relevant exploration on integrating RL algorithms into the training of LLM’s tool-use capability, the importance of RL is indispensable. RL provides a framework that allows models to learn better behaviors through trial and error, adapting to changes and optimizing decisions based on dynamic feedback. This adaptability is crucial for developing LLMs that can effectively handle a variety of tools and environments, enhancing their practical utility and robustness in real-world applications.
Ethical Human-Model Collaboration in Tool Use. The integration of foundation models with human labor raises critical ethical concerns that warrant careful consideration. Employing human labor in conjunction with AI systems could result in more robust and accurate knowledge. However, this approach may also conflict with the widely accepted ethical principle that “human beings should be treated as ends in themselves, and not merely as means to an end” [
54].
The ethical implications of human-model collaboration are complex and multifaceted. They involve questions about the value of human labor, the distribution of benefits and risks, and the potential for exploitation or discrimination. Moreover, they involve broader societal and cultural issues, such as the impact of AI on employment and the role of humans in a world increasingly dominated by intelligent machines.
To address these ethical concerns, it is essential for the community to establish guidelines and safeguards that prioritize human dignity and agency when integrating human labor with foundation models. This may involve setting clear boundaries on the types of tasks that can be delegated to humans, ensuring fair compensation and working conditions, and promoting transparency in the development of AI systems [
78]. Moreover, fostering collaboration between AI researchers, ethicists, policymakers, and other stakeholders is crucial to develop a comprehensive understanding of the ethical implications of human-model collaboration and to create effective regulations that safeguard human rights and dignity [
154].
Tool Learning for Code Generation. The integration of external tools with foundation models presents an important avenue for enhancing code generation, particularly when handling complex, real-world software engineering tasks. By leveraging tools like API search engines or development environments, models can overcome the limitations of memorized knowledge, ensuring more accurate and context-specific code outputs. Tool learning has the potential to shift code generation from a purely generative task to an interactive one, where models dynamically consult external resources for better decision-making. This is particularly crucial for real-world repo-level coding challenges, where understanding contextual dependencies across files and integrating code testing mechanisms are paramount. The ability of a model to actively search for and apply the correct APIs, as demonstrated by ToolCoder [
171], or to operate within more sophisticated development interfaces, like those proposed in SWE-Agent [
161] and CodeAgent [
170], reflects a broader shift in how AI can augment software development. Such approaches not only improve accuracy but also align better with human-like programming workflows, which rely heavily on external resources. Moreover, frameworks like AppWorld [
139] illustrate how tool learning can extend beyond mere API usage to orchestrate complex workflows across multiple apps, emphasizing the versatility and scalability of tool learning in diverse programming environments.
Tool Learning for Scientific Discovery. AI for science has shown great potential in various scientific scenarios, such as HyperTree Proof Search for proving Metamath theorems [
61] and protein structure prediction in structural biology [
53]. Overall, AI system has been proven effective in capturing rules and patterns from scientific data and providing hints for human researchers. Nevertheless, in the absence of professional scientific knowledge and reasoning ability training, the scientific problems that AI can solve are limited.
Tool learning brings new solutions to this problem. Specifically, AI systems are promising to manipulate scientific tools and play more important roles in scientific discovery, and solve multidisciplinary problems (e.g., mathematics, cybernetics, materials). For instance, MATLAB [
79] is designed for algorithm development, data visualization/analysis, and numerical computation. With MATLAB, AI systems can analyze raw materials, design algorithms, and verify assumptions by conducting simulations.
Apart from the software level, it is also possible for AI systems to manipulate practical platforms such as the synthetic robots [
17], and to conduct synthetic experiments independently. Recently, Boiko et al. [
12] show the potential of this direction and build a system that uses foundation models to design, plan, and execute scientific experiments (e.g., catalyzed cross-coupling reactions). However, this approach also presents challenges in terms of the complexity of scientific problems, the need for robust and interpretable models, and the ethical and safety considerations of autonomous scientific discovery.