tutorial

Open access

Robotic Vision for Human-Robot Interaction and Collaboration: A Survey and Systematic Review

Authors:

Peter CorkeAuthors Info & Claims

ACM Transactions on Human-Robot Interaction, Volume 12, Issue 1

Article No.: 12, Pages 1 - 66

https://doi.org/10.1145/3570731

Published: 16 February 2023 Publication History

All formats PDF

Abstract

Robotic vision, otherwise known as computer vision for robots, is a critical process for robots to collect and interpret detailed information related to human actions, goals, and preferences, enabling robots to provide more useful services to people. This survey and systematic review presents a comprehensive analysis on robotic vision in human-robot interaction and collaboration (HRI/C) over the past 10 years. From a detailed search of 3,850 articles, systematic extraction and evaluation was used to identify and explore 310 papers in depth. These papers described robots with some level of autonomy using robotic vision for locomotion, manipulation, and/or visual communication to collaborate or interact with people. This article provides an in-depth analysis of current trends, common domains, methods and procedures, technical processes, datasets and models, experimental testing, sample populations, performance metrics, and future challenges. Robotic vision was often used in action and gesture recognition, robot movement in human spaces, object handover and collaborative actions, social communication, and learning from demonstration. Few high-impact and novel techniques from the computer vision field had been translated into HRI/C. Overall, notable advancements have been made on how to develop and deploy robots to assist people.

1 Introduction

This article presents a comprehensive survey and review of robotic vision methods for Human-Robot Interaction and Collaboration (HRI/C) based on a review of 3,850 articles to create a collection of 310 eligible articles for in-depth analysis. The selected 310 published papers examine how robotic vision is used to facilitate human-robot interaction tasks such as robot navigation in human spaces, social interaction with people to exchange information, and human-robot handovers of everyday objects. Such a combination of a systematic review to calculate trends and prevalence alongside a comprehensive survey for each section will help to explore emerging patterns, statistical trends, and recommendations on how robotic vision can help to improve HRI/C.

1.1 Purpose

The purpose of this systematic review and survey was to provide detailed insight into underlying emergent research themes pursued by the community, and to explore the trajectory and impact that robotic vision will have on enabling robots to better interact and collaborate with humans. For the purpose of this work, robotic vision will be defined as computer vision that is used to inform or direct a robot on what actions to perform that will contribute to achieving the chosen goal. In practice, robotic vision can enable robots to sense, perceive, and respond to people through capturing and responding to a rich continuous information stream. Visual information provided by humans can help robots to better understand the scenario and to plan their actions, such as interpreting hand gesture movements as communication signals and human body movements as an indication of future intent to perform an action. Robots with robotic vision can therefore help to create and facilitate an important information exchange between the human and the robot, opening up new communication channels using a method that is natural and intuitive to people, improving the effectiveness of collaborative tasks.

1.2 Scope

This survey and review explored published papers from the past 10 years using a systematic search, screen, and evaluation protocol to extract a general overview of current research trends, common applications and domains, methods and procedures, technical processes, relevant datasets and models, experimental testing setups, sample populations, vision algorithm metrics, and performance evaluations. To create the systematic search strategy, several key parameters needed to be defined before commencing the extraction and evaluation of papers. First, given the extensive scope of reviewing all relevant papers in the broad field of robotic vision for HRI/C, this review focused on the past 10 years (i.e., 2010–2020). This time frame helped to showcase the more contemporary use of robotic vision based on newly emergent techniques, and was chosen to coincide with the introduction of critical camera hardware that boosted the applied use of robotic vision to enable robots to be more reactive and suitable for human interaction, such as the release date of the Kinect camera [173].

1.3 Related Surveys and Systematic Reviews

To the best of our knowledge, no systematic review or comprehensive survey on the development and use of robotic vision for HRI/C had been conducted. Current surveys or reviews have focused on other areas, such as specific domains including robotics in industry [183, 436], agriculture [429], public areas [140], healthcare [365], and education [43]. All of these works described different robots or methods relevant to the domain of interest, including describing robot types and use cases that did not use robotic vision. Other reviews or surveys focused on components related to the process of HRI/C, such as the use of physical touch and tactile sensing techniques [22], safety bounds for vision-based safety systems [171, 480], trust modeling and trust-related factors [174, 218], distance between humans and robots [243], and the use of non-verbal communication [377]. There have also been other published works on specific methods used in human-robot interaction that did not have a direct focus on vision, such as tests with psycho-physiological measures [47], exploring general robot perception methods [462], investigating a single robot platform [413], or a specific form of robot behavior [295].

The computer vision field has contributed to providing detailed surveys and reviews that show the technical process for computer vision related to humans, such as gesture-based human-machine interaction [399] and multi-modal machine collaboration with a focus on body, gesture, gaze, and affective interaction [200] . Others have explored more detailed and specific use cases such as action recognition [486], hand gesture recognition [256, 357, 456], and human motion capture [302]. There have also been detailed surveys and reviews that explored vision-based techniques in robots—for instance, reviews or surveys that included recent developments in robotic vision techniques [82], learning for robotic vision [381], and the use of computer vision for a specific type of robot, such as aerial robots [261] . Other surveys and reviews instead had more general overviews of vision for robots such as object recognition and modeling, site reconstruction and inspection, robotic manipulation, localization, path following, map construction, autonomous navigation, and exploration [53, 83] . Others also included a brief mention of applications to people but did not provide a detailed analysis on how this could better facilitate HRI/C across different technique types. In collection, the identified surveys and reviews provided an excellent commentary on their respective fields and target areas, but there were limited works that presented a detailed investigation into robotic vision techniques, hardware integration, and evaluation of its use in real-world scenarios for HRI/C.

1.4 Contribution

The contribution of this work is the systematic extraction, discovery, and detailed synthesis of the literature to showcase the current use of robotic vision for robots that can interact and collaborate with people. This survey and systematic review contributes new knowledge on how robots can be improved by integrating and refining functionality related to robotic vision, showcases real-world use of robots with vision capabilities to improve collaborative outcomes, and provides a critical discussion to help push the field forward.

2 Background

2.1 A Brief History of the Field of Computer Vision

Computer vision is important to help machines better understand and interact with the real world, making relevant actions and decisions based on visual information [98] . Common sensors in computer vision include RGB (red, green, and blue wavelength) cameras that provide detailed information by capturing light in RGB to create a color representation of the world. The use of visual information to understand the world can help emulate how humans perceive the world, creating a common language and understanding between humans and robots when sharing details, objects, and task-related information. Computer vision involves techniques such as object detection to localize where an object is in the scene, image classification to determine what it is in the image, and pixel-level classification to classify what part of the image belongs to an area of interest [98, 138] . In relation to computer vision for humans, computer vision can address the detection and analysis of humans in visual scenes, including methods such as face detection [438], pose estimation [64], and human motion tracking [182]. This type of visual information can then further assist in creating shared knowledge and understanding between humans and machines.

The field of computer vision has evolved rapidly in the past decade from 2010 to 2020. Deep learning has played a dominant role since its success at the 2012 ImageNet competition [227]. Learning complex parameterized functions from data has also served to make computer vision algorithms more robust and effective in real-world situations, making it ideal for the field of HRI/C. However, this comes at the expense of increased hardware requirements and longer development time, such as the need for data collection, labeling, and network training. Another significant change at the start of this period was the advent of more readily available RGB-D sensors with the Microsoft Kinect camera released in 2010 [173] . This allowed researchers to reason about color and 2.5D geometry jointly, facilitating new breakthroughs such as real-time 3D reconstruction [197] .

In the decades before this period, computer vision had several major successes relevant to HRI/C. The first was the codification of the principles of multiple-view geometry [175] and their successful application in large-scale reconstruction tasks [7] using the techniques of structure from motion. The period was also marked by increasingly sophisticated handcrafted features such as SIFT [266] and HOG (histogram of oriented gradients) [103] features and the use of increasingly sophisticated learning algorithms, such as kernel Support Vector Machine (SVM) [54] and AdaBoost [143], the latter used to great effect in the Viola–Jones face detector [437] . The topics of image classification, object detection, image segmentation, and optical flow received significant research attention, among many others. Some highlights include deformable part models [133, 134] that demonstrated unprecedented performance on object detection benchmarks before deep learning, conditional random fields for image segmentation [156, 226, 419], graph cuts for tasks such as stereo depth estimation [57], and variational methods for optical flow estimation [186, 268] . These approaches continue to be used in robotics and embodied vision settings due to their efficiency and low hardware requirements.

2.2 A Brief History of the Field of Robotic Vision

Robotic vision, by contrast, exists at the intersection of robotics and computer vision, enabling robots to sense, perceive, and respond to people by providing rich, continuous information about human states, actions, intentions, and communication. Robotic vision involves a vision sensor (RGB, RGB-D) and supporting algorithms that translate raw images to a control signal for a robot. In other words, any computer vision techniques used to guide a robot on what action to perform can be considered robotic vision. Robotic vision benefited from the advancements in the computer vision research community, such as large datasets, computing power, complex algorithms, and scientific methods. Robotic vision has started to become a key perception channel for the robot to interact with and provide assistance to people. Robotic vision has important advantages for enabling robots to smartly interact with the environment, such as better camera control, physical movement around the space, and the capacity to adapt its viewpoint to gather further information [98, 402] . There have also been notable advances to effectively handle multi-modal data in robotic sensing, including visual processing for intelligent robot decisions and actions [344, 402]. Robotic vision can also create new opportunities for humans to interact with robots in a way that does not inhibit natural actions, such as removing the need to use a computer terminal or to wear a physical apparatus. Improvements to robots through visual perception can therefore help to contribute to creating more general-purpose robots, extending the potential for a wide range of tasks that a robot can complete for a person [381] .

2.3 Human-Robot Interaction and Collaboration

Human-robot interaction focuses on the interactivity between humans and robots, and often involves creating a robotic system that can identify and respond to the complexities of human behavior. For the robot to behave in socially acceptable ways, the robot should be able to sense, perceive, and respond to human states, actions, intentions, and emotions. Human-robot interaction related topics include improving robot social acuity using visual perception of the person [415] . Human-robot collaboration instead focuses on how humans and robots work together to achieve shared goals with a common purpose and directed outcome. In HRC, robots work to complement or add value to the intended goal of the human [35] . Collaboration with a robot can help to improve task speed and work productivity, reduce the number of errors, and improve human safety to minimize repetition fatigue and injuries [166, 429] .

2.3.1 Robots with Computer Vision to Improve Collaborative Outcomes.

Robotic vision techniques have been used to create new interaction methods and improve the current process of human-robot interaction, such as using vision to create the ability for people to communicate with the robot, such as to signal information or commands. These contribute to the ability for the robot to provide a more functional service to the human. Visual information captured by the robot through the camera system can then be used to help enable the robot to make more informed decisions about its next set of actions. Examples could include to detect target objects in its field of view when humans request a specific object, to understand events and scenarios that are occurring in the scene for social group dynamics, and to classify and better understand human actions to offer predictive assistance [66, 132, 151, 173, 179, 258, 368]. For instance, visual information from people can help robots make informed decisions on how to interact or assist the person, such as to help the robot to decide how to approach a person [363], how to follow a person [196], or when to offer to hand over an item to a person [326]. Robotic vision can help to identify, classify, or predict human movements through action or activity recognition [39]. Activity recognition to perceive human movements can give robots the ability to better predict or recognize what a human is doing in the environment so that the robot can better provide useful information, advice, or assistance to the person in settings such as in industrial settings [367] or in different contexts such as recognition for multiple people in a robot’s field of view [154]. Gesture recognition has often been tested as a communication and control method through the translation of human pose into a command signal, an action to trigger a state change, or to signify the start of an information exchange between the human and the robot [154, 202, 213, 388, 421]. Gestures can also be used to signal to the robot which object the robot should use [388, 389] and where a robot should move [202]. Visual information can also support collaborative robot actions with the person such as human-robot object handover through perception and interpretation of humans and objects in the scene, including human reach ability, motion, and collision range [237, 326]. There is significant opportunity to draw from principles and concepts of computer vision to improve robot capacity to perceive and act upon visual information to improve human-robot collaboration. This includes robotic vision with the intention to improve robot functionality, user experience, interface design, control methods, and robot utility for certain actions or tasks.

3 Review Protocol

This survey and systematic review will provide insight into the underlying emergent research themes pursued by the community, and explore the broader use of robotic vision to enhance human-robot interactivity and collaborative outcomes. The purpose of this systematic review is to inform readers about the current state of robotic vision applied to interpreting and responding to human actions, activities, tasks, states, and emotions. For the purpose of this survey and review, a robot was defined as a system that can perform (semi-)autonomously through an algorithm/s, action through actuator/s in the world in response to perception through sensor/s, with the potential inclusion of an externally provided goal directive.

This systematic review protocol followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology for systematic research, which specifies the search, screening, and evaluation steps. This method involves a comprehensive and reproducible search strategy to present and critically appraise the research findings related to the topic of interest [306]. PRISMA guidelines [306] are considered to be a gold-standard reporting method with more than 80,000 citations in the past 15 years. The PRISMA method also allows for clear conclusions to be drawn across a expansive pool of studies with minimal selection bias alongside balanced reporting of research findings [306]. All of the studies were assessed for inclusion/exclusion criteria to provide a defined view of the topic, as well as to assess the included studies for quality assurance. The search strategy of the systematic review was designed to capture different aspects related to robotic vision for HRI/C, including robot behaviors, collaborative tasks, and communicative behaviors. This review was designed to answer the following Research Questions (RQ):

RQ1.

What is the general trend of robotic vision work in human-robot collaboration and interaction in the past 10 years?

RQ2.

What are the most common application areas and domains for robotic vision in human-robot collaboration and interaction?

RQ3.

What is the human-robot interaction taxonomy for robots with robotic vision in human-robot collaboration and interaction?

RQ4.

What are the vision techniques and tools used in human-robot collaboration and interaction?

RQ5.

What are the datasets and models that have been used for robotic vision in human-robot collaboration and interaction?

RQ6.

What has been the main participant sample, and how is robotic vision in human-robot collaboration and interaction evaluated?

RQ7.

What is the state of the art in vision algorithm performance for robotic vision in human-robot collaboration and interaction?

RQ8.

What are the upcoming challenges for robotic vision in human-robot collaboration and interaction?

Preliminary searches were undertaken in field-relevant journals and conferences to help inform search criteria keywords. The search terms went through extensive iteration and the final terms were chosen to be broad enough to capture works across multiple disciplines and topic keywords, but scoped as best as possible to systematically extract papers on the intended topic of interest: robots, vision, and humans: ((((“Abstract”: Robot* OR “Abstract”: UAV OR “Abstract”: AUV OR “Abstract”: UUV OR “Abstract”: Drone OR “Abstract”: Humanoid OR “Abstract”: Manipulator) AND (“Abstract”: Vision OR Abstract: Image OR “Abstract”: Camera OR “Abstract”: RGB* OR “Abstract”: Primesense OR “Abstract”: Realsense OR “Abstract”: Kinect) AND (“Abstract”: Human OR “Abstract”: Humans OR “Abstract”: Person OR “Abstract”: People OR “Abstract”: User OR “Abstract”: Users) AND (“Abstract”: HRI OR “Abstract”: HRC OR “Abstract”: Collaborat* OR “Abstract”: Interact* OR “Abstract”: “Human-in-the-Loop” OR “Abstract”: Team* OR “Abstract”: “Human-to-Robot” OR “Abstract”: “Robot-to-Human”)))). To create the search method, the following databases were chosen for systematic search and data extraction, representing multi-disciplinary avenues for published works: IEEE Xplore, ACM Library, and Scopus. Inclusion and exclusion criteria were generated, reviewed, and approved by subject matter experts across robotics, human-robot interaction, and behavioral science to confirm keyword relevance to identify suitable papers for the topic of interest, and to reduce the chance of extracting unrelated works. The final inclusion criteria markers were used in a sequential order when categorizing extracted papers to determine its inclusion into this systematic review:

(C1)

The research must include at least one physically embodied robot that can perceive through a vision system.

(C2)

The robot(s) must be capable of at least one closed-loop interaction or information exchange between the human and the robot(s), where the robot(s) vision system is utilized in the exchange, and a human is the focus of the vision system.

(C3)

The robot(s) must be able to make a decision and/or action based on visual input that is real time or at least fast enough for an interactive channel to occur between the human and the robot (i.e., 60 seconds).

The purpose of C1 was to ensure that only physical robot systems with a vision system were analyzed with digital avatars and software systems running on computers removed from analysis. The purpose of C2 was to ensure that robots were able to perceive visual information relevant to creating a robot signal, task, or action based on the vision system, and that the robot could in fact perceive some or part of the person during the interaction or collaborative exchange. The purpose of C3 was to ensure that robots could perform a decision and/or action based on interpretation of the visual information, and the interaction exchange occurs without extended wait times. Taken together, these chosen criteria would ensure that the robot was acting on the visual information, that the human was classified and/or involved in the process, and the information and/or exchange was occurring in a functional amount of time for an interaction.

To maintain the proposed review theme of humans, robotic vision, and interaction or collaboration, several exclusions were created and used in this review. Papers were excluded if they contained non-embodied agents that did not operate as a robotic system, such as having no actuation system and/or capacity to make or execute decisions (i.e., cameras, computers, smartphones, tele-operated devices, avatars). Papers were also excluded if there was no physical or verbal robot action involved in the process as a result of processing visual information, as well as instances in which the robot could have been substituted with a camera on its own, a computer screen on its own, or another simple input signal such as using the robot as a speaker only. Given the clear focus on robotic vision, this review excluded papers that did not meet the criteria of a vision sensor (RGB, RGB-D) paired with an algorithm/s that could translate raw images to a control signal for a robot. Papers were also excluded if vision was not central to the system’s operation or lacked control, such as vision being a function of the robot, but not used to inform or update the decision-making process or resulting action of the robot.

Papers that did not have any human-relevant information, use case application, or research experiment with people were also excluded because it did not meet inclusion criteria for the intention to explore robotic vision in relation to human-robot interaction or collaboration. Examples of these papers include early-stage design work on proposed concepts of robot systems that had not yet been built, and robot competition papers where the robot was intended for a human environment, but its proposed performance or relationship with people was not reported at all. For this review, only papers in which there was a clear interaction or collaboration between the human and the robot were included. Therefore, robots that only used an open-loop interaction were excluded from analysis for not meeting the criteria for an interaction, especially if the visual signal input was independent of the robot output and did not influence the action or decision making of the robot. Simple devices such as children’s toys were also excluded given the limited interaction set often involved in these devices, and the intention to focus on robots that could provide benefit to support a person’s work or lifestyle.

All papers must have been published and available for access from the publication venue between January 1, 2010 and December 31, 2020 in a peer-reviewed journal or conference. Papers that were not formally published between these dates were not extracted. If authors published multiple versions of the work, the most complete version was included. E-Print services (e.g., arXiv.org) were not included for three reasons: (1) the abundance of early stage work that did not yet include humans into the proposed system, (2) there was limited quality control without a peer-review process to ensure that only high-quality papers were identified in an unbiased way, and (3) reporting on early-stage work that has not yet undergone peer review could have created a skewed commentary on the current prevalence and impact of the field. We do acknowledge the importance of robotic vision use cases that occurred in works that would have fallen into the excluded criteria category for this systematic review. As such, we will present a section in the following to acknowledge and investigate the use cases that did not meet the search strategy criteria, including any key papers that were not captured as part of the systematic search. Examples include tele-operated robots, robots with an open-loop system, simple devices, and early-stage work that did not have tests with people.

3.1 Review Information and Categorization

Each eligible article underwent systematic data extraction informed by robot classification and human-robot interaction taxonomies (e.g., [40, 464]). For each, manuscript information was extracted into categories such as task type, task criticality (low, medium, high), robot morphology (anthropomorphic, zoomorphic, functional), ratio of people to robots (i.e., a non-reduced fraction with number of humans over number of robots), composition of robot teams (formation), level of shared interaction among teams, interaction roles, type of human-robot physical proximity, time/space taxonomy [464], level of autonomy [40], task evaluation, sensor fusion (i.e., vision and speech), camera system and type, vision techniques and algorithm, training method, and datasets. User study information was extracted if a user study was reported, including participant details and experimental outcomes. A custom metric for overall task evaluation was computed using 3-point scaling (low, medium, high) for task complexity, risk, importance, and robot complexity.

Application areas were clustered and labeled using the following criteria. Gestures were defined as a hand, arm, head, or body movement intended to indicate, convey a message, or send information. Action recognition was defined as the recognition of human actions or activities that were not related to explicit gestures. Robot movement in human spaces was classified if the robot had physical movement in a human environment and robot movement did not require the human to perform a set pose to signal movement commands to the robot, including if the robot was classified as a (semi-)autonomous vehicle. Object handover and collaborative action papers included a robot capable of manipulating objects while the interaction did not require the person to perform a set pose (i.e., the person was detected without performing a gesture or action). Categorization for social communication captured papers in which the robot needed to perform a social behavior, or be capable of socially interacting with a person. Last, learning from demonstration must have used some form of demonstration learning.

4 Results

4.1 Selected Articles

The initial search across three databases found 6,771 papers, 2,034 of which were identified as duplicate records (Figure 1). The remaining 4,737 papers were screened for titles and abstracts to assess initial eligibility and 887 papers excluded based on format: textbook chapters that did not include original research work (n = 63, 7%), reviews or surveys (n = 92, 10%), no English version of the work (n = 16, 2%), and other non-related works (n = 460, 52%) such as front pages, table of contents, plenary talks, copyright notices, and keynotes. A total of 255 papers were excluded based on title alone: robotic surgical tools only (n = 121, 14%) and tele-operation only (n = 135, 15%). The remaining 3,850 papers were assessed by detailed review of the text and 3,310 papers omitted for not meeting the C1–3 inclusion criteria: 2020 on C1 (61%), 1,005 on C2 (30%), and 285 on C3 (9%). The large volume of papers omitted on C1 showed that most works had no physical robot (e.g., [165, 288, 422, 495]), involved simulation testing such as using a virtual robot (e.g., [44, 119, 329, 461]), involved cameras on their own that were not connected to robotic systems (e.g., [307]), or the robot/s did not use a vision system as part of the interaction with the person (e.g., [294, 346]). Papers omitted on C2 often had a clear focus on other components such as speech (e.g., [257]) or visual servoing (e.g., [205]). C2 papers often had no humans at all (e.g., [110]), or no human vision involved in the vision process of the interaction or collaboration (e.g., [93, 167, 217, 355, 460]). Papers omitted on C3 often did not have a robot perform an action based on visual input (e.g., [153, 284, 289, 417, 485]), robots that were tele-operated (e.g., [360]), or no near real-time information exchange (e.g., [77]). A total of 540 papers that met the C1–3 inclusion criteria were then subject to another round of investigation. Papers were then excluded if it was the same study published across multiple venues (n = 59, 26%), or there was no clear experiment or demonstration of human-robot interaction or collaboration despite reporting on a system designed for HRI/C (n = 171, 74%). A final total of 310 papers (8% of full articles assessed for eligibility) met final inclusion criteria, which provides a significant pool of research works for detailed analysis on the chosen topic. The CONSORT chart of inclusion and exclusion steps can be seen in Figure 1. Two independent raters went through 10% of 310 eligible papers and achieved a 100% consensus on inclusion and exclusion criteria. Some notable works that would have fallen into the excluded criteria category for this systematic review were reported in a separate section as part of presenting a comprehensive survey on the topic, but these papers were not included in the final systematic review.

Fig. 1.

5 RQ1. What Is the General Overall Trend of Robotic Vision in Human-Robot Collaboration AND Interaction in the Past 10 Years?

This section presents the general trend of robotic vision in human-robot collaboration and interaction in the past 10 years. Robotic vision for HRI/C had a moderate but steady increase, which might be attributed to several components, such as limited accessibility to robot platforms, integration challenges, the interdisciplinary nature of HRI/C, technical capacity for robots to operate consistently for robust use cases with people, limited engineering knowledge of human-robot testing, limited capacity to test robots in human spaces, and human-centered robotics representing a much smaller research field compared to its robotics and computer vision counterparts. Figure 2(a) depicts a modest increase in publications over the period but a small decline from 2020/2021 that was predicted to be attributed to the COVID-19 pandemic. Figure 2(b) depicts the publication themes of robotic vision work, including interaction (human-robot interaction, human-machine systems), robotics (robotics, automation, mechatronics), sensors (sensors, vision, signal processing), engineering (engineering, systems, industry, control, science), and computers and Artificial Intelligence (AI). Figure 2(c) depicts the most relevant papers that were published in conferences, journals, and then book series.

Fig. 2.

Figure 2(d) depicts the most common application areas clustered into groups (N = 335): action recognition (13%), gesture recognition (35%), robot movement in human spaces (22%), object handover and collaborative actions (17%), learning from demonstration (3%), and social communication (10%). If a paper had more than one application, each area was included in the final total. Individual application breakdowns will be seen in the next section. Figure 2(e) depicts that common domain areas involved field, industrial (i.e., manufacturing and warehouses), domestic (i.e., home use), and urban settings (i.e., shopping centers, schools, restaurants, and hotels). Robots that work with and around humans were often proposed for domestic and urban environments. Figure 2(f) depicts common robot types being mobile robots, followed by fixed manipulators, social robots, mobile manipulators, and aerial robots. If multiple robots were tested, only the first or most detailed test was reported in the total. Most works had a single focus on a specific vision application for a target purpose, and the intended outcome was often for robots to better integrate into human-populated environments in a direct (i.e., controlled via gesture) or non-direct way (i.e., following a person).

Figure 3 depicts that for camera type, RGB-D cameras such as the Kinect were the most frequently used, followed by monocular, stereo, and omni-directional cameras. Figure 3 also shows an increased uptake of RGB-D cameras. RGB-D cameras were extensively used across all use cases, environments, and robot types, showing the value of this sensor capacity to provide critical visual information to improve robotic vision for robot tasks. Figure 4 depicts global trends in domain and types. Figure 4(a) depicts that the highest volume of work was conducted in gesture recognition or robot movement in human spaces, with the exception of Europe with a higher focus on object handover and collaborative actions. Figure 4(b) depicts that the most common robot types was mobile robots and fixed manipulators across all continents.

Fig. 3.

Fig. 4.

6 RQ2. What Are the Most Common Application Areas AND Domains for Robotic Vision in Human-Robot Collaboration AND Interaction?

This section provides a detailed breakdown of the following application areas: gesture and action recognition, robot movement in human spaces, object handover and collaborative actions, learning from demonstration, and social communication. If a paper had more than one application, each area was included in the final total. Papers often reported tasks and actions that were simplified or well contained in their relevant context or domain. In addition, papers often focused on using the human to improve the robot’s performance, such as human gestures for more control over the robot actions, humans to improve robot handover accuracy, and humans contributing to better mobile robot safety on pathways. A summary of the identified papers will be reviewed in Section 6, and detailed exploration into the technical content of the papers will be discussed in Section 10.

6.1 Gesture Recognition

6.1.1 Overview.

Gestures were defined as a hand, arm, head, or body movement intended to indicate, convey a message, or send information. A total of 116 papers (37% of the eligible total) were found to have at least one form of gesture recognition that used vision to identify and respond to the person. Figure 5 depicts the number of gesture-related works, common domains, camera types, robot types, gesture types, and level of autonomy. In 75 papers that used RGB-D cameras, 66 were the Kinect (88%). From the total 116 papers, 79 (68%) involved gestures from body pose, and 37 (32%) involved a hand gesture. Most papers used static gestures (i.e., stationary human pose, \(N=91\), 78%) that did not require visual detection of movement. A smaller portion used dynamic methods (N = 25, 22%), requiring multiple frames to classify the gesture. Some papers had a blended approach—for example, a static gesture to signify the start of a dynamic gesture (e.g., [458]). Gesture recognition to control robots was used for different robot types: industrial robot arms (e.g., [129, 254, 282, 393]), mobile ground robots (e.g., [79, 296, 491]), and mobile manipulators (e.g., [63, 117, 303, 397]), and less commonly for social robots (e.g., [209]), and aerial robots (e.g., [253]). In robot type, robots had little consistency across different categories with a wide variety of models used for gesture recognition. Continuous control was often used, such as to interact with mobile robots using hand position [333] and small mobile robots with head positions [230]. Last, gestures were also used with teams of robots [13, 63, 253, 297, 323] and by multiple humans in the same scene [270].

Fig. 5.

6.1.2 Use Case Examples: Mobile Robot Control.

In mobile ground robots, hand gestures were often used to control the robot to move forward, left, right, or stop (e.g., [79, 84, 94, 129, 131, 150, 229, 236, 248, 263, 277, 296, 338, 447, 459, 491]). Pointing gestures were often used to direct mobile robots to a specified location (e.g., [6, 81, 202, 330, 348, 416, 426, 472]). Human body movements were used as the control signal, such as shoulder angle to control robot direction based on discrete angles [445]. More dynamic motions were also used to control a robot to move forward, back, left, and right by movement of an arm up and down, left to right, or in circles (e.g., [127]), or hand waving to signify a follow-me or goodbye command (e.g., [145]). Body gestures were used to control an otherwise autonomous mobile robot to turn and stop by moving arms up or down [150], or human shoulder position to control robot velocity [279]. Last, a spherical vision system was used with a mobile robot with three omni-directional wheels to detect pointing gestures from a person wearing a red coat with a blue glove [472].

6.1.3 Use Case Examples: Manipulator Robot Control.

In manipulators, robots were controlled using hand gestures such as an open palm [129, 254, 282]. Hand gestures were used to command robot actions, such as to lift or lower the arm [109, 129, 272, 393], rotate the arm [91, 393], open or close the gripper [129, 254, 282], place an object into an open palm [21], return to position when the palm is closed [21], and to set positions for lifting and lowering [109, 129, 272]. Pointing was commonly used for selecting an object for grasping [303, 353, 397, 425], including having the robot arm confirm object selection with the robot arm pointing at the object [353]. Hand gestures were also paired with other body movements for controlling manipulators [254, 282]. Other works included more collaborative actions such as the robot helping to cook by dropping confirmed toppings over a pizza base [353]. A robot equipped with two arms, stereo vision, and tactile sensors could also pick up an object (sponge cube, wooden cube, ping-pong ball) that was selected by a hand pointing gesture from a human, and could release the object onto the palm of the person [192].

6.1.4 Use Case Examples: Mobile Manipulators and Aerial Robots.

In mobile manipulators, pointing gestures were similarly used to select desired objects for the robot to pick up (e.g., [63, 117, 303, 351, 397]). In one instance, a mobile manipulator responded to gestures (left and right hand) and user speech to identify, fetch, and handover objects such as a water bottle [62]. Last, a mobile base with arms could wave back to a person waving at the robot and perform a behavior as commanded by a dynamic gesture [248]. Aerial examples include the use of body pose to control an aerial robot, such as right arm up to take off and right arm out to turn right [375] and pointing gestures to select an aerial robot and confirm the selection by touching the right arm to the left hand [253].

6.1.5 Use Case Examples: State Changes.

Gestures were also used to signal the robot to commence state changes (e.g., [79, 122, 229, 339]). Some examples include to initiate person guiding or following [339] or to indicate a path direction change for an otherwise autonomous robot [122, 229]. Hand and body gestures were used to start/stop a walk action for a small humanoid [356, 387], body gestures to start/stop person following in an indoor environment [263], or left/right arm raised to change between robot following or parking behavior [296]. In one example, an autonomous mobile navigation robot explored a laboratory and asked humans for directions when a person was detected, translating pointing gestures to a goal in the robot’s map [426]. Gestures were also used in learning from demonstration to determine when a demonstration has commenced or concluded (hand [287] and body [398]), or to update a robot’s behavior online [118, 340]. Further works on learning from demonstration will be discussed in Section 6.6.

6.1.6 Use Case Examples: Team-Based Scenarios.

Gestures were also used in team-based scenarios, such as four mobile robots responding to gestures from a human operator [297]. In this example, the human selected a group of robots by drawing a circle around robots, and directing the robots to go to a chosen location [297]. Other team-based examples include the use of gesture-based interaction to signal to aerial and ground robot teams [318]. Gestures were used to command a small swarm of mobile robots to move into a set configuration using body poses (i.e., arms out front, or above the person’s head) [13]. Pointing gestures were often used to select a specific robot from a team of aerial robots [253], and to command a selected group of mobile robots [297]. This included pointing to direct robot attention to other human targets [270]. Last, one example showed a multi-person interaction, including a mobile robot that identified a person by localizing from an audio source, then determining which person to track when they waved at the robot [322].

6.1.7 Use Case Examples: Implicit (Non-Verbal) Communication and Social Interactivity.

There were fewer papers around gestures being performed by anthropomorphic robots to mimic human gestures for the purpose of social interaction. An example includes hand waving from a humanoid robot in response to a human human wave [63], helping to facilitate non-verbal communication. In another example, gesture recognition was used for humanoid robots (Pepper and NAO) to perform finger spelling gestures to communicate with hearing-impaired individuals at a public service center [209]. Social interactivity with robots also involved gesture-based games, such as paper-scissors-rock, which required the robot to classify human pose to determine the result [213, 476]. Last, a game with the iCub robot required the robot to recognize each gesture performed by the person to participate in the game [155].

6.1.8 Included Papers.

Papers related to gesture recognition are listed here: [6, 13, 15, 21, 28, 30, 32, 58, 60, 62, 63, 71, 74, 76, 78, 79, 80, 84, 87, 91, 94, 99, 101, 107, 109, 117, 118, 122, 127, 129, 130, 131, 145, 146, 150, 155, 159, 168, 172, 176, 187, 192, 202, 207, 209, 214, 223, 229, 231, 236, 241, 248, 252, 253, 254, 263, 270, 272, 276, 277, 278, 279, 280, 281, 282, 286, 287, 291, 296, 297, 303, 304, 310, 311, 318, 322, 323, 330, 338, 339, 340, 341, 348, 351, 352, 353, 356, 372, 375, 384, 387, 393, 397, 398, 408, 416, 423, 424, 425, 426, 439, 442, 445, 447, 457, 458, 459, 469, 472, 474, 475, 476, 477, 483, 484, 491].

6.2 Action Recognition

6.2.1 Overview.

Action recognition was defined as the recognition of human actions or activities that were not related to explicit gestures. A total of 44 papers (14% of eligible total) involved some form of action or activity recognition. Figure 6 shows the number of action recognition-related works, common domains, camera types, robot types, action types, and level of autonomy. Of the 36 that used RGB-D cameras, 33 were the Kinect (92%). Action recognition often involved recognizing the person’s activities such as action recognition and response (N = 23, 52%), activities of daily living (\(N=7\), 16%), exercise pose (\(N=6\), 14%), and recognition of their walking motion (\(N=3\), 7%). Humanoid robots often used action recognition: NAO [27, 121, 158, 452], Pepper [160, 233, 394], other humanoids [27, 347], and mobile robots [239, 430, 493].

Fig. 6.

6.2.2 Use Case Examples.

Action recognition often involved the identification of states. For action recognition, a robot could make a decision on when to offer a person a footrest to rest their feet [493], if the person had fallen down to ask them if they could call an ambulance [394], or when to respond to a human who was handing a bottle to the robot [347]. This included to help robots to recognize multiple actions such as eating, brushing teeth, and making a phone call (e.g., [239]). Action recognition was used to help the robot predict human motion such as walking, eating, and smoking, and to infer the remaining motion sequence after the camera was occluded [160], as well as to predict human actions in a shared workspace, such as to avoid collision during tool use by recognizing the use of a hammer or reaching for a cup [446]. Action recognition was also used to allow a robot to detect other body actions (i.e., shake head, wave hand) and then perform a corresponding behavior [233]. Other examples include to understand and copy human motions, such as joint positions [26, 102, 191, 193, 267, 405, 418, 467, 499], head positions [69], facial expressions [283, 292, 395, 405], or following continuous position of the person’s hand [333] or head [230]. This also involved more rigorous body motions such as humans performing physical activity, and the robot could give feedback to the person on a chosen exercise about pose quality [27, 158, 452]. Other physical activity examples include recording pose count and signal to change exercises if the person waved their hand [27], and the robot learning exercises through action recognition and response to a human demonstrator [158]. In a more applied environment, action recognition was used to detect walking ability for a service robot to be able to guide people of different walking capacities (i.e., wheelchairs, crutches, or walkers) to a suitable entrance [430]. In particular, motion analysis helped the robot to track the person’s position to adapt the motion of a robotic walking aid for different mobility levels [72]. For service assistance, action recognition was used to determine when to provide domestic chore assistance, such as filling a glass of water, opening a fridge door [224], clearing a table, or pushing a trivet to the person [240]. Last, some forms of action recognition were used in playful contexts, such as a child and humanoid robot taking turns to perform and recognize a pantomime action such as swimming, painting a wall, or digging a hole [121].

6.2.3 Included Papers.

Papers related to action recognition are listed here: [5, 8, 26, 27, 69, 72, 95, 102, 111, 121, 158, 160, 191, 193, 224, 230, 233, 239, 240, 259, 267, 283, 292, 305, 324, 333, 347, 371, 376, 390, 394, 395, 396, 405, 418, 430, 435, 446, 452, 463, 467, 493, 498, 499].

6.3 Robot Movement in Human Spaces

6.3.1 Overview.

Robot movement in human spaces was classified if the robot had physical movement in a human-based environment and robot movement did not require the human to perform a set pose to signal movement commands to the robot, including if the robot was classified as a (semi-)autonomous vehicle. A total of 74 papers (24% of the eligible total) used robot movement in human spaces. Figure 7 shows the number of action recognition related works, common domains, camera types, robot types, common tasks, and level of autonomy. In the 37 papers that used RGB-D cameras, 29 were the Kinect (78%). Common robot tasks were following the person (n = 55, 74%), avoiding a person (n = 9, 12%), and approaching one or more people (n = 7, 9%). In total, body pose detection was the most common method for identifying a person in an image (n = 57, 77%), followed by face detection (\(n=14\), 19%). Other methods involved tracking clothing or detection of clothing [298, 453]. Mobile robots often had laser range sensors for person detection [14, 188, 219, 271, 339, 450], obstacle avoidance [10, 465, 479], and for navigation (i.e., Simultaneous Localization and Mapping (SLAM)) [16, 113, 296, 475]. Depth images were also used for obstacle avoidance [37] and SLAM [406]. Ultrasonic sensors were also used for person following [178], and for navigation [89], as well as audio to localize a person not in view [37, 271, 322]. Re-identification when a person who had become occluded when following the person was addressed in several papers [97, 271, 450, 453, 492]. Some papers used multi-modal detectors such as laser or ultrasonic range sensors to identify a person (i.e., detecting legs (\(n=9\)) or shoulders (\(n=1\))), and audio localization to determine if a person was out of view (\(n=4\)). Others required minimal intervention from the person through gesture commands (\(n=9\)). Proxemics was often considered for appropriate social distance to approach [290] and avoid [41, 406, 428, 465] people, including velocity when a person is detected [428]. Person following environments included both urban [113] and indoor settings [219, 263, 453, 454, 455]. Commonly used mobile platforms included Pioneer mobile robots (n = 16), SCITOS G5 [448, 450], iRobot create [113], a wheelchair robot [453], Turtlebot [97], and even a robotic blimp [470].

Fig. 7.

6.3.2 Use Case Examples: Social and Functional Navigation.

There were several key examples for robots that followed people for both social and functional reasons [18, 97, 190, 263, 273, 454, 455, 492]. Many works required the robot to approach the person to commence navigation [61, 81, 86, 137, 290], and with human detection and tracking to know where the person was to avoid colliding with them [41, 150, 406, 428, 465]. In addition, other works included integrating humans into a map for a mobile robot [16] . Other works had the robot approach and then interact with the person by initiating a dialogue [137], or by creating a new goal through the use of human gestures to change the robot’s state, such as to commence a variant of a person following behavior [79, 145, 263, 296, 339, 348, 475]. Other state changes included to guide the people avoiding the robot toward a particular path [150]. In works when the robot approached the person, a line following robot approached a person when their face was detected [61] and a PR2 robot approached a person using a real-time proxemic controller [290], helping to bridge functional navigation methods toward social interaction points. In one example, a robot had an omni-directional camera for person detection, and a laser that is used to navigate through the environment using SLAM [113]. This included a social force model that allowed for appropriate social distance when passing people and to avoid having the robot cross into a person’s personal or intimate space when passing a person from behind [41, 465]. Robot movement through human spaces was also tested in busy environments, such as an iRobot that could navigate through urban environments, detect human faces, and report back to a supervisory person [113]. Multiple people were also detected and used as landmarks for integration into a SLAM solution for a Pioneer robot following a path [16]. This included robust person following, even when multiple people were present in a scene and when the target became occluded [316]. The iRobot was also used to identify the face of a specific person and keep the target person’s face in view [490]. In robot navigation for human environments, some robots also conducted multiple tasks together as part of its use case. For example, a mobile R2D2 robot with arms carried a smaller mobile robot with a gripper. The robot could wave when it detected a person’s face, and delivered a drink based on the distance to the target face. The main robot could also deploy a smaller robot with a gripper to pick up small objects in its path [86]. Others had different methods of person following, such as following a person through airspace. Two examples involved using a two-camera vision system with a small aerial robot to detect and hover above the hand of a person wearing a glove with the intention to pass the robot between people [298], and a monocular camera attached to an autonomous blimp robot to detect and follow a person’s face [470].

6.3.3 Included Papers.

Papers related to robot movement in human spaces are listed here: [10, 11, 12, 14, 16, 17, 18, 34, 37, 41, 59, 61, 79, 81, 85, 86, 90, 97, 108, 113, 125, 137, 145, 148, 149, 150, 157, 163, 178, 181, 188, 190, 204, 208, 215, 219, 242, 263, 265, 271, 273, 285, 290, 296, 298, 299, 300, 316, 322, 328, 337, 339, 348, 349, 361, 379, 406, 428, 440, 448, 450, 451, 453, 454, 455, 465, 470, 475, 478, 479, 488, 490, 492, 494].

6.4 Object Handover and Collaborative Actions

6.4.1 Overview.

Object handover and collaborative action papers included a robot capable of manipulating objects while the interaction did not require the person to perform a set pose (i.e., the person was detected without performing a gesture or action). A total of 56 papers (18% of the eligible total) involved an object handover or collaborative action. Figure 8 shows the number of works, common domains, camera types, robot types, common tasks, and level of autonomy. In the 45 papers that used RGB-D cameras, 34 were the Kinect (75.5%). The most common use case involved a human-aware work space where the robot had to operate safely in a shared space (\(N=26\), 46%), followed by direct control of a robot arm (\(N=15\), 27%), object handover (\(N=7\), 13%), and collaborative manipulation (\(N=3\), 5%). For shared space work, most actions were to improve safety outcomes for the human. For instance, if a human was detected in the shared area, the robot would came to a halt [20, 312, 319, 385, 407], slow down [51, 482], or change its trajectory to avoid contact [73, 112, 136, 232, 262, 313, 314, 331, 409]. In some shared work instances, the person was required to be in a safe standing pose before robot commands were accepted [169]. In object handover cases, the handover process often involved both passing objects from robot to human [25, 38, 192, 392] and from human to robot [382]. These handover actions often used force [38] or tactile sensors [192] to determine when to release the object. Another use case was robotic vision to control robot arms by matching tool center point with human hand positions [25, 52, 92, 169, 185]. Last, other related works also used both robotic vision and an IMU [169]. Commonly used platforms were the Kuka (\(N=9\)), Universal Robots (UR) series (\(N=8\)), ABB Industrial (\(N=4\)), Franka Emika Panda (\(N=3\)), Stäubli (\(N=2\)), Baxter (\(N=3\)), Lynxmotion (\(N=2\)), Sawyer (\(N=1\)), WAM robot (\(N=1\)), and other types of arms (\(N=19\)).

Fig. 8.

6.4.2 Use Case Examples.

Collaborative actions often involved the robot providing assistance during a specific task. In relation to specific tasks, robotic vision was used to track a person’s hand, and a force sensor to sense contact in an screwing task [88] and to match the end-effector position with hand position to pick up and pass an item [25]. Others included moving the tool center point of a robot arm to follow a human hand to execute a grasp action when the human placed both hands out in front [52] or a robot performing cooperative sheet folding with a hand detected against the fabric corner with the robot arm picking up the opposite corner to perform the fold [225]. Some collaborative actions took a mixed sensor approach such as joint manipulation of a wooden stick with a Franka Emika arm with the goal to keep the stick horizontal using data from a wearable IMU device attached to the wrists, elbows, and the hip fused with extracted skeleton information to provide the robot with an accurate human pose [487], a camera mounted to the end effector of a Stäubli industrial arm to follow a human hand using visual servoing, and a force sensor to know when the robot should release the object [38]. Collaborative actions also had specific person-centered applications, such as robots to assist persons with disabilities through assistive dressing with a jacket [425], soft robots to assist in bathing [114], and for a surgeon to instruct a robot to fetch an item during an operation [221]. Other applications were orientated around a specific outcome, such as to improve safety. Safety outcomes included robotic vision to identify a potential collision with a person and stop the robot in collaborative assembly task [312], to inform the robot to deactivate if a person was too close [319], to assist in consideration of proxemics when handing over a water bottle [392], and image and torque sensing to help reduce robot arm speed based on human proximity and contact [51]. Other safety-related functions included the robot using the shoulder and hand position to determine if the person was directed toward the task and, if not, to pause until the person was again engaged or no person was present [55], and robotic vision to help address ergonomics, such as to adjust the working height of the end effector based on human pose (i.e., height and arm position) [427].

6.4.3 Included Papers.

Papers related to object handover and collaborative actions are listed here: [20, 25, 29, 30, 38, 45, 51, 52, 55, 56, 73, 88, 92, 100, 107, 108, 112, 114, 126, 136, 144, 147, 162, 169, 185, 192, 221, 225, 232, 262, 301, 312, 313, 314, 319, 320, 321, 331, 374, 378, 380, 382, 385, 392, 401, 404, 407, 409, 412, 425, 427, 444, 482, 484, 487, 496].

6.5 Social Communication

6.5.1 Overview.

Categorization for social communication required that the robot needed to perform a social behavior, or be capable of socially interacting with a person. A total of 33 papers (33% of the eligible total) involved a social interaction between a person and a robot. Figure 9 shows the number of social interaction works, common domains, camera types, robot types, common social tasks, and level of autonomy. In the 16 papers that used RGB-D cameras, 12 were the Kinect (75%). Common tasks required a robot to converse with a person (N = 10, 30%), detect social engagement (N = 6, 18%), or to approach people in a social way (N = 3, 9%).

Fig. 9.

6.5.2 Use Case Examples.

Social actions included having the robot face toward the person who was talking [70, 102, 249, 343], to detect the active speaker in a group of people using facial recognition and audio [70], to commence a conversation when a person is detected [343], to identify when the person had finished talking [49, 201], to perform face detection and gestures during a conversation with a person [102], to wave when a waving gesture was detected [152], and to recognize engagement levels through facial expression and gaze detection [68, 373, 405] from head movements such as nodding or shaking [373]. Other actions related to social interaction included classifying facial expressions [216, 250], gender and person identification [89], and age and gender estimation using facial cues [70]. Social communication was also used for mobile robot situations, such as to detect if the person wanted to interact with it [247, 309, 432], and to determine social group configurations to enact appropriate social distance conventions [420]. Multi-person applications were also explored, such as speech and vision sensing with an iCat robot head with two arms to greet people, take orders, and serve drinks to multiple people [139], and the Pepper robot in a restaurant where the robot was required to point to seating locations, repeat bar orders, relocate a person, and deliver an item to a person even if they had moved from their original position [238]. This included other tests for if there was more than one human in the robot’s field of view (e.g., [247, 420]).

6.5.3 Included Papers.

Papers related to social communication are listed here: [19, 33, 42, 49, 68, 70, 89, 102, 106, 139, 152, 177, 201, 206, 216, 234, 238, 246, 247, 249, 250, 309, 327, 343, 354, 373, 391, 400, 405, 420, 432, 473, 489].

6.6 Learning from Demonstration

6.6.1 Overview.

A total of 12 papers (4% of the eligible total) involved some form of learning from demonstration. Figure 10 shows the number of learning from demonstration works, common domains, camera types, robot types, common tasks, and level of autonomy. In the 10 papers that used RGB-D cameras, 9 were the Kinect (90%). Learning from demonstration tasks included manufacturing assistance (\(N=6\), 25%), human interaction (\(N=2\), 16%), scene understanding (\(N=2\), 16%), and behavior learning (\(N=2\), 16%). Robots could learn by watching a person perform a task, or through collecting data from an interaction between two people. Gestures were often used to enter a demonstration mode, including when a human would move the robot (\(N=3\), 25%), or to provide instructions (\(N=2\), 16%). Robots were often humanoid robots (\(N=7\)), which included the iCub (\(N=2\)) [370, 481], Pepper (\(N=1\)) [473], SARCOS humanoid (\(N=1\)) [340], imNeu (\(N=1\)) [251], a small humanoid (\(N=1\)) [347], and an anthropomorphic robot head (\(N=1\)) [471]. There were five industrial robot arms: Kuka (\(N=3\)) [118, 287, 463], WAM robot (\(N=1\)) [424], and FANUC (\(N=1\)) [398]).

Fig. 10.

6.6.2 Use Case Examples.

Gestures were often used in learning from demonstration tasks, such as hand and body gestures to signal the beginning and end of a demonstration [287, 398], or teach the robot online [118, 340]. Other examples involved pointing and speech to show an industrial arm where to work [118], teaching the robot to perform a peg grasping task [251], and to follow actions in a watering task from visual gaze [471]. Learning methods included programming the robot to become compliant once a hand gesture command had been received to move the end effector for a chosen action, with a second gesture to signal the completion of programming, and for the robot to begin executing a new task [287]. Other methods included fine-tuning actions when the robot end effector was close to the desired position [118], to change the periodic motion of a humanoid end effector [340], and change the motion of its hand in response to a human coaching gesture [340]. Learning from demonstration also included humans teaching a small humanoid [347], or from mirrored human examples such as doing a task with a bottle by observing related actions (hold, place, and take) and learning to replicate it [370].

6.6.3 Included Papers.

Papers related to learning from demonstrations are listed here: [118, 251, 287, 340, 347, 370, 398, 424, 463, 471, 473, 481].

7 RQ3. What Is the Human-Robot Interaction Taxonomy for Robotic Vision in Human Robot Collaboration AND Interaction?

7.1 Summary

This section will explore the human-robot interaction taxonomy data (Figures 11 and 12) as informed by human-robot interaction and robot classification taxonomies (e.g., [40, 464]). Detailed explanation of taxonomy hierarchy and their relevant classification labels can be found elsewhere [40, 464], and a brief summary of labels has been listed in the review information and categorization section (Section 3.1). Task type was relatively broad with limited consistency between studies and has been reported in individual sections listed prior to this section (see Section 6). Task criticality for most robot use cases was identified as low criticality (N = 271, 88%) compared to medium (N = 32, 10%) or high (\(N=7\), 2%) classification, which may further support the emergent nature of robot roles in easier use cases as a first application. There was also a link between task difficulty and frequency, where less difficult tasks are more commonly investigated and harder tasks are less represented. These classification patterns were similar to our own custom metric on a single score for overall task evaluation using task complexity, risk, importance, and robot complexity: low (N = 249, 80%), medium (N = 51, 17%), and high (\(N=10\), 3%). Robot morphology was categorized as anthropomorphic (human-like), zoomorphic (animal-like), and functional (neither human-like nor animal-like, but related to function). Most systems were functional (N = 203, 65%) compared to anthropomorphic (N = 104, 34%) or zoomorphic (\(N=3\), 1%). Functional robots were more likely to be used in medium and high task criticality studies compared to anthropomorphic or zoomorphic robots. A high volume of works had a 1:1 human to robot ratio (N = 281, 91%) with an overall mean of 1.08, 9 with more than one robot, 20 with more than one human, a maximum reported ratio as 5 [246], and minimum reported ratio between 0.1 [13] and 0.07 [318]. Human-robot teams often had 1:1 human-robot team compositions, showing that the methods and team setups focused on a single human to potentially assist in robustness and utility of robotic vision in the collaborative scenario. Homogeneous teams were used for 307 (99%) of the reported studies, with only 3 (1%) of studies using heterogeneous robot team compositions (e.g., [86, 238, 253]). Level of shared interaction among teams was high in the ‘A’ formation (N = 280, 90%) as predicated on the earlier reported ratio of people to robots (N = 281, 90%). There were 8 papers with a ‘B’ formation (one human with multiple robots using a single interaction), 5 with a ‘C’ formation (one human with multiple robots using a separate interaction for each robot), 6 with a ‘D’ formation (multiple humans with one robot, where the robot interacts with the humans through a single interaction), and 11 with an ‘E’ formation (multiple humans with one robot, where each human interacts with the robot separately). In terms of the type of human-robot physical proximity, interacting (N = 186, 60%) was the highest followed by following (N = 64, 21%) and then avoiding (N = 24, 8%). No studies that used tele-operation were eligible in this review, but for the remainder of eligible studies, nearly all (N = 309, 99%) had the robot as synchronous (same time) and collocated (same place) with the exception of one study that was non-collocated (e.g., [229]). Autonomy level scoring by Beer et al. [40] was used, but no scores were classified on level 1 due to exclusion criteria that the robot must not be manually operated by the human. For the remainder, robots were often high on autonomy, which may have been skewed by initial entry criteria that required the robot to use robotic vision to perform an action or response. Figure 13(a) depicts that papers over the past 10 years often had level 2 (tele-operation: robot prompted to assist but sensing and planning left to the human), level 6 (shared control with human initiative: robot senses the environment, develops plans/goals, and implements actions while the human monitors the robot’s progress), or level 10 (full autonomy: robot performs all task aspects autonomously without human intervention). Figure 13(b) depicts that there was a relatively even spread of autonomy level across the four domains, Figure 13(c) depicts mobile and fixed manipulators were most often used with level 10 autonomy, with similar trends seen across the autonomy levels, and Figure 13(d) depicts camera types per robot autonomy level.

Fig. 11.

Fig. 12.

Fig. 13.

7.1.1 Included Papers.

Papers where a mobile robot was used are listed here: [6, 10, 11, 12, 13, 14, 15, 16, 17, 18, 28, 34, 37, 41, 59, 61, 63, 72, 79, 80, 81, 84, 85, 89, 90, 94, 97, 101, 113, 117, 122, 125, 127, 131, 137, 145, 148, 149, 150, 157, 158, 159, 163, 177, 178, 181, 187, 188, 190, 202, 204, 207, 208, 215, 219, 223, 229, 230, 236, 239, 241, 242, 247, 248, 263, 265, 271, 273, 277, 278, 279, 280, 281, 285, 290, 296, 297, 299, 300, 309, 316, 322, 323, 328, 330, 333, 337, 338, 339, 343, 348, 349, 361, 372, 379, 406, 408, 416, 420, 423, 426, 428, 430, 432, 440, 445, 447, 448, 450, 451, 453, 454, 455, 458, 459, 465, 472, 475, 477, 478, 479, 483, 488, 490, 491, 492, 493, 494]. Papers where a fixed manipulator was used are listed here: [20, 21, 29, 30, 38, 45, 51, 52, 55, 56, 60, 71, 73, 74, 78, 88, 91, 92, 100, 107, 109, 112, 114, 118, 126, 130, 136, 144, 147, 169, 176, 185, 192, 214, 221, 225, 231, 232, 240, 251, 252, 254, 262, 282, 286, 287, 301, 312, 313, 314, 319, 320, 321, 331, 340, 352, 353, 370, 374, 378, 380, 382, 385, 393, 398, 401, 404, 407, 409, 412, 424, 425, 427, 442, 444, 446, 463, 469, 474, 482, 484, 487, 496]. Papers that used a mobile manipulator are listed here: [25, 58, 62, 86, 108, 129, 162, 224, 238, 272, 303, 351, 392, 397]. Papers that used an aerial robot are listed here: [76, 99, 253, 276, 291, 298, 304, 310, 311, 318, 341, 375, 384, 470]. Papers that used a social robot are listed here: [5, 8, 19, 26, 27, 32, 33, 42, 49, 68, 69, 70, 87, 95, 102, 106, 111, 121, 139, 146, 152, 155, 160, 168, 172, 191, 193, 201, 206, 209, 216, 233, 234, 246, 249, 250, 259, 267, 270, 283, 292, 305, 324, 327, 347, 354, 356, 371, 373, 376, 387, 390, 391, 394, 395, 396, 400, 405, 418, 435, 439, 452, 457, 467, 471, 473, 476, 481, 489, 498, 499].

8 RQ4. What Are the Vision Techniques AND Tools Used in Human-Robot Collaboration AND Interaction?

This section provides a detailed review of robotic vision techniques used in selected papers, including methods, algorithms, datasets, cameras, and methods to allow robots to provide information or take action. Common techniques are discussed in detail to provide clear trends on how methods and techniques have been adapted from computer vision to robotic vision problems. However, many emergent techniques from computer vision are yet to be seen in HRI/C works. The use of these techniques may create many new opportunities for robots to help people in domains that have not yet been explored due to technical challenges, and to the speed or accuracy needed to be useful to the person, as discussed in Section 12. Yet this level of development and testing will likely experience a translation delay from the advances seen in computer vision, given the need for human study approvals, extensive testing on hardware components that are subject to error and malfunction, and robust results that can meet peer-review standards for publication.

This RQ is important to the field of HRI/C because understanding the visual world is fundamental for interacting with the environment and its actors. It is important for the HRI/C researcher to understand the computer vision techniques and tools that have been used so far, since these are often robust, well understood, and well suited to HRI/C applications. More significantly, these patterns of usage allow us to identify areas for improvement, where an over-reliance on traditional or proprietary techniques might be inhibiting progress.

To begin, robot vision requires the robot to perceive a human and/or their actions to be able to provide a function, task, or service to the person during human-robot collaboration and interaction. Robotic vision is often performed in a two-step process. First, localization detects where the human is located in the robot’s field of view, often to the granularity of the position of specific parts of the body, such as the hands. Second, classification determines what gesture, action, or expression is being shown by the person. This can include tracking across image frames to help resolve actions that are ambiguous or reliant on motion cues, and to facilitate continuous interaction between the person and the robot. This visual information can then provide the robot with important information that can help the robot to determine its next action or movement, making the vision process central to the function and utility of the robot to the person. Multiple sensors are also commonly used to provide different types of relevant information, especially the combination of color and depth measurements. The next section will discuss implemented solutions for human detection and pose estimation, gesture classification, action classification, tracking, and multi-sensor fusion. A summary of camera types, including relevant information and examples, is listed in Table 1.

Table 1.

Camera	RGB Resolution	Depth Resolution	Depth Range	Fieldof View (Degrees)	Frame Rate	Examples
Microsoft Kinect	1920 × 1080	512 × 424	0.5–4.5 m	70 × 60	30 fps	[262, 267, 487]
Intel RealSense	1920 × 1080	1280 × 720	0.3–3 m	69 × 42	30 fps	[28, 327, 390]
PrimeSense	1280 × 960	640 × 480	0.35–3 m	54 × 45	30 fps	[406, 406, 484]
Asus Xtion Pro Live	1920 × 1080	640 × 480	0.5–6 m	57 × 44	30 fps	[8, 314, 409]
Logitech c9xx	1920 × 1080	–	–	78	30 fps	[32, 68, 427]

Table 1. Camera Types Represented in the Corpus of Papers and Their Properties

8.1 Human Detection and Pose Estimation

Detection of the location of the person or people in a stream of visual data is often the first step for reasoning about them. Pose estimation techniques go beyond coarse detection (e.g., of bounding boxes) to find the location of body parts and their connections (i.e., skeleton estimation).

8.1.1 Commercial Software.

The most common approach from the corpus of selected papers was to apply skeleton extraction and tracking software to depth images from RGB-D cameras (N = 107), which combines detection and pose estimation. This software was primarily sourced from the Microsoft SDK for the Kinect camera or OpenNI for PrimeSense cameras. This approach has the advantage of using commercial-grade software that is easily available, real time, and robust. Example papers include the following: [18, 20, 26, 41, 55, 58, 73, 79, 88, 92, 94, 102, 118, 127, 136, 150, 158, 185, 191, 193, 225, 232, 233, 251, 253, 262, 263, 267, 292, 296, 309, 312, 313, 340, 343, 347, 382, 394, 398, 405, 418, 420, 425, 444, 454, 465, 467, 475, 487, 493, 499].

8.1.2 Face Detection.

The next most common method (especially for earlier papers) was to detect a person by first detecting their face using the Viola–Jones method [438] (\(N=34\)). This approach uses Haar filters to extract features from an image, followed by AdaBoost [143] to make predictions using a cascade of classifiers. This approach has the advantage of being extremely fast and performant, without requiring depth information. Depth can be used if available to disambiguate false positives based on the realistic size of a face [490]. Example papers that use this method include the following: [10, 25, 37, 61, 97, 101, 102, 113, 117, 137, 160, 216, 219, 230, 250, 271, 281, 283, 339, 373, 426, 453, 470, 471]. Once the face location is identified, this was sometimes used to help detect other body parts, such as hands [117, 270, 297].

8.1.3 Segmentation.

Color segmentation is a simple and fast method to distinguish regions containing skin from a video stream. However, color segmentation is less robust than many other methods to different skin tones, scene colors, and occlusions such as clothing and hair. Segmentation is typically performed in HSV [231, 277, 397] or YCrCb [91, 270, 459] color spaces. This includes segmentation using skin threshold values [84, 91, 139, 270], or by using histogram analysis from a skin sample [231, 277, 459]. For example, Xu et al. [459] used color segmentation thresholds tuned for human skin tones and incorporated depth to remove segmentation errors. Accurate color segmentation facilitated by using colored items was also used, such as clothing [80, 131, 178, 472], or colored tape around the person’s body [236]. For example, Miyoshi et al. [298] had an aerial robot follow a glove of a known color that can be easily segmented. Given a segmented image, morphology and edge detection operations can be used to approximately extract the person from the image [385]. Depth segmentation is an alternative that relies on some assumptions about the scene and setup [117, 287, 333, 459]. For example, Paulo et al. [333] segmented hand gestures by setting minimum and maximum distances to the sensor, and Mazhar et al. [287] expanded a hand keypoint to a hand segmentation using depth.

8.1.4 Region of Interest.

Regions of interest, such as a bounding box around a body, face, or hands, can also be acquired using deep learning techniques. The most prevalent models include the Region-Based Convolutional Neural Network (R-CNN) family of architectures [151], and the single shot detectors [258, 359]. These approaches are fast and robust, and are able to detect multiple people in a single image. The R-CNN family includes two-stage networks that generate many object proposals, before filtering and classifying them, whereas single-stage networks predict the final bounding boxes in one go. This saves computation time at the expense of accuracy. From the corpus of selected papers, Vasquez et al. [430] use Fast R-CNN to extract a human bounding box from an image of a person using a walking aid. The SSD network is used in other papers to locate bodies [97, 190] or faces [69, 247, 448] in an image. For example, Weber et al. [448] used an SSD network for face detection, fit faces to a deformable template model, and tracked the detections using LSD-SLAM (large-scale direct monocular SLAM) [123]. While there are many other methods for detecting humans that have been deployed on robots, such as detection from 2D range data [24, 48, 203], these approaches were not present in the corpus of papers.

8.1.5 Pose Estimation.

Deep learning is also commonly used for human pose estimation from RGB images. Image-based pose estimation usually refers to extracting keypoints and skeletons from an image of one or more people, without any additional depth information. Out of the HRI/C corpus surveyed, Convolutional Pose Machines (CPM) [449] and OpenPose [64] were used most often (\(N=15\)) for robotic vision. The OpenPose approach trains a network to predict the location of all joints in the image, then assembles skeletons using learned part affinity fields in a post-processing step. The main advantage of this bottom-up approach is that it runs in real time, and runtime is independent of the number of people in the image, making it particularly suitable for human-robot interaction. An additional explanation for its prevalence among HRI/C papers is that it has a well-known, well-maintained, and high-quality codebase that is user-friendly and publicly available. Example papers can be seen here: [72, 160, 239, 282, 287, 296, 427, 482].

8.2 Gesture Classification

For human-robot interaction or collaboration, the human is often required to perform a specific hand or body configuration to interact with the robot. Once the region of interest or skeleton is extracted, hand and body gestures are classified through a variety of methods, depending on the dynamic or static nature of the gestures being performed. For static hand gestures, classification was often performed from segmented images by determining the contours (boundary pixels of a region) and the convex hull (smallest convex polygon to contain the region) as a method of counting the number of fingers being held up [91, 109, 231, 397, 459]. The distance of each contour point to the centroid of the segmented image was also used for finger counting [270, 277]. From skeleton information, gestures were often classified from the 3D joint positions of the human skeleton using the angular configuration of the joints [27, 87, 129, 145, 158, 172, 233, 253, 296, 303, 323, 444, 452, 469]. For example, Ghandour et al. [150] trained a neural network classifier to recognize body pose gestures using joint positions from a single image as input. In contrast, dynamic gestures, which consist of a sequence of poses, require multiple image frames for classification. Common algorithms used to track and classify dynamic gestures include hidden Markov models [62, 117, 145, 330, 408, 458], particle filters [62, 118, 330, 348], and dynamic time warping [63, 338, 351, 439]. For example, Tao and Liu [408] used a hidden Markov model to classify hand waves in various directions, Cicirelli et al. [94] used a neural network to recognize gestures from an input of Fourier-transformed joint positions, and Li et al. [248] reason about temporal information using an LSTM to determine the intention of the person. Although traditional machine learning techniques have been used for gesture and pose classification, such as k-nearest neighbors [146, 339], SVM [80, 84, 122, 155], and multi-layered perceptrons [281, 408], deep neural networks have become increasingly prevalent. Convolutional neural networks have been used to classify gestures directly from color images [21, 287, 459] or depth images [209, 254]. Other architectures have been used to classify gestures from intermediate representations such as skeletal information [94, 150, 248] and other sensory inputs [281, 408]. Practitioners can fine-tune pre-trained models on task-specific data [287], or train from scratch with a custom dataset [447, 459]. As previously indicated, neural networks have been used to classify both static [150] and dynamic [94, 250] body poses.

8.3 Non-Gestural Action Classification

Human action classification involves the identification of specific types of human motions from video streams. For this section, gestural actions are considered to involve gestures where the person is explicitly trying to communicate with the robot, and non-gestural actions involve actions such as walking, eating, running, and sitting down. Non-gestural actions can be important for robots to recognize and facilitate contextual understanding, but the robot may not require an immediate response from the person, unlike a gesture action. Several action classification methods operated on pre-detected human keypoints [72, 158, 160, 233, 239, 394, 493]. For example, Görer et al. [158] compared the keypoint position of motion identifier joints on an elderly user performing an exercise pose with those of a human demonstrator to identify any disparities. In another example, Vasquez et al. [430] classified the category and estimated the 3D position and velocity of a person with a walking aid using a ResNet-50 [180] network and a hidden Markov model. Efthymiou et al. [121] showed that dense trajectories [443] provide better features that result in more accurate action classifications than convolutional neural networks, when there is a mismatch between training and testing data (e.g., identifying the actions of children from models trained on large action recognition datasets where adults are more prevalent). Last, Gui et al [160] trained a generative adversarial network from a sequence of skeleton keypoints over time, extracted using OpenPose [64], to generate plausible motion predictions.

8.4 Social Classification

This section explores the classification of social factors that were not directly associated with a specific gesture or action, including facial expression, level of engagement, and intent to interact with the robot. For example, Saleh and Berns [373] classify head movements, such as nodding and shaking, with an SVM, using the direction and magnitude patterns of depth pixels as features. Facial expressions are classified by fitting an active appearance model [283], by using Gabor filters with principal component analysis and an SVM [216], or using a neural network classifier on facial keypoints [292]. Gender classification was performed from principal component analysis of face regions [216], or using an SVM with local binary patterns and histogram-of-gradient features [89]. Intention to interact with a robot was identified using random forest regression on facial expression features [247], or from a combination of the user’s line of sight to the robot, shoulder orientation, and speech activity [309]. Last, level of engagement was estimated using gaze analysis techniques to determine if a person was averting their gaze during conversation with a social robot [49].

8.5 Human Motion Tracking

To interact with a human, robots are often required to track the person through multiple frames, which includes detection, tracking their motion, and re-acquiring the intended person if they have been occluded or fall out-of-frame. Human motion tracking for the purpose of robotic vision has been performed using particle filters [10, 25, 125, 219, 309, 312, 319, 348], optical flow [454], or SLAM [16, 41, 448]. In specific examples, Fahn and Lin [125] used a particle filter to track a face from the center of the image using its color properties, whereas Nair et al. [319] tracked multiple people with a particle filter by first segmenting out the static background and then tracking bounding boxes in the foreground. Image keypoints and features, such as SURF features [36], have be used to find correspondences across frames [453], where the track can be initialized by providing a known pattern worn by the target [85] or automatically identifying features on a detected person’s clothing [453]. Kernalized correlation filters [182] have been used for efficient tracking [190, 273], and Kalman filters have also been frequently used to reason about and reduce tracking errors from noisy sensors and odometry [139, 163, 263, 273, 312, 406, 430]. For example, Foster et al. [139] propagated a set of pixel hypotheses for segmented skin-colored blobs with a Kalman filter. Last, human motion prediction [79, 163, 188, 232, 312] and robot kinematic models [444, 492] have been used to perform better tracking of the person. To demonstrate, Landi et al. [232] used a neural network to predict hand positions to avoid collisions and inverse kinematics to plan a trajectory that could avoid the human [313, 482]. Although not expressly used in the corpus of HRI/C papers, there is a significant body of work on techniques for detecting and tracking groups of people, which is likely to be used in future work [199, 235, 255, 317, 410, 411, 433, 434] . For example, Lau et al. [235] and Linder and Arras [255] cast group detection and tracking as a multi-hypothesis model selection problem, a probabilistic model that allows for the splitting and merging of clusters. RGB-D sensors are frequently used by robots for multi-person tracking [199, 317, 410] . The latter [410] predicts social groups from egocentric RGBD by reasoning about joint motion and proximity estimates. These approaches have significant potential for use in HRI/C applications, since reasoning about group behavior is likely to be critical for robots in social settings.

8.6 Multiple Sensors

In the selected papers, robotic vision was often paired with other sensors to enhance the robot’s capacity to perceive and respond to the human, see Table 2. Other sensors often involved microphones for speech recognition [62, 118, 145, 286, 442, 475], laser range sensors [14, 188, 219, 271, 339, 450], and ultrasonic sensors [178] to help determine distance, audio sensors for locating the active speaker [70, 322]) and to relocate people who were out-of-view [37, 271, 322], and Leap Motion sensors [209] and inertial measurement units to help track movement and orientation [118, 169, 487]. Force sensors were used with vision sensors during applied tasks such as object handover [38], collaborative manipulation [88, 225], and to determine contact in a safe workspace scenario [51]. Tactile sensors were also used in object handover to assist physical interaction with the environment [192]. Humans who operated the robot (N = 138) were often provided additional information from other sensors or information sources. Operators made informed decisions by information provided from sources such as different LED colors [13, 101, 146, 253, 297], feedback on a display screen [52, 79, 145, 236], video feedback [229, 397], augmented reality [231], haptic feedback [379], spoken response from the robot [145, 146, 287, 490], or robot movement [303, 490] to signal or confirm that the command had been received by the robot. In most examples, the position or configuration of the robot was sufficient (N = 96, e.g., [21, 26]).

Table 2.

Name of Dataset	Type of Data	Volume of Data	Usage	Used by
NTU RGB-D dataset [383]	RGB-D images and 3D skeletal data	60 action classes, 56,880 video samples	Action recognition	[239, 394]
Manipulation action dataset (MAD) [135]	Video of object/action samples (e.g., cup \| drink, pound, shake, move)	625 recordings	Action recognition	[446]
H3.6M dataset [194]	RGB-D images with bounding regions, 3D pose data	3.6 million 3D human poses	Human motion prediction	[160]
Market-1501 dataset [497]	RGB images	32,000 annotated bounding boxes	Person detection	[238]
INRIA dataset [104]	RGB images	1,805 images of humans	Person detection	[14, 316]
HollywoodHeads dataset [441]	RGB video frames	Annotated head regions for 224,740 video frames	Head detection	[448]
The extended Cohn-Kanade (CK+) dataset [210, 269]	RGB images	593 image sequences	Facial expression recognition	[216, 250, 395]
The AffectNet Database [308]	RGB images	1,000,000 facial images	Facial expression recognition	[250]
The WIDER FACE dataset [468]	RGB images	32,203 images, with 393,703 bounding regions	Facial expression recognition	[247]
The Aberdeen Facial Database [2]	RGB images	687 faces	Face detection	[89]
Glasgow Unfamiliar Face Database (GUFD) [3]	RGB images	6,000 images of faces	Face detection	[89]
Utrecht ECVP Facial Database [4]	RGB images	131 images	Face detection	[89]
ChaLearn Looking at People Challenge [124]	RGB-D images	14,000 images of hand gestures	Hand gestures	[60]
Kinect Tracking Precision (KTP) dataset [316]	RGB-D images	8,475 frames, 14,766 instances of people	Person detection	[316]
Annotated hospital dataset [430]	RGB-D images	17,000 annotated images	Mobility detection	[430]
OpenSign [287]	RGB-D images	20,950 images	Hand gestures	[287]
Dataset by Lima et al. [254]	RGB images	160,000 images of open and closed hands	Gesture recognition	[254]

Table 2. Datasets Used in Robotic Vision for Human-Robot Interaction/Collaboration

9 RQ6. What Has Been the Main Participant Sample, AND How Is Robotic Vision in Human-Robot Collaboration AND Interaction Evaluated?

9.1 Main Participant Sample

Many published works reported little human-relevant information. In the 310 papers, only 66 (21%) reported details around a human experiment or testing with people. In the studies that did report participant numbers, there was a calculated total of 1228 participants across all papers (\(M = 20\), range = 1–150, \(SD = 22\); Christiernin and Augustsson [92] did not report numbers). In studies that reported participant age (22%), participants were on average 32 years old (range = 1–88, \(SD = 24\)). For papers that reported gender (35%), there was an average percentage split of 70% male and 30% female participants. No studies reported participant country of origin. Gesture recognition had the highest number of participants at 302 (20%, 23 out of 116) papers. Few experiments had direct evaluation for robotic vision performance. Instead, experiments often had a clear focus on robot evaluation as part of its overall intended task or role in the interaction. Evaluation metrics were more likely to involve objective metrics (N = 384) compared to subjective metrics (N = 40), with the nature of the metrics around robot components such as overall perception, robot task performance, or robot preference when compared to other modalities or system setups. Common quantitative questionnaires included the NASA-TLX (e.g., [202, 221, 425]) and GODSPEED (e.g., [139, 452, 493]). Others included the PARADISE framework [139], Positive or Negative Affect Scale [452], Robot Acceptance Scale [452], or a custom-made scale, such as a 9-point evaluation [348], 7-point comfort rating [379], or 5-point robot performance rating [240]. Figure 14 depicts the number of team metrics and team metric categories, team metrics for each application area, domain, robot type and camera type. A selection of exemplar use cases will be provided in the next section (Section 10) to describe state-of-the-art robotic vision in HRI/C.

Fig. 14.

9.2 Evaluation Metrics in Each Application Area

9.2.1 Gesture Recognition.

Robot evaluation scores were often on preference ratings, how enjoyable people found the interaction, and how favorable people found the robot. For instance, the robot was found to be engaging during a social interaction [102], the interaction was enjoyable in a navigation task [229], and the robot was the preferred choice when compared to other control methods such as joystick or gamepad [472], or when compared to other modalities such as a screen in a public service center [209]. People reported high rating or preference for a robotic vision component involved in the interaction, such as for gestures were natural and easy to use (79%, \(n=24\) [63]), for specific gesture styles (62% preference for elbow to finger, and 38% for eye to finger [6]). This was not systematic, with other modalities not related to robotic vision also rated more favorably, such as physical interaction rated as the least demanding and most accurate method to guide a mobile robot to different waypoints (\(n=24\) [202]), or handheld devices being easier to use than gestures (\(n=23\) [229]). Other robot evaluations included learning speed for use of the robot in a gesture controlled grasping task (\(n=10\) [254]), number of errors in terms of distance from the robot when signaling to pick up items (\(n=16\) [117]), or accuracy as seen during a gesture game with a robot (15% lost due to out of distribution gestures and 10% due to classification error (\(n=30\) [155]).

9.2.2 Action Recognition.

Robot evaluations often involved preference scores, willingness to use, and satisfaction levels with the robot. People rated their impression of the robot’s behavior compared to several baselines on a 1 to 5 scale (\(n=12\) [240]). Trust was assessed in a home service robot for when it could detect the persons’ actions, with 75% of people (\(n=16\)) reporting that the feature was important [394], and more than 40 out of 50 people (exact number not reported) reporting a satisfaction level of at least 4 out of 5 [233]. In preference scores, 77% (\(n=30\)) preferred a robot for exercise compared to training videos [452], and 92% (\(n=32\)) reported willingness to continue the robot program [27]. Other evaluations included 65% (13 out of 20) who were unable to tell if the robot was autonomously operated or tele-operated from its behavior [493]. Last, a robot that used activity recognition for physical activity had people report an increase in exercise success rates with 12 elderly users over a 3-week period [158].

9.2.3 Robot Movement in Human Spaces.

Evaluations were often in performance, preference, and acceptability. In one example, a mobile robot was successfully controlled in 77.4% of the interactions, with 8.9% of unsuccessful interactions attributed to fast rotational robot movement or large distance between the camera and person [122]. Considering preference, a human following task revealed that people in general found the robot’s behavior appropriate, but most reported being uneasy with it (\(n=13\) [348]), and 46% (6 out of 13) were willing to adapt their living environment to accommodate a mobile robot, but 10 did not want to adjust their walking speed [348].

9.2.4 Object Handover and Collaborative Actions.

Robot evaluations involved both performance and preference rubrics. For performance, 12 people used an UR10 for a user-controlled pick and place task and after three trials, and those with no experience achieved greater than 75% success with a interaction time of 69 seconds on average [52]. In addition, higher mental load was found for seven novice and experienced operators in a shared work space when the robot operated closer to the person with a higher velocity [407]. For preference, 4 people (100% of the sample) reported that a robot arm did not always match their expectations [92], and collaborative actions from robots in surgical assistance could help them perform their role more efficiently (n = 16 [221]) and reduced workload for a shoe fitting task when the robot was personalized with preferences, including shorter time and fewer commands [425]. Last, some found no differences in time or error rates between humans for passing instruments to a person [221].

9.2.5 Social Communication.

There were performance and preference scores for social communication. For performance, 16 people had high detection accuracy for interest to engage (99%) as well as if the person was not interested (92% [373]). A robot served drinks successfully with high accuracy (100% for single-person scenario) with a good response (658 ms) and expected interaction time (49.4 seconds [139]). Gesture recognition and speed in a human-robot gesture game had good recognition accuracy (92%, \(n = 5\) [476]). Other performance results were for a PR2 robot that achieved a 100% success rate to answer user requests (72.7% first attempt, 18.2% second, remainder on the third [309]). Some behaviors were also improved with robotic vision—for example, a humanoid receptionist robot conversed with 26 people to compare engagement-aware behavior, which included small improvements in eye gaze toward the robot (78.8% with and 73.7% without [249]). People perceived the robot as more intelligent and were more satisfied with the interaction, although no effect was noticed on task performance [249]. A robot bartender was rated as likable, intelligent, and safe by 31 people [139]. Others found larger increases with robotic vision—for example, gaze and pause detection to determine when a person had finished speaking resulted in a two times increase in talking time compared with filled pause detection (n = 28 [50]). Some also had reported engagement with the robot but no statistics [405].

10 RQ7. What Is the State of the Art in Vision Algorithm Performance for Robotic Vision in Human-Robot Collaboration AND Interaction?

Considering robot evaluation and vision algorithm performance, state-of-the-art performance is paramount to the functional benefit of the robot, including what tasks or services the robot could provide to the human. However, Section 9 demonstrated that few papers reported standardized metrics that directly evaluated robotic vision performance. This makes it challenging to fairly compare the performance of the vision algorithms used in these works. It is nonetheless important to highlight works that make superior use of robotic vision in HRI/C systems. Therefore, we present selected works that well represent the use of robotic vision in human-robot collaboration and interaction with respect to the criteria of novelty, impact, and/or robustness. Exemplar studies are identified that showcased creative and/or robust use cases of robotic vision, given that systematic differences could not be calculated across studies from metric and result reporting. These examples help to direct to future pathways in the field to increase experimental rigor with more experimentation and more systematically evaluate the feasibility for the capacity, speed, and accuracy for robotic vision to be used with people.

10.1 Gesture Recognition

Mazhar et al. [287] demonstrated control of a KUKA arm via hand gestures by fine-tuning an Inception V3 convolutional neural network on a custom dataset (OpenSign). This resulted in a system that was able to detect 5 gestures in a row at 40 Hz (250 ms per detection). OpenPose was used to localize hands in the dataset images, and the Kinect V2 depth map was used to segment the hands from the background, allowing background substitution for data augmentation. Inception V3 fine-tuning resulted in a validation accuracy of 99.1% and a test accuracy of 98.9%. The dataset had RGB-D images with 10 gestures performed by 10 people, including 8,646 original images and 12,304 synthetic images from background substitution. Waskito et al. [447] tested the robustness of their hand gesture classifier as the hand was rotated or when lighting conditions were varied to find an average total accuracy of 96.7% with each gesture having a 0.141-second average response time. Last, Pentiuc and Vultur [338] used skeleton data from the Kinect and the dynamic time warping algorithm to detect 5 gestures with an accuracy of more than 86%. Although these methods cannot be compared directly, since different gestures and settings were being evaluated, the overall trend was to use higher-capacity models trained with more data to increase the accuracy and robustness of gesture recognition.

10.2 Human Detection and Tracking

To detect and track a specific person, Hwang et al. [190] integrated a Single-Shot Detector, FaceNet, and a Kernelized Correlation Filter. With this system, Hwang et al. [190] were able to detect humans up to 8 m away and recognize specific faces, achieving a maximum position error of 4 cm and 5-degree orientation. For tracking a person from a mobile robot, Weber et al. [448] achieved a 59.7% mean average precision with a Single-Shot Detector and tracking-as-repeated-detection strategy. Zhang et al. [492] compared their detection and tracking method using target contour bands to several others that used videos from the object tracking benchmark dataset. This work showed the presented method was more accurate (94%) with the fastest processing time (34 fps). Once deployed on a mobile robot, the robot could follow an identified target for around 648 m. Fang et al. [127] found that dynamic body poses could be recognized with high accuracy (>96% classification accuracy on 300 tests), but limited details were provided on the method [127]. To detect and distinguish people based on walking aids, Vasquez et al. [430] found that combining a Kalman filter, a hidden Markov model, and a Fast R-CNN region extractor improved system performance by a factor of 7 compared to a dense sliding window method.

10.3 Non-Gestural Action Recognition

In an action recognition task, Lee and Ahn [239] achieved an accuracy of 71% from an RGB camera on the NTU RGB-D dataset (75% from Kinect) at 15 fps. For recognizing actions from a child, Efthymiou et al. [121] used dense trajectories from multi-view fusion as the input to their action recognition system and evaluated on a test set of 25 children performing 12 actions with comparisons to a test set of 14 adults [121]. Finally, to adapt the motion of a robotic assistant rollator to the patients, Chalvatzaki et al. [72] found that their model-based reinforcement learning method that uses predicted human actions obtained a smaller tracking error than several other control methods [72].

11 RQ8. What Are the Upcoming Challenges for Robotic Vision in Human-Robot Collaboration AND Interaction?

This section will discuss potential and known challenges for robotic vision in human-robot collaboration and interaction, as well as a brief discussion of general robotic vision challenges. Overall summaries on future human-related challenges and general robotic vision challenges demonstrate target areas for consideration in the design, deployment, and future use of robotic vision in human spaces. Challenges specific to vision during human-robot collaboration include the ethical use of human data, human model selection and optimization, experimental design and validation, and appropriate trust.

11.1 General Challenges of Robotic Vision

In addition to challenges specific to human-robot interaction, there are also more general challenges related to robotic and computer vision. Although the performance of robotic vision systems are often bounded by what can be achieved by state-of-the-art computer vision algorithms, there are many reasons why state-of-the-art computer vision techniques have not been transferred to robotic platforms.

An algorithm that uses visual data may not be sufficiently robust to perform in real-world conditions and edge cases relevant for HRI/C [98, 402]. Although visual data may contribute to better multi-modal understandings of human states, actions, and involvement with the robot [39], vision still presents with challenges. For the robot to understand its given task or action, robots may require large training data for new tasks and/or need access to processing capabilities that are unavailable on the robotic platform (e.g., multiple GPUs). Hardware performance can be limited by robot on-board processing compared to cloud processing and the availability, performance, and cost of hardware solutions, such as the RGB-D camera as an inexpensive source of depth information compared with expensive alternatives like laser range sensors [403]. Computationally intensive algorithms can lead to the requirement to have a GPU in the robot, which can be heavy and noisy, and require a lot of power [336].

Sim-to-real transfer can be a particular challenge for many learning-based robotics applications, and progress is often relinquished to large companies that can implement large-scale data collection and testing [244] . In addition, deep neural networks often fail to generalize, with a reduction in accuracy when tested outside of benchmark datasets [358] ; a method that achieves state-of-the-art performance on a benchmark dataset may not generalize immediately to a real-world setting on a robotic platform. Transferring cutting-edge computer vision techniques may also be too recent to have been adopted in physically embodied robots, let alone applied to scenarios related to HRI/C. In the scenarios when techniques have been transferred into HRI/C applications, robots can encounter failures due to computational delays or challenges around the complexity of the vision-based activities related to the task, such as to perceive and understand the diversity of hand-based gestures from multiple people across different countries [399] . Real-world vision-based challenges can also occur with robots operating around people, such as important information being occluded when perceiving the person, and critical information not being identified or perceived during the interaction. However, robots can better overcome some of these challenges, such as by controlling camera positioning, adjusting to capture missing information, and orienting visual capture to help fill in the missing gaps [402] . Additional challenges include software challenges such as the availability of open source libraries, software development kits, and the lack of training data for specific use cases.

Robots must also be able to act upon the visual information in a relevant and suitable way. For instance, there continue to be challenges around translating signals from computer vision into actionable and useful robot functions for robots, such as movement and manipulation actions that can improve the robot’s utility for a given task [275] . This could be impacted by the limited use of participatory design to select suitable applications for robots to assist people [315], or the inability of current vision systems to address tasks and actions that people want robots to assist them on [462] . Either of these could have contributed to slowing down the deployment of robotic vision use cases in human domains. However, human detection and tracking is clearly a key capability for many HRI/C systems, and therefore it is likely that the current state of the art in computer vision will be rapidly transferred to these systems in the near future.

11.2 Fair and Ethical Use of Human Data

First, there are important challenges around the fair and ethical use of human data for the purpose of HRI/C when interacting and collaborating with robots [96, 293]. For example, the General Data Protection Regulation (GDPR) describes regulations on the processing of personal data in Europe, including the consent of individuals for use of their personal data [1] . Fair and ethical use will therefore need to use data processing and management methods that comply with national and/or international data protection regulation and laws. For the next decade, robots are likely to continue to enter human spaces, and there may be limited public knowledge and awareness on how interacting with a robot may use their personal information to help facilitate the interaction. Examples include the robot using its camera system to perceive and classify the person’s facial features, body pose, and actions, as well as using visual information to make inferences on ways to engage the person, such as by classifying the person’s age range, intent to interact with the robot, and future actions. Similar to other data-intensive fields, there are important implications on the use of human data in HRI/C, such as the capacity for people to give informed consent and for the appropriate collection, management, and storage of data in the context of human-robot interaction. Consent to use data should be obtained with clear explanation or capacity to access detailed information on what data is being collected, how it will be used, and how long it will be stored if data will be used for personal, private, or third-party use. For instance, previously recorded data (images and videos) could be captured and approved to be used to improve future robotic interactions as commonly used in computer vision by fine-tuning pre-trained neural networks (e.g., [50]). Other future challenges around fair and ethical use will include data storage and/or ownership of any images and video collected from robots in human spaces, including the right for people to access, edit, or request deletion of any or all images or video streams collected from them. Robot interactions with a physical hardware system do not always include screens or terminals, and these interactions can be located in high-volume areas with frequent turnover of people, such as in a public space. These robot interactions also often do not facilitate similar user agreements or consent notices as other digital methods such as website or smartphone application use [274]. Future challenges therefore should involve methods to address clear and transparent notices of intent to use human data when images or video captured for the purpose of core vision-based features, such as to follow the correct person or identification of a specific customer to complete an order transaction. This could include consent as provided by active or passive consent, and/or accessible information about the robot deployed in the public space with the potential for people to avoid or remove themselves from the robots field of view. This could also include detailed consideration for processes to obtain appropriate informed consent for certain groups that can have a second person involved in the consent process, such as guardians of young children or those who are unable to consent for themselves.

Other challenges also involve the concept of privacy and helping to mediate negative effects around invasion of privacy from robot use [274]. For instance, weighing up potential benefits with risks for each deployment ensures that visual information is collected only when required for functionality, and if so, it is handled and stored with proper care. Furthermore, there have been advances made in the domain of privacy-preserving computer vision, such as to anonymize faces during action detection [362], as well as privacy-preserving visual SLAM [386], given that point clouds can retain a sufficient level of information to re-create the surrounding environment, potentially compromising privacy if people were intentionally or unintentionally involved in the scene [345] . Continued deployment of robots in public spaces or in private or sensitive contexts, such as at home, may require the consistent use of privacy-preserving vision techniques to ensure that human data is handled appropriately, and that humans do not have unresolved concerns about how robots with robotic vision capacity will operate safely in their own space.

11.3 Human Models

Second, there are important challenges on model selection and optimization for human behaviors, including the reliability and validity of behavioral phenomena to be captured and responded to through robotic vision. This systematic review presents several important use cases, including gesture and action recognition, human walking trajectories, object handover, social communication, and learning from human demonstration. These range from both simple and complex behaviors that require the robot to understand the person. However, some of these behaviors are not as simple to interpret through the use of model selection and optimization. One key example of this is the use of emotion recognition. Robotic vision that is dependent on state classification of human intent or emotion into static categories (i.e., happy or sad) can result in inaccurate identification and/or irrelevant responses provided from the robot without consideration of a more complex human emotional spectrum [31]. For instance, behavioral science research has drawn attention to the unsuitable use of current computer vision methods to detect a person’s emotional state from their facial movements, instead calling for research that explores how people actually move their face to express emotion or other signals in different contexts [31]. This research demonstrates that care should be taken around selection for what robotic vision should and should not be used during human interaction with robots—for example, inferring other human characteristics from visual data that could cause more harm than benefit to the interaction, such as classifying sexuality, race, or serious underlying medical conditions not known to the person. Incorrect model selection and optimization could also cause notable long-term problems in future robot deployments, if development and testing continues to optimize for behavior that is not accurate or representative of the person, further contributing to bias [293]. This also raises the question of whether simulated humans need to be involved in the simulation process, and the level of realism needed for this to be meaningful to the learning process. In addition, current methods that require large datasets may use datasets originally collected for other purposes, and therefore may not easily translate to a new context, such as a dataset of a busy crowd repurposed to help robots learn about social norms in small groups. Last, people may eventually develop long-term hesitancy and rejection to use robotic systems due to perceptions that their capacities are non-functional after repeated errors, given the importance of robot performance on trust in the system [174].

11.4 Experimental Design and Evaluation

Third, there has been limited human experimentation and evaluation of robotic vision with humans. As reported in the preceding sections, few studies report direct testing with humans, and for those that did report a form of testing with people, there were limited participant numbers across all studies. This is a challenge for future deployments, because exploration of participant characteristics found that many did not involve a large range of people who were representative of the general population, and instead involved a narrow sample with limited diversity [293]. Therefore, robotic vision for HRI/C could lead toward design and optimization for a very restricted sample, further contributing to bias [293]. For instance, people who provide feedback in experimental testing become the leading designers in future iterations of robot behavior and function. This can create barriers for wider-scale adoption when robots are deployed in the general community and inevitably encounter different kinds of people who have not been taken into consideration during design and refinement stages. Greater inclusion of different people has been the recent focus of co-design methodology for human-robot interaction and collaboration to ensure that robots demonstrate a more inclusive behavior for a wider range of people who are likely to use them (e.g., [366]). In addition, reported experiments often used single or simple evaluation metrics to measure robot perceptions and human-robot team performance, which may skew robotic vision evaluation in collaborative scenarios, without taking into consideration the human, the robot, and the team dynamic [105, 184]. Such a simplified approach to testing and evaluation could further contribute to skewed development around how robotic vision should work to help people, if testing on human participants continues to remain low and only on restricted samples.

11.5 Appropriate Trust

Last, there is the high potential that humans perceive robots that can interpret visual information to have a high sense of intelligence and general capacity to function with the person and within the environment [228, 414]. For instance, it may not be clear to the person to what extent the robot can identify only a limited visual field or target areas of interest, instead assuming the robot can view all of its surroundings and the activities that occur within it. The use of visual information to interact with the person may also contribute to an increased sense of anthropomorphic interpretation of the robot, leading people to perceive the robot as having more emotional expression or intelligence [120]. This can result in people who inflate their confidence, trust, or perception of the system, meaning that people will relinquish greater autonomy or responsibility to the robot beyond what the robot is capable to perform on its own [364]. For instance, people may falsely assume that the robot has human-like perception and cognitive abilities [120, 228, 414], which can lead people to assume that the robot can perform better than it can. Misunderstanding around the capability of the robot could have notable consequences for safe and effective HRI/C—for instance, the human assuming that the robot will detect visual hazards for the person, or recognize its own errors or mistakes in collaborative work. Therefore, visual information in human-robot collaboration and interaction should be used for the functional purpose of the robot, as well as be explained to the people who use the robot, which includes both its potential strengths and limitations within the intended context to help regulate expectations and define the intended role of the robot [228, 414].

12 Promising Areas of Future Research

There were several relevant computer vision methods that were not represented in the corpus of selected HRI/C papers. Four prominent examples are video convolutional networks, 3D human pose estimators, human–object interaction classifiers, and sign language recognition. Each of these could potentially have a significant impact on HRI/C research in the future.

To begin, recent state-of-the-art methods for action classification process video data using 3D convolutional networks [66, 132], unlike the predominantly frame-based classification approaches used in the HRI/C literature (see Section 8). New techniques include the inflated 3D convolutional networks [66] and the two-stream SlowFast network [132]. Action classification from video is critical to many HRI/C systems, but 3D convolutional networks tend to require significant training data for fine-tuning and significant GPU resources for inference, making it challenging to transfer to many HRI/C systems. However, the expansion of models pre-trained on diverse datasets, coupled with developments in transfer learning, help mitigate this difficulty. Therefore, it is expected that pre-trained video action recognition networks could become a commonly used tool in HRI/C research. Although 2D human pose estimation was well represented, 3D human pose estimation was mostly absent, despite the fact that providing spatial and shape information could be very useful for HRI/C to enable the robot to perform more accurate and functional actions with the person. Monocular 3D human pose estimation from a single image or video is a very popular topic in computer vision. Model-free approaches [334, 335] include VideoPose3D [335], which estimates 3D joint locations using temporal convolutions over a sequence of 2D joint detections. Model-based approaches [211, 212, 220, 222] predict the parameters (e.g., joint angles, shape, and transformation) of a body model, such as the SMPL mesh model [264] . Adversarial learning can be used [211, 212, 220] to generate realistic body poses and motions, which tends to generalize better to unseen datasets, and therefore may be more appropriate for HRI/C tasks. Modeling humans in 3D allows physics to be taken into account, allowing the robot to plan and respond more appropriately and preventing it from making non-physical predictions.

Another set of techniques of relevance to HRI/C are those developed for human-object interaction classification [75, 164, 350] . This task aims to extend action recognition to interaction recognition: localizing and describing pairs of interacting humans and objects in the scene. A robot that collaborates with people to perform a task would strongly benefit from knowing which object the person is interacting with at that point in time, and what type of interaction is taking place. For example, an airport assistance robot may need to detect instances of “person carrying suitcase” to determine where best to provide support to the person. Methods for this task almost always detect human and object bounding boxes first, before combining information from different modalities (appearance, relative geometry, human pose) using multi-stream networks [75, 164] or graph neural networks [142, 350] . There is a clear case for widespread use of these techniques in HRI/C, to facilitate higher-order reasoning about what the people proximal to the robot are doing, and with what objects.

In general, substantial progress has been made in computer vision and machine learning since 2020. Although beyond the scope of this work, there are significant opportunities for HRI/C arising from these developments. In particular, the Transformer architecture [431], originally proposed for natural language processing, has begun to supplant or supplement convolutional neural networks for vision tasks, with large performance increases across many tasks. These include image recognition [115], object detection [65], video understanding [23, 46, 332], and human-object interaction [141] . This represents a significant opportunity for the HRI/C community, because it is a general-purpose architecture that facilitates multi-modal sensor processing [198], allowing a robot to reason about its video, audio, and other inputs jointly. This is likely to expand and robustify robot capabilities while interacting or collaborating with people.

There are also additional areas in which computer vision research has potential impact to adapt or improve HRI/C across different settings. One notable area is sign language recognition [9, 245, 342], which was not present in the corpus of papers. This represents an opportunity for further developing methods for non-verbal communication in HRI/C. The techniques developed, involving fine-grained gesture recognition and multi-modal learning, are relevant for HRI/C, since these techniques can provide benefit to human-robot communication, as well as general situational awareness. Other area is autonomous and assisted driving, in which robotic vision for HRI/C could have a notable impact to increase the uptake, efficiency, and safety of autonomous vehicles [116, 170, 369]. For example, autonomous driving requires that the car can detect and predict the trajectories of others around and on the road, such as drivers, cyclists, and pedestrians. This process can involve close monitoring and coordination to ensure that humans can safely move around autonomous vehicles while vehicles can also get to their intended destination. There is currently a growing body of work around the human component in autonomous driving, but most of these works have so far been tested in simulated environments and without vision, creating notable opportunities to explore new areas of robotic vision for HRI/C-style tasks in the near future [116, 128, 170, 369, 466]. Other areas also include to further explore the capacity to anticipate human actions in advance, resulting in robot behavior that can be more reactive than passive to respond to dynamic interaction patterns over time [161, 189, 325].

13 Conclusion

This survey and systematic review provided a comprehensive overview on robot vision for HRI/C with a detailed review of papers published in the past 10 years for robots that can perceive and take action to facilitate a high-level task. Robotic vision had the capacity to improve HRI/C, including to create new ways for humans and robots to work together. This survey and systematic review provided an extensive analysis on the use of robotic vision in human-robot collaboration and interaction into common domains, areas, and performance metrics for robotic vision. This includes exploring how computer vision has been adapted and translated through robotic vision to improve aspects of HRI/C. This survey and systematic review also contributed to identifying application areas that had not yet been attempted, and how techniques from the computer vision research could help to inform human-focused vision research in robotics. It was found that robotic vision for improving the capacity of robots to collaborate with people is still an emerging domain. Most works involved a one-on-one interaction, and focused on using robotic vision to enhance a specific feature or function related to the interaction. It was also found that only some high-impact and novel techniques from the computer vision field had been translated for HRI/C, highlighting an important opportunity to improve the capacity of robots to engage and assist people. More novel and emerging areas in the HRI/C field such as multi-human, multi-robot teams were less represented in the corpus of papers [67, 195, 260]. Furthermore, robotic vision was often tested in a simple or single application field for each specific use case, showing limited depth in its current form.

Future pathways for HRI/C involve the creation and development of robotic platforms using vision-related information to create more competent robots that can operate in dynamic environments with people—for instance, improving robots to better handle multiple visual inputs at once to open up new domains or collaborative tasks, such as multi-human multi-robot teams. Robotic vision could therefore help to break down some barriers present in long-term human-robot teamwork, such as better adaptation to dynamic environments and different kinds of people over a long period of time.

References

[1]

European Commission. n.d. 2018 Reform of EU Data Protection Rules. Retrieved December 11, 2022 from https://ec.europa.eu/commission/sites/beta-political/files/data-protection-factsheet-changes_en.pdf.