1 Introduction
For more than five decades, researchers in the field of
Human–Robot Interaction (HRI) have been building and studying how robots can collaborate with humans, support them with their work, and assist them in their daily lives [
92,
101,
160]. For example, autonomous mobile robots work side by side with skilled human workers in factories and retail sectors [
165]. Social robots inform and guide passengers in large and busy airports [
187]. In both clinical and home settings, robots have been used to assist healthcare workers, clean rooms, ferry supplies, and support people with disabilities and older adults in rehabilitation and task assistance [
160].
There is emerging interest in using robotics technology to address key challenges in healthcare, particularly those related to the quality, safety, and cost of care delivery. However, there are several key contextual challenges to realizing this vision. One big concern is the rapidly increasing costs of healthcare. For example, in the United States, healthcare is expensive across a range of services including administrative costs, pharmaceutical spending, individual services, and the use of high-income trained healthcare workers [
12]. Another challenge is the dynamic nature of clinical environments with occupational hazards that put healthcare workers at risk of injury and disability [
93,
182,
183]. Additionally, the global shortfall in professional healthcare workers with sufficient clinical education and skills is challenging [
203].
Providing healthcare systems with robots may help address these gaps. For example, robots can support the independence of people with disabilities by enabling transitions to home-based care. Robots can also help clinicians and caregivers with care tasks including physical, cognitive, and manipulation tasks [
20,
88,
95,
160,
198], as well as healthcare worker education (see Figure
1).
Robots can potentially enable healthcare workers to spend more time with patients and less time engaging in “non-value added” physical tasks, and reduce the errors caused by the overburden of these tasks [
88,
182]. These physical tasks include transportation, inventory, and spending time searching and waiting [
160]. For example, Tug robots [
14] are medical transportation robots that autonomously move through hospitals, delivering supplies, meals, and medication to patients.
Moreover, robots can assist in clinical learning. For example, humanoid patient simulators can mimic human function (physiology) or anatomy (biology). Some of these simulators are engineered systems that model information integration and flow to help clinical learners study human physiology. Others present models of human patient biology and cognition to provide clinicians with a platform to practice different skills including task execution, testing and validation, diagnosis and prognosis, training, and social and cognitive interaction.
Robotic patient simulators (RPSs), virtual patient simulators (VPSs), and augmented reality patient simulators (APSs) are three main technologies used to represent realistic, expressive patients within the context of clinical education. Clinical educators (CE) can use them to convey realistic scenarios, and clinical learners (CL) can practice different procedural and communication skills without harming real patients.
Although there are many benefits associated with using RPS, VPS, and APS systems, their designs suffer from a lack of
facial expressions (FEs), which are both a key social function and clinical cue conveyed by real patients. While enabling RPS and VPS systems with an expressive face can address this challenge, still it creates a bigger challenge with designing expressive systems: Facial expressions are very person-dependent and can vary from person to person [
212]. It is challenging to analyze, model, and synthesize FEs of a small subgroup of patients on simulators’ faces and develop generalized expressive simulator systems that are capable of representing a diverse group of patients (including but not limited to different ages, genders, and ethnicities who are affected by different diseases and conditions) [
212].
Another challenge is that incorrectly (or not) exhibiting symptoms on a simulator’s face may reinforce incorrect skills in CLs and could lead to future patient harm. Furthermore, developers may face physical limitations preventing them from advancing the state of the art. For example, VPSs are limited by flat two-dimensional (2D) display mediums, making them unable to represent a physical 3D human-shape volume that clinicians can palpate to perform clinical assessments. Other challenges include the simulator’s usability, controllability, high costs, and physical limitations, as well as the need of recruiting experts with various skills.
Tackling these technical challenges to advance the state-of-the art needs work on several fronts. These include the creation of capable and usable RPS and VPS systems, new techniques for recognizing and synthesizing facial expressions on simulators, novel computational methods for developing humanlike face model for them, and new means for evaluating these systems. Ultimately, addressing these gaps can provide healthcare education with realistic, expressive simulators capable of mimicking patientlike expressions. This has the potential to positively affect CLs’ retention, and eventually, revolutionize healthcare education.
In this review, we discuss research at the intersection of robotics, computer vision, and clinical education to enable socially interactive robots and virtual agents to simulate human-patient-like expressions and interact with real humans. In Section
2, we provide an overview of the root causes of preventable patient harm and contextualize clinical education as a means for addressing it. We outline common learning modalities, including VPS and RPS systems, and outline key opportunities to improve them. Sections
2.4–
5 discuss the importance of incorporating humanlike FEs in RPS and VPS systems and algorithmic approaches for doing so. In Section
6, we discuss our recent research on creating expressive VPS and RPS systems, with diverse appearances and features, which show promise as an important clinical education tool. Finally, Sections
7 and
9 explore open problems in the field and discuss new directions for future work.
3 Automatic Facial Expression Analysis
To build robots and virtual avatars that can replicate realistic, understandable, humanlike expressions, it is necessary to be able to recognize how people express FEs. This section discusses common methods for manually and automatically detecting, locating, and analysing humanlike expressions in the presence of noise and clutter. First, we list a few key concepts.
Facial Landmarks (FL), also known as facial feature points or facial fiducial points, are visually highlighted points in the facial area, mainly located around facial components and contours, such as the eyes, mouth, nose, and chin.
Facial AUs are individual components representing the movements of one or several specific facial muscles in each facial component surrounded with specific FLs [
16]. Researchers introduced 46 main facial AUs [
185], and others have added 8 head movement AUs and 4 eye movement AUs [
16]. Examples include AU6-Cheek Riser, AU12-Lip Corner Puller, 5-Upper Lid Raiser, or AU-26 Jaw Drop. To express each specific facial expression, people need to move a specific subset of AUs in different facial components of their face. For example, researchers have identified AU6 and AU10 are associated with the expression of pain, and AU 10 with the expression of disgust [
131].
Facial Action Coding System (FACS) is a system for manually describing facial actions according to their appearance, first published in 1978 and later updated in 2002 [
78]. The main focus of FACS systems is to recognize facial expression
configuration, which refers to the combination of AUs. This means that the system associates facial expression changes into a set of facial AUs (of 46 uniquely defined AUs) that produce them. This system also characterizes the variation of AU
intensity, which represents the degree of difference between the current state of facial expression and neutral face [
144]. FACS provides a 5-point intensity scale (A–E) for representing the AU intensity (A is weakest intensity and E is strongest intensity).
Manual FACS are based on annotations done by trained FACS coders who manually recognize both configuration and intensity of AUs in video recordings of an individual according to AUs described by FACS [
78]. However, manual FACS rating requires extensive training and is subjective and time consuming. Thus, it is impractical for real-time applications [
96].
Nowadays, many researchers work on automating FACS systems to analyze AUs [
90]. Using automatic FACS instead of a manual approach can be beneficial, because training experts and manually scoring videos is time consuming. Furthermore, studies suggest using automatic FACS can enhance reliability, accuracy, and temporal resolution of facial measurements [
125].
In developing these systems, in addition to
configuration and
intensity variation, researchers also analyze facial expression
dynamics (i.e., the timing and the duration of different AUs). Dynamics can be important for human facial movement interpretation [
90]. For example, facial expression dynamics can be beneficial for learning complex physiological behavioral states such as different types of pain [
200].
The rest of this section briefly describes the main stages involved in automatic
facial expression analysis (FEA), as suggested in a recent survey by Martinez et al. [
125], which include face detection and tracking, facial point detection and tracking, facial feature selection and extraction, AU classification based on extracted features, and new approaches on jointly estimating landmark detection and AU Intensity. Finally, we include a list of facial expression analysis software used by the community.
3.1 Face Detection and Tracking
To engage in facial expression analysis, systems need to be able to engage in “face localization,” which Deng et al. define as including face detection, alignment, parsing, and dense face localization [
75].
Deng et al. introduced RetinaFace [
74,
75], “a robust, single-stage, multi-level face detector.” It performs face localization on different scales of the image plane using joint extra-supervised and self-supervised multi-task learning. Many acknowledge that RetinaFace provides one of the most robust and strongest approaches to face detection. Others have made strides on related problems, for example, Hu et al. [
100] explored a new approach of training separate detectors for face images with different scales. Their result reduced error by a factor of two compared to prior state-of-the-art methods.
In general, most current methods for face detection employ deep learning techniques, including Cascade-
Convolutional Neural Network– (CNN) based models,
region-based Convolutional Neural Network (R-CNN) and
Faster Regions with Convolutional Neural Network Features– (Faster-R-CNN) based models, Single Shot Detector models, and Feature Pyramid Network-based models; see Reference [
129] for a recent survey.
3.2 Facial Feature Point Detection and Face Alignment
Facial feature point detection (FFPD) (also known as landmark localization) generally refers to a supervised or semi-supervised process of detecting the locations of FLs. FFPD algorithms are sensitive to facial deformations that can be due to either rigid deformations (e.g., scale, rotation, and translation) or non-rigid deformations (e.g., facial expression variation, head poses, illuminations, noise, clutter, or occlusion) [
194]. Enabling FFPD methods to align faces in an input image can lower the effect of changes in face scale as well as in-plane rotation.
Cascaded regression-based methods are one type of FFPD method that recognize either local patches or global facial appearance variations and directly learn a regression function to map facial appearance to the FL locations of the target image [
205]. These methods do not explicitly build any global shape model, but they may implicitly embed the information regarding the global shape constraints (i.e., estimate the shape directly from the appearance without learning any shape model or appearance model).
Deep learning regression-based methods combine deep learning models, such as CNN, with global shape models to enhance performance. Early work in this field employed Cascaded CNNs [
177], which predict landmarks in a cascaded way. Researchers then presented Multi-task CNNs [
208] to further benefit from multi-task learning to increase the performance rate. Studies show the cascade regression with
deep learning (DL) performs better than cascade regression and cascade regression better than direct regression [
205].
In terms of facial feature point detection and face alignment, the
Face Alignment Network (FAN) proposed by Bulat and Tzimiropoulos [
58] is considered to be the state of the art. They constructed FAN by combining landmark localization with a residual block. They then trained the network on a 2D facial landmark dataset and evaluated it for large-scale 2D and 3D face alignment experiments. Researchers have proposed different follow-up methods to reduce the complexity of the original approach. For example, MobileNets is a class of efficient models that uses light-weight
deep neural networks (DNN) to improve the performance [
99].
3.3 Facial Feature Selection and Extraction
If the number of facial features becomes relatively large in comparison to the number of observations in a dataset, then some algorithms may not be able to train models effectively. High-dimensional vectors may cause two problems for classifiers: one, data may become sparser in high-dimensional space, and, two, too many extracted features may cause overfitting [
102].
Li and Deng [
117] provide a recent comprehensive survey on deep facial expression recognition and include discussion of feature learning and feature extraction techniques. A few examples are briefly discussed below.
CNNs have been widely employed for the purpose of feature extraction, due to their ability to being robust when encountering facial location changes and variations [
87]. For example, researchers in Reference [
176] used R-CNN to combine multi-modal texture features for facial expression recognition in the wild. Moreover, researchers [
116] proposed a Faster-R-CNN technique to prevent from the explicit feature extraction step by producing region proposals.
Deep autoencoders and their variations have also been used for feature extraction. For example, researchers [
114] used the
deep sparse autoencoder network (DSAE) on a large dataset of images to prune learned features and develop high-level feature detectors using unlabeled data. The proposed DSAE-based detector is robust to different transformations, including translation, scaling, and rotation. As another example, researchers [
162] employed contractive Autoencoder network that adds a penalty term to induce locally invariant features, leading to a set of robust features.
3.4 Facial Feature Classification
In the classification step, the classifier predicts expressions by categorizing the facial features into different categories. Similarly to the facial feature extraction stage, classification performance directly affects the performance of the facial expression recognition system.
Early facial feature classification work used techniques such as Naive Bayes [
123,
179], multi-layer perceptrons [
55,
150], and SVMs [
157]; however, these have fallen out of favor given newer deep learning methods. While traditional facial expression analysis approaches usually perform the feature extraction step and the feature classification step independently, deep facial expression analysis approaches are able to perform both steps in an end-to-end training manner by adding a loss layer as the final layer to the DNN to adjust the error and then directly estimating the probability distribution over a set of classes [
117].
For this purpose, many researchers have adapted CNN techniques for expression detection and classification [
57,
119,
211]. The results of work done by Zeng et al. [
119] shows that CNN classifiers trained faster and performed well. Another study indicates CNN classifiers also provide better accuracy compared to other neural network-based classifiers [
157]. One main challenge to some of CNN classifiers is that they are sensitive to occlusion [
119].
In addition to using deep neural networks for end-to-end training, other researchers [
40,
76,
142,
170] have used DNNs for feature extraction and then added independent classifiers to the system for expression classification.
3.5 Jointly Estimating Landmark detection and Action Unit Intensity
Early FEA work often included a computationally intensive and laborious process (e.g., face and facial landmark detection, hand-crafted feature extraction, and limited classification methods). Nowadays, researchers benefit from having access to comprehensive, large-scale facial datasets, as well as advanced computing resources to develop more efficient facial analysis methods [
68,
84,
85,
110,
118,
120].
One line of research is the work done on jointly estimating landmark and action unit intensity. For example, Wu et al. [
204] proposed a constrained joint cascade regression framework to simultaneously perform landmark detection and AU intensity measurement. This method learns a constraint to model the correlation between AUs and face shapes. Next, they use the learned constraint as well as the proposed framework to estimate the landmark location and recognize AUs. The results of the study suggests the connection between these two parameters can improve the performance for both tasks.
Furthermore, many researchers consider the work done by Ntinou et al. [
141] as the state-of-the-art method for jointly estimating landmark localization and AU intensity. In this work, researchers employed heatmap regression to model the the existence of an AU at specific location. For this purpose, they used a transfer learning technique between the face alignment network and the AU network.
It is worth mentioning that the newer directions for estimating AU intensity seek learning models with little or no supervision, including work done by Sanchez et al. [
168], Wang and Peng [
196], Wang et al. [
195], and Zhang et al. [
210].
One of the applications for AU intensity estimation is to further analyze and synthesize facial expressions representing specific feelings, such as pain. Many researchers have already conducted studies that indicate there is a relationship between a combination of AUs and pain, including work done by Kaltwang et al. [
109] and Werner et al. [
199]. Furthermore, it is worth mentioning that a fully functional automatic pain estimation system requires enough representative data, and for that purpose, there are some pain datasets publicly available (cf. Reference [
122]).
3.6 Facial Expression Analysis Software
Dynamic FEA systems integrate automatic FACS to assess human expressions. Several commercial and open source FEA software packages are available, including iMotions, AFFDEX, FaceReader, IntraFace, and OpenFace 2.0.
iMotions developed a commercial tool for FEA that offers assessing FEs in combination with EEG, GSR, EMG, ECG, and eye tracking [
24]. This tool lets users record videos with a mobile phone camera or laptop webcam and then detects changes in FLs. The researcher can set the tool to apply either the AFFDEX algorithm by Affectiva Inc. [
80] or the
Computer Expression Recognition Toolbox (CERT) algorithm used by FaceReader tool [
121] to classify expressions. Different classifier algorithms such as CERT and AFFDEX employ various facial datasets, FLs, and statistical models to train the ML system to perform the classification task [
24].
Affectiva’s AFFDEX software developer kit (SDK) [
128] is a commercially available real-time facial expression coding toolkit that is able to simultaneously recognize the expressions of several people and is available across different platforms (IOS, Windows, Android). The AFFDEX algorithm uses Viola-Jones [
192] for detecting a face and identifying 34 landmarks,
Histogram of Oriented Gradient (HOG) to extract facial textures, SVM classifiers to classify facial action and, finally, code seven facial expressions based on combinations of facial according to FACS [
24]. AffdexMe is the name of the IOS-based AFFDEX SDK that enables developers to emotion-enable their own apps and digital experiences. The tests we performed on the trial version of this SDK show that the app can efficiently analyze and respond to seven basic emotions in real-time.
FaceReader [
23] is a commercially available automated expression analysis system developed by Noldus. It enables developers to integrate expression recognition software with eye tracking data and physiology data. This tool provides an assessment of seven expressions, head orientation, gaze direction, AUs, heart rate, valence and arousal, and person characteristics.
FaceReader’s algorithm uses the Viola-Jones algorithm [
192] to find a face, then makes a 3D face model using facial points and face texture. It then analyzes the face using DL methods, and classifies the expressions using an ANN. Studies show that FaceReader is more robust than AFFDEX [
173].
IntraFace is a software package developed by De La Torres et al. [
73] for automated facial feature tracking, head pose estimation, facial attribute recognition, and facial expression analysis. This package also includes an unsupervised technique for synchrony detection that supports the function of discovering correlated facial behavior between two people.
IntraFace uses the SDM method to extract and track facial feature landmarks, and normalize the image with respect to scale and rotation [
73]. They then extract HOG features at each landmark and perform a linear SVM for classifying facial attributes. Finally, they use the Selective Transfer Machine learning approach to classify facial expressions and AUs.
OpenFace 2.0 is an open source and cross-platform tool for facial behavior analysis released by the Multimodal Communication and Machine Learning Laboratory at Carnegie Mellon University in 2018 [
13]. OpenFace 2.0 is capable of performing facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation in real time [
46].
OpenFace 2.0 uses a newly developed Convolutional Experts Constrained Local Model [
206] and optimized FFPD algorithm for facial landmark detection and tracking that enables real-time performance [
46]. Using this approach also enables OpenFace 2.0 to cope with challenges such as non-frontal or occluded faces and low illumination conditions. The algorithm of this tool is able to operate on recorded video files, image sequences, individual images, and real-time video data from a webcam without any specialist hardware. GANimation [
154,
155] is an anatomically aware facial synthesis method that automatically generates anatomical facial expression movements from a single image. This method provides the opportunity to control the magnitude of activation of each AU and combine several of them.
Latent-pose-reenactment [
61] uses latent pose descriptors for neural head reenactment. This system can use videos of a random person and map their expressions to generate realistic reenactments of random talking heads.
4 Facial Action Modeling for Synthesis
In many robotics and AI applications, in addition to recognizing FEs in people, we also need the ability to synthesize them on robotic and virtual characters. We discuss this further in Section
5; however, it is first important to discuss facial modeling.
Facial action modeling (FAM) builds a bridge between facial analysis (recognizing and tracking facial movements) and facial expression synthesis (translating modeled FEs onto an embodied face and animating its facial components) [
167]. Thus, technology developers need to incorporate two key ideas in the design of face models: (1) patterns that model the human face (e.g, shape, appearance), both in its neutral state and the way facial movements (i.e., AUs) change to display different expressions, and (2) patterns of the temporal aspects of facial deformation (e.g., acceleration, peak, and amplitude).
The complexity of facial modeling can vary based on the
degrees of freedom (DoF) of the embodiment (e.g., a mechanical robot or virtual face). It is less complex to build face models for more machinelike robots with very simple faces, such as Jibo [
28], which only has one eye with varying properties and details. The complexity of designing a face model increases as the face becomes more realistic and detailed. For both robots with hyper realistic faces (e.g., Charles [
161] and Geminoid HI-2 [
138]), or a humanlike computer-generated virtual face (e.g., Furhat [
25]), developers need to design highly accurate models to engage in synthesis.
There are two groups of information processing strategies for face modeling: theory-driven modeling and data-driven modeling [
103].
4.1 Theory-driven Modeling Methods
Ekman and Friesen’s FACS theory [
78] describes the facial movements through observing the effect of each facial muscle on facial appearance and decomposes the visible movements of the face in the form of 46 AUs. Formerly, many researchers adopted FACS theory for facial modeling and embedded FEs derived from this theory constrained into their social robots [
56,
97]. In this approach, programmers selected a small set of (static) FEs (e.g., tightening and slightly raising the corner of the lip unilaterally to express contempt) [
79]. They then asked actors to contract
\(k\) different combinations of muscle AUs to display the selected FEs to generate
\(k\) different face images and score the face with FACS to verify muscle AUs depicted in each image. Finally, they asked observers to select which image better mimics each specific FEs and therefore identify which combinations of muscle AUs are signals for each specific FE.
However, there are several challenges with the theory-driven modeling methods. For one, these models are based on FEs that precisely met criteria selected and specified by researchers [
79]. Moreover, since these models are based on static FEs, they lack dynamical data including the temporal order of FE movements (e.g., acceleration, peak, amplitude) [
103], resulting in less realistic facial models and ultimately less human-lik simulators. Furthermore, even in studies on cross-cultural FE analysis where subjects pose cultural-specific expressions, still most subjects are identified as Westerners [
137], leading to less diverse face models. Finally, people may have asymmetric facial expressions, such as people who have facial paralysis or deformities are rarely included, thus also limiting the diversity of facial models [
134]. As a result, expressive robotic and avatar faces developed using theory-driven modeling methods lack the ability to generate a wide range of FEs. Therefore, these embodiments are not able to adequately communicate and interact with users.
4.2 Data-driven Modeling Methods
To address the gaps associated with theory-driven methods, researchers have proposed data-driven modeling methods (or, example-based deformation models) to computationally model (dynamic) FEs based on real data. Data-driven modeling methods usually consist of three main steps: data collection, facial expression and intensity data labeling, and facial expression model creation [
103].
4.2.1 Data Collection.
Data are generally collected in one of two ways: via recordings of human participants and through the use of artificial data creation.
One way of collecting data is to capture videos of facial expressions of human subjects (e.g., via an actor or layperson performing facial movements, or use of existing datasets). In this method, a researcher can use any statistical analysis method or facial expression analysis software package (see Section
3.6) to derive a parametric representation of facial deformations and identify the AUs correlated to each frame of a video. For example, Wang et al. [
197] created a new FE dataset of over 200,000 images with 119 persons, 4 poses, and 54 expressions, which is about enough to evaluate the effects of unbalanced poses, expressions on the performance of the FE tasks.
Another way of collecting data is by generating artificial data through artificial data creation methods. In this method, developers usually use facial movement generators to randomly generate an enormous range of artificial dynamic facial expression videos. For example, Jack et al. use a facial movement generator, which randomly selects a subset of AUs, assigns a random movement to each AU by setting random values for each temporal parameter, combines randomly activated AUs, and finally projects them to a robotic face to create random facial animation videos [
66].
4.2.2 Facial Expression and Intensity Data Labeling.
Researchers have used different techniques for labeling FE data correlated to each frame of videos and their intensities, including manual labeling by both lay participants and domain experts and unsupervised data labeling via use of machine learning.
For instance, Jack et al. [
66] recruited participants to watch videos of facial expressions. If the projected video formed a pattern that correlated with the perceivers’ prior knowledge of one of six expressions, then they manually assigned a label to identify the expression and its intensity rating accordingly. Other researchers working on labeling FEs use domain experts (e.g., clinicians) to manually label data [
134]. Other researchers develop facial expression datasets that use different semi-supervised or unsupervised techniques to label the data [
197].
4.2.3 Facial Expression Model Creation.
The next step is the learning phase, where the system uses the shape and texture variations of several sample images in datasets to build a face model and generate its appearance parameters. The parameters of the face model are reversible, meaning that they represent the shape and the texture of all images in the dataset, and therefore, are able to regenerate realistic images similar to each of the learned sample images. Thus, researchers can reverse-engineer specific dynamic FE patterns. This helps to derive the unique patterns of correlated AUs that are activated over time, which are correlated with human perception of each expression. For example, Chen et al. [
66] developed their models by calculating a 41-dimensional binary vector per emotion detailing all AUs, and also seven values detailing the temporal parameters of each AU.
Using these three steps, developers can learn and build mathematical models of the dynamic FEs within a video stream that make it possible to reconstruct these FEs on a robot or avatar’s face and animate them later [
66].
7 Future Research Directions
There are several opportunities to advance the state of the art of expressive RPS and VPS systems within the context of clinical learning, as well as in the broader context of robotics and HRI. These include technical advancements, such as new methods for FEA, FMA, and FSA, as well as socio-technical considerations, such as stakeholder-centered design and ethical questions. We briefly outline these below.
7.1 Advancing Expression Recognition and Synthesis Approaches
As discussed, there are many methods for recognizing and synthesizing facial expressions. However, they have their drawbacks. Many commercially available systems are unable to perform the tasks necessary for FE analysis or synthesis (e.g., FaceReader is not able to provide head pose estimation). Furthermore, systems may may lack state-of-the-art performance, rendering them impractical for clinical applications.
Thus, there are many opportunities to advance the state of the art. For example, some regression-based methods such as CNNs are successful for FL detection and tracking. Furthermore, Gabor features showed promising results for feature extraction, and the CNN and SVM methods improved classification performance. Integrating these approaches into facial expression FEA and FSA systems may improve analysis and/or synthesis of dynamic FEs in individuals with and without facial disorders.
7.2 Combining Domain Knowledge with Facial Model Development
As part of the design process, engaging in stakeholder-centered design with CEs and CLs, as well as conducting observations of live simulations is important. For example, neurologists can help validate if neurological impairment models created by the system are realstic and also ensure the patient simulator’s appearance and expressiveness is well aligned with their clinical education goals.
7.3 Real-world, Spontaneous Data Collection
It is important for developers to release systems that are designed and built using enough real-world, spontaneous facial expression data [
94]. The number of facial expressions used for training and developing FEA, FAM, and FSA systems should be much higher to lead to more realistic results. In case of having a low number of images for training, it is challenging to choose the best approaches to enlarge the dataset while developing the system. Expressive robot developers also need to make sure the system includes a continuous adoption process that learns each user’s expressions over time and adds them to its knowledge base [
94]. It is also important to pay close attention to include the variability of the facial data in terms of subjects by including data from subjects well represented in gender and ethnicity, as well as diversity in terms of lighting, head position, and face resolution [
94]. Given that patient simulators are designed to mimic humans and are designed for use by humans, we added a discussion on the importance of having designs that are informed by human sensory systems and behavioral outputs. Finally, it is important that datasets are labeled and analyzed in concert with domain experts, but to our knowledge little work has been done in this area. One potential solution can be to create a large training set of photorealistic facial expressions generated using existing face generation platforms labeled by human observers.
There are several existing facial expression datasets and Action Unit datasets that tackle some of the data collection challenges, including DISFA [
126], BP4D-spontaneous [
209], Aff-Wild 2 [
111], and SEWA DB [
113].
Furthermore, some of the recent facial expression synthesis methods, such as those mentioned in Section
5, are also intended to address these challenges. However, more work can be done in this field to tackle all the afore-mentioned problems.
Moreover, newer directions also seek learning models with little or no supervision, both for facial landmarks (unsupervised landmark detection) and for Action Units that can help to address these challenges.
In terms of identifying databases of images or videos that reflect real facial expressions, it is important to consider the relationship between internal states and external facial cues. Work done by Benedek et al. [
50] indicates people perceive the appearance of the face, especially the eyes of others, to understand both their external goals or actions, and their internal thoughts and feelings. Voluntary facial expressions are sometimes made in the absence of internal states. However, it is difficult to detect internal states in case attention is not presented externally. Therefore, it is critical to identify datasets of real data to better infer the external facial cues and more accurately interpret internal states.
It is worth mentioning that there is the potential of having a pattern of confusions (false alarms and misses) in detected facial expressions. False alarms is the errors of describing a facial expression being present when it was absent. Misses is the errors of describing a facial expression as being absent when it was present. Studies indicate that the pattern of confusion becomes worse when some other challenges occur at the same time, such as illumination or occlusion in an image [
153].
7.4 Cultural Considerations
Researchers have also explored the caveats associated with cultural variance in the way observers infer internal experiences from external displays of facial expressions. For example, Engelmann et al. [
82] argues that culture influences expression perception in different ways. For one, people from different cultures may perceive the intensity of external facial expressions differently. For example, American participants rated the intensity of same expressions of happiness, sadness, and surprise higher that Japanese participants. Moreover, depending on cultural contexts, there is a difference in the way people infer internal states from external facial cues of expressions. For example, researchers ran an experiment to ask two groups of American and Japanese participants to rate the intensity of internal and external state of a person expressing certain emotions. American participants gave higher rates to external facial cues of emotions, while Japanese participants gave a higher ratings to internal state of emotions. Therefore, it is important to consider these cross-cultural differences in inferring internal states and external expressions.
7.5 Generating Universal Models for Various Pathologies
To generally represent all patients with specific pathologies, one can create a universal model for each that encompasses its predominant features. This can be done by leveraging our previous findings in Section
6.3 to further extend the FPM framework in two directions: (1) Extend the FPM framework to encompass the predominant features of a specific pathology (e.g., stroke) and (2) transfer the framework from being an individual mask generator to a universal model generator. This can be done by using enough source videos of people with the specific pathology, extracting its common features, and creating a general model (see Figure
9). By leveraging this work, CLs will have the potential to more accurately diagnose people with diverse backgrounds, and to be better able to interact with them.
7.6 Sharing Autonomy between Users and Expressive Robots
Considering how to share autonomy between a human and robot is an important aspect to ensuring effective HRI [
156]. It can help to reduce an operator’s workload, allow both inexperienced and professional operators to control the system [
86,
156].
As such, it is important to focus on interaction between the control system and human users in the context of expressive simulator systems. Thus, researchers can design and validate a customizable, shared autonomy system for expressive RPS systems to leverage the advantages of automation while also having users as “active supervisors.” For example, in our work, we are designing a shared autonomy system that can support a range of adjustable control modalities, including direct tele-operation (e.g., puppeteering), pre-recorded modes (e.g., hemifacial paralysis during a stroke), and reactive modes (e.g., wincing in pain given certain physiological signals) [
151]. It also can help overcome common control challenges, including the operator being overwhelmed, having high workload, and lack of autonomy in robotic simulator systems. This system can help to make robots adjustable to different control paradigms, so that they reliably support CEs’ workload in dynamic, safety-critical settings, and improve the operator’s ability to focus on their educational goals rather than on robot control.