research-article

Open access

Multiple Roles of Multimodality Among Interacting Agents

Authors:

Erik Lagerstedt,

Serge ThillAuthors Info & Claims

ACM Transactions on Human-Robot Interaction, Volume 12, Issue 2

Article No.: 17, Pages 1 - 13

https://doi.org/10.1145/3549955

Published: 15 March 2023 Publication History

All formats PDF

Abstract

The term multimodality has come to take on several somewhat different meanings depending on the underlying theoretical paradigms and traditions along with the purpose and context of use. The term is closely related to embodiment, which, in turn, is also used in several different ways. In this article, we elaborate on this connection and propose that a pragmatic and pluralistic stance is appropriate for multimodality. We further propose a distinction between first- and second-order effects of multimodality—what is achieved by multiple modalities in isolation and the opportunities that emerge when several modalities are entangled. This highlights questions regarding ways to cluster or interchange different modalities, for example, through redundancy or degeneracy. Apart from discussing multimodality with respect to an individual agent, we further look to more distributed agents and situations in which social aspects become relevant.

In robotics, understanding the various uses and interpretations of these terms can prevent miscommunication when designing robots as well as increase awareness of the underlying theoretical concepts. Given the complexity of the different ways in which multimodality is relevant in social robotics, this can provide the basis for negotiating appropriate meanings of the term on a case-by-case basis.

1 Introduction

Machines designed to respond to the world around them—for example, social robots—need to access the world through at least one, but often multiple, modalities. This is sometimes referred to as the machines being structurally coupled with the environment [83]. Such a coupling between an agent and the environment is the minimal requirement for embodiment [83], bearing in mind that the term embodied is routinely used to refer to several different concepts [12, 65, 78, 83]. Since creating machines such as social robots is an interdisciplinary endeavour, discrepancies in terminology, assumptions, perspectives, and procedures are to be expected based on variations between fields. Although these differences are often a positive aspect of interdisciplinary collaboration since they can broaden perspectives, they can also interfere negatively if they go unnoticed. In this instance, for example, the term structural coupling, while originally proposed to denote a fundamental concept within enactivist theory of cognition [43], has since also been used in a broader sense in other branches of cognitive science [55] and has further been adapted for use in sociology [42]. Additionally, the same term is used for a different concept (a measure of the interdependence of different modules of programming code) in software engineering (e.g., [2]). Since it is realistic to expect that cognitive scientists, sociologists, and software engineers can collaborate in an interdisciplinary endeavour such as social robotics, it is plausible that situations arise in which a term such as structural coupling is used appropriately by the different experts yet is still (mis)understood as referring to different concepts or contexts. Sources of potential confusion arising from overloaded terms in social robotics have been pointed out by others as well, for instance, “autonomy” [66, 78, 84], or the aforementioned embodied.

In this article, we are particularly concerned with such discrepancies in the use of the term multimodal with respect to its uses in different theories of embodiment. Awareness of the different meanings of such terms does not only facilitate communication with, and within, design teams; it can also make underlying assumptions and theoretical roots more salient and explicit. This, in turn, can help in making fairer judgements regarding capabilities, facilitating smoother and more appropriate interactions [20, 40].

There continues to be significant progress in technical developments relevant to robots, including increasingly sophisticated sensors, actuators, and processors. Not only are the well-established modalities—such as vision—improved, but modalities that have previously received less attention—such as tactility and haptics—are increasingly common [63]. However, to make the most of the possibilities that these new sensors offer, it is necessary to also be clear about their purpose and how that purpose is achieved. Here, the cognitive sciences in general, and the studies of artificial cognitive systems in particular, can be helpful. For example, cognitive scientists such as Chemero [12] distinguish between embodied cognition—grounded in a computationalist perspective—and radically embodied cognition—grounded in ecological psychology. This difference in perspective leads to different purposes being attributed to the body [71]. Consequently, the relevance of the various aspects of a body will also vary depending on the perspective taken. How specific issues are understood depends on how the corresponding situation and its components are framed. This, in turn, affects the solution space to the specific problem at hand. It is the particular interest of this article that these perspectives on embodiment lead to different notions of what type of sensors to use and what their overall function with respect to the agent is and, thus, to different notions of what multimodal can be meaningfully taken to mean.

We develop this further in the rest of the article, starting with a discussion of embodiment in relation to multimodality in Section 2. We will then expand the argument to a multi-agent context, including social aspects, in Section 3. The overall take-home message is that multimodality is a very complex concept that can be grounded in many different theories and relevant in many domains. We also provide some indication of how these variations can take shape. Our stance is that the benefit of the different interpretations varies based on purpose and context. Therefore, being aware of different interpretations and evaluating them on a case-by-case basis would be the recommended approach to the concept.

2 Embodiment and Multimodality

Historically, computationalist perspectives have been prominent in cognitive science, and von Neumann architectures have commonly been used as a model for cognitive systems [3, ch. 7]. The symbolic manipulations in such a modular system have been convenient in their compatibility with reductionism and the pragmatic benefits of the possibility to create complex yet manageable systems. However, despite the progress made using this paradigm, an increasing awareness of its shortcomings resulted in a rise in popularity (see [38]) of so-called multi-e (sometimes also 4E) perspectives [48], which we will elaborate more on in Section 2.2. A crucial difference between these paradigms is that they each prioritise their own sets of fundamental challenges to be addressed, which, in turn, biases the concrete problems that will be tackled.

Given the diversity of contexts in which multimodality is relevant and the complexity that this often leads to, it also makes sense to approach related issues from a diverse range of perspectives [45]. Instead of attempting to identify “the one correct theory” and the “true” meaning of each term, it is more pragmatic to acknowledge the values of the various perspectives, and negotiate meaning for each particular case. We start by briefly introducing the general perspectives on cognition and their relations to (multi-)modalities for an agent, and then expand to a multi-agent context.

2.1 Cognitivism

The cognitivist perspective, strongly associated with computational functionalism, is based on the metaphor that the brain is like a computer in the sense that it is a centralised unit performing symbolic manipulations [78, ch. 2]. Given some input, it produces output. This view of cognition is naturally very compatible with work in robotics. One could, in fact, consider a human to be a “meat robot” whose purpose it is to maintain, protect, and transport the “meat computer”, that is, the brain [14]. Further, even though much work has been done on studying the “software of the mind”, there are also embodied perspectives that have received some attention.

From this perspective, to a large extent, the body is reduced to the function of vehicle and/or an interface between the mind and the surrounding world [12, 71]. It holds sensors that provide input to the computer that, in turn, computes appropriate signals to send to the actuators of the body. Although of limited use as a model of cognition, these architectures continue to have a practical relevance, for instance, when creating machines for controlled environments and limited interactions. An example of typical uses of such robots is along assembly lines in the manufacturing domain, where the machines are placed in cages to prevent interference. In these well-controlled situations, the reductionist and modular properties are particularly desirable. In general, though, it is relevant to consider frameworks that go beyond such paradigms both to facilitate interaction with humans and to facilitate informed decisions regarding appropriate design for the robots. In robotics, for example, this was a driver for a paradigm shift towards reactive robots and subsumption architectures in the early 1990s [9].

2.2 Multi-e Perspectives

Sensing and acting can be characterised as being fundamentally intertwined in a sensorimotor system [13, 27]. This is central for many of the multi-e perspectives; an umbrella term for perspectives that in various ways emphasise and combine embedded, embodied (in the radical sense, see [12]), extended, enactive, ecological, and similar perspectives of cognitive science.

The diversity in scope and method is vast, both between and within these perspectives, to the point at which it can be challenging, if at all possible, to reconcile them, although they are at least united in their opposition of cognitivism [1]. They also tend to have a stronger foundation in biology and a more holistic and situated view on the mind, body, and environment. Senses associated with the body (interoception) and senses associated with things in the environment (exteroception) are both central and fundamentally entangled when considering perception [65]. Since affects are generally associated with interoception, they are intrinsically part of cognition from the multi-e perspectives. The fundamental integration between agent and environment also means that the meaningful aspects of the environment are determined by the properties of the agent. Not only will the sensitivities of the various sensors determine what is accessible to perceive but so will the needs, moods, and other affects of the agent. This perceivable “filtered” world is called the Umwelt of the agent [80]. While the term Umwelt is sometimes simply used for the subset of the physical world that is accessible for an individual, it was originally intended to refer to the subjective experience of that subset of the world [22]. The simplified interpretation is very compatible with cognitivist perspectives since the physical attributes of the body and actuators, and the ranges in sensitivity of the sensors, would together select the aspects of the external world to be represented internally. The originally intended interpretation would require some of the more radical understandings of multi-e cognition, in which the agent is thoroughly and historically integrated in the world and in the process of mutual perturbation constructs its experienced world.

From this perspective, it is less obvious how to apply understanding of cognition to robots. Creating artificial cognitive systems of this kind is arguably possible but puts very large demands on the machine [85] as they would need to be organismically embodied [83]. Conventional robots, however, typically lack the thoroughly integrated metabolism and homeostasis necessary for autonomy¹ in the strong sense required for organismic embodiment. From an enactive point of view (see, e.g., [67, 77]), cognitive phenomena emerge in complex systems under the right conditions, making it difficult to design robots displaying such phenomena. Identifying the right kind of complexity is strongly associated with integrating the right modalities in the right way; modifying the set of modalities will fundamentally alter the agent’s Umwelt, and the consequences for the system will be difficult to predict. Combining this with other requirements (such as cost, reliability, or strength) makes it particularly difficult. On the other hand, these problems make robots an excellent tool for studying cognition synthetically [54].

2.3 Modalities

We suggest considering the contribution of individual modalities to perception in terms of first- and second-order effects. First-order effects refer to those of individual modalities: photoreceptors make light detectable and microphones make sounds detectable (though sensitive at specific ranges). Thus, multimodality in the sense of using several sensors that individually provide input to the system can be seen as a collection of first-order modalities. However, access to several first-order modalities might provide new opportunities for perception, which are second-order effects of the modalities. For example, the synthesis of different sources can lead to richer perception, similar to a kind of triangulation (see [17]) in which independent clues can provide sufficient information from different perspectives to allow for conclusions that would be out of reach when only using one source. Triangulation is traditionally used as a technique for navigation, where two known landmarks can be used to infer one’s position (this position will together with the two landmarks form the vertices of a triangle). However, in this metaphorical use, the vertices do not have to be locations in physical space. A different kind of opportunity afforded by multimodality is that one modality can be used to validate information from another. For example, this could be done in a Popperian way, in which one modality is used to form a hypothesis testable through another [82]. A hypothesis regarding the existence of objects might be formed based on observations from a visual modality but tested through tactile means. In a similar manner, but from a different theoretical perspective, the inference between several signals could result in the confirmation or rejection of a prediction [53].

This leads to questions regarding how, and under what circumstances, various modalities complement each other. For example, are there general rules regarding what modalities are interchangeable and/or is robustness a positive second-order effect in itself? Is it meaningful to consider different degrees of overlap between modalities and how does that change over time? Do certain classes of modalities typically result in certain synergies? These remain open questions and relate to the ubiquitous property of “degeneracy” in biological systems [19], which refers to structurally different elements fulfilling the same function. The point of this article is not to resolve these questions—this is out of scope for any single article—but rather to introduce the questions in relation to relevant theoretical perspectives to help make informed and explicit decisions when studying, using, and communicating about multimodality. Part of the reason it is challenging to address these questions lies in the large amount of potential modalities (leading to an enormous amount of potential combinations) and the large variety of potential functions the modalities might have. Simply having redundant modalities can be beneficial in the sense that there are backup sensors if any break, whereas degenerate modalities can be beneficial in the sense that some function can be fulfilled even if the circumstances change. For example, a robot detecting obstacles using two (redundant) cameras could still operate if one camera breaks but not if the light goes out. A robot relying on a camera and a LIDAR scanner (degenerate) would be able to handle both cases.

Another complicating factor is the general role of, and relations between, modalities—which is strongly influenced by how they are embodied. For instance, a fundamental strategy in biological systems is active perception (see, e.g., [26]), where top-down processes based on the animal’s goals and desires guide their attention. Therefore, the relations between the modalities are integral to this particular kind of embodiment. Gaze might be drawn to faces in a crowd to recognise familiar individuals. Another way to actively perceive is to cause perturbations to reveal information; pushing an object can reveal the object’s weight and structure, and squeezing a surface can reveal properties of the material. For these reasons, action and perception are inseparable when considering animals’ relations to their surroundings.

2.4 Sensorimotor Systems

The example of testing visual cues through a tactile modality can be considered a primitive foundation for a sensorimotor system: when an action is performed, a resulting sensation is expected to follow. For example, moving an arm would lead to a change in proprioceptive sensation but potentially also visual sensations (if moved in view), tactile sensations (if touching something), and so on (see, e.g., [27]). The way changes in sensations in the different modalities are produced by different motor actions are sometimes referred to as sensorimotor contingencies [49]. Through a history of interacting with the world, the agent will experience a continuous flow of such multimodal sensorimotor information without any obvious segmentation, which can provide the basis of expectations through a kind of Hebbian learning [29].

The expectations of such continuous and active cycles are not only shaped by the morphology of the perceiver, but aspects such as the agent’s past experiences, mood, and current needs determines the salience of parts of the Umwelt. This can be achieved through a hierarchical structure of the sensorimotor system, allowing for top-down and bottom-up phenomena [82]. Although this could be understood in a cognitivist context as filtering the input based on the value of some state variable, the more thoroughly integrated sensorimotor system of many multi-e perspectives allows for the deeper interpretation of Umwelt where the experienced world is actively constructed. Features noticeable in this sense—although, mainly in terms of what function the perceived aspect has for the perceiver—are called functional tones [80] and share some similarity with social affordances from ecological psychology (see, e.g., [76]). These strongly depend on the circumstances of each situation and are constantly changing. As such, there is a strong connection between what functional tones are perceived by an animal and how that animal interacts with the world: the history, experience, and expectation on interaction with objects are in a way projected onto the environment by the animal. For example, to a hermit crab deprived of either food or its shell, a sea anemone will assume a “feeding tone” or a “dwelling tone”, respectively [80]. As an example from robotics, it is possible to imagine a human (as perceived by the robot) shifting tone from something to assist to something that could guide the robot back to a charging station as the robot’s battery is getting depleted (see, e.g., [47]).

Since the sensorimotor system predicts the results of the actions as the actions are prepared and performed, it is thought to be able to serve as an internal and embodied predictive mechanism—or simulation—for evaluating whether specific actions are worth executing [46, 68]. These simulations would consist of reactivation of modality-specific information from which the simulated actions and perceptions can follow in chains of consequences [31]. At a neural level, the sensations and actions involve the same systems, and simulate actions without necessarily involving actual activation of muscles. Therefore, the experience of an interaction can be anticipated, not only the consequences.

3 Multi-agent Systems

When discussing perception and agents, two aspects are of particular relevance: how an agent perceives in general and how other agents fit into that model of perception. We have thus far mainly discussed perception and embodiment from an individual perspective, addressing the first aspect. However, when the environment partly consists of other agents, it is important to understand how they can be perceived and understood. These aspects are not only important for cognitive scientists or for people developing service robots, but have also recently gained attention in industrial robotics with the notion of “Industry 4.0”, according to which humans are to be integrated into the production line rather than being separated by physical barricades (see, e.g., [35]). We will focus on the simpler cases of monolithic agents, but will start by mentioning some alternatives that are increasingly relevant and interesting to study in their own right, as they blur the line between “self” and “other” to some extent. We will then briefly introduce social interactions, perception of acting agents, and attribution of properties. All of these aspects are important when attempting to understand an environment that includes other agents and, therefore, will affect how such a world is perceived.

3.1 Swarms and Multiembodiment

Before diving into issues related to interactions between agents, it is important to acknowledge situations in which an agent is arguably distributed in different physical bodies. There are several cases that could fall into this category, some of which have emerged with artificial agents (in which case they would often be discussed in terms of “cyber-physical systems”; see, e.g., [64]). Swarms and packs are examples of groups of individuals that can behave as if the individuals are extensions of a collective mind. Some aspects of the swarm can be explained to a large extent by a small number of simple rules from which the complex behaviours emerge without any explicit decision (e.g., the murmuration of starlings [11]), and there are situations in which decision-making is best understood as a distributed process throughout a network of specialists and artefacts (see, e.g., [33]). In these cases, the behaviours of the individuals based on their local understanding gives rise to global behaviours. This may be aided more or less by explicit communication. In the case of artificial agents, it is possible to imagine a computer controlling robots feeding back sensory information.

As an example, let us consider a case in which a computer, using a WiFi connection, is controlling two mobile units, one equipped with a camera and the other equipped with a microphone. Is the computer sensing the world mono-modally through the WiFi signal or is it more appropriate to consider the robots body parts of the computer that just happen to not be physically attached? What is the relation between the robots and their sensors? In addition to the questions regarding an agent with detached body parts (with more or less autonomy), and interaction with robot swarms (see, e.g., [36]), there are also questions related to what it means to have an agent moving between physical instances [81]. As an example of such multiembodiment, an Intelligent Personal Assistant that normally would constitute the interface of a smart home might be loaded into a service robot at an airport as a kind of personalisation.

Questions regarding what constitutes an agent and how to develop clear demarcations for this might seem to be mainly of philosophical interest at first glance, but the social aspects of interacting with a more distributed agent is relevant for a wider range of subjects. What constitutes an agent in a given situation and what information that agent has access to will determine how it and its behaviour is understood, which, in turn, will shape the interaction with the agent [70]. Conveying what modalities are available to a specific distributed agent might be challenging but also something critical to address when creating agents and systems to interact with or be integrated into. For the sake of brevity, we simply highlight these issues here and will focus on the more common (current) cases of interaction between monolithic agents in the remainder of this article.

3.2 Social Robots

Social robots are loosely defined as “socially responsive creatures that cooperate with us as partners” [8], where cooperative activity requires mutual responsiveness, commitment to the joint activity, and commitment to mutual support [7]. Cooperation can take different forms, for example, dyadic, triadic, and collaborative engagements [74]. Dyadic engagement simply consists of acting together (in a mutually responsive manner) with another agent, whereas triadic engagement also requires the cooperating agents to monitor each other to facilitate progress towards a shared goal. For collaborative engagements, a deeper cooperation and interaction is necessary, where the agents can reverse roles and assist each other based on the intentions and sub-goals of the other agent. From an enactive perspective, the interacting agent (individual dynamic systems) join together making a new dynamic system, even in the case of dyadic interaction, and social interaction can be found when such a system emerges and maintains itself without destroying the autonomy of the individual agents involved [15, 23]. What role the agents can play in the interaction—and, thus, the locus of responsibility for the interaction—will depend on the agents’ respective capabilities [39].

Some of the necessary abilities for (social) interaction can be understood in a hierarchical manner [32]. They range from basic adaptive feedback control to systems that can not only monitor and adapt their own behaviour, imitate others, and imagine different outcomes, but also imagine things despite counterfactual input. Humans are able to reach the highest levels of this hierarchy [32], resulting in such aptitude for social interactions that even various kinds of technological artefacts inspire social behaviour [56]. This ability can be seen as a willingness to suspend disbelief and act by ascribing social abilities that machines lack [86]. To facilitate this mechanism, the robots need to display rich social behaviour [8]. Displaying properties and abilities in ways that are compatible with the modalities of the intended perceiver is important to, in other words, become salient in the perceiver’s Umwelt.

The open questions raised in Section 2.3 regarding redundancy and degeneration through different modalities are also relevant in the social context, although social queues might have their own requirements. As an example, Torre et al. [75] found that the emotional arousal and valence of a smiling virtual avatar can be conveyed differently in different modalities. Both voice and facial expression can be used as a predictor for arousal. However, of the two, it was only facial expressions that reliably conveyed the appropriate valence. In the case of a mismatch, the visual channel of the facial expression took priority. This can be seen as first-order multimodality, in which the individual modalities function in isolation and one is prioritised in the case of a conflict. It can, however, be difficult to capture some of the contextual and dynamic aspects of interaction in shorter experiments. For example, expectations of abilities change based on experience, and interaction dynamics can change with more interaction [20]. This is true, in particular, for interactions with higher demands on coordination, such as triadic and collaborative engagements. In those cases, it is not enough to understand and anticipate the other agent in isolation. The interaction is necessarily understood in relation to something beyond the interacting agents. When interacting with a robot, the human not only anticipates the abilities and impact of oneself, the robot, and the situation but also anticipates intersubjective phenomena (see, e.g., [15]). Therefore, simply considering interactions as isolated events that can either succeed or fail might have only limited use. It is important to be careful when drawing general conclusions regarding the role of various modalities based on studies in which modalities are artificially separated. This distinction between benefits of controlled reductionism and situated holism mirrors that of the perspectives on cognitive science. These events are happening in a particular context; thus, understanding factors such as how the agents attempt to recover from breaches of trust is also necessary [28]. To address unexpected events, alternative explanations can be searched for, and having alternative modalities to use for facilitation of dynamics could be seen as a useful second-order multimodality. An example of when mismatching monomodal signals are used intentionally in social interaction is the use of irony, which has also been studied with human interaction with robots [58]. In the case of irony, the specific combination of mismatching signals in different modalities creates a new meaning that cannot be understood simply by identifying the signal in the correct modality, making it a case of second-order multimodality.

3.3 Perceiving Acting Agents

In social interaction, even the most basic dyadic encounters require mutual responsiveness. Thus, it is not sufficient to simply identify the presence of others; their actions also need to be recognised. Expanding simulations of the sensorimotor system to not only consider one’s own actions could provide a solution for that problem [68] and potentially even help with deriving intentions, which is necessary for triadic and collaborative engagements. In this context, mirror neurons [59] have historically received some interest since they not only fire when an action is performed by the owner of the cells but also when perceiving someone else performing the action [18], thus, linking embodied simulations and other agents [24]. Mirror neurons have been found to activate when perceiving several kinds of agents performing actions, among them humans, monkeys, dogs, and robots [10, 25], which is of particular relevance in social interaction with robots [4, 61, 62]. The ability to experience the other’s situation is quite useful when attempting to understand someone else’s intentions [34]. However, the activation of the mirror neurons is independent of the context of the action, meaning that the mirror neurons are not by themselves sufficient for reading intention or mind reading [6].

Since the social environment is a fundamental part of the Umwelt, other agents can fulfil social purposes for the perceiver and, thus, assume corresponding functional tones based on experiences, needs, and affects of the perceiver. Another agent can be seen in terms of what the agent can and will do or in terms of how the agent can further the perceiver’s goal. Perception in terms of how other agents can help is related to tool use in a more specific problem-solving context and could arguably contain both dyadic and triadic interactions. The way a robot stands out to a human perceiver depends on what the human is attempting to achieve or solve. If the human has a specific need to fulfil, the robot might assume a “tool” tone, whereas it might be seen more as a social partner in more interactive cases [73]. Since the functional tones are based on expectations, there are no guarantees that agents perform according to the functional tones they assume. Social tones of agents or the environment might fade either because the source of the tone satisfied the needs of the perceiver or because it failed to live up to the expectations. This could lead to some interesting dynamic second-order effects, in which agents’ tones fade in some modalities and emerge in different modalities based on changes in needs and expectations of the perceiver.

Social engagements can be seen as a kind of entanglement of the interacting agents’ Umwelten, in which they are mutually acting upon and perceiving each other [52]. This can be mediated by more or less intentional or directed signalling (e.g., [57]). Understanding the modalities of such signals is important for designing or using sociable robots in general and necessary for explainable robots (e.g., [51]). The modalities used by the interacting agents need to be compatible to make it possible for them to be part of each other’s Umwelten.

3.4 Ascribed or Actual Abilities

There are not any absolute requirements for being perceived as an agent since the “truth” is in the eye of the beholder in this case. The ascription of mental states to machines, for example, has been argued to be “useful when the ascription helps us understand the structure of the machine, its past or future behavior, or how to repair or improve it” [44]. Ascribing intentions to robots has been studied in particular as facilitating social interaction (e.g., [50, 69]). Such ascribed or interpreted abilities are not necessarily in line with the actual abilities of agents. For example, Heider and Simmel [30] have shown that even basic animated geometrical shapes can be ascribed to mental states. More recently, Thill et al. [72] showed that drivers of automated vehicles adapt their behaviour based on the perceived, rather than the actual, abilities of the vehicle. Similar results have been found for interaction with computers [5] and humanoid robots [79].

Since the creation of robots is a fairly recent phenomenon, there is not much experience for most perceivers of such tools to fall back on. It has been proposed that there are three different stances that humans can assume when attempting to understand the behaviours of entities in the environment: the physical stance, the design stance, and the intentional stance [16]. With the physical stance, the behaviours are understood through the physical laws that govern the system (e.g., the swinging of a pendulum). With the design stance, the behaviour can be understood through knowledge of the purpose of a designed system. However, some systems are too complex for those strategies to help. In those cases, it might be appropriate to assume the intentional stance—in which attitudes, beliefs, and intentions are ascribed to the system—and, in a sense, anthropomorphise it. In this context, anthropomorphism—“when a person relies on egocentric or anthropocentric knowledge to guide reasoning about nonhuman agents” [21]—indicates that we will fall back on perception mechanisms developed over evolutionary time scales when interacting with robots. However, this will change as humans acquire more experience with robots (see, e.g., [20, 40]). To facilitate social interaction with a robot, the human can be encouraged to anticipate social interaction to be possible. However, if a robot cannot fulfil expectations and lacks the ability to renegotiate a “broken” relationship with a user, the robot might fall into disuse [28]. The issue of disuse (or the related problem of misuse) is currently a topical subject in the human-machine interaction literature (e.g., [20, 60]). Interestingly, it appears that expectations regarding the function of a robot are of greater importance than expectations related to social attributes [41] in this context. It is proposed that this is a result of robots often being portrayed to have cognitive capabilities comparable with humans but much lower social capabilities [37], which then transfers to everyday expectations.

4 Conclusions

In this article, we have argued that multimodality is a term that can mean different things in different fields, but also that it is a term often used in multidisciplinary and interdisciplinary contexts. That combination can lead to misunderstandings. We argue that a pragmatic and pluralistic approach is appropriate in which the meaning of terms are explicitly negotiated among collaborators for each specific case while acknowledging the value of differences in definitions in different fields.

We have also identified how different perspectives on embodiment are closely related to what multimodality can be taken to mean. For instance, different paradigms within cognitive science have different perspectives on a body’s role in cognition, which, in turn, affect the views on sensors and actuators in relation to cognition. In addition, we indicated how the distinction between a monolithic body, a distributed body, and different collaborating agents can sometimes be unclear, blurring the line between a sensorimotor system of an agent and communication among agents.

A third aspect influencing the meaning and role of multimodality is the particular instantiation. In relation to this aspect, we have lifted some open questions to explore further in the future. One set of such questions is regarding redundancy and degeneracy, that is, how do different modalities intersect or complement each other? In this regard, it is particularly relevant to consider both first- and second-order effects of multimodality: are the different modalities working in isolation or are they interacting in ways that allow for more complex perception?

Different purposes and contexts provide different demands on the modalities and embodiment of the involved. For some robots, it is not necessary to possess advanced cognitive abilities; simpler structural couplings might be sufficient. However, social interactions have particularly high demands, including dynamic aspects, and consequently require a stronger coupling than can be provided by first-order multimodality effects.

During the design of a robot, sensory design decisions may explicitly need to be conceived as primarily a first- or second-order problem. This determines how the sensory design is addressed: a first-order problem can be dealt with by adding more sensors; a second-order problem will require new synergies between sensors. Thus, interactions between modalities are also relevant when evaluating the sufficiency and necessity of a sensorimotor setup with respect to its intended purpose. In addition, it is important to consider how the robot is perceived by others to facilitate that it assumes appropriate functional tones. While these remain unresolved issues, we have argued in this article that a starting point consists of characterising different types of multimodality, thereby understanding the open design problems in an appropriate context.

Footnote

Note that, in robotics, the term autonomy is also used in a weaker sense to refer to the ability of a robot to perform its tasks without direct control or supervision by a human. In this sense, the concept is closely related to automation and often desired. The stronger sense in which the term is used would require the robot to have the ability of refusing instructions based on some self agenda, which, in many circumstances, might be a less desirable property.

References

[1]

Ken Aizawa. 2018. Critical note: So, what again is 4E cognition? In The Oxford Handbook of 4E Cognition, Albert Newen, Leon De Brun, and Shaun Gallager (Eds.). Oxford University Press Oxford, Oxford, UK, Chapter 6, 117–1126.

Abstract

1 Introduction

2 Embodiment and Multimodality

2.1 Cognitivism

2.2 Multi-e Perspectives

2.3 Modalities

2.4 Sensorimotor Systems

3 Multi-agent Systems

3.1 Swarms and Multiembodiment

3.2 Social Robots

3.3 Perceiving Acting Agents

3.4 Ascribed or Actual Abilities

4 Conclusions

Footnote

References

Index Terms

Recommendations

Embodiment and multimodality

A Methodology for Evaluating Multimodal Referring Expression Generation for Embodied Virtual Agents

Click if You Want to Speak: Reframing CA for Research into Multimodal Conversations in Online Learning

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations