1 Introduction
Does a person who observes or interacts with a robot think it is making its own decisions? Or has it been programmed for that exact situation? These questions are of perceived agency and have implications for design [
46], law [
4], interaction studies [
77], philosophy [
10], morality [
5], and social psychology [
32]. While agency has been well studied, there are also many different overlapping concepts—animacy, mind perception, anthropomorphism, intentionality, and others. These concepts are similar but distinct from perceived agency. Animacy in robotics focuses on making the robot lifelike, frequently focusing on how the robot moves [
6]. Anthropomorphism concerns attributing human characteristics or behavior to a robot or other non-human entity [
42]. Mind perception is concerned with how people conceptualize others’ minds—the number of dimensions and what those dimensions are [
31,
55,
90]. Intentionality is how deliberately and goal directed a robot acts and is frequently associated with perceived agency [
75]. In this report, we focus on perceived agency.
In their seminal work from 2007, Gray et al. [
31] explored how people think about other people’s minds. Specifically, they were interested in the number of dimensions that people thought others’ minds consisted of. They found, contrary to current beliefs, that people conceptualized others’ minds along two dimensions: experience (the extent to which an entity is capable of being hungry, feeling rage, desire, pleasure, pain, etc.) and agency (the extent to which an entity is capable of recognizing emotions, having self-control, planning, communication, morality, and thought).
Gray et al. [2007] used Principal Component Analysis (PCA) to analyze their data. PCA and factor analysis are statistical approaches that reduce the dimensionality of large datasets. For example, Gray et al. found that when individuals answered questions like “Which one do you think is more capable of feeling hungry, a robot or a 5-year-old girl?” on a 5-point Likert scale, some capabilities were answered similarly (e.g., feeling hungry and feeling pain had comparable scores across a range of entities). Some capabilities were associated with each other more frequently than another set. Each set of questions that were strongly correlated with each other but were correlated less with other questions could be considered a dimension or factor. These dimensions are considered latent—not directly observed but inferred from the combination of associated questions.
A decade later in 2017, Weisman et al. [
90] changed the original methodology and suggested that instead of two dimensions of mind perception, there were three. Weisman et al. argued that the three dimensions of mind perception are body (e.g., getting hungry, experiencing pain, feeling tired), heart (e.g., feeling love, having a personality), and mind (e.g., remembering things, detecting sounds). Both the experience and agency concepts from Gray et al. [
31] were scattered across all body, heart, and mind dimensions of Weisman et al. [
90].
In 2019, Malle [
55] used a methodology similar to that of Weisman et al. [
90], but used different initial items. Interestingly, Malle also found three dimensions, but they were slightly different from those of Weisman et al. [
90] and in some cases showed five dimensions. Malle found affect (positive and negative emotions and feelings), moral (e.g., telling right from wrong)/social cognition (planning and theory of mind), and reality interaction (verbal communication and moving through the environment). Again, both the experience and agency concepts loaded on different dimensions. All three of these studies used a strong bottom-up approach to search for items that were associated with mind perception.
Because both Weisman et al. [
90] in 2017 and Malle [
55] in 2019 did not find evidence that agency was one of the core dimensions of mind perception, other researchers have been understandably uncertain about the status of perceived agency and how to measure it. We believe that while agency is not a core dimension of mind perception, it can be measured as a component of how people perceive other entities, much like the Robotic Social Attribute Scale (RoSAS) measures warmth, competence, and discomfort of robots [
11] or how the Multi-Dimensional Measure of Trust (MDMT) measures different dimensions of trust [
87].
Previous experimental work in perceived agency has focused on determining whether non-organic entities (i.e., robots,
Artificial Intelligence (AI) characters) can be perceived as having agency and what cues lead people to judge whether an entity has agency. Multiple researchers have shown that people do, in fact, ascribe agency to non-organic entities. For example, in 1944, Heider and Simmel [
36] constructed an animation of geometric shapes and noticed that people frequently ascribed the shapes with goals, emotions, and perceived agency [
75]. This work launched an entire subfield investigating how adults and children perceive intentional motion and the relationship to goal-directed cognition and perceived agency [
26,
75,
76].
Researchers originally hypothesized that when an entity looks like or acts “like a person,” the entity is more likely to be perceived as having agency [
42,
57]. Later researchers, however, have attempted to find better and more specific cues over the general “like a person” hypothesis. For example, in 2010, Short et al. [
77] used a very clever experimental paradigm with a robot playing a game of rock paper scissors to examine perceived agency. In one condition, the robot played in a standard way throughout multiple rounds with a participant. In another condition, the robot seemed to make a mistake when calling out who won or lost. In a final condition, the robot actively cheated by changing their throw after both the robot and the participant had completed the round. The cheating robot was perceived as having more agency than the other two robots.
Another group of researchers have shown that robots with social norms may be a cue that leads people to believe that the robot has agency. For example, in 2019, Korman et al. [
45] found that robots that follow social norms are perceived as having more agency than robots that disregard social norms or that seem to make a mistake. In 2020, Yasuda et al. [
94] refined this hypothesis and found that a robot that cheated was perceived as more agentic than other types of social norm violations (cursing or insulting), suggesting that cheating itself may be one of the features that encourages people to think of robots as having agency.
1.1 Measuring Perceived Agency
It is clear from this brief review that a great deal of research has already been done on perceived agency, and a large number of claims have been made about perceived agency. However, this review also masks a serious problem: we do not have a reliable, robust, theoretically meaningful method of measuring perceived agency. This problem can be shown by examining how a number of influential papers from the past few years have measured perceived agency. Some researchers have measured perceived agency through qualitative coding of written comments [
77,
94]. Another group of researchers have used overlapping concepts of animacy or anthropomorphism to make claims about perceived agency [
6,
91]. A different group of researchers have used idiosyncratic measures of perceived agency where they created measures of perceived agency for their specific study or used incomplete scales from other sources (e.g., [
33,
48,
68]); unfortunately, all of the idiosyncratic measures were different from each other, and each paper made strong claims about perceived agency. Finally, some of these measures show inconsistent results across experiments [
54,
77,
94].
This lack of a good measurement tool inhibits our theoretical understanding of what perceived agency is, but also how it impacts other constructs (or vice versa). Because the measurement of perceived agency is so different across studies, the conclusions and opportunities for replication are limited. All of these reasons suggest that a reliable method of measuring PA is needed to increase theory and practice of Human-Robot Interaction (HRI). Our goal in this article will be to construct a method for measuring perceived agency in entities of all types.
3 Generating Measurement Instruments
The most common method of generating and validating a scale is to use factor analysis [
25,
69]. The general approach to construct a validated instrument from factor analysis is described in detail by others (e.g., [
8]), and the factor analytic approach has had success in HRI [
11,
63,
87]. The factor analytic approach to survey construction typically consists of generating a large number of possible items that relate to the dimension of interest. Participants then use those items to rate an entity, an interaction, or themselves. Factor analysis provides loadings that describe how related each item is to different dimensions (factors). Items that highly load on a specific factor can be considered consistent with that factor. Different factors are usually considered different aspects of the primary area of interest.
In the factor analytic approach, items are selected to maximize reliability which leads to items that are similar in terms of endorsability [
22,
78]. This is an excellent approach when the researcher is attempting to understand the many dimensions and nuances of the construct (e.g., the mind perception work described earlier). Indeed, in 2002, Smith [
80] suggested using factor analysis when the data have multiple uncorrelated factors.
The factor analytic approach has at least two disadvantages when attempting to create a measurement scale. First, because factor analysis identifies how close an item is to the underlying latent variable, it can be more difficult to select items that cover a wide range of the latent variable. This can make it more difficult to differentiate levels or amounts of the specific dimension the scale is measuring.
Second, the ideal of scale development is to measure a single dimension, or latent factor [
17,
60,
71]. A single dimension is desirable because it makes interpretation and understandability easier and more straightforward. When a construct does have multiple constructs, difficulties in interpretation can arise because the analyst needs to show how the multiple factors create a general factor [
71]: this occurs more commonly for understanding the factor’s structure and less when the focus is scale creation. Factor analysis can measure and show unidimensional constructs, but the number of dimensions in factor analysis is still a hotly debated topic [
69].
A final feature about factor analysis is that it measures the latent construct of a person and does not account for an external target. Usually this is not a concern—an individual’s attitudes or opinions can be well measured. However, when the target is an external event or entity, factor analysis has no way to account for differences in those external targets.
Because our intention is to construct a unidimensional scale of perceived agency about external entities (e.g., robots or AIs), we will be using a Rasch analysis.
The Rasch model is a mathematical formulation that describes the relationship between raters, items, and entities and is part of the item response theory framework. All three components are measured on the same latent scale, which is a logit as the unit of measurement. Rasch analysis models the fact that raters, entities, and items can all vary along the latent variable: in our case, some raters will have a (pre-)disposition to believe an entity has more or less perceived agency; an item may be easier or more difficult to agree with; and an entity may have more or less perceived agency. Because Rasch puts all three components on the same measurement scale, it is straightforward to determine the location of each rater, item, or entity.
Rasch models have measurement invariance [
9,
21,
23]: when a set of observations fits a Rasch model, entity measures are invariant across different sets of items or raters, and items and raters are invariant across different entities. Measurement invariance suggests that test scores are sufficient statistics for estimating rater measures. Measurement invariance is tested by fit statistics [
79]; unidimensionality and reliability can be measured as well.
3.1 Rasch Analysis
Our approach will be to have raters (participants) answer items (survey questions) on different entities that will be judged. Each of these “facets” is a separate source of information and bias, and each can be measured along the same dimension. Rasch analysis can construct measurements for each element in each facet.
Items in a Rasch analysis perform best if there is a range where some items are easier to agree with and some are more difficult to agree with. Each item will be expected to measure some aspect of the latent trait (perceived agency) that we are interested in.
Entities in our case will consist of a variety of videos that show a robot, AI character, or person performing some task. Like items, entities will be expected to have a range of perceived agency.
Raters are people who watch a video and answer items about the entity. A rater who may be more likely to agree that many entities have some amount of perceived agency would be considered to have more of the latent value. Similarly, a rater who felt that very few entities could have perceived agency would be considered to have less of the perceived agency latent value. For example, a person who felt that very few entities have much perceived agency would be scored as having relatively little perceived agency as a latent value. These latent values for each rater can be considered a (pre-)disposition for whether an entity may have perceived agency.
We can operationalize these intuitions using a Rasch rating scale, which can be defined as
where
—
c is the category of the rating scale or the Likert value (in our case, it will be 1–5),
—
\(P_{eirc}\) is the probability of entity e receiving a rating of category c of item i from rater r,
—
\(P_{eir(c-1)}\) is the probability of entity e receiving a rating of category \(c-1\) of item i from rater r,
—
\(\theta _e\) is the amount of perceived agency of entity e,
—
\(\beta _i\) is how difficult the item i is to agree with,
—
\(\alpha _r\) is the severity or (pre-)disposition of rater r, and
—
\(\tau _c\) is the difficulty of receiving a rating of category c relative to a rating of category \(c - 1.\)
The category value
\(\tau _c\) is the location where adjacent categories
c and
\(c - 1\) are equally probable to be observed, also known as Rasch-Andrich thresholds [
51].
The Rasch model is an additive linear model based on a logistic transformation of ratings to a logit scale. Critically, each facet (entities e, raters r, items i) are all on the same logit scale, and all can influence the final rating. Conceptually, this means that the logit scale represents the latent value or dimension—the amount of perceived agency.
The Rasch model makes some basic assumptions about measurement [
9,
16,
74,
93]. For example, if a rater with a high (pre-)disposition of perceived agency (
\(\alpha\)) gives an entity with a high perceived agency (
\(\theta\)) an especially low score on an item (
\(\beta\)), that item may have have a measurement problem or mis-fit. Rasch models allow us to find and inspect these mis-fits; entities, items, or raters who have consistently large mis-fits suggest a concern: the video may be misleading; an item may be confusing; or a rater may be answering randomly. Rasch models have several strengths, including generalizability across entities and raters (e.g., different robots or AIs can be measured accurately by different raters), perform measurements in an interval scale (not an ordinal) allowing parametric statistical analysis, can identify items or entities that do not behave as expected, and produce an ordered set of items and entities.
5 Experiment 2: Scale Evaluation
The goal of experiment 2 was to evaluate the scale constructed in experiment 1 and compare it to other survey methods of measuring perceived agency. As suggested in Section 1, there are two other existing survey approaches that researchers have used to measure perceived agency. The most common is the work by Gray et al. [
31]; these items or a subset of these items have been used by others to measure or explore perceived agency [
33,
56,
84,
90]. A greatly reduced set of items to measure perceived agency was created by Korman et al. [
45]. The items generated by Korman were not psychometrically validated, but they represent one of the typical [
33,
48,
68] approaches used to measure a latent construct: scour the literature and create a “reasonable subset” of items.
Finally, both the averaged raw items and the calibrated items from experiment 1 will be used. There will be four measures of perceived agency that experiment 2 will evaluate on novel entities: (1) the agency dimension from Gray et al. [
31], (2) the agency items from Korman et al. [
45], (3) the average of all 13 items from experiment 1, and (4) the logit scale from experiment 1.
5.1 Method
All studies, including this one, were approved by the NRL IRB. All participants consented to participate.
5.1.1 Participants.
A Monte Carlo simulation based on experiment 1 effect sizes showed that 70 participants were needed to have an 80% chance of showing a significant ordinal relationship between different entities. A total of 75 participants were recruited through Cloud Research and paid $12 for participation in the study; 10 participants were removed because they missed an attention check (“has a face”), leaving 65 participants. The average age of participants was 39 (SD = 9) years. A total of 32 participants were women, 32 participants were men, and 1 participant was unreported. The study took 47 minutes on average. No participants took part in experiment 1.
5.1.2 Materials (Videos).
Seven new videos were selected and collected using methods similar to those in experiment 1. None of the videos in experiment 2 were used in experiment 1.
Table
5 provides a label, a brief description, the morphology of the entity, and a citation of the source. The citation of each video is either a YouTube location or a paper or website describing the video.
5.1.3 Materials (Survey Items).
There were three sets of items. One set was developed in experiment 1 and shown in Table
3; these are the PA items. Another set of items came directly from the agency dimension of Gray et al. [
31]; these are the GGW items and are shown in Table
6. Finally, the items from Korman et al. [
45] and Frazier et al. [
24] (under review) are the Korman items and are shown in Table
6. The PA and GGW items used a Likert scale range from 1 to 5, whereas the Korman items used a Likert scale range from 1 to 7.
5.1.4 Procedure.
The procedure for experiment 2 was identical to the procedure from experiment 1 except for three differences. First, because there were three different scales, we kept each set of survey items together, but the order of each block was randomly determined with a Latin square design.
The second difference from experiment 1 was that after all videos had been watched and all items were answered for each video, a ranking screen was displayed. Participants were provided the preceding definition of perceived agency and asked to rank all videos from least to most by dragging a thumbnail of each video to the desired rank. They were able to watch any video again if they desired. When this task was completed, they pushed a submit button.
The third difference from experiment 1 was that participants performed a calibration task for three of the entity videos from experiment 1 by answering the PA items from Table
2. The calibration videos selected were “service” (
\(\theta = .85\)), “cheating” (
\(\theta = .26\)), and “feeder” (
\(\theta = -.73\)): these were selected because they covered a range of perceived agency while not being at the extremes, although in theory any number of the original videos could be used. The data from the calibration videos was used only for the PA-R (Perceived Agency–Rasch) scale and did not impact any of the other scales since it was at the end of the experiment.
5.2 Results
5.2.1 Calculating Scale Values.
For the GGW, Korman, and PA scales, the respective items were averaged to give a single score for each rater for each entity. Because of the way Rasch calculates the logit score, only total measures are calculated; reliability cannot be calculated for the Rasch measure. However, it is possible to calculate reliability for the raw PA scales; reliability was calculated using
\(\alpha\) and
\(\omega _{total}\) from the psych package [
70].
\(\omega _{total}\) was .96; Cronbach’s \(\alpha\) was .95 for PA. \(\omega _{total}\) was .90; Cronbach’s \(\alpha\) was .90 for Korman. \(\omega _{total}\) was .96; Cronbach’s \(\alpha\) was .95 for GGW.
For the PA-R measure, the calibration videos were used to calculate each rater’s
\(\alpha\), or (pre-)disposition to perceived agency. Each rater’s
\(\alpha\) was then used with their item ratings to calculate a logit value on an interval scale for each entity [
50].
5.2.2 Comparing Scales to Each Other.
It is traditional when generating and comparing scales to show the correlations between the different scales. We expect the scales to have moderate to high correlations to each other, since they all attempt to measure the same underlying construct. Indeed, as Table
7 suggests, all correlations are moderate to high.
5.2.3 Comparing Each Scale to Empirical Ranking.
Our overall goal was to determine which scale best predicts the rank ordering of the entities that were ranked from least to most perceived agency by participants. An ordinal regression is the most appropriate analysis for ordered data. An ordinal regression uses an ordinal outcome variable (e.g., rank orderings), whereas the predictors can be of any type (categorical, ordinal, interval, etc.). Four different ordinal regression models were created, one for each scale.
Figure
1 shows a graphical representation of the ordinal model fits and the empirical data. Model fits were calculated using each rater’s scores for each scale to predict a model rank for each entity using the respective ordinal regression model.
There are several aspects of Figure
1 that should be highlighted. First, notice that all four models fit the empirical data quite well. Even the Korman model [
45], which was not constructed or validated in a psychometrically strong manner, fits the overall pattern well. The GGW model [
31], which was based on a PCA of an agency dimension and is currently the most common method of measuring perceived agency, does a very good job of capturing the trends in ordering, although it seems to have the most difficulty in the middle range (i.e., RSR1 and Bargaining are nearly identical in model scores, but quite different empirically). The PA model that consists of the average of items from experiment 1 does an excellent job of predicting the empirical ordering. The PA-R model that was calculated using the the items and three calibration videos from experiment 1 to convert into a logit score using Equation (
1) also shows an excellent fit to the empirical data. The difference between the PA and PA-R model is relatively small.
We can empirically compare each of the models shown in Figure
1 to determine which is the best predictor of rater rank orderings.
Most importantly, all four models are significantly better than chance (
\(p \lt 0.05\)). We can evaluate how well each model fits the data by using the
Akaike Information Criterion (AIC); a lower relative score is better. The AIC statistics derived from the ordinal regression model fits to the data are shown in Table
8. These AIC scores show that the GGW model is significantly preferred over the Korman model, the PA model is significantly preferred over the GGW model, and the PA-R model is significantly preferred over the PA model [
89].
5.3 Discussion: Experiment 2
Experiment 2 collected data from a new set of raters on a novel group of seven entities that consisted of a large range of perceived agency. These raters answered items from three different surveys on perceived agency, and the results of each of those scales were compared to raters’ ranking of the entities.
The ordering of the different entities by the PA and PA-R scale presents a nuanced result of perceived agency. In 2022, Li et al. [
48] found that humans were rated as having more perceived agency than robots, but our results suggest that not all robots have the same amount of perceived agency. It is not impossible that some robots could have more perceived agency than some humans, and our PA-R scale has the potential to show these more nuanced differences.
Experiment 2 found that all evaluated surveys were acceptable measures of perceived agency and all better than chance. However, the two best surveys were those developed in experiment 1: PA and PA-R. Both PA and PA-R were substantially and significantly better than the other two methods of measuring perceived agency.
5.4 Rasch Analysis
Rasch analysis allowed us to construct a measure of perceived agency where all three important facets (entities, items, and raters) were on the same logit scale. Critically, this allows us to examine the hierarchical order of the items: the item \(\beta\)s are estimates of how difficult it is for a rater to agree to each item. This hierarchy allows us to make some important inferences about how people conceptualize perceived agency. First, the items that are most likely for raters to rank highly focus on goals—“acts with purpose” and “has goals”: this is not surprising since an entity without goals can hardly have any perceived agency. In contrast, the items that are the most difficult for raters to rank highly are integrative items (the two scenarios) and emotional items (“can show emotions to other people” and “can change their behavior based on how people treat them”). The integrative items highlight that raters will think an entity has a high degree of perceived agency when that entity apparently behaves according to their thoughts and feelings, and not purely responding to the environment. In addition, when an entity responds based on their apparent internal feelings, it is more likely to be rated highly on perceived agency. This analysis suggests that robots and other entities that seem to behave according to their internal feelings will be likely to be perceived as having agency.
For the remainder of this article, we will use PA-R.
7 General Discussion
The goal of this research report was to generate a scale to reliably measure perceived agency and use that scale in a predictive, productive manner. To accomplish this goal, we began with a definition of perceived agency. From our definition, we constructed a set of items that were based on each aspect of the definition of perceived agency; this is in contrast to some of the more bottom-up approaches (e.g., [
31]). Experiment 1 used a Rasch analysis and showed that the scale items were well fitting and that the overall scale had high reliability across all three facets (entities, items, and raters). Experiment 2 used the scale developed in experiment 1 along with two other scales that have been used to measure perceived agency; experiment 2 showed that the scale developed in experiment 1 better captured empirical data than two other current measures of perceived agency.
The PA scale has been developed and tested on a wide variety of entities: videos of humans (3), videos of robots of dramatically different morphologies (15), and videos of AI characters (3). There were also static images of people (2), animals (2), robots (2), and artwork (2). Note that while the majority of entities were humanoid, we also included non-humanoid animals (dogs) and robotic arms and industrial robots and robots with wildly different humanoid features (wheels, no ears, large eyes, etc.). The successful usage of our scale across this range of entities is encouraging for other entity types.
We should note that these experiments have at least two possible concerns. First, to capture a wide range of morphologies, we used videos instead of in-person interactions or observations. Second, the videos were relatively short—less than 3 minutes—and longer interactions may impact the results. We believe, however, that the strength of our approach will overcome these possible weaknesses.
One benefit of measuring how people perceive agency is that we can examine previous work in HRI with our new understanding. We want to emphasize that the success of the created measures supports our definition of perceived agency. We can also provide insight to one of the most influential pieces of work on perceived agency in HRI—the research of Short et al. [
77], who showed that a robot that cheated had more perceived agency than a robot that did not cheat. First, all conditions had “easy” cues of perceived agency: they had goals and could communicate with others, and because they could communicate and move around, they could perform many different types of tasks and could do well in other environments. However, the cheating robot, in order to cheat, needed to treat others as if they had a mind (thoughts), needed to create novel goals (thoughts), needed to want to perform the cheating action (feelings), and could adapt to different situations (losing; environment). These differences are subtle, but note that they also came from each of the three definitional components.
It is our hope that this scale of perceived agency will enable other researchers to accurately measure perceived agency, improve our understanding of how people conceptualize robots’ minds, and build robots that have different levels of perceived agency.