We analyze the task measures and survey responses from our user study to draw conclusions about the effect of using UHTP on the human participant’s individual assembly strategy and, consequently, the performance of the human-robot team in our collaborative drill assembly task.
7.2 Results
We now present the results and data analysis obtained from our study. For each metric, we test for statistical significance using a two-sided Wilcoxon Signed-Rank test (non-parametric analysis) [
8] with a significance value of
\(\alpha = 0.05\) and a sample size of 30 participants. We expect
\(S_{UHTP}\) to score less than
\(S_{fixed}\) across all the metrics, supporting hypotheses H2-A to H2-D.
H2-A: The mean
\(X_{makespan}\) for
\(S_{UHTP}\) (M = 392.6, SD = 53.60) is lower than that of
\(S_{fixed}\) (M = 435.6, SD = 62.59) as shown in Figure
9. This difference is statistically significant with a p-value of less than 0.001, thereby supporting hypothesis H2-A that planning with UHTP leads to quicker task completion times. Furthermore, the range of
\(X_{makespan}\) values for
\(S_{fixed}\) (Range = 268.4 secs) is greater than the range of values for
\(S_{UHTP}\) (Range = 168.7 secs).
To explain the difference in
\(X_{makespan}\) values, consider the examples shown in Figure
10, in which the user begins assembling either a blue drill (Row
\(H_{blue}\)) or a yellow drill (Row
\(H_{yellow}\)).
\(R_{fixed}\) is programmed to deliver parts assuming the human always begins with a blue drill. We observe that in example
\(H_{blue}-R_{fixed}\), the participant’s choice of building a blue drill first matches the robot’s assumption about the human and this leads to a low makespan of 390 seconds. On the other hand, the opposite occurs in example
\(H_{yellow}-R_{fixed}\) and the human has to wait for the robot to bring parts for a yellow drill before they can even begin assembling. This leads to undesirable human idle time and an increased assembly makespan of 420 seconds. Alternatively in the second column, robot
\(R_{UHTP}\) infers the user’s choice of ordering in real-time and accordingly executes a sequence of actions that complement the user’s decision and reduces human idle time. As a result, we find that in examples
\(H_{blue}-R_{UHTP}\) and
\(H_{yellow}-R_{UHTP}\), the color of parts brought by
\(R_{UHTP}\) match the drill color chosen by the user. The makespans of both these examples also match the lowest makespan from example
\(H_{blue}-R_{fixed}\) which is 390 seconds. Consequently, the range of makespan for
\(S_{fixed}\) will also be larger than that of
\(S_{UHTP}\).
H2-B: Participants ranked
\(R_{UHTP}\) higher than
\(R_{fixed}\) in terms of how frequently the robot brought parts that the participant needed at the right time. This is evident from the lower median
\(X_{fail}\) and
\(X_{fill}^*\) scores for
\(S_{UHTP}\) (Median
\(X_{fail}\) = Never, Median
\(X_{fill}^*\) = Never) than the median scores for
\(S_{fixed}\) (Median
\(X_{fail}\) = Sometimes, Median
\(X_{fill}^*\) = Most of the time), as shown in Figure
11. Statistical analyses shows that participants responded to
\(S_{UHTP}\) with significantly different
\(X_{fill}^*\) scores (
\(p \lt 0.001\)) and
\(X_{fail}\) scores (
\(p \lt 0.001\)) than to
\(S_{fixed}\). These results support hypothesis H2-B that
\(R_{UHTP}\) anticipates and fulfills the human partner’s requirements better than
\(R_{fixed}\).
H2-C: Figure
11 shows that participants responded to
\(S_{UHTP}\) with a lower median
\(X_{alter}\) value (Median = Never) than to
\(S_{fixed}\) (Median = Sometimes). While both scenarios received the same median
\(X_{delay}\) value (Median = Sometimes),
\(S_{fixed}\) has more participants to the right of 0 than
\(S_{UHTP}\). Statistical analysis shows that the
\(X_{delay}\) and
\(X_{alter}\) values for
\(S_{UHTP}\) are significantly different than those of
\(S_{fixed}\) for both metrics (
\(p \lt 0.001\)), supporting hypothesis H2-C. These results inform us that participants had to pause and/or change their construction more often during
\(S_{fixed}\) than during
\(S_{UHTP}\) due to the robot providing parts in a sub-optimal order, which is reflected in the above responses. Thus, due to UHTP’s adaptation to participant decisions, the robot avoids undesirable interruptions in the participant’s assembly such as idle time and building alteration.
H2-D: The average
\(X_{workload}\) score for
\(S_{fixed}\) (M-11.9, SD = 2.54) is larger than the average score for
\(S_{UHTP}\) (M = 11.1, SD = 1.90), as shown in Figure
12. This difference is statistically significant (
\(p \lt 0.01\)), supporting hypothesis H2-D that participants in
\(S_{fixed}\) experienced a higher workload than
\(S_{UHTP}\). We posit that the overall workload is not substantial, though, because the task performed during this study is short (15 mins per scenario) and fairly straightforward. We anticipate that the difference in workload will be more substantial when measured on other assembly tasks that are more complex and are repeated over an extended period of time.
Post-scenario Responses: Figure
13(a) shows the responses of participants to yes/no questions about whether the participant felt that the robot was monitoring their activity. Participants responded more positively to robot
\(R_{UHTP}\) than
\(R_{fixed}\) in terms of both the robot’s ability to track the color of the drill being built (Percentage of Yes responses for
\(S_{UHTP}\) : 90.0%,
\(S_{fixed}\) = 20.0%) and the robot’s adaptability to participant actions (Percentage of Yes responses for
\(S_{UHTP}\) : 63.3%,
\(S_{fixed}\) = 30.0%). McNemar’s test shows that there is a significant difference in participant responses between scenarios for these two questions (p-value from left to right in Figure
13(a): p < 0.001, p = 0.006).
Comparative Responses: Figure
13(b) shows the responses from our post-experiment questionnaire where we asked participants to compare the robot behaviors based on specific qualities. For each question, participants responded by selecting one of four options: the UHTP scenario (
\(S_{UHTP}\)), fixed policy scenario (
\(S_{fixed}\)), both scenarios equally, or none of the scenarios. The majority of participants preferred the interaction in scenario
\(S_{UHTP}\) over
\(S_{fixed}\) (Percentage responses for
\(S_{UHTP}\) = 86.7%,
\(S_{fixed}\) = 10.0%). Also, most of the participants chose
\(R_{UHTP}\) as better at tracking the color of drill being constructed (Percentage responses for
\(S_{UHTP}\) = 80.0%,
\(S_{fixed}\) = 6.7%) and minimizing participant idle time (Percentage responses for
\(S_{UHTP}\) = 63.3%,
\(S_{fixed}\) = 10.0%). Additionally, we note that a moderate number of participants responded in favor of both scenarios equally across all three questions. This result matches our previous observation of a wider range of responses for
\(S_{fixed}\) due to some participants choosing a color order that meets the optimal ordering assumptions of the fixed-policy robot’s action sequence. Participants whose choice matched with the action sequence of
\(R_{fixed}\) would not be able to differentiate between either scenario, as both policies would behave optimally.
7.3 Discussion
The above results validate two central claims we make about UHTP under operating conditions with real human collaborators – (1) the ability to infer a human user’s intent in real-time and complement it during a collaborative task and (2) the versatility in accommodating a variety of human behaviors without sacrificing task performance. Participant responses about interacting with a UHTP-controlled robot show that UHTP also improves the human user’s experience of collaboration as compared to a fixed policy in a number of ways, such as reducing human idle time and avoiding execution paths that interrupt the user’s construction. Furthermore, participants are able to identify that \(R_{UHTP}\) is adapting to their choices. As a result, participants reported a reduced mental workload and distinctly prefer the UHTP-controlled robot as a collaborative partner over a fixed-policy robot.
A common observation made across all metrics is that the data obtained for
\(S_{fixed}\) cover a wider range of values compared to that of
\(S_{UHTP}\) for all six metrics. This complements what we found with the makespan values of task
\(Q_{chair}\) from Section
5, where the makespan’s standard deviation of
\(P_{fixed}\) was higher than that of UHTP. We again attribute this trend to the fact that there is a split among participants–those whose choice of color ordering matched the fixed-policy robot’s sequence of actions and those that did not–which is not an issue for
\(S_{UHTP}\) due to its adaptive properties.
Comparing Participant’s Responses for Different Orderings of Scenarios: We further analyze potential ordering effects in participants’ self-reported responses by plotting participant responses across the two scenario orderings in Figure
14. The left sides of Figures
14(a) and
14(b) show the responses for participants who interacted with
\(S_{fixed}\) first (ordering
\(O_{fixed}\)), and the right sides of the figures show the responses for participants who interacted with
\(S_{UHTP}\) first (ordering
\(O_{UHTP}\)). We observe from Figure
14(a) that more participants from
\(O_{UHTP}\) responded to
\(S_{fixed}\) with responses like ‘Always’ and ‘Most of the time’ than participants from
\(O_{fixed}\). Since a lower response is better across all four metrics, this implies that
\(O_{UHTP}\) participants penalized
\(S_{fixed}\) more severely than
\(O_{fixed}\) participants.
\(O_{UHTP}\) participants whose choice of color order did not match with the fixed policy in
\(S_{fixed}\) observed the robot changing into a less efficient partner, while the opposite occurred with
\(O_{fixed}\) participants. We believe that interacting with a robot partner whose performance degraded across conditions elicited a stronger reaction in participants than interacting with a robot partner whose performance improved, resulting in stronger responses from
\(O_{UHTP}\) participants. The observed trend complements a concept from human psychology studies known as loss aversion, which states that an individual experiences a loss more severely than an equivalent gain [
30]. This analysis shows a measurable effect of changing the ordering of participant scenarios on the participants’ responses, although this does not affect participant workload (see Figure
14(b)). We note that this disparity in participant responses does not negate the significant differences in metrics observed between
\(S_{fixed}\) and
\(S_{UHTP}\), but we include this observation to show the importance of counterbalancing the ordering of scenarios presented to reduce order effects for our study.