1 Introduction
Many videos remain inaccessible for blind and low-vision (BLV) people [
4,
42]. For videos to be accessible for BLV people, they need audio descriptions (AD)—audible explanations of visual events played in a video. Yet, there are only a small number of videos that have AD. For example, about 2,760 out of 75 million videos (0.004%) on Amazon Prime Video come with AD [
43]. The problem is partly attributed to the cost and time of hiring professional audio describers—the gold standard for generating AD. Indeed, AD is costly (
i.e., US$12 to US$75 per video minute) and the turn-around time could be a week for a video [
52].
Prior work has explored ways to minimize the cost of generating AD and turn-around time by (semi-)automating the AD authoring process [
31,
54,
57,
58]. For example, Wang
et al. built a tool to automatically generate AD by combining computer vision and natural language processing techniques [
54]. However, in their evaluation, they reported that the result of the automatically generated AD lacked elements like characters’ actions and location information that is useful for blind people. Given the fully automated AD authoring is not yet ready, we ask,
“how can we combine automation in the process of authoring AD to increase production efficiency while keeping their quality high?”We develop an AD authoring user interface that combines manual AD authoring and automated review and feedback generation based on computer vision and natural language processing. The appearance of the user interface follows the conventional design of a collaborative tool (like [
37,
38]) for writing and peer reviewing scene descriptions (SD)—textual descriptions of scenes in a video that are converted into audio descriptions through text-to-speech (TTS). Using this system, a describer can write SD to augment each video scene with its verbal description. We design a novel automatic feedback mechanism; as a describer writes SD, the system displays an interactive word cloud of representative labels. We generate representative labels by combining the output of the off-the-shelf scene recognition tool (Amazon Rekognition) with a custom design algorithm that selects labels to show to the users. As the describer mentions terms similar to those that appear in the word cloud, the system attempts to detect those similarities and highlights the closest matching labels in real-time to indicate that those visual elements are now captured.
We used the authoring system and three types of video stimuli (advertisement, explainer, and instructional) to evaluate whether real-time feedback could help novice SD describers to compose high-quality descriptions cost-effectively. We conducted a between-subjects study with 60 novice SD describers, comparing control, automatic-feedback, and human-feedback conditions. We found that the automatic feedback could help the participants to write SD that is descriptive, objective, and good at conveying the intended video’s learning quality. We also found that human feedback can help the participants to describe the scene sufficiently, but it may negatively affect the objectiveness of SD. Although the human review remains the most effective feedback to improve quality, we show that automated feedback can reduce the time needed for SD production by 45% compared to the workflow that require manual review. In summary, this work contributes:
•
The design and development of an AD authoring system that combines manual SD authoring and novel real-time automated feedback based on computer vision and natural language processing.
•
Empirical results with 60 participants demonstrating the feasibility, value, and limitations of real-time automated feedback.
•
Design implications for future AD authoring tools with the automated feedback elicited through the user study.
3 Automatic Feedback Mechanism for Scene Descriptions
Informed by the design of the existing automated technologies to support writing tasks, we develop a scene description (SD) authoring tool. We envision a tool that could provide immediate feedback to SD authors (describers), offer an option to incorporate it, and let the describers improve the writing quality. The tool shall be like Grammarly, an AI-powered writing tool that provides writers immediate quality feedback;, but instead of grammatical correction, our tool should provide helpful information for writing SD. The tool should help reducing SD authoring time and improve SD quality—two key factors in making SD available to more videos.
The user interface design of our SD authoring interface follows the convention of existing similar tools (
e.g., [
38,
44,
56]) (Fig.
1). The video pane is on the left. On the right, we present the scene’s start time, closed captions (CC), scene descriptions (SD), and feedback as table columns. CC and SD bars are both included at the bottom of the video pane, allowing a sighted describer to quickly grasp where speech is present in the video and how succinct an SD is; the SD bar’s fill color turns red if the authored SD overlaps with CC, nudging the user to shorten the SD.
In a typical authoring interaction, the user: (i) watches a scene of a given video, (ii) writes a description for that scene, (iii) receives feedback, and (iv) revises as necessary. In the workflow introduced in the previous work [
37,
38], there is a break between steps (ii) and (iii); receiving feedback involves another human reviewer providing written comments on the feedback column. Though the human review is detailed and useful in improving the quality of SD, it incurs additional costs and long turnaround times.
To mitigate the need for a human reviewer, we propose an automatic feedback mechanism; for each video scene, our system creates an interactive word cloud of noun, verb, and adjective labels deemed relevant to the scene (Fig.
1). The labels aim to support novice describers by suggesting what to mention in their SD.
3.1 Generating Labels via Computer Vision for Scene Understanding
We use Amazon Rekognition, off-the-shelf object detection and scene recognition service [
5] (Fig.
2.a). The service automatically detects entities (
e.g., car,
person) in video scenes, recognizes actions that these entities may perform, and retrieves adjectives that may describe them. The Rekognition output are labels of entities, actions, and adjectives, labels’ confidence scores and trees representing relationship between labels. For example, in a short scene where a person is walking down a street, we get a tree of labels like
Road →
Tarmac →
Zebra Crossing.
Overall, for our video stimuli described in Fig.
4, we obtained 20 to 43 labels per scene (
mean = 28 labels). These labels were way too many to be informative and often repetitive (
e.g., included both
car and
automobile) or even irrelevant (
e.g., Final Fantasy and
Grand Theft Auto were detected on the scene that depict the cliff and car driving). Thus, as a follow-up step, we selected representative labels that captured the scene well via a custom clustering and label selection algorithm described in Fig.
2.b-c.
Pruning representative labels is a two-step process. First, because some disjoint label trees from a same scene are semantically related, we group such trees using DBSCAN [
16], an oft-used unsupervised clustering algorithm (Fig.
2 b). DBSCAN requires a distance metric and threshold for clustering; we use semantic similarity computed with sematch [
59] and set the threshold at 0.66. Second, we select one representative label per cluster of trees by leveraging the label confidence score from the Rekognition output and the tree structure to find a label that balances the specificity, generality, and confidence (Fig.
2.c). We include more details on our algorithm in the supplementary material (Appendix 2).
3.2 Supporting Real-time Interactivity via Automatic Matching
As the user writes a scene description (Fig.
3), our interface presents a word cloud of representative labels with a message,
“Would it be better if these items are described?” Given the error-prone nature of AI-infused systems, the accompanying message is intentionally not forceful in demanding people to include labels in their description. All labels are gray initially. As the user writes an SD, the system highlights representative labels that matches user-provided SD in blue, indicating that the user has successfully included the labels in their SD. This is achieved via an exact or semantic matching. For the exact match, which rarely occurs, we use NLTK [
10] to tokenize user-provided SD and seek exact matches between these tokens and the representative labels. For the semantic matching we form pairs between the tokens and representative labels, then compute the semantic similarity for each pair by using
sematch [
59], a Python library that utilized a knowledge-graph to compute how similar two words/phrase are. We set the similarity threshold at 0.66 through trial-and-error. We use knowledge graph-based similarity metric over other methods (
e.g., combination of word embedding and cosine similarity) because it performed better in our application. The user can delete the labels by clicking on the ‘x’ button in the label when the user deemed the labels to be irrelevant.
3.3 Video
In the subsequent sections, we adopt the three videos from [
38] to both evaluate the labels and our user interface. The three videos vary in the level of visual information and span many domains: (i) an
advertisement video
1, (ii) an
explainer video
2, and (3) an
instructional video
3. The
advertisement video is a Subaru Car commercial with a duration of 1 minute. This video shows a group of people exploring the scenic area with a Subaru Outback Car. The
explainer video explains about Web Accessibility guide on color contrast. This video is from W3C Website and lasts 1 minute 38 seconds. The
instructional video, engineered to serve as a very challenging video, is an origami tutorial on how to make a bookmark corner. Each instructional step is visually shown without any verbal narrative, which is impossible for blind people to follow without AD. Many blind people enjoy the art of origami due to its tactile nature but many origami videos just play music and don’t provide verbal instructions. This video lasts for 1 minute 42 seconds. All three videos are accompanied by professionally authored scene descriptions, which we treat as “ground truth” in the subsequent sections (see Appendix 1 for the ground truth scene descriptions).
3.4 Evaluating the Quality of Labels
We used Amazon Rekognition and the representative label selection algorithm described above to generate feedback for scenes in the three videos. Because the quality of representative labels would affect the instructive quality of the automatic interactive word cloud feedback that the users will see, we wanted to know their quality. We compared representative labels with the text in the ground truth scene descriptions.
Amazon Rekognition detected on average 247 labels per video (
Nadv = 311,
Nexp = 229,
Ninst = 201). This is equivalent to 28 labels per scene on average (34.56 labels per scene in
advertisement video, 32.71 labels per scene in
explainer video, and 16.75 labels per scene in
instructional video). We observed that most of the labels detected by Rekognition are nouns (177 distinct nouns, 19 distinct verbs, 14 distinct adjectives). Displaying so many labels as a word cloud was impractical. Thus, after selectively reducing the number of labels (Section
3.1), we concealed 162 (69.6%) labels per video on average. We kept 62 labels for the
advertisement video, 84 for the
explainer video, and 78 for the
instructional video. Per-scene label count for the videos were 9, 9, and 7, respectively.
We evaluated the quality of the final labels by comparing them with tokens extracted from the ground truth SD. Because we could not directly compare a sentence with a set of labels, we first extracted noun, verb, and adjective tokens from each ground truth SD using NLTK’s
pos_tag function. This allowed us to compare a set of tokens with a set of representative labels (Fig.
5). Treating the extracted tokens as the ground truth, we measured precision and recall of the representative labels over the videos. Precision indicated what portion of the representative labels were relevant for the scene. Recall represented what portion of the tokens in the scenes were included in the representative labels.
The overall precision and recall values for
advertisement,
explainer, and
instructional videos were (0.05, 0.13, 0.06) and (0.08, 0.17, 0.03), respectively (Table
1). We also measured precision and recall for each part of speech—noun, verb, and adjective. Both precision and recall for verb and adjective labels were low. We expected this as the output of the scene recognition from the Amazon Rekognition was noun heavy. Across all types of parts-of-speech, the recall scores were higher compared to precision. The low precision suggested that the system retained redundant labels even after pruning.
The system’s scene recognition performance was better for the explainer video compared to the other two videos, especially in terms of noun recall. In fact, in some scenes in explainer video, the scene-level recall was 1.0, indicating the representative labels included all noun tokens mentioned in the ground truth. Both precision and recall were low for the instructional video. The video captured only the hands of a person making origami. This resulted in our system producing representative labels like paper and manicure while the ground truth tokens included unfold and crease capturing fine-grained actions.
Objective similarity metrics between the ground truth tokens and representative labels were low, though the recalls for noun tokens were relatively high. This showed the accuracy of the interactive word cloud feedback was not perfect. However, some noun labels were found useful in a pilot study. They helped describers ideate what to mention in SD albeit them not matching the ground truth tokens semantically. For example, a label word was presented to the user as feedback in the last scene of the explainer video; the feedback guided the participant to describe the textual information shown in that scene. After seeing the potential value of the automatically generated feedback error-prone as it may be, we decided to move on to the main study. We could remove the instructional video from the subsequent study given its particularly low precision and recall scores. However, we would have missed the opportunity to observe the effects of error-prone feedback on the quality of scene descriptions that may reflect real-world scenarios. Thus, we kept this video for the following study.
4 Study Method
To evaluate the quality and cost of authoring SD with automatic feedback, we recruited novice audio describers to write SD using our system. We conducted the study remotely over Zoom due to COVID-19 restrictions. The study had a feedback type as a between-subjects factor with three levels: control, automatic-feedback, and human-feedback:
•
Control: Participants in this group went through two sessions of writing SD. In session 1, we asked the participants to author SD by watching videos. We invited the participants in this group back to session 2 after at least one day to revise the SD they wrote in session 1’s without any feedback.
•
Automatic-feedback: As participants watched a video and wrote SD, the interface displayed the word cloud of representative labels as described in Section
3. Participants in this condition participated in only one SD authoring session because they wrote and revised their SD simultaneously with automatically generated feedback.
•
Human-feedback: Like the control condition, the participants in the human-feedback condition went through two sessions. In session 1, the participants authored the SD without receiving any feedback. After session 1, a sighted reviewer wrote comments on how to improve the SD quality along the quality dimensions described below. After the manual review process, the same describers were invited back to join session 2, where they revised the SD by addressing the reviewer comments.
4.1 Scene Description Quality Dimensions
The following set of quality dimensions taken from the prior work [
38] was used by the sighted reviewers to give feedback to the describers and the blind evaluator to assess the perceived quality of SD:
•
Descriptive: SD should provide pictorial and kinetic descriptions of objects, people, settings, and how people act in the scene.
•
Sufficient: SD should capture all the scenes and provide sufficient information for the audience to comprehend the content of a video while not being overly descriptive.
•
Clarity: SD should use plain language by writing a short, uncomplicated sentence and leaving out anything that is unnecessary.
•
Objective: SD should illustrate objects, people, and the relationship between them in an unbiased manner.
•
Succinct: SD should fit in a gap without dialogue or a natural pause in the video.
•
Accurate: SD should provide correct information about what is shown in the scene.
•
Referable: SD should use language that is accessible to everyone with different disabilities.
•
Learning: Video’s SD can convey the video’s intended message to the audience.
•
Interest: SD should make the video be interesting to the audience by writing a cohesive narrative.
4.2 Sighted Participants
We recruited sixty participants as novice describers via a university mailing list and word-of-mouth. We uniformly assigned the sixty participants to the three conditions at random. We recruited participants without prior experience in writing SD to evaluate if automatic feedback could help novice describers produce good SD. All participants watched and wrote SD for the three videos described above. Our system recorded key interactions (e.g., task_start and task_end timestamps, SD edit logs), which allowed us to calculate how much time describers and reviewers took to complete their tasks. We did not inform the participants which condition they were assigned to, thus participants in control and human-feedback conditions did not know which condition they were in until they returned to session 2.
Of the 60 participants, 29 were females. Their ages ranged from 18 to 42 years old (median = 23, mean = 24.2, std = 4.78). Because the participants’ English proficiency could affect the quality of SD they write, we asked them to self-report their English proficiency in 4-point scale (elementary, intermediate, advanced, and native) for four skills—writing, reading, speaking, and listening—for post hoc analysis. We were particularly interested in their writing proficiency as that might affect the quality of SD they write. N = 28 participants rated themselves as native writers. N = 20 and N = 12 reported they have advanced and intermediate writing proficiency, respectively. Nobody self-reported themselves as elementary. We grouped the advanced and intermediate English writers together and classified them as non-native (which left us with N = 32 non-native English writers). We had a balanced distribution of native English writers in each condition. There were N=11 native participants in control condition, N = 9 native participants in automatic-feedback condition, and N=8 native participants in human-feedback condition.
4.3 Sighted Reviewer for Human-feedback
We recruited one sighted research staff, who is not the co-authors of this paper, to be the reviewer for the human-feedback condition. The task for this reviewer was to give feedback to the participants’ SD after the participants have completed session 1. We recruited a reviewer who had prior experience in working with us—someone who we could trust that she would be a dedicated and motivated reviewer. This was done to control the quality of the reviews. We instructed the reviewer to understand quality dimensions that explained what makes good SD. Neither the describers nor the reviewer was exposed to the ground truth SD that accompanied the three videos.
4.4 Blind Evaluator
Using the quality dimensions, a fully blind evaluator from our research team assessed the qualities of SD that our participants authored. The blind evaluator watched videos augmented by participant-authored SD. If the evaluator approved a quality facet upon finishing watching a video, he gave ‘1’ to the SD. In contrast, if the evaluator rejected the quality of SD, he gave ‘0’. The blind evaluator used eight out of nine codes (descriptive, sufficient, clarity, objective, succinct, referable, learning, and interest). We omitted the accurate dimension as the blind evaluator could not justify the quality without visually inspecting the video. The blind evaluator also assessed the SD qualities at the scene level, but we refrain from reporting the result in this paper due to the word-count limit. We randomized the sequence of videos to be evaluated to minimize the potential bias in review conditions and sessions. In total, the blind evaluator checked the quality of 300 sets of SD (20 participants × 3 videos × 5 sets of SD from different feedback types and sessions). We did not expose the evaluator to the ground truth SD before they evaluated the participant-provided SD, because we did not want to influence evaluators’ perception of what is good SD.
Though we used the codebook to let the evaluator objectively assess SD qualities, having only one blind evaluator might introduce subjectivity and individual bias in quality evaluation. This might reduce the reliability of the result. To see the level of individual bias of the blind evaluator, we recruited an additional blind evaluator via a local accessibility organization. Following the same procedure as the first evaluator, the additional evaluator classified SD as “approve” or “reject” for each quality. Instead of evaluating all 300 sets of SD, however, the additional evaluator assessed 90 SD—30% of what the first evaluator assessed. We then calculated the agreement in the evaluation between the first evaluator and the additional evaluator for the 90 SD.
To measure the consistency between the two evaluators, we used Cohen’s kappa (
κ) for the evaluation results for the eight quality dimensions [
13]. The result was: (
descriptive,
sufficient,
clarity,
objective,
succinct,
referable,
learning,
interest) = (0.71, 0.42, 0.07, 0.67, 0.79, 0.52, 0.67, 0.05). We observed substantial agreement (0.6 <
κ) for
descriptive,
objective,
succinct, and
learning, and moderate agreement (0.4 <
κ) for
sufficient and
referable. This shows that the two evaluators assessed these qualities consistently with minimal individual bias. However, the consistency for
clarity and
interest was low, suggesting they are susceptible to individual bias.
The results suggest that the evaluation done by the first evaluator for the 300 SD is not subjective for most of the quality dimensions. We will make a careful discussion at the end of the paper for the qualities with low reliability.
4.5 Procedure
At the beginning of the study, we explained to the sighted participants the basic concept of audio descriptions and SD (e.g., how they could support people who cannot see videos), the motivation of the study, and the tasks. We explained how to use our system, the topic of each video, and the quality dimensions of SD. For the participants in the automated-feedback condition, we asked them to author and revise their scene descriptions simultaneously as they received interactive word cloud feedback. We asked the participants in the control and human-feedback conditions to write SD in session 1. After they completed session 1, we scheduled them for session 2. Between the two sessions, the sighted reviewer wrote comments to SD that were written by the participants in the human-feedback condition. After that, the participants returned for session 2 and revised their SD. We did not disclose the reviewer’s information to the participants to minimize bias on the perceived importance of the comments. Each participant authored SD for all three videos. We counterbalanced the order of the videos to minimize the learning effect. We ended the study with a semi-structured interview to gauge the participants’ perception on the interface.
6 Statistical Analysis of SD Quality
To further investigate the effects of automated-feedback and human-feedback on SD quality, we performed confirmatory statistical analysis. We compared the SD qualities between conditions where the SD describers did receive feedback (automatic-feedback and human-feedback (session 2)) and did not receive any feedback (control (session 1) and human-feedback (session 1)). We formed no-feedback group by pooling the control and human-feedback conditions’ session 1 data; this was deemed appropriate as the approval data from both of these conditions represented the qualities of SD that the describers wrote without any feedback. We disregarded control (session 2) data in the confirmatory data analysis as we observed minimal difference between the session 1 data and session 2 data in the control group.
The dichotomous nature of our response variable (approve vs reject) restricted us from using statistical tools like
t-test and ANOVA which are commonly used in HCI literature. The outcome that these tests assume is data that ranges from negative to positive infinity that is drawn from a near-normal distribution, which was not the case for our data. And the strong random effect of video types that we observed in the exploratory data analysis appealed us to use statistical tools that can handle not only a fixed effect but also a random effect of video stimuli. Incorporating the random effect is a practice to separate the videos’ effect from the main effect, ruling out its potential confounding effect.
Potentially suitable alternatives are generalized linear mixed models (GLMM) and Bayesian hierarchical models. Both techniques allow us to capture a wide variety of outcome variables (
e.g., count data, binary outcome). For example, a binary response of approve vs reject could be modeled using (Bayesian) logistic regression with
feedback as the main effect and
video type as a random effect. The latter, however, also allows us to interpret the outcome using posterior distributions. Increasingly, researchers both from HCI [
27] and outside [
30] recommend posterior distribution-based analysis over discussion solely based on
p-value. Bayesian hierarchical models not only allows us to compute a
p-value equivalent statistic, such as a probability of superiority [
32], but it also allows us to observe the practical differences between conditions. Thus, we use Bayesian hierarchical models for confirmatory analysis. A probability of superiority computes whether a data point drawn randomly from one model has a larger value than one was also taken at random from another model. The higher the value (or lower the value), it is likely that the first model’s estimated parameter is larger (or smaller) than the second one’s.
6.1 The Effects of Feedback on SD Qualities
Because the perceived quality of SD for the instructional video is much lower compared to the perceived quality of SD for the other two, naively pooling the data for all videos could preventing us from understanding the accurate effects of feedback condition on the perceived SD quality. Thus, we create two models to analyse the effects of feedback using two sets of data: (i) for the first model, we used data only from advertisement video and explainer video conditions, and (ii) another model combined data from all three videos.
To model the approval and rejection by the blind evaluator, we construct logistic regression models where the outcome followed a Bernoulli distribution—a discrete probability distribution used to model dichotomous outcomes. We had three feedback conditions as the main effect:
no-feedback,
automatic-feedback, and
human-feedback. We treated video type as a random effect; that is, we have
advertisement video and
explainer video as a random effect for the first model and all three videos for the second model. We present the plate diagram of the models in Appendix 4.
Consistent with the visual exploratory data analysis, our analysis excluding instructional video found that both automatic-feedback and human-feedback improved descriptiveness and learning. The result suggested the additional information that the describers obtained from reviews (either from automatic-feedback or human-feedback) provided useful information in improving details in the description of the scenes. When we included the data from the instructional video, the evidence became weaker. The probability of superiority was high for automatic-feedback (p.s.=0.93), and the value shrunk lower for descriptiveness (p.s.=0.86). This is likely because the uninformative representative labels for the instructional video prevented our participants from gaining information that is useful for improving SD.
Human-feedback improved sufficiency, but it seems to negatively impact objectiveness of SD. We suspect that these two qualities were in trade-off relationship; the SD’s objectiveness was negatively influenced in the human-feedback condition as the describers used more terms and phrases to increase sufficiency. On the contrary, the high probability of superiority for this quality facet (p.s.=0.93) suggested that automatic-feedback improved the objectiveness of SD. But we did not see any evidence in favor of the automatic-feedback improving sufficiency. As we show in the linguistic analysis below, automatic-feedback has encouraged the describers to write syntactically simpler SD, which may have reduced the chance of SD being perceived as subjective.
Our result showed that
clarity,
succinctness,
referability, and interest did not improve with either
human-feedback or
automatic-feedback. By observing Fig.
8, we see that
succinctness,
referability, and
interest were relatively high across conditions (including the
no-feedback condition). Since our participants wrote SD that were already good in terms of these qualities, there was not much space for improvement. We also note that
succinctness was influenced by the succinctness visualization, which gave feedback to participants in all conditions.
Clarity did not improve; we speculate this quality and
sufficiency were bounded by the fact that the participants had space limit in writing SD (
i.e., duration between dialogues in a video to describe a scene).
6.2 The Effects of Video Types
We used the hierarchical model that used data from all three video stimuli for our analysis. We visualized the posterior distributions of the random effect on the intercepts of the model (Fig.
9). We also computed probabilities of superiority by comparing posterior distributions of the random effects of the videos.
The result suggests that authoring descriptive and clear SD for the advertisement video and explainer video was easier. The probabilities of superiority for descriptiveness were 0.92 and 0.89, and 0.9 and 0.85 for clarity, when it was compared against the instructional video. Though the results are not “statistically significant”, the shift in intercept could have affected the approval/reject outcome with practical significance. Somewhat surprisingly, the random effects of video was weaker for qualities like, referability, learning, and interest for instructional video; video types did not affect these qualities too much statistically speaking.
The instructional video type had positive random effects on the objectiveness and succinctness of the SD compared to that of advertisement video and explainer video. The result is consistent with the exploratory analysis of the effects of the video where we observed that the perceived quality for these dimensions was higher for the instructional video. We speculate that this result is because the instructional video’s scenes included precise steps of folding origami, having less space for subjective interpretation of the scenes.
The result shows that the difference between
advertisement video and
explainer video is marginal, which is congruent with the exploratory data analysis. But the results reify the need of controlling the effects of video types, especially the
instructional video that we used.
7 Analysis of Linguistic Characteristics
The statistical analysis suggested that
automatic-feedback and
human-feedback affected the quality of perceived SD. But the quantitative nature of the analysis prevented us from understanding what linguistic characteristics affected the perceived quality. In this section, we analyze the effects of feedback on the linguistic characteristics of SD through a two-steps process: (1) linguistic profiling of text using Profiling-UD [
12], and (2) dimension reductions using Principal Component Analysis (PCA) [
3].
Manually identifying linguistic properties of written documents is subject to systemic bias and error-prone. Instead, we identify the linguistic characteristics of SD through a method called linguistic profiling [
7]. The process extracts quantitative linguistic features such as lexical variety, syntactic relations, and morphological information from written text. The method has been used in characterizing the difference between sets of documents [
14,
34] and tracking the evolution of written language [
35]. Here, we use the method to assess the difference in linguistic properties in SD across different feedback conditions.
For each video, we concatenate SD from scenes that one person authored to form an SD document. We then extract the linguistic features of the SD document using Profiling-UD
4. Profiling-UD extracts 119-dimensional linguistic features that quantify seven types of linguistic profiles from a given document: raw text properties (4 dimensions), lexical variety (4 dimensions), morphosyntactic information (37 dimensions), verbal predicate structure (10 dimensions), global and local parse tree structures (15 dimensions), syntactic relations (41 dimensions), and use of subordination (8 dimensions)—see the original paper [
12] and the table in Appendix 5 for more details.
Inspecting 119-dimensional values for each document is tedious and uninformative. Instead, we aggregate values into seven scalar values, each summarizing a type of linguistic characteristics like lexical variety and verbal predicate structure. We flatten the multi-dimensional feature values for each linguistic characteristic to a scalar value using PCA [
3]. PCA is a statistical procedure that converts a multi-dimensional vector into low-dimensional principal components that best explain the variability of the data.
We show strip plots representing the distributions of linguistic characteristics in Fig
10. The result suggested SD in the
automatic-feedback condition differed in characteristics compared to those in the
no-feedback condition. We observed strong differences in a raw text property, lexical variety, verbal predicate structure, parse tree structure, and use of subordination, which we detail below.
Raw text property. The raw text property represents document length, sentence and word length, and number of character per token. SD in automatic-feedback condition scored higher compared to SD in human-feedback and no-feedback conditions. The result indicated that the length of the documents written by the participants in the automated-feedback condition was longer than those written by the participants in the no-feedback condition. On average, the SD in automatic-feedback condition had 137 words and the no-feedback condition had 120 words across three videos (human-feedback has 131 words).
Lexical variety. The lexical variety represents the number of lexical types and the number of tokens within the text. The SD in automatic-feedback conditions had higher lexical variety than those in no-feedback condition and human-feedback condition. The result showed that the participants in the automatic-feedback condition used more variety of unique words in describing scenes.
Verbal Predicate Structure. The verbal predicate structure represents the distribution of verbal head and root in a sentence. The SD in the automatic-feedback condition had higher verbal predicate structure than no-feedback and human-feedback conditions. This result showed that the SD in automated-feedback conditions are more likely to use base words than no-feedback and human-feedback conditions.
Global and Local Parse Tree Structure. This characteristic represents the structure of the syntactic tree like its depth; the higher the value, the deeper and more complex the parse tree is. The SD in no-feedback had a lower value compared to the SD in both automated-feedback and human-feedback conditions. The result indicated that the participants wrote SD with a simpler sentence structure in no-feedback conditions. For example, one participant in the automated-feedback condition wrote “A woman walks along the streets in London, looking at street signs.” And another participant in the no-feedback condition wrote “A road sign is displayed” (the latter’s syntactic structure is simpler and has a shallower parse tree).
Use of Subordination. Use of subordination represents the proportions of subordinate clauses and main clauses in the text. We observed that SD in automated-feedback have lower use of subordination compared to the other conditions. For example, one participant in the human-feedback condition used sentences with subordinates frequently; they wrote, “... With clear contrast on her phone, she understands where she is heading...” (the subordinate underscored). Whereas a participant in the automated-feedback condition used shorter chunks of main clauses (“...The lady takes out her mobile phone for navigation. She continues to her destination using her mobile for navigation...”). We discuss the implications of the above results in the Discussion section.
8 Thematic Analysis of the Interview Result
We conducted a short semi-structured interview with the participants who received either human or automated feedback in authoring SD. To understand their subjective impressions on the feedback mechanisms, we used thematic analysis to evaluate the interview transcript. We transcribed the interview results and one member of our research team iteratively coded them. Human feedback was preferred by more participants, but some found automatic feedback useful.
8.1 Human Feedback
Nine out of twenty participants explicitly mentioned human feedback benefits SD authoring. The participants repeatedly mentioned the benefit that it pointed out the scenes that they missed to describe and inaccurate information. For instance, H9 noted:
“I am not an experienced describer to begin with. I am not sure what I should describe. Like, should I describe everything I see? [...] However, the feedback guides me on what I should include and what not [to describe].” (H9)
While no participants mentioned explicitly negative comments toward the human feedback, three mentioned challenges in incorporating the feedback with the limited space. One noted:
“I think having feedback is good because it helps me to spot things I missed in my previous iteration. However, I am not sure if the reviewer knows that we have a very limited space to incorporate those comments in our descriptions. When the feedback makes the description becomes too long, then the feedback becomes not helpful anymore.” (H5)
8.2 Automated Feedback
Six participants found the automatic feedback to be useful. For example, with the labels provided in the feedback, three noted that they included the missing items that they did not think about initially. A8 noted:
“When I write down the SD initially, I tend to look at the feedback to see if I miss anything in the scene. It gives me some ideas on what I should see in the scene and if I see some label might be relevant and I miss it, I will include them in my description.” (A8)
Three participants had a negative impression of the automatic feedback. They noted it did not help in authoring SD because the feedback covered too many labels that were either generic, repetitive, or not relevant. For instance, A14 criticized that the automatic feedback mostly covered the objects and found it not helpful:
“To be honest, I didn’t rely on the feedback too much because the feedback is all about the objects and many of them are duplicated, which made the suggested labels to be irrelevant to the focus of the scene. I think the most important thing is the action in the scene, but they are very limited in the feedback.” (A14)
9 Discussion
9.1 Effects of Feedback
Our results of the analysis with the blind evaluator suggested that automatic feedback enabled by combining scene recognition and natural language processing could support novice describers in improving their SD after controlling the effects of videos. Specifically, we observed a significant improvement in descriptiveness and learning aspects of SD and a strong positive trend in objectiveness. We also showed that the automatic feedback reduces the time cost to produce SD by 45% by alleviating the manual review process. However, we observed that human feedback had stronger effect in improving SD quality. Except for objectiveness, the SD authored by those who received human-feedback had qualities that were equal to or better than the ones authored with automatic-feedback. Thus, whether to employ human reviewers or rely on automatic feedback is still in a trade-off relationship. This insight gives an idea to explore a hybrid approach between human-feedback and automatic-feedback. One direction could be, first, to train the novice with the human-feedback, and slowly transition into automatic-feedback once they are more familiar with the scene description authoring task.
The analysis of the video types’ random effect showed that novice describers found writing SD for the instructional video that we chose the hardest. Perhaps not a surprise as this was carefully selected as a challenging case for our automatic feedback. However, the wide range in quality evaluation response between video types calls for the need of a mechanism to classify video types a priori. Such a mechanism would help us to make an informed decision about, say, whether to hire professional audio describers or use a similar automatic approach for composing AD.
We showed the benefit of automatic feedback in reducing SD authoring time by having a research staff as a reviewer. But we also considered recruiting a professional audio describer to be a reviewer as such a person may be faster in reviewing SD. To see the effect of having a professional describer as a reviewer, we recruited one audio describer through our personal connection. We asked her to review six sets of SD (2 authors × 3 videos). On average, she took 57 minutes per SD (39.06 minutes per video minute) to review, compared to, on average, 11.65 minutes per SD (8.71 minutes per video minute) by the research staff. She wrote a review of 423 words per SD, compared to the research staff’s 198 words per SD. The result suggested that expertise in composing audio descriptions does not make one quick to review. The longer reviews may affect the SD quality though it would not be a realistic upper baseline due to cost. We also showed that even the research staff’s feedback helped improve the SD quality.
Our linguistic profiling captured the difference in lexical and syntactic patterns between SD in the no-feedback and automatic-feedback conditions. The captured difference may inform why the blind evaluator perceived SD the way they did. For example, the linguistic profiling suggested that SD in the automatic-feedback condition tended to be longer, informative (e.g., higher lexical variety), and more complex (e.g., deeper parse tree). And the blind evaluation showed that the SD in the automatic-feedback condition tended to be more descriptive. Thus, SD’s length and lexical and syntactical complexity and variety may explain the perceived descriptiveness. While our analysis allows us only to discuss a potential association, future work could investigate a causal relationship.
9.2 Future Directions for Automation
Though we showed the benefits of our method, it is just one way of automating SD authoring. A natural question we should ask is,
“In what other ways can we automate the process so that we can generate AD more efficiently?” One direction is to rely on more advanced (future) computer vision technology for scene recognition to automatically produce SD. We tried to generate SD in a fully automated way using a deep neural network model called MART for video scene understanding [
31]. But a manual inspection of the generated descriptions suggested that the output was incoherent. Therefore we decided to turn to the authoring method described in this paper. But automatic video scene understanding is a quickly advancing field; a recent report suggests that a better model significantly outperforms prior work in video description tasks [
40]. Future work should investigate ways to incorporate such cutting-edge video understanding technology with or without humans in the loop.
Another area where automation could improve the approach described in this paper is the real-time processing of user-provided SD and representative selection of labels. Our system used a knowledge graph-based text matching and label pruning. The method is not perfect. For example, there was a cluster where we had to choose a label from road, sidewalk, and zebra crossing Our algorithm chose zebra crossing as a representative label. But in such a case, we may want to choose sidewalk or mention all three labels. A better algorithm could improve the selection of labels and thereby increase usability of our system.
9.3 UI Design Implications
The way we presented the feedback was governed by the structure of the output that we obtained from Amazon Rekognition. We could improve the way we give feedback (
e.g., by signifying important representative labels), but also consider other ways of conveying feedback. Commercial products like Grammarly provide inline feedback that is pleasant to read [
22]. Or we could adopt some of the prior work that has proposed a more interactive way of dynamically giving feedback in translating language (
e.g., [
23]). Similar to Yuksel
et al., we could present automatically written SD as a placeholder and let the user post-edit. Future research could investigate what feedback mechanisms works the best in writing SD. We are also curious if providing more generic review comments to the describers can have a positive effect. In addition to showing a word cloud, we could compose a series of short tips that imitates human feedback.
While our study design prevented us from combining human feedback and automatic feedback, we could imagine creating a system that presents the describer’s feedback from both a human and automatic system. The describer could get quick feedback from the automated system and more detailed human feedback when necessary. Then, the describers could assess the reviews balance objectiveness where automation excelled and other qualities where the human review was better.
Another observation surfacing from the study is that the majority of the SD were not deemed sufficient. We suspect that this is likely due to the lack of space (time available between the speaking scenes) to provide long, detailed descriptions. While interpretation of the sufficient quality requires caution due to the moderate reliability, the interview result also supports this. One intriguing future direction is the design of the user interface that supports the audience, the blind user, being in control of the AD, not just the describers. For example, an interface that allows the user to customize and decide the preferred length of AD dynamically at runtime might benefit them. The system can dynamically adjust the amount of information in SD based on the audience’s preferred length. Future work could study blind and low vision people’s preferences at a larger scale to inform the design of such interfaces and their view toward personalized and customizable AD.
10 Limitation
Our automatic feedback mechanism relies on computer vision output of subpar accuracy. The black box computer vision API makes it hard to assess what made scene understanding difficult. Yet, the participants still managed to improve the quality of the scene description with automatic feedback.
The sighted reviewer was a research staff and not a professional audio describer. But she was well versed in giving feedback by studying the codebook that we used. We conducted an exploratory study and saw that the research staff was faster in giving review compared to the professional describer.
We had only one blind evaluator who evaluated the full set of the SD. But we observed high agreement between two blind evaluators for the evaluation of the subset of the SD. Even then, we caution that the result should be interpreted carefully, particularly for the quality dimensions clarity and interest where we observed low agreement between the two evaluators. To mitigate the potential influence of the individual bias in evaluation, we triangulated our evaluation using multiple evaluation methods. Thus the current work should be informative.