research-article

Open access

Supporting Novices Author Audio Descriptions via Automatic Feedback

Authors:

Rosiana Natalie,

Joshua Tseng,

Hernisa Kacorri,

Kotaro HaraAuthors Info & Claims

CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Article No.: 77, Pages 1 - 18

https://doi.org/10.1145/3544548.3581023

Published: 19 April 2023 Publication History

All formats PDF

Abstract

Audio descriptions (AD) make videos accessible to those who cannot see them. But many videos lack AD and remain inaccessible as traditional approaches involve expensive professional production. We aim to lower production costs by involving novices in this process. We present an AD authoring system that supports novices to write scene descriptions (SD)—textual descriptions of video scenes—and convert them into AD via text-to-speech. The system combines video scene recognition and natural language processing to review novice-written SD and feeds back what to mention automatically. To assess the effectiveness of this automatic feedback in supporting novices, we recruited 60 participants to author SD with no feedback, human feedback, and automatic feedback. Our study shows that automatic feedback improves SD’s descriptiveness, objectiveness, and learning quality, without affecting qualities like sufficiency and clarity. Though human feedback remains more effective, automatic feedback can reduce production costs by 45%.

Figure 1:

1 Introduction

Many videos remain inaccessible for blind and low-vision (BLV) people [4, 42]. For videos to be accessible for BLV people, they need audio descriptions (AD)—audible explanations of visual events played in a video. Yet, there are only a small number of videos that have AD. For example, about 2,760 out of 75 million videos (0.004%) on Amazon Prime Video come with AD [43]. The problem is partly attributed to the cost and time of hiring professional audio describers—the gold standard for generating AD. Indeed, AD is costly (i.e., US$12 to US$75 per video minute) and the turn-around time could be a week for a video [52].

Prior work has explored ways to minimize the cost of generating AD and turn-around time by (semi-)automating the AD authoring process [31, 54, 57, 58]. For example, Wang et al. built a tool to automatically generate AD by combining computer vision and natural language processing techniques [54]. However, in their evaluation, they reported that the result of the automatically generated AD lacked elements like characters’ actions and location information that is useful for blind people. Given the fully automated AD authoring is not yet ready, we ask, “how can we combine automation in the process of authoring AD to increase production efficiency while keeping their quality high?”

We develop an AD authoring user interface that combines manual AD authoring and automated review and feedback generation based on computer vision and natural language processing. The appearance of the user interface follows the conventional design of a collaborative tool (like [37, 38]) for writing and peer reviewing scene descriptions (SD)—textual descriptions of scenes in a video that are converted into audio descriptions through text-to-speech (TTS). Using this system, a describer can write SD to augment each video scene with its verbal description. We design a novel automatic feedback mechanism; as a describer writes SD, the system displays an interactive word cloud of representative labels. We generate representative labels by combining the output of the off-the-shelf scene recognition tool (Amazon Rekognition) with a custom design algorithm that selects labels to show to the users. As the describer mentions terms similar to those that appear in the word cloud, the system attempts to detect those similarities and highlights the closest matching labels in real-time to indicate that those visual elements are now captured.

We used the authoring system and three types of video stimuli (advertisement, explainer, and instructional) to evaluate whether real-time feedback could help novice SD describers to compose high-quality descriptions cost-effectively. We conducted a between-subjects study with 60 novice SD describers, comparing control, automatic-feedback, and human-feedback conditions. We found that the automatic feedback could help the participants to write SD that is descriptive, objective, and good at conveying the intended video’s learning quality. We also found that human feedback can help the participants to describe the scene sufficiently, but it may negatively affect the objectiveness of SD. Although the human review remains the most effective feedback to improve quality, we show that automated feedback can reduce the time needed for SD production by 45% compared to the workflow that require manual review. In summary, this work contributes:

•

The design and development of an AD authoring system that combines manual SD authoring and novel real-time automated feedback based on computer vision and natural language processing.

•

Empirical results with 60 participants demonstrating the feasibility, value, and limitations of real-time automated feedback.

•

Design implications for future AD authoring tools with the automated feedback elicited through the user study.

2 Related Work

2.1 Audio Descriptions and Video Accessibility

Audio descriptions (AD) make videos accessible for blind people by describing visual content in an audio format. Existing guidelines for composing AD explain what verbal information they should include and how they should be placed in the video [6, 15, 39, 56]. For instance, AD should describe important visuals that enable the audience to comprehend the video. It should not overlap with the dialog in the video and “spoil” the video content. And AD should be clear, concise, and objective. Despite the comprehensive guidelines, many videos get produced and made public without added accessibility. The scarcity of AD-accompanied videos is partially attributed to the high cost and the limited availability of professional audio description services [52], which is hard for casual content creators to afford. Thus, we see the need to provide a solution to make AD generation more cost- and labor-efficient while maintaining AD quality.

2.2 Describing Visual Media for Accessibility

Designing technological solutions for increasing the accessibility of visual media is an active research topic in accessibility and HCI. Much work has focused on adding accessibility to still images by computationally generating image descriptions [9, 21, 25, 26, 50, 51]. In particular, a body of recent work explored the use of computer vision (CV) to make visual media accessible for BLV people. For example, Gleason et al. [21] proposed methods using CV and crowdsourcing to add alternative text to social media images automatically. Peng et al. [46] created a system that automatically analyzes visual contents in presentation slides in real-time using CV. While our work share the same motivation in using automation to make visual contents accessible, describing video scenes poses challenges that do not surface in making static pictures and slides accessible. For example, making videos accessible involves describing visual content that spans a few seconds of frames and balancing competing AD qualities like keeping AD duration short while providing a comprehensive description of the scene.

Existing technical research for increasing AD production efficiency has primarily focused on designing user interfaces and systems that streamline manual production [1, 11, 18, 20, 28, 29, 53, 56]. Branje and Fels [11] built a system called LiveDescribe that provides an easy-to-use interface and facilitates manual AD authoring by novices. Similarly, Kobayashi et al. [28] designed an AD script editor that allows novice describers to edit descriptions and transform them into synthetic speech using text-to-speech. Textual narratives composed through these tools can be formed into a format like VTT [53] and can be consumed with videos through video players that support them (e.g., Able Player [48]). However, the process that relies on a single person to manually compose AD is susceptible to quality variation due to the individual’s ability and limits the scalability. Recent work designed a collaborative writing tool so that novice scene description (SD) describers could get feedback from other individuals who are more versed in composing SD [37, 38]. The study showed that the tool enables novice describers to generate SD that are better in quality. However, the approach still relied on manual feedback authoring. In this work, we improve SD production by incorporating automation.

2.3 Automatic Feedback in Writing

Research on automated technologies to support writing has a long history [22, 47, 55]. Features like grammatical error correction and readability assessment have been deployed and widely used in applications like Microsoft Word and Grammarly [22, 36]. Research at the intersection of natural language processing (NLP) and HCI has started to incorporate automation into more specialized linguistic tasks like language translation [24] and text summarization [8]. But automating SD authoring support is not trivial; the technology not only needs to process textual SD, but also needs to process visual information from video streams.

Using automation to facilitate AD production is an emerging active research topic [2, 17, 19, 33, 44, 49, 54, 58]. Wang et al. [54] attempted to use deep learning to generate AD automatically. However, in the same paper the authors reported that blind people felt the automatically generated AD often lacked key information and the logical flow was incoherent. Thus, fully automated AD production is deemed not possible yet. Others have incorporated automation as a part of AD production process. For example, CrossA11y automatically identifies the accessibility problems in a video and points it out, potentially optimizing the manual effort in AD authoring [33]. Yuksel et al. proposed the workflow in which describers post-edit automatically generated SD [57, 58]. Pavel et al. designed an algorithmic method to smartly adjust AD succinctness [45]. Our work contributes to this effort to incorporate automation into AD production. We propose a novel approach that combines NLP and scene understanding capability to automatically generate feedback for SD describers to increase AD production efficiency.

3 Automatic Feedback Mechanism for Scene Descriptions

Figure 2:

Informed by the design of the existing automated technologies to support writing tasks, we develop a scene description (SD) authoring tool. We envision a tool that could provide immediate feedback to SD authors (describers), offer an option to incorporate it, and let the describers improve the writing quality. The tool shall be like Grammarly, an AI-powered writing tool that provides writers immediate quality feedback;, but instead of grammatical correction, our tool should provide helpful information for writing SD. The tool should help reducing SD authoring time and improve SD quality—two key factors in making SD available to more videos.

The user interface design of our SD authoring interface follows the convention of existing similar tools (e.g., [38, 44, 56]) (Fig. 1). The video pane is on the left. On the right, we present the scene’s start time, closed captions (CC), scene descriptions (SD), and feedback as table columns. CC and SD bars are both included at the bottom of the video pane, allowing a sighted describer to quickly grasp where speech is present in the video and how succinct an SD is; the SD bar’s fill color turns red if the authored SD overlaps with CC, nudging the user to shorten the SD.

In a typical authoring interaction, the user: (i) watches a scene of a given video, (ii) writes a description for that scene, (iii) receives feedback, and (iv) revises as necessary. In the workflow introduced in the previous work [37, 38], there is a break between steps (ii) and (iii); receiving feedback involves another human reviewer providing written comments on the feedback column. Though the human review is detailed and useful in improving the quality of SD, it incurs additional costs and long turnaround times.

To mitigate the need for a human reviewer, we propose an automatic feedback mechanism; for each video scene, our system creates an interactive word cloud of noun, verb, and adjective labels deemed relevant to the scene (Fig. 1). The labels aim to support novice describers by suggesting what to mention in their SD.

3.1 Generating Labels via Computer Vision for Scene Understanding

We use Amazon Rekognition, off-the-shelf object detection and scene recognition service [5] (Fig. 2.a). The service automatically detects entities (e.g., car, person) in video scenes, recognizes actions that these entities may perform, and retrieves adjectives that may describe them. The Rekognition output are labels of entities, actions, and adjectives, labels’ confidence scores and trees representing relationship between labels. For example, in a short scene where a person is walking down a street, we get a tree of labels like Road → Tarmac → Zebra Crossing.

Overall, for our video stimuli described in Fig. 4, we obtained 20 to 43 labels per scene (mean = 28 labels). These labels were way too many to be informative and often repetitive (e.g., included both car and automobile) or even irrelevant (e.g., Final Fantasy and Grand Theft Auto were detected on the scene that depict the cliff and car driving). Thus, as a follow-up step, we selected representative labels that captured the scene well via a custom clustering and label selection algorithm described in Fig. 2.b-c.

Pruning representative labels is a two-step process. First, because some disjoint label trees from a same scene are semantically related, we group such trees using DBSCAN [16], an oft-used unsupervised clustering algorithm (Fig. 2 b). DBSCAN requires a distance metric and threshold for clustering; we use semantic similarity computed with sematch [59] and set the threshold at 0.66. Second, we select one representative label per cluster of trees by leveraging the label confidence score from the Rekognition output and the tree structure to find a label that balances the specificity, generality, and confidence (Fig. 2.c). We include more details on our algorithm in the supplementary material (Appendix 2).

3.2 Supporting Real-time Interactivity via Automatic Matching

Figure 3:

As the user writes a scene description (Fig. 3), our interface presents a word cloud of representative labels with a message, “Would it be better if these items are described?” Given the error-prone nature of AI-infused systems, the accompanying message is intentionally not forceful in demanding people to include labels in their description. All labels are gray initially. As the user writes an SD, the system highlights representative labels that matches user-provided SD in blue, indicating that the user has successfully included the labels in their SD. This is achieved via an exact or semantic matching. For the exact match, which rarely occurs, we use NLTK [10] to tokenize user-provided SD and seek exact matches between these tokens and the representative labels. For the semantic matching we form pairs between the tokens and representative labels, then compute the semantic similarity for each pair by using sematch [59], a Python library that utilized a knowledge-graph to compute how similar two words/phrase are. We set the similarity threshold at 0.66 through trial-and-error. We use knowledge graph-based similarity metric over other methods (e.g., combination of word embedding and cosine similarity) because it performed better in our application. The user can delete the labels by clicking on the ‘x’ button in the label when the user deemed the labels to be irrelevant.

3.3 Video

Figure 4:

In the subsequent sections, we adopt the three videos from [38] to both evaluate the labels and our user interface. The three videos vary in the level of visual information and span many domains: (i) an advertisement video¹, (ii) an explainer video², and (3) an instructional video³. The advertisement video is a Subaru Car commercial with a duration of 1 minute. This video shows a group of people exploring the scenic area with a Subaru Outback Car. The explainer video explains about Web Accessibility guide on color contrast. This video is from W3C Website and lasts 1 minute 38 seconds. The instructional video, engineered to serve as a very challenging video, is an origami tutorial on how to make a bookmark corner. Each instructional step is visually shown without any verbal narrative, which is impossible for blind people to follow without AD. Many blind people enjoy the art of origami due to its tactile nature but many origami videos just play music and don’t provide verbal instructions. This video lasts for 1 minute 42 seconds. All three videos are accompanied by professionally authored scene descriptions, which we treat as “ground truth” in the subsequent sections (see Appendix 1 for the ground truth scene descriptions).

3.4 Evaluating the Quality of Labels

Figure 5:

We used Amazon Rekognition and the representative label selection algorithm described above to generate feedback for scenes in the three videos. Because the quality of representative labels would affect the instructive quality of the automatic interactive word cloud feedback that the users will see, we wanted to know their quality. We compared representative labels with the text in the ground truth scene descriptions.

Amazon Rekognition detected on average 247 labels per video (N_adv = 311, N_exp = 229, N_inst = 201). This is equivalent to 28 labels per scene on average (34.56 labels per scene in advertisement video, 32.71 labels per scene in explainer video, and 16.75 labels per scene in instructional video). We observed that most of the labels detected by Rekognition are nouns (177 distinct nouns, 19 distinct verbs, 14 distinct adjectives). Displaying so many labels as a word cloud was impractical. Thus, after selectively reducing the number of labels (Section 3.1), we concealed 162 (69.6%) labels per video on average. We kept 62 labels for the advertisement video, 84 for the explainer video, and 78 for the instructional video. Per-scene label count for the videos were 9, 9, and 7, respectively.

We evaluated the quality of the final labels by comparing them with tokens extracted from the ground truth SD. Because we could not directly compare a sentence with a set of labels, we first extracted noun, verb, and adjective tokens from each ground truth SD using NLTK’s pos_tag function. This allowed us to compare a set of tokens with a set of representative labels (Fig. 5). Treating the extracted tokens as the ground truth, we measured precision and recall of the representative labels over the videos. Precision indicated what portion of the representative labels were relevant for the scene. Recall represented what portion of the tokens in the scenes were included in the representative labels.

Table 1:

	Precision				Recall
	Overall	Noun	Verb	Adjective	Overall	Noun	Verb	Adjective
Advertisement	0.05	0.05	0	0	0.08	0.14	0	0
Explainer	0.13	0.11	0.02	0	0.17	0.31	0.04	0
Instructional	0.06	0.04	0	0.02	0.03	0.05	0	0.04

Table 1: Precision and recall of representative labels. The recall score was relatively high for the explainer video and low for the instructional video. Generally low precision scores indicate our system generate representative labels more than what is necessary.

The overall precision and recall values for advertisement, explainer, and instructional videos were (0.05, 0.13, 0.06) and (0.08, 0.17, 0.03), respectively (Table 1). We also measured precision and recall for each part of speech—noun, verb, and adjective. Both precision and recall for verb and adjective labels were low. We expected this as the output of the scene recognition from the Amazon Rekognition was noun heavy. Across all types of parts-of-speech, the recall scores were higher compared to precision. The low precision suggested that the system retained redundant labels even after pruning.

The system’s scene recognition performance was better for the explainer video compared to the other two videos, especially in terms of noun recall. In fact, in some scenes in explainer video, the scene-level recall was 1.0, indicating the representative labels included all noun tokens mentioned in the ground truth. Both precision and recall were low for the instructional video. The video captured only the hands of a person making origami. This resulted in our system producing representative labels like paper and manicure while the ground truth tokens included unfold and crease capturing fine-grained actions.

Objective similarity metrics between the ground truth tokens and representative labels were low, though the recalls for noun tokens were relatively high. This showed the accuracy of the interactive word cloud feedback was not perfect. However, some noun labels were found useful in a pilot study. They helped describers ideate what to mention in SD albeit them not matching the ground truth tokens semantically. For example, a label word was presented to the user as feedback in the last scene of the explainer video; the feedback guided the participant to describe the textual information shown in that scene. After seeing the potential value of the automatically generated feedback error-prone as it may be, we decided to move on to the main study. We could remove the instructional video from the subsequent study given its particularly low precision and recall scores. However, we would have missed the opportunity to observe the effects of error-prone feedback on the quality of scene descriptions that may reflect real-world scenarios. Thus, we kept this video for the following study.

4 Study Method

To evaluate the quality and cost of authoring SD with automatic feedback, we recruited novice audio describers to write SD using our system. We conducted the study remotely over Zoom due to COVID-19 restrictions. The study had a feedback type as a between-subjects factor with three levels: control, automatic-feedback, and human-feedback:

•

Control: Participants in this group went through two sessions of writing SD. In session 1, we asked the participants to author SD by watching videos. We invited the participants in this group back to session 2 after at least one day to revise the SD they wrote in session 1’s without any feedback.

•

Automatic-feedback: As participants watched a video and wrote SD, the interface displayed the word cloud of representative labels as described in Section 3. Participants in this condition participated in only one SD authoring session because they wrote and revised their SD simultaneously with automatically generated feedback.

•

Human-feedback: Like the control condition, the participants in the human-feedback condition went through two sessions. In session 1, the participants authored the SD without receiving any feedback. After session 1, a sighted reviewer wrote comments on how to improve the SD quality along the quality dimensions described below. After the manual review process, the same describers were invited back to join session 2, where they revised the SD by addressing the reviewer comments.

4.1 Scene Description Quality Dimensions

The following set of quality dimensions taken from the prior work [38] was used by the sighted reviewers to give feedback to the describers and the blind evaluator to assess the perceived quality of SD:

•

Descriptive: SD should provide pictorial and kinetic descriptions of objects, people, settings, and how people act in the scene.

•

Sufficient: SD should capture all the scenes and provide sufficient information for the audience to comprehend the content of a video while not being overly descriptive.

•

Clarity: SD should use plain language by writing a short, uncomplicated sentence and leaving out anything that is unnecessary.

•

Objective: SD should illustrate objects, people, and the relationship between them in an unbiased manner.

•

Succinct: SD should fit in a gap without dialogue or a natural pause in the video.

•

Accurate: SD should provide correct information about what is shown in the scene.

•

Referable: SD should use language that is accessible to everyone with different disabilities.

•

Learning: Video’s SD can convey the video’s intended message to the audience.

•

Interest: SD should make the video be interesting to the audience by writing a cohesive narrative.

4.2 Sighted Participants

We recruited sixty participants as novice describers via a university mailing list and word-of-mouth. We uniformly assigned the sixty participants to the three conditions at random. We recruited participants without prior experience in writing SD to evaluate if automatic feedback could help novice describers produce good SD. All participants watched and wrote SD for the three videos described above. Our system recorded key interactions (e.g., task_start and task_end timestamps, SD edit logs), which allowed us to calculate how much time describers and reviewers took to complete their tasks. We did not inform the participants which condition they were assigned to, thus participants in control and human-feedback conditions did not know which condition they were in until they returned to session 2.

Of the 60 participants, 29 were females. Their ages ranged from 18 to 42 years old (median = 23, mean = 24.2, std = 4.78). Because the participants’ English proficiency could affect the quality of SD they write, we asked them to self-report their English proficiency in 4-point scale (elementary, intermediate, advanced, and native) for four skills—writing, reading, speaking, and listening—for post hoc analysis. We were particularly interested in their writing proficiency as that might affect the quality of SD they write. N = 28 participants rated themselves as native writers. N = 20 and N = 12 reported they have advanced and intermediate writing proficiency, respectively. Nobody self-reported themselves as elementary. We grouped the advanced and intermediate English writers together and classified them as non-native (which left us with N = 32 non-native English writers). We had a balanced distribution of native English writers in each condition. There were N=11 native participants in control condition, N = 9 native participants in automatic-feedback condition, and N=8 native participants in human-feedback condition.

4.3 Sighted Reviewer for Human-feedback

We recruited one sighted research staff, who is not the co-authors of this paper, to be the reviewer for the human-feedback condition. The task for this reviewer was to give feedback to the participants’ SD after the participants have completed session 1. We recruited a reviewer who had prior experience in working with us—someone who we could trust that she would be a dedicated and motivated reviewer. This was done to control the quality of the reviews. We instructed the reviewer to understand quality dimensions that explained what makes good SD. Neither the describers nor the reviewer was exposed to the ground truth SD that accompanied the three videos.

4.4 Blind Evaluator

Using the quality dimensions, a fully blind evaluator from our research team assessed the qualities of SD that our participants authored. The blind evaluator watched videos augmented by participant-authored SD. If the evaluator approved a quality facet upon finishing watching a video, he gave ‘1’ to the SD. In contrast, if the evaluator rejected the quality of SD, he gave ‘0’. The blind evaluator used eight out of nine codes (descriptive, sufficient, clarity, objective, succinct, referable, learning, and interest). We omitted the accurate dimension as the blind evaluator could not justify the quality without visually inspecting the video. The blind evaluator also assessed the SD qualities at the scene level, but we refrain from reporting the result in this paper due to the word-count limit. We randomized the sequence of videos to be evaluated to minimize the potential bias in review conditions and sessions. In total, the blind evaluator checked the quality of 300 sets of SD (20 participants × 3 videos × 5 sets of SD from different feedback types and sessions). We did not expose the evaluator to the ground truth SD before they evaluated the participant-provided SD, because we did not want to influence evaluators’ perception of what is good SD.

Though we used the codebook to let the evaluator objectively assess SD qualities, having only one blind evaluator might introduce subjectivity and individual bias in quality evaluation. This might reduce the reliability of the result. To see the level of individual bias of the blind evaluator, we recruited an additional blind evaluator via a local accessibility organization. Following the same procedure as the first evaluator, the additional evaluator classified SD as “approve” or “reject” for each quality. Instead of evaluating all 300 sets of SD, however, the additional evaluator assessed 90 SD—30% of what the first evaluator assessed. We then calculated the agreement in the evaluation between the first evaluator and the additional evaluator for the 90 SD.

Table 2:

Video	Feedback Type	Descriptive	Sufficient	Clarity	Objective	Succinct	Referable	Learning	Interest
Adv.	Control (S1)	19 (0.95)	9 (0.45)	17 (0.85)	14 (0.70)	20 (1.00)	17 (0.85)	12 (0.60)	16 (0.80)
	Control (S2)	20 (1.00)	10 (0.50)	18 (0.90)	16 (0.80)	20 (1.00)	14 (0.70)	14 (0.70)	16 (0.80)
	Automatic	20 (1.00)	7 (0.35)	16 (0.80)	19 (0.95)	18 (0.90)	15 (0.75)	15 (0.75)	16 (0.80)
	Human (S1)	18 (0.90)	3 (0.15)	16 (0.80)	16 (0.80)	19 (0.95)	14 (0.70)	12 (0.60)	16 (0.80)
	Human (S2)	20 (1.00)	12 (0.60)	18 (0.90)	16 (0.80)	17 (0.85)	18 (0.90)	18 (0.90)	18 (0.90)
Exp.	Control (S1)	17 (0.85)	0 (0.00)	13 (0.65)	13 (0.65)	20 (1.00)	15 (0.75)	7 (0.35)	16 (0.80)
	Control (S2)	19 (0.95)	2 (0.10)	10 (0.50)	13 (0.65)	20 (1.00)	16 (0.80)	8 (0.40)	15 (0.75)
	Automatic	20 (1.00)	3 (0.15)	15 (0.75)	13 (0.65)	20 (1.00)	20 (1.00)	11 (0.55)	19 (0.85)
	Human (S1)	18 (0.90)	3 (0.15)	12 (0.60)	11 (0.55)	20 (1.00)	20 (1.00)	7 (0.35)	15 (0.75)
	Human (S2)	20 (1.00)	5 (0.25)	13 (0.65)	7 (0.35)	18 (0.90)	18 (0.90)	15 (0.75)	16 (0.80)
Ins.	Control (S1)	3 (0.15)	0 (0.00)	0 (0.00)	20 (1.00)	20 (1.00)	3 (0.15)	1 (0.05)	2 (0.10)
	Control (S2)	3 (0.15)	0 (0.00)	2 (0.10)	20 (1.00)	20 (1.00)	2 (0.10)	2 (0.10)	3 (0.15)
	Automatic	3 (0.15)	0 (0.00)	1 (0.05)	19 (0.95)	20 (1.00)	1 (0.05)	1 (0.05)	3 (0.15)
	Human (S1)	2 (0.10)	0 (0.00)	0 (0.00)	20 (1.00)	20 (1.00)	0 (0.00)	0 (0.00)	0 (0.00)
	Human (S2)	9 (0.45)	1 (0.05)	2 (0.10)	20 (1.00)	20 (1.00)	3 (0.15)	3 (0.15)	4 (0.20)
Mean		14.07 (0.70)	3.67 (0.18)	10.20 (0.51)	15.80 (0.79)	19.47 (0.97)	11.73 (0.59)	8.40 (0.42)	11.67 (0.58)

Table 2: The raw approval counts from blind evaluation and the approval rate (presented in the bracket). For each cell, the maximum number of approvals is 20. S1 and S2 indicate the results from session 1 and session 2, respectively. Note: Adv. = advertisement Video, Exp. = explainer Video, Ins. = instructional Video.

To measure the consistency between the two evaluators, we used Cohen’s kappa (κ) for the evaluation results for the eight quality dimensions [13]. The result was: (descriptive, sufficient, clarity, objective, succinct, referable, learning, interest) = (0.71, 0.42, 0.07, 0.67, 0.79, 0.52, 0.67, 0.05). We observed substantial agreement (0.6 < κ) for descriptive, objective, succinct, and learning, and moderate agreement (0.4 < κ) for sufficient and referable. This shows that the two evaluators assessed these qualities consistently with minimal individual bias. However, the consistency for clarity and interest was low, suggesting they are susceptible to individual bias.

The results suggest that the evaluation done by the first evaluator for the 300 SD is not subjective for most of the quality dimensions. We will make a careful discussion at the end of the paper for the qualities with low reliability.

Figure 6:

4.5 Procedure

At the beginning of the study, we explained to the sighted participants the basic concept of audio descriptions and SD (e.g., how they could support people who cannot see videos), the motivation of the study, and the tasks. We explained how to use our system, the topic of each video, and the quality dimensions of SD. For the participants in the automated-feedback condition, we asked them to author and revise their scene descriptions simultaneously as they received interactive word cloud feedback. We asked the participants in the control and human-feedback conditions to write SD in session 1. After they completed session 1, we scheduled them for session 2. Between the two sessions, the sighted reviewer wrote comments to SD that were written by the participants in the human-feedback condition. After that, the participants returned for session 2 and revised their SD. We did not disclose the reviewer’s information to the participants to minimize bias on the perceived importance of the comments. Each participant authored SD for all three videos. We counterbalanced the order of the videos to minimize the learning effect. We ended the study with a semi-structured interview to gauge the participants’ perception on the interface.

5 Exploratory Analysis of Scene Descriptions Quality and Production Time

5.1 Perceived Scene Description Quality

We present the raw approval counts for each condition in Table 2. The average of approval counts across videos and conditions show that our participants wrote SD that were succinct (mean=19.47). Descriptiveness and objectiveness of SD were relatively high compared to other qualities (14.07 and 15.80), followed by referable (11.73), interest (11.67), and clarity (10.20). Learning quality was lower (8.40). Sufficiency was the lowest across videos and conditions (3.67). The influence of the participants’ English writing proficiency was small, thus in the following study we disregard its influence from the analysis.

To study the effects of the feedback types and videos on SD quality perceived by our blind evaluator, we calculate the overall approval counts over conditions and overall approval counts over videos. First, we divide the blind evaluator’s evaluation result into groups by their condition or video. We then calculate the overall approval counts by totalling the number of blind evaluator’s approvals for SD over quality dimensions in each group. To normalize each approval count to a value between 0 and 1, we divide it by the total number of SD in each group. For example, the total number of SD in control group (session 1) is 480 (=8 quality dimensions x 3 videos x 20 SD per video). To estimate the spread of the means of overall approval count, we calculate 95% confidence intervals.

We present the means of overall approval counts over conditions and mean of approval counts for each quality dimension in Fig. 6. We indicate the results from the first and second sessions for the control as control (session 1) and control (session 2). Likewise, we label the results from the human feedback group as human feedback (session 1) and human feedback (session 2). Note that the participants in the automated feedback condition only participated in one authoring session. The mean approval counts over conditions were (control (session1), control (session 2), automatic feedback, human feedback (session 1), human feedback (session 2)) = (0.57, 0.59, 0.62, 0.55, 0.65). The approval counts for control (session 1) and human feedback (session 1)—i.e., conditions where the describers did not receive any feedback—were similar to each other and the lowest (0.57 and 0.55). This was not surprising as our participants in both conditions did not receive any feedback. The approval counts from the human feedback (session 2) was the highest among all the conditions. SD quality improved in human feedback (session 2) compared to human feedback (session 1). This aligned with the previous work’s finding that showed that people can improve SD quality with the help of human reviewers [38].

Looking at Fig. 6, the influence of feedback to some SD qualities seems larger than the influence on other quality dimensions. Human feedback seems to improve descriptiveness, sufficiency, and learning qualities. However, the result seems to suggest that human feedback negatively influences the objectiveness of SD. Automated feedback seems to have a positive influence on descriptiveness and objectiveness compared to conditions where the describers did not receive any feedback (control (session 1) and human feedback (session 1)). The succinctness of SD was high across the conditions; this is understandable given our user interface had a succinct visualization and warned the describers in all conditions if they wrote long SD. As it is not possible to improve quality that hits the maximum approval counts, it is understandable we do not see improvements in this quality.

The effects of videos are more evident compared to the effects of the conditions. The mean approval counts over videos were (advertisement, explainer, instructional) = (0.79, 0.68, 0.32). Fig. 6 shows the perceived quality of SD for the advertisement video and explainer video were higher compared to that for the instructional video. The result showed that providing high-quality SD for an instructional video was particularly hard. When we averaged the perceived quality across videos, the results for the instructional video pull down the average qualities. Thus, we treated instructional video as the edge case and evaluated qualities by splitting the with-instructional video and without-instructional video in the confirmatory data analysis presented below.

We asked the two evaluators to evaluate the ground truth SD that accompanied the three videos. The evaluators were asked to assess the video-level SD quality by following the same procedure as the way they evaluated participant-provided SD. We wanted to get a sense of how well professionally composed SD are perceived by blind individuals. For the explainer and the advertisement videos, both evaluators approved all quality dimensions except for sufficient, clarity, and interest. One evaluator rejected the sufficient quality for the explainer video. The other evaluator rejected the sufficient, clarity, and interest dimensions for the advertisement video. For the instructional video, both evaluators rejected five out of eight qualities—sufficient, clarity, referable, learning, and interest. The result showed that blind individuals could perceive professionally composed SD imperfect.

5.2 Scene Description Production Time

Using the interaction log, we calculated the time that the describers and the reviewer spent doing the tasks to infer the total cost of authoring SD. In Table 3, we presented the time it took to (i) author SD in session 1 for all the three videos, (ii) reviewing the SD (for human-feedback condition), and (iii) revising the SD (for control and human-feedback conditions).

On average, the human-feedback condition took the longest time to compose SD (mean=109.25 minutes), and the control condition (67.6 minutes) and automatic-feedback conditions (60.7 minutes) followed (Table 3). It was no surprise that the duration of the human-feedback condition was the longest given it required additional reviewer time. But even after disregarding the time that the reviewer took to assess and comment on SD, the describers took 74.4 minutes to write and revise the SD, which was longer than the durations of both control and automatic-feedback. The revision duration in the human-feedback impacted the total duration; the describers in this condition took more than twice the duration in revising the SD they wrote in the second session (32.3 min) compared to the describers in the control condition (15 min). We also point out that when we compared the session 1 durations for control and human-feedback conditions, the describers in the automatic-feedback condition took 10.5 min (18.7 - 8.1) shorter time. Thus it is unlikely that the participants in this group were particularly slow to write. The result showed that the time cost of SD production could be reduced by 45% (=(109.4 − 60.7)/109.4) when we compare human-feedback condition and automatic-feedback condition.

To see the variability in SD production cost between video types, we calculated the total time taken for generating SD. We also calculated the estimated monetary cost of SD production by multiply a given duration with the U.S. federal minimum wage (US$ 7.25 per hour). Because the video lengths differ, we divided the cost by the video total length and obtained the cost per video minute (pvm). The mean monetary cost ranged between US$1.23pvm to US$4.08pvm across the videos and conditions. The human-feedback condition cost the most (Table 4), where the cost ranged between US$1.92pvm to US$4.08pvm, which is cheaper compared to the prior work (US$2.81pvm to $5.48pvm) [38] and the known professional production cost (US$12 to US$75pvm) [52]. The cost per video minute for the automatic-feedback condition was cheaper and ranged between US$1.23pvm and US$2.31pvm. The estimated production cost for the control condition ranged between US$1.32pvm and US$2.51pvm.

Table 3:

	Session 1			Review			Session 2			Total Duration
	mean	std	med	mean	std	med	mean	std	med	mean	std	med
control	52.6	22.9	50				15	6.5	15	67.6	26.1	69
automatic	60.7	17.4	59							60.7	17.4	59
human	42.1	17.3	37.5	35.0	16.5	32	32.3	14.4	28.5	109.4	37.2	101

Table 3: The completion time (in minutes) to author, review, and revise the SD for the three videos in different sessions and feedback type. We also present the total duration, which is the sum of session 1, review, and session 2 durations.

Table 4:

		Total Duration (Minute)			Cost (USD)			Cost per video minute (USD)
condition	video	mean	std	med	mean	std	med	mean	std	med
control	Advertisement	20.8	10.98	19	2.51	1.33	2.30	2.51	1.33	2.30
	Explainer	17.9	8.04	17	2.16	0.97	2.05	1.32	0.59	1.26
	Instruction	28.9	11.29	27.5	3.49	1.36	3.32	2.03	0.79	1.94
automatic	Advertisement	19.1	8.04	19	2.31	0.97	2.30	2.31	0.97	2.30
	Explainer	16.6	6.8	17	2.01	0.82	2.05	1.23	0.50	1.26
	Instruction	25	7.88	23	3.02	0.95	2.78	1.76	0.55	1.62
human	Advertisement	33.8	12.31	32.5	4.08	1.49	3.93	4.08	1.49	3.93
	Explainer	25.95	11.12	23	3.14	1.34	2.78	1.92	0.82	1.70
	Instruction	49.5	18.85	47	5.98	2.28	5.68	3.48	1.33	3.31

Table 4: The authoring durations and estimated cost to generate SD for different feedback types and videos. We obtain the value of cost (in US$) by multiplying the total time duration with the US federal minimum wage (US$ 7.25 per hour [41]). We compute the cost per minute for each video (pvm).

Figure 7:

We investigated different patterns of authoring between automatic- and human-feedback conditions and visualized two participants’ authoring logs in Fig. 7. We observed the participant in human-feedback condition (H1) performed more insertion than other actions (i.e., deletion and update) in session 1. In session 2, H1 minimally added new SD content, instead, she performed more updates and deletions. In automatic-feedback condition, the participant (A9) interactions were mixed throughout the session.

6 Statistical Analysis of SD Quality

To further investigate the effects of automated-feedback and human-feedback on SD quality, we performed confirmatory statistical analysis. We compared the SD qualities between conditions where the SD describers did receive feedback (automatic-feedback and human-feedback (session 2)) and did not receive any feedback (control (session 1) and human-feedback (session 1)). We formed no-feedback group by pooling the control and human-feedback conditions’ session 1 data; this was deemed appropriate as the approval data from both of these conditions represented the qualities of SD that the describers wrote without any feedback. We disregarded control (session 2) data in the confirmatory data analysis as we observed minimal difference between the session 1 data and session 2 data in the control group.

The dichotomous nature of our response variable (approve vs reject) restricted us from using statistical tools like t-test and ANOVA which are commonly used in HCI literature. The outcome that these tests assume is data that ranges from negative to positive infinity that is drawn from a near-normal distribution, which was not the case for our data. And the strong random effect of video types that we observed in the exploratory data analysis appealed us to use statistical tools that can handle not only a fixed effect but also a random effect of video stimuli. Incorporating the random effect is a practice to separate the videos’ effect from the main effect, ruling out its potential confounding effect.

Figure 8:

Table 5:

Model	Feedback	Descriptive	Sufficient	Clarity	Objective	Succinct	Referable	Learning	Interest
Without instructional video	Automatic	0.99	0.81	0.70	0.93	0.12	0.74	0.97	0.88
	Human	0.99	0.99	0.72	0.14	0.03	0.74	0.99	0.77
With instructional video	Automatic	0.85	0.80	0.72	0.86	0.19	0.72	0.95	0.88
	Human	0.99	0.99	0.78	0.17	0.06	0.84	0.99	0.89

Table 5: Automatic-feedback and human-feedback conditions’ probabilities of superiority (p.s.) over no-feedback condition using two models. p.s. over 0.9 are emboldened.

Potentially suitable alternatives are generalized linear mixed models (GLMM) and Bayesian hierarchical models. Both techniques allow us to capture a wide variety of outcome variables (e.g., count data, binary outcome). For example, a binary response of approve vs reject could be modeled using (Bayesian) logistic regression with feedback as the main effect and video type as a random effect. The latter, however, also allows us to interpret the outcome using posterior distributions. Increasingly, researchers both from HCI [27] and outside [30] recommend posterior distribution-based analysis over discussion solely based on p-value. Bayesian hierarchical models not only allows us to compute a p-value equivalent statistic, such as a probability of superiority [32], but it also allows us to observe the practical differences between conditions. Thus, we use Bayesian hierarchical models for confirmatory analysis. A probability of superiority computes whether a data point drawn randomly from one model has a larger value than one was also taken at random from another model. The higher the value (or lower the value), it is likely that the first model’s estimated parameter is larger (or smaller) than the second one’s.

6.1 The Effects of Feedback on SD Qualities

Because the perceived quality of SD for the instructional video is much lower compared to the perceived quality of SD for the other two, naively pooling the data for all videos could preventing us from understanding the accurate effects of feedback condition on the perceived SD quality. Thus, we create two models to analyse the effects of feedback using two sets of data: (i) for the first model, we used data only from advertisement video and explainer video conditions, and (ii) another model combined data from all three videos.

To model the approval and rejection by the blind evaluator, we construct logistic regression models where the outcome followed a Bernoulli distribution—a discrete probability distribution used to model dichotomous outcomes. We had three feedback conditions as the main effect: no-feedback, automatic-feedback, and human-feedback. We treated video type as a random effect; that is, we have advertisement video and explainer video as a random effect for the first model and all three videos for the second model. We present the plate diagram of the models in Appendix 4.

Figure 9:

Table 6:

	Descriptive			Sufficient			Clarity			Objective
	Ad	Ex	In	Ad	Ex	In	Ad	Ex	In	Ad	Ex	In
Advertisement		0.56	0.92		0.66	0.88		0.6	0.90		0.63	0.21
Explainer	0.44		0.89	0.34		0.78	0.4		0.85	0.37		0.13
Instructional	0.08	0.11		0.12	0.22		0.10	0.15		0.79	0.87

	Succinct			Referable			Learning			Interest
	Ad	Ex	In	Ad	Ex	In	Ad	Ex	In	Ad	Ex	In
Advertisement		0.29	0.08		0.39	0.82		0.6	0.85		0.5	0.84
Explainer	0.71		0.15	0.61		0.89	0.4		0.78	0.50		0.84
Instructional	0.92	0.85		0.18	0.11		0.15	0.22		0.16	0.16

Table 6: The probability of superiority between videos. Probability of Superiority over 0.9 and under 0.1 are emboldened. Ad: advertisement video, Ex: explainer video, In: instructional video.

Consistent with the visual exploratory data analysis, our analysis excluding instructional video found that both automatic-feedback and human-feedback improved descriptiveness and learning. The result suggested the additional information that the describers obtained from reviews (either from automatic-feedback or human-feedback) provided useful information in improving details in the description of the scenes. When we included the data from the instructional video, the evidence became weaker. The probability of superiority was high for automatic-feedback (p.s.=0.93), and the value shrunk lower for descriptiveness (p.s.=0.86). This is likely because the uninformative representative labels for the instructional video prevented our participants from gaining information that is useful for improving SD.

Human-feedback improved sufficiency, but it seems to negatively impact objectiveness of SD. We suspect that these two qualities were in trade-off relationship; the SD’s objectiveness was negatively influenced in the human-feedback condition as the describers used more terms and phrases to increase sufficiency. On the contrary, the high probability of superiority for this quality facet (p.s.=0.93) suggested that automatic-feedback improved the objectiveness of SD. But we did not see any evidence in favor of the automatic-feedback improving sufficiency. As we show in the linguistic analysis below, automatic-feedback has encouraged the describers to write syntactically simpler SD, which may have reduced the chance of SD being perceived as subjective.

Our result showed that clarity, succinctness, referability, and interest did not improve with either human-feedback or automatic-feedback. By observing Fig. 8, we see that succinctness, referability, and interest were relatively high across conditions (including the no-feedback condition). Since our participants wrote SD that were already good in terms of these qualities, there was not much space for improvement. We also note that succinctness was influenced by the succinctness visualization, which gave feedback to participants in all conditions. Clarity did not improve; we speculate this quality and sufficiency were bounded by the fact that the participants had space limit in writing SD (i.e., duration between dialogues in a video to describe a scene).

6.2 The Effects of Video Types

We used the hierarchical model that used data from all three video stimuli for our analysis. We visualized the posterior distributions of the random effect on the intercepts of the model (Fig. 9). We also computed probabilities of superiority by comparing posterior distributions of the random effects of the videos.

The result suggests that authoring descriptive and clear SD for the advertisement video and explainer video was easier. The probabilities of superiority for descriptiveness were 0.92 and 0.89, and 0.9 and 0.85 for clarity, when it was compared against the instructional video. Though the results are not “statistically significant”, the shift in intercept could have affected the approval/reject outcome with practical significance. Somewhat surprisingly, the random effects of video was weaker for qualities like, referability, learning, and interest for instructional video; video types did not affect these qualities too much statistically speaking.

The instructional video type had positive random effects on the objectiveness and succinctness of the SD compared to that of advertisement video and explainer video. The result is consistent with the exploratory analysis of the effects of the video where we observed that the perceived quality for these dimensions was higher for the instructional video. We speculate that this result is because the instructional video’s scenes included precise steps of folding origami, having less space for subjective interpretation of the scenes.

The result shows that the difference between advertisement video and explainer video is marginal, which is congruent with the exploratory data analysis. But the results reify the need of controlling the effects of video types, especially the instructional video that we used.

Figure 10:

7 Analysis of Linguistic Characteristics

The statistical analysis suggested that automatic-feedback and human-feedback affected the quality of perceived SD. But the quantitative nature of the analysis prevented us from understanding what linguistic characteristics affected the perceived quality. In this section, we analyze the effects of feedback on the linguistic characteristics of SD through a two-steps process: (1) linguistic profiling of text using Profiling-UD [12], and (2) dimension reductions using Principal Component Analysis (PCA) [3].

Manually identifying linguistic properties of written documents is subject to systemic bias and error-prone. Instead, we identify the linguistic characteristics of SD through a method called linguistic profiling [7]. The process extracts quantitative linguistic features such as lexical variety, syntactic relations, and morphological information from written text. The method has been used in characterizing the difference between sets of documents [14, 34] and tracking the evolution of written language [35]. Here, we use the method to assess the difference in linguistic properties in SD across different feedback conditions.

For each video, we concatenate SD from scenes that one person authored to form an SD document. We then extract the linguistic features of the SD document using Profiling-UD⁴. Profiling-UD extracts 119-dimensional linguistic features that quantify seven types of linguistic profiles from a given document: raw text properties (4 dimensions), lexical variety (4 dimensions), morphosyntactic information (37 dimensions), verbal predicate structure (10 dimensions), global and local parse tree structures (15 dimensions), syntactic relations (41 dimensions), and use of subordination (8 dimensions)—see the original paper [12] and the table in Appendix 5 for more details.

Inspecting 119-dimensional values for each document is tedious and uninformative. Instead, we aggregate values into seven scalar values, each summarizing a type of linguistic characteristics like lexical variety and verbal predicate structure. We flatten the multi-dimensional feature values for each linguistic characteristic to a scalar value using PCA [3]. PCA is a statistical procedure that converts a multi-dimensional vector into low-dimensional principal components that best explain the variability of the data.

We show strip plots representing the distributions of linguistic characteristics in Fig 10. The result suggested SD in the automatic-feedback condition differed in characteristics compared to those in the no-feedback condition. We observed strong differences in a raw text property, lexical variety, verbal predicate structure, parse tree structure, and use of subordination, which we detail below.

Raw text property. The raw text property represents document length, sentence and word length, and number of character per token. SD in automatic-feedback condition scored higher compared to SD in human-feedback and no-feedback conditions. The result indicated that the length of the documents written by the participants in the automated-feedback condition was longer than those written by the participants in the no-feedback condition. On average, the SD in automatic-feedback condition had 137 words and the no-feedback condition had 120 words across three videos (human-feedback has 131 words).

Lexical variety. The lexical variety represents the number of lexical types and the number of tokens within the text. The SD in automatic-feedback conditions had higher lexical variety than those in no-feedback condition and human-feedback condition. The result showed that the participants in the automatic-feedback condition used more variety of unique words in describing scenes.

Verbal Predicate Structure. The verbal predicate structure represents the distribution of verbal head and root in a sentence. The SD in the automatic-feedback condition had higher verbal predicate structure than no-feedback and human-feedback conditions. This result showed that the SD in automated-feedback conditions are more likely to use base words than no-feedback and human-feedback conditions.

Global and Local Parse Tree Structure. This characteristic represents the structure of the syntactic tree like its depth; the higher the value, the deeper and more complex the parse tree is. The SD in no-feedback had a lower value compared to the SD in both automated-feedback and human-feedback conditions. The result indicated that the participants wrote SD with a simpler sentence structure in no-feedback conditions. For example, one participant in the automated-feedback condition wrote “A woman walks along the streets in London, looking at street signs.” And another participant in the no-feedback condition wrote “A road sign is displayed” (the latter’s syntactic structure is simpler and has a shallower parse tree).

Use of Subordination. Use of subordination represents the proportions of subordinate clauses and main clauses in the text. We observed that SD in automated-feedback have lower use of subordination compared to the other conditions. For example, one participant in the human-feedback condition used sentences with subordinates frequently; they wrote, “... With clear contrast on her phone, she understands where she is heading...” (the subordinate underscored). Whereas a participant in the automated-feedback condition used shorter chunks of main clauses (“...The lady takes out her mobile phone for navigation. She continues to her destination using her mobile for navigation...”). We discuss the implications of the above results in the Discussion section.

8 Thematic Analysis of the Interview Result

We conducted a short semi-structured interview with the participants who received either human or automated feedback in authoring SD. To understand their subjective impressions on the feedback mechanisms, we used thematic analysis to evaluate the interview transcript. We transcribed the interview results and one member of our research team iteratively coded them. Human feedback was preferred by more participants, but some found automatic feedback useful.

8.1 Human Feedback

Nine out of twenty participants explicitly mentioned human feedback benefits SD authoring. The participants repeatedly mentioned the benefit that it pointed out the scenes that they missed to describe and inaccurate information. For instance, H9 noted:

“I am not an experienced describer to begin with. I am not sure what I should describe. Like, should I describe everything I see? [...] However, the feedback guides me on what I should include and what not [to describe].” (H9)

While no participants mentioned explicitly negative comments toward the human feedback, three mentioned challenges in incorporating the feedback with the limited space. One noted:

“I think having feedback is good because it helps me to spot things I missed in my previous iteration. However, I am not sure if the reviewer knows that we have a very limited space to incorporate those comments in our descriptions. When the feedback makes the description becomes too long, then the feedback becomes not helpful anymore.” (H5)

8.2 Automated Feedback

Six participants found the automatic feedback to be useful. For example, with the labels provided in the feedback, three noted that they included the missing items that they did not think about initially. A8 noted:

“When I write down the SD initially, I tend to look at the feedback to see if I miss anything in the scene. It gives me some ideas on what I should see in the scene and if I see some label might be relevant and I miss it, I will include them in my description.” (A8)

Three participants had a negative impression of the automatic feedback. They noted it did not help in authoring SD because the feedback covered too many labels that were either generic, repetitive, or not relevant. For instance, A14 criticized that the automatic feedback mostly covered the objects and found it not helpful:

“To be honest, I didn’t rely on the feedback too much because the feedback is all about the objects and many of them are duplicated, which made the suggested labels to be irrelevant to the focus of the scene. I think the most important thing is the action in the scene, but they are very limited in the feedback.” (A14)

9 Discussion

9.1 Effects of Feedback

Our results of the analysis with the blind evaluator suggested that automatic feedback enabled by combining scene recognition and natural language processing could support novice describers in improving their SD after controlling the effects of videos. Specifically, we observed a significant improvement in descriptiveness and learning aspects of SD and a strong positive trend in objectiveness. We also showed that the automatic feedback reduces the time cost to produce SD by 45% by alleviating the manual review process. However, we observed that human feedback had stronger effect in improving SD quality. Except for objectiveness, the SD authored by those who received human-feedback had qualities that were equal to or better than the ones authored with automatic-feedback. Thus, whether to employ human reviewers or rely on automatic feedback is still in a trade-off relationship. This insight gives an idea to explore a hybrid approach between human-feedback and automatic-feedback. One direction could be, first, to train the novice with the human-feedback, and slowly transition into automatic-feedback once they are more familiar with the scene description authoring task.

The analysis of the video types’ random effect showed that novice describers found writing SD for the instructional video that we chose the hardest. Perhaps not a surprise as this was carefully selected as a challenging case for our automatic feedback. However, the wide range in quality evaluation response between video types calls for the need of a mechanism to classify video types a priori. Such a mechanism would help us to make an informed decision about, say, whether to hire professional audio describers or use a similar automatic approach for composing AD.

We showed the benefit of automatic feedback in reducing SD authoring time by having a research staff as a reviewer. But we also considered recruiting a professional audio describer to be a reviewer as such a person may be faster in reviewing SD. To see the effect of having a professional describer as a reviewer, we recruited one audio describer through our personal connection. We asked her to review six sets of SD (2 authors × 3 videos). On average, she took 57 minutes per SD (39.06 minutes per video minute) to review, compared to, on average, 11.65 minutes per SD (8.71 minutes per video minute) by the research staff. She wrote a review of 423 words per SD, compared to the research staff’s 198 words per SD. The result suggested that expertise in composing audio descriptions does not make one quick to review. The longer reviews may affect the SD quality though it would not be a realistic upper baseline due to cost. We also showed that even the research staff’s feedback helped improve the SD quality.

Our linguistic profiling captured the difference in lexical and syntactic patterns between SD in the no-feedback and automatic-feedback conditions. The captured difference may inform why the blind evaluator perceived SD the way they did. For example, the linguistic profiling suggested that SD in the automatic-feedback condition tended to be longer, informative (e.g., higher lexical variety), and more complex (e.g., deeper parse tree). And the blind evaluation showed that the SD in the automatic-feedback condition tended to be more descriptive. Thus, SD’s length and lexical and syntactical complexity and variety may explain the perceived descriptiveness. While our analysis allows us only to discuss a potential association, future work could investigate a causal relationship.

9.2 Future Directions for Automation

Though we showed the benefits of our method, it is just one way of automating SD authoring. A natural question we should ask is, “In what other ways can we automate the process so that we can generate AD more efficiently?” One direction is to rely on more advanced (future) computer vision technology for scene recognition to automatically produce SD. We tried to generate SD in a fully automated way using a deep neural network model called MART for video scene understanding [31]. But a manual inspection of the generated descriptions suggested that the output was incoherent. Therefore we decided to turn to the authoring method described in this paper. But automatic video scene understanding is a quickly advancing field; a recent report suggests that a better model significantly outperforms prior work in video description tasks [40]. Future work should investigate ways to incorporate such cutting-edge video understanding technology with or without humans in the loop.

Another area where automation could improve the approach described in this paper is the real-time processing of user-provided SD and representative selection of labels. Our system used a knowledge graph-based text matching and label pruning. The method is not perfect. For example, there was a cluster where we had to choose a label from road, sidewalk, and zebra crossing Our algorithm chose zebra crossing as a representative label. But in such a case, we may want to choose sidewalk or mention all three labels. A better algorithm could improve the selection of labels and thereby increase usability of our system.

9.3 UI Design Implications

The way we presented the feedback was governed by the structure of the output that we obtained from Amazon Rekognition. We could improve the way we give feedback (e.g., by signifying important representative labels), but also consider other ways of conveying feedback. Commercial products like Grammarly provide inline feedback that is pleasant to read [22]. Or we could adopt some of the prior work that has proposed a more interactive way of dynamically giving feedback in translating language (e.g., [23]). Similar to Yuksel et al., we could present automatically written SD as a placeholder and let the user post-edit. Future research could investigate what feedback mechanisms works the best in writing SD. We are also curious if providing more generic review comments to the describers can have a positive effect. In addition to showing a word cloud, we could compose a series of short tips that imitates human feedback.

While our study design prevented us from combining human feedback and automatic feedback, we could imagine creating a system that presents the describer’s feedback from both a human and automatic system. The describer could get quick feedback from the automated system and more detailed human feedback when necessary. Then, the describers could assess the reviews balance objectiveness where automation excelled and other qualities where the human review was better.

Another observation surfacing from the study is that the majority of the SD were not deemed sufficient. We suspect that this is likely due to the lack of space (time available between the speaking scenes) to provide long, detailed descriptions. While interpretation of the sufficient quality requires caution due to the moderate reliability, the interview result also supports this. One intriguing future direction is the design of the user interface that supports the audience, the blind user, being in control of the AD, not just the describers. For example, an interface that allows the user to customize and decide the preferred length of AD dynamically at runtime might benefit them. The system can dynamically adjust the amount of information in SD based on the audience’s preferred length. Future work could study blind and low vision people’s preferences at a larger scale to inform the design of such interfaces and their view toward personalized and customizable AD.

10 Limitation

Our automatic feedback mechanism relies on computer vision output of subpar accuracy. The black box computer vision API makes it hard to assess what made scene understanding difficult. Yet, the participants still managed to improve the quality of the scene description with automatic feedback.

The sighted reviewer was a research staff and not a professional audio describer. But she was well versed in giving feedback by studying the codebook that we used. We conducted an exploratory study and saw that the research staff was faster in giving review compared to the professional describer.

We had only one blind evaluator who evaluated the full set of the SD. But we observed high agreement between two blind evaluators for the evaluation of the subset of the SD. Even then, we caution that the result should be interpreted carefully, particularly for the quality dimensions clarity and interest where we observed low agreement between the two evaluators. To mitigate the potential influence of the individual bias in evaluation, we triangulated our evaluation using multiple evaluation methods. Thus the current work should be informative.

11 Conclusion

In this paper, we presented an audio description authoring interface that supports novices to write scene descriptions (SD). The system combined video scene recognition and natural language processing to review novice-written descriptions and suggests things to mention in SD automatically as a form of interactive word cloud feedback. To assess the effectiveness of the automatic feedback in supporting novice describers, we recruited 60 participants to author SD and compared conditions, no feedback, automatic-feedback, and human-feedback. Our study showed that automatic feedback improved SD’s descriptiveness, objectiveness, and learning quality, while it does not affect qualities like sufficiency and clarity. Though we observed human feedback is still more effective, automatic feedback can reduce the SD production cost by 45% compared to human feedback.

Acknowledgments

This research / project is supported by the Ministry of Education, Singapore under its Academic Research Fund Tier 2 (Project ID: T2EP20220-0016). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore. Hernisa Kacorri is partially supported by the National Institute on Disability, Independent Living, and Rehabilitation Research (NIDILRR), ACL, HHS (#90REGE0008).

Footnotes

https://www.youtube.com/watch?v=flu6u988kh0

https://www.w3.org/WAI/perspective-videos/contrast/

https://www.youtube.com/watch?v=hO4J1GjPQFw

http://linguistic-profiling.italianlp.it/

Supplementary Material

Supplemental Materials (3544548.3581023-supplemental-materials.zip)

Download
310.17 KB

MP4 File (3544548.3581023-talk-video.mp4)

Pre-recorded Video Presentation

Download
70.13 MB

References

[1]

3PlayMedia. 2020. Beginner’s Guide to Audio Description. https://go.3playmedia.com/hubfs/WP%20PDFs/Beginners-Guide-to-Audio-Description.pdf. Accessed: 2021-01-13.

Abstract

1 Introduction

2 Related Work

2.1 Audio Descriptions and Video Accessibility

2.2 Describing Visual Media for Accessibility

2.3 Automatic Feedback in Writing

3 Automatic Feedback Mechanism for Scene Descriptions

3.1 Generating Labels via Computer Vision for Scene Understanding

3.2 Supporting Real-time Interactivity via Automatic Matching

3.3 Video

3.4 Evaluating the Quality of Labels

4 Study Method

4.1 Scene Description Quality Dimensions

4.2 Sighted Participants

4.3 Sighted Reviewer for Human-feedback

4.4 Blind Evaluator

4.5 Procedure

5 Exploratory Analysis of Scene Descriptions Quality and Production Time

5.1 Perceived Scene Description Quality

5.2 Scene Description Production Time

6 Statistical Analysis of SD Quality

6.1 The Effects of Feedback on SD Qualities

6.2 The Effects of Video Types

7 Analysis of Linguistic Characteristics

8 Thematic Analysis of the Interview Result

8.1 Human Feedback

8.2 Automated Feedback

9 Discussion

9.1 Effects of Feedback

9.2 Future Directions for Automation

9.3 UI Design Implications

10 Limitation

11 Conclusion

Acknowledgments

Footnotes

Supplementary Material

References

Cited By

Index Terms

Recommendations

Toward Automatic Audio Description Generation for Accessible Videos

OmniScribe: Authoring Immersive Audio Descriptions for 360° Videos

An Exploratory Study on the Low Adoption Rate of Smart Canes

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations