In this section we present baseline performance for both endpointer and ASR models on our dataset, as well as results from the three interventions that address issues that arose in the survey: (1) endpointer tuning to address the problem of being frequently cut off by VAs, (2) ASR decoder tuning to improve the ability to understand stuttered speech, and (3) refining dysfluencies in the transcribed speech that is output from the ASR model to improve dictation experiences for PWS. We also place these findings in historical context by investigating how ASR performance on speech from PWS has changed over the past five years, using archived consumer-grade ASR models that were publicly available from 2017-2022.
The baseline models are from the Apple Speech framework [
7], which uses a hybrid deep neural network architecture for the ASR system. See [
38] for ASR model details, which at a glance is composed of an acoustic model, a language model, and a beam search decoder. The acoustic model maps audio to phone- or word-like intermediate representations, a language model encodes the probability of word sequences and acts as a prior over what words or phrases someone may have said, and a beam search decoder that efficiently computes candidate transcriptions.
4.2.1 Endpointer Model Performance.
An endpointer model identifies when the user stops speaking, and must balance the desire for a low truncation rate (i.e., what percent of utterances are cut off too early) with the desire for a minimal delay after speech (i.e., time from the end of the utterance to when the VA stops listening). Our base model is trained on completed utterances from the general population, and predicts the end of a query using both auditory cues like how long the user has been silent as well as ASR cues like the chance a given word is the final word of an utterance. For each input frame (time window), the model outputs the likelihood of utterance completion. Once the output exceeds a defined threshold, the system stops listening and moves on to the next phase of processing.
Baseline endpointer. The baseline likelihood threshold from the model we used was set such that 97% of utterances from the general population data are endpointed correctly (i.e., truncation rate of 3%). When evaluated on our data from PWS, which is likely poorly represented in general population data, a much higher portion of data is truncated early. Across the 41 Phase 2 participants,
6 the baseline endpointer model truncates on average 23.8% of utterances (
SD=19.7,
Median=16.8,
IQR=29.0); see Figure
4(a). Truncation rates also vary substantially per person—for 7 of 41 participants over 50% of utterances are cut off early. This high truncation rate reflects the survey findings, where a majority of participants reported that early truncation was a key issue.
Our hypothesis, based on the literature and understanding of stuttering, is that blocks, which are often expressed as inaudible gasps, cause most endpointing errors. We validated this using the dysfluency annotations by computed Spearman’s rank correlation
7 and found truncation rates are significantly positively correlated with rates of blocks (
r(39) = .64,
p < .001), part-word repetitions (
r(39) = .44,
p = .004), and interjections (
r(39) = .48,
p = .002). Correlations with prolongations (
r(39) = .17,
p = .301) and whole word repetitions (
r(39) = .18,
p = .271) are not statistically significant.
Tuned endpointer threshold (Intervention #1). To improve the VA experience for PWS, we investigate performance of three new thresholds that could reduce truncations for people with different rates of dysfluent speech. Specifically, using the Phase 1 data, we compute three new, higher threshold values that target an average 3% truncation rate for participants with different levels of A&H severities: a mild threshold based on the 25 Phase 1 participants with mild ratings, a moderate threshold based on the 18 moderate participants, and a severe threshold based on the 7 severe participants. By definition, increasing the threshold will always reduce or at worst maintain the same truncation rate as a lower threshold, at the cost of incurring a longer delay before the system responds. We evaluate these thresholds on the Phase 2 participant data.
As shown in Figure
4(a), the new thresholds substantially reduce the truncation rate for PWS compared to the baseline. Even the smallest threshold increase, “mild”, reduces the truncation rate to a per-participant average of 4.8% (
SD=6.4,
Median=1.5,
IQR=6.5), while the moderate threshold achieves our goal of under 3% on average (
M=2.5%,
SD=3.6,
Median=0.8,
IQR=3.1). Table
2 shows these results averaged across all Phase 2 utterances and includes metrics to capture how much delay occurs after the user finishes speaking. Here,
P50 and
P95 refer to the 50
th (median) and 95
th percentile delay between when a speech utterance ends and when the system stops listening. This analysis shows that the improvements in truncation rate come with a modest median delay of an additional 1.2 seconds over the baseline in the mild case and a 1.7 second increase in the moderate case. While the severe threshold is successful for 99.2% of utterances, it causes a median delay of over 3 seconds. We return to these tradeoffs in the Discussion.
4.2.2 Baseline & Improved ASR Model Performance.
Next, we evaluated baseline ASR performance and compared the results to an approach that tunes the ASR decoder using dysfluent speech. For this analysis, we use the Apple Speech framework [
7] and report on results from an ASR model trained on voice assistant tasks and speech from the general population. For completeness, we also examined baseline performance with a model trained on dictation tasks; the pattern of results was similar and thus results are omitted for clarity.
Our primary evaluation metric is Word Error Rate (WER), as widely used within the speech recognition community.
Word Error Rate is computed by counting the number of substitutions, insertions, and deletions in the transcribed text, and dividing by the total number of intended words. For example, the intended phrase “Add apples to my grocery list” may be spoken “A-(dd) a-(dd) add apples to my grocery list” and be recognized as “had had balls to my grocery list”. With two insertions and one misrecognized words, the WER is 50.0%.
8 For some experiments, we also look at Thresholded WER, as used in work by Project Euphonia [
33,
65], to assess VA performance for people with speech disabilities. This is computed as the percentage of utterances, per person, with WER below 10% or 15% (as specified). These values have been suggested as potential minimums for VAs to be useful depending on domain. Lastly, we look at
Intent Error Rate (IER), which captures whether the VA carries out the correct action in response to an utterance. To compute Intent Error Rate, we run the ASR output on the NLU model from Chen et al. [
16] which also relied on models used by the Apple Speech Framework.
9Baseline ASR. Table
3 shows baseline ASR performance. Across participants in both phases, the average baseline WER is 19.8%, which is much higher than the ∼ 5% reported for consumer VA systems [
62,
68,
69]. The WER distribution is also highly skewed, with many participants having low WERs but also a long tail of participants with much higher WERs. For people with mild A&H severity, the average WER was 4.8%, which is similar to what is expected for people who do not stutter, whereas moderate and severe had average WERs of 13.6% and 49.2%, respectively. Moreover, 84.0% of participants with mild severity had a WER of less than 10% (i.e., Thresholded WER) and the average Intent Error Rate for this group was 4.9%, suggesting that many people with mild A&H severity ratings would likely be able to use off-the-shelf VA systems, which echos the VA usage reported in survey presented in Figure
3(left).For moderate and severity grades, Intent Error Rates are 7.3% and 18.4% respectively. Overall, this analysis both confirms the ASR accuracy difficulties described in the survey, as well as reflects the varied experiences that survey participants reported in how well VAs understood them (Figure
2).
To understand how different types of dysfluencies affect WER, we examined the Phase 2 data, which includes detailed dysfluency annotations (see Section
4.1 for rates of each dysfluency type). Note that Phase 2 only included participants with moderate and severe A&H ratings, so on average has somewhat higher WER (25.4%) than the full set of participants (bottom row of Table
3). For these Phase 2 participants, we found high Spearman’s rank correlations
7 between WER and part-word repetitions (
r(39) = .85,
p < .001) and between WER and whole word repetitions (
r(39) = .60,
p < .001). The correlations were not significant for prolongations (
r(39) = .3,
p = .056), blocks (
r(39) = .21,
p = .181) or interjections (
r(39) = .15,
p = .347). This indicates that part-word and whole word repetitions tend to increase word error rates, and that blocks, prolongations and interjections have less of an impact even if they are frequent.
Among all errors the ASR system made in Phase 2, 80.9% were word insertions, 17.5% substitutions, and only 1.6% deletions. The high rate of word insertions has a strong correlation with part-word repetitions (r(39) = .73, p < .001) followed by whole word repetitions (r(39) = .60, p < .001). This echoes survey reports that part-word or whole word repetition can lead to misrecognized words being inserted in their transcription. A trained speech-language pathologist characterized insertion errors from part-word repetitions and found that in many cases insertions come from individual syllables being recognized as whole words (e.g., the first syllable in “become”, vocalized as “be-(come) be-(come) become”). In contrast, a sound repetition on /b/ in “become” is less likely to lead to word insertions. Furthermore, some people demonstrated part-word repetitions between syllables in multi-syllabic words, such as the word “vocabulary” may result in spurious insertions, such as “vocab cab Cavaleri.”
ASR Decoder Tuning (Intervention #2). Consumer ASR systems are commonly trained on thousands of hours of speech from the general public; an amount far larger than can likely be obtained from PWS. However, an initial investigation by Mitra et al. [
51] on 18 PWS has shown it may be possible to tune a small number of ASR decoder parameters to improve performance for PWS. Here, we validate this approach on our larger dataset and find even greater gains when tuning on our 50 participant Phase 1 subset. While we defer to that paper for details, in brief the approach increases the importance of the language model relative to the acoustic component in the decoder and increases the penalty for word insertions. These changes reduce the likelihood of predicting extraneous low-confidence words often caused by part-word repetitions and bias the model towards more likely voice assistant queries. We used Phase 1 data to tune these parameters and report results on Phase 2.
The average WER for Phase 2 participants jumps from the baseline of 25.4% to 12.4% (
SD=12.3,
median=6.1,
IQR=13.1) with the tuned decoder, which is a relative improvement of 51.2%. A Wilcoxon signed rank test shows that this improvement is statistically significant. See Table
4 for more metrics and to understand how WER improves as a function of dysfluency types. For example, the WER lowers (improves) in 43.2% of utterances with part-word repetitions and increases (worsens) 3.9% of the time. For ASR tuning, WER improves the most in utterances that have whole word repetitions, part-word repetitions, and interjections and least for those with prolongations or blocks.
4.2.3 Dysfluency Refinement (Intervention #3).
According to our survey, PWS are often displeased when seeing repeated words, phrases and filler words in their dictated notes and texts. To address this issue, we refine the transcribed text using two strategies. First, we look at filler words such as “um,” “eh,” “ah,” “uh” and minor variations. Many of these filters are not explicitly defined in the language model (i.e., by design “eh” is never predicted), however, in practice, short fillers are frequently transcribed as the word “oh.” As part of our approach, we remove “oh” from predictions, unless “oh” is used to represent the number zero. We considered others such as “like” and “you know,” which are common in conversational speech, but they did not appear as fillers in our dataset, likely because the utterances tend to be short and more defined than free-form speech. Second, we remove repeated words and phrases. This is more challenging because words may naturally be repeated (e.g., “We had had many discussions”). In our refinement approach, we take all adjacent repeated words or phrases in a transcript and compute the statistical likelihood that they would appear consecutively in text using an n-gram language model that is similar to [
34]. If the probability is below a threshold,
10 i.e.,
\(\mathrm{P}({\rm\small {SUBSTRING}}_1, {\rm\small {SUBSTRING}}_2) < \tau\), then we remove the duplicate. Note that these two strategies—interjection removal and repetition removal—can be applied to any ASR model output, thus we evaluated our dysfluency refinement approach in combination with both ASR models from Section
4.2.2: the baseline model and the model with the tuned decoder.
As shown in Figure
4(b), dysfluency refinement reduces WER for the Phase 2 data compared to both the baseline and tuned ASR models. The average WER goes from a baseline of 25.4% to 18.1% (
SD=19.2,
median=6.1,
IQR=22.8) after dysfluency refinement—a 28.7% relative improvement on average. This improvement is significant with a Wilcoxon signed rank test (
W=839,
Z=5.43,
p<.001). As shown in Table
4, across the entire Phase 2 dataset, 64.7% of utterances that contain whole word repetitions see a WER improvement and only 0.3% regress. For those with interjections, WER improves in 30.9% and regresses in none of the utterances.
Applying dysfluency refinement to the output from the tuned ASR model further reduces WER, from on average 12.5% with the tuned model alone to 9.9% (SD=8.8, median=5.3, IQR=10.3) for the tuned model plus dysfluency refinement. Compared to the baseline ASR model, this tuned model plus dysfluency refinement combination is an average 61.2% improvement across participants. Wilcoxon signed rank tests show that these improvements are statistically significant both when appended to the baseline ASR model output (W=741, Z=5.37, p<.001) or appended to the tuned ASR model output (W=595, Z=5.08, p<.001). The percentage of participants with WER < 10% increases from 48.8% to 65.9% and the average Intent Error Rate improved from 10.4% (SD=10.8, median=5.3, IQR=13.4) to 5.4% (SD=5.8, median=3.1, IQR=7.2). Such improvements may enable a VA to be usable for many of our participants when the baseline is not; for example, P2-21’s WER improves from 40.4% (baseline) to 13.0% (tuned ASR + dysfluency refinements) and IER improves from 19.1% (baseline) to 4.6%.
4.2.4 ASR Results Over Time.
The above analyses assess the effectiveness of three changes to an existing speech recognition system that require little data from PWS to implement. At the same time, there has been substantial progress in general speech recognition models in recent years, with some accounts claiming WERs of 5% or less using consumer voice assistants [
69]. Theoretically, these general improvements could also result in improvements for PWS.
To understand how these changes over time impact performance on dysfluent speech, we use the Apple Speech framework to run all Phase 1 and 2 data through archived ASR models that had been publicly available between Fall 2017 and Spring 2022; these experiments were conducted in Summer 2022. Figure
5 shows WER across all utterances in the dataset every 6 months across the five years. The utterance-weighted WER was 29.5% for the Fall 2017 model, fell consistently for the following eight time periods, and ended at 19.9%. This is a 32.5% relative reduction in WER from the start to the end of the 5-year time span. Differences at each time point may be attributed to what data was used in training, the convolutional architecture, and/or the language model.
For Phase 2 utterances, we further examined changes in how specific types of dysfluencies manifest in WER performance between the Fall 2017 to Spring 2022 models. For part-word repetitions, the WER improved 43.5% of utterances (worse 12.2%), those with whole word repetitions 46.8% (worse 14.5%), prolongations 36.9% (worse 10.3%), blocks 36.0% (worse 9.4%), and interjections 65.8% (worse 10.4%). The improvements with part-word repetitions are especially interesting, because these errors are more challenging to correct using our strategies. See the Discussion (Section
5) for further implications of these findings.