1 Introduction
Speech input, an auditory-based language processing technique that transcribes acoustic signals into text, stands out as one of the most intuitive and efficient means of engaging with mobile devices. It holds the potential to enhance user comfort and productivity, particularly when conventional input methods like touchscreens and physical keyboards prove inefficient, cumbersome, or inconvenient [
89]. Moreover, it serves as a crucial accessibility feature, empowering individuals with limited motor skills to seamlessly interact with mobile technology without reliance on manual dexterity. This functionality also proves invaluable for those experiencing situationally-induced impairments and disabilities (SIID), a category encompassing instances where hand use is restricted due to concurrent tasks, glove-wearing, or minor injuries [
92]. However, it is important to acknowledge that while speech input excels in numerous interaction scenarios, its suitability may be compromised in situations characterized by high ambient noise levels, privacy and security considerations, or pre-existing speech impairments [
26,
27].
Silent speech input, an image-based language processing method that translates users’ lip movements into textual content, presents a promising solution to address a multitude of challenges [
81]. Its independence from acoustic cues allows for versatile application, thriving even in noisy or sensitive environments like libraries or museums. Moreover, it significantly bolsters privacy and security, given the limited number of individuals skilled in lip reading. In fact, studies indicate that even those skilled in lip reading can typically comprehend only about 30-45% of spoken English [
65], further underscoring the privacy and confidentiality advantages of silent speech. Silent speech further promotes inclusivity by accommodating individuals who are unable to vocalize or have speech disorders, thereby making communication with computers more accessible.
In pursuit of optimal silent speech recognition, researchers have explored various sensor-based techniques, achieving high accuracy in speech transcription [
73,
78,
83,
88,
107]. However, these approaches often entail invasive, unwieldy, and non-portable setups, rendering them impractical in real-world scenarios. Recent endeavors have aimed to harness video-based recognition, commonly referred to as digital lip reading, to facilitate silent speech communication [
3,
9,
16,
18]. Yet, many of these models are primarily tailored for high-performance computing devices, like desktop computers [
3,
9,
77]. Even with ample computational resources, these models exhibit sluggish response times, susceptibility to errors, and a lack of real-time functionality, making them unsuitable for mobile devices. Additionally, existing models tend to support only a limited, pre-determined vocabulary, hampering their applicability in everyday conversational interactions.
A well-known challenge in developing deep learning models is the demand for substantial data for training, particularly datasets tailored to specific vocabularies, a time-consuming and arduous process. Thus, the imperative arises to design models robust enough to operate effectively with modest data quantities, without compromising performance. Furthermore, the exploration of interface and feedback mechanisms tailored for silent speech interaction on mobile devices has remained ignored in the literature. A mobile-optimized interface is of paramount importance, as it directly impacts user experience and usability. The creation of a swifter, more accurate, and real-time silent speech recognition system optimized for mobile devices, thus, holds the potential to serve as a versatile medium for input and interaction tasks, seamlessly integrating into daily routines.
This paper presents MELDER, a
Mobil
e Lip Rea
der optimized for performance and usability on mobile devices. The contribution of the work is five-fold. First, it develops a new real-time silent speech recognizer that improves recognition performance on mobile devices by splitting the input video into smaller temporal segments, then processing them individually. Second, it introduces a transfer learning approach aimed at enhancing the performance of silent speech recognition models in everyday conversational contexts. Through a study, we validate the applicability of this approach, demonstrating its effectiveness not only with MELDER but also with other pre-trained models. Third, a comparative evaluation of MELDER against two state-of-the-art silent speech recognition models, assessing their performance in both stationary (seated position) and mobile settings (while walking). Fourth, it introduces two visual feedback methods designed for silent speech recognition systems to keep the users informed about the ongoing recognition process. These methods are compared with the feedback method of Google Assistant in a qualitative study. Fifth, the dataset
1 and the source code and other material produced in this study
2 are freely available to download for research and development, encouraging replication and further investigations in the area. Table
1 provides a summary of all the experiments conducted in this work, detailing aspects such as the conditions under which each experiment was carried out, the number of phrases used, the sample size of participants, and the total number of videos utilized in these experiments.
3 MELDER: A Mobile Lip Reader
MELDER leverages LipType as its foundational model. LipType [
77], an established end-to-end sentence-level model, translates a variable-length sequence of video frames into text. It achieves this through the integration of a shallow 3D-CNN (1-layer) with a deep 2D-CNN (34-layer ResNet [
37]), enhanced by squeeze and excitation (SE) blocks (SE-ResNet). This configuration effectively captures both spatial and temporal information. The choice of SE-ResNet is strategic, as it adaptively recalibrates channel-wise feature responses by explicitly modeling the inter-dependencies between channels, thereby refining the quality of feature representations. Moreover, SE-ResNet is notable for its computational efficiency, adding only a minimal increase in model complexity and computational demands. For additional details, please refer to Section
4.1. First, MELDER enhances the model by introducing innovative transcriber and reviewer channels that run in parallel. This structure not only enables real-time processing but also provides users with continuous visual feedback during the recognition of silent speech.
3.1 The Transcriber Channel
The proposed transcriber channel consists of three sub-modules: a
windowing frontend that splits the input video into smaller temporal segments, a
spatiotemporal feature extraction module that takes a sequence of frames and outputs one feature vector per frame, and a
sequence modeling module that inputs the sequence of per-frame feature vectors and outputs a sentence character by character. The model appends the sliced clip to the buffer for parallel processing. This cycle continues until the end of a video clip is detected. We must emphasize that the transcriber channel operates on a server, as modern smartphones do not possess the necessary storage and processing capacity to work with large datasets. Consequently, the results presented in this study may not be directly comparable to models that were tested exclusively on smartphones. Additionally, we acknowledge the concerns some users might have regarding the security of sending video clips to a server. Nonetheless, it is pertinent to point out that nearly all sophisticated real-time recognition systems, including Google Lens, Google Speech, Google Home, and Amazon Alexa, employ a similar server-based approach for processing significant volumes of data [
33].
3.1.1 Windowing.
The channel slices video input into smaller segments. In order to determine the best windowing function in the defined context, we studied two linear (
y =
x + 5,
y = 2
x + 5) and two non-linear (
y =
x3,
y = 2
x) windowing functions, where
x = window start frame and
y = window end frame. Each function has an overlapping window of two frames. This was decided in lab trials with an existing silent speech recognition model [
77], where we compared speed-accuracy trade-offs between 1–4 overlapping windows. We did not examine more than four frames because the average time per phoneme with silent speech is 176 ms [
80], which corresponds to four frames (video frame rate is 25 frames per second). Since the model was already slicing a video into small chunks (∼ 5 frames), reprocessing a large overlapped window increased the processing time without improving accuracy. However, using two frames as an overlapped window improved accuracy without substantially slowing the processing time (Table
3). The overlap between the clips assures that any lost phonemes due to the slicing process are recovered using the information in the overlap frames.
We selected the windowing function based on certain assumptions. We chose linear functions because they have constant window sizes, possibly resulting in faster computations. For instance,
y =
x + 5 has a fixed length, thus likely to have a faster processing time, but the accuracy can suffer due to limited context. While larger window sizes, such as those used in
y = 2
x + 5, may increase accuracy, but may lead to extended processing times. Alternatively, for non-linear functions, the window size increases gradually rather than being constant. They may initially have a faster processing time with a lower accuracy. However, as the window size increases, the processing time will slow down and the accuracy is likely to rise. For this work, we selected non-linear functions based on their window interval size. For instance,
y = 2*
x has a gradual increase in the window size, while
y =
x*3 has a steeper increase in the window size. Because the optimal windowing function for real-time processing within this context is unclear, we validated our choice in an experiment described in Section
4.
3.1.2 Spatiotemporal Feature Extraction.
This module takes the sliced video chunk and extracts the mouth-centred cropped image of size H:100 × W:50 pixels per video frame. For this, videos are first pre-processed using the DLib face detector [
58] and the iBug face landmark predictor [
91] with 68 facial landmarks combined with Kalman filtering. Then, a mouth-centred cropped image is extracted by applying affine transformations. The sequence of
T mouth-cropped frames are then passed to 3D-CNN, with a kernel dimension of T:5 × W:7 × H:7, followed by Batch Normalization (BN) [
46] and Rectified Linear Units (ReLU) [
5]. The extracted feature maps are then passed through a 34-layer 2D SE-ResNet that gradually decreases the spatial dimensions with depth, until the feature becomes a single dimensional tensor per time step.
3.1.3 Sequence Modeling.
The extracted features are processed by 2-Bidirectional Gated Recurrent Units (Bi-GRUs) [
15]. Each time-step of the GRU output is processed by a linear layer, followed by a softmax layer over the vocabulary and an end-to-end model is trained with connectionist temporal classification (CTC) loss [
35]. The softmax output is then decoded with a left-to-right beam search [
20] using the Stanford-CTC decoder [
68] to recognize the spoken utterance. The model appends the recognized character to the buffer for post-processing. This cycle continues until the end of a phrase is detected. The model predicts the end of phrase when the newline character is detected.
3.2 The Reviewer Channel
The proposed reviewer channel corrects both character-level and word-level errors and provides real-time feedback by displaying the most probable candidate words and phrases for auto-completion. The process comprises of the following two steps.
3.2.1 Character-Level Corrector.
The character-level model enables real-time word completion based on the sequence of characters or a prefix string obtained from the transcriber channel. As soon as the transcriber channel recognizes a character
S, the model auto-completes the string with its most probable word (
\(\hat{S}\)). The conditional probability can be formulated as:
Consider,
S1: m as the first
m characters in string
S and all completions must contain the prefix exactly, i.e.,
where
n is the total length of a completion. As probabilities in the sequence domain contain exponentially many candidate strings, we simplified the model by calculating conditional probabilities recursively:
This requires modelling only
\(P(\hat{S}_{t+1}|\hat{S}_{1:t})\), which is the probability of the next character under the current prefix. For this, it computes
\(argmax P(\hat{S}_{t+1}|\hat{S}_{1:t})\) using the prefix tree (Trie) data structure. Upon finding the most probable completion for the current prefix, the model automatically displays the auto-completion.
3.2.2 Word-Level Corrector.
The module is activated only when a space character is detected. Upon detection, the sequence recognized so far is passed to the word-level
n-gram language model (LM). First, it extracts the last word
W from the recognized text, calculates edit distances [
63] between
W and each dictionary word
d, then replaces
W with a word that has the minimum edit distance. Second, it auto-completes the sentence by modelling the joint probability distribution of the given words and future words.
Formally, we consider a given string of
t words,
W =
W1,
W2,...,
Wt and our goal to predict the future word sequence (
Wt + 1,
Wt + 2,...,
Wt + T). The conditional probability can be formulated as:
This model uses bidirectional
n-grams to account for both forward and reverse directions. The combined probability of a sentence, thus, is computed by multiplying the forward and backward
n-gram probability of each word:
In a forward
n-gram, the conditional probability is estimated depending on the preceding words:
In contrast, in a backward
n-gram, the probability of each word is estimated depending on the succeeding words:
Applying the values from Eq.
6 and Eq.
7, we get:
Finally, the model predicts the most probable auto-completion of the given words and automatically adds it to the input text. We used COCA corpus [
23], one of the largest publicly available and genre-balanced corpus of English, to train the reviewer modules. The dataset contains approximately 1 billion words, however, we extracted the top 200,000 sentences as vocabulary to reduce the computation time. The average perplexity
4 score for the model is 42.6, indicating that it is well-trained and can anticipate words accurately.
5 Adopting A Transfer Learning Strategy
Most lip reading datasets contain limited vocabulary and do not support vocabulary relevant to everyday conversation. A model trained on a dataset with specific vocabulary performs poorly when applied to a dataset other than the training vocabulary words. Furthermore, training a deep learning model requires an enormous amount of data. Developing large-scale datasets tailored to particular vocabularies is extremely challenging, expensive, and time-consuming. To overcome this, we leverage the effectiveness of transfer learning, which exploits existing features (or knowledge) from a model trained on a high-resource vocabulary,
source model, and generalizes it to a new low-resource vocabulary,
target model [
76,
121].
Generally, features transition from general to specific characteristics by the last layer of the network, but this transition has not been extensively investigated in the context of lip reading. Research in deep learning research showed that standard features learned on the first layer appear regardless of the dataset and the task [
96,
115], thus are called general features. In contrast, features calculated by the last layer of a trained network is highly dependent on the dataset and the task. It is unclear, however, how this transition can be generalized to lip reading, that is, to what extent features within a network could be generalized and used for transfer learning. Towards this, we investigated three strategies to transfer learning (Fig.
4). Consider a
source model composed of
N layers, with
V layers representing
visual_frontend and
S layers representing
sequence_processing.
(1)
Finetune_Last: The network is first initialized with the weights from the source model, then the top layers (N − 1) are frozen, and only the last layer is allowed to modify its weights. The model is then trained to fine-tune the last layer for the target vocabulary. During the training process, only the weights associated with last layer are changed until they converge. Using this method, fine-tuning only the final layer is needed to work more effectively with the target dataset and it makes use of the features learned from the source model.
(2)
Finetune_Visual_Frontend: The network is first initialized with the weights from the source model, then the sequence_processing layers (N − V) are frozen and only the visual_frontend layers are allowed to modify their weights. Afterwards, the model is trained to fine-tune the visual_frontend for the target vocabulary. During the training process, only the weights associated with the visual_frontend are changed until they converge.
(3)
Finetune_Sequence: The network is first initialized with the weights from the source model, then the visual_frontend layers (V) are frozen and only the sequence_processing layers are allowed to modify their weights. Afterwards, the model is trained to fine-tune the sequence_processing for the target vocabulary. During the training process, only the weights associated with the sequence_processing are changed until they converge.
7 Experiment 3: Stationary Performance
We conducted a user study to compare MELDER with two state-of-the-art, pre-trained silent speech models LipNet [
9] and Transformer [
3] with unseen data (data that has not been used to train the models) in a stationary setting (in a seated position). Since these models do not work in real-time (computes one phrase at a time), we equipped these with the
y =
x + 5 windowing function and the Finetune_Sequence transfer learning strategy, as MELDER, for a fair comparison between the models (henceforth referred to as RT-LipNet and RT-Transformer) and to demonstrate that these approaches could be used independently with other silent speech models to make them real-time. We also disabled the visual feedback component of the reviewer channel (described in Section
3.2) in the study to eliminate a confounding factor (to remove any potential effects of feedback on performance).
7.1 Experimental Dataset
We used the Enron Mobile Email dataset [
110] in this study. It contains genuine mobile emails, thus is better suited to evaluate mobile text entry. Towards this, first, we filtered the dataset using the following rules: 1) exclude phrases with lengths less than three or greater than ten, 2) exclude phrases containing common nouns, such as general names, places, and things, and 3) exclude phrases containing contractions or numeric values. After filtering, we randomly selected thirty phrases and removed all punctuation and non-alphanumeric tokens, and replaced all uppercase letters with lowercase letters. The selected phrases are presented in Appendix
A.
7.2 Participants
Twenty volunteers took part in the study (Fig.
6b ). Their age ranged from 18 to 41 years (M = 25.55 years, SD = 6.2). Ten of them identified as women, nine as men, and one as non-binary. They all owned a smartphone for at least five years (M = 8.4 years, SD = 2.1). Sixteen of them were frequent users of a voice assistant system on their smartphones (M = 3 years, SD = 2.3), while the remaining four were infrequent or nonusers. They all received U.S. $10 for volunteering.
7.3 Apparatus
We developed a custom application for smartphones running on Android OS using the default Google Android API (Fig.
6a ). The application enabled users to record videos of them silently speaking the presented phrases using the front camera of a smartphone. In the study, we enabled participants to record videos using the front camera of their own smartphones to increase the variability of the dataset.
7.4 Design
The study used a within-subjects design with one independent variable “model” with three levels: RT-LipNet, RT-Transformer, and MELDER. In total, we collected (20 participants × 30 phrases) = 600 samples. The dependent variables were the same word error rate and words per minute performance metrics as described in Section
4.2. However, regarding computation time, this experiment measured the average time required by the model to process a phrase.
7.5 Procedure
The data collection process occurred remotely. We explained the purpose of the study and scheduled individual Zoom video calls with each participant ahead of time. We instructed them to join the call from a quiet room to avoid any interruptions during the study. First, we demonstrated the application and collected their consents and demographics using electronic forms. We then shared the application (APK file) with them and guided them through the installation process on their smartphones.
Participants were instructed to sit at a desk during the study. The application displayed one phrase at a time. Participants pressed the “Record/Stop” toggle button, silently spoke the phrase (uttered the phrase without vocalizing sound), then pressed the same button to see the next phrase. We did not instruct them about how to hold the device. But most of them held the device with their non-dominant hand and pressed the button with their dominant hand. Upon completion of the study, participants shared the logged data with us by uploading those to a cloud storage under our supervision. For evaluation, we passed the recorded video through the transcriber channel to obtain recognition, then post-processed the recognized text through the reviewer channel to auto-correct errors and present the most probable auto-completion of text at both word and phrase-level.
7.6 Results
A Martinez-Iglewicz test revealed that the response variable residuals were normally distributed. A Mauchly’s test indicated that the variances of populations were equal. Therefore, we used a repeated-measures ANOVA and a post-hoc Tukey-Kramer multiple-comparison test for all analysis. We also report effect sizes in eta-squared (
η2) for all statistically significant results.
7.6.1 Word Error Rate.
An ANOVA identified a significant effect of model on word error rate (
F2, 19 = 3632.67,
p < .00001,
η2 = 0.94). On average, RT-LipNet, RT-Transformer, and MELDER yielded 20.95% (SD = 0.8), 28.1% (SD = 1.1), and 19.75% (SD = 0.9) word error rates, respectively. A Tukey-Kramer test revealed that RT-Transformer was significantly more error prone than RT-LipNet and MELDER. Fig.
7a illustrates this.
7.6.2 Words per Minute.
An ANOVA identified a significant effect of model on word error rate (
F2, 19 = 557.08,
p < .00001,
η2 = 0.89). On average, RT-LipNet, RT-Transformer, and MELDER yielded 4.96 wpm (SD = 0.3), 4.21 wpm (SD = 0.2), and 5.59 wpm (SD = 0.1), respectively. A Tukey-Kramer test revealed that RT-Transformer was significantly slower than RT-LipNet and MELDER. Fig.
7b illustrates this.
7.6.3 Computation Time.
An ANOVA identified a significant effect of model on word error rate (
F2, 19 = 11085.33,
p < .00001,
η2 = 0.99). On average, RT-LipNet, RT-Transformer, and MELDER required 12.93s (SD = 0.4), 13.55s (SD = 0.2), and 6.51s (SD = 0.2) to compute a phrase, respectively. A Tukey-Kramer test revealed that MELDER was significantly faster in computing the phrases than RT-LipNet and RT-Transformer. Fig.
7c illustrates this.
7.7 Discussion
MELDER outperformed RT-LipNet and RT-Transformer both in terms of speed and accuracy. MELDER took 50% less time than RT-LipNet and 52% less time than RT-Transformer to compute a phrase. These effects were statistically significant, and resulted in a 13% and a significantly 33% faster text entry speed than RT-LipNet and RT-Transformer, respectively. MELDER was also the most accurate. It committed 6% fewer errors than RT-LipNet and a significantly 30% fewer errors than RT-Transformer. The statistically significant differences, accompanied by large effect sizes (
η2 ≥ 0.1 constitutes a large effect size [
7,
19]), indicate their potential generalizability to a broader population. These results strengthen our argument that MELDER is better suited for use on mobile devices than existing models.
We also compared the original LipNet and Transformer models with RT-LipNet and RT-Transformer in an ablation study
5. In the study, LipNet yielded 97.3% word error rate, 4.6 wpm entry speed, and 14.2s computation time. The addition of windowing and transfer learning approaches reduced word error rate by 78%, improved entry speed by 7%, and reduced computation time by 9%. Transformer also demonstrated substantial improvements in performance when empowered with the proposed windowing and transfer learning approaches. The original Transformer yielded 81.2% word error rate, 5.2 wpm entry speed, and 14.7s computation time. RT-Transformer, conversely, demonstrated a 65% reduction in word error rate, 19% improvement in entry speed, and 14% reduction in computation time. These findings validate that the suggested windowing and transfer learning methods can be employed separately with existing silent speech recognizers, not only enabling real-time capabilities but also enhancing their overall performance.
9 Experiment 5: Visual Feedback
We conducted a final study to compare the visual feedback methods of MELDER with the visual feedback method of Google Assistant. Note that the feedback methods were not included in Experiments 3 and 4 to eliminate a potential confounding factor. This study focuses on assessing the perceived performance of visual feedback methods in MELDER and Google Assistant, rather than directly comparing their actual performance. Such a comparison would be unfair due to the inherent differences between the two systems: MELDER is an image-based silent speech recognizer, while Google Assistant’s speech-to-text relies on audio processing. These disparities stem from the distinct data types they handle (visual for images, auditory for audio), resulting in varying complexities in operations and feature extraction.
9.1 Apparatus
We developed a custom Web application with HTML5, CSS, PHP, JavaScript, and Node.js. We hosted the application on GitHub. The application was loaded on a Chrome web browser v71.0.3578.98 running on a Motorola Moto G
5 Plus smartphone (150.2x74x7.7 mm, 155 g). Its built-in front camera (12 megapixel with 1080 × 1920 pixel resolution) was used to track lip movements. Through an IP webcam Android application [
54], we connected the smartphone’s camera to the server, which ran the silent speech recognition model. The server was running on a MacBook Pro 16" laptop with 2.6 GHz Intel Core i7 processor, 16 GB RAM, 3072 × 1920 at 226 ppi. The laptop and the smartphone were connected to a fast and reliable Wi-Fi network. There were no network dropouts during the study.
9.2 Participants
Twelve volunteers participated in the user study (Fig.
9). Their age ranged from 21 to 41 years (M = 27.8 years, SD = 5). Eight of them identified as women and four as men. They all owned a smartphone for at least five years (M = 8.2 years, SD = 2.2). Eleven of them were frequent users of a voice assistant system on their smartphones (M = 3 years, SD = 2.4), while one was an infrequent user. They all received U.S. $15 for volunteering.
9.3 Design
We used a within-subjects design for the user study with one independent variable “feedback” with three levels: Google, word-level MELDER, and phrase-level MELDER. In each condition, participants entered thirty short English phrases from a subset of the Enron Mobile Email corpus, presented in Appendix
A. In summary, the design was 12 participants × 3 conditions × 30 phrases = 1,080 input tasks in total. The dependent variables were the eight items on a questionnaire. The study gathered qualitative data through the utilization of a custom questionnaire inspired by the System Usability Scale (SUS) [
12]. The questionnaire asked participants to rate eight statements on the examined methods’ speed (
“The technique was fast”), accuracy (
“The technique was accurate”), effectiveness (
“The feedback method used in the technique was effective and useful”), willingness-to-use (
“I think that I would like to use this system frequently”), ease-of-use (
“I thought the system was easy to use”), learnability (
“I would imagine that most people would learn to use this system very quickly”), confidence (
“I felt very confident using the system”), and privacy and security (
“I think the system will be private and secure when using in public places”) on a 5-point Likert scale.
9.4 Feedback Approaches
We created two real-time visual feedback methods for silent speech recognition models, drawing inspiration from Google Assistant’s feedback approach. In Google Assistant, the system starts displaying likely letters and words as soon as it detects speech, refining the output as the speaker continues. These initial predictions are presented in a greyed-out font (Fig.
10 a) to signify their potential for correction as more information becomes available. Unlike suggestions on a virtual keyboard, these predictions in Google Assistant are automatically managed by the system and cannot be manually selected, discarded, or updated by users. Additionally, the system offers feedback for sound detection, resembling oscilloscope traces or sound waves, presented as four colored vertical lines (Fig.
10 a, bottom of the display). These lines dynamically change in height to indicate when the system detects sound and come to a halt when sound detection ceases.
MELDER also offers feedback on lip detection and speech recognition. When the front camera detects the user’s lips, it displays a red blinking circle, similar to the video recording indicator on mobile devices. The red circle ceases blinking and changes to grey when the lips are no longer visible (Fig.
10 b). To keep users informed about the speech recognition process, we developed two feedback methods:
•
Word-level feedback: This method offers real-time feedback on a word-by-word basis. It presents the most probable word based on the recognized input. The text remains gray until the confidence level of the word surpasses a specified threshold (empirically set at 0.75). Once this condition is met, the word turns black, signifying that it is considered finalized and will not be corrected (Fig.
10 c).
•
Phrase-level feedback: In this approach, real-time feedback is provided by displaying the most likely phrase based on the recognized prefix string. Each word within the phrase starts in gray and transitions to black when its confidence level exceeds a specific threshold (empirically set at 0.87). This change to black indicates that the word is considered fixed and will not undergo further correction (Fig.
10 d).
The threshold values were determined empirically through multiple lab trials. During these trials, we tested thresholds ranging from 0.65 to 1.0 for both feedback methods. We selected the threshold values that proved most effective in delivering real-time feedback based on the experimental results. Similar to Google Assistant, neither of these feedback methods allowed users to proactively select, dismiss, or modify the suggestions; they were merely provided to inform users about the recognition process.
9.5 Procedure
The study was conducted in a quiet computer laboratory. First, we provided the participants with a brief overview of the functioning principles behind both speech and silent speech recognition. Subsequently, we offered practical demonstrations of the three distinct feedback methods employed in the study. We then collected their informed consent forms, and enabled them to practice with the three methods for about five minutes. They could extend the duration of the practice an extra two minutes upon request.
The main study started after that. In the study, participants entered thirty short English phrases from the Enron set [
110] by either speaking or silently speaking on a smartphone. All participants were seated at a desk. The three conditions (Google Assistant, MELDER with word-level feedback, and MELDER with phrase-level feedback) were counterbalanced to eliminate any potential effect of practice. As each phrase was recognized, the application automatically displayed the next phrase, continuing in this manner until all phrases within the given condition had been successfully completed. Participants were not required to re-speak a phrase in the event that it was not accurately recognized by the system.
Upon the completion of all conditions, participants completed a questionnaire that asked them to rate the three methods’ speed, accuracy, effectiveness, willingness-to-use, ease-of-use, learnability, confidence, and privacy and security on a five-point Likert scale (Section
9.3). Finally, we concluded the study with a debrief session, where participants were given a chance to share their thoughts and comments regarding their responses to the questionnaire.
9.6 Speed and Accuracy
As discussed in Section
9, the primary aim of this qualitative study was not to conduct a direct comparison of the actual speed and accuracy of the models. However, it is noteworthy that we did carry out a separate study comparing Google Assistant and MELDER. In this between-subjects study, 24 participants (average age = 26.25 years, SD = 5.9, comprising 12 females, 11 males, and 1 non-binary) were evenly distributed into two groups: one using Google Assistant and the other using MELDER. Each group employed their designated input method in a seated position. A between-subjects ANOVA analysis revealed a statistically significant impact of the input method on both entry speed (
F1, 22 = 1083.35,
p < .00001,
η2 = 0.98) and accuracy (
F1, 22 = 1219.38,
p < .00001,
η2 = 0.99).
As expected, participants using Google Assistant achieved an average entry speed of 30.54 wpm (SD = 2.6) and a remarkably low word error rate of 2.01% (SD = 0.3). In contrast, those using MELDER exhibited significantly slower input speeds, averaging 5.62 wpm (SD = 0.1), along with a much higher word error rate of 19.86% (SD = 1.0). Fig.
11 summarizes these findings. It is important to highlight that both the word-level and phrase-level versions of MELDER utilize the same recognition model and do not necessitate users to actively choose suggestions from the feedback. Consequently, they are indistinguishable in terms of actual speed and accuracy.
9.7 Results
We used a Friedman test and a post-hoc Games-Howell multiple-comparison test for analysing all non-parametric study data. We also report effect sizes in Kendall’s
W for all statistically significant results. Kendall’s
W uses the Cohen’s interpretation guidelines [
19] of
W < 0.3 as small,
W ≥ 0.3 as medium, and
W ≥ 0.5 as large effect sizes. Fig.
12 summarizes the findings of the study.
9.7.1 Perceived Speed and Accuracy.
A Friedman test identified a significant effect of feedback on perceived speed (χ2 = 9.83, df = 2, p < .01, W = 0.4) and accuracy (χ2 = 6.4, df = 2, p < .05, W = 0.3). A Games-Howell test revealed that participants found Google Assistant to be significantly faster than both word-level and phrase-level MELDER. But interestingly, the pairwise test was unable to identify any significant difference between the three methods in terms of accuracy.
9.7.2 Effectiveness.
A Friedman test failed to identify a significant effect of feedback on effectiveness (χ2 = 4.69, df = 2, p = .09). Additionally, a Games-Howell test confirmed that participants perceived all three examined feedback approaches to be relatively equally effective.
9.7.3 Willingness-to-Use.
A Friedman test identified a significant effect of feedback on willingness-to-use (χ2 = 7.0, df = 2, p < .05, W = 0.3). An analysis using the Games-Howell test demonstrated that participants expressed a significantly stronger preference for phrase-level feedback over word-level feedback. However, there was no statistically significant difference in their preference between these feedback types and Google Assistant.
9.7.4 Ease-of-Use and Learnability.
A Friedman test failed to identify a significant effect of feedback on either ease-of-use (χ2 = 6.0, df = 2, p = .05) or learnability (χ2 = 6.0, df = 2, p = .05). A Games-Howell test also confirmed that participants found the three examined methods relatively comparable in terms of ease-of-use and learnability.
9.7.5 Confidence.
A Friedman test identified a significant effect of feedback on confidence (χ2 = 12.56, df = 2, p < .01, W = 0.5). A Games-Howell test indicated that participants exhibited a notably higher level of confidence when utilizing Google Assistant compared to both work-level and phrase-level MELDER. Their confidence levels in using the two variations of MELDER appeared to be relatively similar.
9.7.6 Privacy and Security.
A Friedman test identified a significant effect of feedback on privacy and security (χ2 = 24.0, df = 2, p < .0001, W = 1.0). A Games-Howell test revealed that participants found both word-level and phrase-level MELDER to be significantly more secure and private than Google Assistant.
9.8 Discussion
MELDER was notably slower and displayed a higher error rate compared to Google Assistant. The discrepancy in text entry speed between the two methods was readily observed by all participants. They universally perceived MELDER, regardless of the feedback method, to be slower than Google Assistant. This affected their confidence in both variants of MELDER. This notably influenced participants’ confidence levels. Participants reported feeling significantly more confident when using Google Assistant compared to both word-level and phrase-level MELDER. One participant (male, 26 years) commented, “’I think silent speech is slower, and speed is really important in some cases. Apart from this, I think it is going to be an extremely cool piece of technology.”
Interestingly, participants found MELDER with phrase-level feedback to be relatively faster than MELDER with word-level feedback, even though both variants used the same underlying model. The majority of participants agreed with the statement that MELDER with phrase-level feedback is fast (N = 8), while a few remained neutral (N = 3), and only one participant disagreed with the statement. These results indicate that phrase-level feedback enhanced users’ perception of the method’s speed, despite the actual performance being similar.
Participants’ perception of the accuracy of the examined methods yielded surprising results. Despite the fact that both variants of MELDER, with either word-level or phrase-level feedback, displayed significantly higher error rates compared to Google Assistant, participants did not perceive them as notably error-prone. In fact, the vast majority of participants agreed with the statement that the method is accurate (N = 11), with only one participant expressing a neutral opinion on the matter. It is important to note that while a Friedman test identified a statistically significant difference in error rates between the methods, a post-hoc multiple-comparison analysis did not confirm this significance. This suggests that participants’ perceptions of accuracy may not align with the quantitative error rates, highlighting an interesting aspect of user perception in human-computer interaction studies.
Participants’ perception of the performance of MELDER with phrase-level feedback had a clear impact on their willingness to use the different methods. They expressed a significantly higher willingness to use both Google Assistant and MELDER with phrase-level feedback compared to MELDER with word-level feedback. This observation underscores the potential effectiveness of the proposed methods and the feedback approaches employed in the study. Participants’ willingness to use MELDER with phrase-level feedback was also positively influenced by their perception of the method’s security and privacy features. They viewed both variants of MELDER as significantly more private and secure compared to Google Assistant, primarily because bystanders could not overhear their interactions. Some participants even indicated that they would consider using the method primarily for its privacy and security benefits. For instance, one participant (female, 21 years) stated,
“Due to its privacy benefits, it is extremely useful.” These findings align with prior research on the perceived privacy and security advantages of speech and silent speech-based input methods [
81].
The results showed that participants found both Google Assistant and the two variations of MELDER to be relatively comparable in terms of effectiveness, ease of use, and learnability. While there were slight variations in the ratings for these three methods, a Friedman test did not detect any statistically significant differences in these aspects. Furthermore, participants expressed that both variants of MELDER were easy to use, and they believed that their performance would improve with practice. As one participant (female, 21 years) noted, “Adapting to silent speech was challenging at first, but became easier as I progressed.” This feedback suggests that users may require some time to acclimate to silent speech input but can become more proficient with practice.
10 Conclusion
In this comprehensive work, we have successfully developed a real-time silent speech recognition system tailored for mobile devices. Our approach involves breaking down the input video into smaller temporal segments, processing them individually, and utilizing advanced language models to auto-correct output at both character and word-levels. Additionally, our system offers users valuable feedback on the silent speech recognition process.
The work began with an experiment where we explored four different windowing functions for segmenting video lips, ultimately determining that a linear function (y = x + 5) yielded the best performance. Building upon this, we introduced a transfer learning approach aimed at enhancing the capabilities of silent speech recognition models for everyday conversational contexts. We investigated three strategies for transferring learning with three existing silent speech models, with the Finetune_Sequence strategy emerging as the most effective, showcasing its potential for improving the performance of existing pre-trained models. Equipped with the linear slicing function and the Finetune_Sequence transfer learning approach, we compared our system, MELDER, with two state-of-the-art silent speech models in two user studies–one in a stationary (seated position) and another in a mobile setting (while walking). The results demonstrated that MELDER outperformed both methods, establishing its feasibility for mobile device use. Furthermore, we conducted a qualitative study comparing our proposed word-level and phrase-level visual feedback methods with Google Assistant’s feedback mechanism. Interestingly, the study revealed that users’ perceived performance did not always align with actual performance. Notably, the phrase-level feedback significantly enhanced users’ perception of the silent speech model.
In conclusion, this work firmly establishes silent speech as a viable and effective method for interacting with mobile devices. As part of our commitment to advancing research in this field, we have made the dataset, source code, and other materials generated during this study freely available for download. We hope that this will encourage further investigations and replication efforts in this promising area of study.
11 Future Work
In future work, we plan to investigate various manual error correction strategies, empowering users to effectively correct recognition errors. Additionally, our aim is to further optimize the algorithm, enhancing its speed, accuracy, and adaptability, especially for individuals with diverse speech disorders. We also intend to conduct more in-depth studies to thoroughly examine the usability, adaptiveness, and robustness of the model. Moreover, testing the method in varied settings, such as under different lighting conditions and noise levels, is also part of our future research agenda.