research-article

Open access

MELDER: The Design and Evaluation of a Real-time Silent Speech Recognizer for Mobile Devices

Authors:

Laxmi Pandey,

Ahmed Sabbir ArifAuthors Info & Claims

CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Article No.: 320, Pages 1 - 23

https://doi.org/10.1145/3613904.3642348

Published: 11 May 2024 Publication History

All formats PDF

Abstract

Silent speech is unaffected by ambient noise, increases accessibility, and enhances privacy and security. Yet current silent speech recognizers operate in a phrase-in/phrase-out manner, thus are slow, error prone, and impractical for mobile devices. We present MELDER, a Mobile Lip Reader that operates in real-time by splitting the input video into smaller temporal segments to process them individually. An experiment revealed that this substantially improves computation time, making it suitable for mobile devices. We further optimize the model for everyday use by exploiting the knowledge from a high-resource vocabulary using a transfer learning model. We then compare MELDER in both stationary and mobile settings with two state-of-the-art silent speech recognizers, where MELDER demonstrated superior overall performance. Finally, we compare two visual feedback methods of MELDER with the visual feedback method of Google Assistant. The outcomes shed light on how these proposed feedback methods influence users’ perceptions of the model’s performance.

Figure 1:

1 Introduction

Speech input, an auditory-based language processing technique that transcribes acoustic signals into text, stands out as one of the most intuitive and efficient means of engaging with mobile devices. It holds the potential to enhance user comfort and productivity, particularly when conventional input methods like touchscreens and physical keyboards prove inefficient, cumbersome, or inconvenient [89]. Moreover, it serves as a crucial accessibility feature, empowering individuals with limited motor skills to seamlessly interact with mobile technology without reliance on manual dexterity. This functionality also proves invaluable for those experiencing situationally-induced impairments and disabilities (SIID), a category encompassing instances where hand use is restricted due to concurrent tasks, glove-wearing, or minor injuries [92]. However, it is important to acknowledge that while speech input excels in numerous interaction scenarios, its suitability may be compromised in situations characterized by high ambient noise levels, privacy and security considerations, or pre-existing speech impairments [26, 27].

Table 1:

Experiment	Conditions		Phrases	Sample Size	Total Videos

1	Windowing Functions (4)	Models (3)
	Linear: y = x + 5	LipNet	30 × 3 = 90 random [77]	N = 12	90 × 12 = 1,080
	Linear: y = 2x + 5	Transformer
	Non-linear: y = x³	LipType
	Non-linear: y = 2^x
2	Transfer Learning Strategies (3)	Models (3)	30 random [69]	N = 12	30 × 12 = 360
	Finetune_Last	LipNet
	Finetune_Visual_Frontend	Transformer
	Finetune_Sequence	LipType
3	Models in Stationary Condition (3)		30 random [110]	N = 20	30 × 20 = 600
	RT-LipNet
	RT-Transformer
	MELDER
4	Models in Mobile Condition (3)		30 random [110]	N = 6	30 × 6 = 180
	RT-LipNet
	RT-Transformer
	MELDER
5	Method + Visual Feedback (3)		30 random [110]	N = 12	30 × 3 × 12 = 1,080
	Google Voice Assistant
	MELDER + Word-level Feedback
	MELDER + Phrase-level Feedback

Table 1: A summary of experiments conducted in this work, including conditions, number of phrases, sample size, and total number of videos used in the experiments.

Silent speech input, an image-based language processing method that translates users’ lip movements into textual content, presents a promising solution to address a multitude of challenges [81]. Its independence from acoustic cues allows for versatile application, thriving even in noisy or sensitive environments like libraries or museums. Moreover, it significantly bolsters privacy and security, given the limited number of individuals skilled in lip reading. In fact, studies indicate that even those skilled in lip reading can typically comprehend only about 30-45% of spoken English [65], further underscoring the privacy and confidentiality advantages of silent speech. Silent speech further promotes inclusivity by accommodating individuals who are unable to vocalize or have speech disorders, thereby making communication with computers more accessible.

In pursuit of optimal silent speech recognition, researchers have explored various sensor-based techniques, achieving high accuracy in speech transcription [73, 78, 83, 88, 107]. However, these approaches often entail invasive, unwieldy, and non-portable setups, rendering them impractical in real-world scenarios. Recent endeavors have aimed to harness video-based recognition, commonly referred to as digital lip reading, to facilitate silent speech communication [3, 9, 16, 18]. Yet, many of these models are primarily tailored for high-performance computing devices, like desktop computers [3, 9, 77]. Even with ample computational resources, these models exhibit sluggish response times, susceptibility to errors, and a lack of real-time functionality, making them unsuitable for mobile devices. Additionally, existing models tend to support only a limited, pre-determined vocabulary, hampering their applicability in everyday conversational interactions.

A well-known challenge in developing deep learning models is the demand for substantial data for training, particularly datasets tailored to specific vocabularies, a time-consuming and arduous process. Thus, the imperative arises to design models robust enough to operate effectively with modest data quantities, without compromising performance. Furthermore, the exploration of interface and feedback mechanisms tailored for silent speech interaction on mobile devices has remained ignored in the literature. A mobile-optimized interface is of paramount importance, as it directly impacts user experience and usability. The creation of a swifter, more accurate, and real-time silent speech recognition system optimized for mobile devices, thus, holds the potential to serve as a versatile medium for input and interaction tasks, seamlessly integrating into daily routines.

This paper presents MELDER, a Mobile Lip Reader optimized for performance and usability on mobile devices. The contribution of the work is five-fold. First, it develops a new real-time silent speech recognizer that improves recognition performance on mobile devices by splitting the input video into smaller temporal segments, then processing them individually. Second, it introduces a transfer learning approach aimed at enhancing the performance of silent speech recognition models in everyday conversational contexts. Through a study, we validate the applicability of this approach, demonstrating its effectiveness not only with MELDER but also with other pre-trained models. Third, a comparative evaluation of MELDER against two state-of-the-art silent speech recognition models, assessing their performance in both stationary (seated position) and mobile settings (while walking). Fourth, it introduces two visual feedback methods designed for silent speech recognition systems to keep the users informed about the ongoing recognition process. These methods are compared with the feedback method of Google Assistant in a qualitative study. Fifth, the dataset¹ and the source code and other material produced in this study² are freely available to download for research and development, encouraging replication and further investigations in the area. Table 1 provides a summary of all the experiments conducted in this work, detailing aspects such as the conditions under which each experiment was carried out, the number of phrases used, the sample size of participants, and the total number of videos utilized in these experiments.

2 Related Work

Silent speech input is a form of unspoken communication that allows users to interact with mobile devices without making any audible sounds. As opposed to speech, this method allows users to communicate effectively with mobile devices without invading their privacy and security or disrupting the environment. There have been several previous attempts at achieving silent speech communication through sensor-based recognition and video-based recognition.

2.1 Sensor-Based Recognition

Speech production mechanism is composed of several stages, starting from the conceptual idea, followed by brain signals, muscular activity, and, finally, sound waves. In order to develop silent speech interfaces, researchers acquire and process information from different stages of speech production. Some of them have utilized ultrasonic imaging to achieve silent speech interaction by measuring mouth and tongue movements through a sensor attached under the chin [24, 25, 29, 32, 38, 43, 44, 57, 117]. However, the technique requires applying gel to the skin to obtain the echo images, which is a complicated and expensive process.

Several studies have attempted to estimate speech by using electromyography (EMG) to measure muscle movement around the mouth [45, 48, 49, 50, 52, 70, 93, 112]. It is, however, difficult to estimate speech with EMG because it uses movement of the oral cavity as a basis for gesture recognition. As a result, there are fewer detectable commands and the user must learn new gestures instead of using existing speaking abilities. Another study recognizes tongue gestures with an electrostatic sensor array installed in the mouth [64]. Since the sensor must be placed in the mouth, it interferes with normal activities like eating and conversing. A recent work employs electropalatography (EPG) to observe tongue movements as users spell out a word to detect individual letters within the word [55]. The method uses a hidden Markov model (HMM) to decode 100Hz 16-dimensional signal from the EPG. Research has indicated that EPG is an effective approach for detecting individual letters in spelling (97% character accuracy), but it is considered intrusive and not very user-friendly due to its reliance on an artificial palate equipped with embedded sensors.

Fukumoto [30] propose the “ingressive speech” method, where a microphone is placed very close to the front of the mouth to capture soft speech sounds. However, placing the device in front of the mouth is obtrusive and hinders social interactions. Several studies have also attempted to achieve silent communication with non-audible murmurs (NAM) [39, 40, 41, 72] by using a microphone worn on the skin or throat to recognize speech. In this case, the user uses articulate respiratory sounds without vibrating their vocal folds (i.e., whispering). Whispers are, however, evident to bystanders, and a long-term use of whispers could negatively effect the vocal cords [90].

A few researchers have developed intracortical microelectrode Brain-Computer Interfaces (BCI) to predict users’ intended speech data directly from the brain activity during speech production [13, 22, 86, 104, 105]. Several multimodal imaging systems have also been employed for speech recognition, mainly focused on tongue visualization [44]. Some have employed electromagnetic articulography (EMA) [28, 32, 38], electroencephalogram (EEG) [86], vibration sensors of glottal activity [73, 83, 88, 107], and speech motor cortex implants [10] to recover the speech produced without vibration of the vocal folds, by detecting tongue, facial, and throat movements. A recent study developed a wearable interface for detecting silent speech from neural signals captured by electrodes placed above the face [51]. However, the majority of these studies employ invasive, impractical, and non-portable setups, rendering them unsuitable for real-world applications.

Recent research has investigated the innovative approach of capturing vocal cord vibrations through millimeter-wave (mmWave) sensing [66, 114, 116] and using smartphones’ acoustic sensors to detect continuous wave ultrasound signals for analyzing lip movements [31, 118]. While these methods are computationally lighter than image-based approaches and can be accurate in ideal scenarios, the recognition results can be influenced by both static environmental objects and subtle movements of the body or hand. Besides, these methods necessitate the device being in close proximity to the mouth, sometimes even requiring the user to hold the device near their mouth. This requirement could potentially impact usability, as it may be inconvenient or uncomfortable for users to maintain such close interaction with the device for extended periods or in various settings.

2.2 Video-Based Recognition

Recently, attempts have been made to enable silent speech communication using video-based recognition, referred to as lip reading or silent speech recognition [103]. It captures lip movements with a camera, then recognizes silently spoken words using image processing and language models [3, 6, 9, 11, 17, 18, 84, 100]. Initially, lip reading methods relied on handcrafted pipelines and statistical models for visual feature extraction and temporal modelling, limiting their generalizability.[34, 67, 75, 82, 87] (refer to [120] for a comprehensive review). However, with the advent of deep learning and the availability of large-scale lip-reading datasets, such as GRID [21], lip reading in-the-wild (LRW) [16], and lip reading sentences in-the-wild [4, 97], this field has been revitalized. Researchers initially focused on estimating phoneme-level and word-level recognition [16, 100]. Koller et al. [60] trained a convolutional neural network (CNN) to differentiate between visemes³ on a sign language dataset of signers mouthing words. Similarly, Noda et al. [74] used CNN to predict phonemes in spoken Japanese. Tamura et al. [106] used deep bottleneck features (DBF) to encode shallow input features, such as latent dirichlet allocation (LDA) and GA-based informative feature (GIF) [108] for word recognition. Petridis and Pantic [84] also utilized DBF to encode every video frame and trained a long short-term memory (LSTM) classifier for word-level classification. In contrast, Wand et al. [111] used an LSTM with histograms of oriented gradients (HoG) input features to recognize words. Another work developed CNN architectures for classifying multi-frame time series of lip movements [16]. Kashiwagi et al. [53] introduced a method that places emphasis on identifying shared viseme representations between normal and silent speech. This is achieved by employing metric learning techniques to acquire knowledge about visemes across different speech instances and within the same speech type. This approach enables the efficient utilization of silent speech data while accommodating variations within specific speech types. These approaches still cannot be adapted to make sentence-level sequence predictions due to their inability to handle variable sequence lengths.

Table 2:

Research	Camera	Vocabulary	Size	WER (%)	Mode
TieLent, Kimura et al. [56]	Wearable	Command	15 commands	6.0	Offline
C-Face, Chen et al. [14]	Wearable	Command	8 commands	15.3	Offline
SpeeChin, Zhang et al. [117]	Wearable	Command	54 commands	9.5	Offline
Lip-Interact, Sun et al. [103]	Smartphone	Command	44 commands	4.6	Offline
LipType, Pandey and Arif [77]	Smartphone	Sentence	30 phrases/41 words	40.9	Offline
LipLearner, Su et al. [101]	Smartphone	Command	30 commands	1.2	Offline
MELDER	Smartphone	Sentence	30 phrases/122 words	19.7	Real-time

Table 2: A high-level overview of recent silent speech recognition methods and their reported performances.

More recently, researchers have focused their attention on adapting sentence-level recognition by modifying models for automatic speech recognition using LSTM-based sequence-to-sequence models [97] or connectionist temporal classification (CTC) approach [9, 94]. Another work has taken a hybrid approach, training an long short-term memory (LSTM)-based sequence-to-sequence model with an auxiliary CTC loss [85]. Researchers have also explored transformer-based architectures [2], convolution block variants [119], or hybrid architectures such as conformers [36]. Most of these models either make use of spatiotemporal convolutional neural networks (CNNs) with multiple 3D convolution layers [9, 94] or use lightweight approaches that combine a 3D layer applied frame-by-frame with a 2D one for visual feature extraction and short-term dynamic modeling [2, 16, 100]. LipType [77], on the other hand, use a hybrid approach by combining a shallow 3D-CNN and a deep squeeze and excited 2D-CNN [42], thereby modeling spatial and temporal interdependencies between channels, which led to a reduction of 57% in word errors compared to existing methods. Some have focused on audiovisual speech recognition that uses both acoustic and video channels to recognize speech using deep learning models [61].

Despite these improvements, existing video-based recognition models remain slow (refer to [95] for a comprehensive review). They take fourteen seconds or more to process a short English phrase, which makes them ineffective for everyday usage. In addition, these models do not operate in real-time, instead require the user to perform an action (e.g., pressing a button) or wait for a time-out period after speaking a phrase for the system to start processing it. This additional waiting time negatively impacts the user’s perception of the model. We address these issues by automatically slicing the input video into shorter clips, processing the clips character-by-character in real-time, leveraging the insights gained from a high-resource vocabulary through a transfer learning model, and providing real-time visual feedback on the progress of speech recognition. Table 2 presents a summary of recent advancements in silent speech recognition, with a particular focus on its application in human-computer interaction (HCI) through a camera-based approach. This table emphasizes the efforts to utilize camera technologies, whether incorporated into wearable devices or smartphones, for capturing and interpreting silent speech cues.

2.3 Silent Command Recognition

A new research direction is centered around optimizing both sensor-based and image-based silent speech recognition models specifically for commands. Pandey and Arif [79], for example, proposed a stripped-down version of LipType [77] that can recognize silent commands almost as fast as state-of-the-art speech recognition models. Su et al. [101] proposed a semi-supervised model trained on public datasets that enables customizing commands using a few-shot silent speech customization framework. Kunimi et al. [62], on the other hand, designed a mask-shaped interface containing eight flexible and highly sensitive strain sensors to recognize commands with an existing EMG-based model [52]. Zhang et al. [118] designed EchoSpeech, which utilizes a glass-frame configuration featuring integrated speakers and microphones to project inaudible sound waves toward the skin. It employs a deep learning pipeline with connectionist temporal classification (CTC) loss to discern speech by capturing and analyzing the subtle skin deformations arising from silent utterances. Su et al. [102] used a spatiotemporal convolution network to enable rapid and precise interacting with big displays using gaze and silent commands. Jin et al. [47], in contrast, developed a earphone-based model that recognize commands using the relationship between the deformation of the ear canal and the movements of the articulator. [99] also used an ear-worn system to process jaw motion during word articulation to break each word signal into its constituent syllables, then each syllable into phonemes. These methods are often faster and more accurate than general silent speech models, primarily because they are tailored to recognize a limited set of specific commands and are not intended for everyday communication.

2.4 Applications of Silent Speech in HCI

Silent speech recognition technology [77, 81] holds significant promise for diverse applications in the field of HCI. This innovation provides a hands-free communication solution, particularly valuable in environments where vocalization may be impractical or socially discouraged, such as libraries or quiet workspaces. Users can seamlessly navigate applications, compose text messages, make calls, control smart devices, and perform online searches, all achieved through the simple act of forming words silently [101, 118]. Its ability to interpret silent lip movements makes it an invaluable assistive technology for individuals with speech impairments, enabling a more accessible mode of communication. In addition, silent speech recognition’s multimodal capabilities find practical use in wearable devices, enabling users to interact through silent speech with devices like smartwatches and desktop computers [103]. Silent speech has also been used for hands-free selection with eye-gaze pointing [79, 102], offering performance, usability, and privacy benefits over conventional methods such as speech and dwell.

Silent speech also holds significant potential for emotion recognition in HCI applications [78]. By analyzing facial expressions and lip movements associated with silent speech, the technology could infer emotional states in various contexts. For instance, in adaptive user interfaces, if frustration is detected during a task, the system could offer additional assistance. This enables virtual agents and avatars to express empathy by responding to users with appropriate emotional cues. Emotion-aware assistive technologies benefit from recognizing emotional nuances, aiding individuals with autism in more effective communication. Educational applications could create personalized learning environments, adjusting content based on the student’s emotional engagement. Silent speech recognition could also enhance gaming experiences by dynamically adjusting game elements according to the player’s emotional responses.

There are various other interesting directions one could explore with silent speech recognition. For example, in robotics, silent speech recognition could facilitate more intuitive human-robot interaction, offering users the ability to convey commands without audible speech. Additionally, the technology holds promise in security applications, serving as a unique identifier for biometric authentication. In virtual and augmented reality settings, silent speech recognition could enhance user experiences by allowing silent communication with virtual characters and interfaces. Finally, in training scenarios, silent speech recognition could provide a platform for individuals to practice and refine their communication skills without the need for vocalization, making it a versatile tool in education and skill development. As this technology continues to evolve, its integration into HCI promises more inclusive, adaptable, and natural interaction paradigms.

Figure 2:

3 MELDER: A Mobile Lip Reader

MELDER leverages LipType as its foundational model. LipType [77], an established end-to-end sentence-level model, translates a variable-length sequence of video frames into text. It achieves this through the integration of a shallow 3D-CNN (1-layer) with a deep 2D-CNN (34-layer ResNet [37]), enhanced by squeeze and excitation (SE) blocks (SE-ResNet). This configuration effectively captures both spatial and temporal information. The choice of SE-ResNet is strategic, as it adaptively recalibrates channel-wise feature responses by explicitly modeling the inter-dependencies between channels, thereby refining the quality of feature representations. Moreover, SE-ResNet is notable for its computational efficiency, adding only a minimal increase in model complexity and computational demands. For additional details, please refer to Section 4.1. First, MELDER enhances the model by introducing innovative transcriber and reviewer channels that run in parallel. This structure not only enables real-time processing but also provides users with continuous visual feedback during the recognition of silent speech.

3.1 The Transcriber Channel

The proposed transcriber channel consists of three sub-modules: a windowing frontend that splits the input video into smaller temporal segments, a spatiotemporal feature extraction module that takes a sequence of frames and outputs one feature vector per frame, and a sequence modeling module that inputs the sequence of per-frame feature vectors and outputs a sentence character by character. The model appends the sliced clip to the buffer for parallel processing. This cycle continues until the end of a video clip is detected. We must emphasize that the transcriber channel operates on a server, as modern smartphones do not possess the necessary storage and processing capacity to work with large datasets. Consequently, the results presented in this study may not be directly comparable to models that were tested exclusively on smartphones. Additionally, we acknowledge the concerns some users might have regarding the security of sending video clips to a server. Nonetheless, it is pertinent to point out that nearly all sophisticated real-time recognition systems, including Google Lens, Google Speech, Google Home, and Amazon Alexa, employ a similar server-based approach for processing significant volumes of data [33].

3.1.1 Windowing.

The channel slices video input into smaller segments. In order to determine the best windowing function in the defined context, we studied two linear (y = x + 5, y = 2x + 5) and two non-linear (y = x³, y = 2^x) windowing functions, where x = window start frame and y = window end frame. Each function has an overlapping window of two frames. This was decided in lab trials with an existing silent speech recognition model [77], where we compared speed-accuracy trade-offs between 1–4 overlapping windows. We did not examine more than four frames because the average time per phoneme with silent speech is 176 ms [80], which corresponds to four frames (video frame rate is 25 frames per second). Since the model was already slicing a video into small chunks (∼ 5 frames), reprocessing a large overlapped window increased the processing time without improving accuracy. However, using two frames as an overlapped window improved accuracy without substantially slowing the processing time (Table 3). The overlap between the clips assures that any lost phonemes due to the slicing process are recovered using the information in the overlap frames.

Table 3:

LipType	Windowing Size
	1	2	3	4
Word Error Rate (WER)	28.9%	22.6%	22.5%	22.1%
Computation Time (CT)	0.4s	0.6s	1.1s	1.4s

Table 3: Performance of a silent speech recognition model with varying windowing size.

We selected the windowing function based on certain assumptions. We chose linear functions because they have constant window sizes, possibly resulting in faster computations. For instance, y = x + 5 has a fixed length, thus likely to have a faster processing time, but the accuracy can suffer due to limited context. While larger window sizes, such as those used in y = 2x + 5, may increase accuracy, but may lead to extended processing times. Alternatively, for non-linear functions, the window size increases gradually rather than being constant. They may initially have a faster processing time with a lower accuracy. However, as the window size increases, the processing time will slow down and the accuracy is likely to rise. For this work, we selected non-linear functions based on their window interval size. For instance, y = 2*x has a gradual increase in the window size, while y = x*3 has a steeper increase in the window size. Because the optimal windowing function for real-time processing within this context is unclear, we validated our choice in an experiment described in Section 4.

3.1.2 Spatiotemporal Feature Extraction.

This module takes the sliced video chunk and extracts the mouth-centred cropped image of size H:100 × W:50 pixels per video frame. For this, videos are first pre-processed using the DLib face detector [58] and the iBug face landmark predictor [91] with 68 facial landmarks combined with Kalman filtering. Then, a mouth-centred cropped image is extracted by applying affine transformations. The sequence of T mouth-cropped frames are then passed to 3D-CNN, with a kernel dimension of T:5 × W:7 × H:7, followed by Batch Normalization (BN) [46] and Rectified Linear Units (ReLU) [5]. The extracted feature maps are then passed through a 34-layer 2D SE-ResNet that gradually decreases the spatial dimensions with depth, until the feature becomes a single dimensional tensor per time step.

3.1.3 Sequence Modeling.

The extracted features are processed by 2-Bidirectional Gated Recurrent Units (Bi-GRUs) [15]. Each time-step of the GRU output is processed by a linear layer, followed by a softmax layer over the vocabulary and an end-to-end model is trained with connectionist temporal classification (CTC) loss [35]. The softmax output is then decoded with a left-to-right beam search [20] using the Stanford-CTC decoder [68] to recognize the spoken utterance. The model appends the recognized character to the buffer for post-processing. This cycle continues until the end of a phrase is detected. The model predicts the end of phrase when the newline character is detected.

3.2 The Reviewer Channel

The proposed reviewer channel corrects both character-level and word-level errors and provides real-time feedback by displaying the most probable candidate words and phrases for auto-completion. The process comprises of the following two steps.

3.2.1 Character-Level Corrector.

The character-level model enables real-time word completion based on the sequence of characters or a prefix string obtained from the transcriber channel. As soon as the transcriber channel recognizes a character S, the model auto-completes the string with its most probable word ($\hat{S}$). The conditional probability can be formulated as:

\begin{equation} P(S_{1}^n) = P(\hat{S} | S) = P (\text{completion } | \text{ prefix}) \end{equation}

(1)

Consider, S_{1: m} as the first m characters in string S and all completions must contain the prefix exactly, i.e.,

\begin{equation} \begin{split} \hat{S}_{1:m} = S_{1:m} \quad \mathrm{and}\quad P(\hat{S}_{1:n}|S_{1:m}) = \\ P(\hat{S}_{m+1:n}|S_{1:m}) = \\ P(\hat{S}_{m+1:n}|\hat{S}_{1:m}) \end{split} \end{equation}

(2)

where n is the total length of a completion. As probabilities in the sequence domain contain exponentially many candidate strings, we simplified the model by calculating conditional probabilities recursively:

\begin{equation} P(S_{1}^n) = P(\hat{S}_{m+1:n}|\hat{S}_{1:m}) = \underset {S_{1},...,S_{n}}{argmax}\prod _{t=m}^{n-1}P(\hat{S}_{t+1}|\hat{S}_{1:t}) \end{equation}

(3)

This requires modelling only $P(\hat{S}_{t+1}|\hat{S}_{1:t})$, which is the probability of the next character under the current prefix. For this, it computes $argmax P(\hat{S}_{t+1}|\hat{S}_{1:t})$ using the prefix tree (Trie) data structure. Upon finding the most probable completion for the current prefix, the model automatically displays the auto-completion.

3.2.2 Word-Level Corrector.

The module is activated only when a space character is detected. Upon detection, the sequence recognized so far is passed to the word-level n-gram language model (LM). First, it extracts the last word W from the recognized text, calculates edit distances [63] between W and each dictionary word d, then replaces W with a word that has the minimum edit distance. Second, it auto-completes the sentence by modelling the joint probability distribution of the given words and future words.

Formally, we consider a given string of t words, W = W₁, W₂,..., W_t and our goal to predict the future word sequence (W_{t + 1}, W_{t + 2},..., W_{t + T}). The conditional probability can be formulated as:

\begin{equation} P_{combined}(W_{1}^T) = P (\text{completion } | \text{ prefix}) \end{equation}

(4)

This model uses bidirectional n-grams to account for both forward and reverse directions. The combined probability of a sentence, thus, is computed by multiplying the forward and backward n-gram probability of each word:

\begin{equation} \begin{split} P_{combined}(W_{1}^T) = \\ P(\text{completion } | \text{ prefix}) * P(\text{prefix } | \text{ completion}) = \\ P_{forward}(W_{1}^T)*P_{backward}(W_{1}^T) \end{split} \end{equation}

(5)

In a forward n-gram, the conditional probability is estimated depending on the preceding words:

\begin{equation} \begin{split} P_{forward}(W_{1}^T) = \\ P_{forward}((W_{t+1},W_{t+2},...,W_{t+T})|W_1, W_2,...,W_t) = \\ \underset {W_{t+1},...,W_{t+T}}{argmax} \prod _{j=1}^{T}P(W_{t+j}|W_1,...,W_{t+j-1}) \end{split} \end{equation}

(6)

In contrast, in a backward n-gram, the probability of each word is estimated depending on the succeeding words:

\begin{equation} \begin{split} P_{backward}(W_{1}^T) = \\ P_{backward}(W_1, W_2,...,W_t|(W_{t+1},W_{t+2},...,W_{t+T})) = \\ \underset {W_{1},...,W_{t}}{argmax} \prod _{j=1}^{T}P(W_1,...,W_{t+j-1}|W_{t+j}) \end{split} \end{equation}

(7)

Applying the values from Eq. 6 and Eq. 7, we get:

\begin{equation} \begin{split} P_{combined}(W_1^T) = \\ (P (W_1 | < start>) * P (< start> | W_1)) * \\ (P (W_2 | W_1^1) * P (W_1^1 | W_2)) * \\ (P (W_3 | W_1^2) * P (W_1^2 | W_3)) * \\ ...* \\ (P (< end> | W_T) * (W_T | < end>)) \end{split} \end{equation}

(8)

Finally, the model predicts the most probable auto-completion of the given words and automatically adds it to the input text. We used COCA corpus [23], one of the largest publicly available and genre-balanced corpus of English, to train the reviewer modules. The dataset contains approximately 1 billion words, however, we extracted the top 200,000 sentences as vocabulary to reduce the computation time. The average perplexity⁴ score for the model is 42.6, indicating that it is well-trained and can anticipate words accurately.

4 Experiment 1: Selection of Windowing Function

We conducted an experiment to evaluate the performance of the four windowing functions proposed in Section 3.1.1 with three state-of-the-art silent speech recognizers. In video processing, windowing functions play a critical role by isolating specific sections of video frames for detailed analysis. This method significantly improves noise reduction and emphasizes important features within the chosen segments. In the context of MELDER, we specifically explored windowing functions to enable real-time silent speech processing. This approach is designed to facilitate continuous processing without the need for a “stop” cue, such as pausing after completing a phrase spoken silently.

4.1 Silent Speech Recognition Models

We selected the following three pre-trained silent speech recognition models for this study.

(1)

LipNet [9] model uses a neural network architecture for lip reading that maps variable-length sequences of video frames to text sequences, making use of deep 3-dimensional convolutions, a recurrent network, and the connectionist temporal classification loss [35], trained entirely end-to-end. It was trained on the GRID dataset (21,635 videos) [21], which comprises of short and formulaic videos that show a well-lit person’s face while uttering a highly constrained vocabulary in a specific order.

(2)

Transformer [3] model comprises of two sub-modules: a spatio-temporal visual frontend that takes a sequence of video frames to extract one feature vector per frame and a sequence processing backend comprised of encoder-decoder structure with multi-head attention layers [109] that generates character probabilities over the vocabulary. It was trained on Lip Reading in the Wild (LRW: 500 videos) [16] and the Lip Reading Sentences 2 (LRS2: 41,000 videos) [3] datasets.

(3)

LipType [77] model follows the same architecture as LipNet except it replaces deep 3-dimensional convolutions with a combination of shallow 3-dimensional convolutions (1-layer) and deep 2-dimensional convolutions (34-layer ResNet) integrated with squeeze and excitation (SE) blocks (SE-ResNet). It was also trained on the GRID dataset (21,635 videos).

To ensure a fair comparison, we utilized an openly accessible dataset consisting of thirty randomly selected phrases from each model’s training dataset [77].

4.2 Performance Metrics

We used the following metrics to benchmark the proposed framework.

•

Word error rate is the minimum number of operations required to transform the predicted text to the ground truth, divided by the number of words in the ground truth. It is calculated using the following equation, where S is the number of substitutions, D is the number of deletions, I is the number of insertions, N is the number of words in the ground truth.

\begin{equation} {\it Word error rate} = \frac{S+D+I}{N} \end{equation}

(9)

•

Words per minute (wpm) is a commonly used text entry metric that signifies the rate in which words (= 5 chars) are entered [8]. It is calculated using the following equation, where T is the number of recognized words, t is the sum of speaking time and computation time in seconds, the constant 60 is the number of seconds per minute, and the factor of one fifth accounts for the average length of a word in the English language.

\begin{equation} WPM = \frac{\left| T \right|-1}{t} \times 60 \times \frac{1}{5} \end{equation}

(10)

•

Computation time (s) is the total time required by the model to process each window. It does not include the time users took to silently speak a phrase.

Figure 3:

4.3 Results

We evaluated all models on NVIDIA GeForce 1080Ti GPU board. Based on the results, y = x + 5 results in less computation time for processing each sliced clip, thereby resulting in a faster input speed. The function, however, is slightly more erroneous than others, but since our aim is to show recognition as quickly as possible in order to mimic the real-time recognition, we considered it the most effective method. Fig. 3 shows the performance of each windowing function on the three examined silent speech recognition models. It can be seen that all pre-trained models performed much better with y = x + 5 functions in terms of input speed and computation time. With LipNet, y = x + 5 shows 2.5% increase in word error rate, 19.8% increase in words per minute, and 15.9% reduction in the computation time than the remaining three windowing functions. With Transformer, y = x + 5 shows 1.4% increase in word error rate, 19.1% increase in words per minute, and 15.5% reduction in the computation time than the remaining three windowing functions. With LipType, y = x + 5 shows 7.3% increase in word error rate, 10.5% increase in words per minute, and 22.5% reduction in the computation time than the remaining three windowing functions. Regardless of windowing function, LipType performed better. This further strengthens the decision to use LipType as the base model for this work. Note that in the proposed model, repetitions of blank tokens (> 3) in the recognized sequence are used to determine the end of the sentence. The buffer is cleared if the following sequence is detected and buffering will begin from scratch. However, since we focus on text entry on mobile devices, we did not optimize the model on very long sentences.

Figure 4:

5 Adopting A Transfer Learning Strategy

Most lip reading datasets contain limited vocabulary and do not support vocabulary relevant to everyday conversation. A model trained on a dataset with specific vocabulary performs poorly when applied to a dataset other than the training vocabulary words. Furthermore, training a deep learning model requires an enormous amount of data. Developing large-scale datasets tailored to particular vocabularies is extremely challenging, expensive, and time-consuming. To overcome this, we leverage the effectiveness of transfer learning, which exploits existing features (or knowledge) from a model trained on a high-resource vocabulary, source model, and generalizes it to a new low-resource vocabulary, target model [76, 121].

Generally, features transition from general to specific characteristics by the last layer of the network, but this transition has not been extensively investigated in the context of lip reading. Research in deep learning research showed that standard features learned on the first layer appear regardless of the dataset and the task [96, 115], thus are called general features. In contrast, features calculated by the last layer of a trained network is highly dependent on the dataset and the task. It is unclear, however, how this transition can be generalized to lip reading, that is, to what extent features within a network could be generalized and used for transfer learning. Towards this, we investigated three strategies to transfer learning (Fig. 4). Consider a source model composed of N layers, with V layers representing visual_frontend and S layers representing sequence_processing.

(1)

Finetune_Last: The network is first initialized with the weights from the source model, then the top layers (N − 1) are frozen, and only the last layer is allowed to modify its weights. The model is then trained to fine-tune the last layer for the target vocabulary. During the training process, only the weights associated with last layer are changed until they converge. Using this method, fine-tuning only the final layer is needed to work more effectively with the target dataset and it makes use of the features learned from the source model.

(2)

Finetune_Visual_Frontend: The network is first initialized with the weights from the source model, then the sequence_processing layers (N − V) are frozen and only the visual_frontend layers are allowed to modify their weights. Afterwards, the model is trained to fine-tune the visual_frontend for the target vocabulary. During the training process, only the weights associated with the visual_frontend are changed until they converge.

(3)

Finetune_Sequence: The network is first initialized with the weights from the source model, then the visual_frontend layers (V) are frozen and only the sequence_processing layers are allowed to modify their weights. Afterwards, the model is trained to fine-tune the sequence_processing for the target vocabulary. During the training process, only the weights associated with the sequence_processing are changed until they converge.

Table 4:

Model	Training		Fine-tuning
	Source	Target	Finetune_Last	Finetune_Visual	Finetune_Sequence
LipNet	21,635	720	720	720	720
LipType	21,635	720	720	720	720
Transformer	41,500	720	720	720	720

Table 4: Statistics of dataset used for training and fine-tuning the models. The values are the total number of video samples.

6 Experiment 2: Effects of Transfer Learning

In this experiment, we examined how different strategies of transfer learning affect the performance of silent speech recognition models. For the source models, we used the same pre-trained silent speech models as described in Section 4.1. For target models, we trained these source models from scratch with a low-resource target dataset.

The experiment calculated the same word error rate and words per minute performance metrics as described in Section 4.2. However, regarding computation time, this experiment specifically measured the average time required by the model to process a phrase.

6.1 Transfer Learning Dataset

All source models were trained on their respective training datasets (Section 4.1). For target models, we used the publicly available dataset [77], which consists of thirty randomly selected phrases from the MacKenzie & Soukoref dataset [69]. It comprises of short and formulaic video clips of a person’s face when uttering the phrases. The selected phrases are a good representation of the English language and is highly correlated with Mayzner & Tresselt’s letter frequencies [71], thus are more generalizable. Target dataset contains 1,080 video clips of twelve speakers uttering thirty phrases. For the experiment, we employed a random selection of 720 videos for the fine-tuning phase and 360 videos for the evaluation phase. The same evaluation dataset was consistently used for all models (Table 4).

Figure 5:

6.2 Implementation

To avoid any potential confounding factor, we trained all models from scratch with the same training parameters used in their respective source model. For target model, we did not apply any transfer learning at all, and let the model train on the given low-resource training data. The number of frames was fixed to 75. Since all videos are 25 fps with a length of ∼ 3 seconds, they have 75 frames in total (25 fps × 3 seconds = 75 frames). Longer image sequences were truncated and shorter sequences were padded with zeros. We applied a channel-wise dropout [98] of 0.5. The model was trained end-to-end by the Adam optimizer [59] for 60 epochs with a batch size of 50. The network was implemented based on the Keras deep-learning platform with TensorFlow [1] as the backend. We trained and tested both models on NVIDIA GeForce 1080Ti GPU board.

6.3 Results

Results showed that the target model performed the worst, while the model that kept visual front-end frozen and sequential module fine-tuned performed the best. With LipNet, Finetune_Sequence shows 39.5% decrease in word error rate, 12.1% increase in words per minute, and 2.2% reduction in the computation time than the other models. With Transformer, Finetune_Sequence shows 26.1% decrease in word error rate, 2.3% increase in words per minute, and 2.1% reduction in the computation time than the other models. With LipType, Finetune_Sequence shows 39.1% decrease in word error rate, 5.1% increase in words per minute, and 5.4% reduction in the computation time than the other models. Fig. 5 presents the findings of this experiment.

This means that performance worsens as we keep bottom layers fixed when transferring parameters from the source task. We speculate that this is because the top layer features are not specific to particular datasets or tasks, but are general in that they can be applied to a wide range of datasets and tasks. On the other hand, the features computed by the bottom layer of a network are highly dependent on the dataset and the task chosen. In addition, fine-tuning only the last layer is not sufficient since the sequential module learns transitional probability of characters based on context. Therefore, fine-tuning only the last layer will not be able to model the transition of characters that were not part of the source model’s training vocabulary.

7 Experiment 3: Stationary Performance

We conducted a user study to compare MELDER with two state-of-the-art, pre-trained silent speech models LipNet [9] and Transformer [3] with unseen data (data that has not been used to train the models) in a stationary setting (in a seated position). Since these models do not work in real-time (computes one phrase at a time), we equipped these with the y = x + 5 windowing function and the Finetune_Sequence transfer learning strategy, as MELDER, for a fair comparison between the models (henceforth referred to as RT-LipNet and RT-Transformer) and to demonstrate that these approaches could be used independently with other silent speech models to make them real-time. We also disabled the visual feedback component of the reviewer channel (described in Section 3.2) in the study to eliminate a confounding factor (to remove any potential effects of feedback on performance).

7.1 Experimental Dataset

We used the Enron Mobile Email dataset [110] in this study. It contains genuine mobile emails, thus is better suited to evaluate mobile text entry. Towards this, first, we filtered the dataset using the following rules: 1) exclude phrases with lengths less than three or greater than ten, 2) exclude phrases containing common nouns, such as general names, places, and things, and 3) exclude phrases containing contractions or numeric values. After filtering, we randomly selected thirty phrases and removed all punctuation and non-alphanumeric tokens, and replaced all uppercase letters with lowercase letters. The selected phrases are presented in Appendix A.

Figure 6:

7.2 Participants

Twenty volunteers took part in the study (Fig. 6b ). Their age ranged from 18 to 41 years (M = 25.55 years, SD = 6.2). Ten of them identified as women, nine as men, and one as non-binary. They all owned a smartphone for at least five years (M = 8.4 years, SD = 2.1). Sixteen of them were frequent users of a voice assistant system on their smartphones (M = 3 years, SD = 2.3), while the remaining four were infrequent or nonusers. They all received U.S. $10 for volunteering.

7.3 Apparatus

We developed a custom application for smartphones running on Android OS using the default Google Android API (Fig. 6a ). The application enabled users to record videos of them silently speaking the presented phrases using the front camera of a smartphone. In the study, we enabled participants to record videos using the front camera of their own smartphones to increase the variability of the dataset.

7.4 Design

The study used a within-subjects design with one independent variable “model” with three levels: RT-LipNet, RT-Transformer, and MELDER. In total, we collected (20 participants × 30 phrases) = 600 samples. The dependent variables were the same word error rate and words per minute performance metrics as described in Section 4.2. However, regarding computation time, this experiment measured the average time required by the model to process a phrase.

7.5 Procedure

The data collection process occurred remotely. We explained the purpose of the study and scheduled individual Zoom video calls with each participant ahead of time. We instructed them to join the call from a quiet room to avoid any interruptions during the study. First, we demonstrated the application and collected their consents and demographics using electronic forms. We then shared the application (APK file) with them and guided them through the installation process on their smartphones.

Participants were instructed to sit at a desk during the study. The application displayed one phrase at a time. Participants pressed the “Record/Stop” toggle button, silently spoke the phrase (uttered the phrase without vocalizing sound), then pressed the same button to see the next phrase. We did not instruct them about how to hold the device. But most of them held the device with their non-dominant hand and pressed the button with their dominant hand. Upon completion of the study, participants shared the logged data with us by uploading those to a cloud storage under our supervision. For evaluation, we passed the recorded video through the transcriber channel to obtain recognition, then post-processed the recognized text through the reviewer channel to auto-correct errors and present the most probable auto-completion of text at both word and phrase-level.

7.6 Results

A Martinez-Iglewicz test revealed that the response variable residuals were normally distributed. A Mauchly’s test indicated that the variances of populations were equal. Therefore, we used a repeated-measures ANOVA and a post-hoc Tukey-Kramer multiple-comparison test for all analysis. We also report effect sizes in eta-squared (η²) for all statistically significant results.

Figure 7:

7.6.1 Word Error Rate.

An ANOVA identified a significant effect of model on word error rate (F_{2, 19} = 3632.67, p < .00001, η² = 0.94). On average, RT-LipNet, RT-Transformer, and MELDER yielded 20.95% (SD = 0.8), 28.1% (SD = 1.1), and 19.75% (SD = 0.9) word error rates, respectively. A Tukey-Kramer test revealed that RT-Transformer was significantly more error prone than RT-LipNet and MELDER. Fig. 7a illustrates this.

7.6.2 Words per Minute.

An ANOVA identified a significant effect of model on word error rate (F_{2, 19} = 557.08, p < .00001, η² = 0.89). On average, RT-LipNet, RT-Transformer, and MELDER yielded 4.96 wpm (SD = 0.3), 4.21 wpm (SD = 0.2), and 5.59 wpm (SD = 0.1), respectively. A Tukey-Kramer test revealed that RT-Transformer was significantly slower than RT-LipNet and MELDER. Fig. 7b illustrates this.

7.6.3 Computation Time.

An ANOVA identified a significant effect of model on word error rate (F_{2, 19} = 11085.33, p < .00001, η² = 0.99). On average, RT-LipNet, RT-Transformer, and MELDER required 12.93s (SD = 0.4), 13.55s (SD = 0.2), and 6.51s (SD = 0.2) to compute a phrase, respectively. A Tukey-Kramer test revealed that MELDER was significantly faster in computing the phrases than RT-LipNet and RT-Transformer. Fig. 7c illustrates this.

7.7 Discussion

MELDER outperformed RT-LipNet and RT-Transformer both in terms of speed and accuracy. MELDER took 50% less time than RT-LipNet and 52% less time than RT-Transformer to compute a phrase. These effects were statistically significant, and resulted in a 13% and a significantly 33% faster text entry speed than RT-LipNet and RT-Transformer, respectively. MELDER was also the most accurate. It committed 6% fewer errors than RT-LipNet and a significantly 30% fewer errors than RT-Transformer. The statistically significant differences, accompanied by large effect sizes (η² ≥ 0.1 constitutes a large effect size [7, 19]), indicate their potential generalizability to a broader population. These results strengthen our argument that MELDER is better suited for use on mobile devices than existing models.

We also compared the original LipNet and Transformer models with RT-LipNet and RT-Transformer in an ablation study⁵. In the study, LipNet yielded 97.3% word error rate, 4.6 wpm entry speed, and 14.2s computation time. The addition of windowing and transfer learning approaches reduced word error rate by 78%, improved entry speed by 7%, and reduced computation time by 9%. Transformer also demonstrated substantial improvements in performance when empowered with the proposed windowing and transfer learning approaches. The original Transformer yielded 81.2% word error rate, 5.2 wpm entry speed, and 14.7s computation time. RT-Transformer, conversely, demonstrated a 65% reduction in word error rate, 19% improvement in entry speed, and 14% reduction in computation time. These findings validate that the suggested windowing and transfer learning methods can be employed separately with existing silent speech recognizers, not only enabling real-time capabilities but also enhancing their overall performance.

8 Experiment 4: Mobile Performance

We conducted a follow-up pilot study to compare MELDER with RT-LipNet and RT-Transformer with unseen data in a mobile setting (while walking). The study used the same dataset as the previous experiment (Appendix A).

8.1 Participants

Six new volunteers took part in the study. Their age ranged from 22 to 31 years (M = 26.33 years, SD = 3.1). Three of them identified as women and three as men. They all owned a smartphone for at least four years (M = 6.67 years, SD = 2.4). All of them were frequent users of a voice assistant system on their smartphones (M = 2.17 years, SD = 1.2). They all received U.S. $10 for volunteering.

8.2 Apparatus, Design, and Procedure

The study used the same apparatus, design, and procedure as the previous experiment (Section 7). However, unlike the previous study, participants were instructed to silently speak the phrases while walking indoors. They were instructed to walk at a pace they would usually walk while using a smartphone. We did not collect outdoor data because the risk of slips, trips, and falls is higher outdoors, which could have subjected participants to unnecessary risks. The study computed the same word error rate, words per minute, and computation time performance metrics as outlined in Section 7.4.

8.3 Results

Figure 8:

8.3.1 Word Error Rate.

An ANOVA identified a significant effect of model on word error rate (F_{2, 5} = 32.25, p < .00005, η² = 0.78). On average, RT-LipNet, RT-Transformer, and MELDER yielded 27.01% (SD = 3.1), 34.24% (SD = 1.8), and 25.34% (SD = 1.3) word error rates, respectively. A Tukey-Kramer test revealed that RT-Transformer was significantly more error prone than RT-LipNet and MELDER. Fig. 8a illustrates this.

8.3.2 Words per Minute.

An ANOVA identified a significant effect of model on word error rate (F_{2, 5} = 25.76, p < .0005, η² = 0.68). On average, RT-LipNet, RT-Transformer, and MELDER yielded 5.19 wpm (SD = 3.1), 4.24 wpm (SD = 1.8), and 5.31 wpm (SD = 1.3), respectively. A Tukey-Kramer test revealed that RT-Transformer was significantly slower than RT-LipNet and MELDER. Fig. 8b illustrates this.

8.3.3 Computation Time.

An ANOVA identified a significant effect of model on word error rate (F_{2, 5} = 385.09, p < .00001, η² = 0.97). On average, RT-LipNet, RT-Transformer, and MELDER required 12.42s (SD = 3.1), 14.85s (SD = 1.8), and 6.73s (SD = 1.3) to compute a phrase, respectively. A Tukey-Kramer test revealed that MELDER was significantly faster in computing the phrases than RT-LipNet and RT-Transformer. Fig. 8c illustrates this.

8.4 Discussion

The findings of this study parallel those of the previous study, which evaluated the models’ performance in a stationary setting. MELDER outperformed RT-LipNet and RT-Transformer both in terms of speed and accuracy. MELDER was significantly faster in computing the phrases than RT-LipNet (46% faster) and RT-Transformer (55% faster). It also demonstrated a 2% faster text entry speed than RT-LipNet and a significantly 25% faster entry speed than RT-Transformer. Further, MELDER yielded a 6% lower word error rate than RT-LipNet and a significantly 26% lower word error rate than RT-Transformer. Most importantly, despite the small sample size (N = 6), the statistically significant results yielded large effect sizes (η² ≥ 0.1 constitutes a large effect size [7, 19]), which suggests the potential for these findings to generalize to a wider population.

We conducted an independent-samples t-test to compare the results of Experiment 3 and Experiment 4. Table 5 presents the findings. As anticipated, there were some performance differences between the experiments not only because they were conducted in different settings but also with different samples and sample sizes (N = 20, N = 6). A t-test revealed that both RT-LipNet (t(24) = −8.16, p < .00001, d = 1.6), RT-Transformer (t(24) = −10.16, p < .00001, d = 1.3), and MELDER (t(24) = −11.69, p < .00001, d = 1.03) committed significantly more errors in the mobile setting than in the stationary setting. This is not surprising, since the videos were shakier and more jittery in the mobile condition, which affects video processing. Nevertheless, MELDER yielded the lowest average word error rate than the other models when mobile. Text entry speed with MELDER was significantly slower in the mobile setting compared to the stationary setting (t(24) = 2.3, p < .05, d = 0.3), while RT-LipNet and RT-Transformer had relatively similar speeds. Likewise, RT-LipNet yielded a significantly faster computation time (t(24) = 2.37, p < .05, d = 0.5), while RT-Transformer yielded a significantly slower computation time (t(24) = −7.05, p < .00001, d = 0.4) in the mobile setting than in the stationary setting. MELDER’s computation time in the two settings were comparable. These significant differences are likely caused by the differences in the samples or by chance, as we did not identify any other reasons through data analysis. Relevantly, these relationships produced small–medium effect sizes (Cohen’s d ≤ 0.2 constitutes a small effect size and d ≥ 0.5 constitutes a medium effect size [19]), indicating to the possibility that these outcomes are likely due to chance. However, further investigations are needed to confirm this assumption. The results of this study further solidify our claim that MELDER is effective not only in stationary settings but also in mobile ones.

Table 5:

Metric	Model	Stationary	Mobile	Difference	Significance (α = 0.05)
Word error rate	RT-LipNet	20.95	27.01	29% ↑	Significant
	RT-Transformer	28.10	34.24	22% ↑	Significant
	MELDER	19.75	25.34	28% ↑	Significant
Words per minute	RT-LipNet	4.96	5.19	5% ↑	Not significant
	RT-Transformer	4.21	4.24	1% ↑	Not significant
	MELDER	5.59	5.31	5% ↓	Significant
Computation time	RT-LipNet	12.93	12.42	4% ↓	Significant
	RT-Transformer	13.55	14.85	10% ↑	Significant
	MELDER	6.51	6.73	3% ↑	Not significant

Table 5: Performance differences between the three silent speech recognition models in stationary and mobile settings. The up and down arrows indicate increments and decrements in the respective values, the colors green and red indicate whether a difference is an improvement or a decline, respectively, in performance.

Figure 9:

9 Experiment 5: Visual Feedback

We conducted a final study to compare the visual feedback methods of MELDER with the visual feedback method of Google Assistant. Note that the feedback methods were not included in Experiments 3 and 4 to eliminate a potential confounding factor. This study focuses on assessing the perceived performance of visual feedback methods in MELDER and Google Assistant, rather than directly comparing their actual performance. Such a comparison would be unfair due to the inherent differences between the two systems: MELDER is an image-based silent speech recognizer, while Google Assistant’s speech-to-text relies on audio processing. These disparities stem from the distinct data types they handle (visual for images, auditory for audio), resulting in varying complexities in operations and feature extraction.

9.1 Apparatus

We developed a custom Web application with HTML5, CSS, PHP, JavaScript, and Node.js. We hosted the application on GitHub. The application was loaded on a Chrome web browser v71.0.3578.98 running on a Motorola Moto G⁵ Plus smartphone (150.2x74x7.7 mm, 155 g). Its built-in front camera (12 megapixel with 1080 × 1920 pixel resolution) was used to track lip movements. Through an IP webcam Android application [54], we connected the smartphone’s camera to the server, which ran the silent speech recognition model. The server was running on a MacBook Pro 16" laptop with 2.6 GHz Intel Core i7 processor, 16 GB RAM, 3072 × 1920 at 226 ppi. The laptop and the smartphone were connected to a fast and reliable Wi-Fi network. There were no network dropouts during the study.

9.2 Participants

Twelve volunteers participated in the user study (Fig. 9). Their age ranged from 21 to 41 years (M = 27.8 years, SD = 5). Eight of them identified as women and four as men. They all owned a smartphone for at least five years (M = 8.2 years, SD = 2.2). Eleven of them were frequent users of a voice assistant system on their smartphones (M = 3 years, SD = 2.4), while one was an infrequent user. They all received U.S. $15 for volunteering.

Figure 10:

9.3 Design

We used a within-subjects design for the user study with one independent variable “feedback” with three levels: Google, word-level MELDER, and phrase-level MELDER. In each condition, participants entered thirty short English phrases from a subset of the Enron Mobile Email corpus, presented in Appendix A. In summary, the design was 12 participants × 3 conditions × 30 phrases = 1,080 input tasks in total. The dependent variables were the eight items on a questionnaire. The study gathered qualitative data through the utilization of a custom questionnaire inspired by the System Usability Scale (SUS) [12]. The questionnaire asked participants to rate eight statements on the examined methods’ speed (“The technique was fast”), accuracy (“The technique was accurate”), effectiveness (“The feedback method used in the technique was effective and useful”), willingness-to-use (“I think that I would like to use this system frequently”), ease-of-use (“I thought the system was easy to use”), learnability (“I would imagine that most people would learn to use this system very quickly”), confidence (“I felt very confident using the system”), and privacy and security (“I think the system will be private and secure when using in public places”) on a 5-point Likert scale.

9.4 Feedback Approaches

We created two real-time visual feedback methods for silent speech recognition models, drawing inspiration from Google Assistant’s feedback approach. In Google Assistant, the system starts displaying likely letters and words as soon as it detects speech, refining the output as the speaker continues. These initial predictions are presented in a greyed-out font (Fig. 10 a) to signify their potential for correction as more information becomes available. Unlike suggestions on a virtual keyboard, these predictions in Google Assistant are automatically managed by the system and cannot be manually selected, discarded, or updated by users. Additionally, the system offers feedback for sound detection, resembling oscilloscope traces or sound waves, presented as four colored vertical lines (Fig. 10 a, bottom of the display). These lines dynamically change in height to indicate when the system detects sound and come to a halt when sound detection ceases.

MELDER also offers feedback on lip detection and speech recognition. When the front camera detects the user’s lips, it displays a red blinking circle, similar to the video recording indicator on mobile devices. The red circle ceases blinking and changes to grey when the lips are no longer visible (Fig. 10 b). To keep users informed about the speech recognition process, we developed two feedback methods:

•

Word-level feedback: This method offers real-time feedback on a word-by-word basis. It presents the most probable word based on the recognized input. The text remains gray until the confidence level of the word surpasses a specified threshold (empirically set at 0.75). Once this condition is met, the word turns black, signifying that it is considered finalized and will not be corrected (Fig. 10 c).

•

Phrase-level feedback: In this approach, real-time feedback is provided by displaying the most likely phrase based on the recognized prefix string. Each word within the phrase starts in gray and transitions to black when its confidence level exceeds a specific threshold (empirically set at 0.87). This change to black indicates that the word is considered fixed and will not undergo further correction (Fig. 10 d).

The threshold values were determined empirically through multiple lab trials. During these trials, we tested thresholds ranging from 0.65 to 1.0 for both feedback methods. We selected the threshold values that proved most effective in delivering real-time feedback based on the experimental results. Similar to Google Assistant, neither of these feedback methods allowed users to proactively select, dismiss, or modify the suggestions; they were merely provided to inform users about the recognition process.

Figure 11:

9.5 Procedure

The study was conducted in a quiet computer laboratory. First, we provided the participants with a brief overview of the functioning principles behind both speech and silent speech recognition. Subsequently, we offered practical demonstrations of the three distinct feedback methods employed in the study. We then collected their informed consent forms, and enabled them to practice with the three methods for about five minutes. They could extend the duration of the practice an extra two minutes upon request.

The main study started after that. In the study, participants entered thirty short English phrases from the Enron set [110] by either speaking or silently speaking on a smartphone. All participants were seated at a desk. The three conditions (Google Assistant, MELDER with word-level feedback, and MELDER with phrase-level feedback) were counterbalanced to eliminate any potential effect of practice. As each phrase was recognized, the application automatically displayed the next phrase, continuing in this manner until all phrases within the given condition had been successfully completed. Participants were not required to re-speak a phrase in the event that it was not accurately recognized by the system.

Upon the completion of all conditions, participants completed a questionnaire that asked them to rate the three methods’ speed, accuracy, effectiveness, willingness-to-use, ease-of-use, learnability, confidence, and privacy and security on a five-point Likert scale (Section 9.3). Finally, we concluded the study with a debrief session, where participants were given a chance to share their thoughts and comments regarding their responses to the questionnaire.

9.6 Speed and Accuracy

As discussed in Section 9, the primary aim of this qualitative study was not to conduct a direct comparison of the actual speed and accuracy of the models. However, it is noteworthy that we did carry out a separate study comparing Google Assistant and MELDER. In this between-subjects study, 24 participants (average age = 26.25 years, SD = 5.9, comprising 12 females, 11 males, and 1 non-binary) were evenly distributed into two groups: one using Google Assistant and the other using MELDER. Each group employed their designated input method in a seated position. A between-subjects ANOVA analysis revealed a statistically significant impact of the input method on both entry speed (F_{1, 22} = 1083.35, p < .00001, η² = 0.98) and accuracy (F_{1, 22} = 1219.38, p < .00001, η² = 0.99).

As expected, participants using Google Assistant achieved an average entry speed of 30.54 wpm (SD = 2.6) and a remarkably low word error rate of 2.01% (SD = 0.3). In contrast, those using MELDER exhibited significantly slower input speeds, averaging 5.62 wpm (SD = 0.1), along with a much higher word error rate of 19.86% (SD = 1.0). Fig. 11 summarizes these findings. It is important to highlight that both the word-level and phrase-level versions of MELDER utilize the same recognition model and do not necessitate users to actively choose suggestions from the feedback. Consequently, they are indistinguishable in terms of actual speed and accuracy.

Figure 12:

9.7 Results

We used a Friedman test and a post-hoc Games-Howell multiple-comparison test for analysing all non-parametric study data. We also report effect sizes in Kendall’s W for all statistically significant results. Kendall’s W uses the Cohen’s interpretation guidelines [19] of W < 0.3 as small, W ≥ 0.3 as medium, and W ≥ 0.5 as large effect sizes. Fig. 12 summarizes the findings of the study.

9.7.1 Perceived Speed and Accuracy.

A Friedman test identified a significant effect of feedback on perceived speed (χ² = 9.83, df = 2, p < .01, W = 0.4) and accuracy (χ² = 6.4, df = 2, p < .05, W = 0.3). A Games-Howell test revealed that participants found Google Assistant to be significantly faster than both word-level and phrase-level MELDER. But interestingly, the pairwise test was unable to identify any significant difference between the three methods in terms of accuracy.

9.7.2 Effectiveness.

A Friedman test failed to identify a significant effect of feedback on effectiveness (χ² = 4.69, df = 2, p = .09). Additionally, a Games-Howell test confirmed that participants perceived all three examined feedback approaches to be relatively equally effective.

9.7.3 Willingness-to-Use.

A Friedman test identified a significant effect of feedback on willingness-to-use (χ² = 7.0, df = 2, p < .05, W = 0.3). An analysis using the Games-Howell test demonstrated that participants expressed a significantly stronger preference for phrase-level feedback over word-level feedback. However, there was no statistically significant difference in their preference between these feedback types and Google Assistant.

9.7.4 Ease-of-Use and Learnability.

A Friedman test failed to identify a significant effect of feedback on either ease-of-use (χ² = 6.0, df = 2, p = .05) or learnability (χ² = 6.0, df = 2, p = .05). A Games-Howell test also confirmed that participants found the three examined methods relatively comparable in terms of ease-of-use and learnability.

9.7.5 Confidence.

A Friedman test identified a significant effect of feedback on confidence (χ² = 12.56, df = 2, p < .01, W = 0.5). A Games-Howell test indicated that participants exhibited a notably higher level of confidence when utilizing Google Assistant compared to both work-level and phrase-level MELDER. Their confidence levels in using the two variations of MELDER appeared to be relatively similar.

9.7.6 Privacy and Security.

A Friedman test identified a significant effect of feedback on privacy and security (χ² = 24.0, df = 2, p < .0001, W = 1.0). A Games-Howell test revealed that participants found both word-level and phrase-level MELDER to be significantly more secure and private than Google Assistant.

9.8 Discussion

MELDER was notably slower and displayed a higher error rate compared to Google Assistant. The discrepancy in text entry speed between the two methods was readily observed by all participants. They universally perceived MELDER, regardless of the feedback method, to be slower than Google Assistant. This affected their confidence in both variants of MELDER. This notably influenced participants’ confidence levels. Participants reported feeling significantly more confident when using Google Assistant compared to both word-level and phrase-level MELDER. One participant (male, 26 years) commented, “’I think silent speech is slower, and speed is really important in some cases. Apart from this, I think it is going to be an extremely cool piece of technology.”

Interestingly, participants found MELDER with phrase-level feedback to be relatively faster than MELDER with word-level feedback, even though both variants used the same underlying model. The majority of participants agreed with the statement that MELDER with phrase-level feedback is fast (N = 8), while a few remained neutral (N = 3), and only one participant disagreed with the statement. These results indicate that phrase-level feedback enhanced users’ perception of the method’s speed, despite the actual performance being similar.

Participants’ perception of the accuracy of the examined methods yielded surprising results. Despite the fact that both variants of MELDER, with either word-level or phrase-level feedback, displayed significantly higher error rates compared to Google Assistant, participants did not perceive them as notably error-prone. In fact, the vast majority of participants agreed with the statement that the method is accurate (N = 11), with only one participant expressing a neutral opinion on the matter. It is important to note that while a Friedman test identified a statistically significant difference in error rates between the methods, a post-hoc multiple-comparison analysis did not confirm this significance. This suggests that participants’ perceptions of accuracy may not align with the quantitative error rates, highlighting an interesting aspect of user perception in human-computer interaction studies.

Participants’ perception of the performance of MELDER with phrase-level feedback had a clear impact on their willingness to use the different methods. They expressed a significantly higher willingness to use both Google Assistant and MELDER with phrase-level feedback compared to MELDER with word-level feedback. This observation underscores the potential effectiveness of the proposed methods and the feedback approaches employed in the study. Participants’ willingness to use MELDER with phrase-level feedback was also positively influenced by their perception of the method’s security and privacy features. They viewed both variants of MELDER as significantly more private and secure compared to Google Assistant, primarily because bystanders could not overhear their interactions. Some participants even indicated that they would consider using the method primarily for its privacy and security benefits. For instance, one participant (female, 21 years) stated, “Due to its privacy benefits, it is extremely useful.” These findings align with prior research on the perceived privacy and security advantages of speech and silent speech-based input methods [81].

The results showed that participants found both Google Assistant and the two variations of MELDER to be relatively comparable in terms of effectiveness, ease of use, and learnability. While there were slight variations in the ratings for these three methods, a Friedman test did not detect any statistically significant differences in these aspects. Furthermore, participants expressed that both variants of MELDER were easy to use, and they believed that their performance would improve with practice. As one participant (female, 21 years) noted, “Adapting to silent speech was challenging at first, but became easier as I progressed.” This feedback suggests that users may require some time to acclimate to silent speech input but can become more proficient with practice.

10 Conclusion

In this comprehensive work, we have successfully developed a real-time silent speech recognition system tailored for mobile devices. Our approach involves breaking down the input video into smaller temporal segments, processing them individually, and utilizing advanced language models to auto-correct output at both character and word-levels. Additionally, our system offers users valuable feedback on the silent speech recognition process.

The work began with an experiment where we explored four different windowing functions for segmenting video lips, ultimately determining that a linear function (y = x + 5) yielded the best performance. Building upon this, we introduced a transfer learning approach aimed at enhancing the capabilities of silent speech recognition models for everyday conversational contexts. We investigated three strategies for transferring learning with three existing silent speech models, with the Finetune_Sequence strategy emerging as the most effective, showcasing its potential for improving the performance of existing pre-trained models. Equipped with the linear slicing function and the Finetune_Sequence transfer learning approach, we compared our system, MELDER, with two state-of-the-art silent speech models in two user studies–one in a stationary (seated position) and another in a mobile setting (while walking). The results demonstrated that MELDER outperformed both methods, establishing its feasibility for mobile device use. Furthermore, we conducted a qualitative study comparing our proposed word-level and phrase-level visual feedback methods with Google Assistant’s feedback mechanism. Interestingly, the study revealed that users’ perceived performance did not always align with actual performance. Notably, the phrase-level feedback significantly enhanced users’ perception of the silent speech model.

In conclusion, this work firmly establishes silent speech as a viable and effective method for interacting with mobile devices. As part of our commitment to advancing research in this field, we have made the dataset, source code, and other materials generated during this study freely available for download. We hope that this will encourage further investigations and replication efforts in this promising area of study.

11 Future Work

In future work, we plan to investigate various manual error correction strategies, empowering users to effectively correct recognition errors. Additionally, our aim is to further optimize the algorithm, enhancing its speed, accuracy, and adaptability, especially for individuals with diverse speech disorders. We also intend to conduct more in-depth studies to thoroughly examine the usability, adaptiveness, and robustness of the model. Moreover, testing the method in varied settings, such as under different lighting conditions and noise levels, is also part of our future research agenda.

Acknowledgments

This work has been funded in part by a National Science Foundation (NSF) CAREER grant, Award # 2239633.

A Experimental Dataset

This appendix lists the phrases chosen from the Enron Mobile Email corpus [110] for evaluating the proposed silent speech model.

(1)

are you going to join us for lunch

(2)

thanks for the quick turnaround

(3)

please call tomorrow if possible

(4)

are you getting all the information you need

(5)

she has absolutely everything

(6)

we can have wine and catch up

(7)

i agree since i am at the bank right now

(8)

i wanted to go drinking with you

(9)

both of us are still here

(10)

we need to talk about this month

(11)

this seems fine to me

(12)

is this the only time available

(13)

do you want to fax it to my hotel

(14)

i hope he is having a fantastic time

(15)

can you help get this cleared up

(16)

i would be glad to participate

(17)

i worked on the grade level promotion

(18)

that would likely be an expensive option

(19)

we are waiting on the cold front

(20)

you have a nice holiday too

(21)

what is the cost issue

(22)

i changed that in one prior draft

(23)

we must be consistent

(24)

we just need a sitter

(25)

thanks for your concern

(26)

has anyone else heard anything

(27)

take what you can get

(28)

call me to give me a heads up

(29)

they are more efficiently pooled

(30)

i am out of town on business tonight

Footnotes

MELDER Dataset: https://www.theiilab.com/resources/MELDER_Data.zip

MELDER Source Code: https://github.com/theiilab/MELDER

A viseme is the visual equivalent of a phoneme that represents the position of the face and the mouth when making a sound.

Perplexity is the multiplicative inverse of the probability assigned to the sentence by the language model, normalized by the number of words in the sentence. The lower the perplexity the better the language model.

An ablation study “investigates the performance of an AI system by removing certain components to understand the contribution of the component to the overall system” [113].

Supplemental Material

MP4 File - Video Preview

Video Preview

Download
42.79 MB

MP4 File - Video Presentation

Video Presentation

Transcript for: Video Presentation

MP4 File - Video Figure

A video summarizing the contributions of the work.

Download
167.72 MB

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. https://doi.org/10.48550/arXiv.1603.04467 arXiv:1603.04467 [cs].