Open AccessData Descriptor

BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research

Engineering Acoustics, Institute of Fluid Dynamics and Technical Acoustics, Technische Universität Berlin, Einsteinufer 25, 10587 Berlin, Germany

Lighting Technology, Institute of Energy and Automation Technology, Technische Universität Berlin, Einsteinufer 19, 10587 Berlin, Germany

Author to whom correspondence should be addressed.

Data 2024, 9(8), 92; https://doi.org/10.3390/data9080092

Submission received: 20 June 2024 / Revised: 21 July 2024 / Accepted: 22 July 2024 / Published: 24 July 2024

Download

Browse Figures

Figure 1
Lexical frequency of the unique words (<math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <mn>399</mn> </mrow> </semantics></math>) contained in the BELMASK matrix sentences, based on the 7-point logarithmic frequency scale (0: rare–6: frequent) of the German digital dictionary “Digitales Wörterbuch der deutschen Sprache” (DWDS) (<a href="https://www.dwds.de/d/api" target="_blank">https://www.dwds.de/d/api</a>), accessed on 21 July 2024. Note: cumulative calculation of frequencies for inflected/uninflected word forms. "> Figure 2
Phonemic distribution of the BELMASK matrix sentences (blue) and the Oldenburg sentence test (green), digitized and extracted from [<a href="#B49-data-09-00092" class="html-bibr">49</a>], compared to the average phoneme distribution for written German (red), as reported in [<a href="#B55-data-09-00092" class="html-bibr">55</a>] (see Table “100.000 sound count”) and conversational German (yellow), based on the <a href="https://www.bas.uni-muenchen.de/forschung/Bas/BasPHONSTATeng.html" target="_blank">https://www.bas.uni-muenchen.de/forschung/Bas/BasPHONSTATeng.html</a> [<a href="#B56-data-09-00092" class="html-bibr">56</a>] extended phone monogram statistics for the Verbmobil 1+2, SmartKom and RVG1 databases, accessed on 21 July 2024. "> Figure 3
Relationship between pseudo log likelihood (PLL) scores of the BELMASK matrix and sentence length (number of tokens), including correlation analysis (Pearson’s <math display="inline"><semantics> <mrow> <mi>r</mi> <mo>=</mo> <mo>−</mo> <mn>0.77</mn> </mrow> </semantics></math>). Shaded area of regression line corresponds to the 95% confidence interval. Each dot represents a sentence. The red dot represents the highly-predictable reference sentence “The rocket flies into space”, not contained in the BELMASK set. "> Figure 4
Relationship between pseudo log likelihood (PLL) scores of the BELMASK matrix words and DWDS word log frequency, including correlation analysis (Pearson’s <math display="inline"><semantics> <mrow> <mi>r</mi> <mo>=</mo> <mn>0.56</mn> </mrow> </semantics></math>). Shaded area of regression line corresponds to the 95% confidence interval. Each dot represents a unique word. "> Figure 5
Frequency response of the face mask used during recordings with subsequent 1/12 octave band smoothing, measured reciprocally using a 3D-printed head [<a href="#B63-data-09-00092" class="html-bibr">63</a>]. "> Figure 6
Experimental setup of the recording sessions. Display of keywords on screen in speaker booth not depicted. "> Figure 7
Example of annotation layers in the Praat TextGrid object as a result of the G2P→ MAUS→ PHO2SYL pipeline. ">

Versions Notes

Abstract

In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions, and describe the development of the related speech task. The dataset contains in total 128 min of audio and video recordings of 10 German native speakers (4 female, 6 male) with a mean age of 30.2 years (SD: 6.3 years), uttering matrix sentences in cued, uninstructed speech in four conditions: (i) with a Filtering Facepiece P2 (FFP2) mask in silence, (ii) without an FFP2 mask in silence, (iii) with an FFP2 mask while exposed to noise, (iv) without an FFP2 mask while exposed to noise. Noise consisted of mixed-gender six-talker babble played over headphones to the speakers, triggering the Lombard effect. All conditions are readily available in face-and-voice and voice-only formats. The speech material is annotated, employing a multi-layer architecture, and was originally conceptualized to be used for the administration of a working memory task. The dataset is stored in a restricted-access Zenodo repository and is available for academic research in the area of speech communication, acoustics, psychology and related disciplines upon request, after signing an End User License Agreement (EULA).

Dataset: https://doi.org/10.5281/zenodo.10730794.

Dataset License: End User License Agreement (EULA).

Keywords:

face masks; COVID-19; German; matrix sentence test; face-masked speech; Lombard speech; audiovisual corpus; pseudo log likelihood (PLL) metric; memory

1. Introduction

The ubiquitous use of medical-grade face masks during the Coronavirus Disease 2019 (COVID-19) Pandemic highlighted the challenges face masks pose for both human interlocutors and various technologies, such as automatic speech/speaker recognition (ASR) systems [1,2] and biometric authentication algorithms [3], which rely on unhindered speech and visual signals. In that sense, face masks can be regarded as an adverse condition and a hindrance to communication. Even after the pandemic, medical-grade face masks continue to be used by vulnerable populations and in various scenarios, including medical settings, industrial environments and regions with high levels of air pollution.

1.1. Summary of Research Findings on Acoustic and Perceptual Effects of Face Masks

There is a consensus in the literature that face masks affect the acoustic properties of speech by attenuating high-frequency components and thereby altering the spectral characteristics of the spoken signal; see [4] for a review. They also impede voice transmission and radiation due to the physical obstruction they pose [5]. This can lead to reduced intelligibility and speech recognition accuracy in both human listeners and machine-based systems, especially in noisy backgrounds and with increasing spatial distance. However, the extent of these effects varies, depending on the face mask type and on whether the speaker employs adjustment strategies.

For human listeners, the drop in intelligibility performance ranges from

2.8

% to

15.4

% [6,7,8,9,10], but the effects are not always significant [11] or consistently negative [12]. For machine-based systems, speech recognition accuracy was reported to be, on average, 10% lower with face masks in the presence of a Signal-to-Noise Ratio (SNR) of

+ 3

dB [2]. In quiet conditions, the impact of face masks on speech recognition accuracy seems to be negligible for both human listeners and machine-based systems. However, automatic speaker recognition and classification of mask/no mask speech were found to be highly variable [1], suggesting that face masks may hamper speaker identification in critical practical scenarios that require high precision, e.g., in forensic contexts. Overall, filtering facepiece masks are generally more disruptive than surgical masks due to their material characteristics [13] and efficacy [14].

Face masks also obscure visual cues such as lip movements, which are crucial for audio–visual speech processing and for individuals who rely on lip-reading, namely hearing-impaired listeners and cochlear implant (CI) users [15,16] or non-native speakers [9]. As a consequence, face masks can lead to increased listening effort [7,8,17,18], even when no competing sounds are present. This finding is consistent across ages. Listening effort is defined as “the deliberate allocation of mental resources to overcome obstacles in goal pursuit when carrying out a [listening] task” [19].

Simultaneously, face masks can induce vocal fatigue [20,21,22], especially when speakers adapt their speaking style to compensate against the physical constriction [23]. This is the case when clear speech mechanisms are triggered, e.g., by hyperarticulating or speaking more slowly. Such adaptation strategies also include Lombard speech [24], a speaking style in which speakers increase their volume and pitch to become more understandable in noisy settings [25]. While clear speech has been shown to quite efficiently counteract the barriers imposed by a face mask, improving discrimination scores to a level similar or even higher to those without face masks [9,26,27], upholding such speaking styles can be effortful for speakers and impacts voice quality, leading to hoarseness, volume instability and strain [28]. The implications of voice quality in the context of cognition have received less attention than other acoustic factors, but findings confirm that deviations from typical modal phonation can increase listeners’ reaction times, listening effort and perceived annoyance, impair recall of spoken content and influence attitude towards speakers, as summarized by [29]. In our own earlier work, we observed that Lombard speech negatively impacted recall performance despite a face mask’s attenuation, which we hypothesized may be attributed to increased listener annoyance [30].

Besides immediate processing, face masks also appear to affect memory. Studies indicate that wearing face masks can reduce recall performance for audio-visually presented spoken material [9,31,32]. The reasons for the drop in recall performance are not yet fully understood, but the aforementioned studies postulate that the effect may likely be attributed to increased processing demands caused by signal deterioration, which in turn reduces the resources available for encoding speech in memory.

Face masks also seem to disrupt metacognitive processes, affecting the accuracy of confidence judgments [18]. Metacognition refers to the ability to monitor and assess one’s own cognitive processes while engaged in a task [33]; an imperative skill during social interactions. Face masks furthermore influence the speed and accuracy of age, gender, identity and emotion categorization [34,35]. The diminished confidence monitoring and the heightened difficulties in identifying emotions are explained by the absence of visual cues, which are crucial for holistic processing. Given that compensatory strategies in adverse conditions are triggered by the subject’s ability to correctly assess the quality of the communicative exchange, a disruption of these skills may be particularly problematic.

1.2. Datasets of Face-Masked Speech

To facilitate the study of face mask effects on acoustics, cognition and perception, extend the current knowledge base and address the described challenges, it is essential to develop and utilize datasets that include speech recorded with face masks, i.e., face-masked speech, preferably in diverse contexts and languages. Though several of the studies referenced in Section 1 of this article have used human recordings of face-masked speech, only two of them have released their datasets for further exploration. In addition, the overwhelming body of literature on the subject of face-masked speech employs English samples, which raises the problem of over-reliance on a single model language, as explained by, e.g., [36,37]. One of the few datasets available for the German language is the Mask Augsburg Speech Corpus (MASC) [38], which was originally collected as part of the Mask Sub-Challenge (MSC) mask/nomask classification task of the INTERSPEECH 2020 Computational Paralinguistics Challenge (ComParE) [39]. A second dataset recorded for similar feature extraction and classification purposes is the MASCFLICHT Corpus [40].

While these datasets encompass a variety of guided, read and free speech tasks with and without a face mask, they were not originally conceived for research in the field of auditory cognition and therefore have a few shortcomings when applied to this domain. Both the MASC and MASCFLICHT corpus are restricted to audio-only recordings, which limits their use for research questions aimed at exploring the role of auditory and visual cues, as well as their interactions. In addition, the MASC corpus only contains samples from participants wearing surgical masks, possibly due to the pre-pandemic data collection period. Given that FFP2-type face masks are considered the gold standard and have been evidenced to impact speech acoustics more heavily, the need to extend datasets to include this mask type arises. The MASCFLICHT corpus was recorded with a smartphone microphone. The suitability of mobile communication device (MCD) recordings for acoustic voice analysis is an ongoing subject of debate. While some studies report comparable results between high-quality recording systems and MCDs [41], others indicate that the robustness of some voice measures is compromised due to the limited dynamic range and uneven frequency responses of the inbuilt microphones [42,43]. Lastly, the principle limitation of both datasets for cognitive auditory research is the lack of controlled and standardized test material, e.g., matrix sentences, which is a necessary prerequisite for many research questions in this field.

Regarding Lombard speech, although prominent corpora exist for a variety of languages, e. g., the Audio–Visual Lombard Grid Speech Corpus [44] for native English or the Dutch English Lombard Native Non-Native (DELNN) Corpus [45], the authors have only been able to identify two readily available datasets for German. These are the Bavarian Archive for Speech Signals (BAS) Siemens Hörgeräte Corpus (HOESI) [46] and the Lombard Speech Database for German Language [47]. However, neither dataset comprises standardized test material and both datasets are limited to audio. The HOESI corpus features spontaneous, casual dialogues in diverse noisy environments, whereas the Lombard Speech Database for the German Language contains a collection of read sentences. As evidenced by [48], the Lombard effect is a multimodal phenomenon, characterized by increased face kinematics. Audiovisual datasets of Lombard speech are therefore particularly useful to further explore these aspects.

2. Data Description

The Berlin Dataset of Lombard and Masked Speech (BELMASK) was collected to facilitate research in the field of auditory cognition and to extend the resources available for the German language. It allows for the analysis of the effects of face masks on specific cognitive domains, such as memory or selective auditory attention, while considering related voice quality changes, i.e., Lombard speech, which commonly occurs when wearing a face mask, especially in ecologically valid and therefore inherently noisy settings. Given the nature of the dataset, the effects of face-masked and Lombard speech can also be studied independently.

The Berlin Dataset of Lombard and Masked Speech (BELMASK) is a phonetically controlled, multimodal dataset, containing, in total, 128 min of audio and video recordings of 10 German native speakers, uttering matrix sentences in cued, uninstructed speech in four conditions: (i) with an FFP2 mask in silence, (ii) without an FFP2 mask in silence, (iii) with an FFP2 mask while exposed to noise, (iv) without an FFP2 mask while exposed to noise. Noise consisted of mixed-gender six-talker babble played over headphones to the speakers, triggering the Lombard effect. All conditions are readily available in face-and-voice and voice-only formats. The speech material is annotated, employing a multi-layer architecture. Due to the nature of the dataset and in accordance with existing regulations, it is stored in a restricted-access Zenodo repository under an academic, non-commercial license, which requires signing an End User License Agreement (EULA). The dataset is summarized in Table 1 1. The matrix sentence test material development and the data collection process are described in the following sections. All abbreviations used in this article are explained in the corresponding section at the end of the article.

3. Methods

3.1. Construction of the Test Material

The test material used in the BELMASK dataset is modeled after established matrix tests for the German language, such as the Oldenburg sentence test (OLSA) [49]. Matrix sentence tests were originally developed for adaptive speech in noise tests and are primarily used in the context of audiological diagnostics, e.g., to determine speech reception thresholds (SRT), but are also broadly employed in speech intelligibility experiments. They usually consist of a basic inventory (matrix) of words and have a fixed grammatical structure. Candidates of word groups are interchangeable between sentences, allowing for random combinations. Due to the limited alternatives per word group, words are eventually repeated between sentences. Whilst this is not critical or even desirable for certain settings, it limits the use of matrix tests for certain memory tasks with multiple testing conditions due to potential learning effects. This motivated the development of novel matrix sentences to be used for the administration of a cued serial recall task.

The BELMASK test material consists of 96 semantically coherent German sentences. Considering that the memorization of words is influenced by their lexical frequency, with high-frequency words being easier to remember [50], the construction of sentences took lexical frequency into account. This was done to equalize the level of difficulty, with mainly average and high lexical frequency words being used, see Figure 1. To avoid context-based and linguistic structure bonuses in recall performance [51,52,53], the sentences are designed to not be highly predictable. Predictability was validated using a Large Language Model (LLM) and an optimized version of the pseudo-log-likelihood metric, see Section 3.2 for details. We opted for this type of validation instead of one with human subjects, due to its cost-effectiveness and reproducibility, while also circumventing the multitude of confounders typically encountered with human participants.

All sentences are syntactically identical and consist of 5–6 words, beginning with a subject, followed by a verb, a numeral, an adjective and an object, e.g., “Timo besitzt neun rosa Fahnen” (“Timo owns nine pink flags”). Subjects are either common German names or a noun with its respective article, accounting for the inconsistency in the number of words per sentence. The latter two words are always in plural form and serve as the keywords to be recalled. They consist of a total of four or five syllables to balance out the difficulty and prevent word length effects [54]. We opted for two keywords instead of one to facilitate mnemonic processing, encouraging strategies such as visualization or association. Each word, except for numerals, appears once within the test material.

To ensure that the test material is representative of the speech sounds contained in everyday communication, the sentences exhibit a phonemic distribution that aligns with the average phoneme distribution found in the German language and is comparable to other matrix tests; see Figure 2. Subjects contain the tense German vowels /a:/, /e:/, /i:/, /o:/, /u:/, equally distributed among the test material. Given that the sentences were initially not conceived to be used as separate test lists, the consonant distribution is not balanced between subsets of sentences. In future versions, we intend to optimize subsets with regard to equal phoneme distribution for all phoneme classes and optimize the sentences for equal mean intelligibility, equal degree of familiarity and equal number of syllables throughout all words. For the complete list of sentences refer to Appendix A.

3.2. Validation of Predictability

The predictability of the matrix sentences was validated using Bidirectional Encoder Representations from Transformers (BERT), a language model, which is freely available and can be used out-of-the-box without further fine-tuning [57]. We used a variant of BERT, trained on 16 GB of German text data [58]. Masked Language Models (MLMs) like BERT are designed to predict semantic content, such as omitted words within a sentence, based on the context provided by the surrounding words. The process involves masking certain words within the input text and then training the model to anticipate the masked words using contextual clues from the unmasked words. To do so, the model accesses information bidirectionally from preceding and subsequent tokens. Tokens represent smaller units of words, segmented using the WordPiece subword tokenization method outlined in [59].

Perplexity is a metric used to score the performance of a model in this task, i.e., how surprised it is when predicting the next word in a sequence. High perplexity values indicate worse model performance and therefore lower predictability. Perplexity for bidirectional models such as BERT can be calculated by means of the pseudo-log-likelihood (PLL) sentence scoring approach, proposed by [60]. To score the BELMASK matrix sentences, we used an optimized metric (PLL-whole-word) proposed by [61], which adapts the scorer module of the minicons library [62] and takes into account whether preceding and future tokens belong to the same word. The score of a sentence “is obtained as the sum of the log probabilities of each of the

| w |

tokens in each of the

| S |

words in [a sentence] S given the token’s context”:

{PLL}_{ww} (S) : = \sum_{w = 1}^{| S |} \sum_{t = 1}^{| w |} log P_{MLM} (s_{w_{t}} | S_{∖ s_{w}})

The resulting analysis demonstrates that the BELMASK matrix sentences have high negative PLL scores (in absolute values), which reflects model surprisal. Compared to a reference sentence with high predictability “Die Rakete fliegt ins All” (“The rocket flies into space”) with a PLL score of −9.72, all other sentences exhibit PLL scores in the range of −30 to −92.6, see Figure 3. This verifies that the content of the BELMASK matrix sentences is not highly predictable and that they are thereby suited for memory tasks. The figure also shows an inverse relationship between perplexity and sentence length, with longer sentences exhibiting higher perplexity. Figure 4 furthermore demonstrates the positive correlation between perplexity and word frequency, with higher frequency words resulting in less perplexity. The variation in word frequency and PLL scores also allows for an evaluation of how these metrics correlate with human retention.

3.3. Speakers

A total of ten German native speakers (4 female, 6 male) were recruited for the recording sessions of the matrix sentences with a mean age of 30.2 years (SD: 6.3 years). The sample consisted of university students and academic staff. All speakers reported normal hearing and vision and no known reading or speaking impairments. Table 2 summarizes the demographics for each subject. Written consent was obtained from all speakers to process, store and publish collected sociodemographics, audio and video data. Compensation was offered in the form of trial participant credit.

3.4. Audio and Video Recordings

The recording sessions were conducted in a sound-attenuated speaker booth under constant lighting conditions. The experimental setup was a

2 \times 2

(physical obstruction × sound background) factorial setting, aimed at simulating adverse speaking conditions. Physical obstruction was either a face mask or no face mask, and the sound background was either quiet (“silence”) or babble noise played back over circumaural, acoustically closed Beyerdynamic DT 1770 Pro headphones. Headphones were worn by the speakers throughout the individual sessions to enable communication with the experimenter seated outside the booth. This resulted in the following four recording conditions:

{nomask}_{sil}

{mask}_{sil}

{nomask}_{noise}

{mask}_{noise}

. The face mask used was an unvalved class 2 filtering facepiece (FFP2), type 3M 9320+. This type was chosen because it is certified, unrestrictedly available and is frequently used in the field of occupational safety. The transmission properties of the mask are shown in Figure 5, evidencing a dampening effect in the 2–8 kHz frequency band.

Audio was recorded in stereo at a sampling rate of 48 kHz in Audacity, combining inputs from two separate channels: the microphone of the speaker inside the booth and that of the experimenter outside the booth. The latter channel was utilized for documentation during post-processing and was subsequently split from the main channel, resulting in mono audio. The recording microphone was a Sennheiser MD421-II cardioid studio microphone positioned at a 15 cm distance and 45° angle from the speakers’ mouth. The bass roll-off filter was activated at its lowest setting (M + 1) to reduce potential proximity effects. The

{nomask}_{noise}

condition was used to adjust the microphone gain level at the beginning of each recording session; see Table 3. This adjustment was necessary to avoid clipping artifacts due to the varying speech volume of the speakers.

Videos were captured by means of a Razer Kiyo Pro HD camera, mounted on a monitor and positioned at a distance of 80 cm in front of the subjects, using the inbuilt Windows 10 camera application. Speakers were instructed to look into the camera while producing the sentences. Audio recordings were exported in .wav format and encoded as signed 24-bit PCM. Prior to exporting the audio files, the communication stream with the experimenter was separated from the main stream. Video recordings were exported in .mp4 format in full HD.

The sound pressure level (SPL) of the produced speech was tracked with an NTi Audio XL2 sound level meter, positioned next to the microphone. All audio recording and playback devices were routed via a RME Fireface UCX-II audio interface. Babble noise was looped and mixed into the same audio channel used to communicate with the experimenter but was only audible for the speakers, who had to selectively focus their attention on the experimenter’s voice when receiving instructions. The TotalMix FX software, version 1.95, RME Audio (Haimhausen, Germany) was used to control recording settings. The experimental setup is depicted in Figure 6.

The noise used during two of the recording conditions consisted of mixed-gender, six-talker babble, which was created by superimposing concatenated, read sentences of six individual speakers from the KGGS corpus [64]. The content of these sentences was unrelated to that of the test material. Prior to superimposition, the chains of sentences were trimmed or filled with silence to all have the same length and were then normalized using the Linear Mapping filter in ArtemiS SUITE. This was performed to minimize the effect of single voices standing out and distracting the speaker during sentence production. Playback level for the babble noise was calibrated at ∼

67.5 dB (A)

, using a HMS III HEAD acoustics artificial head. This level was deemed optimal to naturally trigger Lombard speech, while at the same time avoiding leakage during recordings.

Speakers were not given any instructions regarding articulation speed or speech style in order to preserve naturalness and maintain interspeaker variability. The aim was to accurately record any changes in the speech signal caused by the face mask’s physical obstruction and exposure to noise, without biasing speakers towards any particular adaptation strategy. Condition order was randomized for each speaker to account for order and carryover effects, see Table 3. To ensure ecological validity while maintaining a controlled laboratory environment, sentences were not read out. Instead, speakers were cued by seeing the last three words of each sentence in their uninflected form via slides that were duplicated on a screen inside the booth underneath the question meant to trigger the full sentence, for instance, “What does Timo own?” (“9, pink, flag”). In the context of a memory task, this was implemented to minimize potential recall boosts produced through read speech, which is characterized by reduced speech rate and clearer articulation [65]. To avoid hesitations, speakers were asked to first mentally form the sentences and then speak them out loud. The slides were controlled by the experimenter, who monitored the correctness of spoken sentences and the speakers’ eye contact with the camera, asking participants to repeat sentences in case of errors.

4. Post-Processing

During post-processing, it became evident that videos were recorded with a variable frame rate. They were therefore resampled with a fixed frame rate of 30 fps for further processing, using Kdenlive (version 23.08.04). Given that audio and video recordings did not start simultaneously, the offset was first manually detected to synchronize the two modalities. The audio recordings underwent visual and auditory inspection for each speaker and condition to localize the best utterances, which were then manually segmented using tier boundaries within the PRAAT software (version 6.4.14) [66]. The start and end times of each sentence were extracted automatically to enable cutting. Utterances were deemed optimal if they lacked any disfluencies or slips of the tongue and exhibited predominantly neutral intonation. Additionally, instances of blinking and averted gazes were taken into account during the selection process.

Audio and video cutting was performed with an automated Matlab (version R2022b Mathworks, Natick, MA, USA) script. Using the determined start and end times and the offset, individual sentences for each speaker and condition were extracted from the audio and video files. Original video audio was removed and replaced with the high-quality audio recording of the MD421-II microphone. Individual sentences were exported as single .wav and combined audio-video .mkv files. The wave file bit rate was reduced to 16-bit for further processing within the Bavarian Archive for Speech Signals (BAS) Web Service framework [67]. The video files were exported using ffmpeg (version 4.4.2) within Matlab, encoded with the H.264 codec.

The high-quality audio recordings of each sentence with their corresponding orthographic transcriptions were uploaded to the BAS Web Services within the Matlab script, using the provided RESTful API to enable automatic segmentation and annotation using the G2P→ MAUS→ PHO2SYL pipeline2. Given that the orthographic transcriptions were readily available as .txt files and for data protection reasons, we opted for the pipeline without ASR. The output format was specified as ‘Praat (TextGrid)’ and output encoding as X-SAMPA (ASCII) with all other options set to their default settings. The ‘G2P’ module converts the orthographic input transcript into a phonemic one, the ‘MAUS’ service segments the input into words and phones, and the ‘PHO2SYL’ service generates phonetic syllable segments based on the phonetic segmentation.

The resulting .textgrid file has five annotation layers; see Figure 7. The ORT-MAU layer contains the sentence in its orthographic form, segmented and tokenized into single words. Non-vocalized segments and pauses are denoted as <p:>. The KAN-MAU layer is the phonemic, i.e., canonical transcription of the same chain with the KAS-MAU layer additionally showing syllable boundaries, denoted as a dot. In contrast, the bottom two layers MAU and MAS contain the phonetic transcript, which deviates from standard, canonical transcription and mirrors what the speakers actually said, e.g., fYmf instead of fYnf. The MAU layer contains the individual phoneme segmentation, while the MAS layer contains the syllabified chain. The Munich Automatic Segementation System (MAUS) algorithm relies on Viterbi-alignment, which is the process of aligning speech features to the most probable sequence of states, using a set of continuous Hidden Markov Models (HMMs), which take acoustic probabilities into account [68]. Due to this probabilistic framework, some boundaries and annotations may have to be manually adjusted, if the deviations from the norm are not automatically detected.

4.1. Linear Mapping

The audio recordings are provided in their raw format and have not been normalized. This maintains the information of absolute and relative level differences between individual speakers and conditions. To reflect actual, absolute level values, which is important for the computation of psychoacoustic measures and to determine the range of the Lombard effect, linear mapping has to be applied to the wave files after reading, corresponding to the actual SPL measured. The average measured SPL for the

{nomask}_{sil}

of VP02 was

72.6

dB(A). This value includes pauses between sentences. The measured SPL for single sentences without pauses was 79 dB(A) on average. This latter value can be used as a reference to calculate the mapping factor when reading the provided single-sentence audio files. Given that the microphone gain levels were adjusted for every speaker, the gain factors provided in Table 3 have to be considered as well to accurately represent the resulting SPL. They therefore have to be multiplied with the derived linear mapping factors. The relative level differences between sentences and conditions remain in tact.

4.2. Corrections and Errors

The documentation process revealed that the test material used in the recording sessions included the verb ‘erzählt’ (‘told’) and the noun ‘Fische’ (‘fish’) on two occasions. The verb in sentence s40 has therefore been corrected to ‘erwähnt’ (‘mentioned’) and the object in sentence s29 to ‘Tische’ (‘tables’) in the provided BELMASK matrix sentences, see Appendix A. The video and audio recordings however contain the original doubles. The phonemic distribution in Figure 2 takes these corrections into account. Slips of the tongue during recordings are summarized in Table 4. Notably, these errors were almost exclusively made during the noisy conditions and consist mostly of misplacements (speakers uttered words they remembered from previous sentences) or arithmetic errors.

4.3. File Structure

The following file structure tree demonstrates how the provided files were organized, exemplified for speaker VP01. The complete dataset requires 18

GB

storage capacity.

5. Conclusions

In this article, we present the Berlin Dataset of Lombard and Masked Speech (BELMASK), a comprehensive dataset of speech produced in adverse conditions. The dataset contains audiovisual recordings of ten German native speakers uttering matrix sentences with and without an FFP2 face mask in a quiet setting and while being subjected to noise over headphones, triggering the Lombard effect. The article outlines the test material development, as well as the data collection and post-processing.

In contrast to previous datasets, the main advantage of the BELMASK dataset is that it contains recordings of phonetically controlled and standardized test material, which has been optimized in terms of lexical frequency and predictability. To the authors’ best knowledge, it is also the first audiovisual dataset of face-masked and Lombard speech for the German language. The speech tasks were neither guided nor read, but instead cued, which allows for fairly natural recordings while maintaining the controlled and high-quality setting of a laboratory environment. Given the multimodal nature of the dataset, it is furthermore possible to explore the role of visual versus auditory cues and potential interactions. Lastly, through the provision of linear mapping and gain factors, both absolute and relative information about level differences has been preserved across various conditions and speakers. These considerations are important when administering cognitive tasks and deriving psychoacoustic metrics. A limitation of the BELMASK dataset is the relatively small sample of speakers, which may restrict its applicability for certain classification tasks that require large datasets. However, the dataset has been fully annotated and several data formats are provided, which allows for its out-of-the-box use.

The dataset aims to facilitate research in the field of auditory cognition by contributing to a deeper understanding of how cognitive processes are affected by adverse speaking conditions. It furthermore enables the training and evaluation of speech processing models under realistic and varied conditions, provides data to enhance the robustness of ASR systems, improve speaker identification and verification accuracy, and refine assistive technologies for the hearing impaired. Additionally, it can be used as a resource to broaden audio–visual-dependent research, e.g., in the field of computer vision and simulation or biometric authentication. By providing this dataset, we aim to extend auditory cognitive research and support the development of more resilient speech-processing technologies that can adapt to the ongoing and future needs of masked communication in diverse settings.

Author Contributions

Conceptualization, C.C.M.; methodology, C.C.M.; formal analysis, C.C.M.; investigation, C.C.M.; resources, C.C.M., A.F. and E.S.; data collection and curation, C.C.M. and F.R.; writing—original draft preparation, C.C.M.; writing—review and editing, C.C.M. and F.R.; visualization, C.C.M.; supervision, A.F. and E.S.; project administration, C.C.M. All authors have read and agreed to the published version of the manuscript.

Funding

We acknowledge support by the Open Access Publication Fund of TU Berlin.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the Department of Psychology and Ergonomics of the TU Berlin (2310186/30 April 2024) as part of the research proposal “ADVAUD—Investigating the impact of adversely produced speech on auditory memory in diverse contexts”.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the participants to publish this paper.

Data Availability Statement

The collected dataset and related test materials are available over Zenodo upon request after agreeing to and signing an End User License Agreement (EULA).

Acknowledgments

The authors express their thanks to all speakers who volunteered to participate in the data collection process.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASR	Automatic Speech/Speaker Recognition
BAS	Bavarian Archive for Speech Signals
BELMASK	Berlin Database of Lombard and Masked Speech
BERT	Bidirectional Encoder Representations from Transformers
CI	Cochlear Implant
ComParE	Computational Paralinguistics Challenge
COVID-19	Coronavirus Disease 2019
DWDS	Digitales Wörterbuch der deutschen Sprache
EULA	End User License Agreement
FFP2	Filtering Facepiece P2
HMM	Hidden Markov Model
LLM	Large Language Model
MASC	Mask Augsburg Speech Corpus
MDC	Mobile Device Communication
MLM	Masked Language Model
MSC	Mask Sub-Challenge
OLSA	Oldenburg sentence test
PLL	Pseudo Log Likelihood
SD	Standard Deviation
SNR	Signal-to-Noise Ratio
SPL	Sound Pressure Level
SRT	Speech Reception Threshold

Appendix A

Table A1. The Berlin Masked Speech Dataset (BELMASK) matrix sentences.

Nr.	Sentence	Nr.	Sentence
s01	Coco zählt dreizehn grüne Wände.	s49	Die Hofbesitzerin schnippelt acht exotische Kiwis.
s02	Liane trägt drei bunte Tücher.	s50	Die Cousine verkostet neun krosse Zwiebeln.
s03	Oma beobachtet zwei spielende Kinder.	s51	Peter verputzt zwei volle Schüsseln.
s04	Ludwig tritt fünf kaputte Dosen.	s52	Rudolf trifft acht freudige Gruppen.
s05	Die Familie betrachtet sieben nasse Tauben.	s53	Beate bewundert sechs reife Pflaumen.
s06	Lisa überwindet sechs krachende Wellen.	s54	Der Schuster bearbeitet zehn braune Holzbretter.
s07	Sarah verkauft acht rote Smoothies.	s55	Joseph pflückt drei gute Papayas.
s08	Timo besitzt neun rosa Fahnen.	s56	Ruth kaut zwei zähe Stangen.
s09	Die Dame hält zehn spitze Steine.	s57	Alex erschreckt vier schlafende Ziegen.
s10	Udo sieht elf gelbe Muscheln.	s58	Der Geologe trimmt fünf dornige Büsche.
s11	Elisa registriert zwölf dunkle Biere.	s59	Emilia fährt sieben klapprige Taxis.
s12	Verena findet zwei plumpe Gänse.	s60	Andreas schlemmt dreizehn fettige Bretzeln.
s13	Opa hört drei sanfte Bässe.	s61	Adam feuert zwölf imaginäre Schüsse.
s14	Vater genießt vier kalte Erdbeeren.	s62	Georg isst drei knackige Rüben.
s15	Leon produziert fünf schrille Töne.	s63	Gisela schneidet fünf trockene Tulpen.
s16	Susi bekommt sechs prima Forellen.	s64	Ramona erntet dreizehn breite Zucchinis.
s17	Dieter besucht acht große Boote.	s65	Der Erzieher füttert sechs einsame Spatzen.
s18	Theodor prüft sieben salzige Käse.	s66	Hugo wäscht zwölf türkise Hemden.
s19	Ole studiert neun tolle Zeitungen.	s67	Der Hofjunge pflanzt vier seltene Kräuter.
s20	Tamara bestellt zehn süße Weine.	s68	Der Ladenbesitzer erhält zehn schimmelnde Datteln.
s21	Das Mädchen klaut elf glasierte Muffins.	s69	Der Lieferant hebt acht strahlende Segel.
s22	Biene nimmt zwölf delikate Puten.	s70	Gertrud entwirft sieben edle Gewänder.
s23	Egon bemerkt zwei schöne Fische.	s71	Margarete befestigt elf hölzerne Schilder.
s24	Ute gibt drei lustige Tipps.	s72	Doro frühstückt vier kleine Gurken.
s25	Der Badegast sammelt fünf weiße Perlen.	s73	Der Lehrling bäckt elf frische Pasteten.
s26	Maria bastelt vier schimmernde Ketten.	s74	Simon bringt zwei wichtige Seiten.
s27	Die Urlauberin holt sechs riesige Donuts.	s75	Helene mietet elf alte Fahrräder.
s28	Doris bezahlt acht sprudelnde Brausen.	s76	Der Apotheker verschreibt vier günstige Wickel.
s29	Gerald fängt dreizehn graue Tische.	s77	Mara präsentiert neun wertvolle Spiele.
s30	Olaf schwimmt neun flotte Runden.	s78	Das Schulkind lernt zehn komplexe Fächer.
s31	Der Kurgast reserviert zehn sonnige Liegen.	s79	Der Seemann verleiht sieben eigene Bücher.
s32	Renate entsorgt elf dreckige Teller.	s80	Der Musiker überzeugt drei kranke Touristen.
s33	Der Bote empfängt zwölf zerstörte Kartons.	s81	Lola singt sechs blöde Balladen.
s34	Ulrike betreut zwei blaue Delfine.	s82	Der Fotograf beleuchtet dreizehn antike Städte.
s35	Der Ehemann spendiert drei feurige Schnäpse.	s83	Lina radelt sechs hügelige Strecken.
s36	Rosa baut vier sandige Schlösser.	s84	Der Lehrer verabschiedet vier jodelnde Burschen.
s37	Lena wirft fünf japanische Vasen.	s85	Der Blumenhändler züchtet zehn feine Rosen.
s38	Samuel malt sieben grelle Pfaue.	s86	Der Kioskbesitzer überfliegt neun witzige Titel.
s39	Der Ureinwohner macht dreizehn abstrakte Bilder.	s87	Nena näht sieben pinke Jacken.
s40	Fine erwähnt acht schlaue Witze.	s88	Die Baronin verspielt dreizehn goldene Bänder.
s41	Der Bube ruft neun laute Sprüche.	s89	Martha ergattert acht dünne Hosen.
s42	Der Maler zeichnet zehn tiefe Gewässer.	s90	Der Poet erzählt elf spannende Sagen.
s43	Ina liefert neun teure Gläser.	s91	Uwe schleppt neun flauschige Katzen.
s44	Der Sohn riecht zwölf saure Pfirsiche.	s92	Die Diva verteilt zwei gewonnene Chips.
s45	Frau Huber ermahnt fünf tobende Punks.	s93	Thomas unterschreibt fünf zerkratzte Platten.
s46	Ariane erwirbt sieben duftende Gräser.	s94	Der Pfadfinder zeigt sechs fixe Schritte.
s47	Der Kumpane jagt drei wilde Schafe.	s95	Der Studi verinnerlicht zwölf schwere Gedichte.
s48	Die Ärztin verfolgt dreizehn blasse Ferkel.	s96	Lara transportiert zwölf köstliche Äpfel.

Note

1	For all annotation and pipeline abbreviations consult the BAS Webmaus documentation available at: https://clarin.phonetik.uni-muenchen.de/BASWebServices/help/tutorial, accessed on 21 July 2024.
2	“Uploaded material is automatically deleted [from the BAS servers] after 24 h. Uploaded data are not forwarded to third parties, except in the case of the service ‘ASR’, which forwards user data to a third-party, commercial webservice provider”, see: https://clarin.phonetik.uni-muenchen.de/BASWebServices/help, accessed on 21 July 2024.

References

Geng, P.; Lu, Q.; Guo, H.; Zeng, J. The effects of face mask on speech production and its implication for forensic speaker identification-A cross-linguistic study. PLoS ONE 2023, 18, e0283724. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Ni, K.; Huang, Y. Effect of Face Masks on Automatic Speech Recognition Accuracy for Mandarin. Appl. Sci. 2024, 14, 3273. [Google Scholar] [CrossRef]
Ritchie, K.; Carragher, D.; Davis, J.; Read, K.; Jenkins, R.E.; Noyes, E.; Gray, K.L.H.; Hancock, P.J.B. Face masks and fake masks: The effect of real and superimposed masks on face matching with super-recognisers, typical observers, and algorithms. Cogn. Res. 2024, 9, 5. [Google Scholar] [CrossRef] [PubMed]
Badh, G.; Knowles, T. Acoustic and perceptual impact of face masks on speech: A scoping review. PLoS ONE 2023, 18, e0285009. [Google Scholar] [CrossRef] [PubMed]
Pörschmann, C.; Lübeck, T.; Arend, J.M. Impact of face masks on voice radiation. J. Acoust. Soc. Am. 2020, 148, 3663–3670. [Google Scholar] [CrossRef] [PubMed]
Bandaru, S.V.; Augustine, A.M.; Lepcha, A.; Sebastian, S.; Gowri, M.; Philip, A.; Mammen, M.D. The effects of N95 mask and face shield on speech perception among healthcare workers in the coronavirus disease 2019 pandemic scenario. J. Laryngol. Otol. 2020, 134, 895–898. [Google Scholar] [CrossRef] [PubMed]
Bottalico, P.; Murgia, S.; Puglisi, G.E.; Astolfi, A.; Kirk, K.I. Effect of masks on speech intelligibility in auralized classrooms. J. Acoust. Soc. Am. 2020, 148, 2878–2884. [Google Scholar] [CrossRef] [PubMed]
Brown, V.A.; Van Engen, K.J.; Peelle, J.E. Face mask type affects audiovisual speech intelligibility and subjective listening effort in young and older adults. Cogn. Res. Princ. Implic. 2021, 6, 49. [Google Scholar] [CrossRef] [PubMed]
Smiljanic, R.; Keerstock, S.; Meemann, K.; Ransom, S.M. Face masks and speaking style affect audio-visual word recognition and memory of native and non-native speech. J. Acoust. Soc. Am. 2021, 149, 4013–4023. [Google Scholar] [CrossRef] [PubMed]
Toscano, J.C.; Toscano, C.M. Effects of face masks on speech recognition in multi-talker babble noise. PLoS ONE 2021, 16, e0246842. [Google Scholar] [CrossRef]
Mendel, L.L.; Gardino, J.A.; Atcherson, S.R. Speech Understanding Using Surgical Masks: A Problem in Health Care? J. Am. Acad. Audiol. 2008, 19, 686–695. [Google Scholar] [CrossRef] [PubMed]
Magee, M.; Lewis, C.; Noffs, G.; Reece, H.; Chan, J.C.S.; Zaga, C.J.; Paynter, C.; Birchall, O.; Rojas Azocar, S.; Ediriweera, A.; et al. Effects of face masks on acoustic analysis and speech perception: Implications for peri-pandemic protocols. J. Acoust. Soc. Am. 2020, 148, 3562–3568. [Google Scholar] [CrossRef] [PubMed]
Das, S.; Sarkar, S.; Das, A.; Das, S.; Chakraborty, P.; Sarkar, J. A comprehensive review of various categories of face masks resistant to COVID-19. Clin. Epidemiol. Glob. Health. 2021, 12, 100835. [Google Scholar] [CrossRef] [PubMed]
Martarelli, M.; Montalto, L.; Chiariotti, P.; Simoni, S.; Castellini, P.; Battista, G.; Paone, N. Acoustic Attenuation of COVID-19 Face Masks: Correlation to Fibrous Material Porosity, Mask Breathability and Bacterial Filtration Efficiency. Acoustics 2022, 4, 123–138. [Google Scholar] [CrossRef]
Atcherson, S.R.; Mendel, L.L.; Baltimore, W.J.; Patro, C.; Lee, S.; Pousson, M.; Spann, M.J. The Effect of Conventional and Transparent Surgical Masks on Speech Understanding in Individuals with and without Hearing Loss. J. Am. Acad. Audiol. 2017, 28, 58–67. [Google Scholar] [CrossRef]
Sönnichsen, R.; Tó, G.L.; Hohmann, V.; Hochmuth, S.; Radeloff, A. Challenging Times for Cochlear Implant Users—Effect of Face Masks on Audiovisual Speech Understanding during the COVID-19 Pandemic. Trends Hear. 2022, 26, 23312165221134378. [Google Scholar] [CrossRef] [PubMed]
Rahne, T.; Fröhlich, L.; Plontke, S.; Wagner, L. Influence of surgical and N95 face masks on speech perception and listening effort in noise. PLoS ONE 2021, 16, e0253874. [Google Scholar] [CrossRef] [PubMed]
Giovanelli, E.; Valzolgher, C.; Gessa, E.; Todeschini, M.; Pavani, F. Unmasking the Difficulty of Listening to Talkers With Masks: Lessons from the COVID-19 pandemic. i-Perception 2021, 12, 204166952199839. [Google Scholar] [CrossRef]
Pichora-Fuller, M.K.; Kramer, S.E.; Eckert, M.A.; Edwards, B.; Hornsby, B.W.Y.; Humes, L.E.; Lemke, U.; Lunner, T.; Matthen, M.; Mackersie, C.L.; et al. Hearing Impairment and Cognitive Energy: The Framework for Understanding Effortful Listening (FUEL). Ear Hear. 2016, 37, 5S. [Google Scholar] [CrossRef]
Ribeiro, V.V.; Dassie-Leite, A.P.; Pereira, E.C.; Santos, A.D.N.; Martins, P.; Irineu, R.d.A. Effect of Wearing a Face Mask on Vocal Self-Perception during a Pandemic. J. Voice 2020, 37, 878. [Google Scholar] [CrossRef] [PubMed]
Gama, R.; Castro, M.E.; van Lith-Bijl, J.T.; Desuter, G. Does the wearing of masks change voice and speech parameters? Eur. Arch. Oto-Rhino 2021, 279, 1701–1708. [Google Scholar] [CrossRef] [PubMed]
McKenna, V.S.; Patel, T.H.; Kendall, C.L.; Howell, R.J.; Gustin, R.L. Voice Acoustics and Vocal Effort in Mask-Wearing Healthcare Professionals: A Comparison Pre- and Post-Workday. J. Voice 2021, 37, 802.e15–802.e23. [Google Scholar] [CrossRef] [PubMed]
Gutz, S.E.; Rowe, H.P.; Tilton-Bolowsky, V.E.; Green, J.R. Speaking with a KN95 face mask: A within-subjects study on speaker adaptation and strategies to improve intelligibility. Cogn. Res. Princ. Implic. 2022, 7, 73. [Google Scholar] [CrossRef] [PubMed]
Lombard, E. Le signe de l’élévation de la voix [The sign of raising the voice]. Ann. Mal. Oreille Larynx Nez Pharynx 1911, 37, 101–119. [Google Scholar]
Bottalico, P.; Passione, I.I.; Graetzer, S.; Hunter, E.J. Evaluation of the starting point of the Lombard Effect. Acta Acust. United Acust. 2017, 103, 169–172. [Google Scholar] [CrossRef] [PubMed]
Hampton, T.; Crunkhorn, R.; Lowe, N.; Bhat, J.; Hogg, E.; Afifi, W.; De, S.; Street, I.; Sharma, R.; Krishnan, M.; et al. The negative impact of wearing personal protective equipment on communication during coronavirus disease 2019. J. Laryngol. Otol. 2020, 134, 577–581. [Google Scholar] [CrossRef] [PubMed]
Cohn, M.; Pycha, A.; Zellou, G. Intelligibility of face-masked speech depends on speaking style: Comparing casual, clear, and emotional speech. Cognition 2021, 210, 104570. [Google Scholar] [CrossRef]
Karagkouni, O. The Effects of the Use of Protective Face Mask on the Voice and Its Relation to Self-Perceived Voice Changes. J. Voice 2021, 37, 802.e1–802.e14. [Google Scholar] [CrossRef] [PubMed]
Schiller, I.S.; Aspöck, L.; Schlittmeier, S.J. The impact of a speaker’s voice quality on auditory perception and cognition: A behavioral and subjective approach. Front. Psychol. 2023, 14, 1243249. [Google Scholar] [CrossRef] [PubMed]
Moshona, C.; Fiebig, A. Effects of face-masked speech on short-term memory. In Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023, Turin, Italy, 11–15 September 2023; pp. 4697–4704. [Google Scholar] [CrossRef]
Truong, T.L.; Beck, S.D.; Weber, A. The impact of face masks on the recall of spoken sentences. J. Acoust. Soc. Am. 2021, 149, 142–144. [Google Scholar] [CrossRef] [PubMed]
Truong, T.L.; Weber, A. Intelligibility and recall of sentences spoken by adult and child talkers wearing face masks. J. Acoust. Soc. Am. 2021, 150, 1674–1681. [Google Scholar] [CrossRef] [PubMed]
Son, L.K.; Schwartz, B.L. The relation between metacognitive monitoring and control. In Applied Metacognition; Perfect, T.J., Schwartz, B.L., Eds.; Cambridge University Press: Cambridge, UK, 2002; pp. 15–38. [Google Scholar]
Carbon, C.C. Wearing Face Masks Strongly Confuses Counterparts in Reading Emotions. Front. Psychol. 2020, 11, 566886. [Google Scholar] [CrossRef]
Fitousi, D.; Rotschild, N.; Pnini, C.; Azizi, O. Understanding the Impact of Face Masks on the Processing of Facial Identity, Emotion, Age, and Gender. Front. Psychol. 2021, 12, 743793. [Google Scholar] [CrossRef] [PubMed]
Vitevitch, M.S.; Chan, K.Y.; Goldstein, R. Using English as a ‘Model Language’ to Understand Language Processing. In Motor Speech Disorders: A Cross-Language Perspective; Miller, N., Lowit, A., Eds.; Multilingual Matters: Bristol, UK, 2014; pp. 58–73. [Google Scholar] [CrossRef]
Blasi, D.E.; Henrich, J.; Adamou, E.; Kemmerer, D.; Majid, A. Over-reliance on English hinders cognitive science. Trends Cogn. Sci. 2022, 26, 1153–1170. [Google Scholar] [CrossRef] [PubMed]
Mohamed, M.M.; Nessiem, M.A.; Batliner, A.; Bergler, C.; Hantke, S.; Schmitt, M.; Baird, A.; Mallol-Ragolta, A.; Karas, V.; Amiriparian, S.; et al. Face mask recognition from audio: The MASC database and an overview on the mask challenge. Pattern Recognit. 2022, 122, 108361. [Google Scholar] [CrossRef] [PubMed]
Schuller, B.W.; Batliner, A.; Bergler, C.; Messner, E.M.; Hamilton, A.; Amiriparian, S.; Baird, A.; Rizos, G.; Schmitt, M.; Stappen, L.; et al. The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing and Masks. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 2042–2046. [Google Scholar] [CrossRef]
Mallol-Ragolta, A.; Urbach, N.; Liu, S.; Batliner, A.; Schuller, B.W. The MASCFLICHT Corpus: Face Mask Type and Coverage Area Recognition from Speech. In Proceedings of the INTERSPEECH 2023, Dublin, Ireland, 20–24 August 2023; pp. 2358–2362. [Google Scholar] [CrossRef]
Awan, S.N.; Shaikh, M.A.; Awan, J.A.; Abdalla, I.; Lim, K.O.; Misono, S. Smartphone Recordings are Comparable to “Gold Standard” Recordings for Acoustic Measurements of Voice. J. Voice 2023, in press. [CrossRef]
Maryn, Y.; Ysenbaert, F.; Zarowski, A.; Vanspauwen, R. Mobile Communication Devices, Ambient Noise, and Acoustic Voice Measures. J. Voice 2017, 31, 248.e11–248.e23. [Google Scholar] [CrossRef]
Jannetts, S.; Schaeffler, F.; Beck, J.; Cowen, S. Assessing voice health using smartphones: Bias and random error of acoustic voice parameters captured by different smartphone types. Int. J. Lang. Commun. Disord. 2019, 54, 292–305. [Google Scholar] [CrossRef] [PubMed]
Alghamdi, N.; Maddock, S.; Marxer, R.; Barker, J.; Brown, G.J. A corpus of audio-visual Lombard speech with frontal and profile views. J. Acoust. Soc. Am. 2018, 143, EL523–EL529. [Google Scholar] [CrossRef]
Marcoux, K.; Ernestus, M. Acoustic characteristics of non-native Lombard speech in the DELNN corpus. J. Phon. 2024, 102, 101281. [Google Scholar] [CrossRef]
Folk, L.; Schiel, F. The Lombard Effect in Spontaneous Dialog Speech. In Proceedings of the INTERSPEECH 2011, Florence, Italy, 27–31 August 2011; pp. 2701–2704. [Google Scholar]
Sołoducha, M.; Raake, A.; Kettler, F.; Voigt, P. Lombard speech database for German language. In Proceedings of the 42nd Annual Conference on Acoustics—DAGA 2016, Aachen, Germany, 14–17 March 2016. [Google Scholar]
Trujillo, J.; Özyürek, A.; Holler, J.; Drijvers, P. Speakers exhibit a multimodal Lombard effect in noise. Sci. Rep. 2021, 11, 16721. [Google Scholar] [CrossRef] [PubMed]
Wagener, K.; Kühnel, V.; Kollmeier, B. Entwicklung und Evaluation eines Satztests für die deutsche Sprache I: Design des Oldenburger Satztests [Development and evaluation of a sentence test for the German language I: Design of the Oldenburg sentence test]. Z. Für Audiol. 1999, 38, 1–32. [Google Scholar]
Poirier, M.; Saint-Aubin, J. Word frequency effects in immediate serial recall: Item familiarity and item co-occurence have the same effect. Memory 2005, 13, 325–332. [Google Scholar] [CrossRef] [PubMed]
Hunter, C.R. Dual-task accuracy and response time index effects of spoken sentence predictability and cognitive load on listening effort. Trends Hear. 2021, 25, 1–15. [Google Scholar] [CrossRef] [PubMed]
Roverud, E.; Bradlow, A.; Kidd, G.J. Examining the sentence superiority effect for sentences presented and reported in forwards or backwards order. Appl. Psycholinguist. 2020, 41, 381–400. [Google Scholar] [CrossRef] [PubMed]
Kowialiewski, B.; Krasnoff, J.; Mizrak, E.; Oberauer, K. The semantic relatedness effect in serial recall: Deconfounding encoding and recall order. J. Mem. Lang. 2022, 127, 104377. [Google Scholar] [CrossRef]
Baddeley, A.D.; Thomson, N.; Buchanan, M. Word length and the structure of short-term memory. J. Verb. Learn. Verb. Behav. 1975, 14, 575–589. [Google Scholar] [CrossRef]
Best, K.H. Laut- und Phonemhäufigkeiten im Deutschen [Sound and phoneme frequencies in German]. Göttinger Beiträge Zur Sprachwiss. 2005, 10/11, 21–32. [Google Scholar]
Schiel, F. BAStat: New statistical resources at the Bavarian Archive for speech signals. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, 17–23 May 2010. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019. [Google Scholar]
MDZ Digital Library Team (dbmdz) at the Bavarian State Library. Bert-Base-German-Dbmdz-Cased. Available online: https://huggingface.co/dbmdz/bert-base-german-cased (accessed on 21 July 2024).
Schuster, M.; Nakajima, K. Japanese and Korean voice search. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 5149–5152. [Google Scholar] [CrossRef]
Salazar, J.; Liang, D.; Nguyen, T.Q.; Kirchhoff, K. Masked Language Model Scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar] [CrossRef]
Kauf, C.; Ivanova, A. A Better Way to Do Masked Language Model Scoring. arXiv 2023, arXiv:2305.10588. [Google Scholar]
Misra, K. minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models. arXiv 2022, arXiv:2203.13112. [Google Scholar]
Moshona, C.; Hofmann, J.; Fiebig, A.; Sarradj, E. Bestimmung des Übertragungsverlustes von Atemschutzmasken mittels eines 3D-Kopfmodells unter Berücksichtigung des Ansatzrohres [Determination of the transmission loss of respiratory masks using a 3D head model considering the vocal tract]. In Proceedings of the 49nd Annual Conference on Acoustics—DAGA 2023, Hamburg, Germany, 6–9 March 2023; pp. 178–181. [Google Scholar]
Mooshammer, C. Korpus Gelesener Geschlechtergerechter Sprache (KGGS) [Corpus of Read Gender-Inclusive Language (KGGS)]. 2020. Available online: https://rs.cms.hu-berlin.de/phon (accessed on 21 July 2024).
Nakamura, M.; Iwano, K.; Furui, S. Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance. Comput. Speech Lang. 2008, 22, 171–184. [Google Scholar] [CrossRef]
Boersma, P.; Weenik, D. Praat, a system for doing phonetics by computer. Glot Int. 2001, 5, 341–345. [Google Scholar]
Kisler, T.; Reichel, U.; Schiel, F. Multilingual processing of speech via web services. Comput. Speech Lang. 2017, 45, 326–347. [Google Scholar] [CrossRef]
Schiel, F. A Statistical Model for Predicting Pronunciation. In Proceedings of the 18th International Congress of Phonetic Sciences, ICPhS 2015, Glasgow, UK, 10–14 August 2015; p. 195. [Google Scholar]

Figure 1. Lexical frequency of the unique words (

n = 399

) contained in the BELMASK matrix sentences, based on the 7-point logarithmic frequency scale (0: rare–6: frequent) of the German digital dictionary “Digitales Wörterbuch der deutschen Sprache” (DWDS) (https://www.dwds.de/d/api), accessed on 21 July 2024. Note: cumulative calculation of frequencies for inflected/uninflected word forms.

Figure 1. Lexical frequency of the unique words (

n = 399

Figure 2. Phonemic distribution of the BELMASK matrix sentences (blue) and the Oldenburg sentence test (green), digitized and extracted from [49], compared to the average phoneme distribution for written German (red), as reported in [55] (see Table “100.000 sound count”) and conversational German (yellow), based on the https://www.bas.uni-muenchen.de/forschung/Bas/BasPHONSTATeng.html [56] extended phone monogram statistics for the Verbmobil 1+2, SmartKom and RVG1 databases, accessed on 21 July 2024.

Figure 3. Relationship between pseudo log likelihood (PLL) scores of the BELMASK matrix and sentence length (number of tokens), including correlation analysis (Pearson’s

r = - 0.77

). Shaded area of regression line corresponds to the 95% confidence interval. Each dot represents a sentence. The red dot represents the highly-predictable reference sentence “The rocket flies into space”, not contained in the BELMASK set.

Figure 3. Relationship between pseudo log likelihood (PLL) scores of the BELMASK matrix and sentence length (number of tokens), including correlation analysis (Pearson’s

r = - 0.77

Figure 4. Relationship between pseudo log likelihood (PLL) scores of the BELMASK matrix words and DWDS word log frequency, including correlation analysis (Pearson’s

r = 0.56

). Shaded area of regression line corresponds to the 95% confidence interval. Each dot represents a unique word.

Figure 4. Relationship between pseudo log likelihood (PLL) scores of the BELMASK matrix words and DWDS word log frequency, including correlation analysis (Pearson’s

r = 0.56

). Shaded area of regression line corresponds to the 95% confidence interval. Each dot represents a unique word.

Figure 5. Frequency response of the face mask used during recordings with subsequent 1/12 octave band smoothing, measured reciprocally using a 3D-printed head [63].

Figure 6. Experimental setup of the recording sessions. Display of keywords on screen in speaker booth not depicted.

Figure 7. Example of annotation layers in the Praat TextGrid object as a result of the G2P→ MAUS→ PHO2SYL pipeline.

Table 1. Summary of the BELMASK dataset.

Name	Berlin Dataset of Lombard and Masked Speech
Abbreviation	BELMASK
Version	1.0
License	End User License Agreement (EULA)
Speech material (total)	3840 matrix sentences à ∼ $2 s$
Speech material (per speaker)	96 matrix sentences à ∼ $2 s$ in 4 conditions
Speech conditions	${nomask}_{sil}$ , ${mask}_{sil}$ , ${nomask}_{noise}$ , ${mask}_{noise}$
Total duration	∼ $128 \min$ (∼ $12.8 \min$ per speaker)
Register	Cued, uninstructed speech
Speakers	10 (4 female, 6 male)
Language	German
Modality	Audio, video
Annotation mode	Automated (webMAUS + Pipeline: G2P → MAUS → PHO2SYL)
Annotation layers	Tokenized word segmentation based on MAUS (ORT-MAU)
	Canonical pronunciation encoded in X-SAMPA (KAN-MAU)
	Phonological syllable segmentation based on G2P (KAS-MAU)
	Phonetic segmentation in X-SAMPA (MAU)
	Phonetic syllable segmentation based on MAUS (MAS)
Available data formats	.mkv, .wav, .txt, .TextGrid
Audio bitrate / Sampling Frequency	16 bit, $48 kHz$
Video compression codec	H.264
Size	$18 GB$ (compressed: $4.9 GB$ )

Table 2. Sociodemographic data (L1 = first language, P1/P2 = parents’ country/state of origin).

ID	Gender	Age	L1	P1	P2
VP01	m	38	DE	DE-BE	DE-BE
VP02	f	38	DE	GR	GR
VP03	f	36	DE	DE-BE	DE-BE
VP04	m	34	DE	DE-BB	DE-SN
VP05	m	24	DE	CN	CN
VP06	m	27	DE	DE-NW	DE-HE
VP07	f	26	DE	DE-BY	DE-BY
VP08	m	33	DE	DE-ST	DE-ST
VP09	f	21	DE	RU	DE-NW
VP10	m	25	DE	DE-HE	DE-BW

Table 3. The recording settings and condition orders for individual speakers.

L_{x}

is the gain level in dB and x the corresponding linear factor with

L_{x} = 0 dB

and

x_{0} = 1

as reference values.

Table 3. The recording settings and condition orders for individual speakers.

L_{x}

is the gain level in dB and x the corresponding linear factor with

L_{x} = 0 dB

and

x_{0} = 1

as reference values.

VP	$L_{x}$ [dB]	Factor x	Condition Order
VP01	$- 6$ dB	0.501	${nm}_{s}$ → $m_{s}$ → $m_{n}$ → ${nm}_{n}$
VP02	0 dB	1.000	$m_{n}$ → ${nm}_{n}$ → $m_{s}$ → ${nm}_{s}$
VP03	$- 3$ dB	0.708	$m_{s}$ → ${nm}_{n}$ → ${nm}_{s}$ → $m_{n}$
VP04	$1$ dB	1.122	${nm}_{n}$ → $m_{n}$ → $m_{s}$ → ${nm}_{s}$
VP05	$- 8$ dB	0.398	$m_{n}$ → ${nm}_{s}$ → ${nm}_{n}$ → $m_{s}$
VP06	$- 8$ dB	0.398	${nm}_{s}$ → $m_{s}$ → $m_{n}$ → ${nm}_{n}$
VP07	$- 3$ dB	0.708	$m_{s}$ → ${nm}_{s}$ → ${nm}_{n}$ → $m_{n}$
VP08	0 dB	1.000	${nm}_{n}$ → $m_{n}$ → ${nm}_{s}$ → $m_{s}$
VP09	0 dB	1.000	${nm}_{s}$ → $m_{s}$ → $m_{n}$ → ${nm}_{n}$
VP10	$1$ dB	1.122	$m_{n}$ → ${nm}_{s}$ → ${nm}_{n}$ → $m_{s}$

{nm}_{s}

= no mask/silence,

m_{s}

= mask/silence,

m_{n}

= mask/noise,

{nm}_{n}

= no mask/noise.

Table 4. Utterance errors contained in the recorded audio and video files.

VP	Condition	Sentence
VP03	${nomask}_{noise}$	s45 Frau Huber ermahnt neun (richtig: fünf) tobende Punks. s45 Mrs. Huber warns nine (correct: five) raging punks.
VP03	${nomask}_{sil}$	s58 Der Geologe trimmt vier (richtig: fünf) dornige Büsche. s58 The geologist trims four (correct: five) thorny bushes.
VP04	${mask}_{noise}$	s70 Gertrud entwirft sieben strahlende (richtig: edle) Gewänder. s70 Gertrud designs seven radiant (correct: noble) robes.
VP06	${nomask}_{noise}$	s89 Martha ergattert acht dünne Rosen (richtig: Hosen). s89 Martha snags eight thin roses (correct: pants).
VP08	${mask}_{noise}$	s05 Die Familie beobachtet (richtig: betrachtet) sieben nasse Tauben. s05 The family observes (correct: views) seven wet pigeons.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moshona, C.C.; Rudawski, F.; Fiebig, A.; Sarradj, E. BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research. Data 2024, 9, 92. https://doi.org/10.3390/data9080092

AMA Style

Moshona CC, Rudawski F, Fiebig A, Sarradj E. BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research. Data. 2024; 9(8):92. https://doi.org/10.3390/data9080092

Chicago/Turabian Style

Moshona, Cleopatra Christina, Frederic Rudawski, André Fiebig, and Ennes Sarradj. 2024. "BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research" Data 9, no. 8: 92. https://doi.org/10.3390/data9080092

APA Style

Moshona, C. C., Rudawski, F., Fiebig, A., & Sarradj, E. (2024). BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research. Data, 9(8), 92. https://doi.org/10.3390/data9080092

Article Menu