Deep Learning-Based Automatic Pronunciation Assessment for Second Language Learners

Kohichi Takai^8,9,
Panikos Heracleous⁹,
Keiji Yasuda^8,9 &
…
Akio Yoneyama⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1225))

Included in the following conference series:

International Conference on Human-Computer Interaction

1688 Accesses
2 Citations

Abstract

For second language learners, computer-aided language learning (CALL) is of high importance. In recent years, the use of smart phones, tablets, and laptops has become increasingly popular; with this change, more people can use CALL to learn a second language. In CALL, automatic pronunciation assessment can be applied to provide feedback to teachers regarding the efficiency of teaching approaches. Furthermore, with automatic pronunciation assessment, students can monitor their language skills and improvements over time while using the system. In the current study, a text-independent method for pronunciation assessment based on deep neural networks (DNNs) is proposed and evaluated. In the proposed method, only acoustic features are applied, and native acoustic models and teachers’ reference speech are not required. The method was evaluated using speech from a large number of Japanese students who studied English as a second language.

You have full access to this open access chapter, Download conference paper PDF

The Application of English Spoken Pronunciation Assessment Based on Deep Learning in English Education

Application of Computer Aided Language Learning in College Russian Teaching

A Pronunciation Practice System Based on Pre-trained Deep Learning Models

Keywords

1 Introduction

English is one of the most widely spoken languages in the world and many people learn English as a second language (ESL). In addition to conventional in-class English learning, the importance of computer-aided language learning (CALL) increases. ESL includes four components namely, listening, reading, speaking, and writing. The current research focuses on English pronunciation assessment, which plays an important role in the speaking component of second language learning.

Previously, several studies addressed the problem of automatic pronunciation assessment using different features and grading approaches [4, 7, 11]. However, the majority of the studies reported require accurate native acoustic models with a large amount of training data, reference teachers’ speech, and are usually text-dependent (i.e., known text of the uttered speech). These automatic assessments are only useful for shadowing-based pronunciation learning. In contrast, text-independent automatic pronunciation assessment without teacher’s reference speech can expand possibility of CALL applications.

2 Methods

2.1 Overview of Data Collection

In the proposed method, automatic pronunciation assessment without reference native speech and known text of the uttered speech is being considered. Because of the simplicity in collecting the speech data, in the current study a speech shadowing framework and materials were used.

To evaluate the effectiveness of the proposed method for automatic pronunciation assessment, speech data on various materials and a large number of speakers were collected. Following the data collection, human raters were employed to annotate the collected speech samples. In this section, the data collection procedure and the data annotation are described.

Speaking Materials and Collected Speech Data. For speaking material, 3,388 sentences of shadowing samples extracted from daily conversations were used. The materials were classified into five subsets reflecting the English proficiency level. In the current study, the TOEIC listening and reading test score was used to define difficulties of materials.

The materials also included the native reference speech samples, and, therefore, the speakers could use the native speech as a reference before producing the desired speech sample. This made it easy for speakers to produce difficult sentences and reduced the need for dictionaries. The speakers who participated in the data collection included Japanese students (45.53%), native Japanese English teachers (11.24%), and native English teachers (43.23%).

In total, 924 speakers produced speech samples from a part of the shadowing materials. Details of the collected speech data are shown in Table 1.

Table 1. Details of collected speech samples.

Full size table

Annotation by Manual Pronunciation Evaluation. A part of the collected speech data which consists of 96,993 speech samples were evaluated by human raters using four criteria as shown in Table 2.

Table 2. Criteria of subjective evaluation.

Full size table

Each speech sample was evaluated by two different English native raters using a 5-rank scale for each criteria. The two scores were averaged to provide the final score. Tables 3 shows the annotation results of overall annotation criterion.

Table 3. Subjective evaluation results in overall criterion.

Full size table

2.2 Results

The data set for preliminary DNN experiments were created by using a subset of data shown in Table 3. For pronunciation assessment, a 3-level scale was used, namely, below average (rank1 and rank2), average (rank3), and above average (rank4 and rank5) by merging the corresponding ranks. In the experiments reported in the current study, 935 speech samples for each class were used for training the DNN [6]. Other 924 speech samples for each class were used for the DNN evaluation.

For the evaluation, the recalls of each class, the unweighted average recall (UAR) (i.e., mean of the class recalls), and the Pearson correlation coefficient metrics were used. Mel-frequency cepstral coefficients (MFCCs) [8] concatenated with shifted delta cepstral (SDC) coefficients [1, 9] were extracted from the speech signal every 10 ms with a time window of 20 ms. The MFCC and SDC features were used to construct the i-vectors [3] used for training and evaluation.

Gaussian mixture models (GMMs) supervectors are widely used in speaker recognition. The GMM supervectors are obtained by concatenating the means of an adapted GMM. The main disadvantage of supervectors is the high dimensionality, which imposes high computational and memory costs. To overcome these problems, the i-vectors were introduced, which represent the whole utterance by a small number of factors, explaining also the variability of speaker, language, emotion, and channel. In the current method, the i-vectors are used as features for pronunciation scoring. Following i-vector extraction, linear discriminant analysis (LDA) [5] was also applied to further improve the class discrimination ability.

The classification experiments were based on DNNs. A DNN is a feed-forward neural network with many (i.e., more than one) hidden layers. The main advantage of DNNs compared to shallow networks is the better feature expression and the ability to perform complex mapping. In the current study, four hidden layers with 64 units and ReLu activation function were used. On top, a fully-connected Softmax layer was added. The number of batches was set to 512, and 500 epochs were used.

Table 4 shows the results achieved. As shown, when using LDA, significant improvements were obtained. When using MFCC and SDC features with LDA, a 64.4% UAR and a 0.48 correlation were achieved. These results are comparable or even superior to other similar state-of-the-art approaches [2, 10].

Table 4. Individual recalls for the three classes.

Full size table

3 Conclusions

In the current paper, a method for automatic pronunciation assessment for second language learners was presented. The method is based on DNNs, and the results obtained were very promising. Data collection of a large number of speakers was also introduced. Evaluating the method using a larger amount of non-native speech data is currently in progress.

References

Bielefeld, B.: Language identification using shifted delta cepstrum. In: Fourteenth Annual Speech Research Symposium (1994)
Google Scholar
Chen, L.Y., Jang, J.S.R.: Automatic pronunciation scoring using learning to rank and DP-based score segmentation. In: Proceedings of Interspeech, pp. 761–764 (2010)
Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
Franco, H., Neumeyer, L., Ramos, M., Bratt, H.: Exploring deep learning architecures for automatically grading non-native spontaneous speech. In: Proceedings of ICASSP, pp. 6140–6144 (2016)
Google Scholar
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, New York (1990). Ch. 10
MATH Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Nicolao, M., Beeston, A.V., Hain, T.: Automatic assesement of English learner pronunciation using discriminative classifiers. In: Proceedings of ICASSP, pp. 5351–5355 (2015)
Google Scholar
Sahidullah, M., Saha, G.: Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543–565 (2012). https://doi.org/10.1016/j.specom.2011.11.004
Article Google Scholar
Torres-Carrasquillo, P., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., Deller Jr, J.R.: Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In: Proceedings of ICSLP 2002-INTERSPEECH 2002, pp. 16–20 (2002)
Google Scholar
Witt, S., Young, S.: Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 30, 95–108 (2000)
Article Google Scholar
Yue, J., et al.: Automatic scoring of shadowing speech based on DNN posteriors and their DTW. In: proceedings of Interspeech, pp. 1422–1426 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Nara Institute of Science and Technology, Nara, Japan
Kohichi Takai & Keiji Yasuda
KDDI Research, Inc., Fujimino, Japan
Kohichi Takai, Panikos Heracleous, Keiji Yasuda & Akio Yoneyama

Authors

Kohichi Takai
View author publications
You can also search for this author in PubMed Google Scholar
Panikos Heracleous
View author publications
You can also search for this author in PubMed Google Scholar
Keiji Yasuda
View author publications
You can also search for this author in PubMed Google Scholar
Akio Yoneyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panikos Heracleous .

Editor information

Editors and Affiliations

University of Crete and Foundation for Research and Technology – Hellas (FORTH), Heraklion, Greece
Constantine Stephanidis
Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Margherita Antona

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takai, K., Heracleous, P., Yasuda, K., Yoneyama, A. (2020). Deep Learning-Based Automatic Pronunciation Assessment for Second Language Learners. In: Stephanidis, C., Antona, M. (eds) HCI International 2020 - Posters. HCII 2020. Communications in Computer and Information Science, vol 1225. Springer, Cham. https://doi.org/10.1007/978-3-030-50729-9_48

Download citation

DOI: https://doi.org/10.1007/978-3-030-50729-9_48
Published: 10 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50728-2
Online ISBN: 978-3-030-50729-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Learning-Based Automatic Pronunciation Assessment for Second Language Learners

Abstract

Similar content being viewed by others

The Application of English Spoken Pronunciation Assessment Based on Deep Learning in English Education

Application of Computer Aided Language Learning in College Russian Teaching

A Pronunciation Practice System Based on Pre-trained Deep Learning Models

Keywords

1 Introduction

2 Methods

2.1 Overview of Data Collection

2.2 Results

3 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Deep Learning-Based Automatic Pronunciation Assessment for Second Language Learners

Abstract

Similar content being viewed by others

The Application of English Spoken Pronunciation Assessment Based on Deep Learning in English Education

Application of Computer Aided Language Learning in College Russian Teaching

A Pronunciation Practice System Based on Pre-trained Deep Learning Models

Keywords

1 Introduction

2 Methods

2.1 Overview of Data Collection

2.2 Results

3 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation