Keywords

1 Introduction

English is one of the most widely spoken languages in the world and many people learn English as a second language (ESL). In addition to conventional in-class English learning, the importance of computer-aided language learning (CALL) increases. ESL includes four components namely, listening, reading, speaking, and writing. The current research focuses on English pronunciation assessment, which plays an important role in the speaking component of second language learning.

Previously, several studies addressed the problem of automatic pronunciation assessment using different features and grading approaches [4, 7, 11]. However, the majority of the studies reported require accurate native acoustic models with a large amount of training data, reference teachers’ speech, and are usually text-dependent (i.e., known text of the uttered speech). These automatic assessments are only useful for shadowing-based pronunciation learning. In contrast, text-independent automatic pronunciation assessment without teacher’s reference speech can expand possibility of CALL applications.

2 Methods

2.1 Overview of Data Collection

In the proposed method, automatic pronunciation assessment without reference native speech and known text of the uttered speech is being considered. Because of the simplicity in collecting the speech data, in the current study a speech shadowing framework and materials were used.

To evaluate the effectiveness of the proposed method for automatic pronunciation assessment, speech data on various materials and a large number of speakers were collected. Following the data collection, human raters were employed to annotate the collected speech samples. In this section, the data collection procedure and the data annotation are described.

Speaking Materials and Collected Speech Data. For speaking material, 3,388 sentences of shadowing samples extracted from daily conversations were used. The materials were classified into five subsets reflecting the English proficiency level. In the current study, the TOEIC listening and reading test score was used to define difficulties of materials.

The materials also included the native reference speech samples, and, therefore, the speakers could use the native speech as a reference before producing the desired speech sample. This made it easy for speakers to produce difficult sentences and reduced the need for dictionaries. The speakers who participated in the data collection included Japanese students (45.53%), native Japanese English teachers (11.24%), and native English teachers (43.23%).

In total, 924 speakers produced speech samples from a part of the shadowing materials. Details of the collected speech data are shown in Table 1.

Table 1. Details of collected speech samples.

Annotation by Manual Pronunciation Evaluation. A part of the collected speech data which consists of 96,993 speech samples were evaluated by human raters using four criteria as shown in Table 2.

Table 2. Criteria of subjective evaluation.

Each speech sample was evaluated by two different English native raters using a 5-rank scale for each criteria. The two scores were averaged to provide the final score. Tables 3 shows the annotation results of overall annotation criterion.

Table 3. Subjective evaluation results in overall criterion.

2.2 Results

The data set for preliminary DNN experiments were created by using a subset of data shown in Table 3. For pronunciation assessment, a 3-level scale was used, namely, below average (rank1 and rank2), average (rank3), and above average (rank4 and rank5) by merging the corresponding ranks. In the experiments reported in the current study, 935 speech samples for each class were used for training the DNN [6]. Other 924 speech samples for each class were used for the DNN evaluation.

For the evaluation, the recalls of each class, the unweighted average recall (UAR) (i.e., mean of the class recalls), and the Pearson correlation coefficient metrics were used. Mel-frequency cepstral coefficients (MFCCs) [8] concatenated with shifted delta cepstral (SDC) coefficients [1, 9] were extracted from the speech signal every 10 ms with a time window of 20 ms. The MFCC and SDC features were used to construct the i-vectors [3] used for training and evaluation.

Gaussian mixture models (GMMs) supervectors are widely used in speaker recognition. The GMM supervectors are obtained by concatenating the means of an adapted GMM. The main disadvantage of supervectors is the high dimensionality, which imposes high computational and memory costs. To overcome these problems, the i-vectors were introduced, which represent the whole utterance by a small number of factors, explaining also the variability of speaker, language, emotion, and channel. In the current method, the i-vectors are used as features for pronunciation scoring. Following i-vector extraction, linear discriminant analysis (LDA) [5] was also applied to further improve the class discrimination ability.

The classification experiments were based on DNNs. A DNN is a feed-forward neural network with many (i.e., more than one) hidden layers. The main advantage of DNNs compared to shallow networks is the better feature expression and the ability to perform complex mapping. In the current study, four hidden layers with 64 units and ReLu activation function were used. On top, a fully-connected Softmax layer was added. The number of batches was set to 512, and 500 epochs were used.

Table 4 shows the results achieved. As shown, when using LDA, significant improvements were obtained. When using MFCC and SDC features with LDA, a 64.4% UAR and a 0.48 correlation were achieved. These results are comparable or even superior to other similar state-of-the-art approaches [2, 10].

Table 4. Individual recalls for the three classes.

3 Conclusions

In the current paper, a method for automatic pronunciation assessment for second language learners was presented. The method is based on DNNs, and the results obtained were very promising. Data collection of a large number of speakers was also introduced. Evaluating the method using a larger amount of non-native speech data is currently in progress.