- Research
- Open access
- Published:
PLDA in the i-supervector space for text-independent speaker verification
EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 29 (2014)
Abstract
In this paper, we advocate the use of the uncompressed form of i-vector and depend on subspace modeling using probabilistic linear discriminant analysis (PLDA) in handling the speaker and session (or channel) variability. An i-vector is a low-dimensional vector containing both speaker and channel information acquired from a speech segment. When PLDA is used on an i-vector, dimension reduction is performed twice: first in the i-vector extraction process and second in the PLDA model. Keeping the full dimensionality of the i-vector in the i-supervector space for PLDA modeling and scoring would avoid unnecessary loss of information. We refer to the uncompressed i-vector as the i-supervector. The drawback in using the i-supervector with PLDA is the inversion of large matrices in the estimation of the full posterior distribution, which we show can be solved rather efficiently by portioning large matrices into smaller blocks. We also introduce the Gaussianized rank-norm, as an alternative to whitening, for feature normalization prior to PLDA modeling. We found that the i-supervector performs better during normalization. A better performance is obtained by combining the i-supervector and i-vector at the score level. Furthermore, we also analyze the computational complexity of the i-supervector system, compared with that of the i-vector, at four different stages of loading matrix estimation, posterior extraction, PLDA modeling, and PLDA scoring.
1 Introduction
Recent research in text-independent speaker verification has been focusing on the problem of compensating the mismatch between training and test speech segments. Such mismatch in most part is due to the variations induced by the transmission channel. There are two fundamental approaches to tackling this problem. The first approach operates at the front-end via the exploration of discriminative information in speech in the form of features (e.g., voice source, spectro-temporal, prosodic, high-level) [1]-[6]. The second approach relies on the effective modeling of speaker characteristic in the classifier design (e.g., GMM-UBM, GMM-SVM, JFA, i-vector, PLDA) [4],[7]-[15]. In this paper, we focus on the speaker modeling.
Over the past few years, many approaches based on the use of Gaussian mixture models (GMM) in a GMM universal background model (GMM-UBM) framework [7] have been proposed to improve the performance of speaker verification system. The GMM-UBM is a generative model in which a speaker model is trained only on data from the same speaker. New criteria have then been developed that allow discriminative learning of generative models. Support vector machine (SVM) is acknowledged as one of the pre-eminent discriminative approaches [16]-[18], and it has been successfully combined with GMM, such as the GMM-SVM [8],[9],[19]-[21]. Nevertheless, approaches based on GMM-SVM are unable to cope well with the channel effects [22],[23]. To compensate for the channel effects, it was shown using the joint factor analysis (JFA) technique that the speaker and channel variability can be confined as two disjoint subspaces in the parameter spaces of GMM [12],[24]. The word ‘joint’ refers to the fact that not only the speaker, but also the channel variability is treated in a single JFA model. However, it was been reported that the channel space obtained by the JFA does contain some residual speaker information [25].
Inspired by the JFA approach, it was shown in [13] that speaker and session variability can be represented by a single subspace referred to as the total variability space. The major motivation for defining such a subspace is to extract a low-dimensional identity vector (i.e., the so-called i-vector) from the feature sequence of a speech segment. The advantage of i-vector is that it represents a speech segment as a fixed-length vector instead of a variable-length sequence of acoustic features. This greatly simplifies the modeling and scoring processes in speaker verification. For instance, we can assume that the i-vector is generated from a Gaussian density [13] instead of the mixture of Gaussian densities as usual in the case of acoustic features [7]. In this regard, linear discriminant analysis (LDA) [13],[26],[27], nuisance attribute projection (NAP) [8],[13],[28], within-class covariance normalization (WCCN) [13],[29],[30], probabilistic LDA (PLDA) [10],[31], and the heavy-tailed PLDA [32] have shown to be effective for such fixed-length data. In this paper, we focus on PLDA with Gaussian prior instead of heavy-tailed prior. It was recently shown in [33] that the advantage of the heavy-tailed assumption diminishes with a simple length normalization on the i-vector before PLDA modeling.
Because the total variability matrix is always a low-rank rectangular matrix, a dimension reduction process is also imposed by the i-vector extractor [12]. In this study, we advocate the use of the uncompressed form of the i-vector. Similar to that in [13], our extractor converts speech sequence into a fixed-length vector but retains its dimensionality in the full supervector space. Modeling of speaker and session variability is then carried out using PLDA, which has shown to be effective in handling high-dimensional data. By doing so, we avoid reducing the dimensionality of the i-vector twice: first in the extraction process and second in the PLDA model. Any dimension reduction procedure will unavoidably discard information. Our intention is therefore to keep the full dimensionality until the scoring stage with PLDA and to investigate the performance of PLDA in the i-supervector space. We refer to the uncompressed form of i-vector as the i-supervector, or the identity supervector, following the nomenclature in [13],[29]. Similar to that in the i-vector extraction, the i-supervector is computed as the posterior mean of a latent variable, but with a much higher dimensionality.
The downside of using i-supervector with PLDA is that we have to deal with the inversion of large matrices. The size of the matrices becomes enormous when more sessions are available for each speaker in the development dataa. One option is to estimate the subspaces in a decoupled manner, which might lead to suboptimal solution [12],[24]. In [34], we showed that the joint estimation of subspaces can be accomplished by partitioning large matrices into smaller blocks, thereby making the inversion and the joint estimation feasible. In this study, we present the same approach with more detail and further refinement. We also look into various normalization methods and introduce the use of the Gaussianized rank-norm for the PLDA. In the experiments, we compare the performance of both i-vector and i-supervector under no normalization and various normalization conditions. Meanwhile, a fusion system that combines the i-vector and i-supervector is presented as well. In addition, we provide an analysis of the computational complexity associated with the i-vector and i-supervector at four different stages: loading matrix estimation, i-vector and i-supervector extraction, PLDA model training, and verification score calculation.
The paper is organized as follows. In Section 2, we introduce the i-vector paradigm, which includes the formulation of the i-vector and i-supervector and its relationship to the classical maximum a posteriori (MAP). Section 3 introduces the probabilistic LDA, where we show that the inversion of a large matrix in PLDA can be solved by exploiting some inherent structure of the precision matrix. Section 4 deals with PLDA scoring and introduces the Gaussianized rank-norm. We present some experimental results in Section 5 and conclude the paper in Section 6.
2 I-vector paradigm
2.1 I-vector extraction
The purpose of i-vector extraction is to represent variable-length utterances with fixed-length low-dimensional vectors. The fundamental assumption is that the feature vector sequence, , was generated from a session-specific GMM. Furthermore, the mean supervector obtained by stacking the means from all mixtures, m, is constrained to lie in a low-dimensional subspace with origin ℳ as follows:
where m and ℳ are the mean supervectors of the speaker (and session)-dependent GMM and the UBM, respectively. The subspace spanned by the columns of the matrix T captures the speaker and session variability, and hence the name total variability[13]. The weighted combination of the columns of T, as determined by the latent variable x, gives rise to the mean supervector m with ℳ as an additive factor.
The i-vector extraction process was formulated as the MAP estimation problem [13],[35]. Notice that in (1), the equation is concerned with the construction of the mean supervector m from the parameters and the latent variable x. The variable x is unobserved (or latent) as is the supervector. The optimal value of x is determined by the observed sequence and is given by the mode (equivalent to the mean in the current case) of the posterior distribution of the latent variable x:
The first point to note is that the latent variable is assumed to follow a standard normal prior. The parameters ℳ c and Φ c denote the mean vector and covariance matrix of the c-th mixture of the UBM, while N c indicates the number of frames o t aligned to each of the C mixtures. Also, we decompose the total variability matrix to its component matrices, one associated with each Gaussian. Given an observation sequence , its i-vector representation is given by (2), the solution [13] of which is given by
where
is the posterior covariance, f c = ∑ t γc,to t − N c ℳ c is the centralized first-order statistics [35] for the c-th Gaussian, and γc,t denotes the occupancy of vector o t to the c-th Gaussian. Since T is always a low-rank rectangular matrix, the dimension D of the i-vector is much smaller compared to that of the supervector, i.e., D ≪ C⋅F, where F is the dimensionality of the acoustic feature.
2.2 I-supervector extraction
Consider the case where the latent variable is allowed to grow into the full supervector space, for which D = C⋅F. One straightforward approach to achieving this is by using a CF-by-CF full matrix for T in (1). However, the number of parameters would be enormous, causing difficulty in the training. Another option is to impose a diagonal constraint on the loading matrix as follows:
where D is now a CF-by-CF diagonal matrix and the latent variable z has the same dimensionality as the mean supervector m. Similar to the variable x in (1), the variable z is unobserved. Given an observed sequence , we estimate the mode of the posterior distribution as follows:
Here, z c is the sub-vector of z and D c is the F-by-F sub-matrix corresponding to the mixture c. Notice that such notations are necessary as the likelihood is computed over the acoustic vector o t . Following the procedure as in [35], it can be shown that solution to (6) is a CF-by-1 supervector:
where is a CF-by-CF diagonal matrix given by
In (7), f is the CF-by-1 supervector obtained by concatenating the f c from all mixtures (see Figure 1). In (8), N is the CF-by-CF diagonal matrix whose diagonal blocks are N c I, and Φ is a block diagonal matrix with Φ c at its diagonal. Recall that N c and f c are the occupancy count and centralized first-order statistics extracted using the UBM.
We refer to ϕ z as the i-supervector analogous to the i-vector since ϕ z is computed as the posterior mean of a latent variable similar to that in the i-vector extraction, but with a much higher dimensionality. It is worth to note that there exist some subtle differences between the i-supervector extraction and the classical MAP estimation of GMM [36]. In particular, the so-called relevance MAP widely used in the GMM-UBM [7] could be formulated in similar notations. In particular, the mean supervector of the adapted GMM is given by
One could deduce (9) from (7) and (8) by setting DTD = τ−1Σ and using the results in (5). The parameter τ is referred to as the relevance factor, which is set empirically in the range between 8 and 16 [7]. This is different from that in (7), where the matrix D is trained from a dataset using the EM algorithm in a manner similar to the matrix T for the i-vector. Secondly, the i-supervector is taken as the posterior of the latent variable z which is absent in the relevance MAP formulation.
The i-supervector extractor can be implemented by adopting the diagonal modeling part of the JFA [12],[24] with a slight modification: the diagonal model D is trained per utterance instead of per speaker basis in order to capture both speaker and session variability. Figure 2 summarizes the EM steps. The diag(.) operator sets the off-diagonal elements to zeros, only the diagonal elements are computed in our implementation. Notice that the sufficient statistics {f, N} are session-dependent. We omitted the session index for simplicity.
2.3 From i-vector to i-supervector
The i-vector extraction is formulated in probabilistic terms based on a latent variable model as in (2), similarly for the case of i-supervector in (6). One obvious benefit is that in addition to obtain the i-vector as the posterior mean ϕ x of the latent variable x, we could also compute the posterior covariance (4) which quantifies the uncertainty of the estimate and fold in the information in subsequent modeling [37]. Nevertheless, any form of dimension reduction would unavoidably discard information. Following the same latent variable modeling paradigm, we proposed the i-supervector as an uncompressed form of i-vector representation.
Figure 1 compares the i-vector and i-supervector approaches from the extraction process to the subsequent PLDA modeling (recall that the parameter C denotes the number of mixtures in the UBM. F is the size of the acoustic feature vectors. D is the length of the i-vector while the i-supervector has a much higher dimensionality of C·F.). The biggest difference is that there are two rounds of dimension reduction which occurred in the i-vector PLDA system, whereas there is only one time reduction for the case of i-supervector PLDA. In this paper, our motivation is to keep the full dimensionality of the supervector as the input to the PLDA model which has shown to be an efficient model for high-dimensional data [10]. We envisage that more information would be preserved via the use of i-supervector, which could be exploited with the use of PLDA.
3 PLDA modeling in i-supervector space
3.1 Probabilistic LDA
The i-vector and the i-supervector represent a speech segment as a fixed-length vector instead of a variable-length sequence of vectors. Taking the fixed-length vector ϕ ij as input, PLDA assumes that it is generated from a Gaussian density as follows:
where μ denotes the global mean and Γ = FFT + GGT + Σ is the covariance matrix. Here, ϕ ij is i-supervector (or i-vector) representing the j-th session of the i-th speaker. We use ϕ referring either to the i-vector ϕ x or i-supervector ϕ z in the subsequent discussion.
The strength of PLDA lies at the modeling of the covariance Γ in structural form as FFT + GGT + Σ. To see this, we rewrite (10) as marginal density:
where the conditional density is given by
In the above equations, h i is the speaker-specific latent variable pertaining to the i-th speaker, while w ij is the session-specific latent variable corresponding to the j-th session of the i-th speaker. Both latent variables are assumed to follow a standard Gaussian prior. The low-rank matrices F and G model the subspaces corresponding to speaker and session variability (we denote their rank as N F and N G , respectively), while the diagonal matrix Σ covers the remaining variation. From (12), the mean vector of the conditional distribution is given by
Comparing (1) and (13), we see that both i-vector extraction process and the PLDA model involve dimension reduction via a similar form of subspace modeling. This observation motivates us to explore the use of PLDA on i-supervector. The extraction process serves as the front-end which converts a variable-length sequence to a fixed-length vector without reducing the dimension. Speaker modeling and channel compensation are then carried out in the original supervector space.
The downside of using i-supervector with PLDA is that we have to deal with large matrices as illustrated in the lower panel of Figure 1. The size of the matrices becomes enormous when more sessions are available for each speaker in the development data. This is typically the case for speaker recognition where the number of utterances per speaker is usually in the range from ten to over a hundred [38],[39]. In the following, we estimate the parameters of the PLDA model using the expectation maximization (EM) algorithm. We show how large matrices could be partitioned into sub-matrices, thereby making the matrix inversion and EM steps feasible.
3.2 E-step: joint estimation of posterior means
We assume that our development set consists of speech samples from N speakers each having J sessions, though the number of sessions J could be different for each speaker. All the J observations from the i-th speaker are collated to form the compound system [10]:
Each row in (14) says that each observation ϕ ij = μ + Fh i + Gw ij + ε ij consists of a speaker-dependent component μ + Fh i and session-dependent component Gw ij + ε ij , where is responsible for the residual variation in (10). In the E-step, we infer the posterior mean of the compound latent variable as follows:
where is a block diagonal matrix whose diagonal blocks are Σ, and L−1 is the posterior covariance given by
The posterior inference involves the inversion of the matrix . Following the notations in (14), we could express the matrix inversion as
The matrix is large as we consider the joint inference of latent variables representing a speaker and all sessions from the same speaker. The size of the matrix increases with the number of sessions J, while more sessions are always desirable for more robust parameter estimation. Direct inversion of the matrix becomes intractable.
The precision matrix L possesses a unique structure since all sessions from the same speakers are tied to one speaker-specific latent variable. As depicted in (17) and (18), the matrix L can be partitioned into four sub-matrices: A, B, BT, and C. Using the partitioned inverse formula[40], the inverse of the matrix L could be obtained as
Where
The matrix M−1 is known as the Schur complement of L with respect to C[18]. Using these formulae, there are still two matrices to be inverted. The first is C−1 in the left-hand side of (18) and the second is the M in (19). The inversion C−1 is simple as C is block diagonal, where the inversion Q = (GTΣ−1G + I)−1 can be computed directly from the N G -by-N G matrix. Using the notations in (14) and (18), M is given by
where J is obtained via the matrix inversion lemma:
Using (18) in (15), it can be shown that the posterior mean of the speaker-specific latent variable h i is given by
while the session-specific posterior mean of w ij could be inferred as
One interesting point to note from (23) is that the i-supervector ϕ ij is first centralized to the global mean μ and the speaker mean F⋅E{h i } before projection to the session variability space.
From computation perspective, the matrices Q = (GTΣ−1G + I)−1, Λ = Q⋅GTΣ−1F, and J could be pre-computed and used for all sessions and speakers in the E-step. The matrix M depends on the number of sessions J per speaker. In the event where J is different for each speaker (which is usually the case), we compute
where FTJF = VEVT is obtained via eigenvalue decomposition in which V is the square matrix of eigenvectors and is E the diagonal matrix of eigenvalues.
3.3 M-step: model estimation
The M-step can also be formulated in terms of sub-matrices. Let be a compound vector by appending h i to each session w ij belonging to the same speaker. We update the loading matrices F and G jointly as follows:
where could be obtained by concatenating the results from (22) and (23). The second moment is computed for each individual session and speaker as follows:
The covariance matrix of the PLDA model could then be updated as
where the operator diag(⋅) diagonalizes a matrix by setting the off-diagonal elements to zeros.
4 Likelihood ratio computation
4.1 Model comparison
Speaker verification is a binary classification problem, where a decision has to be made between two hypotheses with respect to a decision threshold. The null hypothesis H0 says that the test segment is from the target speaker, while the alternative H1 hypothesizes the opposite. Using the latent variable modeling approach with PLDA, H0 and H1 correspond to the models as shown in Figure 3. In the model H0, {ϕ1, ϕ2} belong to the same speaker and hence share the same speaker-specific latent variable h1,2. On the other hand, {ϕ1, ϕ2} belong to different speakers and hence have separate latent variables, h1 and h2, in the model H1. The verification score is calculated as the log-likelihood ratio between two models:
where the likelihood terms are evaluated using (10) (we shall give more details in the next section). One key feature of the PLDA scoring function in (28) is that no speaker model is built or trained. The verification scores are computed by comparing the likelihood of two different models which describe the relationship between the training and test i-supervectors (or i-vector) through the use of PLDA model.
4.2 PLDA verification score
To solve for (28), we first recognize from Figure 1 that the generative equation for the model H0 is given by
Using the compound form of (29) in (10), we compute the log-likelihood of the model H0 by
where α = C⋅F for the case of i-supervector while α = D for the case of i-vector. To evaluate the log-likelihood function, we have to solve for the inversion and log-determinant of the following covariance matrix:
The inversion of the above matrix could be obtained by applying twice the matrix inversion lemma. In particular, we first compute J = (GGT + Σ)−1, the result of which is given by (21), and apply again the matrix inversion lemma on the right-hand side of (31), which leads to
where M2 is computed using the solution in (24) by setting J = 2. Now, to solve for the log-determinant of the same matrix in (31), we apply twice the matrix determinant lemma in a much similar way as the matrix inversion. Taking the log of the result leads to
Using (32) and (33) in (30), we arrive at
For the alternative hypothesis H1, we form the following compound equation:
The first thing to note is that the first and second rows of the system are decoupled and therefore could be treated separately. The log-likelihood of the alternative hypothesis H1 is therefore given by the following sum of log-likelihoods:
Using a similar approach as for the case of the null hypothesis, it can be shown that the solution to (36) is given by
Using (34) and (37) in (28), canceling out common terms, we arrive at the following log-likelihood ratio score for the verification task:
For brevity of notations, we have let
One way to look at (39) is that it centralizes the vector ϕ l and projects it onto the subspace F where speaker information co-vary the most (i.e., dimension reduction) while de-emphasizing the subspace pertaining to channel variability. In (38), K = log|M2|/2 − log|M1| is constant for the given set of parameters . Though K diminishes when score normalization is applied, we could calculate the two log-determinant terms easily by using the property of eigenvalue decomposition. In particular, we compute log|M2| as and log|M1| as , where {λ n : n = 1, 2,…, N F } are the eigenvalues of the matrix FTJF (c.f. (24)).
4.3 I-supervector pre-conditioning
Another prerequisite for good performance with PLDA is that the i-supervectors have to follow a normal distribution, as in (10). It has been shown in [33], for the case of i-vector, that whitening followed by length normalization helps toward this goal. However, whitening can never be possible for i-supervector due to data scarcity. To this end, we advocate the use of a Gaussianized version of rank norm [34],[41]. The i-supervector is processed element-wise with warping functions mapping each dimension to a standard Gaussian distribution (instead of uniform distribution as in rank norm). To put it mathematically, let ϕ l (m), for m = 1, 2,…, CF, denote the elements of the i-supervector ϕ l . We first get the normalized rank of ϕ l (m) with respect to a background set B m as follows:
where |⋅| denotes the cardinality of a set. The Gaussianized value is then obtained by using the inverse cumulative density function (CDF) of a standard Gaussian distribution (i.e., the probit function) as follows:
where erf−1(⋅) denotes the inverse error function. This can then be followed by length normalization prior to PLDA modeling.
5 Experiment
5.1 Experimental setup
Experiments were carried out on the core task (short2-short3) of NIST SRE08 [42]. We use two well-known metrics in evaluating the performance, namely, equal error rate (EER) and minimum detection cost (MinDCF). Two gender-dependent UBMs consisting of 512 Gaussians were trained using data drawn from the SRE04. Speech parameters were represented by a 54-dimensional vector of mel frequency cepstral coefficients (MFCC) with first and second derivatives appended.
The loading matrices T in (1) and D in (5) were both trained with similar sets of data drawn from Switchboard, SRE04, and SRE05. We use 500 factors for T, while D was a diagonal matrix by definition. The dimensionality of i-vector was therefore 500, while i-supervector is of CF = 27,648 in dimensionality. The rank of the matrices F and G in the PLDA model was set to 300 and 200, respectively, for the case of i-supervector. For i-vector, best result was found with the rank of F set to 300 and using a full matrix for Σ, for which G was no longer required. This observation was consistent with that reported in [32]. Table 1 summarizes all the corpora used to train the UBM, loading matrices T and D, PLDA model, whitening transformation matrix, Gaussianized rank-norm, and the cohort data for s-norm [32].
5.2 Feature and score normalization
Experiments were performed on the so-called det1 (int-int), det4 (int-tel), det5 (tel-mic), and det6 (tel-tel) common conditions as defined in NIST SRE08 short2-short3 core task. The term int refers to interview style recorder over microphone channel. For the det1 common condition, the training and test utterances were both int style of speech. Similar definition applied for other common conditions. The first set of experiments aimed at verifying the effectiveness of PLDA model in the i-supervector space without normalization (raw). Table 2 shows the results. It is evident that the i-supervector system performed much better than i-vector in all the four common conditions for both male and female trials. For the particular case of female trials, the EER for the i-supervector system was lower by 10.27%, 15.46%, 28.42%, and 16.58% in det1, det4, det5, and det6 compared to that of the i-vector system. One possible reason may be that the Gaussian assumption in (10) can be better fulfilled in the i-supervector space with higher dimensionality compared to that of the i-vector.
The second set of experiments aimed at investigating the effectiveness of different normalization methods on i-supervector prior to PLDA modeling (i.e., length normalization, whitening, and Gaussianized rank-norm) and also the effects of score normalization (we used s-norm as reported in [32]). For simplicity, we used only telephone data and report the results on det6 (i.e., tel-tel common condition) in Table 3. We observed similar performance for other common conditions. From Table 3, it is clear that length normalization (len) always outperforms raw for both i-vector and i-supervector. Notice that i-vector gains huge improvement from length normalization. For the MALE subset, we observed 20.0% and 6.5% of relative improvement in EER when length normalization was applied on i-vector and i-supervector, respectively. Whitening followed by length normalization (white + len) further improves the performance for i-vector. Similarly, in the case of i-supervector, we used Gaussianized rank-norm followed by length normalization (grank + len) to cope with the high dimensionality. Finally, we also noticed that s-norm gives consistent improvement for both i-vector and i-supervector.
5.3 Channel factors in i-supervector space
The low-rank matrices G model the subspace corresponding to channel variability as described in Section 3.1. We evaluated the performance of the i-supervector system at different numbers of channel factors, N G . Table 4 shows the results for the det6 common condition. We can see that when N G = 0, which corresponds to a fully diagonal PLDA model, the EER and MinDCF for both of male and female were very poor. A slight increment in the number of channel factors N G to 10 reduces the EER by 47% and 46% for male and female sets, respectively. Further increment in N G reduces the EER gradually until it levels off at N G = 200 after which no further improvement could be attained. We set N G = 200 for the i-supervector PLDA system in subsequent experiments.
5.3.1 Performance comparison
In this section, we compared the performance of i-supervector and i-vector under different train-test channel conditions. The PLDA models used for i-vector and i-supervector were the same as described in Section 5.2. In addition, we included microphone data (drawn from SRE05 and SRE06) for the whitening transform, Gaussianized rank-norm, and s-norm to better handle the interview (int) and microphone (mic) channel conditions.
Table 5 shows the results when full normalization (i.e., white + len + snorm for i-vector, grank + len + snorm for i-supervector) was applied. Here, we consider the EER and MinDCF by pooling together the male and female scores. The DET curves under the four common conditions are plotted in Figure 4. Similar to the observation in Section 5.1, i-vector gives better performance than i-supervector for the case with full normalization except in det5 where the i-supervector gives a much lower EER though the MinDCF is slightly worse. This again shows that current normalization strategy (Gaussianized rank-norm followed by length normalization), though effective, has to be further improved. Also shown in Table 5 and Figure 4 are the results by fusing the i-vector and i-supervector systems. The fusion of the two systems gives competitive performance with slightly lower EER and MinDCF across all four common conditions. The two systems were fused at the score level as follows:
where s1 and s2 are i-vector and i-supervector scores, respectively. The fusion weight β is set to 0.5, 0.5, 0.3, and 0.4, respectively, for det1, det4, det5, and det6.
5.4 Computation complexity comparison
The experiments were carried out using the following hardware configuration: Centos 6.4 system, Intel Xeon processor E5-2687w (8-core, 3.1 GHz/core) with 128 GB memory. We compared the total time and the real-time factor of i-supervector and i-vector systems at four different stages, namely, loading matrix estimation, posterior extraction, PLDA modeling, and PLDA scoring. The total variability matrix T in (1) and D in (5) were both trained using a similar set of data drawn from Switchboard, SRE04, and SRE05. Table 6 lists the time of training T and D with ten EM iterations. We can see that it takes about 16.75 h for training T using 348 h of speech, which implies a real-time factor of 0.048. On the contrary, it took only 380 s for training D. Because D is a diagonal matrix, simple vector multiplication can be used instead of large matrix multiplication.
After training the total variability space, we extracted i-vector and i-supervector for all utterances. Table 6 shows the time required for extracting the i-vectors and i-supervectors from the entire SRE04 dataset. The result shows that i-vector extraction consumes much more time than i-supervector. PLDA models were then trained for i-vectors and i-supervectors drawn from Switchboard, SRE04, and SRE05. We can see that training a PLDA model on the i-vector takes much lesser time than for the i-supervector. Finally, we compared the computation requirement for PLDA scoring on the NIST SRE08 short2-short3 core task with 98,776 trials. It can be seen that i-supervector scoring took more time than i-vector mainly due to its comparatively high dimensionality. In summary, the i-supervector system requires less computation at the front-end while the i-vector system is faster at the back-end PLDA.
6 Conclusions
We have introduced the use of the uncompressed form of i-vector (i.e., the i-supervector) for PLDA-based speaker verification. Similar to i-vector, an i-supervector represents a variable-length speech utterance as a fixed-length vector. But different from i-vector, we keep the total variability space having the same dimensionality as the original supervector space. To this end, we showed how manipulation of high-dimensional matrices can be done efficiently in training and scoring with the PLDA model. We also introduced the use of Gaussianized rank-norm for feature normalization prior to PLDA modeling.
Compared to i-vector, we found that i-supervector performs better when no normalization (on both feature and score) was applied. This suggests that the Gaussian assumption imposed by PLDA becomes less stringent and easier to fulfill in the higher dimensional i-supervector space. However, the performance improvement given by the high dimensionality diminishes when full normalization is applied. As such, current normalization strategy, though effective, has to be improved for better performance. This is a point for future work. We also showed that fusion system can give competitive performance compared to either i-vector or i-supervector. Furthermore, we analyzed the computational complexity of the i-supervector system, compared to that of the i-vector, at four different stages, namely, loading matrix estimation, posterior extraction, PLDA modeling, and PLDA scoring. Actually, the results showed that the i-supervector system took much less time than the i-vector system in terms of loading matrix and posterior extraction.
Endnote
aThe number of sessions is usually limited in face recognition for which PLDA was originally proposed in [10].
References
G Doddington, Speaker recognition based on idiolectal differences between speakers, in Proc. 7th European Conference on Speech Communication and Technology (Eurospeech) (Scandinavia, 2001), pp. 2521–2524
CE Wilson, S Manocha, S Vishnubhotla, A new set of features for text-independent speaker identification, in Proc. Interspeech (Pittsburgh, PA, USA, 2006), pp. 1475–1478
T Kinnunen, KA Lee, H Li, Dimension reduction of the modulation spectrogram for speaker verification, in The Speaker and Language Recognition Workshop (Stellenbosch, South Africa, 2008)
Kinnunen T, Li HZ: An overview of text-independent speaker recognition: from features to supervectors. Speech Comm. 2010, 52(1):12-40. 10.1016/j.specom.2009.08.009
L Wang, K Minami, K Yamamoto, S Nakagawa, Speaker identification by combining MFCC and phase information in noisy environments, in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Dallas, TX, USA, 2010), pp. 4502–4505
Nakagawa S, Wang L, Ohtsuka S: Speaker identification and verification by combining MFCC and phase information. IEEE Trans. Audio Speech Lang. Process. 2012, 20(4):1085-1095. 10.1109/TASL.2011.2172422
Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 2000, 10(1):19-41. 10.1006/dspr.1999.0361
WM Campbell, DE Sturim, DA Reynolds, A Solomonoff, SVM based speaker verification using a GMM supervector kernel and NAP variability compensation, in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (Philadelphia, USA, 2005), pp. 97–100
Campbell WM, Sturim DE, Reynolds DA: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 2006, 13(5):308-311. 10.1109/LSP.2006.870086
SJD Prince, JH Elder, Probabilistic linear discriminant analysis for inferences about identity, in Proc. International Conference on Computer Vision (Rio De Janeiro, Brazil, 2007), pp. 1–8
Wang L, Kitaoka N, Nakagawa S: Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Comm. 2007, 9(6):501-513. 10.1016/j.specom.2007.04.004
Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P: A study of inter-speaker variability in speaker verification. IEEE Trans. Audio. Speech Lang. Process. 2008, 16(5):980-988. 10.1109/TASL.2008.925147
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P: Front-end factor analysis for speaker verification. IEEE Trans. Audio. Speech Lang. Process. 2011, 19(4):788-798. 10.1109/TASL.2010.2064307
Kua JMK, Epps J, Ambikairajah E: i-Vector with sparse representation classification for speaker verification. Speech Comm. 2013, 55(5):707-720. 10.1016/j.specom.2013.01.005
Kelly F, Drygajlo A, Harte N: Speaker verification in score-ageing-quality classification space. Comput. Speech Lang. 2013, 27(5):1068-1084. 10.1016/j.csl.2012.12.005
Wan V, Campbell WM: Support vector machines for speaker verification and identification. IEEE Workshop Neural Netw. Signal Process. 2000, 2: 77-784.
Campbell WM, Campbell JP, Reynolds DA: Support vector machines for speaker and language recognition. Comp. Speech Lang. 2006, 20: 210-229. 10.1016/j.csl.2005.06.003
Bishop C: Pattern Recognition and Machine Learning. Springer Science & Business Media, New York; 2006.
KA Lee, C You, H Li, T Kinnunen, A GMM-based probabilistic sequence kernel for speaker recognition, in Proc. Interspeech (Antwerp, Belgium, 2007), pp. 294–297
You CH, Lee KA, Li H: GMM-SVM kernel with a Bhattacharyya-based distance for speaker recognition. IEEE Trans. Audio Speech Lang. Process. 2010, 18(6):1300-1312. 10.1109/TASL.2009.2032950
Dong X, Zhaohui W: Speaker recognition using continuous density support vector machines. Electron. Lett. 2001, 37(17):1099-1101. 10.1049/el:20010741
Wan V, Renals S: Speaker verification using sequence discriminant support vector machines. IEEE Trans. Speech Audio Process. 2005, 13(2):203-210. 10.1109/TSA.2004.841042
N Dehak, G Chollet, Support vector GMMs for speaker verification, in Proc. IEEE Odyssey: The Speaker and Language Recognition Workshop (San Juan, Puerto Rico, 2006)
Kenny P, Boulianne G, Ouellet P, Dumouchel P: Speaker and session variability in GMM-Based speaker verification. IEEE Trans. Audio Speech Lang. Process. 2007, 15(4):1448-1460. 10.1109/TASL.2007.894527
N Dehak, Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification, in Ph.D. thesis (École de Technologie Supérieure, Université du Québec, 2009)
A Kanagasundaram, D Dean, R Vogt, M McLaren, S Sridharan, M Mason, Weighted LDA techniques for i-vector based speaker verification, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (Kyoto, Japan, 2012), pp. 4781–4794
Kanagasundaram A, Dean D, Sridharan S, McLaren M, Vogt R: I-vector based speaker recognition using advanced channel compensation techniques. Comput. Speech Lang. 2014, 28(1):121-140. 10.1016/j.csl.2013.04.002
BGB Fauve, D Matrouf, N Scheffer, J-F Bonastre, JSD Mason, State-of-the-art performance in text-independent speaker verification through open-source software, in IEEE International Conference on Acoustics, Speech, and Signal Processing (Honolulu, USA, 2007), pp. 1960–1968
M Senoussaoui, P Kenny, N Dehak, P Dumouchel, An i-vector extractor suitable for speaker recognition with both microphone and telephone speech, in Proc.Odyssey: The Speaker and Language Recognition Workshop (Brno, Czech, 2010)
A Kanagasundaram, R Vogt, D Dean, S Sridharan, M Mason, I-vector based speaker recognition on short utterances, in Proc. Interspeech (Florence, 2011), pp. 2341–2344
L Machlica, Z Zajic, An efficient implementation of probabilistic linear discriminant analysis, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (Vancouver, Canada, 2013), pp. 7678–7682
P Kenny, Bayesian speaker verification with heavy-tailed priors, in Proc. Odyssey: Speaker and Language Recognition Workshop (Brno, Czech, 2010)
D Garcia-Romero, CY Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in Proc. Interspeech (Florence, Italy, 2011), pp. 249–252
Y Jiang, KA Lee, Z Tang, B Ma, A Larcher, H Li, PLDA modeling in i-vector and supervector space for speaker verification, in Proc. Interspeech (Portland, USA, 2012)
Kenny P, Boulianne G, Dumouchel P: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 2005, 13(3):345-354. 10.1109/TSA.2004.840940
Gauvain J, Lee C-H: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chain. IEEE Trans. Speech Audio Process. 1994, 2(2):291-298. 10.1109/89.279278
P Kenny, T Stafylakis, P Ouellet, MJ Alam, P Dumouchel, PLDA for speaker verification with utterance of arbitrary duration, in Proc. IEEE ICASSP (Vancouver, Canada, 2013), pp. 7649–7653
H Li, B Ma, KA Lee, CH You, H Sun, A Larcher, IIR system description for the NIST 2012 speaker recognition evaluation, in NIST SRE'12 Workshop (Orlando, 2012)
R Saeidi, KA Lee, T Kinnunen, T Hasan, B Fauve, P-M Bousque, E Khoury, PL Sordo Martinez, JMK Kua, CH You, H Sun, A Larcher, P Rajan, V Hautamaki, C Hanilci, B Braithwaite, R Gonzales-Hautamki, SO Sadjadi, G Liu, H Boril, N Shokouhi, D Matrouf, L El Shafey, P Mowlaee, J Epps, T Thiruvaran, DA van Leeuwen, B Ma, H Li, JHL Hansen et al., I4U submission to NIST SRE2012: a large-scale collaborative effort for noise-robust speaker verification, in Proc. Interspeech (Lyon, France, 2013), pp. 1986–1990
Murphy KP: Machine Learning-A Probabilistic Perspective. MIT Press, Massachusetts; 2012.
A Stolcke, S Kajarekar, L Ferrer, Nonparametric feature normalization for SVM-based speaker verification, in Proc. ICASSP (Ohio, USA, 2008), pp. 1577–1580
NIST, The NIST year 2008 speaker recognition evaluation plan, , [http://www.itl.nist.gov/iad/mig/tests/sre/2008/]
Acknowledgements
This work was partially supported by a research grant from the Tateisi Science and Technology Foundation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jiang, Y., Lee, K.A. & Wang, L. PLDA in the i-supervector space for text-independent speaker verification. J AUDIO SPEECH MUSIC PROC. 2014, 29 (2014). https://doi.org/10.1186/s13636-014-0029-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636-014-0029-2