CN113160795B - Language feature extraction model training method, device, equipment and storage medium - Google Patents
Language feature extraction model training method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113160795B CN113160795B CN202110467103.9A CN202110467103A CN113160795B CN 113160795 B CN113160795 B CN 113160795B CN 202110467103 A CN202110467103 A CN 202110467103A CN 113160795 B CN113160795 B CN 113160795B
- Authority
- CN
- China
- Prior art keywords
- feature
- language
- feature extraction
- extraction model
- examples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 110
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000012549 training Methods 0.000 title claims abstract description 48
- 239000013598 vector Substances 0.000 claims abstract description 133
- 230000009467 reduction Effects 0.000 claims abstract description 58
- 230000006870 function Effects 0.000 claims abstract description 47
- 238000012545 processing Methods 0.000 claims abstract description 23
- 238000004590 computer program Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 28
- 238000009826 distribution Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 3
- 238000005315 distribution function Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Machine Translation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The application relates to the technical field of artificial intelligence and discloses a training method, device, equipment and storage medium for language feature extraction models, wherein the method comprises the following steps: performing dimension reduction processing on the feature vector of the voice sample in the dimension reduction layer to obtain a dimension reduction feature vector; determining context characteristics according to the dimension reduction characteristic vector; redefining positive examples and negative examples of the voice samples, and predicting the positive examples and the negative examples included in each voice sample according to the context characteristics; calculating errors of prediction results of the positive examples and the negative examples through a loss function of a preset feature extraction model; and updating model parameters of the language feature extraction model according to the errors. The method realizes that the context contrast predictive coding is used for extracting the language features, and the language features are represented by the feature vector mean value of the voice sample, so that the features irrelevant to the language are diluted, and the efficiency and the accuracy of training the language feature extraction model are improved.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a training method, device, equipment and storage medium for language feature extraction models.
Background
The contrast prediction coding is a contrast learning method in deep learning, and can effectively find the difference between samples due to the adoption of a contrast learning scheme, so that the method has wide application in multiple fields. For example, in the speech domain, the pronunciation characteristics behind the current sample can be predicted from the above information of the sample, which has good effect on some speech related tasks, such as speaker verification, and phoneme classification problems.
However, in the training process of the language feature model, the comparison learning mechanism is not effectively connected with the language distinguishing task, so that the conventional language feature model cannot apply the comparison prediction coding to language recognition, and information such as language speed, sound size, gender and the like which are irrelevant to languages can be considered, so that the language recognition effect is affected.
Disclosure of Invention
The application provides a language feature extraction model training method, device, equipment and storage medium, which can realize that context contrast predictive coding is used for extracting language features, the language features are represented by feature vector means of a voice sample, the features irrelevant to the language are diluted in the training process of the language feature extraction model, and the training efficiency and accuracy of the language feature extraction model can be improved.
In a first aspect, the present application provides a language feature extraction model training method, where the method includes:
performing dimension reduction processing on the feature vector of the voice sample in a dimension reduction layer of a preset language feature extraction model to obtain a dimension reduction feature vector;
inputting the dimension-reducing feature vector into a time sequence model to obtain the above feature and the following feature;
combining the above feature and the below feature to obtain a context feature;
redefining positive examples and negative examples of the voice samples, and predicting each voice sample as a positive example or a negative example according to the contextual characteristics, wherein the characteristic vector of the positive example is the average value of the characteristic vectors of all voice samples which are the same as the language of the voice sample, and the characteristic vector of the negative example is the average value of the characteristic vectors of all voice samples which are different from the language of the voice sample;
and determining errors of the predicted positive examples and the predicted negative examples through a loss function of the preset feature extraction model, and updating model parameters of the language feature extraction model according to the errors.
In a second aspect, the present application further provides a training device for language feature extraction model, including:
the processing module is used for carrying out dimension reduction processing on the feature vector of the voice sample in a dimension reduction layer of the preset language feature extraction model to obtain a dimension reduction feature vector;
the acquisition module is used for inputting the dimension reduction feature vector into a time sequence model to acquire the above features and the following features;
the obtaining module is used for combining the contextual characteristics and the contextual characteristics to obtain the contextual characteristics;
the prediction module is used for redefining positive examples and negative examples of the voice samples, and predicting each voice sample to be the positive example or the negative example according to the contextual characteristics, wherein the characteristic vector of the positive example is the average value of the characteristic vectors of all voice samples which are the same as the language of the voice sample, and the characteristic vector of the negative example is the average value of the characteristic vectors of all voice samples which are different from the language of the voice sample;
and the updating module is used for determining the errors of the predicted positive examples and the predicted negative examples through a loss function of the preset feature extraction model, and updating model parameters of the language feature extraction model according to the errors.
In a third aspect, the present application further provides a language feature extraction model training apparatus, including:
a memory and a processor;
the memory is used for storing a computer program;
the processor is configured to execute the computer program and implement the steps of the language feature extraction model training method according to the first aspect when the computer program is executed.
In a fourth aspect, the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor causes the processor to implement the steps of the language feature extraction model training method according to the first aspect above.
The application discloses a language feature extraction model training method, a device, equipment and a storage medium, firstly, performing dimension reduction processing on feature vectors of a voice sample to obtain dimension reduction feature vectors; determining the context characteristics of the language of the voice sample according to the dimension-reducing characteristic vector; and redefining the positive examples and the negative examples of the voice samples, and further predicting the positive examples and the negative examples included in each frame of voice samples according to the context characteristics. The method and the device have the advantages that the context contrast prediction coding is used for extracting language features, the language features are represented by the feature vector mean value of the voice sample, the features irrelevant to the language are diluted, and then the errors are calculated on the prediction results of the positive example and the negative example through the loss function of the preset feature extraction model, and the model parameters of the preset language feature extraction model are updated according to the errors. The efficiency and the accuracy of training the language characteristic extraction model are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a training method for language feature extraction model provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a feature encoder provided in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a language feature extraction model according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of a training device for language feature extraction model according to an embodiment of the present application;
fig. 5 is a schematic block diagram of a language feature extraction model training apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.
It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
The embodiment of the application provides a language feature extraction model training method, device, equipment and storage medium. The language feature extraction model training method provided by the embodiment of the application can be used for obtaining the feature vector of the dimension reduction through dimension reduction processing on the feature vector of the voice sample; determining the context characteristics of the language of the voice sample according to the dimension-reducing characteristic vector; and redefining the positive examples and the negative examples of the voice samples, and further predicting the positive examples and the negative examples included in each frame of voice samples according to the context characteristics. The method and the device have the advantages that the context contrast prediction coding is used for extracting language features, the language features are represented by the feature vector mean value of the voice sample, the features irrelevant to the language are diluted, and then the errors are calculated on the prediction results of the positive example and the negative example through the loss function of the preset feature extraction model, and the model parameters of the preset language feature extraction model are updated according to the errors. The efficiency and the accuracy of training the language characteristic extraction model are improved.
For example, the language feature extraction model training method provided by the embodiment of the application can be applied to a terminal or a server, the language features are represented by the feature vector mean of the voice sample by using the context contrast predictive coding for extracting the language features, the features irrelevant to the language are diluted, and further the efficiency and the accuracy of training the language feature extraction model are improved.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flowchart of a training method for language feature extraction model according to an embodiment of the present application. The language feature extraction model training method can be realized by a terminal or a server, wherein the terminal can be a handheld terminal, a notebook computer, a wearable device or a robot and the like; the server may be a single server or a cluster of servers.
As shown in fig. 1, the language feature extraction model training method provided in this embodiment specifically includes: step S101 to step S105. The details are as follows:
s101, performing dimension reduction processing on feature vectors of the voice samples in a dimension reduction layer of a preset language feature extraction model to obtain dimension reduction feature vectors.
The preset language feature extraction model comprises a feature encoder. The feature encoder may be a neural network including a predetermined number of convolutional layers, for sampling input speech samples to obtain feature vectors of the speech samples. Specifically, the input voice sample may be an audio file, for example, each frame of the audio file is a pulse code modulation sampling point (Pulse Code Modulatioon, PCM), the audio file is input into a neural network of a preset convolution layer (five layers are assumed), and after the neural network sampling process, a feature vector corresponding to each frame (each frame includes 160 PCM sampling points) of the voice sample is output.
In an embodiment, performing a dimension reduction process on feature vectors of a speech sample in a dimension reduction layer of a preset language feature extraction model to obtain dimension-reduced feature vectors, including: inputting the voice samples into a feature encoder to obtain respective corresponding feature vectors of each frame of voice samples; and respectively carrying out dimension reduction processing on each feature vector through a dimension reduction layer to obtain dimension reduction feature vectors corresponding to each frame of voice sample. For example, assuming that each frame of speech sample is a 512-dimensional feature vector, in this embodiment, the 512-dimensional feature vector is subjected to feature dimension transformation by the dimension reduction layer, and is mapped into a 40-dimensional feature vector through a 512×40 linear transformation. The 40-dimensional feature vector is a dimension-reduction feature vector corresponding to the voice sample of the current frame. The dimension conversion is carried out on the feature vector through the dimension reduction layer, so that the parameter quantity of a later time sequence model can be reduced, and more compact feature expression can be obtained while feature comparison is facilitated.
Illustratively, as shown in fig. 2, fig. 2 is a schematic structural diagram of a feature encoder provided in an embodiment of the present application. As can be seen from fig. 2, in the present embodiment, the input of the feature encoder 200 is a speech sample 201 in units of frames, and the output is a feature vector 202 corresponding to each frame of speech sample. It should be noted that, the feature encoder 200 shown in fig. 2 is a convolutional neural network including 5 convolutional layers, which does not constitute a limitation of the feature encoder 200, and the feature encoder 200 may be other types of neural networks. After each frame of speech samples passes through the feature encoder 200, a corresponding feature vector 202 is obtained, and in this embodiment, as can be seen from fig. 2, the feature vector 202 corresponding to each frame of speech samples is a 512-dimensional feature vector.
S102, inputting the dimension reduction feature vector into a time sequence model to acquire the above features and the following features.
The time sequence model comprises an autoregressive model and an inverse autoregressive model. For example, the autoregressive model includes a gated loop unit and the inverse autoregressive model includes an inverse gated loop unit. Analyzing the dimension-reducing feature vector corresponding to the voice sample of the previous t frame through the gating circulating unit, obtaining information obtained after the voice sample of the t frame is coded and decoded through the gating circulating unit, and taking the information as the characteristics; and analyzing the dimension-reducing feature vectors corresponding to the voice samples from the last frame to the t+1st frame through the reverse gating circulating unit, obtaining information obtained after the voice samples from the t+1st frame are encoded and decoded through the reverse gating circulating unit, and taking the information as the following features.
In one embodiment, inputting the dimension-reduced feature vector into the timing model, obtaining the above feature and the below feature, comprises: inputting the dimension-reducing feature vector corresponding to the voice sample of the previous t frames into an autoregressive model to obtain the features; and inputting the dimension-reducing feature vectors corresponding to the last frame to the t+1st frame of voice samples into a reverse autoregressive model to obtain the following features.
Wherein the autoregressive model includes an encoder-decoder using two coupled context-based RNNs as the encoder and decoder, respectively, when both the input data and the learning objective are sequential and of variable length. For example, RNN framework seq2seq in language model. The encoder processes the input original text (in this embodiment, the dimension-reduced feature vector of the speech sample) in operation, and outputs the encoded vector to the decoder, which generates a new sequence according to the output of the encoder, in this embodiment, the new sequence of the dimension-reduced feature vector of the t-th frame speech sample output by the decoder of the autoregressive model is the above feature, and the new sequence of the dimension-reduced feature vector of the t+1-th frame speech sample output by the decoder of the inverse autoregressive model is the following feature. In this embodiment, the context contrast prediction encoding is implemented for extracting language features by acquiring the context features and the context features through two timing models, respectively.
And S103, combining the context features and the context features to obtain the context features.
In an embodiment, the above feature and the below feature are combined, and in particular, the last feature of the above feature and the first feature of the below feature may be spliced together to obtain the context feature. For example, the context feature is a 128-dimensional feature vector, the context feature is also a 128-dimensional feature vector, and after the context feature and the context feature are combined, the obtained context feature is a 256-dimensional feature vector.
S104, redefining positive examples and negative examples of the voice samples, and predicting each voice sample as the positive example or the negative example according to the context characteristics, wherein the characteristic vector of the positive example is the average value of the characteristic vectors of all voice samples with the same language as the voice sample, and the characteristic vector of the negative example is the average value of the characteristic vectors of all voice samples with different languages from the voice sample.
In one embodiment, redefining the positive and negative examples of the speech samples includes: determining a target language of the voice sample; defining the voice samples with the same languages as the target language in each batch of voice samples as positive examples; in each batch of voice samples, voice samples with languages different from the target language are defined as counterexamples. For example, assume that there are 10 Chinese and 10 English in a batch (mini-batch) of speech samples. And if the target language of the voice sample is determined to be Chinese, taking the Chinese as a positive example and taking the English as a negative example. Specifically, in the embodiment of the present application, the features of the corresponding positive example sample are replaced by the feature average value of the language corresponding to the positive example sample, so as to obtain the feature vector of the corresponding positive example. And similarly, replacing the features of the counterexample sample with feature average values of languages corresponding to the counterexample sample to obtain feature vectors corresponding to the counterexample. It should be noted that each language different from the target language may constitute a counterexample, and all counterexamples constitute a counterexample set, and the counterexample set includes a plurality of counterexamples corresponding to each voice sample different from the target language of the voice sample. In each group of counterexamples, the number of counterexample samples is the same as the number of all voice samples of the language corresponding to the group, and the feature vector of each group of counterexample samples is the average value of the feature vectors of all voice samples of the language corresponding to the group of counterexamples.
In one embodiment, predicting positive examples and negative examples included in each frame of speech samples based on the contextual characteristics includes: and calculating the inner product of the context feature and the dimension reduction feature vector of each frame of voice sample, and predicting each frame of voice sample as a positive example or a negative example according to the calculated inner product result and the preset correlation degree.
Specifically, taking the calculated inner product result as the correlation degree between the context feature and each frame of voice sample, if the inner product result between the voice sample of the current frame and the context feature is greater than the preset correlation degree, determining that the voice sample of the current frame is highly correlated with the context feature, and predicting the voice sample of the current frame as a positive example; if the inner product result of the voice sample of the current frame and the context feature is smaller than or equal to the preset correlation degree, determining that the correlation degree of the voice sample of the current frame and the context feature is not high, and predicting the voice sample of the current frame is a counterexample.
Before calculating the inner product of the context feature and the dimension-reduced feature vector of each frame of speech sample, the context feature needs to be transformed into a vector with the same dimension as the dimension-reduced feature vector through matrix change. In particular, the process of matrix dimensional transformation may refer to the existing process of vector dimensional linear transformation, and will not be described in detail herein.
S105, determining errors of the predicted positive examples and the predicted negative examples through a loss function of the preset feature extraction model, and updating model parameters of the language feature extraction model according to the errors.
The loss function of the preset feature extraction model comprises an anti-noise loss function. The purpose of this loss function is to fit the generated sample distribution to the real sample distribution as much as possible. In the embodiments of the present application, the purpose of the penalty function is to fit as closely as possible the predicted positive and negative distributions to the actual positive and negative distributions. Specifically, in the present embodiment, the value of the loss function is expressed in terms of the distribution of the predicted positive and negative examples and the divergence of the distribution of the true positive and negative examples, and when the value of the loss function is closer to 0, it means that the distribution of the predicted positive and negative examples is closer to the true positive and negative examples, and the error of the predicted positive and negative examples is smaller; conversely, when the value of the loss function is closer to 1, the distribution representing the predicted positive and negative examples is further from the true positive and negative examples, and the error of the predicted positive and negative examples is larger.
Illustratively, determining the error of the predicted positive example and the negative example by a loss function of a preset feature extraction model comprises: and fitting the first distribution of the predicted positive examples and negative examples with the second distribution of the actual positive examples and negative examples by the anti-noise loss function to obtain the errors of the predicted positive examples and negative examples.
Specifically, the anti-noise loss function can be expressed as:
wherein J is (D) (θ D ,θ G ) Representing the degree of fitting of a first distribution of predicted positive and negative examples to a second distribution of actual positive and negative examples (also referred to as the divergence of the first distribution and the second distribution), representing the error of the predicted positive and negative examples, θ D A first distribution, θ, representing predicted positive and negative examples G A second distribution representing actual positive and negative examples,distribution function representing predicted positive examples, +.>The distribution function representing the predicted counterexample, D (X) represents the discriminator of the language feature extraction model, and is used for carrying out true and false discrimination on the training sample X.
It should be noted that, in the process of determining the errors of the predicted positive example and negative example according to the anti-noise loss function, other samples with the same label (language) in the same batch of samples and the positive example may not be considered as the negative example or the positive example, that is, may not participate in the calculation of the loss function; in addition, if the same batch of samples do not contain positive examples, the samples in the category do not participate in the calculation of the loss function, so that the calculation efficiency of the loss function can be effectively improved.
In one embodiment, updating model parameters of a predetermined language feature extraction model according to an error includes: and updating model parameters of the language feature extraction model through back propagation according to the error.
Specifically, the loss function value of the feature extraction model corresponds to the error of the predicted positive example and the predicted negative example, after the error of the predicted positive example and the error of the predicted negative example are obtained, the gradient descent algorithm is used for gradually reducing the error value, and in the process of gradually reducing the error value, the parameters of the language feature extraction model are continuously updated layer by layer from back to front until the error value takes a minimum value and tends to be stable, and the model parameter update of the language feature extraction model is completed.
The parameter updating process of the language feature extraction model is a process for optimizing parameters of a discriminator D (X) of the language feature extraction model. Specifically, the parameter θ of discriminator D (X) is updated by Adam gradient descent algorithm d . Illustratively, the discriminator parameter θ is updated by Adam gradient descent algorithm d The procedure of (2) can be expressed by the following formula:
wherein J is (D) A cost function representing the discriminator D (X), the value of the cost function representing the error value for the authenticity of the input sample.
In the embodiment of the present application, the cost function of the discriminator D (X) is a loss function of a preset language feature extraction model, specifically an anti-noise loss function. In this embodiment, by using Adam gradient descent algorithm, when J (D) In the process of gradually reducing and stabilizing the value of (C) according to the formulaCan calculate the following J (D) θ of continuously varying value of (a) d Based on the calculated theta d The value of the parameter theta is continuously updated d 。
Exemplary, as shown in fig. 3, fig. 3 is a schematic structural diagram of the language feature extraction model provided in the embodiment of the present application. As can be seen from fig. 3, the language feature extraction model 300 includes a feature encoder 200 and a timing model 301. Specifically, the specific explanation of the feature encoder 200 and the timing model 301 may refer to the foregoing descriptions of the embodiments of the present application, and will not be repeated here.
According to the analysis, the language feature extraction model training method provided by the embodiment of the application obtains the feature vector of the dimension reduction through the dimension reduction treatment of the feature vector of the voice sample; determining the context characteristics of the language of the voice sample according to the dimension-reducing characteristic vector; and redefining the positive examples and the negative examples of the voice samples, and further predicting the positive examples and the negative examples included in each frame of voice samples according to the context characteristics. The method realizes that the context contrast prediction coding is used for extracting language features, the language features are represented by the feature vector mean value of the voice sample, the features irrelevant to the language are diluted, and then the errors are calculated on the prediction results of the positive example and the negative example through the loss function of the preset feature extraction model, and the model parameters of the language feature extraction model are updated according to the errors. The efficiency and the accuracy of training the language characteristic extraction model are improved.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a training device for language feature extraction model according to an embodiment of the present application, where the training device for language feature extraction model is used for executing the training method for language feature extraction model shown in fig. 1. The language feature extraction model training device can be electronic equipment such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, wearable equipment and the like.
As shown in fig. 4, the language feature extraction model training apparatus 400 includes:
the processing module 401 is configured to perform dimension reduction processing on the feature vector of the voice sample in a dimension reduction layer of the preset language feature extraction model to obtain a dimension reduction feature vector;
an obtaining module 402, configured to input the dimension-reduced feature vector into a time sequence model, and obtain the above feature and the below feature;
an obtaining module 403, configured to combine the contextual feature and the contextual feature to obtain a contextual feature;
a prediction module 404, configured to redefine a positive example and a negative example of a speech sample, and predict each speech sample as a positive example or a negative example according to the contextual feature, where a feature vector of the positive example is an average value of feature vectors of all speech samples that are the same as the language of the speech sample, and a feature vector of the negative example is an average value of feature vectors of all speech samples that are different from the language of the speech sample;
and the updating module 405 is configured to determine errors of the predicted positive example and the predicted negative example according to a loss function of the preset feature extraction model, and update model parameters of the language feature extraction model according to the errors.
In one embodiment, the predetermined language feature extraction model includes a feature encoder, and the processing module 401 includes:
the obtaining unit is used for inputting the voice samples into the feature encoder to obtain the feature vectors corresponding to each frame of voice samples;
and the processing unit is used for respectively carrying out dimension reduction processing on each feature vector through the dimension reduction layer to obtain the dimension reduction feature vector corresponding to each frame of voice sample.
In an embodiment, the timing model includes an autoregressive model and a reverse autoregressive model; an acquisition module 402, comprising:
the first acquisition unit is used for inputting the dimension-reduction feature vector corresponding to the voice sample of the previous t frames into the autoregressive model to acquire the features;
and the second acquisition unit is used for inputting the dimension-reduction feature vectors corresponding to the last frame to the t+1st frame of voice samples into a reverse autoregressive model to acquire the following features.
In an embodiment, the redefining the positive and negative examples of the speech samples includes:
determining a target language of the voice sample;
defining the voice samples with the same languages as the target language in each batch of voice samples as positive examples;
in each batch of voice samples, voice samples with languages different from the target language are defined as counterexamples.
In an embodiment, predicting the positive examples and the negative examples included in each voice sample according to the context features includes:
calculating an inner product of the context feature and a dimension-reduction feature vector of each frame of voice sample;
and predicting each frame of voice sample as a positive example or a negative example according to the inner product result obtained by calculation and the preset correlation degree.
In an embodiment, the loss function of the preset feature extraction model includes an anti-noise loss function, and determining the error of the predicted positive example and the predicted negative example through the loss function of the preset feature extraction model includes:
and fitting the predicted positive examples and negative examples with the actual positive examples and negative examples through the anti-noise loss function, and determining the errors of the predicted positive examples and negative examples through the fitting result.
In an embodiment, the updating the model parameters of the preset language feature extraction model according to the error includes:
and updating model parameters of the preset language feature extraction model through back propagation according to the error.
It should be noted that, for convenience and brevity of description, specific working processes of the terminal and each module described above may refer to corresponding processes in the embodiment of the language feature extraction model training method described in fig. 1, and are not described herein again.
The language feature extraction model training method described above may be implemented in the form of a computer program that can be run on the apparatus as shown in fig. 4.
Referring to fig. 5, fig. 5 is a schematic block diagram of a language feature extraction model training apparatus according to an embodiment of the present application. The language feature extraction model training apparatus includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause the processor to perform any one of a number of language feature extraction model training methods.
The processor is used to provide computing and control capabilities to support the operation of the entire computer device.
The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform any one of a number of language feature extraction model training methods.
The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the terminal to which the present application is applied, and that a particular terminal may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:
performing dimension reduction processing on the feature vector of the voice sample in a dimension reduction layer of a preset language feature extraction model to obtain a dimension reduction feature vector;
inputting the dimension-reducing feature vector into a time sequence model to obtain the above feature and the following feature;
combining the above feature and the below feature to obtain a context feature;
redefining positive examples and negative examples of the voice samples, and predicting each voice sample as a positive example or a negative example according to the contextual characteristics, wherein the characteristic vector of the positive example is the average value of the characteristic vectors of all voice samples which are the same as the language of the voice sample, and the characteristic vector of the negative example is the average value of the characteristic vectors of all voice samples which are different from the language of the voice sample;
and determining errors of the predicted positive examples and the predicted negative examples through the loss function of the preset feature extraction model, and updating model parameters of the preset language feature extraction model according to the errors.
In an embodiment, the preset language feature extraction model includes a feature encoder, and the dimension reduction processing is performed on feature vectors of the voice samples at a dimension reduction layer of the preset language feature extraction model to obtain dimension reduction feature vectors, including:
inputting the voice samples into the feature encoder to obtain the feature vectors corresponding to each frame of voice samples;
and respectively carrying out dimension reduction processing on each feature vector through the dimension reduction layer to obtain the dimension reduction feature vector corresponding to each frame of voice sample.
In an embodiment, the timing model includes an autoregressive model and a reverse autoregressive model; the step of inputting the dimension reduction feature vector into a time sequence model to obtain the above feature and the following feature comprises the following steps:
inputting the dimension-reducing feature vector corresponding to the voice sample of the previous t frames into an autoregressive model to obtain the features;
and inputting the dimension-reducing feature vectors corresponding to the last frame to the t+1st frame of voice samples into a reverse autoregressive model to obtain the following features.
In an embodiment, the redefining the positive and negative examples of the speech samples includes:
determining a target language of the voice sample;
defining the voice samples with the same languages as the target language in each batch of voice samples as positive examples;
in each batch of voice samples, voice samples with languages different from the target language are defined as counterexamples.
In an embodiment, predicting the positive examples and the negative examples included in each voice sample according to the context features includes:
calculating an inner product of the context feature and a dimension-reduction feature vector of each frame of voice sample;
and predicting each frame of voice sample as a positive example or a negative example according to the inner product result obtained by calculation and the preset correlation degree.
In an embodiment, the loss function of the preset feature extraction model includes an anti-noise loss function, and determining the error of the predicted positive example and the predicted negative example through the loss function of the preset feature extraction model includes:
and fitting the predicted positive examples and negative examples with the actual positive examples and negative examples through the anti-noise loss function, and determining the errors of the predicted positive examples and negative examples through the fitting result.
In an embodiment, the updating the model parameters of the preset language feature extraction model according to the error includes:
and updating model parameters of the preset language feature extraction model through back propagation according to the error.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, the computer program comprises program instructions, and the processor executes the program instructions to realize the language feature extraction model training method provided by the embodiment shown in fig. 1 of the application.
The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (6)
1. A language feature extraction model training method is characterized by comprising the following steps:
performing dimension reduction processing on the feature vector of the voice sample in a dimension reduction layer of a preset language feature extraction model to obtain a dimension reduction feature vector;
inputting the dimension-reducing feature vector into a time sequence model to obtain the above feature and the following feature; wherein the timing model includes an autoregressive model and a reverse autoregressive model; the step of inputting the dimension reduction feature vector into a time sequence model to obtain the above feature and the following feature comprises the following steps: inputting the dimension-reducing feature vector corresponding to the voice sample of the previous t frames into an autoregressive model to obtain the features; inputting the dimension-reducing feature vectors corresponding to the last frame to the t+1st frame of voice samples into a reverse autoregressive model to obtain the following features; combining the above feature and the below feature to obtain a context feature;
redefining positive examples and negative examples of the voice samples, and predicting each voice sample as a positive example or a negative example according to the contextual characteristics, wherein the characteristic vector of the positive example is the average value of the characteristic vectors of all voice samples which are the same as the language of the voice sample, and the characteristic vector of the negative example is the average value of the characteristic vectors of all voice samples which are different from the language of the voice sample; the redefining of the positive examples and the negative examples of the voice samples comprises: determining a target language of the voice sample; defining the voice samples with the same languages as the target language in each batch of voice samples as positive examples; defining voice samples with languages different from the target language in each batch of voice samples as counterexamples; the predicting, according to the context feature, positive examples and negative examples included in each voice sample includes: calculating an inner product of the context feature and a dimension-reduction feature vector of each frame of voice sample; predicting each frame of voice sample as a positive example or a negative example according to the inner product result obtained by calculation and a preset correlation degree;
determining errors of the predicted positive examples and the predicted negative examples through a loss function of a preset feature extraction model, and updating model parameters of the language feature extraction model according to the errors; the loss function of the preset feature extraction model includes an anti-noise loss function, and the determining the error of the predicted positive example and the predicted negative example through the loss function of the preset feature extraction model includes: and fitting the predicted positive examples and negative examples with the actual positive examples and negative examples through the anti-noise loss function, and determining the errors of the predicted positive examples and negative examples through the fitting result.
2. The training method of language feature extraction model according to claim 1, wherein the predetermined language feature extraction model includes a feature encoder, the dimension reduction processing is performed on feature vectors of the voice sample in a dimension reduction layer of the predetermined language feature extraction model to obtain dimension-reduced feature vectors, and the method includes:
inputting the voice samples into the feature encoder to obtain the feature vectors corresponding to each frame of voice samples;
and respectively carrying out dimension reduction processing on each feature vector through the dimension reduction layer to obtain the dimension reduction feature vector corresponding to each frame of voice sample.
3. The language feature extraction model training method of claim 1, wherein updating model parameters of the predetermined language feature extraction model according to the error comprises:
and updating model parameters of the preset language feature extraction model through back propagation according to the error.
4. The utility model provides a model trainer is drawed to language characteristic which characterized in that includes:
the processing module is used for carrying out dimension reduction processing on the feature vector of the voice sample in a dimension reduction layer of the preset language feature extraction model to obtain a dimension reduction feature vector;
the acquisition module is used for inputting the dimension reduction feature vector into a time sequence model to acquire the above features and the following features; wherein the timing model includes an autoregressive model and a reverse autoregressive model; the step of inputting the dimension reduction feature vector into a time sequence model to obtain the above feature and the following feature comprises the following steps: inputting the dimension-reducing feature vector corresponding to the voice sample of the previous t frames into an autoregressive model to obtain the features; inputting the dimension-reducing feature vectors corresponding to the last frame to the t+1st frame of voice samples into a reverse autoregressive model to obtain the following features;
the obtaining module is used for combining the contextual characteristics and the contextual characteristics to obtain the contextual characteristics;
the prediction module is used for redefining positive examples and negative examples of the voice samples, and predicting each voice sample to be the positive example or the negative example according to the contextual characteristics, wherein the characteristic vector of the positive example is the average value of the characteristic vectors of all voice samples which are the same as the language of the voice sample, and the characteristic vector of the negative example is the average value of the characteristic vectors of all voice samples which are different from the language of the voice sample; the redefining of the positive examples and the negative examples of the voice samples comprises: determining a target language of the voice sample; defining the voice samples with the same languages as the target language in each batch of voice samples as positive examples; defining voice samples with languages different from the target language in each batch of voice samples as counterexamples; the predicting, according to the context feature, positive examples and negative examples included in each voice sample includes: calculating an inner product of the context feature and a dimension-reduction feature vector of each frame of voice sample; predicting each frame of voice sample as a positive example or a negative example according to the inner product result obtained by calculation and a preset correlation degree;
the updating module is used for determining errors of the predicted positive examples and the predicted negative examples through a loss function of the preset feature extraction model, and updating model parameters of the language feature extraction model according to the errors; the loss function of the preset feature extraction model includes an anti-noise loss function, and the determining the error of the predicted positive example and the predicted negative example through the loss function of the preset feature extraction model includes: and fitting the predicted positive examples and negative examples with the actual positive examples and negative examples through the anti-noise loss function, and determining the errors of the predicted positive examples and negative examples through the fitting result.
5. A language feature extraction model training apparatus, comprising:
a memory and a processor;
the memory is used for storing a computer program;
the processor is configured to execute the computer program and implement the steps of the language feature extraction model training method according to any one of claims 1 to 3 when the computer program is executed.
6. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the steps of the language feature extraction model training method according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110467103.9A CN113160795B (en) | 2021-04-28 | 2021-04-28 | Language feature extraction model training method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110467103.9A CN113160795B (en) | 2021-04-28 | 2021-04-28 | Language feature extraction model training method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113160795A CN113160795A (en) | 2021-07-23 |
CN113160795B true CN113160795B (en) | 2024-03-05 |
Family
ID=76871880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110467103.9A Active CN113160795B (en) | 2021-04-28 | 2021-04-28 | Language feature extraction model training method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113160795B (en) |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104538036A (en) * | 2015-01-20 | 2015-04-22 | 浙江大学 | Speaker recognition method based on semantic cell mixing model |
CN109344395A (en) * | 2018-08-30 | 2019-02-15 | 腾讯科技(深圳)有限公司 | A kind of data processing method, device, server and storage medium |
CN109684640A (en) * | 2018-12-26 | 2019-04-26 | 科大讯飞股份有限公司 | A kind of semantic extracting method and device |
CN110263349A (en) * | 2019-03-08 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Corpus assessment models training method, device, storage medium and computer equipment |
CN111048062A (en) * | 2018-10-10 | 2020-04-21 | 华为技术有限公司 | Speech synthesis method and apparatus |
CN111210805A (en) * | 2018-11-05 | 2020-05-29 | 北京嘀嘀无限科技发展有限公司 | Language identification model training method and device and language identification method and device |
CN111429887A (en) * | 2020-04-20 | 2020-07-17 | 合肥讯飞数码科技有限公司 | End-to-end-based speech keyword recognition method, device and equipment |
CN111599344A (en) * | 2020-03-31 | 2020-08-28 | 因诺微科技(天津)有限公司 | Language identification method based on splicing characteristics |
CN111640419A (en) * | 2020-05-26 | 2020-09-08 | 合肥讯飞数码科技有限公司 | Language identification method, system, electronic equipment and storage medium |
CN112489651A (en) * | 2020-11-30 | 2021-03-12 | 科大讯飞股份有限公司 | Voice recognition method, electronic device and storage device |
CN112489626A (en) * | 2020-11-18 | 2021-03-12 | 华为技术有限公司 | Information identification method and device and storage medium |
CN112561060A (en) * | 2020-12-15 | 2021-03-26 | 北京百度网讯科技有限公司 | Neural network training method and device, image recognition method and device and equipment |
CN112634867A (en) * | 2020-12-11 | 2021-04-09 | 平安科技(深圳)有限公司 | Model training method, dialect recognition method, device, server and storage medium |
CN112635050A (en) * | 2020-12-23 | 2021-04-09 | 安徽科大讯飞医疗信息技术有限公司 | Diagnosis recommendation method, electronic equipment and storage device |
CN112669841A (en) * | 2020-12-18 | 2021-04-16 | 平安科技(深圳)有限公司 | Training method and device for multilingual speech generation model and computer equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9484015B2 (en) * | 2013-05-28 | 2016-11-01 | International Business Machines Corporation | Hybrid predictive model for enhancing prosodic expressiveness |
-
2021
- 2021-04-28 CN CN202110467103.9A patent/CN113160795B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104538036A (en) * | 2015-01-20 | 2015-04-22 | 浙江大学 | Speaker recognition method based on semantic cell mixing model |
CN109344395A (en) * | 2018-08-30 | 2019-02-15 | 腾讯科技(深圳)有限公司 | A kind of data processing method, device, server and storage medium |
CN111048062A (en) * | 2018-10-10 | 2020-04-21 | 华为技术有限公司 | Speech synthesis method and apparatus |
CN111210805A (en) * | 2018-11-05 | 2020-05-29 | 北京嘀嘀无限科技发展有限公司 | Language identification model training method and device and language identification method and device |
CN109684640A (en) * | 2018-12-26 | 2019-04-26 | 科大讯飞股份有限公司 | A kind of semantic extracting method and device |
CN110263349A (en) * | 2019-03-08 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Corpus assessment models training method, device, storage medium and computer equipment |
CN111599344A (en) * | 2020-03-31 | 2020-08-28 | 因诺微科技(天津)有限公司 | Language identification method based on splicing characteristics |
CN111429887A (en) * | 2020-04-20 | 2020-07-17 | 合肥讯飞数码科技有限公司 | End-to-end-based speech keyword recognition method, device and equipment |
CN111640419A (en) * | 2020-05-26 | 2020-09-08 | 合肥讯飞数码科技有限公司 | Language identification method, system, electronic equipment and storage medium |
CN112489626A (en) * | 2020-11-18 | 2021-03-12 | 华为技术有限公司 | Information identification method and device and storage medium |
CN112489651A (en) * | 2020-11-30 | 2021-03-12 | 科大讯飞股份有限公司 | Voice recognition method, electronic device and storage device |
CN112634867A (en) * | 2020-12-11 | 2021-04-09 | 平安科技(深圳)有限公司 | Model training method, dialect recognition method, device, server and storage medium |
CN112561060A (en) * | 2020-12-15 | 2021-03-26 | 北京百度网讯科技有限公司 | Neural network training method and device, image recognition method and device and equipment |
CN112669841A (en) * | 2020-12-18 | 2021-04-16 | 平安科技(深圳)有限公司 | Training method and device for multilingual speech generation model and computer equipment |
CN112635050A (en) * | 2020-12-23 | 2021-04-09 | 安徽科大讯飞医疗信息技术有限公司 | Diagnosis recommendation method, electronic equipment and storage device |
Also Published As
Publication number | Publication date |
---|---|
CN113160795A (en) | 2021-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kong et al. | On fast sampling of diffusion probabilistic models | |
CN111951805B (en) | Text data processing method and device | |
CN112699991A (en) | Method, electronic device, and computer-readable medium for accelerating information processing for neural network training | |
US20180158449A1 (en) | Method and device for waking up via speech based on artificial intelligence | |
CN110444203B (en) | Voice recognition method and device and electronic equipment | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN110288980A (en) | Audio recognition method, the training method of model, device, equipment and storage medium | |
CN111368993A (en) | Data processing method and related equipment | |
CN112509600A (en) | Model training method and device, voice conversion method and device and storage medium | |
CN113436620B (en) | Training method of voice recognition model, voice recognition method, device, medium and equipment | |
CN112509555B (en) | Dialect voice recognition method, device, medium and electronic equipment | |
CN112466314A (en) | Emotion voice data conversion method and device, computer equipment and storage medium | |
WO2023134067A1 (en) | Speech classification model training method and apparatus, device, and storage medium | |
CN112084752B (en) | Sentence marking method, device, equipment and storage medium based on natural language | |
WO2023065635A1 (en) | Named entity recognition method and apparatus, storage medium and terminal device | |
CN111339308B (en) | Training method and device of basic classification model and electronic equipment | |
WO2022257454A1 (en) | Speech synthesis method, apparatus and terminal, and storage medium | |
CN111653275A (en) | Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method | |
CN111027681A (en) | Time sequence data processing model training method, data processing device and storage medium | |
CN115081616A (en) | Data denoising method and related equipment | |
CN116684330A (en) | Traffic prediction method, device, equipment and storage medium based on artificial intelligence | |
CN113239702A (en) | Intention recognition method and device and electronic equipment | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN115687934A (en) | Intention recognition method and device, computer equipment and storage medium | |
CN113723077A (en) | Sentence vector generation method and device based on bidirectional characterization model and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |