CN113160795A - Language feature extraction model training method, device, equipment and storage medium - Google Patents
Language feature extraction model training method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113160795A CN113160795A CN202110467103.9A CN202110467103A CN113160795A CN 113160795 A CN113160795 A CN 113160795A CN 202110467103 A CN202110467103 A CN 202110467103A CN 113160795 A CN113160795 A CN 113160795A
- Authority
- CN
- China
- Prior art keywords
- feature
- language
- extraction model
- feature extraction
- dimensionality reduction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 114
- 238000012549 training Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 50
- 239000013598 vector Substances 0.000 claims abstract description 129
- 230000009467 reduction Effects 0.000 claims abstract description 82
- 230000006870 function Effects 0.000 claims abstract description 43
- 238000012545 processing Methods 0.000 claims abstract description 24
- 238000004590 computer program Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000009826 distribution Methods 0.000 description 15
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Machine Translation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The application relates to the technical field of artificial intelligence and discloses a language feature extraction model training method, a device, equipment and a storage medium, wherein the method comprises the following steps: performing dimensionality reduction processing on the feature vector of the voice sample at a dimensionality reduction layer to obtain a dimensionality reduction feature vector; determining context characteristics according to the dimension reduction characteristic vector; redefining positive examples and negative examples of the voice samples, and predicting the positive examples and the negative examples included in each voice sample according to the context characteristics; calculating errors of the prediction results of the positive case and the negative case through a loss function of a preset feature extraction model; and updating model parameters of the language feature extraction model according to the errors. The context contrast prediction coding is used for extracting the language features, the language features are represented by the feature vector mean of the voice sample, the features irrelevant to the languages are diluted, and the efficiency and the accuracy of the language feature extraction model training are improved.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a language feature extraction model training method, device, equipment and storage medium.
Background
The contrast prediction coding is a contrast learning method in deep learning, and due to the adoption of a contrast learning scheme, the difference between samples can be effectively found, so that the method is widely applied to multiple fields. For example, in the speech domain, the pronunciation characteristics behind the current sample can be predicted by the above information of the sample, which has good effect on some speech related tasks such as speaker identification and phoneme classification problems.
However, in the training process of the language feature model, an effective comparison learning mechanism and a language distinguishing task are not effectively linked, so that the existing language feature model cannot apply the comparison prediction coding to the language identification, and information such as the speed, the size of sound, the sex and the like which are irrelevant to the language can be taken into consideration, thereby influencing the language identification effect.
Disclosure of Invention
The application provides a language feature extraction model training method, a device, equipment and a storage medium, which can realize that context contrast prediction coding is used for extracting language features, the language features are represented by feature vector mean values of voice samples, the features irrelevant to the languages are diluted in the training process of a language feature extraction model, and the efficiency and the accuracy of the language feature extraction model training can be improved.
In a first aspect, the present application provides a language feature extraction model training method, where the method includes:
performing dimensionality reduction processing on the feature vector of the voice sample at a dimensionality reduction layer of a preset language feature extraction model to obtain a dimensionality reduction feature vector;
inputting the dimensionality reduction feature vector into a time sequence model to obtain an upper feature and a lower feature;
combining the above feature and the below feature to obtain a context feature;
redefining a positive example and a negative example of the voice sample, and predicting each voice sample to be the positive example or the negative example according to the context features, wherein the feature vector of the positive example is the average value of the feature vectors of all the voice samples with the same language as the voice sample, and the feature vector of the negative example is the average value of the feature vectors of all the voice samples with different languages from the voice sample;
and determining errors of the predicted positive examples and negative examples through a preset loss function of the feature extraction model, and updating model parameters of the language feature extraction model according to the errors.
In a second aspect, the present application further provides a language feature extraction model training device, including:
the processing module is used for carrying out dimensionality reduction processing on the feature vector of the voice sample at a dimensionality reduction layer of a preset language feature extraction model to obtain a dimensionality reduction feature vector;
the acquisition module is used for inputting the dimensionality reduction feature vector into a time sequence model to acquire an upper feature and a lower feature;
an obtaining module, configured to combine the above feature and the below feature to obtain a context feature;
the prediction module is used for redefining the positive examples and the negative examples of the voice samples and predicting each voice sample to be a positive example or a negative example according to the context characteristics, wherein the characteristic vector of the positive example is the average value of the characteristic vectors of all the voice samples with the same language as the voice sample, and the characteristic vector of the negative example is the average value of the characteristic vectors of all the voice samples with different languages from the voice sample;
and the updating module is used for determining the predicted errors of the positive examples and the negative examples through a preset loss function of the feature extraction model, and updating the model parameters of the language feature extraction model according to the errors.
In a third aspect, the present application further provides a language feature extraction model training device, including:
a memory and a processor;
the memory is used for storing a computer program;
the processor is configured to execute the computer program and implement the steps of the language feature extraction model training method according to the first aspect when executing the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the processor implements the steps of the language feature extraction model training method according to the first aspect.
The application discloses a language feature extraction model training method, a language feature extraction model training device and a language feature extraction model training storage medium, wherein firstly, a dimensionality reduction feature vector is obtained by performing dimensionality reduction processing on a feature vector of a voice sample; determining the context characteristics of the language of the voice sample according to the dimension reduction characteristic vector; and redefining the positive examples and the negative examples of the voice samples, and predicting the positive examples and the negative examples included in each frame of voice samples according to the context characteristics. Context contrast prediction coding is used for extracting language features, the language features are represented by the feature vector mean of a voice sample, the features irrelevant to the languages are diluted, errors are calculated for prediction results of positive examples and negative examples through a loss function of a preset feature extraction model, and model parameters of the preset language feature extraction model are updated according to the errors. The efficiency and the accuracy of training the language feature extraction model are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a language feature extraction model training method provided in an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a feature encoder provided in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a language feature extraction model provided in an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a language feature extraction model training apparatus according to an embodiment of the present application;
fig. 5 is a schematic block diagram of a structure of a language feature extraction model training device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a language feature extraction model training method, a language feature extraction model training device and a language feature extraction model training equipment and a storage medium. The language feature extraction model training method provided by the embodiment of the application can be used for performing dimensionality reduction processing on the feature vector of the voice sample to obtain a dimensionality reduction feature vector; determining the context characteristics of the language of the voice sample according to the dimension reduction characteristic vector; and redefining the positive examples and the negative examples of the voice samples, and predicting the positive examples and the negative examples included in each frame of voice samples according to the context characteristics. Context contrast prediction coding is used for extracting language features, the language features are represented by the feature vector mean of a voice sample, the features irrelevant to the languages are diluted, errors are calculated for prediction results of positive examples and negative examples through a loss function of a preset feature extraction model, and model parameters of the preset language feature extraction model are updated according to the errors. The efficiency and the accuracy of training the language feature extraction model are improved.
For example, the language feature extraction model training method provided by the embodiment of the present application may be applied to a terminal or a server, and the context contrast prediction coding is used for extracting the language features, and the language features are characterized by the feature vector mean of the speech sample, so as to dilute the features irrelevant to the language, thereby improving the efficiency and accuracy of training the language feature extraction model.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flow chart of a language feature extraction model training method according to an embodiment of the present application. The language feature extraction model training method can be realized by a terminal or a server, wherein the terminal can be a handheld terminal, a notebook computer, wearable equipment or a robot and the like; the server may be a single server or a cluster of servers.
As shown in fig. 1, the language feature extraction model training method provided in this embodiment specifically includes: step S101 to step S105. The details are as follows:
s101, performing dimensionality reduction processing on the feature vector of the voice sample in a dimensionality reduction layer of a preset language feature extraction model to obtain a dimensionality reduction feature vector.
The preset language feature extraction model comprises a feature encoder. The feature encoder may be a neural network including a preset number of convolution layers, and is configured to sample an input speech sample to obtain a feature vector of the speech sample. Specifically, the input voice sample may be an audio file, for example, each frame of the audio file is a Pulse Code Modulation (PCM) sample, the audio file is input into a neural network of a preset convolutional layer (assuming five layers), and after sampling processing by the neural network, feature vectors corresponding to the voice samples of each frame (each frame includes 160 PCM samples) are output.
In one embodiment, performing dimension reduction processing on a feature vector of a speech sample at a dimension reduction layer of a preset language feature extraction model to obtain a dimension reduction feature vector, includes: inputting the voice samples into a feature encoder to obtain feature vectors corresponding to each frame of voice samples; and respectively carrying out dimensionality reduction processing on each feature vector through a dimensionality reduction layer to obtain the dimensionality reduction feature vector corresponding to each frame of voice sample. For example, assuming that each frame of voice sample is a 512-dimensional feature vector, in this embodiment, the 512-dimensional feature vector is subjected to feature dimension transformation by a dimension reduction layer, and is mapped into a 40-dimensional feature vector through a 512 × 40 linear transformation. The 40-dimensional feature vector is a dimension-reduced feature vector corresponding to the speech sample of the current frame. Dimension conversion is carried out on the feature vectors through the dimension reduction layer, so that the parameter quantity of a subsequent time sequence model can be reduced, feature comparison is facilitated, and meanwhile more compact feature expression can be obtained.
Exemplarily, as shown in fig. 2, fig. 2 is a schematic structural diagram of a feature encoder provided in an embodiment of the present application. As can be seen from fig. 2, in the present embodiment, the feature encoder 200 inputs the speech samples 201 in units of frames, and outputs the feature vectors 202 corresponding to the speech samples of each frame. It should be noted that the feature encoder 200 shown in fig. 2 is a convolutional neural network including 5 convolutional layers, which does not limit the feature encoder 200, and the feature encoder 200 may be another type of neural network. After each frame of voice sample passes through the feature encoder 200, the feature vector 202 corresponding to each frame of voice sample is obtained, and in this embodiment, as can be seen from fig. 2, the feature vector 202 corresponding to each frame of voice sample is a 512-dimensional feature vector.
And S102, inputting the dimensionality reduction feature vector into a time sequence model, and acquiring the upper feature and the lower feature.
Wherein the time sequence model comprises an autoregressive model and an inverse autoregressive model. For example, an autoregressive model includes gated cycle units and an inverted autoregressive model includes gated cycle units. Analyzing the dimensionality reduction feature vector corresponding to the previous t frames of voice samples through a gate control circulation unit, acquiring information obtained after coding and decoding the t frames of voice samples through the gate control circulation unit, and taking the information as the characteristics of the above; and analyzing the dimensionality reduction characteristic vectors corresponding to the voice samples from the last frame to the t +1 th frame through a reverse gating circulating unit, acquiring information obtained after the voice samples of the t +1 th frame are coded and decoded by the reverse gating circulating unit, and taking the information as the following characteristics.
In one embodiment, inputting the dimension-reduced feature vector into a time sequence model, and acquiring the above feature and the below feature, the method comprises the following steps: inputting the dimensionality reduction feature vector corresponding to the previous t frames of voice samples into an autoregressive model to obtain the features; and inputting the dimensionality reduction feature vectors corresponding to the voice samples from the last frame to the t +1 th frame into an inverse autoregressive model to obtain the following features.
Wherein the autoregressive model comprises an encoder-decoder, and when the input data and the learning target are both sequences and variable in length, two coupled context-based RNNs are used as the encoder and the decoder, respectively. For example, the RNN framework seq2seq in the language model. The encoder processes an input original text (in this embodiment, a dimension-reduced feature vector of a speech sample) during operation, and outputs a vector after encoding to the decoder, and the decoder generates a new sequence according to the output of the encoder. In the embodiment, the context contrast prediction coding is used for extracting the language features by respectively acquiring the above features and the below features through two time sequence models.
S103, combining the above features and the below features to obtain context features.
In an embodiment, the above feature and the following feature are combined, and specifically, the last feature of the above feature and the first feature of the following feature may be spliced together to obtain the context feature. For example, the above feature is a 128-dimensional feature vector, the below feature is also a 128-dimensional feature vector, and after the above feature and the below feature are combined, the obtained context feature is a 256-dimensional feature vector.
S104, redefining a positive example and a negative example of the voice sample, and predicting each voice sample to be the positive example or the negative example according to the context features, wherein the feature vector of the positive example is the average value of the feature vectors of all the voice samples with the same language as the voice sample, and the feature vector of the negative example is the average value of the feature vectors of all the voice samples with different languages from the voice sample.
In one embodiment, redefining the positive and negative examples of the speech sample includes: determining a target language of a voice sample; defining a voice sample with the same language as the target language in each batch of voice samples as a positive example; and defining the voice sample with the language different from the target language in each batch of voice samples as a counterexample. For example, assume that there are 10 Chinese and 10 English in a batch (mini-batch) of speech samples. And determining that the target language of the voice sample is Chinese, taking the Chinese as a positive example, and taking English as a negative example. Specifically, in the embodiment of the present application, the features of the corresponding positive example samples are all replaced with the feature mean of the corresponding language of the positive example sample, so as to obtain the feature vector corresponding to the positive example. Similarly, the features of the counterexample sample are replaced by the feature mean of the counterexample sample corresponding to the language, so as to obtain the feature vector corresponding to the counterexample. It should be noted that each language different from the target language may constitute a counter example, and all counter examples constitute a counter example set, where the counter example set includes a plurality of counter examples corresponding to each speech sample different from the target language of the speech sample. In each group of counterexamples, the number of the counterexample samples is the same as the number of all the voice samples of the language corresponding to the group, and the feature vector of each group of the counterexample samples is the average value of the feature vectors of all the voice samples of the language corresponding to the group of the counterexamples.
In one embodiment, predicting the positive and negative examples included in each frame of speech samples based on context characteristics comprises: and calculating the inner product of the context features and the dimensionality reduction feature vector of each frame of voice sample, and predicting the voice sample of each frame to be a positive example or a negative example according to the calculated inner product result and the preset correlation.
Specifically, the calculated inner product result is used as the correlation degree of the context feature and each frame of voice sample, if the inner product result of the voice sample of the current frame and the context feature is greater than the preset correlation degree, the high correlation between the voice sample of the current frame and the context feature is determined, and the prediction of the voice sample of the current frame is taken as a positive example; and if the inner product result of the speech sample and the context feature of the current frame is less than or equal to the preset correlation, determining that the correlation between the speech sample and the context feature of the current frame is not high, and predicting the speech sample of the current frame as a counterexample.
Before calculating the inner product of the context feature and the reduced-dimension feature vector of each frame of voice sample, the context feature needs to be transformed into a vector with the same dimension as the reduced-dimension feature vector through matrix change. Specifically, the process of matrix dimension transformation may refer to the existing process of vector dimension linear transformation, and is not detailed here.
And S105, determining errors of the predicted positive examples and negative examples through a preset loss function of the feature extraction model, and updating model parameters of the language feature extraction model according to the errors.
Wherein the loss function of the preset feature extraction model comprises a countering noise loss function. The purpose of this loss function is to fit the generated sample distribution to the true sample distribution as closely as possible. In the embodiments of the present application, the objective of using the penalty function is to fit the predicted distributions of positive and negative examples to the true distributions of positive and negative examples as much as possible. Specifically, in the present embodiment, the value of the loss function is represented by the divergence of the distribution of the predicted positive and negative examples and the distribution of the true positive and negative examples, and when the value of the loss function is closer to 0, the more the distribution of the predicted positive and negative examples is closer to the true positive and negative examples, the smaller the error of the predicted positive and negative examples is; conversely, as the value of the penalty function is closer to 1, the more the distribution of the predicted positive and negative examples is apart from the true positive and negative examples, the larger the error of the predicted positive and negative examples.
Illustratively, the error of the predicted positive example and negative example is determined by a loss function of a preset feature extraction model, and the method comprises the following steps: and fitting the first distribution of the predicted positive examples and the predicted negative examples with the second distribution of the actual positive examples and the actual negative examples by using a noise loss resisting function to obtain errors of the predicted positive examples and the predicted negative examples.
Specifically, the countering noise loss function can be expressed as:
wherein, J(D)(θD,θG) A degree of fit (also referred to as divergence of the first distribution from the second distribution) of the predicted positive and negative examples to a second distribution of the actual positive and negative examples, an error representing the predicted positive and negative examples, θDA first distribution, θ, representing positive and negative examples of the predictionGA second distribution representing actual positive and negative examples,a distribution function representing a positive example of the prediction,and D (X) a discriminator for a language feature extraction model, which is used for carrying out authenticity discrimination on the training sample X.
It should be noted that, in the process of determining the predicted error between the positive example and the negative example according to the countering noise loss function, for other samples in the same batch of samples having the same label (language) as the positive example and the negative example, the samples may not be considered as the negative examples or the positive examples, and thus do not participate in the calculation of the loss function; in addition, if the samples in the same batch do not include the positive examples, the samples in the class do not participate in the calculation of the loss function, and the calculation efficiency of the loss function can be effectively improved.
In an embodiment, updating the model parameters of the preset language feature extraction model according to the error includes: and updating the model parameters of the language feature extraction model through back propagation according to the errors.
Specifically, the loss function value of the feature extraction model corresponds to the predicted errors of the positive example and the negative example, after the predicted errors of the positive example and the negative example are obtained, the error value is gradually reduced by using a gradient descent algorithm, and parameters of the language feature extraction model are continuously updated layer by layer from back to front in the process of gradually reducing the error value until the error value is a minimum value and tends to be stable, so that the model parameters of the language feature extraction model are updated.
The parameter updating process for the language feature extraction model is a process of optimizing the parameters of the discriminator d (x) of the language feature extraction model. In particular, the parameter θ of discriminator d (x) is updated by Adam gradient descent algorithmd. Illustratively, the parameter θ of the discriminator is updated by the Adam gradient descent algorithmdCan be expressed by the following formula:
wherein, J(D)A cost function representing the discriminator d (x), the value of the cost function representing an error value for authenticating the input sample.
In the embodiment of the present application, the cost function of the discriminator d (x) is a loss function of a preset language feature extraction model, specifically, a noise-fighting loss function. In this example, by using the Adam gradient descent algorithm, when J(D)In the process of gradually reducing and stabilizing the value of (A), according to the formulaCan calculate the following J(D)Theta of constantly changing valuedBased on calculated thetadContinuously updates the parameter thetad。
Exemplarily, as shown in fig. 3, fig. 3 is a schematic structural diagram of a language feature extraction model provided in an embodiment of the present application. As can be seen from fig. 3, the language feature extraction model 300 includes a feature encoder 200 and a time series model 301. Specifically, the specific explanation of the feature encoder 200 and the timing model 301 can refer to the foregoing description of the embodiments of the present application, and will not be described herein again.
According to the analysis, the language feature extraction model training method provided by the embodiment of the application obtains the dimension reduction feature vector by performing dimension reduction processing on the feature vector of the voice sample; determining the context characteristics of the language of the voice sample according to the dimension reduction characteristic vector; and redefining the positive examples and the negative examples of the voice samples, and predicting the positive examples and the negative examples included in each frame of voice samples according to the context characteristics. Context contrast prediction coding is used for extracting language features, the language features are represented by the feature vector mean of a voice sample, features irrelevant to the languages are diluted, errors of prediction results of positive examples and negative examples are calculated through a loss function of a preset feature extraction model, and model parameters of the language feature extraction model are updated according to the errors. The efficiency and the accuracy of training the language feature extraction model are improved.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a language feature extraction model training device according to an embodiment of the present application, where the language feature extraction model training device is used to execute the language feature extraction model training method shown in fig. 1. The language feature extraction model training device can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.
As shown in fig. 4, the language feature extraction model training apparatus 400 includes:
the processing module 401 is configured to perform dimensionality reduction processing on the feature vector of the speech sample at a dimensionality reduction layer of the preset language feature extraction model to obtain a dimensionality reduction feature vector;
an obtaining module 402, configured to input the dimension reduction feature vector into a time sequence model, and obtain an upper feature and a lower feature;
a obtaining module 403, configured to combine the above feature and the below feature to obtain a context feature;
a prediction module 404, configured to redefine a positive example and a negative example of a speech sample, and predict, according to the context feature, each speech sample as the positive example or the negative example, where a feature vector of the positive example is an average value of feature vectors of all speech samples in a same language as the speech sample, and a feature vector of the negative example is an average value of feature vectors of all speech samples in a different language from the speech sample;
and an updating module 405, configured to determine predicted errors of the positive examples and the negative examples through a preset loss function of the feature extraction model, and update the model parameters of the language feature extraction model according to the errors.
In an embodiment, the preset language feature extraction model includes a feature encoder, and the processing module 401 includes:
the obtaining unit is used for inputting the voice samples into the feature encoder and obtaining the feature vectors corresponding to the voice samples of each frame;
and the processing unit is used for respectively carrying out dimensionality reduction processing on each feature vector through the dimensionality reduction layer to obtain the dimensionality reduction feature vector corresponding to each frame of voice sample.
In one embodiment, the timing model includes an autoregressive model and an inverse autoregressive model; an obtaining module 402, comprising:
the first obtaining unit is used for inputting the dimensionality reduction feature vector corresponding to the previous t frames of voice samples into the autoregressive model to obtain the features;
and the second acquisition unit is used for inputting the dimensionality reduction feature vectors corresponding to the voice samples from the last frame to the t +1 th frame into the reverse autoregressive model to acquire the following features.
In one embodiment, the redefining of the positive and negative examples of the speech sample includes:
determining a target language of a voice sample;
defining a voice sample with the same language as the target language in each batch of voice samples as a positive example;
and defining the voice sample with the language different from the target language in each batch of voice samples as a counterexample.
In an embodiment, the predicting the positive and negative examples included in each of the speech samples according to the context feature comprises:
calculating the inner product of the context feature and the dimensionality reduction feature vector of each frame of voice sample;
and predicting each frame of voice sample to be a positive example or a negative example according to the inner product result obtained by calculation and a preset correlation.
In one embodiment, the loss function of the preset feature extraction model includes a countering noise loss function, and the determining the error of the predicted positive example and the predicted negative example through the loss function of the preset feature extraction model includes:
and fitting the predicted positive examples and negative examples with actual positive examples and negative examples through the noise-resisting loss function, and determining the errors of the predicted positive examples and negative examples through the fitting result.
In an embodiment, the updating the model parameter of the preset language feature extraction model according to the error includes:
and updating the model parameters of the preset language feature extraction model through back propagation according to the errors.
It should be noted that, as will be clearly understood by those skilled in the art, for convenience and brevity of description, the specific working processes of the terminal and each module described above may refer to the corresponding processes in the embodiment of the language feature extraction model training method described in fig. 1, and are not described herein again.
The above-described language feature extraction model training method may be implemented in the form of a computer program that can be run on an apparatus as shown in fig. 4.
Referring to fig. 5, fig. 5 is a schematic block diagram illustrating a structure of a language feature extraction model training device according to an embodiment of the present application. The language feature extraction model training device comprises a processor, a memory and a network interface which are connected through a system bus, wherein the memory can comprise a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of the linguistic feature extraction model training methods.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any one of the linguistic feature extraction model training methods.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation on the terminal to which the present application is applied, and that a particular terminal may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
performing dimensionality reduction processing on the feature vector of the voice sample at a dimensionality reduction layer of a preset language feature extraction model to obtain a dimensionality reduction feature vector;
inputting the dimensionality reduction feature vector into a time sequence model to obtain an upper feature and a lower feature;
combining the above feature and the below feature to obtain a context feature;
redefining a positive example and a negative example of the voice sample, and predicting each voice sample to be the positive example or the negative example according to the context features, wherein the feature vector of the positive example is the average value of the feature vectors of all the voice samples with the same language as the voice sample, and the feature vector of the negative example is the average value of the feature vectors of all the voice samples with different languages from the voice sample;
and determining errors of the predicted positive examples and negative examples through the loss function of the preset feature extraction model, and updating the model parameters of the preset language feature extraction model according to the errors.
In an embodiment, the preset language feature extraction model includes a feature encoder, and performing dimensionality reduction processing on a feature vector of a speech sample at a dimensionality reduction layer of the preset language feature extraction model to obtain a dimensionality reduction feature vector, including:
inputting the voice samples into the feature encoder to obtain the feature vectors corresponding to the voice samples of each frame;
and respectively carrying out dimensionality reduction processing on each feature vector through the dimensionality reduction layer to obtain the dimensionality reduction feature vector corresponding to each frame of voice sample.
In one embodiment, the timing model includes an autoregressive model and an inverse autoregressive model; inputting the dimensionality reduction feature vector into a time sequence model to obtain an upper feature and a lower feature, wherein the step of inputting the dimensionality reduction feature vector into the time sequence model comprises the following steps:
inputting the dimensionality reduction feature vector corresponding to the previous t frames of voice samples into an autoregressive model to obtain the features;
and inputting the dimensionality reduction feature vectors corresponding to the voice samples from the last frame to the t +1 th frame into an inverse autoregressive model to obtain the following features.
In one embodiment, the redefining of the positive and negative examples of the speech sample includes:
determining a target language of a voice sample;
defining a voice sample with the same language as the target language in each batch of voice samples as a positive example;
and defining the voice sample with the language different from the target language in each batch of voice samples as a counterexample.
In an embodiment, the predicting the positive and negative examples included in each of the speech samples according to the context feature comprises:
calculating the inner product of the context feature and the dimensionality reduction feature vector of each frame of voice sample;
and predicting each frame of voice sample to be a positive example or a negative example according to the inner product result obtained by calculation and a preset correlation.
In one embodiment, the loss function of the preset feature extraction model includes a countering noise loss function, and the determining the error of the predicted positive example and the predicted negative example through the loss function of the preset feature extraction model includes:
and fitting the predicted positive examples and negative examples with actual positive examples and negative examples through the noise-resisting loss function, and determining the errors of the predicted positive examples and negative examples through the fitting result.
In an embodiment, the updating the model parameter of the preset language feature extraction model according to the error includes:
and updating the model parameters of the preset language feature extraction model through back propagation according to the errors.
In an embodiment of the present application, a computer-readable storage medium is further provided, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement the language feature extraction model training method provided in the embodiment shown in fig. 1 of the present application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A language feature extraction model training method is characterized by comprising the following steps:
performing dimensionality reduction processing on the feature vector of the voice sample at a dimensionality reduction layer of a preset language feature extraction model to obtain a dimensionality reduction feature vector;
inputting the dimensionality reduction feature vector into a time sequence model to obtain an upper feature and a lower feature;
combining the above feature and the below feature to obtain a context feature;
redefining a positive example and a negative example of the voice sample, and predicting each voice sample to be the positive example or the negative example according to the context features, wherein the feature vector of the positive example is the average value of the feature vectors of all the voice samples with the same language as the voice sample, and the feature vector of the negative example is the average value of the feature vectors of all the voice samples with different languages from the voice sample;
and determining errors of the predicted positive examples and negative examples through a preset loss function of the feature extraction model, and updating model parameters of the language feature extraction model according to the errors.
2. The method according to claim 1, wherein the predetermined language feature extraction model comprises a feature encoder, and the performing dimensionality reduction on the feature vector of the speech sample at a dimensionality reduction layer of the predetermined language feature extraction model to obtain a dimensionality reduction feature vector comprises:
inputting the voice samples into the feature encoder to obtain the feature vectors corresponding to the voice samples of each frame;
and respectively carrying out dimensionality reduction processing on each feature vector through the dimensionality reduction layer to obtain the dimensionality reduction feature vector corresponding to each frame of voice sample.
3. The language feature extraction model training method according to claim 1 or 2, wherein said time series model comprises an autoregressive model and an inverse autoregressive model; inputting the dimensionality reduction feature vector into a time sequence model to obtain an upper feature and a lower feature, wherein the step of inputting the dimensionality reduction feature vector into the time sequence model comprises the following steps:
inputting the dimensionality reduction feature vector corresponding to the previous t frames of voice samples into an autoregressive model to obtain the features;
and inputting the dimensionality reduction feature vectors corresponding to the voice samples from the last frame to the t +1 th frame into an inverse autoregressive model to obtain the following features.
4. The method according to claim 3, wherein said redefining the positive and negative examples of speech samples comprises:
determining a target language of a voice sample;
defining a voice sample with the same language as the target language in each batch of voice samples as a positive example;
and defining the voice sample with the language different from the target language in each batch of voice samples as a counterexample.
5. The method for training a linguistic feature extraction model according to claim 4, wherein the predicting, according to the context feature, the positive and negative examples included in each of the speech samples comprises:
calculating the inner product of the context feature and the dimensionality reduction feature vector of each frame of voice sample;
and predicting each frame of voice sample to be a positive example or a negative example according to the inner product result obtained by calculation and a preset correlation.
6. The method according to claim 5, wherein said loss function of said predetermined feature extraction model comprises a counternoise loss function, and said determining errors of said positive and negative examples of prediction by said loss function of said predetermined feature extraction model comprises:
and fitting the predicted positive examples and negative examples with actual positive examples and negative examples through the noise-resisting loss function, and determining the errors of the predicted positive examples and negative examples through the fitting result.
7. The method for training the language feature extraction model according to claim 5 or 6, wherein the updating the model parameters of the preset language feature extraction model according to the error comprises:
and updating the model parameters of the preset language feature extraction model through back propagation according to the errors.
8. A language feature extraction model training device is characterized by comprising:
the processing module is used for carrying out dimensionality reduction processing on the feature vector of the voice sample at a dimensionality reduction layer of a preset language feature extraction model to obtain a dimensionality reduction feature vector;
the acquisition module is used for inputting the dimensionality reduction feature vector into a time sequence model to acquire an upper feature and a lower feature;
an obtaining module, configured to combine the above feature and the below feature to obtain a context feature;
the prediction module is used for redefining the positive examples and the negative examples of the voice samples and predicting each voice sample to be a positive example or a negative example according to the context characteristics, wherein the characteristic vector of the positive example is the average value of the characteristic vectors of all the voice samples with the same language as the voice sample, and the characteristic vector of the negative example is the average value of the characteristic vectors of all the voice samples with different languages from the voice sample;
and the updating module is used for determining the predicted errors of the positive examples and the negative examples through a preset loss function of the feature extraction model, and updating the model parameters of the language feature extraction model according to the errors.
9. A language feature extraction model training device is characterized by comprising:
a memory and a processor;
the memory is used for storing a computer program;
the processor, configured to execute the computer program and to implement the steps of the language feature extraction model training method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, causes the processor to carry out the steps of the language feature extraction model training method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110467103.9A CN113160795B (en) | 2021-04-28 | 2021-04-28 | Language feature extraction model training method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110467103.9A CN113160795B (en) | 2021-04-28 | 2021-04-28 | Language feature extraction model training method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113160795A true CN113160795A (en) | 2021-07-23 |
CN113160795B CN113160795B (en) | 2024-03-05 |
Family
ID=76871880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110467103.9A Active CN113160795B (en) | 2021-04-28 | 2021-04-28 | Language feature extraction model training method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113160795B (en) |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140358547A1 (en) * | 2013-05-28 | 2014-12-04 | International Business Machines Corporation | Hybrid predictive model for enhancing prosodic expressiveness |
CN104538036A (en) * | 2015-01-20 | 2015-04-22 | 浙江大学 | Speaker recognition method based on semantic cell mixing model |
CN109344395A (en) * | 2018-08-30 | 2019-02-15 | 腾讯科技(深圳)有限公司 | A kind of data processing method, device, server and storage medium |
CN109684640A (en) * | 2018-12-26 | 2019-04-26 | 科大讯飞股份有限公司 | A kind of semantic extracting method and device |
CN110263349A (en) * | 2019-03-08 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Corpus assessment models training method, device, storage medium and computer equipment |
CN111048062A (en) * | 2018-10-10 | 2020-04-21 | 华为技术有限公司 | Speech synthesis method and apparatus |
CN111210805A (en) * | 2018-11-05 | 2020-05-29 | 北京嘀嘀无限科技发展有限公司 | Language identification model training method and device and language identification method and device |
CN111429887A (en) * | 2020-04-20 | 2020-07-17 | 合肥讯飞数码科技有限公司 | End-to-end-based speech keyword recognition method, device and equipment |
CN111599344A (en) * | 2020-03-31 | 2020-08-28 | 因诺微科技(天津)有限公司 | Language identification method based on splicing characteristics |
CN111640419A (en) * | 2020-05-26 | 2020-09-08 | 合肥讯飞数码科技有限公司 | Language identification method, system, electronic equipment and storage medium |
CN112489626A (en) * | 2020-11-18 | 2021-03-12 | 华为技术有限公司 | Information identification method and device and storage medium |
CN112489651A (en) * | 2020-11-30 | 2021-03-12 | 科大讯飞股份有限公司 | Voice recognition method, electronic device and storage device |
CN112561060A (en) * | 2020-12-15 | 2021-03-26 | 北京百度网讯科技有限公司 | Neural network training method and device, image recognition method and device and equipment |
CN112634867A (en) * | 2020-12-11 | 2021-04-09 | 平安科技(深圳)有限公司 | Model training method, dialect recognition method, device, server and storage medium |
CN112635050A (en) * | 2020-12-23 | 2021-04-09 | 安徽科大讯飞医疗信息技术有限公司 | Diagnosis recommendation method, electronic equipment and storage device |
CN112669841A (en) * | 2020-12-18 | 2021-04-16 | 平安科技(深圳)有限公司 | Training method and device for multilingual speech generation model and computer equipment |
-
2021
- 2021-04-28 CN CN202110467103.9A patent/CN113160795B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140358547A1 (en) * | 2013-05-28 | 2014-12-04 | International Business Machines Corporation | Hybrid predictive model for enhancing prosodic expressiveness |
CN104538036A (en) * | 2015-01-20 | 2015-04-22 | 浙江大学 | Speaker recognition method based on semantic cell mixing model |
CN109344395A (en) * | 2018-08-30 | 2019-02-15 | 腾讯科技(深圳)有限公司 | A kind of data processing method, device, server and storage medium |
CN111048062A (en) * | 2018-10-10 | 2020-04-21 | 华为技术有限公司 | Speech synthesis method and apparatus |
CN111210805A (en) * | 2018-11-05 | 2020-05-29 | 北京嘀嘀无限科技发展有限公司 | Language identification model training method and device and language identification method and device |
CN109684640A (en) * | 2018-12-26 | 2019-04-26 | 科大讯飞股份有限公司 | A kind of semantic extracting method and device |
CN110263349A (en) * | 2019-03-08 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Corpus assessment models training method, device, storage medium and computer equipment |
CN111599344A (en) * | 2020-03-31 | 2020-08-28 | 因诺微科技(天津)有限公司 | Language identification method based on splicing characteristics |
CN111429887A (en) * | 2020-04-20 | 2020-07-17 | 合肥讯飞数码科技有限公司 | End-to-end-based speech keyword recognition method, device and equipment |
CN111640419A (en) * | 2020-05-26 | 2020-09-08 | 合肥讯飞数码科技有限公司 | Language identification method, system, electronic equipment and storage medium |
CN112489626A (en) * | 2020-11-18 | 2021-03-12 | 华为技术有限公司 | Information identification method and device and storage medium |
CN112489651A (en) * | 2020-11-30 | 2021-03-12 | 科大讯飞股份有限公司 | Voice recognition method, electronic device and storage device |
CN112634867A (en) * | 2020-12-11 | 2021-04-09 | 平安科技(深圳)有限公司 | Model training method, dialect recognition method, device, server and storage medium |
CN112561060A (en) * | 2020-12-15 | 2021-03-26 | 北京百度网讯科技有限公司 | Neural network training method and device, image recognition method and device and equipment |
CN112669841A (en) * | 2020-12-18 | 2021-04-16 | 平安科技(深圳)有限公司 | Training method and device for multilingual speech generation model and computer equipment |
CN112635050A (en) * | 2020-12-23 | 2021-04-09 | 安徽科大讯飞医疗信息技术有限公司 | Diagnosis recommendation method, electronic equipment and storage device |
Also Published As
Publication number | Publication date |
---|---|
CN113160795B (en) | 2024-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kong et al. | On fast sampling of diffusion probabilistic models | |
CN111951805B (en) | Text data processing method and device | |
US10332507B2 (en) | Method and device for waking up via speech based on artificial intelligence | |
US10380236B1 (en) | Machine learning system for annotating unstructured text | |
WO2019169719A1 (en) | Automatic abstract extraction method and apparatus, and computer device and storage medium | |
CN110288980A (en) | Audio recognition method, the training method of model, device, equipment and storage medium | |
CN112509555B (en) | Dialect voice recognition method, device, medium and electronic equipment | |
CN112466314A (en) | Emotion voice data conversion method and device, computer equipment and storage medium | |
CN112509600A (en) | Model training method and device, voice conversion method and device and storage medium | |
CN110162766B (en) | Word vector updating method and device | |
WO2023065635A1 (en) | Named entity recognition method and apparatus, storage medium and terminal device | |
WO2023134067A1 (en) | Speech classification model training method and apparatus, device, and storage medium | |
CN112084752B (en) | Sentence marking method, device, equipment and storage medium based on natural language | |
WO2022257454A1 (en) | Speech synthesis method, apparatus and terminal, and storage medium | |
CN111563161A (en) | Sentence recognition method, sentence recognition device and intelligent equipment | |
CN116684330A (en) | Traffic prediction method, device, equipment and storage medium based on artificial intelligence | |
CN113239702A (en) | Intention recognition method and device and electronic equipment | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN113223502A (en) | Speech recognition system optimization method, device, equipment and readable storage medium | |
CN113011532A (en) | Classification model training method and device, computing equipment and storage medium | |
Schwier et al. | Zero knowledge hidden markov model inference | |
CN115687934A (en) | Intention recognition method and device, computer equipment and storage medium | |
CN116306612A (en) | Word and sentence generation method and related equipment | |
CN113220828A (en) | Intention recognition model processing method and device, computer equipment and storage medium | |
CN113160795B (en) | Language feature extraction model training method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |