CN113327586B - Voice recognition method, device, electronic equipment and storage medium - Google Patents
Voice recognition method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113327586B CN113327586B CN202110610069.6A CN202110610069A CN113327586B CN 113327586 B CN113327586 B CN 113327586B CN 202110610069 A CN202110610069 A CN 202110610069A CN 113327586 B CN113327586 B CN 113327586B
- Authority
- CN
- China
- Prior art keywords
- audio
- audio signal
- phoneme sequence
- training
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000005236 sound signal Effects 0.000 claims abstract description 101
- 238000000605 extraction Methods 0.000 claims abstract description 33
- 238000010183 spectrum analysis Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims description 90
- 238000001228 spectrum Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 230000009467 reduction Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000009432 framing Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The application relates to a voice recognition method, which comprises the following steps: acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-pattern of the audio data; extracting features of the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal; and carrying out word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data. In addition, the application also provides a voice recognition device, electronic equipment and a computer readable storage medium. The application can improve the accuracy of voice recognition.
Description
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, an electronic device, and a computer readable storage medium for voice recognition.
Background
In recent years, machine learning has been rapidly developed, a speech recognition task has made a great breakthrough in the context of deep learning, and although a traditional speech recognition framework can realize stable industrialized recognition, with the introduction of deep learning, people in the intelligent big data age have not met the limited model precision any more, and people hope that speech recognition can process more complex data.
At present, speech recognition is usually implemented by adopting a speech recognition model based on an attention mechanism, and the data quality requirement of speech to be recognized by the speech recognition model based on the attention mechanism is extremely high, however, in an actual service scene, speech data to be recognized in different noise environments, such as data of scenes of accent dialects, noisy, far field and the like, can be generated, so that the speech recognition capability of the speech recognition model based on the attention mechanism can be influenced, and the accuracy of speech recognition can be influenced.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present application provides a voice recognition method, a device, an electronic apparatus, and a computer readable storage medium, which can improve the accuracy of voice recognition.
In a first aspect, the present application provides a speech recognition method, including:
acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-pattern of the audio data;
extracting features of the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;
and carrying out word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data.
According to the application, firstly, based on the spectrum analysis of the audio data, the characteristic data of the audio data can be extracted, so that the complexity of the audio data is reduced, and the analysis accuracy of the subsequent audio data can be improved; and secondly, the feature extraction and the phoneme recognition of the mel-pattern of the audio data are carried out through the pre-trained audio recognition model, namely, the phoneme sequence recognition of the end-to-end audio data is adopted, so that the anti-interference performance of the audio recognition model on the complex audio data can be enhanced, and the analysis accuracy of the audio data is further improved. Therefore, compared with the prior art, the application can enhance the anti-interference performance of the model on the audio data and improve the accuracy of voice recognition.
In a possible implementation manner of the first aspect, the performing spectral analysis on the audio data to generate a mel-spectrogram of the audio data includes:
preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;
carrying out Mel spectrum filtering on the spectrogram, and carrying out cepstrum analysis on the spectrogram after Mel spectrum filtering to obtain an initial Mel cepstrum of the audio data;
and performing discrete transformation on the initial mel-pattern to obtain the mel-pattern of the audio data.
In a possible implementation manner of the first aspect, before the feature extraction of the mel-spectrogram by using a pre-trained audio recognition model, the method further includes:
acquiring a training cepstrum graph and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;
performing spectrum enhancement on the training cepstrum, and taking the training cepstrum after spectrum enhancement and the training cepstrum as model training data;
inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;
calculating training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;
if the training loss does not meet the preset condition, adjusting parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;
and if the training loss meets the preset condition, obtaining a trained audio recognition model.
In a possible implementation manner of the first aspect, the inputting the model training data into the convolution module of the audio recognition model to output the second characteristic audio signal of the model training data includes:
performing convolution operation on the model training data by utilizing a convolution layer in the convolution module to obtain an initial characteristic audio signal;
utilizing a linear rectifying layer in the convolution module to linearly adjust the initial characteristic audio signal;
reducing the dimension of the initial characteristic audio signal after linear adjustment by using a pooling layer in the convolution module;
and outputting the initial characteristic audio signal after the dimension reduction by using the full connection layer in the convolution module to obtain a first characteristic audio signal.
In a possible implementation manner of the first aspect, the identifying, by the phoneme identifying module of the audio identifying model, a second phoneme sequence of the second characteristic audio signal includes:
receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, and setting delay data of the second characteristic audio signal;
extracting a phoneme sequence of the second characteristic audio signal by utilizing a hidden layer in the phoneme recognition module according to the delay data;
and outputting the extracted phoneme sequence by utilizing an output layer in the phoneme recognition module to obtain a second phoneme sequence.
In a possible implementation manner of the first aspect, the calculating the training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence includes:
calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;
calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;
and calculating the training loss of the audio recognition model according to the first training loss and the second training loss.
In a possible implementation manner of the first aspect, the text extraction of the phoneme sequence includes:
calculating a text generation probability from the phoneme sequence;
and identifying the text information relation between the phoneme sequences according to the text generation probability, and generating corresponding text according to the text information relation.
In a second aspect, the present application provides a speech recognition apparatus comprising:
the frequency spectrum analysis module is used for acquiring audio data, carrying out frequency spectrum analysis on the audio data and generating a mel-spectrogram of the audio data;
the phoneme sequence recognition module is used for carrying out feature extraction on the mel-pattern by utilizing a pre-trained audio recognition model to obtain a feature audio signal and recognizing a phoneme sequence of the feature audio signal;
and the character extraction module is used for carrying out character extraction on the phoneme sequence and taking a character extraction result as a recognition result of the audio data.
In a third aspect, the present application provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech recognition method as described in any one of the first aspects above.
In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the speech recognition method according to any one of the first aspects.
The advantages of the second to fourth aspects may be found in the relevant description of the first aspect, and are not described here.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a detailed flowchart of a voice recognition method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating one of the steps of the voice recognition method of FIG. 1 according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating another step of the voice recognition method of FIG. 1 according to an embodiment of the present application;
FIG. 4 is a detailed flowchart illustrating a further step of the voice recognition method of FIG. 1 according to an embodiment of the present application;
FIG. 5 is a schematic block diagram of a speech recognition device according to an embodiment of the present application;
fig. 6 is a schematic diagram of an internal structure of an electronic device for implementing a voice recognition method according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
A speech recognition method according to an embodiment of the present application is described with reference to a flowchart shown in fig. 1.
The method depicted in fig. 1 comprises the following steps:
s1, acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-spectrogram of the audio data.
In an embodiment of the present application, the audio data is digital sound data, including: speech, vocal, and other sounds such as the speech uttered by the user, vocal played by the piano, and other sounds uttered by the object collision.
As one embodiment of the present application, referring to fig. 2, the performing spectral analysis on the audio data to generate a mel-spectrogram of the audio data includes:
s20, preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;
s21, carrying out Mel spectrum filtering on the spectrogram, and carrying out cepstrum analysis on the spectrogram after Mel spectrum filtering to obtain an initial Mel cepstrum of the audio data;
s22, performing discrete transformation on the initial mel-pattern to obtain the mel-pattern of the audio data.
In one embodiment of the present application, the preprocessing the audio data includes: framing the audio data, and windowing the framed audio data. The framing refers to a process of dividing the audio data signal into audio signals of one frame to divide the long audio signal into short audio signals, usually taking 10-30ms as one frame, and the windowing is used for eliminating the discontinuous phenomenon at two ends of the signal after framing.
In one embodiment of the present application, the short-time fourier transform refers to a process of performing fourier transform on a short-time signal, which is used to transform a signal of audio data from a time domain to a frequency domain, so as to analyze a signal change condition of the audio data, and optionally, performing short-time fourier transform on the preprocessed audio data by using the following formula:
wherein F (ω) represents a spectrogram, F (t) represents the number of audio data after preprocessing, and e represents a wireless non-cyclic fraction.
In one embodiment of the present application, the mel spectrum filtering is used for masking a sound signal in a sound chart, which does not conform to a preset frequency range, so as to obtain a sound chart conforming to the hearing habit of a human ear, and the cepstrum analysis refers to performing a secondary spectrum analysis on the sound chart of the audio data so as to extract contour information of the sound chart, so as to obtain feature data of the audio data. Optionally, the Mel spectrum filtering of the spectrogram is performed through a Mel filter bank, the preset frequency range is 200HZ-500HZ, and the cepstrum analysis is realized by taking the logarithm of the spectrogram after the Mel spectrum filtering.
In one embodiment of the present application, the discrete transformation is used for performing image compression on the initial mel-spectrogram, so as to achieve dimension reduction of the initial mel-spectrogram, thereby improving the processing speed of the subsequent image.
S2, carrying out feature extraction on the mel-pattern by utilizing a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal.
In the embodiment of the application, the audio recognition model comprises a convolution module and a phoneme recognition module, wherein the convolution module is used for extracting the characteristic audio signals in the mel-spectrogram, and the phoneme recognition module is used for recognizing the phoneme sequence of the extracted characteristic audio signals.
According to one embodiment of the application, the convolution module is constructed through a convolution neural network and comprises a convolution layer, a linear rectifying layer, a pooling layer and a full-connection layer, wherein the convolution layer is used for extracting different characteristic audio signals of an input mel-pattern, such as level audio signals of edges, line angles and the like, the linear rectifying layer can enhance a judging function and the nonlinear characteristics of the whole neural network through linear rectification as an activating function, so that the training speed of the neural network is improved, the pooling layer is used for carrying out dimension reduction on the characteristic audio signals extracted by the convolution layer, cutting the characteristics extracted by the convolution layer into a plurality of areas, taking the maximum value or the average value of the characteristics extracted by the convolution layer to obtain new characteristic audio signals with smaller dimension, and the full-connection layer combines all local characteristics into global characteristics, so that the characteristic audio signals are output.
In one embodiment of the present application, the phoneme recognition module is constructed by a time delay neural network, and includes an input layer, a hidden layer and an output layer, where the input layer is configured to receive the characteristic audio signals transmitted by the convolution module, the hidden layer recognizes weights between the input characteristic audio signals by setting delay, so as to extract a phoneme sequence that the characteristic audio signals conform to the conditions, and the output layer is configured to output the phoneme sequence extracted by the hidden layer.
Further, in an embodiment of the present application, before the feature extraction of the mel-spectrogram using the pre-trained audio recognition model, the method further includes: training of an audio recognition model.
Specifically, referring to fig. 3, the training of the audio recognition model includes:
s30, acquiring a training cepstrum graph and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;
s31, performing spectrum enhancement on the training cepstrum, and taking the training cepstrum after spectrum enhancement and the training cepstrum as model training data;
s32, inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;
s33, calculating training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;
s34, if the training loss does not meet the preset condition, adjusting parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;
and S35, if the training loss meets the preset condition, obtaining a trained audio recognition model.
In an optional embodiment, the first characteristic audio signal is used as a real tag of a characteristic audio signal obtained by training a subsequent model, the first phoneme sequence is used as a real tag of a phoneme sequence obtained by training the subsequent model, and based on the first characteristic audio signal and the first phoneme sequence, the learning effect of the subsequent model can be supervised, so that the overall recognition capability of the model is improved.
In an optional embodiment, the spectrum enhancement is used for increasing the number of training cepstrum patterns to increase training data of subsequent model training, so as to improve the overall robustness of the model.
In an alternative embodiment, the inputting the model training data into the convolution module of the audio recognition model to output the second characteristic audio signal of the model training data includes: and performing convolution operation on the model training data by using a convolution layer in the convolution module to obtain an initial characteristic audio signal, performing linear adjustment on the initial characteristic audio signal by using a linear rectifying layer in the convolution module, performing dimension reduction on the initial characteristic audio signal subjected to linear adjustment by using a pooling layer in the convolution module, and outputting the initial characteristic audio signal subjected to dimension reduction by using a full-connection layer in the convolution module to obtain a first characteristic audio signal.
In an alternative embodiment, the phoneme recognition module using the audio recognition model recognizes a second sequence of phonemes of the second characteristic audio signal, including: and receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, setting delay data of the second characteristic audio signal, extracting a phoneme sequence of the second characteristic audio signal by using a hidden layer in the phoneme recognition module according to the delay data, and outputting the extracted phoneme sequence by using an output layer in the phoneme recognition module to obtain a second phoneme sequence.
In an alternative embodiment, referring to fig. 4, the step S33 includes:
s40, calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;
s41, calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;
s42, calculating the training loss of the audio recognition model according to the first training loss and the second training loss.
In an alternative embodiment of the application, the first training loss is calculated according to the following formula:
LC=m g logm p +(1-m g )log(1-m p )
wherein LC represents a first training loss, m g Representing a first characteristic audio signal, m p Representing a second characteristic audio signal.
In an alternative embodiment of the application, the second training loss is calculated according to the following formula:
L1=|α p -α g |
wherein L1 represents a second training loss, alpha g Representing a first phoneme sequence, alpha p Representing a second phoneme sequence.
In an alternative embodiment of the present application, the first training loss and the second training loss are added to obtain a training loss of the audio recognition model, i.e. l=l1+lc.
In an alternative embodiment of the application, the preset condition includes the training loss being less than a loss threshold. That is, when the training loss is smaller than the loss threshold, the training loss satisfies the preset condition, and when the training loss is greater than or equal to the loss threshold, the training loss does not satisfy the preset condition. The loss threshold may be set to 0.1, or may be set according to an actual scene. Further, the parameter adjustment of the audio recognition model may be implemented by a gradient descent algorithm, such as a random descent algorithm.
And S3, performing word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data.
According to the embodiment of the application, the voice recognition result of the voice data is recognized by performing text extraction on the phoneme sequence. In one embodiment of the method, a language model is used for extracting characters from the phoneme sequence, and the language model is a language abstract mathematical model which is performed according to language objective facts and is used for identifying character information relations corresponding to the phoneme sequence.
In detail, the text extraction of the phoneme sequence by using the language model includes: and calculating character generation probability from the phoneme sequences by using the language model, identifying character information relations between the phoneme sequences according to the character generation probability, and generating corresponding characters according to the character information relations. The word generation probability refers to a distribution probability of words generated by the phoneme sequence, and the word information relationship refers to a relationship that any two or more phoneme sequences can form words.
The application can extract the characteristic data of the audio data based on the frequency spectrum analysis of the audio data, thereby reducing the complexity of the audio data and further improving the analysis accuracy of the subsequent audio data; and secondly, the feature extraction and the phoneme recognition of the mel-pattern of the audio data are carried out through the pre-trained audio recognition model, namely, the phoneme sequence recognition of the end-to-end audio data is adopted, so that the anti-interference performance of the audio recognition model on the complex audio data can be enhanced, and the analysis accuracy of the audio data is further improved.
As shown in fig. 5, a functional block diagram of the speech recognition apparatus of the present application is shown.
The speech recognition apparatus 500 of the present application may be installed in an electronic device. The speech recognition means may comprise a spectral analysis module 501, a phoneme sequence recognition module 502 and a text extraction module 503, depending on the implemented functions. The module of the present application may also be referred to as a unit, meaning a series of computer program segments capable of being executed by the processor of the electronic device and of performing fixed functions, stored in the memory of the electronic device.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the spectrum analysis module 501 is configured to obtain audio data, perform spectrum analysis on the audio data, and generate a mel-spectrogram of the audio data;
the phoneme sequence recognition module 502 is configured to perform feature extraction on the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognize a phoneme sequence of the feature audio signal;
the text extraction module 503 is configured to perform text extraction on the phoneme sequence, and take a result of the text extraction as a recognition result of the audio data.
In detail, the modules in the voice recognition device 500 in the embodiment of the present application use the same technical means as the voice recognition method described in fig. 1 to 4 and can produce the same technical effects, which are not described herein.
Fig. 6 is a schematic structural diagram of an electronic device for implementing the voice recognition method according to the present application.
The electronic device 6 may comprise a processor 60, a memory 61 and a bus, and may further comprise a computer program, such as a speech recognition program 62, stored in the memory 61 and executable on the processor 60.
The memory 61 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 61 may in some embodiments be an internal storage unit of the electronic device 6, such as a removable hard disk of the electronic device 6. The memory 61 may also be an external storage device of the electronic device 6 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the electronic device 6. The memory 61 may be used not only for storing application software installed in the electronic device 6 and various types of data, such as codes of the voice recognition program 62, but also for temporarily storing data that has been output or is to be output.
The processor 60 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, various control chips, and the like. The processor 60 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 6 and processes data by running or executing programs or modules stored in the memory 61 (e.g., executing a voice recognition program 62, etc.), and calling data stored in the memory 61.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 61 and at least one processor 60 etc.
Fig. 6 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 6 is not limiting of the electronic device 6 and may include fewer or more components than shown, or some components in combination, or a different arrangement of components.
For example, although not shown, the electronic device 6 may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor 60 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 6 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described in detail herein.
Further, the electronic device 6 may also comprise a network interface, optionally comprising a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 6 and other electronic devices.
The electronic device 6 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 6 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The speech recognition 62 stored by the memory 61 in the electronic device 6 is a combination of a plurality of computer programs that, when run in the processor 60, can implement:
acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-pattern of the audio data;
extracting features of the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;
and carrying out word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data.
In particular, the specific implementation method of the processor 60 on the computer program may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
Further, the integrated modules/units of the electronic device 6 may be stored in a non-volatile computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present application also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-pattern of the audio data;
extracting features of the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;
and carrying out word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (9)
1. A method of speech recognition, the method comprising:
acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-pattern of the audio data;
extracting features of the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;
performing word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data;
before the feature extraction is performed on the mel-spectrogram by using the pre-trained audio recognition model, the method further comprises:
acquiring a training cepstrum graph and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;
performing spectrum enhancement on the training cepstrum, and taking the training cepstrum after spectrum enhancement and the training cepstrum as model training data;
inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;
calculating training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;
if the training loss does not meet the preset condition, adjusting parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;
and if the training loss meets the preset condition, obtaining a trained audio recognition model.
2. The method of claim 1, wherein the performing spectral analysis on the audio data to generate a mel-pattern of the audio data comprises:
preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;
carrying out Mel spectrum filtering on the spectrogram, and carrying out cepstrum analysis on the spectrogram after Mel spectrum filtering to obtain an initial Mel cepstrum of the audio data;
and performing discrete transformation on the initial mel-pattern to obtain the mel-pattern of the audio data.
3. The method of claim 1, wherein the inputting the model training data into the convolution module of the audio recognition model to output a second characteristic audio signal of the model training data comprises:
performing convolution operation on the model training data by utilizing a convolution layer in the convolution module to obtain an initial characteristic audio signal;
utilizing a linear rectifying layer in the convolution module to linearly adjust the initial characteristic audio signal;
reducing the dimension of the initial characteristic audio signal after linear adjustment by using a pooling layer in the convolution module;
and outputting the initial characteristic audio signal after the dimension reduction by using the full connection layer in the convolution module to obtain a first characteristic audio signal.
4. The method of claim 1, wherein the phoneme recognition module utilizing the audio recognition model recognizes a second sequence of phonemes of the second characteristic audio signal comprising:
receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, and setting delay data of the second characteristic audio signal;
extracting a phoneme sequence of the second characteristic audio signal by utilizing a hidden layer in the phoneme recognition module according to the delay data;
and outputting the extracted phoneme sequence by utilizing an output layer in the phoneme recognition module to obtain a second phoneme sequence.
5. The method of claim 1, wherein the calculating training loss of the audio recognition model based on the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence, and the second phoneme sequence comprises:
calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;
calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;
and calculating the training loss of the audio recognition model according to the first training loss and the second training loss.
6. The speech recognition method of any one of claims 1 to 5, wherein said text extracting of said phoneme sequence comprises:
calculating a text generation probability from the phoneme sequence;
and identifying the text information relation between the phoneme sequences according to the text generation probability, and generating corresponding text according to the text information relation.
7. A speech recognition apparatus, comprising:
the frequency spectrum analysis module is used for acquiring audio data, carrying out frequency spectrum analysis on the audio data and generating a mel-spectrogram of the audio data;
the phoneme sequence recognition module is used for carrying out feature extraction on the mel-pattern by utilizing a pre-trained audio recognition model to obtain a feature audio signal and recognizing a phoneme sequence of the feature audio signal;
the character extraction module is used for extracting characters from the phoneme sequence, and taking the character extraction result as the recognition result of the audio data;
the process for acquiring the trained audio recognition model comprises the following steps of: acquiring a training cepstrum graph and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;
performing spectrum enhancement on the training cepstrum, and taking the training cepstrum after spectrum enhancement and the training cepstrum as model training data;
inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;
calculating training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;
if the training loss does not meet the preset condition, adjusting parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;
and if the training loss meets the preset condition, obtaining a trained audio recognition model.
8. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech recognition method according to any one of claims 1 to 6.
9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the speech recognition method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110610069.6A CN113327586B (en) | 2021-06-01 | 2021-06-01 | Voice recognition method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110610069.6A CN113327586B (en) | 2021-06-01 | 2021-06-01 | Voice recognition method, device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113327586A CN113327586A (en) | 2021-08-31 |
CN113327586B true CN113327586B (en) | 2023-11-28 |
Family
ID=77423260
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110610069.6A Active CN113327586B (en) | 2021-06-01 | 2021-06-01 | Voice recognition method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113327586B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113763952B (en) * | 2021-09-03 | 2022-07-26 | 深圳市北科瑞声科技股份有限公司 | Dynamic voice recognition method and device, electronic equipment and storage medium |
CN113808577A (en) * | 2021-09-18 | 2021-12-17 | 平安银行股份有限公司 | Intelligent extraction method and device of voice abstract, electronic equipment and storage medium |
CN114743554A (en) * | 2022-06-09 | 2022-07-12 | 武汉工商学院 | Intelligent household interaction method and device based on Internet of things |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20050014183A (en) * | 2003-07-30 | 2005-02-07 | 주식회사 팬택 | Method for modificating state |
CN101447185A (en) * | 2008-12-08 | 2009-06-03 | 深圳市北科瑞声科技有限公司 | Audio frequency rapid classification method based on content |
CN106486119A (en) * | 2016-10-20 | 2017-03-08 | 海信集团有限公司 | A kind of method and apparatus of identification voice messaging |
JP2019020598A (en) * | 2017-07-18 | 2019-02-07 | 国立研究開発法人情報通信研究機構 | Learning method of neural network |
KR20190110939A (en) * | 2018-03-21 | 2019-10-01 | 한국과학기술원 | Environment sound recognition method based on convolutional neural networks, and system thereof |
CN110827801A (en) * | 2020-01-09 | 2020-02-21 | 成都无糖信息技术有限公司 | Automatic voice recognition method and system based on artificial intelligence |
CN111261146A (en) * | 2020-01-16 | 2020-06-09 | 腾讯科技(深圳)有限公司 | Speech recognition and model training method, device and computer readable storage medium |
CN111489745A (en) * | 2019-01-28 | 2020-08-04 | 上海菲碧文化传媒有限公司 | Chinese speech recognition system applied to artificial intelligence |
CN111681637A (en) * | 2020-04-28 | 2020-09-18 | 平安科技(深圳)有限公司 | Song synthesis method, device, equipment and storage medium |
CN111862962A (en) * | 2020-07-20 | 2020-10-30 | 汪秀英 | Voice recognition method and system |
CN112116903A (en) * | 2020-08-17 | 2020-12-22 | 北京大米科技有限公司 | Method and device for generating speech synthesis model, storage medium and electronic equipment |
CN112133289A (en) * | 2020-11-24 | 2020-12-25 | 北京远鉴信息技术有限公司 | Voiceprint identification model training method, voiceprint identification device, voiceprint identification equipment and voiceprint identification medium |
CN112466279A (en) * | 2021-02-02 | 2021-03-09 | 深圳市阿卡索资讯股份有限公司 | Automatic correction method and device for spoken English pronunciation |
CN112735371A (en) * | 2020-12-28 | 2021-04-30 | 出门问问(苏州)信息科技有限公司 | Method and device for generating speaker video based on text information |
CN112767958A (en) * | 2021-02-26 | 2021-05-07 | 华南理工大学 | Zero-learning-based cross-language tone conversion system and method |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8660842B2 (en) * | 2010-03-09 | 2014-02-25 | Honda Motor Co., Ltd. | Enhancing speech recognition using visual information |
US20190130896A1 (en) * | 2017-10-26 | 2019-05-02 | Salesforce.Com, Inc. | Regularization Techniques for End-To-End Speech Recognition |
US11107463B2 (en) * | 2018-08-01 | 2021-08-31 | Google Llc | Minimum word error rate training for attention-based sequence-to-sequence models |
US11170761B2 (en) * | 2018-12-04 | 2021-11-09 | Sorenson Ip Holdings, Llc | Training of speech recognition systems |
-
2021
- 2021-06-01 CN CN202110610069.6A patent/CN113327586B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20050014183A (en) * | 2003-07-30 | 2005-02-07 | 주식회사 팬택 | Method for modificating state |
CN101447185A (en) * | 2008-12-08 | 2009-06-03 | 深圳市北科瑞声科技有限公司 | Audio frequency rapid classification method based on content |
CN106486119A (en) * | 2016-10-20 | 2017-03-08 | 海信集团有限公司 | A kind of method and apparatus of identification voice messaging |
JP2019020598A (en) * | 2017-07-18 | 2019-02-07 | 国立研究開発法人情報通信研究機構 | Learning method of neural network |
KR20190110939A (en) * | 2018-03-21 | 2019-10-01 | 한국과학기술원 | Environment sound recognition method based on convolutional neural networks, and system thereof |
CN111489745A (en) * | 2019-01-28 | 2020-08-04 | 上海菲碧文化传媒有限公司 | Chinese speech recognition system applied to artificial intelligence |
CN110827801A (en) * | 2020-01-09 | 2020-02-21 | 成都无糖信息技术有限公司 | Automatic voice recognition method and system based on artificial intelligence |
CN111261146A (en) * | 2020-01-16 | 2020-06-09 | 腾讯科技(深圳)有限公司 | Speech recognition and model training method, device and computer readable storage medium |
CN111681637A (en) * | 2020-04-28 | 2020-09-18 | 平安科技(深圳)有限公司 | Song synthesis method, device, equipment and storage medium |
CN111862962A (en) * | 2020-07-20 | 2020-10-30 | 汪秀英 | Voice recognition method and system |
CN112116903A (en) * | 2020-08-17 | 2020-12-22 | 北京大米科技有限公司 | Method and device for generating speech synthesis model, storage medium and electronic equipment |
CN112133289A (en) * | 2020-11-24 | 2020-12-25 | 北京远鉴信息技术有限公司 | Voiceprint identification model training method, voiceprint identification device, voiceprint identification equipment and voiceprint identification medium |
CN112735371A (en) * | 2020-12-28 | 2021-04-30 | 出门问问(苏州)信息科技有限公司 | Method and device for generating speaker video based on text information |
CN112466279A (en) * | 2021-02-02 | 2021-03-09 | 深圳市阿卡索资讯股份有限公司 | Automatic correction method and device for spoken English pronunciation |
CN112767958A (en) * | 2021-02-26 | 2021-05-07 | 华南理工大学 | Zero-learning-based cross-language tone conversion system and method |
Also Published As
Publication number | Publication date |
---|---|
CN113327586A (en) | 2021-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111179975B (en) | Voice endpoint detection method for emotion recognition, electronic device and storage medium | |
CN113327586B (en) | Voice recognition method, device, electronic equipment and storage medium | |
CN105976812B (en) | A kind of audio recognition method and its equipment | |
KR20130133858A (en) | Speech syllable/vowel/phone boundary detection using auditory attention cues | |
Muckenhirn et al. | Understanding and Visualizing Raw Waveform-Based CNNs. | |
CN113420556B (en) | Emotion recognition method, device, equipment and storage medium based on multi-mode signals | |
CN111243569B (en) | Emotional voice automatic generation method and device based on generation type confrontation network | |
CN108320734A (en) | Audio signal processing method and device, storage medium, electronic equipment | |
US20160099003A1 (en) | Digital watermark embedding device, digital watermark embedding method, and computer-readable recording medium | |
CN110738998A (en) | Voice-based personal credit evaluation method, device, terminal and storage medium | |
CN116665669A (en) | Voice interaction method and system based on artificial intelligence | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
CN114913859B (en) | Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium | |
CN113555003B (en) | Speech synthesis method, device, electronic equipment and storage medium | |
CN114842880A (en) | Intelligent customer service voice rhythm adjusting method, device, equipment and storage medium | |
CN115104151A (en) | Offline voice recognition method and device, electronic equipment and readable storage medium | |
CN113436621B (en) | GPU (graphics processing Unit) -based voice recognition method and device, electronic equipment and storage medium | |
CN115240696B (en) | Speech recognition method and readable storage medium | |
Shah et al. | Speaker recognition for pashto speakers based on isolated digits recognition using accent and dialect approach | |
US20230081543A1 (en) | Method for synthetizing speech and electronic device | |
CN116543797A (en) | Emotion recognition method and device based on voice, electronic equipment and storage medium | |
CN113555026B (en) | Voice conversion method, device, electronic equipment and medium | |
CN113053409B (en) | Audio evaluation method and device | |
CN115132170A (en) | Language classification method and device and computer readable storage medium | |
CN113808577A (en) | Intelligent extraction method and device of voice abstract, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |