CN113327586B

CN113327586B - Voice recognition method, device, electronic equipment and storage medium

Info

Publication number: CN113327586B
Application number: CN202110610069.6A
Authority: CN
Inventors: 汪雪; 黄石磊; 程刚
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2023-11-28
Anticipated expiration: 2041-06-01
Also published as: CN113327586A

Abstract

The application relates to a voice recognition method, which comprises the following steps: acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-pattern of the audio data; extracting features of the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal; and carrying out word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data. In addition, the application also provides a voice recognition device, electronic equipment and a computer readable storage medium. The application can improve the accuracy of voice recognition.

Description

Voice recognition method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, an electronic device, and a computer readable storage medium for voice recognition.

Background

In recent years, machine learning has been rapidly developed, a speech recognition task has made a great breakthrough in the context of deep learning, and although a traditional speech recognition framework can realize stable industrialized recognition, with the introduction of deep learning, people in the intelligent big data age have not met the limited model precision any more, and people hope that speech recognition can process more complex data.

At present, speech recognition is usually implemented by adopting a speech recognition model based on an attention mechanism, and the data quality requirement of speech to be recognized by the speech recognition model based on the attention mechanism is extremely high, however, in an actual service scene, speech data to be recognized in different noise environments, such as data of scenes of accent dialects, noisy, far field and the like, can be generated, so that the speech recognition capability of the speech recognition model based on the attention mechanism can be influenced, and the accuracy of speech recognition can be influenced.

Disclosure of Invention

In order to solve the above technical problems or at least partially solve the above technical problems, the present application provides a voice recognition method, a device, an electronic apparatus, and a computer readable storage medium, which can improve the accuracy of voice recognition.

In a first aspect, the present application provides a speech recognition method, including:

acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-pattern of the audio data;

extracting features of the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;

and carrying out word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data.

According to the application, firstly, based on the spectrum analysis of the audio data, the characteristic data of the audio data can be extracted, so that the complexity of the audio data is reduced, and the analysis accuracy of the subsequent audio data can be improved; and secondly, the feature extraction and the phoneme recognition of the mel-pattern of the audio data are carried out through the pre-trained audio recognition model, namely, the phoneme sequence recognition of the end-to-end audio data is adopted, so that the anti-interference performance of the audio recognition model on the complex audio data can be enhanced, and the analysis accuracy of the audio data is further improved. Therefore, compared with the prior art, the application can enhance the anti-interference performance of the model on the audio data and improve the accuracy of voice recognition.

In a possible implementation manner of the first aspect, the performing spectral analysis on the audio data to generate a mel-spectrogram of the audio data includes:

preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;

carrying out Mel spectrum filtering on the spectrogram, and carrying out cepstrum analysis on the spectrogram after Mel spectrum filtering to obtain an initial Mel cepstrum of the audio data;

and performing discrete transformation on the initial mel-pattern to obtain the mel-pattern of the audio data.

In a possible implementation manner of the first aspect, before the feature extraction of the mel-spectrogram by using a pre-trained audio recognition model, the method further includes:

acquiring a training cepstrum graph and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;

performing spectrum enhancement on the training cepstrum, and taking the training cepstrum after spectrum enhancement and the training cepstrum as model training data;

inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;

calculating training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;

if the training loss does not meet the preset condition, adjusting parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;

and if the training loss meets the preset condition, obtaining a trained audio recognition model.

In a possible implementation manner of the first aspect, the inputting the model training data into the convolution module of the audio recognition model to output the second characteristic audio signal of the model training data includes:

performing convolution operation on the model training data by utilizing a convolution layer in the convolution module to obtain an initial characteristic audio signal;

utilizing a linear rectifying layer in the convolution module to linearly adjust the initial characteristic audio signal;

reducing the dimension of the initial characteristic audio signal after linear adjustment by using a pooling layer in the convolution module;

and outputting the initial characteristic audio signal after the dimension reduction by using the full connection layer in the convolution module to obtain a first characteristic audio signal.

In a possible implementation manner of the first aspect, the identifying, by the phoneme identifying module of the audio identifying model, a second phoneme sequence of the second characteristic audio signal includes:

receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, and setting delay data of the second characteristic audio signal;

extracting a phoneme sequence of the second characteristic audio signal by utilizing a hidden layer in the phoneme recognition module according to the delay data;

and outputting the extracted phoneme sequence by utilizing an output layer in the phoneme recognition module to obtain a second phoneme sequence.

In a possible implementation manner of the first aspect, the calculating the training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence includes:

calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;

calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;

and calculating the training loss of the audio recognition model according to the first training loss and the second training loss.

In a possible implementation manner of the first aspect, the text extraction of the phoneme sequence includes:

calculating a text generation probability from the phoneme sequence;

and identifying the text information relation between the phoneme sequences according to the text generation probability, and generating corresponding text according to the text information relation.

In a second aspect, the present application provides a speech recognition apparatus comprising:

the frequency spectrum analysis module is used for acquiring audio data, carrying out frequency spectrum analysis on the audio data and generating a mel-spectrogram of the audio data;

the phoneme sequence recognition module is used for carrying out feature extraction on the mel-pattern by utilizing a pre-trained audio recognition model to obtain a feature audio signal and recognizing a phoneme sequence of the feature audio signal;

and the character extraction module is used for carrying out character extraction on the phoneme sequence and taking a character extraction result as a recognition result of the audio data.

In a third aspect, the present application provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech recognition method as described in any one of the first aspects above.

In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the speech recognition method according to any one of the first aspects.

The advantages of the second to fourth aspects may be found in the relevant description of the first aspect, and are not described here.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a detailed flowchart of a voice recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating one of the steps of the voice recognition method of FIG. 1 according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another step of the voice recognition method of FIG. 1 according to an embodiment of the present application;

FIG. 4 is a detailed flowchart illustrating a further step of the voice recognition method of FIG. 1 according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a speech recognition device according to an embodiment of the present application;

fig. 6 is a schematic diagram of an internal structure of an electronic device for implementing a voice recognition method according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

A speech recognition method according to an embodiment of the present application is described with reference to a flowchart shown in fig. 1.

The method depicted in fig. 1 comprises the following steps:

s1, acquiring audio data, performing spectrum analysis on the audio data, and generating a mel-spectrogram of the audio data.

In an embodiment of the present application, the audio data is digital sound data, including: speech, vocal, and other sounds such as the speech uttered by the user, vocal played by the piano, and other sounds uttered by the object collision.

As one embodiment of the present application, referring to fig. 2, the performing spectral analysis on the audio data to generate a mel-spectrogram of the audio data includes:

s20, preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;

s21, carrying out Mel spectrum filtering on the spectrogram, and carrying out cepstrum analysis on the spectrogram after Mel spectrum filtering to obtain an initial Mel cepstrum of the audio data;

s22, performing discrete transformation on the initial mel-pattern to obtain the mel-pattern of the audio data.

In one embodiment of the present application, the preprocessing the audio data includes: framing the audio data, and windowing the framed audio data. The framing refers to a process of dividing the audio data signal into audio signals of one frame to divide the long audio signal into short audio signals, usually taking 10-30ms as one frame, and the windowing is used for eliminating the discontinuous phenomenon at two ends of the signal after framing.

In one embodiment of the present application, the short-time fourier transform refers to a process of performing fourier transform on a short-time signal, which is used to transform a signal of audio data from a time domain to a frequency domain, so as to analyze a signal change condition of the audio data, and optionally, performing short-time fourier transform on the preprocessed audio data by using the following formula:

wherein F (ω) represents a spectrogram, F (t) represents the number of audio data after preprocessing, and e represents a wireless non-cyclic fraction.

In one embodiment of the present application, the mel spectrum filtering is used for masking a sound signal in a sound chart, which does not conform to a preset frequency range, so as to obtain a sound chart conforming to the hearing habit of a human ear, and the cepstrum analysis refers to performing a secondary spectrum analysis on the sound chart of the audio data so as to extract contour information of the sound chart, so as to obtain feature data of the audio data. Optionally, the Mel spectrum filtering of the spectrogram is performed through a Mel filter bank, the preset frequency range is 200HZ-500HZ, and the cepstrum analysis is realized by taking the logarithm of the spectrogram after the Mel spectrum filtering.

In one embodiment of the present application, the discrete transformation is used for performing image compression on the initial mel-spectrogram, so as to achieve dimension reduction of the initial mel-spectrogram, thereby improving the processing speed of the subsequent image.

S2, carrying out feature extraction on the mel-pattern by utilizing a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal.

In the embodiment of the application, the audio recognition model comprises a convolution module and a phoneme recognition module, wherein the convolution module is used for extracting the characteristic audio signals in the mel-spectrogram, and the phoneme recognition module is used for recognizing the phoneme sequence of the extracted characteristic audio signals.

According to one embodiment of the application, the convolution module is constructed through a convolution neural network and comprises a convolution layer, a linear rectifying layer, a pooling layer and a full-connection layer, wherein the convolution layer is used for extracting different characteristic audio signals of an input mel-pattern, such as level audio signals of edges, line angles and the like, the linear rectifying layer can enhance a judging function and the nonlinear characteristics of the whole neural network through linear rectification as an activating function, so that the training speed of the neural network is improved, the pooling layer is used for carrying out dimension reduction on the characteristic audio signals extracted by the convolution layer, cutting the characteristics extracted by the convolution layer into a plurality of areas, taking the maximum value or the average value of the characteristics extracted by the convolution layer to obtain new characteristic audio signals with smaller dimension, and the full-connection layer combines all local characteristics into global characteristics, so that the characteristic audio signals are output.

In one embodiment of the present application, the phoneme recognition module is constructed by a time delay neural network, and includes an input layer, a hidden layer and an output layer, where the input layer is configured to receive the characteristic audio signals transmitted by the convolution module, the hidden layer recognizes weights between the input characteristic audio signals by setting delay, so as to extract a phoneme sequence that the characteristic audio signals conform to the conditions, and the output layer is configured to output the phoneme sequence extracted by the hidden layer.

Further, in an embodiment of the present application, before the feature extraction of the mel-spectrogram using the pre-trained audio recognition model, the method further includes: training of an audio recognition model.

Specifically, referring to fig. 3, the training of the audio recognition model includes:

s30, acquiring a training cepstrum graph and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;

s31, performing spectrum enhancement on the training cepstrum, and taking the training cepstrum after spectrum enhancement and the training cepstrum as model training data;

s32, inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;

s33, calculating training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;

s34, if the training loss does not meet the preset condition, adjusting parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;

and S35, if the training loss meets the preset condition, obtaining a trained audio recognition model.

In an optional embodiment, the first characteristic audio signal is used as a real tag of a characteristic audio signal obtained by training a subsequent model, the first phoneme sequence is used as a real tag of a phoneme sequence obtained by training the subsequent model, and based on the first characteristic audio signal and the first phoneme sequence, the learning effect of the subsequent model can be supervised, so that the overall recognition capability of the model is improved.

In an optional embodiment, the spectrum enhancement is used for increasing the number of training cepstrum patterns to increase training data of subsequent model training, so as to improve the overall robustness of the model.

In an alternative embodiment, the inputting the model training data into the convolution module of the audio recognition model to output the second characteristic audio signal of the model training data includes: and performing convolution operation on the model training data by using a convolution layer in the convolution module to obtain an initial characteristic audio signal, performing linear adjustment on the initial characteristic audio signal by using a linear rectifying layer in the convolution module, performing dimension reduction on the initial characteristic audio signal subjected to linear adjustment by using a pooling layer in the convolution module, and outputting the initial characteristic audio signal subjected to dimension reduction by using a full-connection layer in the convolution module to obtain a first characteristic audio signal.

In an alternative embodiment, the phoneme recognition module using the audio recognition model recognizes a second sequence of phonemes of the second characteristic audio signal, including: and receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, setting delay data of the second characteristic audio signal, extracting a phoneme sequence of the second characteristic audio signal by using a hidden layer in the phoneme recognition module according to the delay data, and outputting the extracted phoneme sequence by using an output layer in the phoneme recognition module to obtain a second phoneme sequence.

In an alternative embodiment, referring to fig. 4, the step S33 includes:

s40, calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;

s41, calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;

s42, calculating the training loss of the audio recognition model according to the first training loss and the second training loss.

In an alternative embodiment of the application, the first training loss is calculated according to the following formula:

LC＝m _g logm _p +(1-m _g )log(1-m _p )

wherein LC represents a first training loss, m _g Representing a first characteristic audio signal, m _p Representing a second characteristic audio signal.

In an alternative embodiment of the application, the second training loss is calculated according to the following formula:

L1＝|α _p -α _g |

wherein L1 represents a second training loss, alpha _g Representing a first phoneme sequence, alpha _p Representing a second phoneme sequence.

In an alternative embodiment of the present application, the first training loss and the second training loss are added to obtain a training loss of the audio recognition model, i.e. l=l1+lc.

In an alternative embodiment of the application, the preset condition includes the training loss being less than a loss threshold. That is, when the training loss is smaller than the loss threshold, the training loss satisfies the preset condition, and when the training loss is greater than or equal to the loss threshold, the training loss does not satisfy the preset condition. The loss threshold may be set to 0.1, or may be set according to an actual scene. Further, the parameter adjustment of the audio recognition model may be implemented by a gradient descent algorithm, such as a random descent algorithm.

And S3, performing word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data.

According to the embodiment of the application, the voice recognition result of the voice data is recognized by performing text extraction on the phoneme sequence. In one embodiment of the method, a language model is used for extracting characters from the phoneme sequence, and the language model is a language abstract mathematical model which is performed according to language objective facts and is used for identifying character information relations corresponding to the phoneme sequence.

In detail, the text extraction of the phoneme sequence by using the language model includes: and calculating character generation probability from the phoneme sequences by using the language model, identifying character information relations between the phoneme sequences according to the character generation probability, and generating corresponding characters according to the character information relations. The word generation probability refers to a distribution probability of words generated by the phoneme sequence, and the word information relationship refers to a relationship that any two or more phoneme sequences can form words.

The application can extract the characteristic data of the audio data based on the frequency spectrum analysis of the audio data, thereby reducing the complexity of the audio data and further improving the analysis accuracy of the subsequent audio data; and secondly, the feature extraction and the phoneme recognition of the mel-pattern of the audio data are carried out through the pre-trained audio recognition model, namely, the phoneme sequence recognition of the end-to-end audio data is adopted, so that the anti-interference performance of the audio recognition model on the complex audio data can be enhanced, and the analysis accuracy of the audio data is further improved.

As shown in fig. 5, a functional block diagram of the speech recognition apparatus of the present application is shown.

The speech recognition apparatus 500 of the present application may be installed in an electronic device. The speech recognition means may comprise a spectral analysis module 501, a phoneme sequence recognition module 502 and a text extraction module 503, depending on the implemented functions. The module of the present application may also be referred to as a unit, meaning a series of computer program segments capable of being executed by the processor of the electronic device and of performing fixed functions, stored in the memory of the electronic device.

In the present embodiment, the functions concerning the respective modules/units are as follows:

the spectrum analysis module 501 is configured to obtain audio data, perform spectrum analysis on the audio data, and generate a mel-spectrogram of the audio data;

the phoneme sequence recognition module 502 is configured to perform feature extraction on the mel-pattern by using a pre-trained audio recognition model to obtain a feature audio signal, and recognize a phoneme sequence of the feature audio signal;

the text extraction module 503 is configured to perform text extraction on the phoneme sequence, and take a result of the text extraction as a recognition result of the audio data.

In detail, the modules in the voice recognition device 500 in the embodiment of the present application use the same technical means as the voice recognition method described in fig. 1 to 4 and can produce the same technical effects, which are not described herein.

Fig. 6 is a schematic structural diagram of an electronic device for implementing the voice recognition method according to the present application.

The electronic device 6 may comprise a processor 60, a memory 61 and a bus, and may further comprise a computer program, such as a speech recognition program 62, stored in the memory 61 and executable on the processor 60.

The memory 61 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 61 may in some embodiments be an internal storage unit of the electronic device 6, such as a removable hard disk of the electronic device 6. The memory 61 may also be an external storage device of the electronic device 6 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the electronic device 6. The memory 61 may be used not only for storing application software installed in the electronic device 6 and various types of data, such as codes of the voice recognition program 62, but also for temporarily storing data that has been output or is to be output.

The processor 60 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, various control chips, and the like. The processor 60 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 6 and processes data by running or executing programs or modules stored in the memory 61 (e.g., executing a voice recognition program 62, etc.), and calling data stored in the memory 61.

The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 61 and at least one processor 60 etc.

Fig. 6 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 6 is not limiting of the electronic device 6 and may include fewer or more components than shown, or some components in combination, or a different arrangement of components.

For example, although not shown, the electronic device 6 may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor 60 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 6 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described in detail herein.

Further, the electronic device 6 may also comprise a network interface, optionally comprising a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 6 and other electronic devices.

The electronic device 6 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 6 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The speech recognition 62 stored by the memory 61 in the electronic device 6 is a combination of a plurality of computer programs that, when run in the processor 60, can implement:

In particular, the specific implementation method of the processor 60 on the computer program may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.

Further, the integrated modules/units of the electronic device 6 may be stored in a non-volatile computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

The present application also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech recognition, the method comprising:

performing word extraction on the phoneme sequence, and taking a word extraction result as a recognition result of the audio data;

before the feature extraction is performed on the mel-spectrogram by using the pre-trained audio recognition model, the method further comprises:

2. The method of claim 1, wherein the performing spectral analysis on the audio data to generate a mel-pattern of the audio data comprises:

3. The method of claim 1, wherein the inputting the model training data into the convolution module of the audio recognition model to output a second characteristic audio signal of the model training data comprises:

4. The method of claim 1, wherein the phoneme recognition module utilizing the audio recognition model recognizes a second sequence of phonemes of the second characteristic audio signal comprising:

5. The method of claim 1, wherein the calculating training loss of the audio recognition model based on the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence, and the second phoneme sequence comprises:

6. The speech recognition method of any one of claims 1 to 5, wherein said text extracting of said phoneme sequence comprises:

calculating a text generation probability from the phoneme sequence;

7. A speech recognition apparatus, comprising:

the character extraction module is used for extracting characters from the phoneme sequence, and taking the character extraction result as the recognition result of the audio data;

the process for acquiring the trained audio recognition model comprises the following steps of: acquiring a training cepstrum graph and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;

8. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech recognition method according to any one of claims 1 to 6.

9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the speech recognition method according to any one of claims 1 to 6.