CN118248154B - Speech processing method, device, electronic equipment, medium and program product - Google Patents
Speech processing method, device, electronic equipment, medium and program product Download PDFInfo
- Publication number
- CN118248154B CN118248154B CN202410670266.0A CN202410670266A CN118248154B CN 118248154 B CN118248154 B CN 118248154B CN 202410670266 A CN202410670266 A CN 202410670266A CN 118248154 B CN118248154 B CN 118248154B
- Authority
- CN
- China
- Prior art keywords
- unvoiced
- frame
- parameters
- unvoiced sound
- lsp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 claims abstract description 46
- 238000012545 processing Methods 0.000 claims abstract description 42
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 29
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 29
- 238000013139 quantization Methods 0.000 claims description 29
- 238000001228 spectrum Methods 0.000 claims description 14
- 238000006243 chemical reaction Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 11
- 238000012937 correction Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 6
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000004891 communication Methods 0.000 abstract description 13
- 230000005284 excitation Effects 0.000 description 19
- 230000005540 biological transmission Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 230000001755 vocal effect Effects 0.000 description 10
- 230000006870 function Effects 0.000 description 8
- 230000004044 response Effects 0.000 description 7
- 230000002194 synthesizing effect Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000006978 adaptation Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000001747 exhibiting effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000003874 inverse correlation nuclear magnetic resonance spectroscopy Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/06—Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
- G10L19/07—Line spectrum pair [LSP] vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/032—Quantisation or dequantisation of spectral components
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/167—Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Quality & Reliability (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The disclosure provides a voice processing method, a voice processing device, electronic equipment, a medium and a program product, and relates to the technical field of communication. The method comprises the following steps: acquiring a first white noise, a Linear Predictive Coding (LPC) parameter and an unvoiced sound error value; performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced speech frame to be corrected; and correcting the unvoiced speech frame to be corrected based on the unvoiced error value to obtain a target unvoiced speech frame. According to the embodiment of the disclosure, the unvoiced speech frame to be corrected is corrected through the unvoiced error value, so that the quality of the unvoiced speech frame is improved.
Description
Technical Field
The present disclosure relates to the field of communications technologies, and in particular, to a voice processing method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product.
Background
In the field of communication technology, vocoders are used to encode and decode sound. Its main function is to convert the sound signal into digital form (encoding) for transmission, storage and processing, and then to decode it back into the sound signal (decoding).
In the related art, the voice frames of the voice include unvoiced frames and voiced frames, the vocoder uses white noise as an excitation source for the unvoiced frames, and the voiced frames use a periodic sequence as an excitation source for synthesizing the voice by using the excitation source, however, the quality of the voice of the unvoiced frames synthesized by the vocoder is poor.
How to improve the voice quality of unvoiced frames is a problem to be solved.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure provides a speech processing method, apparatus, electronic device, medium, and program product that improve, at least to some extent, the speech quality of unvoiced frames.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to one aspect of the present disclosure, there is provided a voice processing method applied to a decoding end, including: acquiring a first white noise, a Linear Predictive Coding (LPC) parameter and an unvoiced sound error value; performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced speech frame to be corrected; and correcting the unvoiced sound frame to be corrected based on the unvoiced sound error value to obtain a target unvoiced sound frame.
In some exemplary embodiments of the present disclosure, before obtaining the first white noise, the linear predictive coding LPC parameters, and the unvoiced sound error value, the method further comprises: receiving LSP quantized bit information and unvoiced error bit information of a line spectrum transmitted by a coding end; decoding the unvoiced sound error bit information to obtain the unvoiced sound error value; decoding the LSP quantized bit information to obtain LSP parameters; and carrying out transformation processing on the LSP parameters to obtain LPC parameters.
In some exemplary embodiments of the present disclosure, before performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced speech frame to be modified, the method further includes: obtaining a frame number of an original unvoiced speech frame corresponding to the LPC parameter; and generating the first white noise according to the frame number, wherein the first white noise is generated by a white noise generator of the decoding end.
In some exemplary embodiments of the present disclosure, the unvoiced sound error value is a difference between the original unvoiced sound frame and a synthesized unvoiced sound frame, and the synthesized unvoiced sound frame is an unvoiced sound frame synthesized by the encoding end; the method for correcting the unvoiced sound frame to be corrected based on the unvoiced sound error value to obtain a target unvoiced sound frame comprises the following steps: and calculating the sum of the unvoiced sound frame to be corrected and the unvoiced sound error value to obtain the target unvoiced sound frame.
According to another aspect of the present disclosure, there is provided a voice processing method applied to an encoding end, including: acquiring an original unvoiced speech frame in original speech; extracting parameters of the original unvoiced speech frame to obtain Linear Predictive Coding (LPC) parameters; transforming the LPC parameters to obtain line spectrum versus LSP parameters; vector quantization coding is carried out on the LSP parameters to obtain LSP quantized bit information; acquiring an unvoiced sound error value based on the LSP quantized bit information and the original unvoiced sound frame; vector quantization coding is carried out on the unvoiced sound error value, so that unvoiced sound error bit information is obtained; and transmitting the unvoiced error bit information and the LSP quantized bit information to a decoding end.
In some exemplary embodiments of the present disclosure, the unvoiced error bit information is transmitted to the decoding side through the number of idle encoded bits of the unvoiced frame.
In some exemplary embodiments of the present disclosure, deriving unvoiced sound error values based on the LSP quantization bit information and the original unvoiced sound frame includes: decoding the LSP quantized bit information to obtain LSP parameters; performing transformation processing on the LSP parameters to obtain LPC parameters; acquiring second white noise; performing LPC synthesis according to the second white noise and the LPC parameters to obtain a synthesized unvoiced speech frame; and calculating the difference between the original unvoiced sound frame and the synthesized unvoiced sound frame to obtain the unvoiced sound error value.
In some exemplary embodiments of the present disclosure, before acquiring the second white noise, the method further comprises: acquiring a frame number of an original unvoiced speech frame corresponding to the LPC parameter; and generating the second white noise according to the frame number, wherein the second white noise is generated by a white noise generator of the encoding end and is identical to the first white noise generated by the white noise generator of the decoding end according to the frame number.
According to still another aspect of the present disclosure, there is provided a voice processing apparatus applied to a decoding side, including: the first acquisition module is used for acquiring first white noise, linear Predictive Coding (LPC) parameters and unvoiced sound error values; the voice frame synthesis module is used for performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced voice frame to be corrected; and the correction module is used for correcting the unvoiced sound frame to be corrected based on the unvoiced sound error value to obtain a target unvoiced sound frame.
According to still another aspect of the present disclosure, there is provided a voice processing apparatus applied to an encoding end, including: the second acquisition module is used for acquiring an original unvoiced sound frame in the original voice; the parameter extraction module is used for extracting parameters of the original unvoiced speech frame to obtain Linear Predictive Coding (LPC) parameters; the conversion module is used for converting the LPC parameters to obtain line spectrum versus LSP parameters; the coding module is used for carrying out vector quantization coding on the LSP parameters to obtain LSP quantized bit information; the error value generating module is used for obtaining an unvoiced sound error value based on the LSP quantized bit information and the original unvoiced sound frame; the coding module is further used for carrying out vector quantization coding on the unvoiced sound error value to obtain unvoiced sound error bit information; and the sending module is used for sending the unvoiced error bit information and the LSP quantized bit information to a decoding end.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the above-described speech processing methods via execution of the executable instructions.
According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described speech processing methods.
According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program or computer instructions loaded and executed by a processor to cause the computer to implement any of the above-described speech processing methods.
The embodiment of the disclosure provides a voice processing method, a device, an electronic device, a medium and a program product, wherein a decoding end of a vocoder obtains first white noise, LPC parameters and unvoiced sound error values, and LPC synthesis is performed according to the first white noise and the LPC parameters to obtain unvoiced sound voice frames to be corrected. And correcting the unvoiced speech frame to be corrected based on the unvoiced error value to obtain a target unvoiced speech frame. According to the embodiment of the disclosure, the unvoiced speech frame to be corrected is corrected through the unvoiced error value, so that the quality of the unvoiced speech frame is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 shows a schematic diagram of a speech processing system architecture in an embodiment of the present disclosure.
Fig. 2 shows a flowchart of a speech processing method in an embodiment of the present disclosure.
Fig. 3 shows a flowchart of a speech processing method in another embodiment of the present disclosure.
Fig. 4 shows a block diagram of an encoding-side speech processing procedure in an embodiment of the present disclosure.
Fig. 5 shows a block diagram of a decoding-side speech processing procedure in an embodiment of the present disclosure.
Fig. 6 shows a schematic diagram of a speech processing device in an embodiment of the disclosure.
Fig. 7 shows a schematic diagram of a speech processing device in another embodiment of the present disclosure.
Fig. 8 shows a block diagram of an electronic device in an embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
For ease of understanding, the following first explains the several terms involved in this disclosure as follows:
Linear Predictive Coding (LPC) is a very important coding method. In principle, LPC is a parameter that generates channel excitation and transfer functions by analyzing a speech waveform, and the coding of a sound waveform is actually converted into coding of these parameters, which greatly reduces the data volume of sound. The speech is reconstructed by speech synthesis using the parameters obtained by the LPC analysis (abbreviated as LPC parameters) at the receiving end.
The line spectrum pair (LINE SPECTRAL PAIRS, LSP) is a direct mathematical transformation of the linear prediction coefficients, i.e. characterizes the linear prediction coefficients. LSPs have good quantization characteristics and high efficiency expression, and are therefore widely used in speech coding.
The LSP may be obtained by converting LPC parameters of the speech signal. LPC is a speech signal analysis method for estimating the vocal tract characteristics of a speech signal. LSP parameters can be obtained by converting LPC parameters.
The vocoder needs to extract and encode parameters of voice signals at the encoding end, and the vocoder of the linear prediction encoding model needs to judge whether the current voice frame is an unvoiced frame or a voiced frame firstly, simulate different excitation sources according to frames of different categories, and synthesize voice through the model.
It should be noted that the speech frame may be an unvoiced frame or a voiced frame, depending on the characteristics of the speech signal. The speech signal is split into unvoiced and voiced, the voiced having a pronounced periodicity, whereas unvoiced is approximately noise, exhibiting no periodicity. Although most speech frames are a mixture of both, if a frame of speech is mostly composed of noisy noise and does not exhibit periodicity, then the frame of speech can be considered an unvoiced frame.
The inventors have found that the vocoder uses white noise as an excitation source for unvoiced frames and a periodic sequence as an excitation source for voiced frames, which synthesizes speech using the excitation source, however, the quality of speech of unvoiced frames synthesized by the vocoder is poor. In addition, the inventor also found that the parameter extraction of the voiced sound frame is complex, including the pitch period, the frequency spectrum amplitude, and the like, the number of required encoding and transmitting bits is also large, while the unvoiced sound frame does not have parameter bit information of the voiced sound frame, and the number of encoding bits required to be transmitted is small, so that the enhancement bit information (unvoiced error bit information in the following) of the unvoiced sound frame can be transmitted by using the idle encoding bit number of the unvoiced sound frame, thereby improving the voice quality of the unvoiced sound frame. It should be noted that, for the vocoder of the Linear Predictive Coding (LPC) model, the total number of bits used for transmitting a voiced frame and the total number of bits used for transmitting an unvoiced frame are the same, and since the number of bits required for transmitting an unvoiced frame is smaller than that of a voiced frame, the number of bits used for transmitting an unvoiced frame is idle, and the enhancement bit information is transmitted by the number of idle bits, the voice quality of the unvoiced frame is effectively improved without increasing the code rate transmission. That is, the voiced frames need to transmit more parameters than the unvoiced frames, and the unvoiced frames do not need to transmit some parameter information of the voiced frames, so that the unvoiced error bit information can be transmitted by using bit positions where the parameter information is not transmitted.
Based on this, the embodiment of the disclosure provides a voice processing method, a device, an electronic apparatus, a medium and a program product, which can be applied to a high-orbit satellite scene, and can effectively improve the language quality without increasing code rate transmission. The method can also be used in ground base station scenes to ensure the quality of voice call. In other communication scenarios, this is not particularly limited. The decoding end of the vocoder obtains first white noise, LPC parameters and unvoiced sound error values generated by a white noise generator of the decoding end, and performs LPC synthesis according to the first white noise and the LPC parameters to obtain unvoiced sound voice frames to be corrected. And correcting the unvoiced speech frame to be corrected based on the unvoiced error value to obtain a target unvoiced speech frame. According to the embodiment of the disclosure, the unvoiced speech frame to be corrected is corrected through the unvoiced error value, so that the quality of the unvoiced speech frame is improved.
It is noted that embodiments of the present disclosure and technical features in the embodiments may be combined with each other without conflict.
The following detailed description of embodiments of the present disclosure refers to the accompanying drawings.
Fig. 1 is a schematic diagram showing the structure of a speech processing system in an embodiment of the present disclosure, to which the speech processing method or the speech processing apparatus in various embodiments of the present disclosure can be applied.
As shown in fig. 1, the speech processing system 100 may include a vocoder (vocoder). A vocoder is a speech analysis synthesis system based on a speech signal model. The speech signal is mainly analyzed to extract characteristic parameters (such as LPC parameters) of the speech signal, and the parameters are encoded to adapt to a transmission channel. At the receiving end, the vocoder restores the original speech waveform according to the received characteristic parameters. Therefore, the vocoder can perform encoding and decoding only by using model parameters during transmission and reception to achieve compression and transmission of a voice signal.
The vocoder may include an encoding end 101 and a decoding end 102, and the encoding end 101 and the decoding end 102 may be located on the same electronic device. But may also be located on different electronic devices of the two communication connections, the disclosure not being particularly limited. When the encoding end 101 and the decoding end 102 are located in the same electronic device, the method can be used for storing and playing the voice. When the encoding end 101 and the decoding end 102 are located in different electronic devices, voice communication or voice transmission between two electronic devices can be achieved, and communication connection between the two electronic devices is achieved through a network, and the network can be a wired network or a wireless network (such as a communication connection achieved by a high-orbit satellite).
Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the Internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over the network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible MarkupLanguage, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure sockets layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet ProtocolSecurity, IPsec), etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.
The electronic device has a voice processing function or is provided with a vocoder, including but not limited to a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but not limited thereto. The two electronic devices may be directly or indirectly connected by wired or wireless communication, and the application is not limited herein.
Those skilled in the art will appreciate that the number of encoding and decoding ends in fig. 1 is merely illustrative, and that any number of encoding and decoding ends may be provided as desired. The embodiments of the present disclosure are not limited in this regard.
The present exemplary embodiment will be described in detail below with reference to the accompanying drawings and examples.
First, in the embodiments of the present disclosure, a speech processing method is provided, which may be performed by any electronic device having computing processing capabilities.
Fig. 2 shows a flowchart of a speech processing method according to an embodiment of the present disclosure, and as shown in fig. 2, the speech processing method provided in the embodiment of the present disclosure is applied to a decoding end, and includes the following S201 to S203.
S201, acquiring a first white noise, a Linear Predictive Coding (LPC) parameter and an unvoiced sound error value.
In the embodiment of the disclosure, the first white noise is white noise for synthesizing unvoiced speech frames to be corrected. It should be noted that the decoding end includes a noise excitation generator (also called white noise generator), and the white noise excitation generated by the white noise generator is related to the frame number of the original unvoiced speech frame. The original unvoiced speech frame is an unvoiced frame among the received speech frames. The same frame number is applied to the synthesized speech frame at the decoding end and the encoding end, and the white noise (for example, the first white noise and the second white noise corresponding to the same frame number are the same), so that the quality of the synthesized speech frame is ensured. The first white noise obtained at the decoding end is generated by a white noise generator at the decoding end.
Illustratively, before performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced speech frame to be corrected, the method may further include: obtaining a frame number of an original unvoiced speech frame corresponding to the LPC parameter; and generating first white noise according to the frame number, wherein the first white noise is generated by a white noise generator at a decoding end.
In the embodiment of the disclosure, the frame number of the original unvoiced sound frame can be obtained from the beginning pin packet of the original unvoiced sound frame, which is not limited in the disclosure, so long as the frame number of the original unvoiced sound frame can be obtained, the frame number can be obtained, and the encoding end and the decoding end can be beneficial to ensuring that the white noise generated by aiming at the same frame number is the same, thereby further improving the voice quality of the unvoiced sound frame.
In the embodiment of the present disclosure, the unvoiced sound error value is a parameter for correcting the unvoiced sound frame to be corrected, and the embodiment of the present disclosure is not limited as to how the unvoiced sound error value is determined. For example, the unvoiced sound error value is obtained by using the original unvoiced sound frame and the synthesized unvoiced sound frame at the decoding end as training data of the machine model and using the trained machine model. For another example, the unvoiced sound error value is obtained by comparing the original unvoiced sound frame with the synthesized unvoiced sound frame at the decoding end, and the unvoiced sound error value is transmitted to the decoding end after vector quantization encoding is performed on the unvoiced sound error value. Embodiments of the present disclosure are not limited thereto.
S202, performing LPC synthesis according to the first white noise and the LPC parameters to obtain the unvoiced speech frame to be corrected.
In the embodiment of the disclosure, the LPC parameters may be vocal tract model parameters, and the LPC parameters of the vocal tract model are extracted according to the original unvoiced speech frame.
In the embodiments of the present disclosure, there is no limitation on how to synthesize unvoiced speech frames to be corrected according to the first white noise and the LPC parameters. For example, the first white noise and the LPC parameters are synthesized into unvoiced speech frames to be modified using vocal tract model functions.
S203, correcting the unvoiced sound frame to be corrected based on the unvoiced sound error value to obtain a target unvoiced sound frame.
In one embodiment, the unvoiced sound error value is the difference between an original unvoiced sound frame and a synthesized unvoiced sound frame, and the synthesized unvoiced sound frame is the unvoiced sound frame synthesized by the encoding end (described in detail below). The method for correcting the unvoiced sound frame to be corrected based on the unvoiced sound error value to obtain a target unvoiced sound frame may include: and calculating the sum of the unvoiced sound frame to be corrected and the unvoiced sound error value to obtain the target unvoiced sound frame.
In the embodiment of the disclosure, the target unvoiced sound frame is an unvoiced sound frame after correction by an unvoiced sound error value, and compared with an unvoiced sound frame before correction (an unvoiced sound frame to be corrected), the quality improvement of the unvoiced sound frame is realized.
According to the embodiment of the disclosure, the unvoiced speech frame to be corrected is corrected through the unvoiced error value, so that the quality of the unvoiced speech frame is improved.
The disclosure is further illustrated by the following two exemplary embodiments.
In an exemplary embodiment, the method may further include the following steps A1 to A4 before the first white noise, the linear predictive coding LPC parameters, and the unvoiced sound error value are acquired.
And step A1, receiving the LSP quantized bit information and unvoiced error bit information of the line spectrum transmitted by the encoding end.
In the embodiment of the disclosure, LSP quantized bit information and unvoiced error bit information generated by an encoding end are transmitted to a decoding end through the number of encoded bits of unvoiced frames. It should be noted that, the unvoiced frames have fewer parameters than the voiced frames, so the unvoiced frames have idle coding bit numbers, and the unvoiced error bit information is transmitted by using the idle coding bit numbers, so that the voice quality of the unvoiced frames can be effectively improved without increasing the code rate transmission.
And step A2, decoding the unvoiced sound error bit information to obtain an unvoiced sound error value.
In the embodiments of the present disclosure, there is no limitation as to how unvoiced error bit information is decoded. For example, the unvoiced-error bit information is decoded by a codebook bank (codebook) to obtain an unvoiced-error value. The codebook banks may be used for acoustic parameter reconstruction in the decoding process.
It should be noted that, the codebook bank for decoding unvoiced error bit information may be a designed vector quantization codebook bank.
And step A3, decoding the LSP quantized bit information to obtain LSP parameters.
In the embodiment of the disclosure, LSP quantized bit information is decoded through a codebook bank to obtain LSP parameters. It should be noted that the codebook libraries in step A2 and step A3 may be the same or different, and the embodiments of the present disclosure are not limited.
And step A4, performing conversion treatment on the LSP parameters to obtain LPC parameters.
In the embodiments of the present disclosure, there are many ways to convert the LSP parameters into LPC parameters. LSP parameters are typically used to represent spectral information, while LPC parameters are used to represent channel information. The conversion of both may be achieved in the following manner, but the embodiments of the present disclosure are not limited thereto. The conversion mode is as follows: LSP to frequency domain amplitude response: the LSP parameters are converted into corresponding frequency domain amplitude responses. This can be achieved by using an inverse FFT (INVERSE FAST Fourier Transform ) and a root locus method. Frequency domain amplitude response to cepstral coefficients: the frequency domain amplitude response is converted to cepstral coefficients, which can be achieved by taking the logarithm of the amplitude response and then performing an FFT. Cepstral coefficients to LPC parameters: the cepstral coefficients are converted to LPC parameters using Durbin (an algorithm for converting autocorrelation coefficients to linear predictive coding coefficients) recursive algorithm.
According to the embodiment of the disclosure, LSP parameters and unvoiced sound error values are obtained by decoding LSP quantized bit information and unvoiced sound error bit information, then the LSP parameters are converted into LPC parameters, an unvoiced sound frame to be corrected can be synthesized by using the LPC parameters, and the unvoiced sound frame to be corrected is corrected by using the unvoiced sound error values, so that the quality of the unvoiced sound frame is improved.
In another exemplary embodiment, before performing the conversion processing on the LSP parameters to obtain the LPC parameters, the method may further include: judging whether the LSP parameters are unvoiced frame parameters or not; when the LSP parameter is an unvoiced frame parameter, performing conversion processing on the LSP parameter to obtain an LPC parameter, wherein the LSP parameter carries a frame number corresponding to an original unvoiced speech frame, and the first white noise is generated by a white noise generator at a decoding end according to the frame number.
In the embodiment of the present disclosure, the embodiment of the present disclosure is not limited as to how to determine whether the LSP parameters are unvoiced frame parameters. For example, whether the frame number is an unvoiced frame parameter is determined by the frame number carried in the LSP parameter. For another example, it is determined whether the frame is an unvoiced frame parameter by energy and average zero-crossing number. Note that, when the LSP parameters correspond to unvoiced frames, the LSP parameters are unvoiced frame parameters. Otherwise, when the LSP parameter corresponds to a voiced frame, the LSP parameter is a voiced frame parameter.
Note that, the manner of determining the voiced/unvoiced sound may not depend on the LSP parameters. For example, the header overhead packet of the voice packet received at the decoding end carries information for judging the voiced sound. The header overhead information is added to the internet-like data packets at different layers, which are automatically generated by the related protocol and are not described in detail herein.
In the embodiment of the disclosure, the voiced frame parameter and the unvoiced parameter are separately processed through the unvoiced/voiced judgment, and the unvoiced error value is utilized to correct the synthesized unvoiced speech frame to be corrected, so that the quality of the unvoiced frame is improved, and the speech synthesis efficiency is improved.
Based on the same inventive concept, a voice processing method is also provided in the embodiments of the present disclosure, such as the following embodiments. Since the principle of solving the problem of this method embodiment is similar to that of the above method embodiment, the implementation of this method embodiment may refer to the implementation of the above method embodiment, and the repetition is not repeated.
Fig. 3 shows a flowchart of a speech processing method in another embodiment of the present disclosure, and as shown in fig. 3, the speech processing method provided in the embodiment of the present disclosure is applied to an encoding end, and includes the following S301 to S307.
S301, acquiring an original unvoiced speech frame in original speech.
In the disclosed embodiment, the original speech is received speech by the vocoder encoding. After the coding end receives the original voice, the coding end carries out clear/voiced sound judgment on the original voice to obtain an original clear voice frame. The embodiments of the present disclosure are not limited as to the way in which the unvoiced/voiced determination is made. For example, energy and average zero crossing numbers, etc. The high energy speech frames are typically judged as original voiced speech frames, while the low energy are judged as original unvoiced speech frames. Likewise, a speech frame with a large number of zero crossings is considered an original unvoiced speech frame, while a speech frame with a small number of zero crossings is considered an original voiced speech frame.
S302, extracting parameters of the original unvoiced speech frame to obtain Linear Predictive Coding (LPC) parameters.
S303, converting the LPC parameters to obtain line spectrum versus LSP parameters.
S304, vector quantization coding is carried out on LSP parameters, and LSP quantized bit information is obtained.
In the embodiment of the disclosure, LPC parameters of a sound channel model are extracted according to an original unvoiced speech frame, the LPC parameters are converted into LSP parameters, and vector quantization coding is performed through a codebook library to obtain LSP quantization bit information to be transmitted.
S305, obtaining an unvoiced sound error value based on the LSP quantized bit information and the original unvoiced sound frame.
In an embodiment, the unvoiced sound error value is obtained based on the LSP quantized bit information and the original unvoiced sound frame, which may include the following S3051 to S3055.
And S3051, decoding the LSP quantized bit information to obtain LSP parameters.
S3052, performing conversion processing on the LSP parameters to obtain LPC parameters.
Before the conversion process, filtering operation may be performed on the LSP parameters, so as to further improve the quality of the language.
In the embodiment of the disclosure, the LSP quantized bit information is decoded through a codebook library to obtain decoded LSP parameters, and the decoded LSP parameters are converted into decoded LPC parameters.
And S3053, acquiring second white noise.
In the disclosed embodiment, the encoding end includes a white noise excitation source generator (also called white noise generator), and the second white noise generated by the white noise generator is related to the frame number of the speech frame. Illustratively, the second white noise is generated by a white noise generator at the encoding end according to the frame number, and the second white noise is the same as the first white noise generated by a white noise generator at the decoding end according to the frame number.
In the embodiment of the present disclosure, the frame number may be automatically generated in the header packet of the voice frame by the related protocol, which is not described herein.
Illustratively, prior to acquiring the second white noise, the method further comprises: acquiring a frame number of an original unvoiced speech frame; and generating second white noise according to the frame number, wherein the second white noise is generated by a white noise generator at the encoding end and is identical to the first white noise generated by a white noise generator at the decoding end according to the frame number.
The white noise generator at the decoding end and the white noise generator at the encoding end are identical, that is, the white noise generated by the same frame, and the decoding end and the encoding end should be identical. That is, the second white noise is the same as the first white noise generated by the white noise generator at the decoding end according to the frame number. The white noise used for synthesizing the same voice frame at the decoding end and the encoding end is the same, and the quality of the synthesized voice frame can be improved.
And S3054, performing LPC synthesis according to the second white noise and the LPC parameters to obtain a synthesized unvoiced speech frame.
Before the LPC synthesis, the filtering operation may be performed on the second white noise, so as to further improve the speech quality.
S3055, calculating the difference between the original unvoiced speech frame and the synthesized unvoiced speech frame to obtain an unvoiced error value.
In the embodiment of the disclosure, a synthesized unvoiced speech frame synthesized by an encoding end is the same as a unvoiced speech frame to be corrected synthesized by a decoding end, correction parameters (such as unvoiced error values) between an original unvoiced speech frame and the synthesized unvoiced speech frame are determined at the encoding end, and the unvoiced error values are sent to the decoding end to correct the unvoiced speech frame to be corrected, so that the corrected target unvoiced speech frame is more similar to the original unvoiced speech frame.
S306, vector quantization coding is carried out on the unvoiced sound error value, and unvoiced sound error bit information is obtained.
In the embodiment of the disclosure, the unvoiced sound error value may be vector quantized and encoded according to a pre-designed codebook bank, and the disclosure is not limited thereto.
S307, the unvoiced error bit information and LSP quantized bit information are sent to the decoding end.
In one embodiment, unvoiced error bit information is transmitted to the decoding side by the number of idle encoded bits of the unvoiced frame.
According to the embodiment of the disclosure, the unvoiced frame error correction information processing is performed at the encoding end, and the unvoiced synthesized voice at the decoding end is corrected through the unvoiced frame error correction information (such as the unvoiced error value), so that the unvoiced voice quality of the vocoder is effectively enhanced, and the voice quality is improved.
In addition, the embodiment of the disclosure can enhance the voice quality of the unvoiced frame of the vocoder, does not increase the transmitted bit information, and can effectively improve the voice quality and improve the voice quality under the condition of not increasing the code rate transmission for scenes with higher capacity requirements, such as high-orbit satellite scenes.
The following describes the processing of speech by the encoding side and the decoding side, respectively, by two specific embodiments.
In an embodiment, as shown in fig. 4, a speech processing method provided in the present disclosure is applied to a decoding end, and includes S401 to S409.
S401, judging whether the voice frame of the received voice is the original unvoiced voice frame or not, if so, executing S402, and if so, executing S403.
S402, LPC analysis. And extracting parameters of the original unvoiced speech frame to obtain Linear Predictive Coding (LPC) parameters. S404 is performed.
In an embodiment of the present disclosure, LPC parameters of a vocal tract model are extracted. Illustratively, the encoding end comprises a white noise excitation source generator, the white noise excitation source generated by the generator is related to the frame number of the voice frame, and the white noise excitation source is,Is the frame number of the speech frame.
Extracting LPC parameters of vocal tract model according to original voice frameN is the order of the LPC parameters and the vocal tract model function is represented by an all-pole model as shown in equation 1 below.
Wherein,For the vocal tract model, a n is the LPC parameter and z is the complex variable.
S403, extracting and quantizing the voiced sound frame parameters. It should be noted that S403 is a conventional processing manner of the vocoder of the LPC model, and will not be described herein.
S404, converting the LPC parameters into LSP parameters.
S405, vector quantization is carried out on LSP parameters. And vector quantization is carried out on the LSP parameters through a codebook library to obtain LSP quantized bit information to be transmitted. And decoding the LSP quantized bit information through a codebook library to obtain decoded LSP parameters.
S406, converting LSP parameters into LPC parameters.
Exemplary, the decoded LSP parameters are converted to LPC parameters。
S407, synthesizing unvoiced speech frames. Channel model function of white noise excitation source and decoded LPC parameters generated according to current speech frame numberUnvoiced frame voice information is synthesized as shown in the following formula 2.
Wherein,For a synthesized unvoiced speech frame with frame number i, h i is a vocal tract model functionIs provided with a unit impulse response of (a),V i is a white noise excitation source for convolution operation.
It should be noted that, the frame number may be transmitted to the decoding end together with LSP quantization bit information and unvoiced error bit information. For example, the frame number may be located in the header packet of the data packet sent to the decoding end.
S408, error extraction. From original unvoiced speech frame x i and synthesized unvoiced speech frameAn unvoiced sound error value is calculated. The calculation formula is shown in formula 3.
Where m i is the unvoiced sound error value of the unvoiced sound frame with frame number i, x i is the original unvoiced sound frame,To synthesize unvoiced speech frames.
S409, vector quantization is carried out on the unvoiced sound error value. And carrying out vector quantization coding on the unvoiced sound error value according to the designed unvoiced sound frame error codebook library to obtain unvoiced sound error bit information, and transmitting the unvoiced sound error bit information at an idle transmission position of the unvoiced sound frame.
In another embodiment, as shown in fig. 5, a speech processing method provided in the present disclosure is applied to a decoding end, and includes S501 to S505.
S501, clear/voiced sound judgment. Before S501, the decoding end decodes the bit information sent from the receiving encoding end, and if it is determined that the bit information is unvoiced, S502 is executed. If it is determined that the voice information is voiced, S503 is executed.
S502, converting LSP parameters into LPC parameters, and executing S504.
S503, synthesizing voice by using the voiced sound frame parameters.
S504, LPC synthesis. And performing LPC synthesis according to the white noise generated by the white noise generator and the LPC parameters to obtain the unvoiced speech frame to be corrected.
Illustratively, the white noise generator generates the white noise excitation source v i according to the frame number of the current speech frame, the white noise generator at the decoding end should be consistent with the white noise generator at the encoding end, and the white noise generated at the decoding end and the encoding end by the same speech frame number should be consistent. And converting the decoded LSP parameters into LPC parameters, and performing LPC synthesis with a white noise excitation source to obtain an unvoiced speech frame to be corrected (which is equivalent to a synthesized unvoiced speech frame of an encoding end). The unvoiced speech frame to be corrected is calculated by the following formula 4.
Wherein,For the unvoiced speech frame to be corrected with frame number i, h i is the vocal tract model functionIs provided with a unit impulse response of (a),V i is a white noise excitation source for convolution operation.
S505, the unvoiced sound error bit information is decoded to obtain an unvoiced sound error value.
S506, correcting the unvoiced speech frame to be corrected based on the unvoiced error value to obtain a target unvoiced speech frame.
Illustratively, the LPC synthesized unvoiced frames are modified based on the decoded unvoiced frame error, and the final unvoiced frame (target unvoiced speech frame) is reached. The correction formula is shown in the following formula 5.
Wherein x i is the target unvoiced speech frame with frame number i,For the unvoiced speech frame to be corrected with frame number i, m i is the unvoiced error value with frame number i.
Based on the same inventive concept, a voice processing device is also provided in the embodiments of the present disclosure, as described in the following embodiments. Since the principle of solving the problem of the embodiment of the device is similar to that of the embodiment of the method, the implementation of the embodiment of the device can be referred to the implementation of the embodiment of the method, and the repetition is omitted.
Fig. 6 shows a schematic diagram of a speech processing device in an embodiment of the disclosure, as shown in fig. 6, applied to a decoding end, where the speech processing device may include a first acquisition module 601, a speech frame synthesis module 602, and a modification module 603. The first obtaining module 601 may be configured to obtain a first white noise, a linear prediction coding LPC parameter, and an unvoiced sound error value; the speech frame synthesis module 602 may be configured to perform LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced speech frame to be corrected; the correction module 603 may be configured to correct the unvoiced speech frame to be corrected based on the unvoiced error value, to obtain the target unvoiced speech frame.
In an embodiment, before acquiring the first white noise, the linear prediction coding LPC parameter and the unvoiced bit error value, the first acquiring module 601 may be further configured to receive the line spectrum pair LSP quantization bit information and the unvoiced bit information sent by the coding end; decoding the unvoiced sound error bit information to obtain an unvoiced sound error value; decoding the LSP quantized bit information to obtain LSP parameters; and performing conversion treatment on the LSP parameters to obtain LPC parameters.
In an embodiment, before performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced speech frame to be corrected, the first obtaining module 601 may be further configured to obtain a frame number of an original unvoiced speech frame corresponding to the LPC parameters; and generating the first white noise according to the frame number, wherein the first white noise is generated by a white noise generator at a decoding end.
In one embodiment, the unvoiced sound error value is the difference between an original unvoiced sound frame and a synthesized unvoiced sound frame, and the synthesized unvoiced sound frame is the unvoiced sound frame synthesized by the encoding end; the correction module 603 may be further configured to calculate a sum of the unvoiced speech frame to be corrected and the unvoiced error value, to obtain the target unvoiced speech frame.
The voice processing device disclosed by the embodiment of the disclosure corrects the unvoiced voice frame to be corrected through the unvoiced error value, thereby realizing the quality improvement of the unvoiced frame.
Based on the same inventive concept, another speech processing apparatus is also provided in the embodiments of the present disclosure, as described in the following embodiments. Since the principle of solving the problem of the embodiment of the device is similar to that of the embodiment of the method, the implementation of the embodiment of the device can be referred to the implementation of the embodiment of the method, and the repetition is omitted.
Fig. 7 shows a schematic diagram of a speech processing device in another embodiment of the present disclosure, as shown in fig. 7, applied to an encoding end, where the speech processing device includes a second acquisition module 701, a parameter extraction module 702, a conversion module 703, an encoding module 704, an error value generation module 705, and a sending module 706. The second obtaining module 701 may be configured to obtain an original unvoiced speech frame in an original speech; the parameter extraction module 702 may be configured to perform parameter extraction on an original unvoiced speech frame to obtain linear prediction coding LPC parameters; the conversion module 703 may be configured to transform the LPC parameters to obtain line spectrum versus LSP parameters; the encoding module 704 may be configured to perform vector quantization encoding on the LSP parameters to obtain LSP quantization bit information; the error value generation module 705 may be configured to obtain unvoiced sound error values based on LSP quantized bit information and an original unvoiced sound frame; the encoding module 704 may be further configured to perform vector quantization encoding on the unvoiced sound error value to obtain unvoiced sound error bit information; the sending module 706 may be configured to send unvoiced error bit information and LSP quantized bit information to a decoding end.
In one embodiment, unvoiced error bit information is transmitted to the decoding side by the number of idle encoded bits of the unvoiced frame.
In an embodiment, the error value generating module 705 may be further configured to decode the LSP quantization bit information to obtain LSP parameters; performing conversion treatment on LSP parameters to obtain LPC parameters; acquiring second white noise; performing LPC synthesis according to the second white noise and the LPC parameters to obtain a synthesized unvoiced speech frame; and calculating the difference between the original unvoiced speech frame and the synthesized unvoiced speech frame to obtain an unvoiced error value.
In an embodiment, before the second white noise is acquired, the second acquiring module 701 is further configured to acquire a frame number of an original unvoiced speech frame; and generating second white noise according to the frame number, wherein the second white noise is generated by a white noise generator at the encoding end and is identical to the first white noise generated by a white noise generator at the decoding end according to the frame number.
The voice processing device disclosed by the embodiment of the disclosure can enhance the voice quality of unvoiced frames of a vocoder, does not increase the transmitted bit information, and can effectively improve the voice quality under the condition of not increasing code rate transmission for scenes with higher capacity requirements, such as high-orbit satellite scenes.
Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 800 according to such an embodiment of the present disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.
As shown in fig. 8, the electronic device 800 is embodied in the form of a general purpose computing device. Components of electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, and a bus 830 connecting the various system components, including the memory unit 820 and the processing unit 810.
Wherein the storage unit stores program code that is executable by the processing unit 810 such that the processing unit 810 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the present specification. For example, the processing unit 810 may perform the following steps of the method embodiment described above: acquiring a first white noise, a Linear Predictive Coding (LPC) parameter and an unvoiced sound error value; performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced speech frame to be corrected; and correcting the unvoiced speech frame to be corrected based on the unvoiced error value to obtain a target unvoiced speech frame.
For another example, the processing unit 810 may perform the following steps of the method embodiment described above: acquiring an original unvoiced speech frame in original speech; extracting parameters of an original unvoiced speech frame to obtain Linear Predictive Coding (LPC) parameters; transforming the LPC parameters to obtain line spectrum versus LSP parameters; vector quantization coding is carried out on LSP parameters to obtain LSP quantized bit information; acquiring an unvoiced sound error value based on the LSP quantized bit information and the original unvoiced sound frame; vector quantization coding is carried out on the unvoiced sound error value, and unvoiced sound error bit information is obtained; and transmitting unvoiced error bit information and LSP quantized bit information to a decoding end.
The storage unit 820 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 8201 and/or cache memory 8202, and may further include Read Only Memory (ROM) 8203.
Storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 830 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 800 may also communicate with one or more external devices 840 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 800, and/or any device (e.g., router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 850. Also, electronic device 800 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 860. As shown, network adapter 860 communicates with other modules of electronic device 800 over bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 800, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium, which may be a readable signal medium or a readable storage medium, is also provided. On which a program product is stored which enables the implementation of the method described above of the present disclosure.
In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the detailed description section of the disclosure, when the program product is run on the terminal device.
More specific examples of the computer readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In this disclosure, a computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Alternatively, the program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
In particular implementations, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the speech processing methods provided in the various alternatives in any of the embodiments of the disclosure.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the description of the above embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.
Claims (12)
1. A speech processing method applied to a decoding end, comprising:
receiving LSP quantized bit information and unvoiced error bit information of a line spectrum transmitted by a coding end;
decoding the unvoiced sound error bit information to obtain the unvoiced sound error value;
Decoding the LSP quantized bit information to obtain LSP parameters, and performing transformation processing on the LSP parameters to obtain Linear Predictive Coding (LPC) parameters;
acquiring first white noise, the LPC parameters and unvoiced sound error values;
Performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced speech frame to be corrected;
and correcting the unvoiced sound frame to be corrected based on the unvoiced sound error value to obtain a target unvoiced sound frame.
2. The method of claim 1, wherein prior to performing LPC synthesis based on the first white noise and the LPC parameters to obtain an unvoiced speech frame to be modified, the method further comprises:
obtaining a frame number of an original unvoiced speech frame corresponding to the LPC parameter;
and generating the first white noise according to the frame number, wherein the first white noise is generated by a white noise generator of the decoding end.
3. The method of claim 1, wherein the unvoiced sound error value is a difference between an original unvoiced sound frame and a synthesized unvoiced sound frame, and the synthesized unvoiced sound frame is an unvoiced sound frame synthesized by the encoding end;
The method for correcting the unvoiced sound frame to be corrected based on the unvoiced sound error value to obtain a target unvoiced sound frame comprises the following steps:
And calculating the sum of the unvoiced sound frame to be corrected and the unvoiced sound error value to obtain the target unvoiced sound frame.
4. A speech processing method applied to an encoding end, comprising:
acquiring an original unvoiced speech frame in original speech;
extracting parameters of the original unvoiced speech frame to obtain Linear Predictive Coding (LPC) parameters;
transforming the LPC parameters to obtain line spectrum versus LSP parameters;
vector quantization coding is carried out on the LSP parameters to obtain LSP quantized bit information;
acquiring an unvoiced sound error value based on the LSP quantized bit information and the original unvoiced sound frame;
Vector quantization coding is carried out on the unvoiced sound error value, so that unvoiced sound error bit information is obtained;
and transmitting the unvoiced error bit information and the LSP quantized bit information to a decoding end.
5. The method of claim 4, wherein the unvoiced error bit information is transmitted to the decoding side by the number of idle encoded bits of the unvoiced frame.
6. The method of claim 4 wherein deriving unvoiced sound error values based on the LSP quantization bit information and the original unvoiced sound frame comprises:
decoding the LSP quantized bit information to obtain LSP parameters;
performing transformation processing on the LSP parameters to obtain LPC parameters;
acquiring second white noise;
performing LPC synthesis according to the second white noise and the LPC parameters to obtain a synthesized unvoiced speech frame;
And calculating the difference between the original unvoiced sound frame and the synthesized unvoiced sound frame to obtain the unvoiced sound error value.
7. The method of claim 6, wherein prior to acquiring the second white noise, the method further comprises:
acquiring a frame number of the original unvoiced speech frame;
And generating the second white noise according to the frame number, wherein the second white noise is generated by a white noise generator of the encoding end and is identical to the first white noise generated by the white noise generator of the decoding end according to the frame number.
8. A speech processing device applied to a decoding end, comprising:
The first acquisition module is used for receiving the LSP quantized bit information and the unvoiced error bit information of the line spectrum sent by the encoding end; decoding the unvoiced sound error bit information to obtain the unvoiced sound error value; decoding the LSP quantized bit information to obtain LSP parameters, and performing transformation processing on the LSP parameters to obtain Linear Predictive Coding (LPC) parameters, and obtaining first white noise, the LPC parameters and unvoiced sound error values;
The voice frame synthesis module is used for performing LPC synthesis according to the first white noise and the LPC parameters to obtain an unvoiced voice frame to be corrected;
And the correction module is used for correcting the unvoiced sound frame to be corrected based on the unvoiced sound error value to obtain a target unvoiced sound frame.
9. A speech processing apparatus for use at an encoding end, comprising:
the second acquisition module is used for acquiring an original unvoiced sound frame in the original voice;
the parameter extraction module is used for extracting parameters of the original unvoiced speech frame to obtain Linear Predictive Coding (LPC) parameters;
the conversion module is used for converting the LPC parameters to obtain line spectrum versus LSP parameters;
The coding module is used for carrying out vector quantization coding on the LSP parameters to obtain LSP quantized bit information;
The error value generating module is used for obtaining an unvoiced sound error value based on the LSP quantized bit information and the original unvoiced sound frame;
the coding module is further used for carrying out vector quantization coding on the unvoiced sound error value to obtain unvoiced sound error bit information;
And the sending module is used for sending the unvoiced error bit information and the LSP quantized bit information to a decoding end.
10. An electronic device, comprising:
A processor; and
A memory for storing executable instructions of the processor;
wherein the processor is configured to perform the speech processing method of any of claims 1-7 via execution of the executable instructions.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the speech processing method of any of claims 1-7.
12. A computer program product comprising a computer program or computer instructions, characterized in that the computer program or the computer instructions are loaded and executed by a processor to cause a computer to implement the speech processing method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410670266.0A CN118248154B (en) | 2024-05-28 | 2024-05-28 | Speech processing method, device, electronic equipment, medium and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410670266.0A CN118248154B (en) | 2024-05-28 | 2024-05-28 | Speech processing method, device, electronic equipment, medium and program product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118248154A CN118248154A (en) | 2024-06-25 |
CN118248154B true CN118248154B (en) | 2024-08-06 |
Family
ID=91556869
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410670266.0A Active CN118248154B (en) | 2024-05-28 | 2024-05-28 | Speech processing method, device, electronic equipment, medium and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118248154B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05297897A (en) * | 1992-04-15 | 1993-11-12 | Sony Corp | Voiced sound deciding method |
US5473727A (en) * | 1992-10-31 | 1995-12-05 | Sony Corporation | Voice encoding method and voice decoding method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0612098A (en) * | 1992-03-16 | 1994-01-21 | Sanyo Electric Co Ltd | Voice encoding device |
US5765127A (en) * | 1992-03-18 | 1998-06-09 | Sony Corp | High efficiency encoding method |
US6963833B1 (en) * | 1999-10-26 | 2005-11-08 | Sasken Communication Technologies Limited | Modifications in the multi-band excitation (MBE) model for generating high quality speech at low bit rates |
-
2024
- 2024-05-28 CN CN202410670266.0A patent/CN118248154B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05297897A (en) * | 1992-04-15 | 1993-11-12 | Sony Corp | Voiced sound deciding method |
US5473727A (en) * | 1992-10-31 | 1995-12-05 | Sony Corporation | Voice encoding method and voice decoding method |
Also Published As
Publication number | Publication date |
---|---|
CN118248154A (en) | 2024-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101140759B (en) | Band-width spreading method and system for voice or audio signal | |
JP5247878B2 (en) | Concealment of transmission error of digital audio signal in hierarchical decoding structure | |
EP3992964B1 (en) | Voice signal processing method and apparatus, and electronic device and storage medium | |
US9524721B2 (en) | Apparatus and method for concealing frame erasure and voice decoding apparatus and method using the same | |
US5623575A (en) | Excitation synchronous time encoding vocoder and method | |
US12087314B2 (en) | Audio encoding/decoding based on an efficient representation of auto-regressive coefficients | |
BR112016022466B1 (en) | method for encoding an audible signal, method for decoding an audible signal, device for encoding an audible signal and device for decoding an audible signal | |
EP1688916A2 (en) | Method and apparatus for recovering line spectrum pair parameter and speech decoding apparatus using same | |
US9449605B2 (en) | Inactive sound signal parameter estimation method and comfort noise generation method and system | |
WO2007088853A1 (en) | Audio encoding device, audio decoding device, audio encoding system, audio encoding method, and audio decoding method | |
CA3181066A1 (en) | Method, apparatus, and system for processing audio data | |
CN110634503B (en) | Method and apparatus for signal processing | |
US8775166B2 (en) | Coding/decoding method, system and apparatus | |
JP2000305599A (en) | Speech synthesizing device and method, telephone device, and program providing media | |
US5504834A (en) | Pitch epoch synchronous linear predictive coding vocoder and method | |
US9478221B2 (en) | Enhanced audio frame loss concealment | |
WO2023197809A1 (en) | High-frequency audio signal encoding and decoding method and related apparatuses | |
JP5489711B2 (en) | Speech coding apparatus and speech decoding apparatus | |
KR20070085532A (en) | Stereo encoding apparatus, stereo decoding apparatus, and their methods | |
SA516371927B1 (en) | Systems and Methods of Switching Coding Technologies at a Device | |
CN118248154B (en) | Speech processing method, device, electronic equipment, medium and program product | |
CN114333893A (en) | Voice processing method and device, electronic equipment and readable medium | |
UA114233C2 (en) | Systems and methods for determining an interpolation factor set | |
JP2022521188A (en) | Spectral shape prediction from MDCT coefficient | |
CN111326166B (en) | Voice processing method and device, computer readable storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |