WO2010079167A1 - Speech coding - Google Patents
Speech coding Download PDFInfo
- Publication number
- WO2010079167A1 WO2010079167A1 PCT/EP2010/050057 EP2010050057W WO2010079167A1 WO 2010079167 A1 WO2010079167 A1 WO 2010079167A1 EP 2010050057 W EP2010050057 W EP 2010050057W WO 2010079167 A1 WO2010079167 A1 WO 2010079167A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- remaining
- speech
- version
- ltp
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 49
- 230000003595 spectral effect Effects 0.000 claims abstract description 36
- 230000000694 effects Effects 0.000 claims abstract description 26
- 238000004590 computer program Methods 0.000 claims abstract description 5
- 230000001131 transforming effect Effects 0.000 claims abstract description 4
- 238000004458 analytical method Methods 0.000 claims description 62
- 230000005284 excitation Effects 0.000 claims description 58
- 230000007774 longterm Effects 0.000 claims description 54
- 238000012545 processing Methods 0.000 claims description 33
- 230000009466 transformation Effects 0.000 claims description 33
- 230000005540 biological transmission Effects 0.000 claims description 17
- 238000009795 derivation Methods 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 3
- 230000000737 periodic effect Effects 0.000 abstract description 5
- 238000007493 shaping process Methods 0.000 description 66
- 230000015572 biosynthetic process Effects 0.000 description 37
- 238000003786 synthesis reaction Methods 0.000 description 37
- 238000013139 quantization Methods 0.000 description 31
- 239000013598 vector Substances 0.000 description 21
- 230000000875 corresponding effect Effects 0.000 description 10
- 230000002087 whitening effect Effects 0.000 description 10
- 230000004048 modification Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000035945 sensitivity Effects 0.000 description 5
- 230000001364 causal effect Effects 0.000 description 4
- 230000003111 delayed effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000001373 regressive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/12—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/10—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/09—Long term prediction, i.e. removing periodical redundancies, e.g. by using adaptive codebook or pitch predictor
Definitions
- the present invention relates to the encoding of speech for transmission over a transmission medium, such as by means of an electronic signal over a wired connection or electro-magnetic signal over a wireless connection.
- a source-filter model of speech is illustrated schematically in Figure 1 a.
- speech can be modelled as comprising a signal from a source 102 passed through a time-varying filter 104.
- the source signal represents the immediate vibration of the vocal chords
- the filter represents the acoustic effect of the vocal tract formed by the shape of the throat, mouth and tongue.
- the effect of the filter is to alter the frequency profile of the source signal so as to emphasise or diminish certain frequencies.
- speech encoding works by representing the speech using parameters of a source-filter model.
- the encoded signal will be divided into a plurality of frames 106, with each frame comprising a plurality of subframes 108.
- speech may be sampled at 16kHz and processed in frames of 20ms, with some of the processing done in subframes of 5ms (four subframes per frame).
- Each frame comprises a flag 107 by which it is classed according to its respective type.
- Each frame is thus classed at least as either "voiced" or "unvoiced", and unvoiced frames are encoded differently than voiced frames.
- Each subframe 108 then comprises a set of parameters of the source-filter model representative of the sound of the speech in that subframe. For voiced sounds (e.g.
- the source signal has a degree of long- term periodicity corresponding to the perceived pitch of the voice.
- the source signal can be modelled as comprising a quasi-periodic signal with each period corresponds to a respective "pitch pulse" comprising a series of peaks of differing amplitudes.
- the source signal is said to be "quasi" periodic in that on a timescale of at least one subframe it can be taken to have a single, meaningful period which is approximately constant; but over many subframes or frames then the period and form of the signal may change.
- the approximated period at any given point may be referred to as the pitch lag.
- FIG. 2a An example of a modelled source signal 202 is shown schematically in Figure 2a with a gradually varying period P 1 , P 2 , P 3 , etc., each comprising a pitch period of four peaks which may vary gradually in form and amplitude from one period to the next.
- a short-term filter is used to separate out the speech signal into two separate components: (i) a signal representative of the effect of the time-varying filter 104; and (ii) the remaining signal with the effect of the filter 104 removed, which is representative of the source signal.
- the signal representative of the effect of the filter 104 may be referred to as the spectral envelope signal, and typically comprises a series of sets of LPC parameters describing the spectral envelope at each stage.
- Figure 2b shows a schematic example of a sequence of spectral envelopes 204i, 204 2 , 204 3 , etc. varying over time.
- the remaining signal representative of the source alone may be referred to as the LPC residual signal, as shown schematically in Figure 2a.
- the short-term filter works by removing short-term correlations (i.e. short term compared to the pitch period), leading to an LPC residual with less energy than the speech signal.
- each subframe 106 would contain: (i) a set of parameters representing the spectral envelope 204; and (ii) a set of parameters representing the pulses of the source signal 202.
- LPC long-term prediction
- correlation being a statistical measure of a degree of relationship between groups of data, in this case the degree of repetition between portions of a signal.
- the source signal can be said to be "quasi" periodic in that on a timescale of at least one correlation calculation it can be taken to have a meaningful period which is approximately (but not exactly) constant; but over many such calculations then the period and form of the source signal may change more significantly.
- a set of parameters derived from this correlation are determined to at least partially represent the source signal for each subframe.
- LTP residual signal representing the source signal with the effect of the correlation between pitch periods removed.
- LTP vectors and LTP residual signal are encoded separately for transmission.
- the sets of LPC parameters, the LTP vectors and the LTP residual signal are each quantised prior to transmission (quantisation being the process of converting a continuous range of values into a set of discrete values, or a larger approximately continuous set of discrete values into a smaller set of discrete values).
- quantisation being the process of converting a continuous range of values into a set of discrete values, or a larger approximately continuous set of discrete values into a smaller set of discrete values.
- each subframe 106 would comprise: (i) a quantised set of LPC parameters representing the spectral envelope, (ii)(a) a quantised LTP vector related to the correlation between pitch periods in the source signal, and (ii)(b) a quantised LTP residual signal representative of the source signal with the effects of this inter-period correlation removed.
- Figure 3a shows a diagram of a linear predictive speech encoder 300 comprising an LPC synthesis filter 306 having a short-term predictor 308 and an LTP synthesis filter 304 having a long-term predictor 310.
- the output of the short-term predictor 308 is subtracted from the speech input signal to produce an LPC residual signal.
- the output of the long-term predictor 310 is subtracted from the LPC residual signal to create an LTP residual signal.
- the LTP residual signal is quantized by a quantizer 302 to produce an excitation signal, and to produce corresponding quantisation indices for transmission to a decoder to allow it to recreate the excitation signal.
- the quantizer 302 can be a scalar quantizer, a vector quantizer, an algebraic codebook quantizer, or any other suitable quantizer.
- the output of a long term predictor 310 in the LTP synthesis filter 304 is added to the excitation signal, which creates the LPC excitation signal.
- the LPC excitation signal is input to the long-term predictor 310, which is a strictly causal moving average (MA) filter controlled by the pitch lag and quantized LTP coefficients.
- MA moving average
- the output of a short term predictor 308 in the LPC synthesis filter 306 is added to the LPC excitation signal, which creates the quantized output signal for feedback for subtraction the input.
- the quantized output signal is input to the short-term predictor 308, which is a strictly causal MA filter controlled by the quantized LPC coefficients.
- Figure 3b shows a linear predictive speech decoder 350.
- Quantization indices are input to an excitation generator 352 which generates an excitation signal.
- the output of a long term predictor 360 in a LTP synthesis filter 354 is added to the excitation signal, which creates the LPC excitation signal.
- the LPC excitation signal is input to the long-term predictor 360, which is a strictly causal MA filter controlled by the pitch lag and quantized LTP coefficients.
- the output of a short term predictor 358 in a short-term synthesis filter 356 is added to the LPC excitation signal, which creates the quantized output signal.
- the quantized output signal is input to the short-term predictor 358, which is a strictly causal MA filter controlled by the quantized LPC coefficients.
- the encoder 300 works by using an LTP analysis (not shown) to determine a correlation between successive received pitch pulses in the LPC residual signal, then passing coefficients of that correlation to the LTP synthesis filter where they are used to predict a version of the later of those pitch pulses from a stored version of the earlier of those pitch pulses based on the correlation.
- the predicted version of the later pitch pulse is fed back to the input where it is subtracted from the corresponding portion in the actual LPC residual signal, thus removing the effect of the periodicity and thereby deriving an LTP residual signal.
- a correlation is determined between the pulses of periods Pi and P 2 then used to predict the pulse of P 2 , and the predicted version of P 2 is then subtracted from the actual version to leave a residual which represents the degree to which P 1 was not correlated with P 2 and so the degree to which the LPC signal was not entirely periodic.
- the LTP synthesis filter uses a long-term prediction to effectively remove or reduce the pitch pulses from the LPC residual signal, leaving an LTP residual signal having lower energy than the LPC residual.
- a method of encoding speech according to a source-filter model whereby speech is modelled to comprise a source signal filtered by a time-varying filter comprising: receiving a speech signal; from the speech signal, deriving a spectral envelope signal representative of the modelled filter and a first remaining signal representative of the modelled source signal, the first remaining signal comprising a plurality of successive portions having a degree of periodicity; deriving a second remaining signal from the first remaining signal by, at intervals during the encoding of said speech signal: exploiting a correlation between ones of said portions to generate a predicted version of a later of said portions from a stored version of an earlier of said portions, and using the predicted version of the later portion to remove an effect of said periodicity from the first remaining signal; and transmitting an encoded signal representing said speech signal based on the spectral envelope signal, said correlations and the second remaining signal; wherein the method further comprises, once every number of said intervals, transforming the stored version of the
- parameters used to derive the first remaining signal may be updated between deriving the respective earlier portion and generating the predicted version of the respective later portion; and said transformation may be performed at said one or more intervals and may comprise updating the stored version of the respective earlier portion of the first residual signal using the updated parameters.
- the encoding may be performed over a plurality of frames each comprising a plurality of subframes, and each of said intervals may be a subframe; said deriving of the second remaining signal may be performed once per subframe whilst parameters used to derive the first remaining signal may be updated once per frame, hence at one subframe per frame then the predicted version of the later portion may be generated from the earlier portion as derived using a previous frame's parameters but used to remove said effect of periodicity from the first remaining signal as derived using a current frame's parameters; and said transformation of the stored version of the earlier portion may be performed at said one subframe per frame and may comprise updating the stored version of the respective earlier portion of the first residual signal using the current frame's parameters.
- the method may comprise determining said correlations using at least one of an open-loop pitch analysis and a long-term prediction analysis, and at least one of those analyses may be based on a version of the first remaining signal derived using said updated parameters for both the previous and current frames.
- Said transformation may be so as to result in a greater reduction in overall energy of the second remaining signal relative to the first remaining signal than without said transformation.
- Said transformation may comprise re-whitening the stored version of the earlier portion.
- the encoded signal may be transmitted as a plurality of packets each encoding a plurality of said intervals, and said transformation of the stored version of the earlier portion may be performed once per packet so as to reduce error propagation caused by potential packet loss in the transmission. Said transformation may be performed for the first interval of each packet.
- Said transformation may be based on information about the packet loss in a channel used for said transmission.
- Said transformation may comprise scaling down the stored version of the earlier portion by a scaling factor.
- the scaling factor may be selected from one of a plurality of specified factors.
- Said specified factors may have substantially the values of 0.5, 0.7 and 0.95.
- Said periodicity may correspond to a perceived pitch of the speech signal.
- the derivation of said spectral envelope signal may be by linear predictive coding (LPC) such that said first remaining signal is an LPC residual signal.
- LPC linear predictive coding
- Said stored versions of the earlier portions may be stored in the form of a quantized excitation corresponding to respective portions of said LPC residual signal.
- Said derivation of the second remaining signal may be by long-term prediction (LTP) such that said second remaining signal is an LTP residual signal.
- LTP long-term prediction
- Each of said stored versions of the earlier portions may each comprises an LTP state.
- a method of decoding an encoded signal comprising speech encoded according to a source-filter model whereby the speech is modelled to comprise a source signal filtered by a time-varying filter, the method comprising: receiving a encoded signal over a communication medium; from the encoded signal, determining a spectral envelope signal representative of the modelled filter; from the encoded signal, determining a second remaining signal; deriving a first remaining signal representative of the modelled source signal and comprising a plurality of successive portions having a degree of periodicity, by, at intervals during the decoding of said encoded signal: determining from the encoded signal information relating to a correlation between ones of said portions of the first remaining signal, using said information to generate a predicted version of a later of said portions based on a stored version of an earlier of said portions, and reconstructing a corresponding portion of the first remaining signal using the second remaining signal and said predicted version of the later portion; and generating a decoded speech signal based on the
- an encoder for encoding speech according to a source-filter model whereby speech is modelled to comprise a source signal filtered by a time-varying filter
- the encoder comprising: an input arranged to receive a speech signal; a first signal processing module configured to derive, from the speech signal, a spectral envelope signal representative of the modelled filter and a first remaining signal representative of the modelled source signal, the first remaining signal comprising a plurality of successive portions having a degree of periodicity; a second signal processing module configured to derive a second remaining signal from the first remaining signal by, at intervals during the encoding of said speech signal: exploiting a correlation between ones of said portions to generate a predicted version of a later of said portions from a stored version of an earlier of said portions, and using the predicted version of the later portion to remove an effect of said periodicity from the first remaining signal; and an output arranged to transmit an encoded signal representing said speech signal based on the spectral envelope signal, said correlations and the second
- a decoder for decoding an encoded signal comprising speech encoded according to a source-filter model whereby the speech is modelled to comprise a source signal filtered by a time-varying filter
- the decoder comprising: an input arranged to receive a encoded signal; a first signal processing module configured to determine, from the encoded signal, a spectral envelope signal representative of the modelled filter; and a second signal processing module configured to determine, from the encoded signal, a second remaining signal; wherein the second signal processing module is further configured to derive a first remaining signal representative of the modelled source signal and comprising a plurality of successive portions having a degree of periodicity, by, at intervals during the decoding of said encoded signal: determining from the encoded signal information relating to a correlation between ones of said portions of the first remaining signal, using said information to generate a predicted version of a later of said portions based on a stored version of an earlier of said portions, and reconstructing a corresponding portion of the
- a communication system comprising a plurality of end-user terminals each comprising a corresponding encoder and/or decoder.
- Figure 1 a is a schematic representation of a source-filter model of speech
- Figure 1 b is a schematic representation of a frame
- Figure 2a is a schematic representation of a source signal
- Figure 2b is a schematic representation of variations in a spectral envelope
- Figure 3a is a schematic block diagram of an encoder
- Figure 3b is a schematic block diagram of a decoder
- Figure 4a shows graphs of an LPC residual, LTP state and LTP residual
- Figure 4b shows further graphs of an LPC residual, LTP state and LTP residual
- Figure 4c shows graphs illustrating error propagation
- Figure 4d shows graphs of an LTP residual according to a number of methods
- Figure 4e is another schematic representation of a frame
- Figure 5 is another schematic block diagram of an encoder
- Figure 6 is a schematic block diagram of a noise shaping quantizer
- Figure 7 is another schematic block diagram of a decoder
- Figure 8 shows schematically an LTP state and LTP synthesis filter output
- Figure 9 shows graphs illustrating resynchronisation of the LTP state.
- long-term prediction is a known technique in speech coding whereby correlations between pitch pulses are exploited to improve coding efficiency.
- a long term prediction filter uses one or more pitch lags and one or more LTP coefficients to compute an LTP residual signal from an LPC residual.
- the LTP residual has smaller variance and can thus be encoded more efficiently than the LPC residual.
- the pitch lags and LTP coefficients are sent to the decoder together with the coded LTP residual, or excitation.
- the excitation is used to reconstruct the LPC excitation signal, using an LTP synthesis filter.
- the LTP state is sometimes called the adaptive codebook.
- the long-term prediction process can itself introduce problems such as difficulties in the prediction or propagation of errors.
- the present invention overcomes such problems by providing a method for modifying the LTP state in predictive speech coding.
- the LTP state is the stored LPC residual or LPC excitation signal from the previous pitch period, from which the following pitch period is to be predicted in order to remove it from the current LPC residual signal and thus derive the LTP residual signal.
- the invention can apply to any situation where a first signal is derived to represent the source signal of a source-filter model of speech, and a second signal is then derived by calculating correlation between earlier and later portions of the first signal which have a degree of repetition therebetween. By transforming the stored earlier portion, it is possible to compensate for changes in the filter part of the source-filter model that would lead to an LTP residual signal having a greater energy that it should.
- the long-term prediction filter removes or reduces the pitch pulses in the LPC residual signal by predicting one pitch pulse based on the previous one.
- the LPC residual of one frame is typically computed based on quantized LPC coefficients that may differ from those used to compute the LPC residual of the next frame.
- Reasons for the difference in quantized LPC coefficients may be the natural evolution of the spectral envelope in speech, and numerical fluctuations in the estimation and quantization of the LPC coefficients.
- the difference in quantized LPC coefficients may cause the last pitch pulse of one frame to have a significantly different shape than the first pitch pulse of the next frame. Consequently, the last pitch pulses of a frame may be a poor basis for predicting the first pitch pulse of the next frame.
- the LPC coefficients are typically updated at the start of a frame, and sometimes also during a frame, for example when interpolated LPC coefficients are used.
- the long-term predictor outputs a signal that is based on the LPC coefficients before the update of the LPC coefficients.
- the long-term predictor output is subtracted, with the aim of minimizing an LPC residual signal based on the LPC coefficients after the update of the LPC coefficients. This creates an inconsistency, where the LTP state is not perfectly suited for minimizing the LTP residual.
- the shape of the pitch pulse in the LPC residual signal is influenced by the LPC coefficients, and after updating the LPC coefficients the LTP state may contain a pitch pulse with a different shape than the pitch pulse in the LPC residual.
- the long-term predictor is not able to create a long-term prediction signal that efficiently minimizes the LTP residual.
- problems of this kind can be solved by modifying the LTP state synchronously in encoder and decoder when LPC coefficients are updated. This improves the long-term prediction performance and thus improves coding efficiency.
- a new LTP state is created by whitening the quantized speech signal with an LPC analysis filter controlled by the LPC coefficients after the update of the LPC coefficients.
- the LTP state is updated in this manner in encoder and decoder synchronously.
- a preferred modification is to "re-whiten” the LTP state using the same whitening (LPC) coefficients as were used for generating the LPC residual ("whitening" is to equalize the LTP state to flatten its spectral density, so that its energy is more evenly spread across its frequency spectrum). Note that this modification does not necessarily make the LTP more white: the preferred modification is to whiten the stored LTP state using the same, updated whitening (LPC) coefficients as used for generating the current LPC residual.
- LPC whitening
- the whitening operation is done by the LPC analysis, and the LPC synthesis performs the reverse of a whitening operation. Therefore the LTP state was already whitened, but using the LPC coefficients of the previous frame. To compensate for this, the preferred modification is therefore to:
- the modification may be referred to as a "re-whitening" (i.e. re-applying a whitening LPC analysis filter), rather than a whitening per se.
- Figure 4a shows the effect of different LCP coefficients used for generating the LTP state and LPC residual.
- the top graph shows an LPC residual for one frame of 20 milliseconds, containing six similarly shaped pitch pulses.
- the similarity allows a long term predictor to create an LTP residual with significantly less energy than the LPC residual.
- the long term predictor uses a state containing the previous pitch pulse.
- the LTP predictor uses an LTP state depending on the LPC coefficients from the previous frame.
- the LTP state is shown in the second graph, and contains a pitch pulse with different shape.
- the LTP residual shown in the bottom graph has more energy for the first pitch pulse than for the subsequent pitch pulses.
- Figure 4b shows the same LPC residual in the top graph, but now the LTP state has been modified by first inputting it into an LPC synthesis filter controlled by the LPC coefficients from the previous frame, and then whitening the output from the LPC synthesis filter with an LPC analysis filter controlled by the LPC coefficients of the current frame.
- the result is an LTP state containing a pitch pulse that matches the first pitch pulse of the LPC residual better. Consequently, the LTP residual in the bottom graph contains less energy for the first pitch pulse than with the unmodified LTP state.
- the LTP residual can be coded more efficiently, thereby reducing the bitrate of the codec.
- One approach to reduce the error propagation is to set constraints on the LTP coefficients so as to shorten the impulse response of the LTP synthesis filter. By doing this the coding gain from LTP is lowered resulting in a need for higher bit rate to maintain the output speech quality in lossless condition.
- the problem of error propagation can be reduced by down-scaling the LTP state synchronously in encoder and decoder.
- error propagation is controlled by scaling down the LTP filter state in both encoder and decoder at the start of each new packet. This gives a better trade off between LTP prediction gain and packet loss error propagation, which translates to a better trade off between bitrate and packet loss sensitivity.
- Figure 4d illustrates different methods for limiting error propagation.
- the top graph shows one 20ms frame of LPC residual.
- the other three graphs show one corresponding frame of LTP residual for three different methods.
- the second graph shows the LTP residual for unconstrained LTP coefficients.
- the LTP residual can be seen to be much reduced compared to the LPC residual.
- the third graph shows the LTP residual for scaled LTP coefficients scaled by 0.5.
- the LTP residual is less reduced than for the unconstrained method.
- the last graph shows the LTP residual for a scaled LTP state scaled by 0.5, with the optimal LTP coefficients used unaltered.
- the first pitch pulse is less reduced, similar to the LTP residual signal with scaled LTP coefficients. However, the remainder of the LTP residual is as much reduced as with the unconstrained method.
- the method of scaling the LTP state is better than scaling the LTP coefficients, because of the higher coding efficiency for the signal after the first pitch pulse.
- FIG. 4e is a schematic representation of a frame according to a preferred embodiment of the present invention.
- the frame additionally comprises an index 1 10 of the scaling value selected to multiply the LTP state by.
- the encoder 500 comprises a high-pass filter 502, a linear predictive coding (LPC) analysis block 504, a first vector quantizer 506, an open-loop pitch analysis block 508, a long-term prediction (LTP) analysis block 510, a second vector quantizer 512, a noise shaping analysis block 514, a noise shaping quantizer 516, and an arithmetic encoding block 518.
- the LTP analysis block 510 comprises a scaling control module 520, which will be discussed later in relation to Figure 6.
- the high pass filter 502 has an input arranged to receive an input speech signal from an input device such as a microphone, and an output coupled to inputs of the LPC analysis block 504, noise shaping analysis block 514 and noise shaping quantizer 516.
- the LPC analysis block has an output coupled to an input of the first vector quantizer 506, and the first vector quantizer 506 has outputs coupled to inputs of the arithmetic encoding block 518 and noise shaping quantizer 516.
- the LPC analysis block 504 has outputs coupled to inputs of the open-loop pitch analysis block 508 and the LTP analysis block 510.
- the LTP analysis block 510 has an output coupled to an input of the second vector quantizer 512, and the second vector quantizer 512 has outputs coupled to inputs of the arithmetic encoding block 518 and noise shaping quantizer 516.
- the open-loop pitch analysis block 508 has outputs coupled to inputs of the LTP 510 analysis block 510 and the noise shaping analysis block 514.
- the noise shaping analysis block 514 has outputs coupled to inputs of the arithmetic encoding block 518 and the noise shaping quantizer 516.
- the noise shaping quantizer 516 has an output coupled to an input of the arithmetic encoding block 518.
- the arithmetic encoding block 518 is arranged to produce an output bitstream based on its inputs, for transmission from an output device such as a wired modem or wireless transceiver.
- the encoder processes a speech input signal sampled at 16 kHz in frames of 20 milliseconds, with some of the processing done in subframes of 5 milliseconds.
- the output bitsream payload contains arithmetically encoded parameters, and has a bitrate that varies depending on a quality setting provided to the encoder and on the complexity and perceptual importance of the input signal.
- the speech input signal is input to the high-pass filter 504 to remove frequencies below 80 Hz which contain almost no speech energy and may contain noise that can be detrimental to the coding efficiency and cause artifacts in the decoded output signal.
- the high-pass filter 504 is preferably a second order auto- regressive moving average (ARMA) filter.
- n is the sample number.
- the LPC coefficients are used with an LPC analysis filter to create the LPC residual.
- the LPC residual is computed for the current frame, and also for the previous frame using the LPC coefficients derived for the current frame.
- the effect of this is to use an LPC residual generated with constant LPC coefficients in the open loop pitch analysis and LTP analysis. Having the last pitch pulse in the previous frame generated with the same LPC coefficients as the pitch pulses in the current frame improves the open loop pitch estimation and LTP analysis. This is particularly applicable when applying the re-whitening in the noise shaping quantizer 516 as described below.
- the LPC coefficients are transformed to a line spectral frequency (LSF) vector.
- LSFs are quantized using the first vector quantizer 506, a multi-stage vector quantizer (MSVQ) with 10 stages, producing 10 LSF indices that together represent the quantized LSFs.
- MSVQ multi-stage vector quantizer
- the quantized LSFs are transformed back to produce the quantized LPC coefficients for use in the noise shaping quantizer 516.
- the LPC residual is input to the open loop pitch analysis block 508, producing one pitch lag for every 5 millisecond subframe, i.e., four pitch lags per frame.
- the pitch lags are chosen between 32 and 288 samples, corresponding to pitch frequencies from 56 to 500 Hz, which covers the range found in typical speech signals.
- the pitch analysis produces a pitch correlation value which is the normalized correlation of the signal in the current frame and the signal delayed by the pitch lag values. Frames for which the correlation value is below a threshold of 0.5 are classified as unvoiced, i.e., containing no periodic signal, whereas all other frames are classified as voiced.
- the pitch lags are input to the arithmetic coder 518 and noise shaping quantizer 516.
- LPC residual itpc is supplied from the LPC analysis block 504 to the LTP analysis block 510.
- the LTP analysis block 510 solves normal equations to find 5 linear prediction filter coefficients bj such that the energy in the LTP residual n_ ⁇ p for that subframe:
- VVLTP is a weighting matrix containing correlation values
- the LTP residual is computed as the LPC residual in the current subframe minus a filtered and delayed LPC residual.
- the LPC residual in the current subframe and the delayed LPC residual are both generated with an LPC analysis filter controlled by the same LPC coefficients. That means that when the LPC coefficients were updated, an LPC residual is computed not only for the current frame but also a new LPC residual is computed for at least lag + 2 samples preceding the current frame.
- the LTP coefficients for each frame are quantized using a vector quantizer (VQ).
- VQ vector quantizer
- the resulting VQ codebook index is input to the arithmetic coder, and the quantized LTP coefficients bQ are input to the noise shaping quantizer.
- the high-pass filtered input is analyzed by the noise shaping analysis block 514 to find filter coefficients and quantization gains used in the noise shaping quantizer.
- the filter coefficients determine the distribution over the quantization noise over the spectrum, and are chose such that the quantization is least audible.
- the quantization gains determine the step size of the residual quantizer and as such govern the balance between bitrate and quantization noise level.
- All noise shaping parameters are computed and applied per subframe of 5 milliseconds.
- a 16 th order noise shaping LPC analysis is performed on a windowed signal block of 16 milliseconds.
- the signal block has a look-ahead of 5 milliseconds relative to the current subframe, and the window is an asymmetric sine window.
- the noise shaping LPC analysis is done with the autocorrelation method.
- the quantization gain is found as the square-root of the residual energy from the noise shaping LPC analysis, multiplied by a constant to set the average bitrate to the desired level.
- the quantization gain is further multiplied by 0.5 times the inverse of the pitch correlation determined by the pitch analyses, to reduce the level of quantization noise which is more easily audible for voiced signals.
- the quantization gain for each subframe is quantized, and the quantization indices are input to the arithmetically encoder 518.
- the quantized quantization gains are input to the noise shaping quantizer 516.
- the noise shaping quantizer also applies long-term noise shaping. It uses three filter taps, described by:
- the short-term and long-term noise shaping coefficients are input to the noise shaping quantizer 516.
- the high-pass filtered input is also input to the noise shaping quantizer 516.
- the noise shaping quantizer 516 comprises a first addition stage 602, a first subtraction stage 604, a first amplifier 606, a scalar quantizer 608, a second amplifier 609, a second addition stage 610, a shaping filter 612, a prediction filter 614 and a second subtraction stage 616.
- the shaping filter 612 comprises a third addition stage 618, a long-term shaping block 620, a third subtraction stage 622, and a short-term shaping block 624.
- the prediction filter 614 comprises a fourth addition stage 626, a long-term prediction block 628, a fourth subtraction stage 630, and a short-term prediction block 632.
- the long term prediction block 628 comprises an LTP buffer 634.
- the first addition stage 602 has an input arranged to receive the high-pass filtered input from the high-pass filter 502, and another input coupled to an output of the third addition stage 618.
- the first subtraction stage has inputs coupled to outputs of the first addition stage 602 and fourth addition stage 626.
- the first amplifier has a signal input coupled to an output of the first subtraction stage and an output coupled to an input of the scalar quantizer 608.
- the first amplifier 606 also has a control input coupled to the output of the noise shaping analysis block 514.
- the scalar quantiser 608 has outputs coupled to inputs of the second amplifier 609 and the arithmetic encoding block 518.
- the second amplifier 609 also has a control input coupled to the output of the noise shaping analysis block 514, and an output coupled to the an input of the second addition stage 610.
- the other input of the second addition stage 610 is coupled to an output of the fourth addition stage 626.
- An output of the second addition stage is coupled back to the input of the first addition stage 602, and to an input of the short-term prediction block 632 and the fourth subtraction stage 630.
- An output of the short-term prediction block 632 is coupled to the other input of the fourth subtraction stage 630.
- the output of the fourth subtraction stage 630 is coupled to the input of the long-term prediction block 628.
- the fourth addition stage 626 has inputs coupled to outputs of the long-term prediction block 628 and short-term prediction block 632.
- the output of the second addition stage 610 is further coupled to an input of the second subtraction stage 616, and the other input of the second subtraction stage 616 is coupled to the input from the high-pass filter 502.
- An output of the second subtraction stage 616 is coupled to inputs of the short-term shaping block 624 and the third subtraction stage 622.
- An output of the short-term shaping block 624 is coupled to the other input of the third subtraction stage 622.
- the output of third subtraction stage 622 is coupled to the input of the long-term shaping block 620.
- the third addition stage 618 has inputs coupled to outputs of the long-term shaping block 620 and short-term prediction block 624.
- the short- term and long-term shaping blocks 624 and 620 are each also coupled to the noise shaping analysis block 514, and the long-term shaping block 620 is also coupled to the open-loop pitch analysis block 508 (connections not shown). Further, the short-term prediction block 632 is coupled to the LPC analysis block 504 via the first vector quantizer 506, and the long-term prediction block 628 is coupled to the LTP analysis block 510 via the second vector quantizer 512 (connections also not shown).
- the purpose of the noise shaping quantizer 516 is to quantize the LTP residual signal in a manner that weights the distortion noise created by the quantisation into less noticeable parts of the frequency spectrum, e.g. where the human ear is more tolerant to noise, and/or where the speech energy is high so that the relative effect of the noise is less.
- the noise shaping quantizer 516 generates a quantized output signal that is identical to the output signal ultimately generated in the decoder.
- the input signal is subtracted from this quantized output signal at the second subtraction stage 616 to obtain the quantization error signal d(n).
- the quantization error signal is input to a shaping filter 612, described in detail later.
- the output of the shaping filter 612 is added to the input signal at the first addition stage 602 in order to effect the spectral shaping of the quantization noise. From the resulting signal, the output of the prediction filter 614, described in detail below, is subtracted at the first subtraction stage 604 to create a residual signal.
- the residual signal is multiplied at the first amplifier 606 by the inverse quantized quantization gain from the noise shaping analysis block 514, and input to the scalar quantizer 608.
- the quantization indices of the scalar quantizer 608 represent a signal that is input to the arithmetically encoder 518.
- the scalar quantizer 608 also outputs a quantization signal, which is multiplied at the second amplifier 609 by the quantized quantization gain from the noise shaping analysis block 514 to create an excitation signal.
- the output of the prediction filter 614 is added at the second addition stage to the excitation signal to form the quantized output signal.
- the quantized output signal is input to the prediction filter 614.
- residual is obtained by subtracting a prediction from the input speech signal.
- excitation is based on only the quantizer output. Often, the residual is simply the quantizer input and the excitation is its output.
- the shaping filter 612 inputs the quantization error signal d(n) to a short-term shaping filter 624, which uses the short-term shaping coefficients a shap e(i) to create a short-term shaping signal s Sh o r t(n), according to the formula:
- the short-term shaping signal is subtracted at the third addition stage 622 from the quantization error signal to create a shaping residual signal f(n).
- the shaping residual signal is input to a long-term shaping filter 620 which uses the long-term shaping coefficients b Sh ape(i) to create a long-term shaping signal S
- the short-term and long-term shaping signals are added together at the third addition stage 618 to create the shaping filter output signal.
- the prediction filter 614 inputs the quantized output signal y(n) to a short-term prediction filter 632, which uses the quantized LPC coefficients aQ to create a short-term prediction signal p Sh o r t(n), according to the formula:
- the LPC excitation signal is input to a long-term prediction filter 628 which calculates a prediction signal using the filter coefficients that were derived from correlations in the LTP analysis block 510 (see Figure 5). That is, long-term prediction filter 628 uses the quantized long-term prediction coefficients b Q (i) to create a long-term prediction signal p ⁇ ong (n), according to the formula:
- the LPC excitation signal e LP c(n) is stored in an LTP buffer 634 in the long-term prediction 628 block.
- the LTP buffer 634 is of length at least equal to the maximum pitch lag of 288 plus 2.
- the signal contained in the LTP buffer 635 is the LTP filter state.
- the long-term prediction block 628 may modify the LPC excitation ⁇ L pc(n) stored in the encoder LTP buffer 634 at the start of every new input frame classified as voiced, when the quantized LPC coefficients aQ are updated.
- the modification consists of replacing the LTP filter state with a new LPC excitation signal e LP c.new computed from the quantized output signal y(n) and the new quantized LPC coefficients aQ, ne w:
- 16 eLPC,nesM X») ⁇ ⁇ ⁇ LTP (« ⁇ »> ⁇ , neM .(0
- the scaling control module 520 in the LTP analysis block 520 may scale down the LTP filter state stored in the LTP buffer 634 at the beginning of every new input frame, before the noise shape quantization is started. Sometimes multiple frames are combined within one packet, in which case the LTP scaling should preferably be applied only for the first frame of each packet (whereas the re-whitening is preferably done for every frame).
- the scaling value is passed from the scaling control module 520 in the LTP analysis block 510 to the long-term prediction block 628 in the noise shaping quantizer 516, where it is used to scale the LTP state stored in the LTP buffer 634.
- the LTP scaling value is calculated by a scaling control module 636 based on information about the packet loss in the channel and information about the speech signal. This module 636 chooses between three scaling values of 0.5, 0.7 or 0.95, where 0.5 gives most error propagation resilience and lowest coding efficiency, and 0.95 gives least error propagation resilience and highest coding efficiency.
- the scaling control module 520 calculates a sensitivity measure that is compared with two thresholds, one for using a scaling value of 0.5 and one for 0.7. Default is using a scaling value of 0.95.
- the sensitivity measure predicts how sensitive the current frame is to errors in the LTP filter state due to packet losses. It is calculated using the following formula:
- PGLTP is the long-term prediction gain, as measured as ratio of the energy of LPC residual ⁇ LPC and LTP residual ⁇ LTP
- the sensitivity measure is a combination of the LTP prediction gain and a high pass version of the same measure.
- the LTP prediction gain is chosen because it directly relates the LTP state error with the output signal error.
- the high pass part is added to put emphasis on signal changes.
- a changing signal has high risk of giving severe error propagation because the LTP state in encoder and decoder will most likely be very different, after a packet loss.
- An example is when loosing a voiced onset (see figure 4c).
- the distribution of scaling values for the frames is dependent on the loss percentage where more of the frames get a scaling value less than 0.95 when the channel loss rate increases. This is done by lowering the sensitivity thresholds as the loss rate increases.
- the state control module 636 only assigns scaling values lower than 0.95 for frames that are the first in a packet.
- the scaling value is supplied from the scaling control module 520 in the LTP analysis block 510 to the arithmetic encoder 518, and from there is transmitted on to the decoder in the encoded signal.
- the short-term and long-term prediction signals are added together to create the prediction filter output signal.
- the LTP state can be either the LPC residual or the LPC excitation signal, depending on details of the encoder. Typically however, as in the described embodiments, it is the LPC excitation signal.
- the LSF indices, LTP indices, quantization gains indices, pitch lags, LTP scaling value indices (if used), and quantization indices are each arithmetically encoded and multiplexed at the arithmetic encoder 518 to create the payload bitstream.
- the arithmetic encoder 518 uses a look-up table with probability values for each index. The look-up tables are created by running a database of speech training signals and measuring frequencies of each of the index values. The frequencies are translated into probabilities through a normalization step.
- the decoder 700 comprises an arithmetic decoding and dequantizing block 702, an excitation generation block 704, an LTP synthesis filter 706, and an LPC synthesis filter 708.
- the LTP synthesis filter 706 comprises an LTP buffer 710.
- the arithmetic decoding and dequantizing block 702 has an input arranged to receive an encoded bitstream from an input device such as a wired modem or wireless transceiver, and has outputs coupled to inputs of each of the excitation generation block 704, LTP synthesis filter 706 and LPC synthesis filter 708.
- the excitation generation block 704 has an output coupled to an input of the LTP synthesis filter 706, and the LTP synthesis block 706 has an output connected to an input of the LPC synthesis filter 708.
- the LPC synthesis filter has an output arranged to provide a decoded output for supply to an output device such as a speaker or headphones.
- the arithmetically encoded bitstream is demultiplexed and decoded to create LSF indices, LTP indices, LTP scaling value indices (if used), quantization gains indices, pitch lags and a signal of quantization indices.
- the LSF indices are converted to quantized LSFs by adding the codebook vectors of the ten stages of the MSVQ.
- the quantized LSFs are transformed to quantized LPC coefficients.
- the LTP codebook is then used to convert the LTP indices to quantized LTP coefficients.
- the gains indices are converted to quantization gains, through look ups in the gain quantization codebook.
- the excitation quantization indices signal is multiplied by the quantization gain to create an excitation signal e(n).
- the excitation signal is input to the LTP synthesis filter 706 to create the LPC excitation signal e ⁇ _pc(n) according to:
- the excitation signal e(n) is stored in an LTP buffer of length at least equal to the maximum pitch lag of 288, plus 2.
- the signal contained in the LTP buffer is the LTP filter state.
- the LPC excitation signal is input to an LPC synthesis filter to create the decoded speech signal y(n) according to
- the LTP synthesis filter 706 may modify the LPC excitation e ⁇ _pc(n) stored in the decoder LTP buffer 710 at the start of every new input frame classified as voiced, when the quantized LPC coefficients aQ are updated.
- the modification consists of replacing the LTP filter state with a new LPC excitation signal e L pc.new computed from the decoded speech signal y(n) and the new quantized LPC coefficients aQ ,ne w
- the LTP synthesis filter 706 may use the decoded LTP scale value to scale down the LTP filter state at the beginning of every new input frame, before LTP synthesis filtering is started. That is, scaling the LPC excitation ei_pc(n).
- the LTP synthesis filter 706 in the decoder 700 may use its knowledge about the LTP scale value to improve the LTP state synchronization further as discussed below.
- Figure 8 shows the relationship between the LTP state and LTP synthesis filter output.
- the LTP synthesis filter delays the LTP state by the pitch lag and convolves it with the LTP filter coefficients to create a filtered LTP state.
- the filtered LTP state is added to the excitation signal to create the LTP synthesis filter output, or LPC excitation signal.
- the left figure shows the filtered LTP state and excitation signal without downscaling, where they are orthogonal. If after a packet loss the LTP state is set to zero, resulting in a zero filtered LTP state, then the excitation signal provides the best approximation to the LTP synthesis output that would have been generated if no packet loss had occurred.
- the figure to the right shows the excitation signal with LTP downscaling, using the same optimal LTP coefficients.
- the vectors are not orthogonal anymore, and have positive correlation.
- the positive correlation between filtered reduced LTP state and excitation can be exploited after a packet loss. If after a loss the LTP state is set to zero, the excitation can be scaled up to give a closer match to the LTP synthesis output that would have been generated if no packet loss had occurred.
- the optimal scaling is not known on the decoder side, but can be estimated using for instance a trained statistical approach. It was found heuristically that good performance is obtained when upscaling by a factor 1.4 when the LTP scale value is 0.7 and upscaling by a factor 1.8 when the LTP scale value is 0.5.
- LTP state is not set to zero after packet loss, but is approximated using the signal generated with a packet loss concealment unit
- another enhancement is used in the decoder.
- the knowledge of LTP state scaling is exploited by changing the phase of the decoder LTP filter state such as to optimize the correlation with the LTP residual signal for the duration of the first pitch period.
- This enhancement is useful when the pitch lag used by the concealment unit drifts away from the pitch lag used in the encoder.
- the advantage is illustrated in the Figure 9, which is an illustration of the resynchronisation of the LTP state after packet loss.
- the first plot is a voiced speech signal without packet loss.
- the second plot illustrates a signal with a lost packet between the vertical lines.
- LTP state scaling by 0.5 is used, but because the phase of the pitch pulse drifts in the concealment signal compared to the lossless signal the signal after the loss contains a large error in the pitch pulse shape.
- the last plot shows how synchronising the LTP state such that correlation between filtered LTP state and excitation signal is maximized improves the pitch pulse shape.
- the encoder 500 and decoder 700 are preferably implemented in software, such that each of the components 502 to 634 and 702 to 710 comprise modules of software stored on one or more memory devices and executed on a processor.
- a preferred application of the present invention is to encode speech for transmission over a packet-based network such as the Internet, preferably using a peer-to-peer (P2P) system implemented over the Internet, for example as part of a live call such as a Voice over IP (VoIP) call.
- P2P peer-to-peer
- VoIP Voice over IP
- the encoder 500 and decoder 700 are preferably implemented in client application software executed on end-user terminals of two users communicating over the P2P system.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
A method, system and computer program for encoding speech according to a source-filter model. The method comprises deriving a spectral envelope signal representative of a modelled filter and a first remaining signal representative of a modelled source signal, and deriving a second remaining signal from the first remaining signal by, at intervals during the encoding: exploiting a correlation between approximately periodic portions in the first remaining signal to generate a predicted version of a later portion from a stored version of an earlier portion, and using the predicted version of the later portion to remove an effect of said periodicity from the first remaining signal. The method further comprises, once every number of intervals, transforming the stored version of the earlier portion of the first remaining signal prior to generating the predicted version of the respective later portion.
Description
Speech Coding
Field of the Invention
The present invention relates to the encoding of speech for transmission over a transmission medium, such as by means of an electronic signal over a wired connection or electro-magnetic signal over a wireless connection.
Background
A source-filter model of speech is illustrated schematically in Figure 1 a. As shown, speech can be modelled as comprising a signal from a source 102 passed through a time-varying filter 104. The source signal represents the immediate vibration of the vocal chords, and the filter represents the acoustic effect of the vocal tract formed by the shape of the throat, mouth and tongue. The effect of the filter is to alter the frequency profile of the source signal so as to emphasise or diminish certain frequencies. Instead of trying to directly represent an actual waveform, speech encoding works by representing the speech using parameters of a source-filter model.
As illustrated schematically in Figure 1 b, the encoded signal will be divided into a plurality of frames 106, with each frame comprising a plurality of subframes 108. For example, speech may be sampled at 16kHz and processed in frames of 20ms, with some of the processing done in subframes of 5ms (four subframes per frame). Each frame comprises a flag 107 by which it is classed according to its respective type. Each frame is thus classed at least as either "voiced" or "unvoiced", and unvoiced frames are encoded differently than voiced frames. Each subframe 108 then comprises a set of parameters of the source-filter model representative of the sound of the speech in that subframe.
For voiced sounds (e.g. vowel sounds), the source signal has a degree of long- term periodicity corresponding to the perceived pitch of the voice. In that case, the source signal can be modelled as comprising a quasi-periodic signal with each period corresponds to a respective "pitch pulse" comprising a series of peaks of differing amplitudes. The source signal is said to be "quasi" periodic in that on a timescale of at least one subframe it can be taken to have a single, meaningful period which is approximately constant; but over many subframes or frames then the period and form of the signal may change. The approximated period at any given point may be referred to as the pitch lag. An example of a modelled source signal 202 is shown schematically in Figure 2a with a gradually varying period P1, P2, P3, etc., each comprising a pitch period of four peaks which may vary gradually in form and amplitude from one period to the next.
According to many speech coding algorithms such as those using Linear Predictive Coding (LPC), a short-term filter is used to separate out the speech signal into two separate components: (i) a signal representative of the effect of the time-varying filter 104; and (ii) the remaining signal with the effect of the filter 104 removed, which is representative of the source signal. The signal representative of the effect of the filter 104 may be referred to as the spectral envelope signal, and typically comprises a series of sets of LPC parameters describing the spectral envelope at each stage. Figure 2b shows a schematic example of a sequence of spectral envelopes 204i, 2042, 2043, etc. varying over time. Once the varying spectral envelope is removed, the remaining signal representative of the source alone may be referred to as the LPC residual signal, as shown schematically in Figure 2a. The short-term filter works by removing short-term correlations (i.e. short term compared to the pitch period), leading to an LPC residual with less energy than the speech signal.
The spectral envelope signal and the source signal are each encoded separately for transmission. In the illustrated example, each subframe 106 would contain: (i)
a set of parameters representing the spectral envelope 204; and (ii) a set of parameters representing the pulses of the source signal 202.
To improve the encoding of the source signal, its periodicity may be exploited. To do this, a long-term prediction (LTP) analysis is used to determine the correlation of the LPC residual signal with itself from one period to the next, i.e. the correlation between the LPC residual signal at the current time and the LPC residual signal after one period at the current pitch lag (correlation being a statistical measure of a degree of relationship between groups of data, in this case the degree of repetition between portions of a signal). In this context the source signal can be said to be "quasi" periodic in that on a timescale of at least one correlation calculation it can be taken to have a meaningful period which is approximately (but not exactly) constant; but over many such calculations then the period and form of the source signal may change more significantly. A set of parameters derived from this correlation are determined to at least partially represent the source signal for each subframe. The set of parameters for each subframe is typically a set of coefficients C of a series, which form a respective vector CLTp = (Ci1 C2, ...Q).
The effect of this inter-period correlation is then removed from the LPC residual, leaving an LTP residual signal representing the source signal with the effect of the correlation between pitch periods removed. To represent the source signal, the LTP vectors and LTP residual signal are encoded separately for transmission.
The sets of LPC parameters, the LTP vectors and the LTP residual signal are each quantised prior to transmission (quantisation being the process of converting a continuous range of values into a set of discrete values, or a larger approximately continuous set of discrete values into a smaller set of discrete values). The advantage of separating out the LPC residual signal into the LTP
vectors and LTP residual signal is that the LTP residual typically has a lower energy than the LPC residual, and so requires fewer bits to quantize.
So in the illustrated example, each subframe 106 would comprise: (i) a quantised set of LPC parameters representing the spectral envelope, (ii)(a) a quantised LTP vector related to the correlation between pitch periods in the source signal, and (ii)(b) a quantised LTP residual signal representative of the source signal with the effects of this inter-period correlation removed.
Figure 3a shows a diagram of a linear predictive speech encoder 300 comprising an LPC synthesis filter 306 having a short-term predictor 308 and an LTP synthesis filter 304 having a long-term predictor 310. The output of the short-term predictor 308 is subtracted from the speech input signal to produce an LPC residual signal. The output of the long-term predictor 310 is subtracted from the LPC residual signal to create an LTP residual signal. The LTP residual signal is quantized by a quantizer 302 to produce an excitation signal, and to produce corresponding quantisation indices for transmission to a decoder to allow it to recreate the excitation signal. The quantizer 302 can be a scalar quantizer, a vector quantizer, an algebraic codebook quantizer, or any other suitable quantizer. The output of a long term predictor 310 in the LTP synthesis filter 304 is added to the excitation signal, which creates the LPC excitation signal. The LPC excitation signal is input to the long-term predictor 310, which is a strictly causal moving average (MA) filter controlled by the pitch lag and quantized LTP coefficients. The output of a short term predictor 308 in the LPC synthesis filter 306 is added to the LPC excitation signal, which creates the quantized output signal for feedback for subtraction the input. The quantized output signal is input to the short-term predictor 308, which is a strictly causal MA filter controlled by the quantized LPC coefficients.
Figure 3b shows a linear predictive speech decoder 350. Quantization indices are input to an excitation generator 352 which generates an excitation signal. The
output of a long term predictor 360 in a LTP synthesis filter 354 is added to the excitation signal, which creates the LPC excitation signal. The LPC excitation signal is input to the long-term predictor 360, which is a strictly causal MA filter controlled by the pitch lag and quantized LTP coefficients. The output of a short term predictor 358 in a short-term synthesis filter 356 is added to the LPC excitation signal, which creates the quantized output signal. The quantized output signal is input to the short-term predictor 358, which is a strictly causal MA filter controlled by the quantized LPC coefficients.
The encoder 300 works by using an LTP analysis (not shown) to determine a correlation between successive received pitch pulses in the LPC residual signal, then passing coefficients of that correlation to the LTP synthesis filter where they are used to predict a version of the later of those pitch pulses from a stored version of the earlier of those pitch pulses based on the correlation. The predicted version of the later pitch pulse is fed back to the input where it is subtracted from the corresponding portion in the actual LPC residual signal, thus removing the effect of the periodicity and thereby deriving an LTP residual signal. For example, referring to Figure 2a, a correlation is determined between the pulses of periods Pi and P2 then used to predict the pulse of P2, and the predicted version of P2 is then subtracted from the actual version to leave a residual which represents the degree to which P1 was not correlated with P2 and so the degree to which the LPC signal was not entirely periodic. Put another way, the LTP synthesis filter uses a long-term prediction to effectively remove or reduce the pitch pulses from the LPC residual signal, leaving an LTP residual signal having lower energy than the LPC residual.
However, it would be desirable to improve some aspects of the LTP prediction, or of other such prediction based on a correlation between portions of a signal representing a source signal of a source-filter model.
Summary
According to one aspect of the present invention, there is provided a method of encoding speech according to a source-filter model whereby speech is modelled to comprise a source signal filtered by a time-varying filter, the method comprising: receiving a speech signal; from the speech signal, deriving a spectral envelope signal representative of the modelled filter and a first remaining signal representative of the modelled source signal, the first remaining signal comprising a plurality of successive portions having a degree of periodicity; deriving a second remaining signal from the first remaining signal by, at intervals during the encoding of said speech signal: exploiting a correlation between ones of said portions to generate a predicted version of a later of said portions from a stored version of an earlier of said portions, and using the predicted version of the later portion to remove an effect of said periodicity from the first remaining signal; and transmitting an encoded signal representing said speech signal based on the spectral envelope signal, said correlations and the second remaining signal; wherein the method further comprises, once every number of said intervals, transforming the stored version of the earlier portion of the first remaining signal prior to generating the predicted version of the respective later portion.
In embodiments, at one or more of said intervals, parameters used to derive the first remaining signal may be updated between deriving the respective earlier portion and generating the predicted version of the respective later portion; and said transformation may be performed at said one or more intervals and may comprise updating the stored version of the respective earlier portion of the first residual signal using the updated parameters.
The encoding may be performed over a plurality of frames each comprising a plurality of subframes, and each of said intervals may be a subframe; said deriving of the second remaining signal may be performed once per subframe
whilst parameters used to derive the first remaining signal may be updated once per frame, hence at one subframe per frame then the predicted version of the later portion may be generated from the earlier portion as derived using a previous frame's parameters but used to remove said effect of periodicity from the first remaining signal as derived using a current frame's parameters; and said transformation of the stored version of the earlier portion may be performed at said one subframe per frame and may comprise updating the stored version of the respective earlier portion of the first residual signal using the current frame's parameters.
The method may comprise determining said correlations using at least one of an open-loop pitch analysis and a long-term prediction analysis, and at least one of those analyses may be based on a version of the first remaining signal derived using said updated parameters for both the previous and current frames.
Said transformation may be so as to result in a greater reduction in overall energy of the second remaining signal relative to the first remaining signal than without said transformation.
Said transformation may comprise re-whitening the stored version of the earlier portion.
The encoded signal may be transmitted as a plurality of packets each encoding a plurality of said intervals, and said transformation of the stored version of the earlier portion may be performed once per packet so as to reduce error propagation caused by potential packet loss in the transmission. Said transformation may be performed for the first interval of each packet.
Said transformation may be based on information about the packet loss in a channel used for said transmission.
Said transformation may comprise scaling down the stored version of the earlier portion by a scaling factor.
The scaling factor may be selected from one of a plurality of specified factors. Said specified factors may have substantially the values of 0.5, 0.7 and 0.95.
Said periodicity may correspond to a perceived pitch of the speech signal.
The derivation of said spectral envelope signal may be by linear predictive coding (LPC) such that said first remaining signal is an LPC residual signal.
Said stored versions of the earlier portions may be stored in the form of a quantized excitation corresponding to respective portions of said LPC residual signal.
Said derivation of the second remaining signal may be by long-term prediction (LTP) such that said second remaining signal is an LTP residual signal.
Each of said stored versions of the earlier portions may each comprises an LTP state.
According to another aspect of the present invention, there is provided a method of decoding an encoded signal comprising speech encoded according to a source-filter model whereby the speech is modelled to comprise a source signal filtered by a time-varying filter, the method comprising: receiving a encoded signal over a communication medium; from the encoded signal, determining a spectral envelope signal representative of the modelled filter; from the encoded signal, determining a second remaining signal; deriving a first remaining signal representative of the modelled source signal and comprising a plurality of successive portions having a degree of periodicity, by, at intervals during the decoding of said encoded signal: determining from the encoded signal
information relating to a correlation between ones of said portions of the first remaining signal, using said information to generate a predicted version of a later of said portions based on a stored version of an earlier of said portions, and reconstructing a corresponding portion of the first remaining signal using the second remaining signal and said predicted version of the later portion; and generating a decoded speech signal based on the first excitation signal and spectral envelope signal, and outputting the decoded speech signal to an output device.
According to another aspect of the present invention, there is provided an encoder for encoding speech according to a source-filter model whereby speech is modelled to comprise a source signal filtered by a time-varying filter, the encoder comprising: an input arranged to receive a speech signal; a first signal processing module configured to derive, from the speech signal, a spectral envelope signal representative of the modelled filter and a first remaining signal representative of the modelled source signal, the first remaining signal comprising a plurality of successive portions having a degree of periodicity; a second signal processing module configured to derive a second remaining signal from the first remaining signal by, at intervals during the encoding of said speech signal: exploiting a correlation between ones of said portions to generate a predicted version of a later of said portions from a stored version of an earlier of said portions, and using the predicted version of the later portion to remove an effect of said periodicity from the first remaining signal; and an output arranged to transmit an encoded signal representing said speech signal based on the spectral envelope signal, said correlations and the second remaining signal; wherein the second signal processing module is further configured to transform, once every number of said intervals, the stored version of the earlier portion of the first remaining signal prior to generating the predicted version of the respective later portion.
According to another aspect of the present invention, there is provided a decoder for decoding an encoded signal comprising speech encoded according to a source-filter model whereby the speech is modelled to comprise a source signal filtered by a time-varying filter, the decoder comprising: an input arranged to receive a encoded signal; a first signal processing module configured to determine, from the encoded signal, a spectral envelope signal representative of the modelled filter; and a second signal processing module configured to determine, from the encoded signal, a second remaining signal; wherein the second signal processing module is further configured to derive a first remaining signal representative of the modelled source signal and comprising a plurality of successive portions having a degree of periodicity, by, at intervals during the decoding of said encoded signal: determining from the encoded signal information relating to a correlation between ones of said portions of the first remaining signal, using said information to generate a predicted version of a later of said portions based on a stored version of an earlier of said portions, and reconstructing a corresponding portion of the first remaining signal using the second remaining signal and said predicted version of the later portion; and the decoder further comprises an output module configured to generate a decoded speech signal based on the first excitation signal and spectral envelope signal, and output the decoded speech signal to an output device.
According to further aspects of the present invention, there are provided corresponding computer program products such as client application products.
According to another aspect of the present invention, there is provided a communication system comprising a plurality of end-user terminals each comprising a corresponding encoder and/or decoder.
Brief Description of the Drawings
For a better understanding of the present invention and to show how it may be carried into effect, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 a is a schematic representation of a source-filter model of speech;
Figure 1 b is a schematic representation of a frame;
Figure 2a is a schematic representation of a source signal; Figure 2b is a schematic representation of variations in a spectral envelope;
Figure 3a is a schematic block diagram of an encoder;
Figure 3b is a schematic block diagram of a decoder;
Figure 4a shows graphs of an LPC residual, LTP state and LTP residual;
Figure 4b shows further graphs of an LPC residual, LTP state and LTP residual; Figure 4c shows graphs illustrating error propagation,
Figure 4d shows graphs of an LTP residual according to a number of methods;
Figure 4e is another schematic representation of a frame;
Figure 5 is another schematic block diagram of an encoder;
Figure 6 is a schematic block diagram of a noise shaping quantizer; Figure 7 is another schematic block diagram of a decoder;
Figure 8 shows schematically an LTP state and LTP synthesis filter output; and
Figure 9 shows graphs illustrating resynchronisation of the LTP state.
Detailed Description of Preferred Embodiments
As discussed, long-term prediction (LTP) is a known technique in speech coding whereby correlations between pitch pulses are exploited to improve coding efficiency. In the encoder, for frames classified as voiced, a long term prediction filter uses one or more pitch lags and one or more LTP coefficients to compute an LTP residual signal from an LPC residual. The LTP residual has smaller variance and can thus be encoded more efficiently than the LPC residual. The
pitch lags and LTP coefficients are sent to the decoder together with the coded LTP residual, or excitation. Here the excitation is used to reconstruct the LPC excitation signal, using an LTP synthesis filter. In speech codecs based on the Code Excited Linear Prediction (CELP) paradigm, the LTP state is sometimes called the adaptive codebook.
However, as discussed in more detail below, the long-term prediction process can itself introduce problems such as difficulties in the prediction or propagation of errors.
In preferred embodiments, the present invention overcomes such problems by providing a method for modifying the LTP state in predictive speech coding. The LTP state is the stored LPC residual or LPC excitation signal from the previous pitch period, from which the following pitch period is to be predicted in order to remove it from the current LPC residual signal and thus derive the LTP residual signal. More generally, the invention can apply to any situation where a first signal is derived to represent the source signal of a source-filter model of speech, and a second signal is then derived by calculating correlation between earlier and later portions of the first signal which have a degree of repetition therebetween. By transforming the stored earlier portion, it is possible to compensate for changes in the filter part of the source-filter model that would lead to an LTP residual signal having a greater energy that it should.
One particular problem with existing encoders is that fluctuations of the LPC coefficients from one frame to the next reduce the correlations between pitch pulses. This, in turn, leads to an LTP or other such residual signal having a greater energy than it should do, and therefore being less efficient to encode in that it requires more bits to quantize. The long-term prediction filter removes or reduces the pitch pulses in the LPC residual signal by predicting one pitch pulse based on the previous one. The LPC residual of one frame is typically computed based on quantized LPC coefficients that may differ from those used to compute
the LPC residual of the next frame. Reasons for the difference in quantized LPC coefficients may be the natural evolution of the spectral envelope in speech, and numerical fluctuations in the estimation and quantization of the LPC coefficients. As the shape of a pitch pulse is influenced by the LPC coefficients, the difference in quantized LPC coefficients may cause the last pitch pulse of one frame to have a significantly different shape than the first pitch pulse of the next frame. Consequently, the last pitch pulses of a frame may be a poor basis for predicting the first pitch pulse of the next frame.
The LPC coefficients are typically updated at the start of a frame, and sometimes also during a frame, for example when interpolated LPC coefficients are used. For the duration of one pitch lag following an update of the LPC coefficients, the long-term predictor outputs a signal that is based on the LPC coefficients before the update of the LPC coefficients. However, during that time the long-term predictor output is subtracted, with the aim of minimizing an LPC residual signal based on the LPC coefficients after the update of the LPC coefficients. This creates an inconsistency, where the LTP state is not perfectly suited for minimizing the LTP residual. In particular, the shape of the pitch pulse in the LPC residual signal is influenced by the LPC coefficients, and after updating the LPC coefficients the LTP state may contain a pitch pulse with a different shape than the pitch pulse in the LPC residual. As a result, the long-term predictor is not able to create a long-term prediction signal that efficiently minimizes the LTP residual.
In embodiments of the present invention, problems of this kind can be solved by modifying the LTP state synchronously in encoder and decoder when LPC coefficients are updated. This improves the long-term prediction performance and thus improves coding efficiency.. First a quantized speech signal is generated by inputting the LTP state into an LPC synthesis filter controlled by the LPC coefficients before the update of the LPC coefficients. Subsequently, a new LTP state is created by whitening the quantized speech signal with an LPC analysis
filter controlled by the LPC coefficients after the update of the LPC coefficients. The LTP state is updated in this manner in encoder and decoder synchronously.
In this case, a preferred modification is to "re-whiten" the LTP state using the same whitening (LPC) coefficients as were used for generating the LPC residual ("whitening" is to equalize the LTP state to flatten its spectral density, so that its energy is more evenly spread across its frequency spectrum). Note that this modification does not necessarily make the LTP more white: the preferred modification is to whiten the stored LTP state using the same, updated whitening (LPC) coefficients as used for generating the current LPC residual.
To elaborate, the whitening operation is done by the LPC analysis, and the LPC synthesis performs the reverse of a whitening operation. Therefore the LTP state was already whitened, but using the LPC coefficients of the previous frame. To compensate for this, the preferred modification is therefore to:
(i) undo the whitening from the previous coefficients by running the LPC excitation through an LPC synthesis filter controlled by the LPC coefficients of the previous frame (note that this was done already anyway in forming the quantized output signal of the previous frame), and
(ii) subsequently whiten the LTP state again with the LPC coefficients of the current frame.
The result of this may actually make the LTP state slightly less white. Thus the modification may be referred to as a "re-whitening" (i.e. re-applying a whitening LPC analysis filter), rather than a whitening per se.
Figure 4a shows the effect of different LCP coefficients used for generating the LTP state and LPC residual. The top graph shows an LPC residual for one frame of 20 milliseconds, containing six similarly shaped pitch pulses. The similarity
allows a long term predictor to create an LTP residual with significantly less energy than the LPC residual. To predict each pitch pulse, the long term predictor uses a state containing the previous pitch pulse. However, for the first pitch pulse the LTP predictor uses an LTP state depending on the LPC coefficients from the previous frame. The LTP state is shown in the second graph, and contains a pitch pulse with different shape. As a result, the LTP residual shown in the bottom graph has more energy for the first pitch pulse than for the subsequent pitch pulses.
Figure 4b shows the same LPC residual in the top graph, but now the LTP state has been modified by first inputting it into an LPC synthesis filter controlled by the LPC coefficients from the previous frame, and then whitening the output from the LPC synthesis filter with an LPC analysis filter controlled by the LPC coefficients of the current frame. The result is an LTP state containing a pitch pulse that matches the first pitch pulse of the LPC residual better. Consequently, the LTP residual in the bottom graph contains less energy for the first pitch pulse than with the unmodified LTP state.
Because modifying the LTP state in the manner described typically leads to an LTP residual with less energy, the LTP residual can be coded more efficiently, thereby reducing the bitrate of the codec.
Another particular problem with existing encoders is packet loss error propagation. When the LTP coefficients are optimized without constraints to minimize the LTP residual energy, the LTP synthesis filter in the decoder (a time varying AR filter), often has a very long impulse response. This means that an error in the decoder, due to losing a packet can have effect over a long time. This effect is often called error propagation as the error from the lost frame propagates into future frames. The effect of such error propagation is illustrated in Figure 4c. The top graph shows an error free decoded signal, the middle graph shows a decoded signal with a packet loss between the vertical lines, and the
bottom graph shows the difference between the two decoded signals. As illustrated, the difference lasts much longer than the duration of the lost packet.
One approach to reduce the error propagation is to set constraints on the LTP coefficients so as to shorten the impulse response of the LTP synthesis filter. By doing this the coding gain from LTP is lowered resulting in a need for higher bit rate to maintain the output speech quality in lossless condition.
According to further embodiments of the present invention, the problem of error propagation can be reduced by down-scaling the LTP state synchronously in encoder and decoder. Preferably, error propagation is controlled by scaling down the LTP filter state in both encoder and decoder at the start of each new packet. This gives a better trade off between LTP prediction gain and packet loss error propagation, which translates to a better trade off between bitrate and packet loss sensitivity.
Figure 4d illustrates different methods for limiting error propagation. The top graph shows one 20ms frame of LPC residual. The other three graphs show one corresponding frame of LTP residual for three different methods. The second graph shows the LTP residual for unconstrained LTP coefficients. The LTP residual can be seen to be much reduced compared to the LPC residual. The third graph shows the LTP residual for scaled LTP coefficients scaled by 0.5. The LTP residual is less reduced than for the unconstrained method. The last graph shows the LTP residual for a scaled LTP state scaled by 0.5, with the optimal LTP coefficients used unaltered. The first pitch pulse is less reduced, similar to the LTP residual signal with scaled LTP coefficients. However, the remainder of the LTP residual is as much reduced as with the unconstrained method.
When more than one pitch pulse sits in a frame, downscaling the state only gives an energy increase (and thereby bit rate increase) for the first pitch pulse, and the following pitch pulses are coded as efficiently as with the unconstrained
method. In contrast, scaling the gains reduces coding efficiency for all pitch pulses.
When the scaling is set to zero in order to avoid all error propagation, the method of scaling the LTP state is better than scaling the LTP coefficients, because of the higher coding efficiency for the signal after the first pitch pulse.
The inventors' experiments have found that scaling the state is also more efficient when the scaling is between zero and one.
The selected scaling value is indicated in the encoded signal to the decoder, preferably once per frame if one frame is encoded per packet. Figure 4e is a schematic representation of a frame according to a preferred embodiment of the present invention. In addition to the classification flag 107 and subframes 108 as discussed in relation to Figure 1 b, the frame additionally comprises an index 1 10 of the scaling value selected to multiply the LTP state by.
An example of an encoder 500 for implementing the present invention is now described in relation to Figure 5.
The encoder 500 comprises a high-pass filter 502, a linear predictive coding (LPC) analysis block 504, a first vector quantizer 506, an open-loop pitch analysis block 508, a long-term prediction (LTP) analysis block 510, a second vector quantizer 512, a noise shaping analysis block 514, a noise shaping quantizer 516, and an arithmetic encoding block 518. The LTP analysis block 510 comprises a scaling control module 520, which will be discussed later in relation to Figure 6. The high pass filter 502 has an input arranged to receive an input speech signal from an input device such as a microphone, and an output coupled to inputs of the LPC analysis block 504, noise shaping analysis block 514 and noise shaping quantizer 516. The LPC analysis block has an output coupled to an input of the first vector quantizer 506, and the first vector quantizer 506 has
outputs coupled to inputs of the arithmetic encoding block 518 and noise shaping quantizer 516. The LPC analysis block 504 has outputs coupled to inputs of the open-loop pitch analysis block 508 and the LTP analysis block 510. The LTP analysis block 510 has an output coupled to an input of the second vector quantizer 512, and the second vector quantizer 512 has outputs coupled to inputs of the arithmetic encoding block 518 and noise shaping quantizer 516. The open-loop pitch analysis block 508 has outputs coupled to inputs of the LTP 510 analysis block 510 and the noise shaping analysis block 514. The noise shaping analysis block 514 has outputs coupled to inputs of the arithmetic encoding block 518 and the noise shaping quantizer 516. The noise shaping quantizer 516 has an output coupled to an input of the arithmetic encoding block 518. The arithmetic encoding block 518 is arranged to produce an output bitstream based on its inputs, for transmission from an output device such as a wired modem or wireless transceiver.
In operation, the encoder processes a speech input signal sampled at 16 kHz in frames of 20 milliseconds, with some of the processing done in subframes of 5 milliseconds. The output bitsream payload contains arithmetically encoded parameters, and has a bitrate that varies depending on a quality setting provided to the encoder and on the complexity and perceptual importance of the input signal.
The speech input signal is input to the high-pass filter 504 to remove frequencies below 80 Hz which contain almost no speech energy and may contain noise that can be detrimental to the coding efficiency and cause artifacts in the decoded output signal. The high-pass filter 504 is preferably a second order auto- regressive moving average (ARMA) filter.
The high-pass filtered input XHP is input to the linear prediction coding (LPC) analysis block 504, which calculates 16 LPC coefficients a{ using the covariance method which minimizes the energy of the LPC residual ΓLPC:
16 rLPC (n) = X *Ή. P (n) -∑xHP(n -i)a,
where n is the sample number. The LPC coefficients are used with an LPC analysis filter to create the LPC residual.
In one particularly advantageous embodiment, here the LPC residual is computed for the current frame, and also for the previous frame using the LPC coefficients derived for the current frame. The effect of this is to use an LPC residual generated with constant LPC coefficients in the open loop pitch analysis and LTP analysis. Having the last pitch pulse in the previous frame generated with the same LPC coefficients as the pitch pulses in the current frame improves the open loop pitch estimation and LTP analysis. This is particularly applicable when applying the re-whitening in the noise shaping quantizer 516 as described below.
The LPC coefficients are transformed to a line spectral frequency (LSF) vector. The LSFs are quantized using the first vector quantizer 506, a multi-stage vector quantizer (MSVQ) with 10 stages, producing 10 LSF indices that together represent the quantized LSFs. The quantized LSFs are transformed back to produce the quantized LPC coefficients for use in the noise shaping quantizer 516.
The LPC residual is input to the open loop pitch analysis block 508, producing one pitch lag for every 5 millisecond subframe, i.e., four pitch lags per frame. The pitch lags are chosen between 32 and 288 samples, corresponding to pitch frequencies from 56 to 500 Hz, which covers the range found in typical speech signals. Also, the pitch analysis produces a pitch correlation value which is the normalized correlation of the signal in the current frame and the signal delayed by the pitch lag values. Frames for which the correlation value is below a
threshold of 0.5 are classified as unvoiced, i.e., containing no periodic signal, whereas all other frames are classified as voiced. The pitch lags are input to the arithmetic coder 518 and noise shaping quantizer 516.
For voiced frames, a long-term prediction analysis is performed on the LPC residual. The LPC residual itpc is supplied from the LPC analysis block 504 to the LTP analysis block 510. For each subframe, the LTP analysis block 510 solves normal equations to find 5 linear prediction filter coefficients bj such that the energy in the LTP residual n_τp for that subframe:
rLτp (») = hpc (») - ∑ hpc (H - la§ - i)bi ι=-2
is minimized. The normal equations are solved as:
" " LTP ^ LTP
where VVLTP is a weighting matrix containing correlation values
79
WLTP (Ij) = J] rLPC (n + 2 - lag - i) rLPC (n +2 -lag -j) ,
and C|_TP is a correlation vector:
79
C LTP (0 = ∑ ' 'LPC i.n>LPC (» + 2 - lag - i) .
Thus, the LTP residual is computed as the LPC residual in the current subframe minus a filtered and delayed LPC residual. The LPC residual in the current subframe and the delayed LPC residual are both generated with an LPC analysis filter controlled by the same LPC coefficients. That means that when the LPC
coefficients were updated, an LPC residual is computed not only for the current frame but also a new LPC residual is computed for at least lag + 2 samples preceding the current frame.
The LTP coefficients for each frame are quantized using a vector quantizer (VQ). The resulting VQ codebook index is input to the arithmetic coder, and the quantized LTP coefficients bQ are input to the noise shaping quantizer.
The high-pass filtered input is analyzed by the noise shaping analysis block 514 to find filter coefficients and quantization gains used in the noise shaping quantizer. The filter coefficients determine the distribution over the quantization noise over the spectrum, and are chose such that the quantization is least audible. The quantization gains determine the step size of the residual quantizer and as such govern the balance between bitrate and quantization noise level.
All noise shaping parameters are computed and applied per subframe of 5 milliseconds. First, a 16th order noise shaping LPC analysis is performed on a windowed signal block of 16 milliseconds. The signal block has a look-ahead of 5 milliseconds relative to the current subframe, and the window is an asymmetric sine window. The noise shaping LPC analysis is done with the autocorrelation method. The quantization gain is found as the square-root of the residual energy from the noise shaping LPC analysis, multiplied by a constant to set the average bitrate to the desired level. For voiced frames, the quantization gain is further multiplied by 0.5 times the inverse of the pitch correlation determined by the pitch analyses, to reduce the level of quantization noise which is more easily audible for voiced signals. The quantization gain for each subframe is quantized, and the quantization indices are input to the arithmetically encoder 518. The quantized quantization gains are input to the noise shaping quantizer 516.
Next a set of short-term noise shaping coefficients aShape, i are found by applying bandwidth expansion to the coefficients found in the noise shaping LPC analysis.
This bandwidth expansion moves the roots of the noise shaping LPC polynomial towards the origin, according to the formula:
<3shape, i ~ <3autocorr, i g
where aauto∞rr, i is the \th coefficient from the noise shaping LPC analysis and for the bandwidth expansion factor g a value of 0.94 was found to give good results.
For voiced frames, the noise shaping quantizer also applies long-term noise shaping. It uses three filter taps, described by:
bshape = 0.5 sqrt(PitchCorrelation) [0.25, 0.5, 0.25].
The short-term and long-term noise shaping coefficients are input to the noise shaping quantizer 516. The high-pass filtered input is also input to the noise shaping quantizer 516.
An example of the noise shaping quantizer 516 is now discussed in relation to Figure 6.
The noise shaping quantizer 516 comprises a first addition stage 602, a first subtraction stage 604, a first amplifier 606, a scalar quantizer 608, a second amplifier 609, a second addition stage 610, a shaping filter 612, a prediction filter 614 and a second subtraction stage 616. The shaping filter 612 comprises a third addition stage 618, a long-term shaping block 620, a third subtraction stage 622, and a short-term shaping block 624. The prediction filter 614 comprises a fourth addition stage 626, a long-term prediction block 628, a fourth subtraction stage 630, and a short-term prediction block 632. The long term prediction block 628 comprises an LTP buffer 634.
The first addition stage 602 has an input arranged to receive the high-pass filtered input from the high-pass filter 502, and another input coupled to an output of the third addition stage 618. The first subtraction stage has inputs coupled to outputs of the first addition stage 602 and fourth addition stage 626. The first amplifier has a signal input coupled to an output of the first subtraction stage and an output coupled to an input of the scalar quantizer 608. The first amplifier 606 also has a control input coupled to the output of the noise shaping analysis block 514. The scalar quantiser 608 has outputs coupled to inputs of the second amplifier 609 and the arithmetic encoding block 518. The second amplifier 609 also has a control input coupled to the output of the noise shaping analysis block 514, and an output coupled to the an input of the second addition stage 610. The other input of the second addition stage 610 is coupled to an output of the fourth addition stage 626. An output of the second addition stage is coupled back to the input of the first addition stage 602, and to an input of the short-term prediction block 632 and the fourth subtraction stage 630. An output of the short-term prediction block 632 is coupled to the other input of the fourth subtraction stage 630. The output of the fourth subtraction stage 630 is coupled to the input of the long-term prediction block 628. The fourth addition stage 626 has inputs coupled to outputs of the long-term prediction block 628 and short-term prediction block 632. The output of the second addition stage 610 is further coupled to an input of the second subtraction stage 616, and the other input of the second subtraction stage 616 is coupled to the input from the high-pass filter 502. An output of the second subtraction stage 616 is coupled to inputs of the short-term shaping block 624 and the third subtraction stage 622. An output of the short-term shaping block 624 is coupled to the other input of the third subtraction stage 622. The output of third subtraction stage 622 is coupled to the input of the long-term shaping block 620. The third addition stage 618 has inputs coupled to outputs of the long-term shaping block 620 and short-term prediction block 624. The short- term and long-term shaping blocks 624 and 620 are each also coupled to the noise shaping analysis block 514, and the long-term shaping block 620 is also coupled to the open-loop pitch analysis block 508 (connections not shown).
Further, the short-term prediction block 632 is coupled to the LPC analysis block 504 via the first vector quantizer 506, and the long-term prediction block 628 is coupled to the LTP analysis block 510 via the second vector quantizer 512 (connections also not shown).
The purpose of the noise shaping quantizer 516 is to quantize the LTP residual signal in a manner that weights the distortion noise created by the quantisation into less noticeable parts of the frequency spectrum, e.g. where the human ear is more tolerant to noise, and/or where the speech energy is high so that the relative effect of the noise is less.
In operation, all gains and filter coefficients and gains are updated for every subframe, except for the LPC coefficients, which are updated once per frame. The noise shaping quantizer 516 generates a quantized output signal that is identical to the output signal ultimately generated in the decoder. The input signal is subtracted from this quantized output signal at the second subtraction stage 616 to obtain the quantization error signal d(n). The quantization error signal is input to a shaping filter 612, described in detail later. The output of the shaping filter 612 is added to the input signal at the first addition stage 602 in order to effect the spectral shaping of the quantization noise. From the resulting signal, the output of the prediction filter 614, described in detail below, is subtracted at the first subtraction stage 604 to create a residual signal. The residual signal is multiplied at the first amplifier 606 by the inverse quantized quantization gain from the noise shaping analysis block 514, and input to the scalar quantizer 608. The quantization indices of the scalar quantizer 608 represent a signal that is input to the arithmetically encoder 518. The scalar quantizer 608 also outputs a quantization signal, which is multiplied at the second amplifier 609 by the quantized quantization gain from the noise shaping analysis block 514 to create an excitation signal. The output of the prediction filter 614 is added at the second addition stage to the excitation signal to form the quantized output signal. The quantized output signal is input to the prediction filter 614.
On a point of terminology, note that there is a small difference between the terms "residual" and "excitation". A residual is obtained by subtracting a prediction from the input speech signal. An excitation is based on only the quantizer output. Often, the residual is simply the quantizer input and the excitation is its output.
The shaping filter 612 inputs the quantization error signal d(n) to a short-term shaping filter 624, which uses the short-term shaping coefficients ashape(i) to create a short-term shaping signal sShort(n), according to the formula:
16 sshort(n) = ∑d(n - i)ashape(i) .
1=1
The short-term shaping signal is subtracted at the third addition stage 622 from the quantization error signal to create a shaping residual signal f(n). The shaping residual signal is input to a long-term shaping filter 620 which uses the long-term shaping coefficients bShape(i) to create a long-term shaping signal S|Ong(n), according to the formula:
2
Slong O) = Σ /O - la§ - l)b shape (0 • ;=-2
The short-term and long-term shaping signals are added together at the third addition stage 618 to create the shaping filter output signal.
The prediction filter 614 inputs the quantized output signal y(n) to a short-term prediction filter 632, which uses the quantized LPC coefficients aQ to create a short-term prediction signal pShort(n), according to the formula:
16
PshoΛn) = ∑j y(n - i)aβ (i) .
1=1
The short-term prediction signal is subtracted at the fourth subtraction stage 630 from the quantized output signal to create an LPC excitation signal eLPc(n).
16 eLpC O) = y(») - P Sh01, (O = y(n) - ∑ y(« - 0«e (0
1=1
The LPC excitation signal is input to a long-term prediction filter 628 which calculates a prediction signal using the filter coefficients that were derived from correlations in the LTP analysis block 510 (see Figure 5). That is, long-term prediction filter 628 uses the quantized long-term prediction coefficients bQ(i) to create a long-term prediction signal pιong(n), according to the formula:
2 P long («) = Σ eLPC (« - laS - ϊ)bQ (0 •
<=-2
The LPC excitation signal eLPc(n) is stored in an LTP buffer 634 in the long-term prediction 628 block. The LTP buffer 634 is of length at least equal to the maximum pitch lag of 288 plus 2. The signal contained in the LTP buffer 635 is the LTP filter state.
In embodiments of the present invention, the long-term prediction block 628 may modify the LPC excitation βLpc(n) stored in the encoder LTP buffer 634 at the start of every new input frame classified as voiced, when the quantized LPC coefficients aQ are updated. The modification consists of replacing the LTP filter state with a new LPC excitation signal eLPc.new computed from the quantized output signal y(n) and the new quantized LPC coefficients aQ,new:
16 eLPC,nesM) = X») ~ Σ β LTP (« ~ »>β,neM.(0
1=1
Alternatively or additionally, to deal with the problem of packet loss error propagation, in embodiments of the present invention the scaling control module 520 in the LTP analysis block 520 may scale down the LTP filter state stored in the LTP buffer 634 at the beginning of every new input frame, before the noise shape quantization is started. Sometimes multiple frames are combined within one packet, in which case the LTP scaling should preferably be applied only for the first frame of each packet (whereas the re-whitening is preferably done for every frame). The scaling value is passed from the scaling control module 520 in the LTP analysis block 510 to the long-term prediction block 628 in the noise shaping quantizer 516, where it is used to scale the LTP state stored in the LTP buffer 634.
The LTP scaling value is calculated by a scaling control module 636 based on information about the packet loss in the channel and information about the speech signal. This module 636 chooses between three scaling values of 0.5, 0.7 or 0.95, where 0.5 gives most error propagation resilience and lowest coding efficiency, and 0.95 gives least error propagation resilience and highest coding efficiency.
To assign the scaling value the scaling control module 520 calculates a sensitivity measure that is compared with two thresholds, one for using a scaling value of 0.5 and one for 0.7. Default is using a scaling value of 0.95. The sensitivity measure predicts how sensitive the current frame is to errors in the LTP filter state due to packet losses. It is calculated using the following formula:
5 = 0.5 - PG LTP + 0.5 • PGLTPMP
Where PGLTP is the long-term prediction gain, as measured as ratio of the energy of LPC residual ΓLPC and LTP residual ΓLTP, and PGLTP,HP is a signal obtained by running PGLTP through a first order high-pass filter according to:
PGLTP,HP (») = PG LTP (") - PGLTP (n - 1) + 0.5 • PGLTP>HP (n - 1)
The sensitivity measure is a combination of the LTP prediction gain and a high pass version of the same measure. The LTP prediction gain is chosen because it directly relates the LTP state error with the output signal error. The high pass part is added to put emphasis on signal changes. A changing signal has high risk of giving severe error propagation because the LTP state in encoder and decoder will most likely be very different, after a packet loss. An example is when loosing a voiced onset (see figure 4c).
The distribution of scaling values for the frames is dependent on the loss percentage where more of the frames get a scaling value less than 0.95 when the channel loss rate increases. This is done by lowering the sensitivity thresholds as the loss rate increases.
If multiple frames are encoded and combined for transmission in one packet, then the state control module 636 only assigns scaling values lower than 0.95 for frames that are the first in a packet.
The scaling value is supplied from the scaling control module 520 in the LTP analysis block 510 to the arithmetic encoder 518, and from there is transmitted on to the decoder in the encoded signal.
The short-term and long-term prediction signals are added together to create the prediction filter output signal.
Note: the LTP state can be either the LPC residual or the LPC excitation signal, depending on details of the encoder. Typically however, as in the described embodiments, it is the LPC excitation signal.
The LSF indices, LTP indices, quantization gains indices, pitch lags, LTP scaling value indices (if used), and quantization indices are each arithmetically encoded and multiplexed at the arithmetic encoder 518 to create the payload bitstream. The arithmetic encoder 518 uses a look-up table with probability values for each index. The look-up tables are created by running a database of speech training signals and measuring frequencies of each of the index values. The frequencies are translated into probabilities through a normalization step.
An example decoder 700 for use in decoding a signal encoded according to embodiments of the present invention is now described in relation to Figure 7.
The decoder 700 comprises an arithmetic decoding and dequantizing block 702, an excitation generation block 704, an LTP synthesis filter 706, and an LPC synthesis filter 708. The LTP synthesis filter 706 comprises an LTP buffer 710. The arithmetic decoding and dequantizing block 702 has an input arranged to receive an encoded bitstream from an input device such as a wired modem or wireless transceiver, and has outputs coupled to inputs of each of the excitation generation block 704, LTP synthesis filter 706 and LPC synthesis filter 708. The excitation generation block 704 has an output coupled to an input of the LTP synthesis filter 706, and the LTP synthesis block 706 has an output connected to an input of the LPC synthesis filter 708. The LPC synthesis filter has an output arranged to provide a decoded output for supply to an output device such as a speaker or headphones.
At the arithmetic decoding and dequantizing block 702, the arithmetically encoded bitstream is demultiplexed and decoded to create LSF indices, LTP indices, LTP scaling value indices (if used), quantization gains indices, pitch lags and a signal of quantization indices. The LSF indices are converted to quantized LSFs by adding the codebook vectors of the ten stages of the MSVQ. The quantized LSFs are transformed to quantized LPC coefficients. The LTP codebook is then used to convert the LTP indices to quantized LTP coefficients.
The gains indices are converted to quantization gains, through look ups in the gain quantization codebook.
At the excitation generation block, the excitation quantization indices signal is multiplied by the quantization gain to create an excitation signal e(n).
The excitation signal is input to the LTP synthesis filter 706 to create the LPC excitation signal eι_pc(n) according to:
eLPC (n) = e(n) + ∑ e(n - lag - i)bQ (i) , i=-2
using the pitch lag and quantized LTP coefficients bQ.
The excitation signal e(n) is stored in an LTP buffer of length at least equal to the maximum pitch lag of 288, plus 2. The signal contained in the LTP buffer is the LTP filter state.
The LPC excitation signal is input to an LPC synthesis filter to create the decoded speech signal y(n) according to
16 y(n) = eLPC («) + ∑ eLPC (n - i)aQ (/) M using the quantized LPC coefficients aQ.
In embodiments of the present invention, the LTP synthesis filter 706 may modify the LPC excitation eι_pc(n) stored in the decoder LTP buffer 710 at the start of every new input frame classified as voiced, when the quantized LPC coefficients aQ are updated. The modification consists of replacing the LTP filter state with a new LPC excitation signal eLpc.new computed from the decoded speech signal y(n) and the new quantized LPC coefficients aQ,new
Alternatively or additionally, if state scaling is used, the LTP synthesis filter 706 may use the decoded LTP scale value to scale down the LTP filter state at the beginning of every new input frame, before LTP synthesis filtering is started. That is, scaling the LPC excitation ei_pc(n).
In a particularly advantageous embodiment, if an LTP scale value significantly below one is used, e.g. a value of 0.5 or 0.7, in a frame after one or more packet losses, then the LTP synthesis filter 706 in the decoder 700 may use its knowledge about the LTP scale value to improve the LTP state synchronization further as discussed below.
Figure 8 shows the relationship between the LTP state and LTP synthesis filter output. The LTP synthesis filter delays the LTP state by the pitch lag and convolves it with the LTP filter coefficients to create a filtered LTP state. The filtered LTP state is added to the excitation signal to create the LTP synthesis filter output, or LPC excitation signal.
The left figure shows the filtered LTP state and excitation signal without downscaling, where they are orthogonal. If after a packet loss the LTP state is set to zero, resulting in a zero filtered LTP state, then the excitation signal provides the best approximation to the LTP synthesis output that would have been generated if no packet loss had occurred.
The figure to the right shows the excitation signal with LTP downscaling, using the same optimal LTP coefficients. Here the vectors are not orthogonal anymore, and have positive correlation. The positive correlation between filtered reduced LTP state and excitation can be exploited after a packet loss. If after a loss the LTP state is set to zero, the excitation can be scaled up to give a closer match to the LTP synthesis output that would have been generated if no packet loss had occurred. The optimal scaling is not known on the decoder side, but can be estimated using for instance a trained statistical approach. It was found
heuristically that good performance is obtained when upscaling by a factor 1.4 when the LTP scale value is 0.7 and upscaling by a factor 1.8 when the LTP scale value is 0.5.
If the LTP state is not set to zero after packet loss, but is approximated using the signal generated with a packet loss concealment unit, another enhancement is used in the decoder. The knowledge of LTP state scaling is exploited by changing the phase of the decoder LTP filter state such as to optimize the correlation with the LTP residual signal for the duration of the first pitch period. This enhancement is useful when the pitch lag used by the concealment unit drifts away from the pitch lag used in the encoder. The advantage is illustrated in the Figure 9, which is an illustration of the resynchronisation of the LTP state after packet loss. The first plot is a voiced speech signal without packet loss. The second plot illustrates a signal with a lost packet between the vertical lines. LTP state scaling by 0.5 is used, but because the phase of the pitch pulse drifts in the concealment signal compared to the lossless signal the signal after the loss contains a large error in the pitch pulse shape. The last plot shows how synchronising the LTP state such that correlation between filtered LTP state and excitation signal is maximized improves the pitch pulse shape.
Resynchronisation of the LTP state, after packet loss, is a known method in the art of predictive speech coding. However, the technique of LTP state downscaling increases the robustness of the LTP state resynchronisation by giving a good estimate of the pitch pulse phase in the encoder, and therefore a good estimate of the error free signal.
The encoder 500 and decoder 700 are preferably implemented in software, such that each of the components 502 to 634 and 702 to 710 comprise modules of software stored on one or more memory devices and executed on a processor. A preferred application of the present invention is to encode speech for transmission over a packet-based network such as the Internet, preferably using
a peer-to-peer (P2P) system implemented over the Internet, for example as part of a live call such as a Voice over IP (VoIP) call. In this case, the encoder 500 and decoder 700 are preferably implemented in client application software executed on end-user terminals of two users communicating over the P2P system.
It will be appreciated that the above embodiments are described only by way of example. Other applications and configurations may be apparent to the person skilled in the art given the disclosure herein. The scope of the invention is not limited by the described embodiments, but only by the following claims.
Claims
1 . A method of encoding speech according to a source-filter model whereby speech is modelled to comprise a source signal filtered by a time-varying filter, the method comprising: receiving a speech signal; from the speech signal, deriving a spectral envelope signal representative of the modelled filter and a first remaining signal representative of the modelled source signal, the first remaining signal comprising a plurality of successive portions having a degree of periodicity; deriving a second remaining signal from the first remaining signal by, at intervals during the encoding of said speech signal: exploiting a correlation between ones of said portions to generate a predicted version of a later of said portions from a stored version of an earlier of said portions, and using the predicted version of the later portion to remove an effect of said periodicity from the first remaining signal; and transmitting an encoded signal representing said speech signal based on the spectral envelope signal, said correlations and the second remaining signal; wherein the method further comprises, once every number of said intervals, transforming the stored version of the earlier portion of the first remaining signal prior to generating the predicted version of the respective later portion.
2. The method of claim 1 , wherein: at one or more of said intervals, parameters used to derive the first remaining signal are updated between deriving the respective earlier portion and generating the predicted version of the respective later portion; and said transformation is performed at said one or more intervals and comprises updating the stored version of the respective earlier portion of the first residual signal using the updated parameters.
3. The method of claim 2, wherein: the encoding is performed over a plurality of frames each comprising a plurality of subframes, and each of said intervals is a subframe; said deriving of the second remaining signal is performed once per subframe whilst parameters used to derive the first remaining signal are updated once per frame, hence at one subframe per frame then the predicted version of the later portion is generated from the earlier portion as derived using a previous frame's parameters but is used to remove said effect of periodicity from the first remaining signal as derived using a current frame's parameters; and said transformation of the stored version of the earlier portion is performed at said one subframe per frame and comprises updating the stored version of the respective earlier portion of the first residual signal using the current frame's parameters.
4. The method of claim 3, comprising determining said correlations using at least one of an open-loop pitch analysis and a long-term prediction analysis, at least one of which analyses is based on a version of the first remaining signal derived using said updated parameters for both the previous and current frames.
5. The method of any preceding claim, wherein said transformation is so as to result in a greater reduction in overall energy of the second remaining signal relative to the first remaining signal than without said transformation.
6. The method of any preceding claim, wherein said transformation comprises re-whitening the stored version of the earlier portion.
7. The method according to any preceding claim, wherein the encoded signal is transmitted as a plurality of packets each encoding a plurality of said intervals, and said transformation of the stored version of the earlier portion is performed once per packet so as to reduce error propagation caused by potential packet loss in the transmission.
8. The method of claim 7, wherein said transformation is performed for the first interval of each packet.
9. The method of claim 7 or 8, wherein said transformation is based on information about the packet loss in a channel used for said transmission.
10. The method of claim 1 , 7, 8 or 9, wherein said transformation comprises scaling down the stored version of the earlier portion by a scaling factor.
11. The method of claim 10, wherein the scaling factor is selected from one of a plurality of specified factors.
12. The method of claim 11 , wherein said specified factors have substantially the values of 0.5, 0.7 and 0.95.
13. The method of any preceding claim, wherein said periodicity corresponds to a perceived pitch of the speech signal.
14. The method of any preceding claim, wherein the derivation of said spectral envelope signal is by linear predictive coding (LPC) such that said first remaining signal is an LPC residual signal.
15. The method of claim 7, wherein said stored versions of the earlier portions are stored in the form of a quantized excitation corresponding to respective portions of said LPC residual signal.
16. The method of any preceding claim, wherein said derivation of the second remaining signal is by long-term prediction (LTP) such that said second remaining signal is an LTP residual signal.
17. The method of claim 16, wherein each of said stored versions of the earlier portions each comprises an LTP state.
18. A method of decoding an encoded signal comprising speech encoded according to a source-filter model whereby the speech is modelled to comprise a source signal filtered by a time-varying filter, the method comprising: receiving a encoded signal; from the encoded signal, determining a spectral envelope signal representative of the modelled filter; from the encoded signal, determining a second remaining signal; deriving a first remaining signal representative of the modelled source signal and comprising a plurality of successive portions having a degree of periodicity, by, at intervals during the decoding of said encoded signal: determining from the encoded signal information relating to a correlation between ones of said portions of the first remaining signal, using said information to generate a predicted version of a later of said portions based on a stored version of an earlier of said portions, and reconstructing a corresponding portion of the first remaining signal using the second remaining signal and said predicted version of the later portion; and generating a decoded speech signal based on the first excitation signal and spectral envelope signal, and outputting the decoded speech signal to an output device.
19. The method of claim 18, wherein: at one or more of said intervals, parameters used to derive the first remaining signal are updated between determining the respective earlier portion and generating the predicted version of the respective later portion; and said transformation is performed at said one or more intervals and comprises updating the stored version of the respective earlier portion of the first residual signal using the updated parameters.
20. The method of claim 18 or 19, wherein the encoded speech signal is received as a plurality of packets each encoding a plurality of said intervals, and said transformation of the stored version of the earlier portion is performed once per packet so as to reduce error propagation caused by potential packet loss in the transmission.
21. The method of claim 18 or 20, wherein said transformation comprises scaling down the stored version of the earlier portion by a scaling factor.
22. An encoder for encoding speech according to a source-filter model whereby speech is modelled to comprise a source signal filtered by a time- varying filter, the encoder comprising: an input arranged to receive a speech signal; a first signal processing module configured to derive, from the speech signal, a spectral envelope signal representative of the modelled filter and a first remaining signal representative of the modelled source signal, the first remaining signal comprising a plurality of successive portions having a degree of periodicity; a second signal processing module configured to derive a second remaining signal from the first remaining signal by, at intervals during the encoding of said speech signal: exploiting a correlation between ones of said portions to generate a predicted version of a later of said portions from a stored version of an earlier of said portions, and using the predicted version of the later portion to remove an effect of said periodicity from the first remaining signal; and an output arranged to transmit an encoded signal representing said speech signal based on the spectral envelope signal, said correlations and the second remaining signal; wherein the second signal processing module is further configured to transform, once every number of said intervals, the stored version of the earlier portion of the first remaining signal prior to generating the predicted version of the respective later portion.
23. The encoder of claim 22, wherein: the first signal processing module is configured such that, at one or more of said intervals, parameters used to derive the first remaining signal are updated between deriving the respective earlier portion and generating the predicted version of the respective later portion; and the second signal processing module is configured to perform said transformation at said one or more intervals by updating the stored version of the respective earlier portion of the first residual signal using the updated parameters.
24. The encoder of claim 23, wherein: the encoding is performed over a plurality of frames each comprising a plurality of subframes, and each of said intervals is a subframe; the second signal processing module is configured to derive the second remaining signal once per subframe whilst the first signal processing module is configured to update said parameters once per frame, hence at one subframe per frame then the predicted version of the later portion is generated from the earlier portion as derived using a previous frame's parameters but is used to remove said effect of periodicity from the first remaining signal as derived using a current frame's parameters; and the second signal processing module is configured to perform said transformation of the stored version of the earlier portion at said one subframe per frame by updating the stored version of the respective earlier portion of the first residual signal using the current frame's parameters.
25. The encoder of claim 24, comprising wherein the second signal processing module comprises at least one of an open-loop pitch analysis block and a long-term prediction analysis block, at least one of which is configured to perform its analysis based on a version of the first remaining signal derived using said updated parameters for both the previous and current frames.
26. The encoder of any of claims 22 to 25, wherein the second signal processing module is configured to perform said transformation so as to result in a greater reduction in overall energy of the second remaining signal relative to the first remaining signal than without said transformation.
27. The encoder of any of claims 22 to 26, wherein the second signal processing module is configured to perform said transformation by re-whitening the stored version of the earlier portion.
28. The encoder of any of claims 22 to 27, wherein the output is arranged to transmit said encoded signal as a plurality of packets each encoding a plurality of said intervals, and the second signal processing module is configured to perform said transformation of the stored version of the earlier portion once per packet so as to reduce error propagation caused by potential packet loss in the transmission.
29. The encoder of claim 28, wherein the second signal processing module is configured to perform said transformation for the first interval of each packet.
30. The encoder of claim 28 or 29, wherein the second signal processing module is configured to perform said transformation based on information about the packet loss in a channel used for said transmission.
31. The encoder of claim 22, 28, 29 or 30, wherein the second signal processing module is configured to perform said transformation by scaling down the stored version of the earlier portion by a scaling factor.
32. The encoder of claim 31 , wherein second signal processing means is configured to select said scaling factor from one of a plurality of specified factors.
33. The encoder of claim 32, wherein said specified factors have substantially the values of 0.5, 0.7 and 0.95.
34. The encoder of any of claims 22 to 33, wherein said periodicity corresponds to a perceived pitch of the speech signal.
35. The encoder of any of claims 22 to 34, wherein the first signal processing module comprises a linear predictive coding (LPC) module such that the derivation of said spectral envelope signal is by linear predictive coding and said first remaining signal is an LPC residual signal.
36. The encoder of claim 35, wherein said stored versions of the earlier portions are stored in the form of a quantized excitation corresponding to respective portions of said LPC residual signal.
37. The encoder of any of claims 22 to 36, wherein the second signal processing module comprises a long-term prediction (LTP) such that said derivation of the second remaining signal is by long-term prediction and said second remaining signal is an LTP residual signal.
38. The encoder of claim 37, wherein each of said stored versions of the earlier portions each comprises an LTP state.
39. A decoder for decoding an encoded signal comprising speech encoded according to a source-filter model whereby the speech is modelled to comprise a source signal filtered by a time-varying filter, the decoder comprising: an input arranged to receive a encoded signal; a first signal processing module configured to determine, from the encoded signal, a spectral envelope signal representative of the modelled filter; and a second signal processing module configured to determine, from the encoded signal, a second remaining signal; wherein the second signal processing module is further configured to derive a first remaining signal representative of the modelled source signal and comprising a plurality of successive portions having a degree of periodicity, by, at intervals during the decoding of said encoded signal: determining from the encoded signal information relating to a correlation between ones of said portions of the first remaining signal, using said information to generate a predicted version of a later of said portions based on a stored version of an earlier of said portions, and reconstructing a corresponding portion of the first remaining signal using the second remaining signal and said predicted version of the later portion; and the decoder further comprises an output module configured to generate a decoded speech signal based on the first excitation signal and spectral envelope signal, and output the decoded speech signal to an output device.
40. The decoder of claim 39, wherein: the first signal processing module is configured such that, at one or more of said intervals, parameters used to derive the first remaining signal are updated between determining the respective earlier portion and generating the predicted version of the respective later portion; and the second signal processing module is configured to perform said transformation at said one or more intervals by updating the stored version of the respective earlier portion of the first residual signal using the updated parameters.
41. The decoder of claim 39 or 40, wherein the input is arranged to receive the encoded speech signal as a plurality of packets each encoding a plurality of said intervals, and the second signal processing module is configured to perform said transformation of the stored version of the earlier portion once per packet so as to reduce error propagation caused by potential packet loss in the transmission.
42. The decoder of claim 40 or 41 , wherein the second signal processing module is configured to perform said transformation by scaling down the stored version of the earlier portion by a scaling factor.
43. A computer program product for encoding speech according to a source- filter model whereby the speech is modelled to comprise a source signal filtered by a time-varying filter, the program comprising code arranged so as when executed on a processor to: receive a speech signal; from the speech signal, derive a spectral envelope signal representative of the modelled filter and a first remaining signal representative of the modelled source signal, the first remaining signal comprising a plurality of successive portions having a degree of periodicity; derive a second remaining signal from the first remaining signal by, at intervals during the encoding of said speech signal: exploiting a correlation between ones of said portions to generate a predicted version of a later of said portions from a stored version of an earlier of said portions, and using the predicted version of the later portion to remove an effect of said periodicity from the first remaining signal; transmit an encoded signal representing said speech signal based on the spectral envelope signal, said correlations and the second remaining signal; and once every number of said intervals, transform the stored version of the earlier portion of the first remaining signal prior to generating the predicted version of the respective later portion.
44. A computer program product for decoding an encoded signal comprising speech encoded according to a source-filter model whereby the speech is modelled to comprise a source signal filtered by a time-varying filter, the program comprising code arranged so as when executed on a processor to: receive a encoded signal over a communication medium; from the encoded signal, determine a spectral envelope signal representative of the modelled filter; from the encoded signal, determine a second remaining signal; derive a first remaining signal representative of the modelled source signal and comprising a plurality of successive portions having a degree of periodicity, by, at intervals during the decoding of said encoded signal: determining from the encoded signal information relating to a correlation between ones of said portions of the first remaining signal, using said information to generate a predicted version of a later of said portions based on a stored version of an earlier of said portions, and reconstructing a corresponding portion of the first remaining signal using the second remaining signal and said predicted version of the later portion; and generate a decoded speech signal based on the first excitation signal and spectral envelope signal, and output the decoded speech signal to an output device.
45. A computer program product comprising code arranged so as when executed on a processor to perform the steps of any of claims 1 to 21.
46. A client application product comprising code arranged so as when executed on a processor to perform the steps of any of claims 1 to 21.
47. A communication system comprising a plurality of end-user terminals, each of the end-user terminals comprising at least one of an encoder according to any of claims 1 to 17 and a decoder according to any of claims 18 to 21.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP10704769.8A EP2384508B1 (en) | 2009-01-06 | 2010-01-05 | Speech coding |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0900142.1 | 2009-01-06 | ||
GB0900142.1A GB2466672B (en) | 2009-01-06 | 2009-01-06 | Speech coding |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2010079167A1 true WO2010079167A1 (en) | 2010-07-15 |
WO2010079167A4 WO2010079167A4 (en) | 2010-10-14 |
Family
ID=40379221
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2010/050057 WO2010079167A1 (en) | 2009-01-06 | 2010-01-05 | Speech coding |
Country Status (4)
Country | Link |
---|---|
US (1) | US8433563B2 (en) |
EP (1) | EP2384508B1 (en) |
GB (1) | GB2466672B (en) |
WO (1) | WO2010079167A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8392178B2 (en) | 2009-01-06 | 2013-03-05 | Skype | Pitch lag vectors for speech encoding |
US8396706B2 (en) | 2009-01-06 | 2013-03-12 | Skype | Speech coding |
US8433563B2 (en) | 2009-01-06 | 2013-04-30 | Skype | Predictive speech signal coding |
US8452606B2 (en) | 2009-09-29 | 2013-05-28 | Skype | Speech encoding using multiple bit rates |
US8463604B2 (en) | 2009-01-06 | 2013-06-11 | Skype | Speech encoding utilizing independent manipulation of signal and noise spectrum |
WO2013149672A1 (en) | 2012-04-05 | 2013-10-10 | Huawei Technologies Co., Ltd. | Method for determining an encoding parameter for a multi-channel audio signal and multi-channel audio encoder |
US8655653B2 (en) | 2009-01-06 | 2014-02-18 | Skype | Speech coding by quantizing with random-noise signal |
US8670981B2 (en) | 2009-01-06 | 2014-03-11 | Skype | Speech encoding and decoding utilizing line spectral frequency interpolation |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8688437B2 (en) * | 2006-12-26 | 2014-04-01 | Huawei Technologies Co., Ltd. | Packet loss concealment for speech coding |
WO2008108721A1 (en) * | 2007-03-05 | 2008-09-12 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and arrangement for controlling smoothing of stationary background noise |
GB2466671B (en) | 2009-01-06 | 2013-03-27 | Skype | Speech encoding |
US9767822B2 (en) * | 2011-02-07 | 2017-09-19 | Qualcomm Incorporated | Devices for encoding and decoding a watermarked signal |
WO2015134579A1 (en) * | 2014-03-04 | 2015-09-11 | Interactive Intelligence Group, Inc. | System and method to correct for packet loss in asr systems |
US20190051286A1 (en) * | 2017-08-14 | 2019-02-14 | Microsoft Technology Licensing, Llc | Normalization of high band signals in network telephony communications |
US10650837B2 (en) | 2017-08-29 | 2020-05-12 | Microsoft Technology Licensing, Llc | Early transmission in packetized speech |
CN113196387B (en) * | 2019-01-13 | 2024-10-18 | 华为技术有限公司 | Computer-implemented method for audio encoding and decoding and electronic device |
US11437050B2 (en) * | 2019-09-09 | 2022-09-06 | Qualcomm Incorporated | Artificial intelligence based audio coding |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1991003790A1 (en) * | 1989-09-01 | 1991-03-21 | Motorola, Inc. | Digital speech coder having improved sub-sample resolution long-term predictor |
EP0724252A2 (en) * | 1994-12-27 | 1996-07-31 | Nec Corporation | A CELP-type speech encoder having an improved long-term predictor |
US20070255561A1 (en) * | 1998-09-18 | 2007-11-01 | Conexant Systems, Inc. | System for speech encoding having an adaptive encoding arrangement |
US20080154588A1 (en) * | 2006-12-26 | 2008-06-26 | Yang Gao | Speech Coding System to Improve Packet Loss Concealment |
Family Cites Families (90)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS62112221U (en) | 1985-12-27 | 1987-07-17 | ||
US5125030A (en) | 1987-04-13 | 1992-06-23 | Kokusai Denshin Denwa Co., Ltd. | Speech signal coding/decoding system based on the type of speech signal |
US5327250A (en) | 1989-03-31 | 1994-07-05 | Canon Kabushiki Kaisha | Facsimile device |
US5240386A (en) | 1989-06-06 | 1993-08-31 | Ford Motor Company | Multiple stage orbiting ring rotary compressor |
US5187481A (en) | 1990-10-05 | 1993-02-16 | Hewlett-Packard Company | Combined and simplified multiplexing and dithered analog to digital converter |
JP3254687B2 (en) | 1991-02-26 | 2002-02-12 | 日本電気株式会社 | Audio coding method |
US5680508A (en) | 1991-05-03 | 1997-10-21 | Itt Corporation | Enhancement of speech coding in background noise for low-rate speech coder |
US5253269A (en) | 1991-09-05 | 1993-10-12 | Motorola, Inc. | Delta-coded lag information for use in a speech coder |
US5487086A (en) | 1991-09-13 | 1996-01-23 | Comsat Corporation | Transform vector quantization for adaptive predictive coding |
US5327520A (en) * | 1992-06-04 | 1994-07-05 | At&T Bell Laboratories | Method of use of voice message coder/decoder |
JP2800618B2 (en) | 1993-02-09 | 1998-09-21 | 日本電気株式会社 | Voice parameter coding method |
US5357252A (en) | 1993-03-22 | 1994-10-18 | Motorola, Inc. | Sigma-delta modulator with improved tone rejection and method therefor |
US5621852A (en) | 1993-12-14 | 1997-04-15 | Interdigital Technology Corporation | Efficient codebook structure for code excited linear prediction coding |
WO1995018523A1 (en) | 1993-12-23 | 1995-07-06 | Philips Electronics N.V. | Method and apparatus for encoding multibit coded digital sound through subtracting adaptive dither, inserting buried channel bits and filtering, and encoding and decoding apparatus for use with this method |
CA2154911C (en) | 1994-08-02 | 2001-01-02 | Kazunori Ozawa | Speech coding device |
JPH08179795A (en) | 1994-12-27 | 1996-07-12 | Nec Corp | Voice pitch lag coding method and device |
US5646961A (en) | 1994-12-30 | 1997-07-08 | Lucent Technologies Inc. | Method for noise weighting filtering |
JP3334419B2 (en) | 1995-04-20 | 2002-10-15 | ソニー株式会社 | Noise reduction method and noise reduction device |
US5867814A (en) | 1995-11-17 | 1999-02-02 | National Semiconductor Corporation | Speech coder that utilizes correlation maximization to achieve fast excitation coding, and associated coding method |
US20020032571A1 (en) | 1996-09-25 | 2002-03-14 | Ka Y. Leung | Method and apparatus for storing digital audio and playback thereof |
CN1178204C (en) | 1996-11-07 | 2004-12-01 | 松下电器产业株式会社 | Acoustic vector, and acoustic encoding and decoding device |
JP3266178B2 (en) | 1996-12-18 | 2002-03-18 | 日本電気株式会社 | Audio coding device |
JP3523649B2 (en) | 1997-03-12 | 2004-04-26 | 三菱電機株式会社 | Audio encoding device, audio decoding device, audio encoding / decoding device, audio encoding method, audio decoding method, and audio encoding / decoding method |
FI113903B (en) | 1997-05-07 | 2004-06-30 | Nokia Corp | Speech coding |
TW408298B (en) | 1997-08-28 | 2000-10-11 | Texas Instruments Inc | Improved method for switched-predictive quantization |
DE19747132C2 (en) | 1997-10-24 | 2002-11-28 | Fraunhofer Ges Forschung | Methods and devices for encoding audio signals and methods and devices for decoding a bit stream |
JP3132456B2 (en) | 1998-03-05 | 2001-02-05 | 日本電気株式会社 | Hierarchical image coding method and hierarchical image decoding method |
US6470309B1 (en) | 1998-05-08 | 2002-10-22 | Texas Instruments Incorporated | Subframe-based correlation |
JP3180762B2 (en) | 1998-05-11 | 2001-06-25 | 日本電気株式会社 | Audio encoding device and audio decoding device |
CN1143470C (en) | 1998-05-29 | 2004-03-24 | 西门子公司 | Method and device for masking errors |
US6173257B1 (en) | 1998-08-24 | 2001-01-09 | Conexant Systems, Inc | Completed fixed codebook for speech encoder |
US6188980B1 (en) | 1998-08-24 | 2001-02-13 | Conexant Systems, Inc. | Synchronized encoder-decoder frame concealment using speech coding parameters including line spectral frequencies and filter coefficients |
US6104992A (en) | 1998-08-24 | 2000-08-15 | Conexant Systems, Inc. | Adaptive gain reduction to produce fixed codebook target signal |
US6260010B1 (en) | 1998-08-24 | 2001-07-10 | Conexant Systems, Inc. | Speech encoder using gain normalization that combines open and closed loop gains |
US6493665B1 (en) | 1998-08-24 | 2002-12-10 | Conexant Systems, Inc. | Speech classification and parameter weighting used in codebook search |
CA2252170A1 (en) | 1998-10-27 | 2000-04-27 | Bruno Bessette | A method and device for high quality coding of wideband speech and audio signals |
US6691084B2 (en) | 1998-12-21 | 2004-02-10 | Qualcomm Incorporated | Multiple mode variable rate speech coding |
US6456964B2 (en) | 1998-12-21 | 2002-09-24 | Qualcomm, Incorporated | Encoding of periodic speech using prototype waveforms |
FI116992B (en) * | 1999-07-05 | 2006-04-28 | Nokia Corp | Methods, systems, and devices for enhancing audio coding and transmission |
JP4734286B2 (en) | 1999-08-23 | 2011-07-27 | パナソニック株式会社 | Speech encoding device |
US6775649B1 (en) | 1999-09-01 | 2004-08-10 | Texas Instruments Incorporated | Concealment of frame erasures for speech transmission and storage system and method |
US6782360B1 (en) * | 1999-09-22 | 2004-08-24 | Mindspeed Technologies, Inc. | Gain quantization for a CELP speech coder |
US6574593B1 (en) | 1999-09-22 | 2003-06-03 | Conexant Systems, Inc. | Codebook tables for encoding and decoding |
US6604070B1 (en) | 1999-09-22 | 2003-08-05 | Conexant Systems, Inc. | System of encoding and decoding speech signals |
US6959274B1 (en) | 1999-09-22 | 2005-10-25 | Mindspeed Technologies, Inc. | Fixed rate speech compression system and method |
US6523002B1 (en) | 1999-09-30 | 2003-02-18 | Conexant Systems, Inc. | Speech coding having continuous long term preprocessing without any delay |
JP2001175298A (en) | 1999-12-13 | 2001-06-29 | Fujitsu Ltd | Noise suppression device |
WO2001052241A1 (en) | 2000-01-11 | 2001-07-19 | Matsushita Electric Industrial Co., Ltd. | Multi-mode voice encoding device and decoding device |
US6757654B1 (en) | 2000-05-11 | 2004-06-29 | Telefonaktiebolaget Lm Ericsson | Forward error correction in speech coding |
US6862567B1 (en) | 2000-08-30 | 2005-03-01 | Mindspeed Technologies, Inc. | Noise suppression in the frequency domain by adjusting gain according to voicing parameters |
US7171355B1 (en) | 2000-10-25 | 2007-01-30 | Broadcom Corporation | Method and apparatus for one-stage and two-stage noise feedback coding of speech and audio signals |
US7505594B2 (en) | 2000-12-19 | 2009-03-17 | Qualcomm Incorporated | Discontinuous transmission (DTX) controller system and method |
US6996523B1 (en) | 2001-02-13 | 2006-02-07 | Hughes Electronics Corporation | Prototype waveform magnitude quantization for a frequency domain interpolative speech codec system |
FI118067B (en) * | 2001-05-04 | 2007-06-15 | Nokia Corp | Method of unpacking an audio signal, unpacking device, and electronic device |
KR100464369B1 (en) | 2001-05-23 | 2005-01-03 | 삼성전자주식회사 | Excitation codebook search method in a speech coding system |
CA2365203A1 (en) * | 2001-12-14 | 2003-06-14 | Voiceage Corporation | A signal modification method for efficient coding of speech signals |
US6751587B2 (en) | 2002-01-04 | 2004-06-15 | Broadcom Corporation | Efficient excitation quantization in noise feedback coding with general noise shaping |
US7260524B2 (en) | 2002-03-12 | 2007-08-21 | Dilithium Networks Pty Limited | Method for adaptive codebook pitch-lag computation in audio transcoders |
JP4805540B2 (en) | 2002-04-10 | 2011-11-02 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | Stereo signal encoding |
US20040083097A1 (en) | 2002-10-29 | 2004-04-29 | Chu Wai Chung | Optimized windows and interpolation factors, and methods for optimizing windows, interpolation factors and linear prediction analysis in the ITU-T G.729 speech coding standard |
CA2415105A1 (en) | 2002-12-24 | 2004-06-24 | Voiceage Corporation | A method and device for robust predictive vector quantization of linear prediction parameters in variable bit rate speech coding |
US8359197B2 (en) | 2003-04-01 | 2013-01-22 | Digital Voice Systems, Inc. | Half-rate vocoder |
JP4312000B2 (en) | 2003-07-23 | 2009-08-12 | パナソニック株式会社 | Buck-boost DC-DC converter |
FI118704B (en) | 2003-10-07 | 2008-02-15 | Nokia Corp | Method and device for source coding |
US7556612B2 (en) * | 2003-10-20 | 2009-07-07 | Medical Components, Inc. | Dual-lumen bi-directional flow catheter |
CA2457988A1 (en) | 2004-02-18 | 2005-08-18 | Voiceage Corporation | Methods and devices for audio compression based on acelp/tcx coding and multi-rate lattice vector quantization |
JP4539446B2 (en) | 2004-06-24 | 2010-09-08 | ソニー株式会社 | Delta-sigma modulation apparatus and delta-sigma modulation method |
KR100647290B1 (en) | 2004-09-22 | 2006-11-23 | 삼성전자주식회사 | Voice encoder/decoder for selecting quantization/dequantization using synthesized speech-characteristics |
CA2603246C (en) | 2005-04-01 | 2012-07-17 | Qualcomm Incorporated | Systems, methods, and apparatus for anti-sparseness filtering |
US7684981B2 (en) | 2005-07-15 | 2010-03-23 | Microsoft Corporation | Prediction of spectral coefficients in waveform coding and decoding |
US7787827B2 (en) | 2005-12-14 | 2010-08-31 | Ember Corporation | Preamble detection |
EP1994531B1 (en) | 2006-02-22 | 2011-08-10 | France Telecom | Improved celp coding or decoding of a digital audio signal |
US7873511B2 (en) | 2006-06-30 | 2011-01-18 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Audio encoder, audio decoder and audio processor having a dynamically variable warping characteristic |
US8335684B2 (en) | 2006-07-12 | 2012-12-18 | Broadcom Corporation | Interchangeable noise feedback coding and code excited linear prediction encoders |
JP4769673B2 (en) | 2006-09-20 | 2011-09-07 | 富士通株式会社 | Audio signal interpolation method and audio signal interpolation apparatus |
US7979282B2 (en) | 2006-09-29 | 2011-07-12 | Lg Electronics Inc. | Methods and apparatuses for encoding and decoding object-based audio signals |
US7752038B2 (en) | 2006-10-13 | 2010-07-06 | Nokia Corporation | Pitch lag estimation |
ATE509347T1 (en) | 2006-10-20 | 2011-05-15 | Dolby Sweden Ab | DEVICE AND METHOD FOR CODING AN INFORMATION SIGNAL |
CN101583995B (en) | 2006-11-10 | 2012-06-27 | 松下电器产业株式会社 | Parameter decoding device, parameter encoding device, and parameter decoding method |
KR100788706B1 (en) | 2006-11-28 | 2007-12-26 | 삼성전자주식회사 | Method for encoding and decoding of broadband voice signal |
WO2008151408A1 (en) | 2007-06-14 | 2008-12-18 | Voiceage Corporation | Device and method for frame erasure concealment in a pcm codec interoperable with the itu-t recommendation g.711 |
GB2466670B (en) | 2009-01-06 | 2012-11-14 | Skype | Speech encoding |
GB2466674B (en) | 2009-01-06 | 2013-11-13 | Skype | Speech coding |
GB2466669B (en) | 2009-01-06 | 2013-03-06 | Skype | Speech coding |
GB2466671B (en) | 2009-01-06 | 2013-03-27 | Skype | Speech encoding |
GB2466675B (en) | 2009-01-06 | 2013-03-06 | Skype | Speech coding |
GB2466673B (en) | 2009-01-06 | 2012-11-07 | Skype | Quantization |
GB2466666B (en) | 2009-01-06 | 2013-01-23 | Skype | Speech coding |
GB2466672B (en) | 2009-01-06 | 2013-03-13 | Skype | Speech coding |
US8452606B2 (en) | 2009-09-29 | 2013-05-28 | Skype | Speech encoding using multiple bit rates |
-
2009
- 2009-01-06 GB GB0900142.1A patent/GB2466672B/en active Active
- 2009-06-02 US US12/455,478 patent/US8433563B2/en active Active
-
2010
- 2010-01-05 WO PCT/EP2010/050057 patent/WO2010079167A1/en active Application Filing
- 2010-01-05 EP EP10704769.8A patent/EP2384508B1/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1991003790A1 (en) * | 1989-09-01 | 1991-03-21 | Motorola, Inc. | Digital speech coder having improved sub-sample resolution long-term predictor |
EP0724252A2 (en) * | 1994-12-27 | 1996-07-31 | Nec Corporation | A CELP-type speech encoder having an improved long-term predictor |
US20070255561A1 (en) * | 1998-09-18 | 2007-11-01 | Conexant Systems, Inc. | System for speech encoding having an adaptive encoding arrangement |
US20080154588A1 (en) * | 2006-12-26 | 2008-06-26 | Yang Gao | Speech Coding System to Improve Packet Loss Concealment |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8392178B2 (en) | 2009-01-06 | 2013-03-05 | Skype | Pitch lag vectors for speech encoding |
US8396706B2 (en) | 2009-01-06 | 2013-03-12 | Skype | Speech coding |
US8433563B2 (en) | 2009-01-06 | 2013-04-30 | Skype | Predictive speech signal coding |
US8463604B2 (en) | 2009-01-06 | 2013-06-11 | Skype | Speech encoding utilizing independent manipulation of signal and noise spectrum |
US8639504B2 (en) | 2009-01-06 | 2014-01-28 | Skype | Speech encoding utilizing independent manipulation of signal and noise spectrum |
US8655653B2 (en) | 2009-01-06 | 2014-02-18 | Skype | Speech coding by quantizing with random-noise signal |
US8670981B2 (en) | 2009-01-06 | 2014-03-11 | Skype | Speech encoding and decoding utilizing line spectral frequency interpolation |
US8849658B2 (en) | 2009-01-06 | 2014-09-30 | Skype | Speech encoding utilizing independent manipulation of signal and noise spectrum |
US9263051B2 (en) | 2009-01-06 | 2016-02-16 | Skype | Speech coding by quantizing with random-noise signal |
US10026411B2 (en) | 2009-01-06 | 2018-07-17 | Skype | Speech encoding utilizing independent manipulation of signal and noise spectrum |
US8452606B2 (en) | 2009-09-29 | 2013-05-28 | Skype | Speech encoding using multiple bit rates |
WO2013149672A1 (en) | 2012-04-05 | 2013-10-10 | Huawei Technologies Co., Ltd. | Method for determining an encoding parameter for a multi-channel audio signal and multi-channel audio encoder |
Also Published As
Publication number | Publication date |
---|---|
GB2466672A (en) | 2010-07-07 |
GB0900142D0 (en) | 2009-02-11 |
US8433563B2 (en) | 2013-04-30 |
US20100174537A1 (en) | 2010-07-08 |
EP2384508A1 (en) | 2011-11-09 |
EP2384508B1 (en) | 2018-09-05 |
WO2010079167A4 (en) | 2010-10-14 |
GB2466672B (en) | 2013-03-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2384508B1 (en) | Speech coding | |
US8463604B2 (en) | Speech encoding utilizing independent manipulation of signal and noise spectrum | |
US9530423B2 (en) | Speech encoding by determining a quantization gain based on inverse of a pitch correlation | |
US9263051B2 (en) | Speech coding by quantizing with random-noise signal | |
EP2384505B1 (en) | Speech encoding | |
US8396706B2 (en) | Speech coding | |
US8452606B2 (en) | Speech encoding using multiple bit rates | |
EP2384506B1 (en) | Speech coding method and apparatus | |
GB2499505A (en) | Speech signal decoding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10704769 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010704769 Country of ref document: EP |