DE69132013T2

DE69132013T2 - PROCEDURE FOR VOICE QUANTIZATION AND ERROR CORRECTION

Info

Publication number: DE69132013T2
Application number: DE69132013T
Authority: DE
Inventors: C. Hardwick; S. Lim
Original assignee: Digital Voice Systems Inc
Current assignee: Digital Voice Systems Inc
Priority date: 1990-12-05
Filing date: 1991-12-04
Publication date: 2000-11-02
Anticipated expiration: 2011-12-05
Also published as: EP0560931A1; EP1211669A3; CA2096425A1; DE69132013D1; EP0893791B1; EP0560931A4; US5226084A; EP1211669B1; CA2096425C; DE69133058D1; DE69133458T2; DE69133058T2; EP1211669A2; EP0893791A3; AU9147091A; DE69133458D1; EP0893791A2; JP3467270B2; EP0560931B1; WO1992010830A1

Description

Diese Erfindung betrifft ein Verfahren zum Codieren von Sprache und stellt Beispiele von Verfahren bereit, die die Qualität von Sprache während der Anwesenheit von Bitfehlern in einem Sprachsignal bewahren können.This invention relates to a method for coding speech and provides examples of methods that can preserve the quality of speech during the presence of bit errors in a speech signal.

Einschlägige Veröffentlichungen umfassen: J. L. Flanagan, Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972, S. 378-386 (erörtert einen Phasenvocoder - ein auf der Frequenz basierendes Sprach-Analyse/Synthese- System); Quatieri, et al., "Speech Transformations Based on a Sinusoidal Representation", IEEE TASSP, Band ASSP34, Nr. 6, Dez. 1986, S. 1449-1986, (erörtert ein Analyse-Synthese-Verfahren auf der Basis einer sinusförmigen Darstellung); Griffin, "Multiband Excitation Vocoder", Ph.D. Thesis, M. I. T., 1987, (erörtert einen Mehrbandanregungs-Sprachcodierer mit 8000 Bit/s); Griffin, et al., "A High Quality 9.6 kbps Speech Coding System", Proc. ICASSP 86, S. 125-128, Tokyo, Japan, 13.-20. April 1986 (erörtert einen Mehrbandanregungs-Sprachcodierer mit 9600 Bit/s); Griffin, et al. "A New Model-Based Speech Analysis/Synthesis System", Proc. ICASSP 85, S. 513-516, Tampa, FL., 26.-29. März 1985 (erörtert ein Mehrbandanregungssprachmodell); Hardwick, "A 4.8 kbps Multi-Band Excitation Speech Coder", S. M. Thesis, M. I. T., Mai 1988 (erörtert einen Mehrbandanregungs-Sprachcodierer mit 4800 Bit/s); McAulay et al., "Mid- Rate Coding Based on a Sinusoidal Representation of Speech", Proc. ICASSP 85, S. 945-948, Tampa, FL., 26.-29. März 1985 (erörtert Sprachcodierung auf der Basis einer sinusförmigen Darstellung); Campbell et al., "The New 4800 bps Voice Coding Standard", Mil Speech Tech Conference, Nov. 1989 (erörtert Fehlerkorrektur in Sprachcodierern mit niedriger Rate); Campbell et al., "CELP Coding for Land Mobile Radio Applications", Proc. ICASSP 90, S. 465-468, Albequerque, NM, 3.-6. April 1990 (erörtert Fehlerkorrektur in Sprachcodierern mit niedriger Rate); Levesque et al., Error-Control Technigues for Digital Communication, Wiley, 1985, S. 157-170 (erörtert Fehlerkorrektur im allgemeinen); Jayant et al., Digital Coding of Waveforms, Prentice-Hall, 1984 (erörtert Quantisierung im allgemeinen); Makhoul, et al. "Vector Quantization in Speech Coding", Proc. IEEE, 1985, S. 1551-1588 (erörtert Vektorquantisierung im allgemeinen); Jayant et al., "Adaptive Postfiltering of 16 kb/s-ADPCM Speech", Proc. ICASSP 86, S. 829-832, Tokyo, Japan, 13.-20. April 1986 (erörtert adaptive Nachfilterung von Sprache). Der Inhalt dieser Veröffentlichungen wird durch die Bezugnahme hierin aufgenommen.Relevant publications include: JL Flanagan, Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972, pp. 378-386 (discusses a phase vocoder - a frequency-based speech analysis/synthesis system); Quatieri, et al., "Speech Transformations Based on a Sinusoidal Representation", IEEE TASSP, Vol. ASSP34, No. 6, Dec. 1986, pp. 1449-1986, (discusses an analysis-synthesis method based on a sinusoidal representation); Griffin, "Multiband Excitation Vocoder", Ph.D. Thesis, MIT, 1987, (discusses an 8000 bit/s multiband excitation speech coder); Griffin, et al., "A High Quality 9.6 kbps Speech Coding System", Proc. ICASSP 86, pp. 125-128, Tokyo, Japan, April 13-20, 1986 (discussing a 9600 bit/s multi-band excitation speech coder); Griffin, et al., "A New Model-Based Speech Analysis/Synthesis System," Proc. ICASSP 85, pp. 513-516, Tampa, FL., March 26-29, 1985 (discussing a multi-band excitation speech model); Hardwick, "A 4.8 kbps Multi-Band Excitation Speech Coder," SM Thesis, MIT, May 1988 (discussing a 4800 bit/s multi-band excitation speech coder); McAulay et al., "Mid-Rate Coding Based on a Sinusoidal Representation of Speech," Proc. ICASSP 85, pp. 945-948, Tampa, FL., March 26-29, 1985 (discusses speech coding based on a sinusoidal representation); Campbell et al., "The New 4800 bps Voice Coding Standard," Mil Speech Tech Conference, Nov. 1989 (discusses error correction in low-rate speech coders); Campbell et al., "CELP Coding for Land Mobile Radio Applications," Proc. ICASSP 90, pp. 465-468, Albequerque, NM, April 3-6, 1990 (discusses error correction in low-rate speech coders); Levesque et al., Error-Control Technigues for Digital Communication, Wiley, 1985, pp. 157-170 (discusses error correction in general); Jayant et al., Digital Coding of Waveforms, Prentice-Hall, 1984 (discusses quantization in general); Makhoul, et al., "Vector Quantization in Speech Coding," Proc. IEEE, 1985, pp. 1551-1588 (discusses vector quantization in general); Jayant et al., "Adaptive Postfiltering of 16 kb/s-ADPCM Speech," Proc. ICASSP 86, pp. 829-832, Tokyo, Japan, April 13-20, 1986 (discusses adaptive postfiltering of speech). The contents of these publications are incorporated herein by reference.

Das Problem der Sprachcodierung (Komprimierung von Sprache in eine kleine Anzahl von Bits) besitzt eine große Anzahl von Anwendungen und hat folglich in der Literatur eine beträchtliche Aufmerksamkeit erlangt. Eine Klasse von Sprachcodierern (Vocodern), die in der Praxis ausgedehnt untersucht und verwendet wurden, basiert auf einem zugrundeliegenden Sprachmodell. Beispiele aus dieser Klasse von Vocodern umfassen Vocoder mit linearer Vorhersage, homomorphe Vocoder und Kanalvocoder. Bei diesen Vocodern wird die Sprache auf Kurzzeitbasis als Antwort eines linearen Systems, das durch eine periodische Impulsfolge für stimmhafte Laute oder statistisches Rauschen für stimmlose Laute angeregt wird, modelliert. Für diese Klasse von Vocodern wird die Sprache durch zuerst Teilen der Sprache in Abschnitte unter Verwendung eines Fensters, wie z. B. eines Hamming-Fensters, analysiert. Dann werden für jedes Sprachsegment die Anregungsparameter und Systemparameter abgeschätzt und quantisiert. Die Anregungsparameter bestehen aus der Entscheidung Stimme/keine Stimme und der Tonhöhenperiode. Die Systemparameter bestehen aus der Spektralhüllkurve oder der Impulsantwort des Systems. Um Sprache zu rekonstruieren, werden die quantisierten Anregungsparameter verwendet, um ein Anregungssignal zu synthetisieren, das aus einer periodischen Impulsfolge in stimmhaften Bereichen oder statistischem Rauschen in stimmlosen Bereichen besteht. Dieses Anregungssignal wird dann unter Verwendung der quantisierten Systemparameter gefiltert.The problem of speech coding (compressing speech into a small number of bits) has a large number of applications and has consequently received considerable attention in the literature. One class of speech coders (vocoders) that have been widely studied and used in practice is based on an underlying speech model. Examples from this class of vocoders include linear prediction vocoders, homomorphic vocoders, and channel vocoders. In these vocoders, speech is modeled on a short-term basis as the response of a linear system excited by a periodic pulse train for voiced sounds or statistical noise for unvoiced sounds. For this class of vocoders, speech is analyzed by first dividing the speech into segments using a window, such as a Hamming window. Then, for each speech segment, the excitation parameters and system parameters are estimated and quantized. The excitation parameters consist of the voice/no voice decision and the pitch period. The system parameters consist of the spectral envelope or impulse response of the system. To reconstruct speech, the quantized excitation parameters are used to synthesize an excitation signal consisting of a periodic pulse train in voiced regions or random noise in unvoiced regions. This excitation signal is then filtered using the quantized system parameters.

Obwohl Vocoder, die auf diesem zugrundeliegenden Sprachmodell basieren, bei der Erzeugung von verständlicher Sprache ziemlich erfolgreich waren, waren sie bei der Erzeugung von Sprache mit hoher Qualität nicht erfolgreich. Folglich wurden sie für Sprachcodierung mit hoher Qualität nicht umfangreich verwendet. Die schlechte Qualität der rekonstruierten Sprache ist teilweise auf die ungenaue Ab schätzung der Modellparameter und teilweise auf Begrenzungen des Sprachmodells zurückzuführen.Although vocoders based on this underlying language model have been quite successful in producing intelligible speech, they have not been successful in producing high-quality speech. Consequently, they have not been used extensively for high-quality speech coding. The poor quality of the reconstructed speech is partly due to the inaccurate ablation of the estimation of the model parameters and partly due to limitations of the language model.

Ein neues Sprachmodell, das als Sprachmodell mit Mehrbandanregung (MBE) bezeichnet wird, wurde 1984 von Griffin und Lim entwickelt. Sprachcodierer, die auf diesem neuen Sprachmodell basieren, wurden 1986 von Griffin und Lim entwickelt, und es wurde gezeigt, daß sie in der Lage sind, Sprache mit hoher Qualität mit Raten oberhalb von 8000 Bit/s (Bits pro Sekunde) zu erzeugen. Die anschließende Arbeit von Hardwick und Lim brachte einen MBE-Sprachcodierer mit 4800 Bit/s hervor, der ebenfalls in der Lage war, Sprache mit hoher Qualität zu erzeugen. Dieser Sprachcodierer mit 4800 Bit/s verwendete raffiniertere Quantisierungsverfahren, um bei 4800 Bit/s eine ähnliche Qualität zu erreichen, die frühere MBE-Sprachcodierer bei 8000 Bit/s erreicht hatten.A new speech model, called the multiband excitation (MBE) speech model, was developed by Griffin and Lim in 1984. Speech coders based on this new speech model were developed by Griffin and Lim in 1986 and were shown to be capable of producing high quality speech at rates in excess of 8000 bit/s (bits per second). Subsequent work by Hardwick and Lim produced a 4800 bit/s MBE speech coder that was also capable of producing high quality speech. This 4800 bit/s speech coder used more sophisticated quantization techniques to achieve similar quality at 4800 bit/s as earlier MBE speech coders had achieved at 8000 bit/s.

Der MBE-Sprachcodierer mit 4800 Bit/s verwendete ein MBE-Analyse/Synthese- System, um die MBE-Sprachmodellparameter abzuschätzen und die Sprache aus den abgeschätzten MBE-Sprachmodellparametern zu synthetisieren. Ein diskretes Sprachsignal, welches mit s(n) bezeichnet wird, wird durch Abtasten eines analogen Sprachsignals erhalten. Dies wird typischerweise mit einer Abtastfrequenz von 8 kHz durchgeführt, obwohl andere Abtastfrequenzen durch eine problemlose Änderung der verschiedenen Systemparameter leicht angepaßt werden können. Das System unterteilt das diskrete Sprachsignal in kleine überlappende Segmente oder Segmente durch Multiplizieren von s(n) mit einem Fenster w(n) (wie z. B. einem Hamming-Fenster oder einem Kaiser-Fenster), um ein ausschnittweise dargestelltes Signal sw(n) zu erhalten. Jedes Sprachsegment wird dann analysiert, um eine Gruppe von MBE-Sprachmodellparametern zu erhalten, die dieses Segment charakterisieren. Die MBE-Sprachmodellparameter bestehen aus einer Grundfrequenz, die zur Tonhöhenperiode äquivalent ist, einer Gruppe von Entscheidungen Stimme/keine Stimme, einer Gruppe von Spektralamplituden und gegebenenfalls einer Gruppe von Spektralphasen. Diese Modellparameter werden dann unter Verwendung einer festen Anzahl von Bits für jedes Segment quantisiert. Die resultierenden Bits können dann verwendet werden, um das Sprachsignal zu rekonstruieren, und zwar durch zuerst Rekonstruieren der MBE-Modellparameter aus den Bits und dann Synthetisieren der Sprache aus den Modellparametern. Ein Blockdiagramm eines typischen MBE-Sprachcodierers ist in Fig. 1 gezeigt.The 4800 bit/s MBE speech coder used an MBE analysis/synthesis system to estimate the MBE speech model parameters and synthesize speech from the estimated MBE speech model parameters. A discrete speech signal, denoted s(n), is obtained by sampling an analog speech signal. This is typically done at a sampling frequency of 8 kHz, although other sampling frequencies can be easily accommodated by simply changing the various system parameters. The system divides the discrete speech signal into small overlapping segments or segments by multiplying s(n) by a window w(n) (such as a Hamming window or a Kaiser window) to obtain a sliced signal sw(n). Each speech segment is then analyzed to obtain a set of MBE speech model parameters that characterize that segment. The MBE speech model parameters consist of a fundamental frequency equivalent to the pitch period, a set of voice/no voice decisions, a set of spectral amplitudes and, if applicable, a set of spectral phases. These model parameters are then quantized using a fixed number of bits for each segment. The resulting bits can then be used to reconstruct the speech signal by first reconstructing the MBE model parameters from the bits and then synthesizing the speech from the model parameters. A block diagram of a typical MBE speech coder is shown in Fig. 1.

Der MBE-Sprachcodierer mit 4800 Bit/s erforderte die Verwendung eines raffinierten Verfahrens, um die Spektralamplituden zu quantisieren. Für jedes Sprachsegment variierte die Anzahl der Bits, die zur Quantisierung der Spektralamplituden verwendet werden konnten, zwischen 50 und 125 Bits. Außerdem variiert die Anzahl der Spektralamplituden für jedes Segment zwischen 9 und 60. Ein Quantisierungsverfahren wurde entworfen, das alle Spektralamplituden mit der für jedes Segment verfügbaren Anzahl an Bits effizient darstellen konnte. Obwohl dieses Spektralamplituden-Quantisierungsverfahren zur Verwendung in einem MBE-Sprachcodierer ausgelegt war, sind die Quantisierungsverfahren gleichermaßen bei einer Anzahl von verschiedenen Sprachcodierverfahren tauglich, wie z. B. dem Sinustransformations-Codierer und dem Oberwellencodierer. Für ein spezielles Sprachsegment bezeichnet die Anzahl der Spektralamplituden in diesem Segment. Der Wert von wird von der Grundfrequenz &sub0; gemäß der Beziehung The 4800 bit/s MBE speech coder required the use of a sophisticated method to quantize the spectral amplitudes. For each speech segment, the number of bits that could be used to quantize the spectral amplitudes varied between 50 and 125 bits. In addition, the number of spectral amplitudes for each segment varied between 9 and 60. A quantization method was designed that could efficiently represent all spectral amplitudes with the number of bits available for each segment. Although this spectral amplitude quantization method was designed for use in an MBE speech coder, the quantization methods are equally applicable to a number of different speech coding methods, such as the sine transform coder and the harmonic coder. For a particular speech segment, denotes the number of spectral amplitudes in that segment. The value of is divided by the fundamental frequency ₀ according to the relationship

abgeleitet, wobei 0 ≤ β ≤ 1,0 die Sprachbandbreite relativ zur halben Abtastfrequenz bestimmt. Die in Gleichung (1) angegebene Funktion x ist gleich der größten ganzen Zahl kleiner als oder gleich x. Die Spektralamplituden werden mit &sub1; für 1 ≤ I ≤ bezeichnet, wobei &sub1; die Spektralamplitude mit der niedrigsten Frequenz ist und die Spektralamplitude mit der höchsten Frequenz ist.where 0 ≤ β ≤ 1.0 determines the speech bandwidth relative to half the sampling frequency. The function x given in equation (1) is equal to the largest integer less than or equal to x. The spectral amplitudes are denoted by ₁ for 1 ≤ I ≤, where ₁ is the spectral amplitude with the lowest frequency and is the spectral amplitude with the highest frequency.

Die Spektralamplituden für das aktuelle Sprachsegment werden durch zuerst Berechnen einer Gruppe von Vorhersageabweichungen, die das Ausmaß angeben, in dem sich die Spektralamplituden zwischen dem aktuellen Sprachsegment und dem vorherigen Sprachsegment geändert haben, quantisiert. Wenn &sup0; die Anzahl der Spektralamplituden im aktuellen Sprachsegment bezeichnet und &supmin;¹ die Anzahl der Spektralamplituden im vorherigen Sprachsegment bezeichnet, dann sind die Vorhersageabweichungen &sub1; für 1 ≤ I ≤ &sup0; durch The spectral amplitudes for the current speech segment are quantized by first computing a set of prediction deviations that indicate the extent to which the spectral amplitudes have changed between the current speech segment and the previous speech segment. Where ⊆ denotes the number of spectral amplitudes in the current speech segment and ⊆¹ denotes the number of spectral amplitudes in the previous speech segment, then the prediction deviations ∆1 for 1 ≤ I ≤ ∆0 are given by

gegeben, wobei die Spektralamplituden des aktuellen Sprachsegments bezeichnet und die quantisierten Spektralamplituden des vorherigen Sprachsegments bezeichnet. Die Konstante γ ist typischerweise gleich 0,7, es kann jedoch ein beliebiger Wert im Bereich von 0 ≤ γ ≤ 1 verwendet werden.where denotes the spectral amplitudes of the current speech segment and denotes the quantized spectral amplitudes of the previous speech segment. The constant γ is typically equal to 0.7, but any value in the range 0 ≤ γ ≤ 1 can be used.

Die Vorhersageabweichungen werden in Blöcke von K Elementen unterteilt, wobei der Wert von K typischerweise im Bereich von 4 ≤ K ≤ 12 liegt. Wenn nicht gleichmäßig durch K teilbar ist, dann enthält der Block mit der höchsten Frequenz weniger als K Elemente. Dies ist in Fig. 2 für = 34 und K = 8 gezeigt.The prediction deviations are divided into blocks of K elements, where the value of K is typically in the range 4 ≤ K ≤ 12. If is not evenly divisible by K, then the block with the highest frequency contains less than K elements. This is shown in Fig. 2 for = 34 and K = 8.

Jeder der Vorhersageabweichungsblöcke wird dann unter Verwendung einer Diskreten Cosinustransformation (DCT), welche durch Each of the prediction deviation blocks is then converted using a Discrete Cosine Transform (DCT) which is

definiert ist, transformiert. Die Länge der Transformation für jeden Block, J, ist gleich der Anzahl der Elemente im Block. Daher werden alle bis auf den Block mit der höchsten Frequenz mit einer DCT der Länge K transformiert, während die Länge der DCT für den Block mit der höchsten Frequenz geringer als oder gleich K ist. Da die DCT eine invertierbare Transformation ist, legen die DCT-Koeffizienten die Spektralamplituden-Vorhersageabweichungen für das aktuelle Segment vollständig fest.The length of the transform for each block, J, is equal to the number of elements in the block. Therefore, all but the block with the highest frequency are transformed with a DCT of length K, while the length of the DCT for the block with the highest frequency is less than or equal to K. Since the DCT is an invertible transform, the DCT coefficients completely determine the spectral amplitude prediction errors for the current segment.

Die Gesamtzahl der zur Quantisierung der Spektralamplituden verfügbaren Bits wird unter den DCT-Koeffizienten gemäß einer Bitzuordnungsregel aufgeteilt. Diese Regel versucht, den wahrnehmbar wichtigeren Niederfrequenzblöcken mehr Bits zuzuteilen als den wahrnehmbar weniger wichtigen Hochfrequenzblöcken. Außerdem verteilt die Bitzuordnungsregel die Bits innerhalb eines Blocks an die DCT-Koeffizienten gemäß ihrer relativen Langzeitvarianzen. Diese Lösungsmethode stimmt die Bitzuordnung auf die Wahrnehmungseigenschaften der Sprache und auf die Quantisierungseigenschaften der DCT ab.The total number of bits available for quantizing the spectral amplitudes is distributed among the DCT coefficients according to a bit allocation rule. This rule attempts to allocate more bits to the perceptually more important low frequency blocks than to the perceptually less important high frequency blocks. In addition, the bit allocation rule distributes the bits within a block to the DCT coefficients according to their relative long-term variances. This solution method tunes the bit allocation to the perceptual properties of the speech and to the quantization properties of the DCT.

Jeder DCT-Koeffizient wird unter Verwendung der durch die Bitzuordnungsregel festgelegten Anzahl an Bits quantisiert. Typischerweise wird eine gleichmäßige Quantisierung verwendet, es kann jedoch auch eine ungleichmäßige oder Vektorquantisierung verwendet werden. Die Schrittweite für jeden Quantisierer wird aus der Langzeitvarianz der DCT-Koeffizienten und aus der Anzahl der zur Quantisierung jedes Koeffizienten verwendeten Bits bestimmt. Tabelle 1 zeigt die typische Variation der Schrittweite als Funktion der Anzahl an Bits für eine Langzeitvarianz gleich σ².Each DCT coefficient is quantized using the number of bits specified by the bit allocation rule. Typically, uniform quantization is used, but non-uniform or vector quantization can also be used. The step size for each quantizer is determined from the long-term variance of the DCT coefficients and from the number of bits used to quantize each coefficient. Table 1 shows the typical variation of the step size as a function of the number of bits for a long-term variance equal to σ².

Table 1: Step size of uniform quantizers Number of bits Step size

1 1,2σ1 1.2σ

2 0,85σ2 0.85σ

3 0,65σ3 0.65σ

4 0,42σ4 0.42σ

5 0,28σ5 0.28σ

6 0,14σ6 0.14σ

7 0,07σ7 0.07σ

8 0,035σ8 0.035σ

9 0,0175σ9 0.0175σ

10 0,00875σ10 0.00875σ

11 0,00438σ11 0.00438σ

12 0,00219σ12 0.00219σ

13 0,00110σ13 0.00110σ

14 0,000550σ14 0.000550σ

15 0,000275σ15 0.000275σ

16 0,000138σ16 0.000138σ

Sobald jeder DCT-Koeffizient unter Verwendung der durch die Bitzuordnungsregel festgelegten Anzahl von Bits quantisiert wurde, kann die Binärdarstellung in Abhängigkeit von der Anwendung übertragen, gespeichert, usw. werden. Die Spektralamplituden können aus der Binärdarstellung rekonstruiert werden durch zuerst Rekonstruieren der quantisierten DCT-Koeffizienten für jeden Block, Durchführen der inversen DCT an jedem Block und dann Kombinieren mit den quantisierten Spektralamplituden des vorherigen Segments unter Verwendung des Inversen von Gleichung (2). Die inverse DCT ist durch Once each DCT coefficient has been quantized using the number of bits specified by the bit allocation rule, the binary representation can be transmitted, stored, etc. depending on the application. The spectral amplitudes can be reconstructed from the binary representation by first reconstructing the quantized DCT coefficients for each block, performing the inverse DCT on each block, and then combining with the quantized spectral amplitudes of the previous segment using the inverse of equation (2). The inverse DCT is given by

gegeben, wobei die Länge J für jeden Block so gewählt wird, daß sie die Anzahl der Elemente in diesem Block ist, und α(j) durch where the length J for each block is chosen to be the number of elements in that block, and α(j) is given by

gegeben ist.given is.

Ein potentielles Problem bei dem MBE-Sprachcodierer mit 4800 Bit/s besteht darin, daß die wahrgenommene Qualität der rekonstruierten Sprache signifikant verringert werden kann, wenn Bitfehler zur Binärdarstellung der MBE-Modellparameter hinzugefügt werden. Da Bitfehler in vielen Sprachcodiereranwendungen existieren, muß ein unempfindlicher Sprachcodierer Bitfehler korrigieren, erkennen und/oder tolerieren können. Ein Verfahren, das als sehr erfolgreich festgestellt wurde, besteht darin, Fehlerkorrekturcodes in der Binärdarstellung der Modellparameter zu verwenden. Fehlerkorrekturcodes ermöglichen, daß seltene Bitfehler korrigiert werden, und sie ermöglichen, daß das System die Fehlerrate abschätzt. Die Abschätzung der Fehlerrate kann dann verwendet werden, um die Modellparameter adaptiv zu verarbeiten, um die Wirkung von irgendwelchen verbleibenden Bitfehlern zu verringern. Typischerweise wird die Fehlerrate durch Zählen der Anzahl der durch die Fehlerkorrekturcodes korrigierten (oder erkannten) Fehler im aktuellen Segment und dann Verwenden dieser Information, um die aktuelle Abschätzung der Fehlerrate zu aktualisieren, abgeschätzt. Wenn beispielsweise jedes Segment einen (23,12)-Golay-Code enthält, der drei Fehler aus den 23 Bits korrigieren kann, und &epsi;T die Anzahl der Fehler (0-3) bezeichnet, die im aktuellen Segment korrigiert wurden, dann wird die aktuelle Abschätzung der Fehlerrate &epsi;R gemäß A potential problem with the 4800 bit/s MBE speech coder is that the perceived quality of the reconstructed speech can be significantly reduced if bit errors are added to the binary representation of the MBE model parameters. Since bit errors exist in many speech coding applications, a robust speech coder must be able to correct, detect and/or tolerate bit errors. One method that has been found to be very successful is to use error correction codes in the binary representation of the model parameters. Error correction codes allow rare bit errors are corrected and they allow the system to estimate the error rate. The error rate estimate can then be used to adaptively process the model parameters to reduce the effect of any remaining bit errors. Typically, the error rate is estimated by counting the number of errors corrected (or detected) by the error correction codes in the current segment and then using this information to update the current error rate estimate. For example, if each segment contains a (23,12) Golay code that can correct three errors out of the 23 bits and εT denotes the number of errors (0-3) corrected in the current segment, then the current error rate estimate εR is calculated according to

aktualisiert, wobei β eine Konstante im Bereich von 0 ≤ β ≤ 1 ist, die die Anpassungsfähigkeit von &epsi;R steuert.where β is a constant in the range 0 ≤ β ≤ 1 that controls the adaptability of εR.

Wenn Fehlerkorrekturcodes oder Fehlererkennungscodes verwendet werden, werden die Bits, die die Sprachmodellparameter darstellen, in eine andere Gruppe von Bits umgewandelt, die gegenüber Bitfehlern unempfindlicher sind. Die Verwendung von Fehlerkorrektur- oder -erkennungscodes erhöht typischerweise die Anzahl der Bits, die übertragen oder gespeichert werden müssen. Die Anzahl der zusätzlichen Bits, die übertragen werden müssen, steht gewöhnlich mit der Unempfindlichkeit des Fehlerkorrektur- oder -erkennungscodes in Beziehung. Bei den meisten Anwendungen ist es erwünscht, die Gesamtzahl der Bits, die übertragen oder gespeichert werden, zu minimieren. In diesem Fall müssen die Fehlerkorrektur- oder -erkennungscodes so ausgewählt werden, daß die Gesamtsystemleistung maximiert wird.When error correction codes or error detection codes are used, the bits representing the language model parameters are converted to another set of bits that are more robust to bit errors. The use of error correction or detection codes typically increases the number of bits that must be transmitted or stored. The number of additional bits that must be transmitted is usually related to the robustness of the error correction or detection code. In most applications, it is desirable to minimize the total number of bits that are transmitted or stored. In this case, the error correction or detection codes must be selected to maximize overall system performance.

Ein weiteres Problem bei dieser Klasse von Sprachcodiersystemen besteht darin, daß Begrenzungen bei der Abschätzung der Sprachmodellparameter eine Qualitätsverschlechterung der synthetisierten Sprache verursachen können. Die an schließende Quantisierung der Modellparameter ruft eine weitere Verschlechterung hervor. Diese Verschlechterung kann die Form von einer nachhallenden oder gedämpften Qualität für die synthetisierte Sprache annehmen. Außerdem können Hintergrundrauschen oder andere Fehler vorliegen, die in der ursprünglichen Sprache nicht vorhanden waren. Diese Form der Verschlechterung tritt selbst dann auf, wenn keine Bitfehler in den Sprachdaten vorliegen, Bitfehler können dieses Problem jedoch verschlimmern. Typischerweise versuchen Sprachcodiersysteme, die Parameterabschätzer und Parameterquantisierer zu optimieren, um diese Form der Verschlechterung zu minimieren. Andere Systeme versuchen, die Verschlechterungen durch Nachfiltern zu verringern. Bei der Nachfilterung wird die Ausgangssprache im Zeitbereich mit einem adaptiven Allpolfilter gefiltert, um die Formatspitzen zu verschärfen. Dieses Verfahren ermöglicht keine feine Steuerung über den Spektralverstärkungsprozeß und ist rechenaufwendig und ineffizient für Frequenzbereichs-Sprachcodierer.Another problem with this class of speech coding systems is that limitations in estimating the language model parameters can cause a deterioration in the quality of the synthesized speech. Final quantization of the model parameters introduces further degradation. This degradation may take the form of a reverberant or muffled quality to the synthesized speech. In addition, background noise or other errors that were not present in the original speech may be present. This form of degradation occurs even when there are no bit errors in the speech data, but bit errors can exacerbate this problem. Typically, speech coding systems attempt to optimize the parameter estimators and parameter quantizers to minimize this form of degradation. Other systems attempt to reduce the degradations by post-filtering. In post-filtering, the source speech is filtered in the time domain with an adaptive all-pole filter to sharpen the format peaks. This technique does not allow fine control over the spectral gain process and is computationally expensive and inefficient for frequency-domain speech coders.

Die hierin beschriebene Erfindung gilt für viele verschiedene Sprachcodierverfahren, die Sprachcodierer mit linearer Vorhersage, Kanalvocoder, homomorphe Vocoder, Sinustransformationscodierer, Mehrbandanregungs-Sprachcodierer und verbesserte Mehrbandanregungs- (IMBE) Sprachcodierer einschließen, aber nicht darauf begrenzt sind. Für den Zweck der Beschreibung dieser Erfindung im einzelnen verwenden wir den IMBE-Sprachcodierer mit 6,4 kBit/s, der vor kurzem als Teil des INMARSAT-M (International Marine Satellite Organization) Satelliten- Kommunikationssystems normiert wurde. Dieser Codierer verwendet ein unempfindliches Sprachmodell, das als Mehrbandanregungs- (MBE) Sprachmodell bezeichnet wird.The invention described herein applies to a wide variety of speech coding techniques, including but not limited to linear prediction speech coders, channel vocoders, homomorphic vocoders, sine transform coders, multiband excitation speech coders, and enhanced multiband excitation (IMBE) speech coders. For the purpose of describing this invention in detail, we will use the 6.4 kbit/s IMBE speech coder that was recently standardized as part of the INMARSAT-M (International Marine Satellite Organization) satellite communications system. This coder uses an insensitive speech model called the multiband excitation (MBE) speech model.

Effiziente Verfahren zur Quantisierung der MBE-Modellparameter wurden entwickelt. Diese Verfahren sind in der Lage, die Modellparameter mit theoretisch jeder Bitrate oberhalb 2 kBit/s zu quantisieren. Der IMBE-Sprachcodierer mit 6,4 kBit/s, der im INMARSAT-M-Satelliten-Kommunikationssystem verwendet wird, verwendet eine Rahmenfrequenz von 50 Hz. Daher sind pro Rahmen 128 Bits verfügbar. Von diesen 128 Bits sind 45 Bits für die Vorwärtsfehlerkorrektur reserviert. Die übrigen 83 Bits pro Rahmen werden zur Quantisierung der MBE-Modell parameter verwendet, die aus einer Grundfrequenz &sub0;, einer Gruppe von V/UV- Entscheidungen k für 1 ≤ k ≤ , und einer Gruppe von Spektralamplituden &sub1; für 1 ≤ I ≤ bestehen. Die Werte von und variieren in Abhängigkeit von der Grundfrequenz jedes Rahmens. Die 83 verfügbaren Bits werden unter den Modellparametern aufgeteilt, wie in Tabelle 2 gezeigt.Efficient methods for quantizing the MBE model parameters have been developed. These methods are able to quantize the model parameters at theoretically any bit rate above 2 kbit/s. The 6.4 kbit/s IMBE speech coder used in the INMARSAT-M satellite communication system uses a frame frequency of 50 Hz. Therefore, 128 bits are available per frame. Of these 128 bits, 45 bits are reserved for forward error correction. The remaining 83 bits per frame are used to quantize the MBE model. parameters are used, consisting of a fundamental frequency �0;, a set of V/UV decisions k for 1 ≤ k ≤ , and a set of spectral amplitudes �1 for 1 ≤ I ≤ . The values of and vary depending on the fundamental frequency of each frame. The 83 available bits are divided among the model parameters as shown in Table 2.

Table 2: Bit assignment under model parameters Parameter Number of bits

Grundfrequenz 8Base frequency 8

Entscheidungen Stimme/keine Decisions vote/no

StimmeAgree

Spektralamplituden 75 - Spectral amplitudes 75 -

Die Grundfrequenz wird durch zunächst Umwandeln derselben in ihre äquivalente Tonhöhenperiode unter Verwendung von Gleichung (7) quantisiert.The fundamental frequency is quantized by first converting it into its equivalent pitch period using equation (7).

&sub0; = 2π/ &sub0; (7)&sub0; = 2π/ &sub0; (7)

Der Wert von &sub0; ist typischerweise auf den Bereich 20 ≤ &sub0; ≤ 120 beschränkt, wenn eine Abtastfrequenz von 8 kHz angenommen wird. In dem IMBE-System mit 6,4 kBit/s wird dieser Parameter unter Verwendung von 8 Bits und einer Schrittweite von 0,5 gleichmäßig quantisiert. Dies entspricht einer Tonhöhengenauigkeit von einem halben Abtastwert.The value of ₀ is typically limited to the range 20 ≤ ₀ ≤ 120, assuming a sampling frequency of 8 kHz. In the IMBE system at 6.4 kbit/s, this parameter is uniformly quantized using 8 bits and a step size of 0.5. This corresponds to a pitch accuracy of half a sample.

Die V/UV-Entscheidungen sind Binärwerte. Daher können sie unter Verwendung eines einzigen Bits pro Entscheidung codiert werden. Das System mit 6,4 kBit/s verwendet maximal 12 Entscheidungen, und die Breite jedes Frequenzbandes ist gleich 3 &sub0;. Die Breite des höchsten Frequenzbandes wird so eingestellt, daß es Frequenzen bis zu 3,8 kHz enthält.The V/UV decisions are binary values. Therefore, they can be encoded using a single bit per decision. The 6.4 kbit/s system uses a maximum of 12 decisions and the width of each frequency band is equal to 3 0. The width of the highest frequency band is set to include frequencies up to 3.8 kHz.

Die Spektralamplituden werden durch Bilden einer Gruppe von Vorhersageabweichungen quantisiert. Jede Vorhersageabweichung ist die Differenz zwischen dem Logarithmus der Spektralamplitude für den aktuellen Rahmen und dem Logarithmus der Spektralamplitude, welche dieselbe Frequenz im vorherigen Sprachrahmen darstellt. Die Spektralamplituden-Vorhersageabweichungen werden dann in sechs Blöcke unterteilt, die jeweils ungefähr dieselbe Anzahl von Vorhersageabweichungen enthalten. Jeder der sechs Blöcke wird dann mit einer Diskreten Cosinustransformation (DCT) transformiert und die DC-Koeffizienten von jedem der sechs Blöcke werden zu einem Vorhersageabweichungsblock- Mittelwert- (PRBA) Vektor aus 6 Elementen kombiniert. Der Mittelwert wird von dem PRBA-Vektor subtrahiert und unter Verwendung eines ungleichmäßigen Quantisierers aus 6 Bits quantisiert. Der PRBA-Vektor mit einem Mittelwert von Null wird dann unter Verwendung eines Vektorquantisierers aus 10 Bits vektorquantisiert. Das PRBA-Codebuch für 10 Bits wurde unter Verwendung eines k- Mittel-Clusteralgorithmus an einer großen Trainingsgruppe, die aus PRBA-Vektoren mit einem Mittelwert von Null bestand, aus einer Vielfalt von Sprachmaterial entworfen. Die DCT-Koeffizienten höherer Ordnung, die nicht im PRBA-Vektor enthalten sind, werden mit gleichmäßigen Skalarquantisierern unter Verwendung der 59 - restlichen Bits quantisiert. Die Bitzuordnung und Quantisiererschrittweiten basieren auf den Langzeitvarianzen der DCT-Koeffizienten höherer Ordnung.The spectral amplitudes are quantized by forming a group of prediction deviations. Each prediction deviation is the difference between the logarithm of the spectral amplitude for the current frame and the logarithm of the spectral amplitude representing the same frequency in the previous speech frame. The spectral amplitude prediction deviations are then divided into six blocks, each containing approximately the same number of prediction deviations. Each of the six blocks is then transformed using a discrete cosine transform (DCT) and the DC coefficients of each of the six blocks are combined into a 6-element prediction deviation block mean (PRBA) vector. The mean is subtracted from the PRBA vector and quantized using a 6-bit non-uniform quantizer. The PRBA vector with a mean of zero is then vector quantized using a 10-bit vector quantizer. The 10-bit PRBA codebook was designed using a k-means clustering algorithm on a large training set consisting of zero-mean PRBA vectors from a variety of speech material. The higher-order DCT coefficients not included in the PRBA vector are quantized with uniform scalar quantizers using the remaining 59 bits. The bit allocation and quantizer step sizes are based on the long-term variances of the higher-order DCT coefficients.

Es gibt verschiedene Vorteile bei diesem Quantisierungsverfahren. Erstens stellt es eine sehr gute Wiedergabetreue unter Verwendung einer kleinen Anzahl von Bits bereit und es behält diese Wiedergabetreue bei, wenn über seinen Bereich variiert. Außerdem sind die Rechenanforderungen dieser Lösungsmethode gut innerhalb der Grenzen, die für eine Echtzeit-Implementierung unter Verwendung eines einzigen DSP, wie z. B. des AT&T DSP32C, erforderlich sind. Schließlich zerlegt dieses Quantisierungsverfahren die Spektralamplituden in ein paar Komponenten, wie z. B. den Mittelwert des PRBA-Vektors, die für Bitfehler empfindlich sind, und eine große Anzahl von anderen Komponenten, die für Bitfehler nicht sehr empfindlich sind. Die Vorwärtsfehlerkorrektur kann dann durch Bereitstellung eines hohen Grades an Schutz für die wenigen empfindlichen Komponenten und eines geringeren Grades an Schutz für die restlichen Komponenten in effizienter Weise verwendet werden. Dies wird im nächsten Abschnitt erörtert.There are several advantages to this quantization method. First, it provides very good fidelity using a small number of bits, and it maintains this fidelity when varied over its range. In addition, the computational requirements of this solution method are well within the limits required for a real-time implementation using a single DSP, such as the AT&T DSP32C. Finally, this quantization method decomposes the spectral amplitudes into a few components, such as the mean of the PRBA vector, that are sensitive to bit errors, and a large number of other components that are not very sensitive to bit errors. Forward error correction can then be achieved by providing a high degree of protection for the few sensitive components and a lower level of protection for the remaining components. This is discussed in the next section.

Die Erfindung zeichnet sich durch ein verbessertes Verfahren zur Quantisierung der Vorhersageabweichungen aus. Die Vorhersageabweichungen werden in Blöcke gruppiert, der Mittelwert der Vorhersageabweichungen innerhalb jedes Blocks wird ermittelt, die Mittelwerte aller Blöcke werden zu einem Vorhersageabweichungsblock-Mittelwert- (PRBA) Vektor gruppiert und der PRBA-Vektor wird codiert. Bei bevorzugten Ausführungsformen wird der Mittelwert der Vorhersageabweichungen durch Addieren der Spektralamplituden-Vorhersageabweichungen innerhalb des Blocks und Dividieren durch die Anzahl der Vorhersageabweichungen innerhalb dieses Blocks oder durch Berechnen der DCT der Spektralamplituden-Vorhersageabweichungen innerhalb eines Blocks und Verwenden des ersten Koeffizienten der DCT als Mittelwert erhalten. Der PRBA-Vektor wird vorzugsweise unter Verwendung von einem von zwei Verfahren codiert: (1) Durchführung einer Transformation, wie z. B. der DCT, des PRBA-Vektors und Skalarquantisierung der Transformationskoeffizienten; (2) Vektorquantisierung des PRBA-Vektors. Die Vektorquantisierung wird vorzugsweise durch Bestimmen des Mittelwerts des PRBA-Vektors, Quantisieren des Mittelwerts unter Verwendung einer Skalarquantisierung und Quantisieren des PRBA-Vektors mit einem Mittelwert von Null unter Verwendung der Vektorquantisierung mit einem Codebuch mit Mittelwert Null durchgeführt. Ein Vorteil dieses Aspekts der Erfindung besteht darin, daß er ermöglicht, daß die Vorhersageabweichungen für eine gegebene Anzahl von Bits mit einer geringeren Verzerrung quantisiert werden.The invention features an improved method for quantizing the prediction deviations. The prediction deviations are grouped into blocks, the average of the prediction deviations within each block is determined, the averages of all blocks are grouped into a prediction deviation block average (PRBA) vector, and the PRBA vector is encoded. In preferred embodiments, the average of the prediction deviations is obtained by adding the spectral amplitude prediction deviations within the block and dividing by the number of prediction deviations within that block, or by calculating the DCT of the spectral amplitude prediction deviations within a block and using the first coefficient of the DCT as the average. The PRBA vector is preferably encoded using one of two methods: (1) performing a transform, such as the DCT, on the PRBA vector and scalar quantizing the transform coefficients; (2) Vector quantization of the PRBA vector. Vector quantization is preferably performed by determining the mean of the PRBA vector, quantizing the mean using scalar quantization, and quantizing the zero-mean PRBA vector using zero-mean codebook vector quantization. An advantage of this aspect of the invention is that it allows the prediction errors for a given number of bits to be quantized with less distortion.

In einem bevorzugten Aspekt zeichnet sich die Erfindung durch ein verbessertes Verfahren zum Bilden der vorhergesagten Spektralamplituden aus. Sie basieren auf der Interpolation der Spektralamplituden eines vorherigen Segments, um die Spektralamplituden im vorherigen Segment bei den Frequenzen des aktuellen Segments abzuschätzen. Dieses neue Verfahren korrigiert Verschiebungen der Frequenzen der Spektralamplituden zwischen den Segmenten, mit dem Ergebnis, daß die Vorhersageabweichungen eine geringere Varianz aufweisen und daher für eine gegebene Anzahl von Bits mit einer geringeren Verzerrung quantisiert werden können. Bei bevorzugten Ausführungsformen sind die Frequenzen der Spektralamplituden die Grundfrequenz und deren Vielfache.In a preferred aspect, the invention is characterized by an improved method for forming the predicted spectral amplitudes. They are based on the interpolation of the spectral amplitudes of a previous segment to estimate the spectral amplitudes in the previous segment at the frequencies of the current segment. This new method corrects for shifts in the frequencies of the spectral amplitudes between the segments, with the result that the prediction deviations have a lower variance and can therefore be quantized with a lower distortion for a given number of bits. In preferred embodiments, the frequencies of the spectral amplitudes are the fundamental frequency and its multiples.

Die Erfindung kann sich auch durch ein verbessertes Verfahren zum Aufteilen der Vorhersageabweichungen in Blöcke auszeichnen. Anstatt die Länge jedes Blocks festzulegen und dann die Vorhersageabweichungen in eine variable Anzahl von Blöcken aufzuteilen, werden die Vorhersageabweichungen in eine vorbestimmte Anzahl von Blöcken aufgeteilt und die Größe der Blöcke variiert von Segment zu Segment. Bei bevorzugten Ausführungsformen werden in allen Segmenten sechs (6) Blöcke verwendet; die Anzahl der Vorhersageabweichungen in einem Block mit niedrigerer Frequenz ist nicht größer als die Anzahl der Vorhersageabweichungen in einem Block mit höherer Frequenz; die Differenz zwischen der Anzahl an Elementen im Block mit der höchsten Frequenz und der Anzahl an Elementen im Block mit der niedrigsten Frequenz ist geringer als oder gleich Eins. Dieses neue Verfahren stimmt die Eigenschaften der Sprache genauer ab, und es ermöglicht daher, daß die Vorhersageabweichungen für eine gegebene Anzahl von Bits mit einer geringeren Verzerrung quantisiert werden. Außerdem kann es leicht mit einer Vektorquantisierung verwendet werden, um die Quantisierung der Spektralamplituden weiter zu verbessern.The invention may also feature an improved method of dividing the prediction errors into blocks. Instead of fixing the length of each block and then dividing the prediction errors into a variable number of blocks, the prediction errors are divided into a predetermined number of blocks and the size of the blocks varies from segment to segment. In preferred embodiments, six (6) blocks are used in all segments; the number of prediction errors in a lower frequency block is not greater than the number of prediction errors in a higher frequency block; the difference between the number of elements in the highest frequency block and the number of elements in the lowest frequency block is less than or equal to one. This new method more accurately matches the properties of the speech and therefore allows the prediction errors to be quantized for a given number of bits with less distortion. Furthermore, it can easily be used with vector quantization to further improve the quantization of the spectral amplitudes.

Die Erfindung zeichnet sich vorzugsweise durch ein verbessertes Verfahren zum Bestimmen der Entscheidungen Stimme/keine Stimme in Gegenwart einer hohen Bitfehlerrate aus. Die Bitfehlerrate wird für ein aktuelles Sprachsegment abgeschätzt und mit einer vorbestimmten Fehlerratenschwelle verglichen, und die Entscheidungen Stimme/keine Stimme für Spektralamplituden über einer vorbestimmten Energieschwelle werden alle für das aktuelle Segment als Stimme deklariert, wenn die abgeschätzte Bitfehlerrate über der Fehlerratenschwelle liegt. Dies verringert die wahrnehmbare Auswirkung von Bitfehlern. Verzerrungen, die durch Umschalten von Stimme auf keine Stimme verursacht werden, werden verringert.The invention preferably features an improved method for determining voice/no voice decisions in the presence of a high bit error rate. The bit error rate is estimated for a current speech segment and compared to a predetermined error rate threshold, and the voice/no voice decisions for spectral amplitudes above a predetermined energy threshold are all declared voice for the current segment if the estimated bit error rate is above the error rate threshold. This reduces the perceptual impact of bit errors. Distortions caused by switching from voice to no voice are reduced.

Die Erfindung kann sich auch durch ein verbessertes Verfahren zur Fehlerkorrektur- (oder Fehlererkennungs-) Codierung der Sprachmodellparameter auszeichnen. Das neue Verfahren verwendet mindestens zwei Arten einer Fehlerkorrek turcodierung, um die quantisierten Modellparameter zu codieren. Eine erste Codierungsart, die eine größere Anzahl von zusätzlichen Bits hinzufügt als eine zweite Codierungsart, wird für eine Gruppe von Parametern verwendet, die für Bitfehler empfindlicher ist. Die andere Art der Fehlerkorrekturcodierung wird für eine zweite Gruppe von Parametern verwendet, die für Bitfehler weniger empfindlich ist als die erste. Verglichen mit existierenden Verfahren verbessert das neue Verfahren die Qualität der synthetisierten Sprache in Gegenwart von Bitfehlern, während die Menge an zusätzlichen Fehlerkorrektur- oder -erkennungsbits, die hinzugefügt werden müssen, verringert wird. Bei bevorzugten Ausführungsformen umfassen die verschiedenen Arten der Fehlerkorrektur Golay-Codes und Hamming-Codes.The invention can also be characterized by an improved method for error correction (or error detection) coding of the language model parameters. The new method uses at least two types of error correction ture coding to encode the quantized model parameters. A first type of coding, which adds a larger number of additional bits than a second type of coding, is used for a group of parameters that is more sensitive to bit errors. The other type of error correction coding is used for a second group of parameters that is less sensitive to bit errors than the first. Compared to existing methods, the new method improves the quality of the synthesized speech in the presence of bit errors while reducing the amount of additional error correction or detection bits that must be added. In preferred embodiments, the different types of error correction include Golay codes and Hamming codes.

Die Erfindung zeichnet sich vorzugsweise durch ein weiteres Verfahren zur Verbesserung der Qualität von synthetisierter Sprache in Gegenwart von Bitfehlern aus. Die Fehlerrate wird aus der Fehlerkorrekturcodierung abgeschätzt und ein oder mehrere Modellparameter aus einem vorherigen Segment werden in einem aktuellen Segment wiederholt, wenn die Fehlerrate für die Parameter ein vorbestimmtes Niveau überschreitet. Bei bevorzugten Ausführungsformen werden alle Modellparameter wiederholt.The invention is preferably characterized by a further method for improving the quality of synthesized speech in the presence of bit errors. The error rate is estimated from the error correction coding and one or more model parameters from a previous segment are repeated in a current segment if the error rate for the parameters exceeds a predetermined level. In preferred embodiments, all model parameters are repeated.

Die Erfindung kann sich auch durch ein neues Verfahren zur Verringerung der Verschlechterung, die durch die Abschätzung und Quantisierung der Modellparameter verursacht wird, auszeichnen. Dieses neue Verfahren verwendet eine Frequenzbereichsdarstellung der Spektralhüllkurvenparameter, um Bereiche des Spektrums zu verstärken, die wahrnehmbar wichtig sind, und Bereiche des Spektrums zu dämpfen, die wahrnehmbar unbedeutend sind. Das Ergebnis ist, daß die Verschlechterung der synthetisierten Sprache verringert wird. Eine geglättete Spektralhüllkurve des Segments wird durch Glätten der Spektralhüllkurve erzeugt, und eine verstärkte Spektralhüllkurve wird durch Erhöhen gewisser Frequenzbereiche der Spektralhüllkurve, für die die Spektralhüllkurve eine größere Amplitude aufweist als die geglättete Hüllkurve, und Verringern gewisser Frequenzbereiche, für die die Spektralhüllkurve eine geringere Amplitude aufweist als die geglättete Hüllkurve, erzeugt. Bei bevorzugten Ausführungsformen wird die geglättete Spektralhüllkurve durch Abschätzen eines Modells niedriger Ordnung (z. B. eines Allpolmodells) von der Spektralhüllkurve erzeugt. Verglichen mit existierenden Verfahren ist dieses neue Verfahren rechnerisch effizienter für Frequenzbereich-Sprachcodierer. Außerdem verbessert dieses neue Verfahren die Sprachqualität durch Beseitigen der Frequenzbereichsbedingungen, die durch Zeitbereichsverfahren auferlegt werden.The invention may also feature a novel method for reducing the degradation caused by the estimation and quantization of the model parameters. This novel method uses a frequency domain representation of the spectral envelope parameters to enhance regions of the spectrum that are perceptually important and to attenuate regions of the spectrum that are perceptually insignificant. The result is that the degradation of the synthesized speech is reduced. A smoothed spectral envelope of the segment is produced by smoothing the spectral envelope, and an enhanced spectral envelope is produced by increasing certain frequency regions of the spectral envelope for which the spectral envelope has a larger amplitude than the smoothed envelope and decreasing certain frequency regions for which the spectral envelope has a smaller amplitude than the smoothed envelope. In preferred embodiments, the smoothed spectral envelope is obtained by estimating a low order model (e.g. an all-pole model) from the spectral envelope. Compared with existing methods, this new method is computationally more efficient for frequency domain speech coders. In addition, this new method improves speech quality by eliminating the frequency domain constraints imposed by time domain methods.

Weitere Merkmale und Vorteile der Erfindung sind aus der folgenden Beschreibung der bevorzugten Ausführungsformen ersichtlich.Further features and advantages of the invention will become apparent from the following description of the preferred embodiments.

In den Zeichnungen gilt:In the drawings:

Fig. 1-2 sind Diagramme, die Sprachcodierverfahren des Standes der Technik zeigen.Fig. 1-2 are diagrams showing prior art speech coding methods.

Fig. 3 ist ein Ablaufplan, der eine bevorzugte Ausführungsform der Erfindung zeigt, bei der die Spektralamplituden-Vorhersage jede Änderung der Grundfrequenz berücksichtigt.Fig. 3 is a flow chart showing a preferred embodiment of the invention in which the spectral amplitude prediction takes into account any change in the fundamental frequency.

Fig. 4 ist ein Ablaufplan, der eine bevorzugte Ausführungsform der Erfindung zeigt, bei der die Spektralamplituden in eine feste Anzahl von Blöcken unterteilt werden.Fig. 4 is a flow chart showing a preferred embodiment of the invention in which the spectral amplitudes are divided into a fixed number of blocks.

Fig. 5 ist ein Ablaufplan, der eine bevorzugte Ausführungsform der Erfindung zeigt, bei der ein Vorhersageabweichungsblock-Mittelwertvektor gebildet wird.Figure 5 is a flow chart showing a preferred embodiment of the invention in which a prediction deviation block mean vector is formed.

Fig. 6 ist ein Ablaufplan, der eine bevorzugte Ausführungsform der Erfindung zeigt, bei der der Vorhersageabweichungsblock-Mittelwertvektor vektorquantisiert wird.Fig. 6 is a flow chart showing a preferred embodiment of the invention in which the prediction deviation block mean vector is vector quantized.

Fig. 7 ist ein Ablaufplan, der eine bevorzugte Ausführungsform der Erfindung zeigt, bei der der Vorhersageabweichungsblock-Mittelwertvektor mit einer DCT und einer Skalarquantisierung quantisiert wird.Fig. 7 is a flow chart showing a preferred embodiment of the invention in which the prediction deviation block mean vector is quantized with a DCT and scalar quantization.

Fig. 8 ist ein Ablaufplan, der eine bevorzugte Ausführungsform des erfindungsgemäßen Codierers zeigt, wobei verschiedene Fehlerkorrekturcodes für verschiedene Modellparameterbits verwendet werden.Fig. 8 is a flow chart showing a preferred embodiment of the inventive encoder using different error correction codes for different model parameter bits.

Fig. 9 ist ein Ablaufplan, der eine bevorzugte Ausführungsform des erfindungsgemäßen Decodierers zeigt, wobei verschiedene Fehlerkorrekturcodes für verschiedene Modellparameterbits verwendet werden.Fig. 9 is a flow chart showing a preferred embodiment of the decoder according to the invention, using different error correction codes for different model parameter bits.

Fig. 10 ist ein Ablaufplan, der eine bevorzugte Ausführungsform der Erfindung zeigt, wobei eine Verbesserung der Frequenzbereich-Spektralhüllkurvenparameter dargestellt ist.Figure 10 is a flow chart showing a preferred embodiment of the invention, illustrating an improvement in the frequency domain spectral envelope parameters.

Im Stand der Technik wurden die Spektralamplituden-Vorhersageabweichungen unter Verwendung von Gleichung (2) gebildet. Dieses Verfahren berücksichtigt keine Veränderung der Grundfrequenz zwischen dem vorherigen Segment und dem aktuellen Segment. Um die Änderung der Grundfrequenz zu berücksichtigen, wurde ein neues Verfahren entwickelt, das zunächst die Spektralamplituden des vorherigen Segments interpoliert. Dies wird typischerweise unter Verwendung einer linearen Interpolation durchgeführt, es könnten jedoch auch verschiedene andere Interpolationsformen verwendet werden. Dann werden die interpolierten Spektralamplituden des vorherigen Segments bei den Frequenzpunkten, die den Vielfachen der Grundfrequenz des aktuellen Segments entsprechen, erneut abgetastet. Diese Kombination einer Interpolation und erneuten Abtastung erzeugt eine Gruppe von vorhergesagten Spektralamplituden, die hinsichtlich jeglicher Änderung der Grundfrequenz zwischen den Segmenten korrigiert wurden.In the prior art, the spectral amplitude prediction errors were formed using equation (2). This method does not take into account any change in fundamental frequency between the previous segment and the current segment. To account for the change in fundamental frequency, a new method was developed that first interpolates the spectral amplitudes of the previous segment. This is typically done using linear interpolation, but various other forms of interpolation could be used. Then the interpolated spectral amplitudes of the previous segment are resampled at frequency points corresponding to multiples of the current segment's fundamental frequency. This combination of interpolation and resampling produces a set of predicted spectral amplitudes that have been corrected for any change in fundamental frequency between segments.

Typischerweise wird ein Bruchteil des Logarithmus mit der Basis Zwei der vorhergesagten Spektralamplituden von dem Logarithmus mit der Basis Zwei der Spektralamplituden des aktuellen Segments subtrahiert. Wenn eine lineare Interpolation verwendet wird, um die vorhergesagten Spektralamplituden zu berechnen, dann kann dies mathematisch folgendermaßen ausgedrückt werden: Typically, a fraction of the base two logarithm of the predicted spectral amplitudes is subtracted from the base two logarithm of the spectral amplitudes of the current segment. If linear interpolation is used to calculate the predicted spectral amplitudes, then this can be expressed mathematically as:

wobei δI durch where δI is given by

gegeben ist, wobei γ eine Konstante ist mit der Bedingung 0 ≤ γ ≤ 1. Typischerweise gilt γ = 0,7, es können jedoch auch andere Werte für γ verwendet werden. Beispielsweise könnte γ von Segment zu Segment adaptiv verändert werden, um die Leistung zu verbessern. Die Parameter und in Gleichung (9) beziehen sich auf die Grundfrequenz des aktuellen Segments bzw. des vorherigen Segments. In dem Fall, in dem die zwei Grundfrequenzen gleich sind, ist das neue Verfahren identisch dem alten Verfahren. In anderen Fällen erzeugt das neue Verfahren eine Vorhersageabweichung mit niedrigerer Varianz als das alte Verfahren. Dies ermöglicht, daß die Vorhersageabweichungen für eine gegebene Anzahl von Bits mit einer geringeren Verzerrung quantisiert werden.is given, where γ is a constant with the condition 0 ≤ γ ≤ 1. Typically, γ = 0.7, but other values for γ can be used. For example, γ could be changed adaptively from segment to segment to improve performance. The parameters and in equation (9) refer to the fundamental frequency of the current segment and the previous segment, respectively. In the case where the two fundamental frequencies are the same, the new method is identical to the old method. In other cases, the new method produces a prediction error with lower variance than the old method. This allows the prediction errors for a given number of bits to be quantized with less distortion.

In einem weiteren Aspekt der Erfindung wurde ein neues Verfahren entwickelt, um die Spektralamplituden-Vorhersageabweichungen in Blöcke aufzuteilen. Bei dem alten Verfahren wurden die Vorhersageabweichungen aus dem aktuellen Segment in Blöcke aus K Elementen aufgeteilt, wobei K = 8 ein typischer Wert ist. Unter Verwendung dieses Verfahrens wurde festgestellt, daß die Eigenschaften jedes Blocks für große und kleine Werte von signifikant unterschiedlich sind. Dies verringerte die Quantisierungsleistung, wodurch die Verzerrung der Spektralamplituden vergrößert wurde. Um die Eigenschaften jedes Blocks gleichmäßiger zu machen, wurde ein neues Verfahren entworfen, das die Vorhersageabweichungen in eine feste Anzahl von Blöcken aufteilt. Die Länge jedes Blocks wird derart gewählt, daß alle Blöcke innerhalb eines Segments nahezu dieselbe Länge aufweisen und die Summe der Längen aller Blöcke innerhalb eines Segments gleich ist. Typischerweise wird die Gesamtzahl der Vorhersageabweichungen in 6 Blöcke aufgeteilt, wobei die Länge jedes Blocks gleich ist. Wenn nicht gleichmäßig durch 6 teilbar ist, dann wird die Länge von einem oder mehreren Blöcken mit höherer Frequenz um Eins erhöht, so daß alle Spektralamplituden in einem der sechs Blöcke enthalten sind. Dieses neue Verfahren ist in Fig. 4 für den Fall dargestellt, daß 6 Blöcke verwendet werden und = 34 ist. Bei diesem neuen Verfahren ist der ungefähre Prozentsatz der in jedem Block enthaltenen Vorhersageabweichungen von unabhängig. Dies verringert die Variation der Eigenschaften jedes Blocks und ermöglicht eine effizientere Quantisierung der Vorhersageabweichungen.In another aspect of the invention, a new method was developed to divide the spectral amplitude prediction deviations into blocks. In the old method, the prediction deviations from the current segment were divided into blocks of K elements, where K = 8 is a typical value. Using this method, it was found that the properties of each block were significantly different for large and small values of . This reduced the quantization performance, thereby increasing the distortion of the spectral amplitudes. To make the properties of each block more uniform, a new method was designed that divides the prediction deviations into a fixed number of blocks. The length of each block is chosen such that all blocks within a segment have almost the same length and the sum of the lengths of all blocks within a segment is equal. Typically, the total number of prediction deviations is divided into 6 blocks, with the length of each block being the same. If is not evenly divisible by 6, then the length of one or more blocks of higher frequency is increased by one so that all spectral amplitudes are contained in one of the six blocks. This new method is illustrated in Fig. 4 for the case where 6 blocks are used and = 34. In this new method, the approximate percentage of prediction errors contained in each block is independent of . This reduces the variation in the properties of each block and allows for more efficient quantization of the prediction errors.

Die Quantisierung der Vorhersageabweichungen kann durch Bilden eines Vorhersageabweichungsblock-Mittelwert- (PRBA) Vektors weiter verbessert werden. Die Länge des PRBA-Vektors ist gleich der Anzahl an Blöcken im aktuellen Segment. Die Elemente dieses Vektors entsprechen dem Mittelwert der Vorhersageabweichungen innerhalb jedes Blocks. Da der erste DCT-Koeffizient gleich dem Mittelwert (oder DC-Wert) ist, kann der PRBA-Vektor aus dem ersten DCT-Koeffizienten aus jedem Block gebildet werden. Dies ist in Fig. 5 für den Fall gezeigt, daß 6 Blöcke im aktuellen Segment vorliegen und = 34 ist. Dieser Prozeß kann durch Bilden von zusätzlichen Vektoren aus dem zweiten (oder dritten, vierten usw.) DCT-Koeffizienten aus jedem Block verallgemeinert werden.The quantization of the prediction deviations can be further improved by forming a prediction deviation block average (PRBA) vector. The length of the PRBA vector is equal to the number of blocks in the current segment. The elements of this vector correspond to the average of the prediction deviations within each block. Since the first DCT coefficient is equal to the average (or DC value), the PRBA vector can be formed from the first DCT coefficient from each block. This is shown in Fig. 5 for the case that there are 6 blocks in the current segment and = 34. This process can be generalized by forming additional vectors from the second (or third, fourth etc.) DCT coefficients from each block.

Die Elemente des PRBA-Vektors sind stark korreliert. Daher kann eine Anzahl von Verfahren verwendet werden, um die Quantisierung der Spektralamplituden zu verbessern. Ein Verfahren, das verwendet werden kann, um eine sehr geringe Verzerrung mit einer kleinen Anzahl von Bits zu erzielen, ist die Vektorquantisierung. Bei diesem Verfahren wird ein Codebuch konstruiert, das eine Anzahl von typischen PRBA-Vektoren enthält. Der PRBA-Vektor für das aktuelle Segment wird mit jedem der Codebuch-Vektoren verglichen, und derjenige mit dem geringsten Fehler wird als quantisierter PRBA-Vektor gewählt. Der Codebuchindex des gewählten Vektors wird verwendet, um die Binärdarstellung des PRBA-Vektors zu bilden. Es wurde ein Verfahren zur Durchführung der Vektorquantisierung des PRBA-Vektors entwickelt, welches die Reihe eines ungleichmäßigen Quantisierers aus 6 Bits für den Mittelwert des Vektors und einen Vektorquantisierer aus 10 Bits für die restliche Information verwendet. Dieses Verfahren ist in Fig. 6 für den Fall dargestellt, daß der PRBA-Vektor immer 6 Elemente enthält. Typische Werte für die Quantisierer aus 6 Bits und 10 Bits sind im beigefügten Anhang angegeben.The elements of the PRBA vector are highly correlated. Therefore, a number of techniques can be used to improve the quantization of the spectral amplitudes. One technique that can be used to achieve very low distortion with a small number of bits is vector quantization. In this technique, a codebook is constructed containing a number of typical PRBA vectors. The PRBA vector for the current segment is compared with each of the codebook vectors, and the one with the smallest error is chosen as the quantized PRBA vector. The codebook index of the chosen vector is used to form the binary representation of the PRBA vector. A technique for performing vector quantization of the PRBA vector has been developed which uses the series of a 6-bit non-uniform quantizer for the mean of the vector and a 10-bit vector quantizer for the remaining information. This technique is illustrated in Fig. 6 for the case where the PRBA vector always contains 6 elements. Typical values for the 6-bit and 10-bit quantizers are given in the attached appendix.

Ein alternatives Verfahren zur Quantisierung des PRBA-Vektors wurde ebenfalls entwickelt. Dieses Verfahren erfordert weniger Berechnung und Speicherung als das Verfahren der Vektorquantisierung. Bei diesem Verfahren wird der PRBA- Vektor zuerst mit einer DCT, wie in Gleichung (3) definiert, transformiert. Die Länge der DCT ist gleich der Anzahl der Elemente im PRBA-Vektor. Die DCT- Koeffizienten werden dann in einer Weise ähnlich der im Stand der Technik erörterten quantisiert. Zuerst wird eine Bitzuordnungsregel verwendet, um die Gesamtzahl der zur Quantisierung des PRBA-Vektors verwendeten Bits unter den DCT-Koeffizienten aufzuteilen. Eine Skalarquantisierung (entweder gleichmäßig oder ungleichmäßig) wird dann verwendet, um jeden DCT-Koeffizienten unter Verwendung der durch die Bitzuordnungsregel festgelegten Anzahl an Bits zu quantisieren. Dies ist in Fig. 7 für den Fall gezeigt, daß der PRBA-Vektor immer 6 Elemente enthält.An alternative method of quantizing the PRBA vector has also been developed. This method requires less computation and storage than the vector quantization method. In this method, the PRBA vector is first transformed with a DCT as defined in equation (3). The length of the DCT is equal to the number of elements in the PRBA vector. The DCT coefficients are then quantized in a manner similar to that discussed in the prior art. First, a bit allocation rule is used to divide the total number of bits used to quantize the PRBA vector among the DCT coefficients. Scalar quantization (either uniform or non-uniform) is then used to quantize each DCT coefficient using the number of bits specified by the bit allocation rule. This is shown in Figure 7 for the case where the PRBA vector always contains 6 elements.

Verschiedene andere Verfahren können verwendet werden, um den PRBA-Vektor effizient zu quantisieren. Beispielsweise könnten andere Transformationen, wie z. B. die Diskrete Fouriertransformation, die Schnelle Fouriertransformation, die Karhunen-Louve-Transformation, anstelle der DCT verwendet werden. Außerdem kann die Vektorquantisierung mit der DCT oder einer anderen Transformation kombiniert werden. Die aus diesem Aspekt der Erfindung gewonnenen Verbesserungen können mit einer breiten Vielfalt von Quantisierungsverfahren verwendet werden.Various other methods can be used to efficiently quantize the PRBA vector. For example, other transforms such as the Discrete Fourier Transform, the Fast Fourier Transform, the Karhunen-Louve Transform could be used instead of the DCT. In addition, vector quantization can be combined with the DCT or another transform. The improvements gained from this aspect of the invention can be used with a wide variety of quantization methods.

In einem weiteren Aspekt wurde ein neues Verfahren zum Verringern der wahrnehmbaren Wirkung von Bitfehlern entwickelt. Fehlerkorrekturcodes werden wie im Stand der Technik verwendet, um seltene Bitfehler zu korrigieren und eine Abschätzung der Fehlerrate &epsi;R vorzusehen. Das neue Verfahren verwendet die Abschätzung der Fehlerrate, um die Entscheidungen Stimme/keine Stimme zu glätten, um die wahrgenommene Wirkung von irgendwelchen verbleibenden Bitfehlern zu verringern. Dies wird durch zunächst Vergleichen der Fehlerrate mit einer Schwelle, welche die Rate bezeichnet, bei der die Verzerrung von nicht korrigier ten Bitfehlern in den Entscheidungen Stimme/keine Stimme signifikant ist, durchgeführt. Der exakte Wert dieser Schwelle hängt von der Menge an Fehlerkorrektur ab, die auf die Entscheidungen Stimme/keine Stimme angewendet wird, aber ein Schwellenwert von 0,003 ist typisch, wenn wenig Fehlerkorrektur angewendet wurde. Wenn die abgeschätzte Fehlerrate &epsi;R unter dieser Schwelle liegt, dann werden die Entscheidungen Stimme/keine Stimme nicht gestört. Wenn &epsi;R über dieser Schwelle liegt, dann wird jede Spektralamplitude, für die die Gleichung (10) erfüllt ist, als Stimme deklariert. In a further aspect, a new method has been developed for reducing the perceived impact of bit errors. Error correction codes are used as in the prior art to correct rare bit errors and provide an estimate of the error rate εR. The new method uses the error rate estimate to smooth the voice/no voice decisions to reduce the perceived impact of any remaining bit errors. This is done by first comparing the error rate to a threshold which denotes the rate at which the distortion of uncorrected bit errors in the vote/no vote decisions is significant. The exact value of this threshold depends on the amount of error correction applied to the vote/no vote decisions, but a threshold of 0.003 is typical when little error correction has been applied. If the estimated error rate εR is below this threshold, then the vote/no vote decisions will not be perturbed. If εR is above this threshold, then any spectral amplitude for which equation (10) is satisfied is declared a vote.

Obwohl Gleichung (10) einen Schwellenwert von 0,003 annimmt, kann dieses Verfahren leicht modifiziert werden, um es an andere Schwellen anzupassen. Der Parameter SE ist ein Maß für die lokale mittlere Energie, die in den Spektralamplituden enthalten ist. Dieser Parameter wird typischerweise bei jedem Segment gemäß: Although equation (10) assumes a threshold of 0.003, this method can be easily modified to suit other thresholds. The parameter SE is a measure of the local average energy contained in the spectral amplitudes. This parameter is typically set at each segment according to:

aktualisiert, wobei R&sub0; durch updated, where R�0; is

gegeben ist.given is.

Der Anfangswert von SE wird auf einen willkürlichen Anfangswert im Bereich von 0 ≤ SE ≤ 10000,0 gesetzt. Der Zweck dieses Parameters besteht darin, die Abhängigkeit von Gleichung (10) vom mittleren Signalpegel zu verringern. Dies gewähr leistet, daß das neue Verfahren für Signale mit niedrigem Pegel ebenso gut funktioniert wie für Signale mit hohem Pegel.The initial value of SE is set to an arbitrary initial value in the range 0 ≤ SE ≤ 10000.0. The purpose of this parameter is to reduce the dependence of equation (10) on the average signal level. This ensures ensures that the new method works just as well for low-level signals as it does for high-level signals.

Die speziellen Formen der Gleichungen (10), (11) und (12) und die in diesen enthaltenen Konstanten können leicht modifiziert werden, während die wesentlichen Bestandteile des neuen Verfahrens beibehalten werden. Die Hauptbestandteile dieses neuen Verfahrens sind zuerst die Verwendung einer Abschätzung der Fehlerrate, um zu bestimmen, ob die Entscheidungen Stimme/keine Stimme geglättet werden müssen. Wenn eine Glättung erforderlich ist, dann werden die Entscheidungen Stimme/keine Stimme gestört, so daß alle Spektralamplituden mit hoher Energie als Stimme deklariert werden. Dies beseitigt jegliche hochenergetischen Übergänge von Stimme zu keine Stimme oder keine Stimme zu Stimme zwischen den Segmenten, und es verbessert folglich die wahrgenommene Qualität der rekonstruierten Sprache in Gegenwart von Bitfehlern.The specific forms of equations (10), (11) and (12) and the constants contained in them can be easily modified while retaining the essential components of the new method. The main components of this new method are first to use an estimate of the error rate to determine whether the voice/no voice decisions need to be smoothed. If smoothing is required, then the voice/no voice decisions are perturbed so that all high energy spectral amplitudes are declared as voice. This eliminates any high energy transitions from voice to no voice or no voice to voice between segments, and it consequently improves the perceived quality of the reconstructed speech in the presence of bit errors.

In unserer Erfindung teilen wir die quantisierten Sprachmodellparameterbits gemäß ihrer Empfindlichkeit für Bitfehler in drei oder mehr verschiedene Gruppen auf und dann verwenden wir verschiedene Fehlerkorrektur- oder -erkennungscodes für jede Gruppe. Typischerweise wird die Gruppe von Datenbits, die als am meisten empfindlich für Bitfehler bestimmt wird, unter Verwendung von sehr wirksamen Fehlerkorrekturcodes geschützt. Weniger wirksame Fehlerkorrektur- oder -erkennungscodes, die weniger zusätzliche Bits benötigen, werden verwendet, um die weniger empfindlichen Datenbits zu schützen. Dieses neue Verfahren ermöglicht, daß die Menge an Fehlerkorrektur oder -erkennung, die jeder Gruppe gegeben wird, auf ihre Empfindlichkeit für Bitfehler abgestimmt wird. Verglichen mit dem Stand der Technik besitzt dieses Verfahren den Vorteil, daß die durch Bitfehler verursachte Verschlechterung verringert wird und die Anzahl der für eine Vorwärtsfehlerkorrektur erforderlichen Bits ebenfalls verringert wird.In our invention, we divide the quantized language model parameter bits into three or more different groups according to their sensitivity to bit errors, and then we use different error correction or detection codes for each group. Typically, the group of data bits determined to be most sensitive to bit errors is protected using very effective error correction codes. Less effective error correction or detection codes requiring fewer additional bits are used to protect the less sensitive data bits. This new method allows the amount of error correction or detection given to each group to be matched to its sensitivity to bit errors. Compared to the prior art, this method has the advantage of reducing the degradation caused by bit errors and also reducing the number of bits required for forward error correction.

Die spezielle Wahl der Fehlerkorrektur- oder -erkennungscodes, die verwendet wird, hängt von der Bitfehlerstatistik des Übertragungs- oder Speichermediums und der gewünschten Bitrate ab. Die empfindlichste Gruppe von Bits wird typischerweise mit einem wirksamen Fehlerkorrekturcode, wie z. B. einem Hamming-Code, einem BCH-Code, einem Golay-Code oder einem Reed- Solomon-Code, geschützt. Weniger empfindliche Gruppen von Datenbits können diese Codes oder einen Fehlererkennungscode verwenden. Schließlich können die am wenigsten empfindlichen Gruppen Fehlerkorrektur- oder -erkennungscodes verwenden oder sie können keine Form einer Fehlerkorrektur oder -erkennung verwenden. Die Erfindung wird hierin unter Verwendung einer speziellen Wahl von Fehlerkorrektur und -erkennungscodes beschrieben, die sich für einen IMBE-Sprachcodierer für Satellitenübertragungen mit 6,4 kBit/s gut eignete.The particular choice of error correction or detection codes used depends on the bit error statistics of the transmission or storage medium and the desired bit rate. The most sensitive group of bits is typically covered with an effective error correction code, such as a Hamming code, a BCH code, a Golay code, or a Reed code. Solomon code. Less sensitive groups of data bits may use these codes or an error detection code. Finally, the least sensitive groups may use error correction or detection codes, or they may use no form of error correction or detection. The invention is described herein using a particular choice of error correction and detection codes that was well suited to an IMBE speech coder for 6.4 kbit/s satellite transmissions.

Bei dem IMBE-Sprachcodierer mit 6,4 kBit/s, der für das INMARSAT-M-Satellitenkommunikationssystem normiert wurde, werden die 45 Bits pro Rahmen, die für eine Vorwärtsfehlerkorrektur reserviert sind, unter [23,12]-Golay-Codes, welche bis zu 3 Fehler korrigieren können, [15,11]-Hamming-Codes, die einzelne Fehler und Paritätsbits korrigieren können, aufgeteilt. Die sechs höchstwertigen Bits von der Grundfrequenz und die drei höchstwertigen Bits vom Mittelwert des PRBA-Vektors werden zuerst mit drei Paritätskontrollbits kombiniert und dann in einem [23,12]-Golay-Code codiert. Ein zweiter Golay-Code wird verwendet, um die drei höchstwertigen Bits von dem PRBA-Vektor und die neun empfindlichsten Bits von den DCT-Koeffizienten höherer Ordnung zu codieren. Alle restlichen Bits außer den sieben am wenigsten empfindlichen Bits werden dann in fünf [15,11]- Hamming-Codes codiert. Die sieben niedrigstwertigen Bits werden nicht mit Fehlerkorrekturcodes geschützt.In the 6.4 kbit/s IMBE speech coder standardized for the INMARSAT-M satellite communication system, the 45 bits per frame reserved for forward error correction are divided between [23,12] Golay codes, which can correct up to 3 errors, [15,11] Hamming codes, which can correct single errors, and parity bits. The six most significant bits from the fundamental frequency and the three most significant bits from the mean of the PRBA vector are first combined with three parity check bits and then encoded in a [23,12] Golay code. A second Golay code is used to encode the three most significant bits from the PRBA vector and the nine most sensitive bits from the higher order DCT coefficients. All remaining bits except the seven least sensitive bits are then encoded in five [15,11] Hamming codes. The seven least significant bits are not protected with error correction codes.

Vor der Übertragung werden die 128 Bits, die ein spezielles Sprachsegment darstellen, derart verschachtelt, daß mindestens fünf Bits irgendwelche zwei Bits von demselben Codewort trennen. Dieses Merkmal verteilt die Wirkung von kurzen Fehlerbündeln über mehrere verschiedene Codeworte, wodurch die Wahrscheinlichkeit, daß die Fehler korrigiert werden können, erhöht wird.Before transmission, the 128 bits that represent a particular speech segment are interleaved so that at least five bits separate any two bits of the same codeword. This feature spreads the effect of short bursts of errors over several different codewords, increasing the likelihood that the errors can be corrected.

Am Decodierer werden die empfangenen Bits durch Golay- und Hamming-Decodierer geleitet, die versuchen, jegliche Bitfehler aus den Datenbits zu entfernen. Die drei Paritätskontrollbits werden geprüft, und wenn keine unkorrigierbaren Bitfehler erkannt werden, dann werden die empfangenen Bits verwendet, um die MBE-Modellparameter für den aktuellen Rahmen zu rekonstruieren. Ansonsten, wenn ein unkorrigierbarer Bitfehler erkannt wird, dann werden die empfangenen Bits für den aktuellen Rahmen ignoriert und die Modellparameter aus dem vorherigen Rahmen werden für den aktuellen Rahmen wiederholt.At the decoder, the received bits are passed through Golay and Hamming decoders, which attempt to remove any bit errors from the data bits. The three parity control bits are checked, and if no uncorrectable bit errors are detected, then the received bits are used to reconstruct the MBE model parameters for the current frame. Otherwise, if an uncorrectable bit error is detected, then the received Bits are ignored for the current frame and the model parameters from the previous frame are repeated for the current frame.

Es wurde festgestellt, daß die Verwendung von Rahmenwiederholungen die wahrnehmbare Qualität der Sprache verbessert, wenn Bitfehler vorliegen. Somit untersuchen wir jeden Rahmen von empfangenen Bits und bestimmen, ob der aktuelle Rahmen wahrscheinlich eine große Anzahl von unkorrigierbaren Bitfehlern enthält. Ein Verfahren, das zur Erkennung von unkorrigierbaren Bitfehlern verwendet wird, besteht darin, zusätzliche Paritätsbits zu prüfen, die in die Daten eingefügt werden. Somit bestimmen wir auch, ob ein großes Bündel von Bitfehlern angetroffen wurde, durch Vergleichen der Anzahl von korrigierbaren Bitfehlern mit der lokalen Abschätzung der Fehlerrate. Wenn die Anzahl von korrigierbaren Bitfehlern im wesentlichen größer ist als die lokale Abschätzung der Fehlerrate, dann wird eine Rahmenwiederholung durchgeführt. Außerdem prüfen wir jeden Rahmen auf ungültige Bitsequenzen (d. h. Gruppen von Bits, die der Codierer niemals überträgt). Wenn eine ungültige Bitsequenz erkannt wird, wird eine Rahmenwiederholung durchgeführt.It has been found that the use of frame repetitions improves the perceptual quality of speech when bit errors are present. Thus, we examine each frame of received bits and determine whether the current frame is likely to contain a large number of uncorrectable bit errors. One method used to detect uncorrectable bit errors is to examine additional parity bits inserted into the data. Thus, we also determine whether a large burst of bit errors has been encountered by comparing the number of correctable bit errors to the local estimate of the error rate. If the number of correctable bit errors is substantially greater than the local estimate of the error rate, then a frame repetition is performed. In addition, we check each frame for invalid bit sequences (i.e., groups of bits that the encoder never transmits). If an invalid bit sequence is detected, a frame repetition is performed.

Die Golay- und Hamming-Decodierer stellen auch eine Information über die Anzahl der korrigierbaren Bitfehler in den Daten bereit. Diese Information wird vom Decodierer verwendet, um die Bilfehlerrate abzuschätzen. Die Abschätzung der Bitfehlerrate wird verwendet, um adaptive Glättungsvorrichtungen zu steuern, die die wahrgenommene Sprachqualität in Gegenwart von unkorrigierbaren Bitfehlern erhöhen. Außerdem kann die Abschätzung der Fehlerrate zur Durchführung von Rahmenwiederholungen in schlechten Fehlerumgebungen verwendet werden.The Golay and Hamming decoders also provide information about the number of correctable bit errors in the data. This information is used by the decoder to estimate the frame error rate. The bit error rate estimate is used to control adaptive smoothers that increase the perceived speech quality in the presence of uncorrectable bit errors. In addition, the error rate estimate can be used to perform frame repetitions in poor error environments.

Dieser Aspekt der Erfindung kann mit einer Weichentscheidungscodierung verwendet werden, um die Leistung weiter zu verbessern. Die Weichentscheidungsdecodierung verwendet eine zusätzliche Information über die Wahrscheinlichkeit für jedes Bit, daß es fehlerhaft ist, um die Fehlerkorrektur- und -erkennungsfähigkeiten von vielen verschiedenen Codes zu verbessern. Da diese zusätzliche Information häufig von einem Demodulator in einem digitalen Übertragungssystem erhältlich ist, kann sie eine verbesserte Unempfindlichkeit für Bitfehler bereitstellen, ohne zusätzliche Bits zum Fehlerschutz zu benötigen.This aspect of the invention can be used with soft decision coding to further improve performance. Soft decision decoding uses additional information about the probability of each bit being in error to improve the error correction and detection capabilities of many different codes. Since this additional information is often available from a demodulator in a digital transmission system, it can provide improved bit error immunity without requiring additional bits for error protection.

Wir verwenden ein neues Verfahren zur Frequenzbereichsparameter-Verbesserung, das die Qualität der synthetisierten Sprache verbessert. Wir machen zunächst die wahrnehmbar wichtigen Bereiche des Sprachspektrums ausfindig. Wir erhöhen dann die Amplitude der wahrnehmbar wichtigen Frequenzbereiche relativ zu anderen Frequenzbereichen. Das bevorzugte Verfahren zur Durchführung einer Frequenzbereichsparameter-Verbesserung besteht darin, die Spektralhüllkurve zu glätten, um die allgemeine Form des Spektrums abzuschätzen. Das Spektrum kann durch Anpassen eines Modells niedriger Ordnung, wie z. B. eines Allpolmodells, eines Cepstral-Modells oder eines Polynommodells, an die Spektralhüllkurve geglättet werden. Die geglättete Spektralhüllkurve wird dann mit der ungeglätteten Spektralhüllkurve verglichen und wahrnehmbar wichtige Spektralbereiche werden als Bereiche identifiziert, wo die ungeglättete Spektralhüllkurve eine größere Energie aufweist als die geglättete Spektralhüllkurve. Ebenso werden Bereiche, wo die ungeglättete Spektralhüllkurve weniger Energie aufweist als die geglättete Spektralhüllkurve, als wahrnehmbar weniger wichtig identifiziert. Die Parameterverbesserung wird durch Erhöhen der Amplitude von wahrnehmbar wichtigen Frequenzbereichen und Senken der Amplitude von wahrnehmbar weniger wichtigen Frequenzbereichen durchgeführt. Dieses neue Verbesserungsverfahren erhöht die Sprachqualität durch Beseitigen oder Verringern von vielen der Fehlern, die während der Abschätzung und Quantisierung der Sprachparameter eingeführt werden. Außerdem verbessert dieses neue Verfahren die Sprachverständlichkeit durch Verschärfen der wahrnehmbar wichtigen Sprachformanten.We use a new method for frequency-domain parameter enhancement that improves the quality of synthesized speech. We first identify the perceptually important regions of the speech spectrum. We then increase the amplitude of the perceptually important frequency regions relative to other frequency regions. The preferred method for performing frequency-domain parameter enhancement is to smooth the spectral envelope to estimate the general shape of the spectrum. The spectrum can be smoothed by fitting a low-order model, such as an all-pole model, a cepstral model, or a polynomial model, to the spectral envelope. The smoothed spectral envelope is then compared to the unsmoothed spectral envelope, and perceptually important spectral regions are identified as regions where the unsmoothed spectral envelope has greater energy than the smoothed spectral envelope. Likewise, regions where the unsmoothed spectral envelope has less energy than the smoothed spectral envelope are identified as perceptually less important. Parameter enhancement is performed by increasing the amplitude of perceptually important frequency ranges and decreasing the amplitude of perceptually less important frequency ranges. This new enhancement technique increases speech quality by eliminating or reducing many of the errors introduced during the estimation and quantization of speech parameters. In addition, this new technique improves speech intelligibility by sharpening the perceptually important speech formants.

Beim IMBE-Sprachdecodierer wird ein Allpolmodell erster Ordnung für jeden Rahmen an die Spektralhüllkurve angepaßt. Dies wird durch Abschätzen der Korrelationsparameter R&sub0; und R&sub1; aus den decodierten Modellparametern gemäß den folgenden Gleichungen durchgeführt, In the IMBE speech decoder, a first-order all-pole model is fitted to the spectral envelope for each frame. This is done by estimating the correlation parameters R�0 and R₁ from the decoded model parameters according to the following equations,

wobei &sub1; für 1 ≤ I ≤ die decodierten Spektralamplituden für den aktuellen Rahmen sind und &sub0; die decodierte Grundfrequenz für den aktuellen Rahmen ist. Die Korrelationsparameter R&sub0; und R&sub1; können zur Abschätzung eines Allpolmodells erster Ordnung verwendet werden. Dieses Modell wird bei den Frequenzen ausgewertet, die den Spektralamplituden für den aktuellen Rahmen entsprechen (d. h. k &sub0; für 1 ≤ I ≤ ), und zur Erzeugung einer Gruppe von Gewichten WI gemäß der folgenden Formel verwendet. where ₁ for 1 ≤ I ≤ are the decoded spectral amplitudes for the current frame and ₀ is the decoded fundamental frequency for the current frame. The correlation parameters R₀ and R₁ can be used to estimate a first order all-pole model. This model is evaluated at the frequencies corresponding to the spectral amplitudes for the current frame (i.e. k ₀ for 1 ≤ I ≤ ) and used to generate a set of weights WI according to the following formula.

Diese Gewichte geben das Verhältnis des geglätteten Allpolspektrums zu den IMBE-Spektralamplituden an. Sie werden dann verwendet, um das Ausmaß der Parameterverbesserung, die auf jede Spektralamplitude angewendet wird, individuell zu steuern. Diese Beziehung wird in der folgenden Gleichung ausgedrückt. These weights give the ratio of the smoothed all-pole spectrum to the IMBE spectral amplitudes. They are then used to control the amount of parameter enhancement applied to each spectral amplitude individually. This relationship is expressed in the following equation.

wobei &sub1; für 1 ≤ I ≤ die verbesserten Spektralamplituden für den aktuellen Rahmen sind.where ₁ for 1 ≤ I ≤ are the enhanced spectral amplitudes for the current frame.

Die verbesserten Spektralamplituden werden dann zur Durchführung der Sprachsynthese verwendet. Die Verwendung der verbesserten Modellparameter verbessert die Sprachqualität relativ zur Synthese aus den nicht verbesserten Modellparametern.The enhanced spectral amplitudes are then used to perform speech synthesis. Using the enhanced model parameters improves speech quality relative to synthesis from the non-enhanced model parameters.

Eine weitere Beschreibung einer speziellen Ausführungsform eines Sprachcodierungssystems, das diese Erfindung verwendet, ist in dem Dokument mit dem Titel "INMARSAT M Voice Codec" zu finden, von dem eine Kopie in die Datei dieser Anmeldung gegeben wurde.A further description of a specific embodiment of a voice coding system utilizing this invention can be found in the document entitled "INMARSAT M Voice Codec," a copy of which has been placed in the file of this application.

Claims

1. A method of encoding speech, wherein the speech is broken down into segments, each of the segments representing one of a sequence of time intervals and having a spectrum of frequencies, and for each segment the spectrum is sampled at a group of frequencies to form a group of actual spectral amplitudes, the frequencies at which the spectrum is sampled generally differing from one segment to the next, and wherein the spectral amplitudes for at least one previous segment are used to generate a group of predicted spectral amplitudes for a current segment, and wherein a group of prediction deviations for the current segment based on a difference between the actual spectral amplitudes for the current segment and the predicted spectral amplitudes for a current segment are used in subsequent encoding, characterized in that the prediction deviations for a segment are grouped into blocks, an average of the Prediction deviations within each block are determined, the means of each of the blocks are grouped into a prediction deviation block mean (PRBA) vector, and the PRBA vector is encoded.

2. The method of claim 1, wherein there are a predetermined number of blocks, the number of blocks being independent of the number of prediction deviations grouped into particular blocks.

3. The method of claim 2, wherein the predicted spectral amplitudes for the current segment are based at least in part on interpolating the spectral amplitudes of a previous segment to estimate the spectral amplitudes in the previous segment at the frequencies of the current segment.

4. A method according to any preceding claim, wherein the difference between the actual spectral amplitudes for the current segment and the predicted spectral amplitudes for the current segment is formed by subtracting a fraction of the predicted spectral amplitudes from the actual spectral amplitudes.

5. A method according to any one of the preceding claims, wherein the spectral amplitudes are obtained using a multi-band excitation speech model.

6. Method according to one of the preceding claims, wherein only spectral amplitudes from the last previous segment are used in forming the predicted spectral amplitudes of the current segment.

7. A method according to any preceding claim, wherein the spectrum comprises a fundamental frequency and the set of frequencies for a given segment are multiples of the fundamental frequency of the segment.

8. The method of claim 2 or 3, wherein the number of prediction deviations in a lower frequency block is not greater than the number of prediction deviations in a higher frequency block.

9. The method according to any one of claims 2, 3 or 8, wherein the number of blocks is equal to six (6).

10. The method of claim 9, wherein the difference between the number of elements in the block with the highest frequency and the number of elements in the block with the lowest frequency is less than or equal to one.

11. Method according to one of the preceding claims, wherein the mean value is calculated by adding the prediction deviations within the block and Dividing by the number of prediction deviations within this block.

12. The method of claim 11, wherein the mean is obtained by calculating a Discrete Cosine Transform (DCT) of the spectral amplitude prediction deviations within a block and using the first coefficient of the DCT as the mean.

13. A method according to any one of the preceding claims, wherein encoding the PRBA vector comprises vector quantizing the PRBA vector.

14. The method of claim 13, wherein the vector quantization is performed using a method comprising the following steps:

Determining a mean of the PRBA vector;

Quantizing the mean using a scalar quantization;

subtracting the mean from the PRBA vector to form a PRBA vector with a mean of zero; and

Quantize the PRBA vector with zero mean using zero mean codebook vector quantization.

15. The method of any preceding claim, wherein the PRBA vector is encoded using a linear transformation of the PRBA vector and a scalar quantization of the transform coefficients.

16. The method of claim 15, wherein the linear transformation comprises a discrete cosine transformation.

17. A method according to any one of the preceding claims, wherein the predicted spectral amplitudes for the current segment are based at least in part on the interpolation of the spectral amplitudes of a previous segment to estimate the spectral amplitudes in the previous segment at the frequencies of the current segment.

18. Method according to one of the preceding claims, wherein the number of blocks is independent of the number of deviations for specific blocks.

19. The method of claim 18, wherein the predicted spectral amplitudes for the current segment are based at least in part on interpolating the spectral amplitudes of a previous segment to estimate the spectral amplitudes in the previous segment at the frequencies of the current segment.

20. A method according to any preceding claim, wherein a bit error rate for a current speech segment is estimated and compared with a predetermined error rate threshold, and the voice/no-voice decisions for spectral amplitudes above a predetermined energy threshold are all declared as voice for the current segment if the estimated bit error rate is above the error rate threshold.

21. The method of claim 20, wherein the predetermined energy threshold depends on the estimate of the bit error rate for the current segment.

22. A method according to any one of the preceding claims, wherein the speech is encoded using a language model characterized by model parameters, wherein the speech is decomposed into time segments and for each segment model parameters are estimated and quantized, and wherein at least some of the quantized model parameters are encoded using error correction coding, at least two types of error correction coding are used to encode the quantized model parameters, and a first type of coding, which adds a larger number of additional bits than a second type of coding, for a first group of quantized model parameters that is more sensitive to bit errors than a second group of quantized model parameters.

23. The method of claim 22, wherein the different types of error correction coding comprise Golay codes and Hamming codes.

24. A method according to any one of the preceding claims, wherein the speech is encoded using a speech model characterized by model parameters, wherein the speech is decomposed into time segments and for each segment model parameters are estimated and quantized, wherein at least some of the quantized model parameters are encoded using error correction coding, and wherein the speech is synthesized from the decoded quantized model parameters, the error correction coding is used in the synthesis to estimate the error rate, and one or more model parameters from a previous segment are repeated in a current segment if the error rate for the parameter exceeds a predetermined level.

25. A method according to any one of claims 22 to 24, wherein the quantized model parameters are those associated with the multiband excitation (MBE) speech coder or the enhanced multiband excitation (IMBE) speech coder.

26. The method of claim 22 or 23, wherein the error rates are estimated using the error correction codes.

27. The method of claim 26, wherein one or more model parameters are smoothed across a plurality of segments based on the estimated error rate.

28. The method of claim 27, wherein the smoothed model parameters comprise vote/no vote decisions.

29. The method of claim 27, wherein the smoothed model parameters comprise parameters for the multi-band excitation (MBE) speech coder or the enhanced multi-band excitation (IMBE) speech coder.

30. The method of claim 29, wherein the value of one or more model parameters in a previous segment is repeated in a current segment if the estimated error rate for the parameters exceeds a predetermined level.

31. A method according to any preceding claim for enhancing speech, wherein a speech signal is decomposed into segments and wherein frequency domain representations of a segment are determined to provide a spectral envelope of the segment and the speech is synthesized from an enhanced spectral envelope, a smoothed spectral envelope of the segment is generated by smoothing the spectral envelope, and an enhanced spectral envelope is generated by increasing some frequency ranges of the spectral envelope for which the spectral envelope has a larger amplitude than the smoothed envelope and decreasing some frequency ranges for which the spectral envelope has a smaller amplitude than the smoothed envelope.

32. The method of claim 31, wherein the frequency domain representation of the spectral envelope is the set of spectral amplitude parameters of the multiband excitation (MBE) speech coder or the enhanced multiband excitation (IMBE) speech coder.

33. The method of claim 26 or 32, wherein the smoothed spectral envelope is generated by estimating a low order model of the spectral envelope.

34. The method of claim 33, wherein the low order model is an all-pole model.