CN108416096B

CN108416096B - Far-field speech data signal-to-noise ratio estimation method and device based on artificial intelligence

Info

Publication number: CN108416096B
Application number: CN201810102302.8A
Authority: CN
Inventors: 孙建伟; 李超; 李鑫; 朱唯鑫; 文铭
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-02-01
Filing date: 2018-02-01
Publication date: 2022-02-25
Anticipated expiration: 2038-02-01
Also published as: CN108416096A

Abstract

The invention discloses a far-field voice data signal-to-noise ratio estimation method and a far-field voice data signal-to-noise ratio estimation device based on artificial intelligence, wherein the method comprises the following steps: performing state binding on far-field voice data to be processed based on a decision tree model obtained by pre-training; dividing a noise section and a voice section in far-field voice data according to the state binding result; and determining the signal-to-noise ratio of the far-field voice data according to the division result. By applying the scheme of the invention, the accuracy of the signal-to-noise ratio estimation result can be improved.

Description

Far-field speech data signal-to-noise ratio estimation method and device based on artificial intelligence

[ technical field ] A method for producing a semiconductor device

The invention relates to a computer application technology, in particular to a far-field speech data signal-to-noise ratio estimation method and a far-field speech data signal-to-noise ratio estimation device based on artificial intelligence.

[ background of the invention ]

Artificial Intelligence (Artificial Intelligence), abbreviated in english as AI. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others.

Acoustic models of far-field speech products such as sound, car machines and the like all need to use a large amount of far-field speech data in the training process, while the real far-field speech data are limited, and in order to meet the requirements of model training, a large amount of simulated far-field speech data need to be generated.

Simulated far-field speech data can be generated by noising near-field speech data according to the signal-to-noise ratio (SNR) distribution in a real scene, and therefore signal-to-noise ratio estimation of real far-field speech data is important.

In the prior art, for far-field speech data, the following method is generally adopted to estimate the signal-to-noise ratio: firstly, obtaining amplitude energy of far-field voice data, then determining a segmentation threshold of the amplitude energy of the far-field voice data, dividing the far-field voice data into a noise section and a voice section through the segmentation threshold, and finally calculating a signal-to-noise ratio according to the divided noise section and voice section. However, the segmentation threshold in this method is difficult to determine accurately, and if the segmentation threshold is not accurate, the obtained signal-to-noise ratio is not accurate.

[ summary of the invention ]

In view of this, the present invention provides a far-field speech data signal-to-noise ratio estimation method and device based on artificial intelligence.

The specific technical scheme is as follows:

a far-field speech data signal-to-noise ratio estimation method based on artificial intelligence comprises the following steps:

performing state binding on far-field voice data to be processed based on a decision tree model obtained by pre-training;

dividing a noise section and a voice section in the far-field voice data according to a state binding result;

and determining the signal-to-noise ratio of the far-field voice data according to the division result.

According to a preferred embodiment of the present invention, before the state-binding the far-field speech data to be processed based on the pre-trained decision tree model, the method further includes:

and training to obtain the decision tree model by using the acquired near-field voice data.

According to a preferred embodiment of the present invention, the dividing the noise segment and the speech segment in the far-field speech data according to the state binding result includes:

and according to the acquired state id alignment label, dividing a noise section and a voice section in the far-field voice data.

According to a preferred embodiment of the present invention, the determining the signal-to-noise ratio of the far-field speech data according to the division result includes:

respectively acquiring amplitude energy of the noise section and the voice section;

and calculating the signal-to-noise ratio of the far-field voice data according to the acquired amplitude energy.

According to a preferred embodiment of the invention, the method further comprises:

respectively obtaining signal-to-noise ratios of N pieces of far-field voice data, wherein N is a positive integer greater than one;

generating a signal-to-noise ratio statistical histogram according to the signal-to-noise ratios of the N pieces of far-field voice data; the horizontal axis of the signal-to-noise ratio statistical histogram is different signal-to-noise ratio values, and the vertical axis of the signal-to-noise ratio statistical histogram is the number of far-field voice data respectively corresponding to the different signal-to-noise ratio values;

and determining the signal-to-noise ratio distribution range of the far-field voice data according to the signal-to-noise ratio statistical histogram.

According to a preferred embodiment of the present invention, the determining the signal-to-noise ratio distribution range of the far-field speech data according to the signal-to-noise ratio statistical histogram includes:

determining a peak value of a longitudinal axis value in the signal-to-noise ratio statistical histogram;

according to a preset mode, determining a reference value according to the peak value, wherein the reference value is smaller than the peak value;

finding out two horizontal axis values which satisfy the following conditions in the signal-to-noise ratio statistical histogram: the corresponding longitudinal axis value is equal to the peak value;

and taking an interval range formed by the values of the two transverse axes as a signal-to-noise ratio distribution range of the far-field voice data.

According to a preferred embodiment of the present invention, said determining a reference value from said peak value in a predetermined manner comprises:

and taking 1/M of the peak value as the reference value, wherein M is a positive integer greater than one.

An artificial intelligence based far-field speech data signal-to-noise ratio estimation device, comprising: a binding unit, a dividing unit and an estimating unit;

the binding unit is used for carrying out state binding on the far-field voice data to be processed based on a decision tree model obtained by pre-training;

the dividing unit is used for dividing a noise section and a voice section in the far-field voice data according to a state binding result;

and the estimation unit is used for determining the signal-to-noise ratio of the far-field voice data according to the division result.

According to a preferred embodiment of the present invention, the apparatus further comprises: a training unit;

and the training unit is used for training to obtain the decision tree model by utilizing the acquired near-field voice data.

According to a preferred embodiment of the present invention, the dividing unit divides the noise section and the speech section in the far-field speech data according to the acquired state id alignment tag.

According to a preferred embodiment of the present invention, the estimation unit obtains amplitude energies of the noise segment and the speech segment, respectively, and calculates a signal-to-noise ratio of the far-field speech data according to the obtained amplitude energies.

According to a preferred embodiment of the present invention, the apparatus further comprises: a counting unit;

the statistical unit is used for respectively acquiring the signal-to-noise ratios of N pieces of far-field voice data, wherein N is a positive integer larger than one, generating a signal-to-noise ratio statistical histogram according to the signal-to-noise ratios of the N pieces of far-field voice data, the horizontal axis of the signal-to-noise ratio statistical histogram is different signal-to-noise ratio values, the vertical axis of the signal-to-noise ratio statistical histogram is the number of the far-field voice data respectively corresponding to the different signal-to-noise ratio values, and determining the signal-to-noise ratio distribution range of the far-field voice data according to the signal-to-noise ratio statistical histogram.

According to a preferred embodiment of the present invention, the statistical unit determines a peak value of a vertical axis value in the signal-to-noise ratio statistical histogram, determines a reference value according to the peak value in a predetermined manner, where the reference value is smaller than the peak value, and finds two horizontal axis values satisfying the following conditions in the signal-to-noise ratio statistical histogram: and the corresponding longitudinal axis value is equal to the peak value, and an interval range formed by the found two transverse axis values is used as the signal-to-noise ratio distribution range of the far-field voice data.

According to a preferred embodiment of the present invention, the statistical unit uses 1/M of the peak value as the reference value, where M is a positive integer greater than one.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.

Based on the above description, it can be seen that, by adopting the scheme of the present invention, the state binding can be performed on the far-field voice data to be processed based on the decision tree model obtained by the pre-training, then the noise section and the voice section in the far-field voice data can be divided according to the state binding result, and further the signal-to-noise ratio of the far-field voice data can be determined according to the dividing result.

[ description of the drawings ]

FIG. 1 is a flowchart illustrating an embodiment of a far-field speech data SNR estimation method based on artificial intelligence according to the present invention.

Fig. 2 is a flowchart of an embodiment of a method for acquiring a signal-to-noise ratio distribution range of far-field speech data according to the present invention.

Fig. 3 is a schematic diagram of a signal-to-noise ratio statistical histogram according to the present invention.

Fig. 4 is a schematic structural diagram of a far-field speech data snr estimation apparatus based on artificial intelligence according to an embodiment of the present invention.

FIG. 5 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.

[ detailed description ] embodiments

In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described below by referring to the drawings and examples.

It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a flowchart illustrating an embodiment of a far-field speech data SNR estimation method based on artificial intelligence according to the present invention. As shown in fig. 1, the following detailed implementation is included.

In 101, state binding is performed on far-field speech data to be processed based on a decision tree model obtained through pre-training.

At 102, the noise segment and the voice segment in the far-field voice data are divided according to the state binding result.

At 103, a signal-to-noise ratio of the far-field speech data is determined based on the result of the division.

It can be seen that, in this embodiment, a decision tree model is needed, and the decision tree model is obtained by pre-training. Preferably, the decision tree model can be obtained by training using the acquired near-field speech data.

A sufficient amount of near-field speech data may be obtained as training samples, and the state id alignment labels for each training sample may be obtained separately. That is, for each frame in any training sample, the corresponding state id (number) is obtained, so as to form a state id sequence, different state ids may represent different meanings, for example, some state ids represent that the corresponding frame data is speech data, and other state ids represent that the corresponding frame data is non-speech data. How to obtain the state id alignment label of each training sample is not limited and can be determined according to actual needs.

Based on the obtained training sample, a decision tree model can be obtained through training. The process of training the decision tree model is the process of letting the model learn how to map the speech data to the corresponding state id alignment labels.

The near-field voice data can be regarded as clean voice data, the decision tree model is trained by utilizing the near-field voice data, the internal structure of the voice can be better confirmed, and the phoneme of the voice can be more accurately state-bound, so that the obtained decision tree model is more accurate.

After the decision tree model is obtained, the signal-to-noise ratio estimation can be carried out on the far-field voice data by combining the decision tree model.

Specifically, state binding may be performed on far-field speech data to be processed based on a decision tree model, and then a noise segment and a speech segment in the far-field speech data may be partitioned according to a state binding result, so that a signal-to-noise ratio of the far-field speech data may be determined according to the partitioning result.

And performing state binding on the far-field voice data based on the decision tree model to obtain a state binding result, namely a state id alignment label, and further dividing a noise section and a voice section in the far-field voice data based on the state id alignment label.

Based on the obtained state id alignment tag, voice data and non-voice data can be distinguished, in far-field voice data, the non-voice data can be regarded as noise data, and accordingly, a noise section and a voice section in the far-field voice data can be respectively obtained.

And then, respectively acquiring the amplitude energy of the noise section and the voice section, and further calculating the signal-to-noise ratio of the far-field voice data according to the acquired amplitude energy and a signal-to-noise ratio calculation formula. How to obtain the amplitude energy of the noise segment and the voice segment is the prior art.

Thus, the signal-to-noise ratio of the far-field voice data to be processed is obtained. On the basis, the signal-to-noise ratio distribution range of the far-field voice data can be further counted.

Specifically, the signal-to-noise ratios of N far-field voice data can be respectively obtained, where N is a positive integer greater than one, and then a signal-to-noise ratio statistical histogram can be generated according to the signal-to-noise ratios of the N far-field voice data; the horizontal axis of the signal-to-noise ratio statistical histogram is different signal-to-noise ratio values, the vertical axis of the signal-to-noise ratio statistical histogram is the number of far-field voice data respectively corresponding to the different signal-to-noise ratio values, and then the signal-to-noise ratio distribution range of the far-field voice data can be determined according to the signal-to-noise ratio statistical histogram.

The specific value of N may be determined according to actual needs, for example, 10 ten thousand.

With the above description in mind, fig. 2 is a flowchart of an embodiment of a method for acquiring a signal-to-noise ratio distribution range of far-field speech data according to the present invention. As shown in fig. 2, the following detailed implementation is included.

In 201, a decision tree model is obtained by training using the acquired near-field speech data.

In 202, each far-field voice data of the N far-field voice data is processed as shown in 203-205.

At 203, state binding is performed on the far-field speech data based on the decision tree model.

In 204, the noise segment and the speech segment in the far-field speech data are partitioned according to the state binding result.

And if the label is aligned according to the acquired state id, dividing a noise section and a voice section in the far-field voice data.

At 205, a signal-to-noise ratio of the far-field speech data is determined based on the result of the division.

If the amplitude energy of the divided noise section and the voice section can be respectively obtained, the signal-to-noise ratio of the far-field voice data can be calculated according to the obtained amplitude energy.

At 206, a signal-to-noise statistical histogram is generated based on the signal-to-noise ratios of the N far-field speech data.

For each far-field voice data in the N far-field voice data, the processing is performed according to the manners shown in 203-205, that is, the pointer processes each far-field voice data according to the manner in the embodiment shown in fig. 1.

After the signal-to-noise ratio of each far-field voice data is obtained, a signal-to-noise ratio statistical histogram can be generated, and specifically, the signal-to-noise ratio statistical histogram can be generated according to the quantization result of each signal-to-noise ratio.

The statistical range of the signal-to-noise ratio generally has no fixed range, but in an actual scene, the signal-to-noise ratio of human voice generally does not exceed-100- +100db, so that in the invention, a unit db can be adopted as a scale when the signal-to-noise ratio range is defined, and the statistical range of the signal-to-noise ratio is defined to be-100- +100db, or can be further simplified to be 0-100 db, because the speaking voice of a human in the actual scene is generally not less than 0 db.

Fig. 3 is a schematic diagram of a signal-to-noise ratio statistical histogram according to the present invention. As shown in fig. 3, the horizontal axis represents different snr values, which may be 0 to 100db, the vertical axis represents the number of far-field speech data corresponding to the different snr values, and if the value of N is 10 ten thousands, the sum of the number of far-field speech data corresponding to the different snr values is 10 ten thousands.

At 207, the signal-to-noise ratio distribution range of the far-field speech data is determined according to the signal-to-noise ratio statistical histogram.

As can be seen from fig. 3, the signal-to-noise ratio of the far-field speech data is approximately gaussian, and then, when determining the signal-to-noise ratio distribution range of the far-field speech data, the following method can be adopted: firstly, the maximum value, namely the peak value, in the longitudinal axis values corresponding to different horizontal axis values is determined, then, a reference value can be determined according to the peak value in a preset mode, the reference value is smaller than the peak value, and further, two horizontal axis values meeting the following conditions can be found out: and finally, taking an interval range formed by the two found transverse axis values as the signal-to-noise ratio distribution range of the far-field voice data.

Preferably, after the peak value of the longitudinal axis value in the signal-to-noise ratio statistical histogram is determined, two transverse axis values corresponding to the longitudinal axis value being 1/M of the peak value can be respectively found out, and then an interval range formed by the two found transverse axis values is used as the signal-to-noise ratio distribution range of the far-field voice data.

M is a positive integer greater than one, and the specific value can be determined according to actual needs, for example, 5. Assuming that the peak value is 3500, two horizontal axis values corresponding to the vertical axis value of 700 need to be found out, and assuming that a and b are provided, the [ a, b ] can be used as the signal-to-noise ratio distribution range of far-field speech data.

In practical application, the signal-to-noise ratio distribution range can be counted aiming at real far-field voice data, and then noise can be added to the near-field voice data according to the counted signal-to-noise ratio distribution range, so that simulated far-field voice data which is as consistent as possible with the real far-field voice data can be obtained.

Further, after the simulated far-field voice data are obtained, the signal-to-noise ratio distribution range of the simulated far-field voice data can be counted, and the signal-to-noise ratio distribution range of the simulated far-field voice data can be compared with the signal-to-noise ratio distribution range of the real far-field voice data to verify whether the simulated far-field voice data are consistent with the real far-field voice data or not.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In a word, by adopting the scheme of each method embodiment, a segmentation threshold is not needed, but a decision tree model obtained based on near-field speech data training is used for dividing a noise section and a speech section in far-field speech data, and then the signal-to-noise ratio of the far-field speech data is determined according to the divided noise section and speech section, so that the accuracy of the signal-to-noise ratio estimation result is improved.

On the basis, the signal-to-noise ratio distribution range of the far-field voice data can be counted, and the accuracy of the signal-to-noise ratio estimation result is improved, so that the accuracy of the counted signal-to-noise ratio distribution range of the far-field voice data is correspondingly ensured, more real simulated far-field voice data can be obtained, and the performance of an acoustic model obtained by training the far-field voice data is correspondingly improved, such as the robustness and the anti-noise performance of the acoustic model are improved.

The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.

Fig. 4 is a schematic structural diagram of a far-field speech data snr estimation apparatus based on artificial intelligence according to an embodiment of the present invention. As shown in fig. 4, includes: a binding unit 401, a dividing unit 402, and an estimating unit 403.

The binding unit 401 is configured to perform state binding on far-field speech data to be processed based on a decision tree model obtained through pre-training.

A dividing unit 402, configured to divide the noise segment and the voice segment in the far-field voice data according to the state binding result.

An estimating unit 403, configured to determine a signal-to-noise ratio of the far-field speech data according to the division result.

It can be seen that, in this embodiment, a decision tree model is needed, and the decision tree model is obtained by pre-training. Correspondingly, the device shown in fig. 4 may further include: a training unit 400.

And the training unit 400 is configured to train to obtain a decision tree model by using the obtained near-field speech data.

First, the binding unit 401 may perform state binding on far-field speech data to be processed based on a decision tree model, and then the dividing unit 402 may divide a noise section and a speech section in the far-field speech data according to the obtained state id alignment tag.

Then, the estimating unit 403 may obtain the amplitude energies of the noise segment and the speech segment, respectively, and calculate the signal-to-noise ratio of the far-field speech data according to the obtained amplitude energies.

Thus, the signal-to-noise ratio of the far-field voice data to be processed is obtained. On the basis, the signal-to-noise ratio distribution range of the far-field voice data can be further counted. Correspondingly, the device shown in fig. 4 may further include: a statistics unit 404.

The statistical unit 404 is configured to obtain signal-to-noise ratios of N far-field voice data, where N is a positive integer greater than one, generate a signal-to-noise ratio statistical histogram according to the signal-to-noise ratios of the N far-field voice data, where a horizontal axis of the signal-to-noise ratio statistical histogram is different signal-to-noise ratio values, and a vertical axis of the signal-to-noise ratio statistical histogram is the number of far-field voice data corresponding to the different signal-to-noise ratio values, and determine a signal-to-noise ratio distribution range of the far-field voice data according to the signal-to-noise ratio statistical histogram.

For each far-field voice data in the N far-field voice data, the signal-to-noise ratio can be respectively obtained according to the method, and then, a signal-to-noise ratio statistical histogram can be generated according to each signal-to-noise ratio. As shown in fig. 3, the signal-to-noise ratio of the far-field speech data is approximately gaussian, and then, when determining the signal-to-noise ratio distribution range of the far-field speech data, the following method can be adopted: firstly, the maximum value, namely the peak value, in the longitudinal axis values corresponding to different horizontal axis values is determined, then, a reference value can be determined according to the peak value in a preset mode, the reference value is smaller than the peak value, and further, two horizontal axis values meeting the following conditions can be found out: and finally, taking an interval range formed by the two found transverse axis values as the signal-to-noise ratio distribution range of the far-field voice data.

Preferably, after the peak value of the longitudinal axis value in the signal-to-noise ratio statistical histogram is determined, two transverse axis values corresponding to the longitudinal axis value being 1/M of the peak value can be respectively found out, and then an interval range formed by the two found transverse axis values is used as the signal-to-noise ratio distribution range of the far-field voice data. M is a positive integer greater than one, and the specific value can be determined according to actual needs, for example, 5.

For a specific work flow of the apparatus embodiment shown in fig. 4, please refer to the corresponding description in the foregoing method embodiment, which is not repeated.

In a word, by adopting the scheme of the embodiment of the device, a segmentation threshold is not needed, but a decision tree model obtained based on near-field speech data training is used for dividing a noise section and a speech section in far-field speech data, and then the signal-to-noise ratio of the far-field speech data is determined according to the divided noise section and speech section, so that the accuracy of the signal-to-noise ratio estimation result is improved.

FIG. 5 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 5 is only one example and should not be taken to limit the scope of use or functionality of embodiments of the present invention.

As shown in FIG. 5, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 5, the network adapter 20 communicates with the other modules of the computer system/server 12 via the bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processor 16 executes various functional applications and data processing, such as implementing the methods of the embodiments shown in fig. 1 or 2, by executing programs stored in the memory 28.

The invention also discloses a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, will carry out the method as in the embodiments of fig. 1 or 2.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A far-field speech data signal-to-noise ratio estimation method based on artificial intelligence is characterized by comprising the following steps:

based on a decision tree model obtained by pre-training, performing state binding on far-field voice data to be processed, wherein the state binding comprises the following steps: mapping the far-field voice data to a corresponding state id alignment tag;

2. The method of claim 1,

the method further includes, before the state binding of the far-field speech data to be processed based on the decision tree model obtained by pre-training, the steps of:

3. The method of claim 1,

the dividing the noise section and the voice section in the far-field voice data according to the state binding result comprises:

4. The method of claim 1,

the determining the signal-to-noise ratio of the far-field voice data according to the division result comprises:

5. The method of claim 1,

the method further comprises the following steps:

6. The method of claim 5,

the determining the signal-to-noise ratio distribution range of the far-field voice data according to the signal-to-noise ratio statistical histogram comprises the following steps:

determining the maximum value in the longitudinal axis values corresponding to different horizontal axis values, and taking the maximum value as a peak value;

finding out two cross axis values satisfying the following conditions: the corresponding longitudinal axis value is equal to the reference value;

7. The method of claim 6,

said determining a reference value from said peak value in a predetermined manner comprises:

8. An artificial intelligence-based far-field speech data signal-to-noise ratio estimation device, comprising: a binding unit, a dividing unit and an estimating unit;

the binding unit is used for carrying out state binding on the far-field voice data to be processed based on a decision tree model obtained by pre-training, and comprises the following steps: mapping the far-field voice data to a corresponding state id alignment tag;

9. The apparatus of claim 8,

the device further comprises: a training unit;

10. The apparatus of claim 8,

and the dividing unit divides a noise section and a voice section in the far-field voice data according to the acquired state id alignment label.

11. The apparatus of claim 8,

the estimation unit respectively acquires the amplitude energy of the noise section and the voice section, and calculates the signal-to-noise ratio of the far-field voice data according to the acquired amplitude energy.

12. The apparatus of claim 8,

the device further comprises: a counting unit;

13. The apparatus of claim 12,

the statistical unit determines the maximum value in the longitudinal axis values corresponding to different horizontal axis values, the maximum value is used as a peak value, a reference value is determined according to the peak value in a preset mode, the reference value is smaller than the peak value, and two horizontal axis values meeting the following conditions are found out: and the corresponding longitudinal axis value is equal to the reference value, and an interval range formed by the two found transverse axis values is used as the signal-to-noise ratio distribution range of the far-field voice data.

14. The apparatus of claim 13,

the statistical unit takes 1/M of the peak value as the reference value, wherein M is a positive integer greater than one.

15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any one of claims 1 to 7.

16. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.