[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111754982B - Noise elimination method and device for voice call, electronic equipment and storage medium - Google Patents

Noise elimination method and device for voice call, electronic equipment and storage medium Download PDF

Info

Publication number
CN111754982B
CN111754982B CN202010570483.4A CN202010570483A CN111754982B CN 111754982 B CN111754982 B CN 111754982B CN 202010570483 A CN202010570483 A CN 202010570483A CN 111754982 B CN111754982 B CN 111754982B
Authority
CN
China
Prior art keywords
voice
speaker
category
detected
scoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010570483.4A
Other languages
Chinese (zh)
Other versions
CN111754982A (en
Inventor
孙岩丹
王瑞璋
马骏
王少军
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010570483.4A priority Critical patent/CN111754982B/en
Publication of CN111754982A publication Critical patent/CN111754982A/en
Priority to PCT/CN2020/121571 priority patent/WO2021151310A1/en
Application granted granted Critical
Publication of CN111754982B publication Critical patent/CN111754982B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention relates to voiceprint recognition technology, and discloses a noise elimination method for voice call, which comprises the following steps: detecting voice end points of call audio to obtain a voice set; extracting voice characteristics of the voice set to obtain a voice characteristic set; intercepting to-be-detected voice feature sets with accumulated time length being a preset time length threshold value from the voice feature sets according to time sequence to obtain a plurality of to-be-detected voice feature sets, clustering each to-be-detected voice feature set, and scoring the clusters; and dividing the voice speech set into a first speaker voice and a second speaker voice according to the scores, distinguishing background voice from the first speaker voice and the second speaker voice, and deleting the background voice from the voice speech set. The invention also relates to blockchain technology, and call audio can be stored in the blockchain. The invention can delete the background voice in the voice call, thereby improving the voice call quality.

Description

Noise elimination method and device for voice call, electronic equipment and storage medium
Technical Field
The present invention relates to the field of voiceprint recognition, and in particular, to a method and apparatus for noise cancellation in a voice call, an electronic device, and a computer readable storage medium.
Background
Customer service systems, particularly intelligent outbound systems, often need to be faced with background noise interference from the environment in which the customer is located. Among all the noises, the noise interference of the background voice is the strongest, and the automatic voice recognition of the intelligent outbound system can recognize the background voice and serve as a target of the conversation, so that the success rate of the whole conversation is greatly affected.
However, the current noise cancellation technology mainly eliminates background noise of non-voice, and has poor noise cancellation effect on background voice, resulting in poor voice call quality.
Disclosure of Invention
The invention provides a noise elimination method, a device, electronic equipment and a computer readable storage medium for voice call, which mainly aim to delete background voice in the voice call and improve the success rate of a dialogue system.
In order to achieve the above object, the present invention provides a method for eliminating noise in a voice call, including:
Detecting voice end points of call audio to obtain a voice set;
extracting voice characteristics of the voice set to obtain a voice characteristic set;
intercepting to-be-detected voice feature sets with accumulated time length being a preset time length threshold value from the voice feature sets according to time sequence to obtain a plurality of to-be-detected voice feature sets, clustering each to-be-detected voice feature set, and scoring the obtained clustering result by using a preset evaluation algorithm to obtain a scoring value of each to-be-detected voice feature set;
dividing the voice speech set into a first speaker voice and a second speaker voice according to the scoring value;
And calculating the time length of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the time length of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set.
Optionally, the extracting the voice feature of the voice speech set to obtain a voice feature set includes:
pre-emphasis, framing and windowing are carried out on the voice set to obtain a voice frame sequence;
Obtaining a corresponding frequency spectrum for each frame of voice in the voice frame sequence through fast Fourier transform;
converting the spectrum into a mel spectrum by a mel filter bank;
and carrying out cepstrum analysis on the Mel frequency spectrum to obtain a voice feature set corresponding to the voice set of the human voice.
Optionally, the clustering processing is performed on each to-be-detected voice feature set, including:
step a, randomly selecting two feature vectors from the voice feature set to be detected as a class center;
Step b, clustering each feature vector in the voice feature set to be detected with the nearest class center by calculating the distance from the feature vector to each class center to obtain two initial classes;
step c, updating the category centers of the two initial categories;
and d, repeating the step b and the step c until the iteration times reach a preset time threshold value, and obtaining two standard categories.
Optionally, the classifying the voice speech set into the first speaker voice and the second speaker voice according to the scoring value includes:
Selecting one of the voice feature sets to be detected, and obtaining a corresponding scoring value;
Comparing the scoring value with a preset scoring threshold value;
When the scoring value is larger than a preset scoring threshold value, combining the two standard categories of the selected voice feature set to be detected into a single voice category, calculating a category center of the single voice category, and generating a first speaker voice according to the single voice category and the category center;
when the scoring value is smaller than or equal to a preset scoring threshold value, generating a first speaker sound and a second speaker sound according to the two standard categories;
selecting the next voice feature set to be detected, obtaining a corresponding grading value, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the grading value.
Optionally, the classifying the two standard categories in the to-be-detected voice feature set into the first speaker sound or the second speaker sound according to the scoring value includes:
If the scoring value is greater than the scoring threshold, merging two standard categories of the to-be-detected voice feature set into a single voice category, calculating a category center of the single voice category, and classifying the single voice category into the first speaker voice or the second speaker voice according to a cosine distance between the category center of the single voice category and the category centers of the first speaker voice and the second speaker voice;
And if the scoring value is smaller than or equal to a scoring threshold value, classifying the two standard categories into the first speaker sound and the second speaker sound respectively according to cosine distances between the category centers of the two standard categories in the to-be-detected voice feature set and the category centers of the first speaker sound and the second speaker sound.
Optionally, the categorizing includes:
Combining the single voice category with the first speaker voice or the second speaker voice, recalculating the combined category center, and accumulating the frame number of the single voice category and the duration of the first speaker voice or the second speaker voice; or (b)
And combining the two standard categories with the first speaker voice and the second speaker voice respectively, recalculating the combined category centers, and accumulating the frame number of the standard categories and the duration of the first speaker voice or the second speaker voice.
Optionally, the deleting the background voice from the voice speech set includes:
calculating the time length proportion of the background voice in the call by using a preset time length algorithm;
comparing the duration proportion with a preset proportion threshold value;
And when the duration proportion is greater than the proportion threshold, deleting the background voice from the voice set, and removing the background voice in the call audio.
In order to solve the above problem, the present invention further provides a noise cancellation device for a voice call, the device comprising:
the voice endpoint detection module is used for detecting voice endpoints of call audio to obtain a voice set;
The voice feature extraction module is used for extracting voice features of the voice set to obtain a voice feature set;
The clustering scoring module is used for intercepting to-be-detected voice feature sets with the accumulated duration being a preset duration threshold value from the voice feature sets according to the time sequence to obtain a plurality of to-be-detected voice feature sets, clustering each to-be-detected voice feature set, and scoring the obtained clustering result by using a preset evaluation algorithm to obtain a scoring value of each to-be-detected voice feature set;
the voice classification module is used for dividing the voice speech set into a first speaker voice and a second speaker voice according to the grading value;
And the background voice removing module is used for calculating the duration of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the duration of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
a memory storing at least one instruction; and
And a processor executing the instructions stored in the memory to implement the method for noise cancellation for a voice call as described in any one of the above.
In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium including a storage data area storing created data and a storage program area storing a computer program which when executed by a processor implements the noise canceling method of a voice call as described in any one of the above.
According to the embodiment of the invention, voice endpoint detection is carried out on the call audio, the non-human voice noise in the call audio is deleted, and the subsequent processing capacity of a computer is reduced; extracting voice characteristics of the voice set to obtain a voice characteristic set, so that background voice in the call audio can be separated conveniently; intercepting to-be-detected voice feature sets with accumulated time length being a preset time length threshold value from the voice feature sets according to time sequence to obtain a plurality of to-be-detected voice feature sets, clustering each to-be-detected voice feature set, scoring the obtained clustering result by using a preset evaluation algorithm to obtain a scoring value of each to-be-detected voice feature set, and detecting fragmented, fuzzy and low-volume background voice by using a clustering combination scoring mode; according to the scoring values, the voice speech set is divided into a first speaker voice and a second speaker voice, and the audio characteristics of the speakers and the background voice can be stored and dynamically updated in real time; and calculating the duration of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the duration of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set so as to improve the voice call quality. Therefore, the method, the device and the computer readable storage medium for eliminating the noise of the voice call can delete the background voice in the voice call and improve the success rate of a dialogue system.
Drawings
Fig. 1 is a flowchart illustrating a method for noise cancellation in a voice call according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for extracting speech features according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a method for separating human voice according to an embodiment of the invention;
fig. 4 is a schematic block diagram of a noise cancellation device for voice call according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an internal structure of an electronic device for implementing a noise cancellation method for a voice call according to an embodiment of the present invention;
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The execution main body of the noise elimination method for voice call provided by the embodiment of the application comprises at least one of electronic equipment which can be configured to execute the method provided by the embodiment of the application, such as a server side, a terminal and the like. In other words, the noise canceling method of the voice call may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
The invention provides a noise elimination method for voice call. Referring to fig. 1, a flowchart of a method for noise cancellation in a voice call according to an embodiment of the invention is shown.
In this embodiment, the method for eliminating noise in a voice call includes:
s1, voice endpoint detection is carried out on call audio to obtain a voice set.
In detail, the call audio in the embodiment of the present invention includes audio generated by a conversation in a crowd or in an environment with more voice, such as call audio generated when a call is made through a communication system such as a telephone or instant messaging software in an environment full of background voice. The call audio may be retrieved directly from the communication system or invoked from a database for storing voice dialog information. It should be emphasized that, to further ensure the privacy and security of the call audio, the call audio may also be stored in a blockchain node.
The voice endpoint detection means that voice data and non-voice data (silence and environmental noise) in call audio are distinguished under noise or other interference environments, and a starting point and an ending point of the voice data are determined so as to delete the non-voice noise in the call audio, reduce subsequent processing capacity of a computer, improve efficiency and provide necessary support for subsequent signal processing.
In a preferred embodiment of the present invention, the voice endpoint detection model may be a voice activity detection (voice activity detection, VAD) model based on a deep neural network (deep neural network, DNN).
S2, extracting voice characteristics of the voice set to obtain a voice characteristic set.
In detail, referring to fig. 2, the S2 includes:
S21, pre-emphasis, framing and windowing are carried out on the voice set to obtain a voice frame sequence;
Wherein, the pre-emphasis is to use a high-pass filter to boost the high-frequency part of the voice signal in the voice concentration, so that the frequency spectrum of the voice signal is flattened; the framing is to adopt a movable window with limited length to weight so as to divide the voice signal into a plurality of short sections, so that the voice signal has stationarity; the windowing is to make the speech signal without periodicity present part of the features of the periodic function, facilitating the subsequent fourier expansion.
S22, obtaining a corresponding frequency spectrum for each frame of voice in the voice frame sequence through fast Fourier transform;
Preferably, since the transformation of the speech signal in the time domain is generally difficult to see the characteristics of the signal, the speech signal is generally transformed into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices.
S23, converting the frequency spectrum into a Mel frequency spectrum through a Mel filter bank;
The Mel (Mel) filter bank is a triangular filter bank with a Mel scale, and the spectrum can be converted into a Mel spectrum through the Mel filter bank, and the Mel frequency can accurately reflect the auditory characteristics of human ears.
S24, performing cepstrum analysis on the Mel frequency spectrum to obtain a voice feature set corresponding to the voice set.
Further, the cepstrum analysis includes taking a logarithmic and discrete cosine change and outputting a feature vector. The voice feature set comprises feature vectors corresponding to the voice frame sequence which is output after cepstrum analysis.
S3, intercepting the to-be-detected voice feature sets with the accumulated time length being a preset time length threshold value from the voice feature sets according to the time sequence to obtain a plurality of to-be-detected voice feature sets, carrying out clustering processing on each to-be-detected voice feature set, and grading the obtained clustering result by using a preset evaluation algorithm to obtain a grading value of each to-be-detected voice feature set.
In the embodiment of the invention, when the accumulated time length of the voice feature set reaches the preset time length threshold value, one detection calculation is performed, and the voice feature set obtained by the accumulation is called a voice feature set to be detected.
In detail, the clustering processing for each to-be-detected voice feature set includes:
step a, randomly selecting two feature vectors from the voice feature set to be detected as a class center;
And b, for each feature vector in the voice feature set to be detected, clustering the feature vector with the nearest class center by calculating the distance from the feature vector to each class center to obtain two initial classes.
In detail, the embodiment of the invention calculates the distance between the feature vector and the center of each category by using the following distance algorithm:
wherein L (X, Y) is the distance value, X is the class center, and Y i is the feature vector in the speech feature set to be detected.
Step c, updating the category centers of the two initial categories;
preferably, the embodiment of the present invention calculates the average value of all feature vectors in each initial category, and updates the average value to the category center of the category.
And d, repeating the step b and the step c until the iteration times reach a preset time threshold value, and obtaining two standard categories.
Further, the method scores the two obtained standard categories by using a preset evaluation algorithm to obtain the scoring values of the standard categories. Preferably, in an embodiment of the present invention, the evaluation algorithm is as follows:
Wherein n 1 and n 2 are respectively the class centers of two standard classes, H s is the assumption that the standard classes belong to the same class, and H d is the assumption that the standard classes belong to different classes; p (n 1,n2|Hs) is a likelihood function that n 1 and n 2 are from the same space; p (n 1|Hd),P(n2|Hd) is the likelihood function of n 1 and n 2 from different spaces, respectively. The likelihood function is a function of statistical model parameters to detect if a hypothesis is valid.
Preferably, the higher the scoring value, the greater the likelihood that the voices corresponding to the two standard categories belong to the same speaker; the lower the score value, the less likely that the voices corresponding to the two standard categories belong to the same speaker.
S4, dividing the voice speech set into a first speaker voice and a second speaker voice according to the grading value.
In detail, referring to fig. 3, the S4 includes:
S40, selecting one of the voice feature sets to be detected, and obtaining a corresponding scoring value;
s41, comparing the grading value with a preset grading threshold value;
When the scoring value is greater than a preset scoring threshold, executing S42, combining the two standard categories of the selected voice feature set to be detected into a single voice category, calculating a category center of the single voice category, and generating a first speaker voice according to the single voice category and the category center;
Wherein the first speaker's voice includes a voice feature including the single voice category and a category center and a duration including a number of frames of the single voice category.
When the scoring value is less than or equal to a preset scoring threshold, executing S43, and generating a first speaker sound and a second speaker sound according to the two standard categories;
Similarly, the first speaker's voice and the second speaker's voice include a voice feature including the standard class and a class center and a duration including the number of frames of the standard class.
S44, selecting the next voice feature set to be detected, obtaining a corresponding grading value, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the grading value;
S45, judging whether each voice feature set to be detected is selected completely, and repeating the S44 until each voice feature set to be detected is selected completely, so as to obtain a first speaker voice and a second speaker voice.
In detail, the classifying the two standard classes in the to-be-detected voice feature set into the first speaker's voice or the second speaker's voice includes:
In one embodiment of the present invention, if the score value of the to-be-detected voice feature set is greater than the score threshold, merging two standard classes of the to-be-detected voice feature set into a single voice class, calculating a class center of the single voice class, calculating a cosine distance between the class center of the single voice class and the class centers of the first speaker and the second speaker, and classifying the single voice class into the first speaker or the second speaker according to the cosine distance.
And classifying the single voice class into the first speaker if the cosine distance between the class center of the single voice class and the class center of the first speaker is short, and classifying the single voice class into the second speaker if the cosine distance between the class center of the single voice class and the class center of the second speaker is short.
The categorizing comprises: combining the single voice category with the first speaker voice or the second speaker voice, and recalculating the combined category center; and accumulating the frame number of the single voice category and the duration of the first speaker voice or the second speaker voice.
In another embodiment of the present invention, if the score value is less than or equal to the score threshold, the two standard classes are respectively classified into the first speaker and the second speaker according to the cosine distance by calculating the cosine distance between the class center of each standard class and the class centers of the first speaker and the second speaker in the to-be-detected speech feature set.
If the cosine distance between the class center of the standard class a and the class center of the first speaker sound is closer and the cosine distance between the class center of the standard class B and the class center of the second speaker sound is closer, classifying the standard class a into the first speaker sound and classifying the standard class B into the second speaker sound.
Similarly, the categorizing includes: combining the standard class A and the standard class with the first speaker voice or the second speaker voice respectively, and recalculating the combined class center; and accumulating the frames of the standard class A and the standard class B and the duration of the first speaker voice or the second speaker voice.
S5, calculating the duration of the first speaker voice and the duration of the second speaker voice, judging the background voice in the voice set according to the duration of the first speaker voice and the duration of the second speaker voice, and deleting the background voice from the voice set.
Preferably, in the call audio, the duration of the general target speaker is longer than the duration of the background speaker, so that in the embodiment of the present invention, the speaker with longer duration of the first speaker and the second speaker is used as the target speaker, and the rest of the speaker is used as the background speaker.
In detail, the deleting the background voice from the voice speech set includes:
calculating the time length proportion of the background voice in the call by using a preset time length algorithm;
comparing the duration proportion with a preset proportion threshold value;
And when the duration proportion is greater than the proportion threshold, deleting the background voice from the voice set, and removing the background voice in the call audio.
Wherein, the duration algorithm is as follows:
R=t/T
wherein R is the time length proportion of the background voice in the conversation, T is the time length of the background speaker, and T is the total conversation time length, namely the sum of the time lengths of the target speaker and the background speaker.
Preferably, when the duration ratio is smaller than the ratio threshold, it indicates that the background voice noise interference suffered by the current call is small, and the call audio does not need to be processed; when the time length proportion is larger than the proportion threshold value, the condition that the conversation is interfered by more serious background voice noise is indicated, and the background voice is deleted from the voice set, so that the false recognition caused by the background voice can be reduced, and the voice conversation quality is improved.
According to the embodiment of the invention, voice endpoint detection is carried out on the call audio, the non-human voice noise in the call audio is deleted, and the subsequent processing capacity of a computer is reduced; extracting voice characteristics of the voice set to obtain a voice characteristic set, so that background voice in the call audio can be separated conveniently; intercepting to-be-detected voice feature sets with accumulated time length being a preset time length threshold value from the voice feature sets according to time sequence to obtain a plurality of to-be-detected voice feature sets, clustering each to-be-detected voice feature set, scoring the obtained clustering result by using a preset evaluation algorithm to obtain a scoring value of each to-be-detected voice feature set, and detecting fragmented, fuzzy and low-volume background voice by using a clustering combination scoring mode; according to the scoring values, the voice speech set is divided into a first speaker voice and a second speaker voice, and the audio characteristics of the speakers and the background voice can be stored and dynamically updated in real time; and calculating the duration of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the duration of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set so as to improve the voice call quality. Therefore, the method, the device and the computer readable storage medium for eliminating the noise of the voice call can delete the background voice in the voice call and improve the success rate of a dialogue system.
As shown in fig. 4, a functional block diagram of the noise cancellation device for voice call according to the present invention is shown.
The noise canceling device 100 for voice call according to the present invention may be installed in an electronic apparatus. The noise elimination device for voice call may include a voice endpoint detection module 101, a voice feature extraction module 102, a cluster scoring module 103, a voice classification module 104, and a background voice removal module 105, depending on the functions implemented. The module of the present invention may also be referred to as a unit, meaning a series of computer program segments capable of being executed by the processor of the electronic device and of performing fixed functions, stored in the memory of the electronic device.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the voice endpoint detection module 101 is configured to perform voice endpoint detection on call audio to obtain a voice speech set.
In detail, the call audio in the embodiment of the present invention includes audio generated by a conversation in a crowd or in an environment with more voice, such as call audio generated when a call is made through a communication system such as a telephone or instant messaging software in an environment full of background voice. The call audio may be retrieved directly from the communication system or invoked from a database for storing voice dialog information. It should be emphasized that, to further ensure the privacy and security of the call audio, the call audio may also be stored in a blockchain node.
The voice endpoint detection means that voice data and non-voice data (silence and environmental noise) in call audio are distinguished under noise or other interference environments, and a starting point and an ending point of the voice data are determined so as to delete the non-voice noise in the call audio, reduce subsequent processing capacity of a computer, improve efficiency and provide necessary support for subsequent signal processing.
In a preferred embodiment of the present invention, the voice endpoint detection model may be a voice activity detection (voice activity detection, VAD) model based on a deep neural network (deep neural network, DNN).
The voice feature extraction module 102 is configured to perform voice feature extraction on the voice feature set to obtain a voice feature set.
In detail, the speech feature extraction module 102 specifically performs:
pre-emphasis, framing and windowing are carried out on the voice set to obtain a voice frame sequence;
Obtaining a corresponding frequency spectrum for each frame of voice in the voice frame sequence through fast Fourier transform;
converting the spectrum into a mel spectrum by a mel filter bank;
And carrying out cepstrum analysis on the Mel frequency spectrum to obtain a voice feature set corresponding to the voice set of the human voice. Wherein, the pre-emphasis is to use a high-pass filter to boost the high-frequency part of the voice signal in the voice concentration, so that the frequency spectrum of the voice signal is flattened; the framing is to adopt a movable window with limited length to weight so as to divide the voice signal into a plurality of short sections, so that the voice signal has stationarity; the windowing is to make the speech signal without periodicity present part of the features of the periodic function, facilitating the subsequent fourier expansion.
Preferably, since the transformation of the speech signal in the time domain is generally difficult to see the characteristics of the signal, the speech signal is generally transformed into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices.
The Mel (Mel) filter bank is a triangular filter bank with a Mel scale, and the spectrum can be converted into a Mel spectrum through the Mel filter bank, and the Mel frequency can accurately reflect the auditory characteristics of human ears.
Further, the cepstrum analysis includes taking a logarithmic and discrete cosine change and outputting a feature vector. The voice feature set comprises feature vectors corresponding to the voice frame sequence which is output after cepstrum analysis.
The clustering scoring module 103 is configured to intercept to-be-detected voice feature sets with accumulated time length being a preset time length threshold from the voice feature sets according to a time sequence, obtain a plurality of to-be-detected voice feature sets, perform clustering processing on each to-be-detected voice feature set, and score the obtained clustering result by using a preset evaluation algorithm, so as to obtain a scoring value of each to-be-detected voice feature set.
In the embodiment of the invention, when the accumulated time length of the voice feature set reaches the preset time length threshold value, one detection calculation is performed, and the voice feature set obtained by the accumulation is called a voice feature set to be detected.
In detail, the clustering processing for each to-be-detected voice feature set includes:
step a, randomly selecting two feature vectors from the voice feature set to be detected as a class center;
And b, for each feature vector in the voice feature set to be detected, clustering the feature vector with the nearest class center by calculating the distance from the feature vector to each class center to obtain two initial classes.
In detail, the embodiment of the invention calculates the distance between the feature vector and the center of each category by using the following distance algorithm:
wherein L (X, Y) is the distance value, X is the class center, and Y i is the feature vector in the speech feature set to be detected.
Step c, updating the category centers of the two initial categories;
preferably, the embodiment of the present invention calculates the average value of all feature vectors in each initial category, and updates the average value to the category center of the category.
And d, repeating the step b and the step c until the iteration times reach a preset time threshold value, and obtaining two standard categories.
Further, the method scores the two obtained standard categories by using a preset evaluation algorithm to obtain the scoring values of the standard categories. Preferably, in an embodiment of the present invention, the evaluation algorithm is as follows:
Wherein n 1 and n 2 are respectively the class centers of two standard classes, H s is the assumption that the standard classes belong to the same class, and H d is the assumption that the standard classes belong to different classes; p (n 1,n2|Hs) is a likelihood function that n 1 and n 2 are from the same space; p (n 1|Hd),P(n2|Hd) is the likelihood function of n 1 and n 2 from different spaces, respectively. The likelihood function is a function of statistical model parameters to detect if a hypothesis is valid.
Preferably, the higher the scoring value, the greater the likelihood that the voices corresponding to the two standard categories belong to the same speaker; the lower the score value, the less likely that the voices corresponding to the two standard categories belong to the same speaker.
The voice classification module 104 is configured to divide the voice speech set into a first speaker voice and a second speaker voice according to the scoring value.
In detail, the voice classification module 104 is specifically configured to:
Selecting one of the voice feature sets to be detected, and obtaining a corresponding scoring value;
Comparing the scoring value with a preset scoring threshold value;
when the scoring value is smaller than or equal to a preset scoring threshold value, generating a first speaker sound and a second speaker sound according to the two standard categories;
selecting the next voice feature set to be detected, obtaining a corresponding scoring value, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the scoring value;
And judging whether each voice feature set to be detected is selected completely or not until each voice feature set to be detected is selected completely, and obtaining a first speaker sound and a second speaker sound.
When the scoring value is greater than a preset scoring threshold, the voice classification module 104 merges the two standard categories of the selected voice feature set to be detected into a single voice category, calculates a category center of the single voice category, and generates a first speaker voice according to the single voice category and the category center;
Wherein the first speaker's voice includes a voice feature including the single voice category and a category center and a duration including a number of frames of the single voice category.
Similarly, the first speaker's voice and the second speaker's voice include a voice feature including the standard class and a class center and a duration including the number of frames of the standard class.
In detail, the classifying the two standard classes in the to-be-detected voice feature set into the first speaker's voice or the second speaker's voice includes:
In one embodiment of the present invention, if the score value of the to-be-detected voice feature set is greater than the score threshold, merging two standard classes of the to-be-detected voice feature set into a single voice class, calculating a class center of the single voice class, calculating a cosine distance between the class center of the single voice class and the class centers of the first speaker and the second speaker, and classifying the single voice class into the first speaker or the second speaker according to the cosine distance.
And classifying the single voice class into the first speaker if the cosine distance between the class center of the single voice class and the class center of the first speaker is short, and classifying the single voice class into the second speaker if the cosine distance between the class center of the single voice class and the class center of the second speaker is short.
The categorizing comprises: combining the single voice category with the first speaker voice or the second speaker voice, and recalculating the combined category center; and accumulating the frame number of the single voice category and the duration of the first speaker voice or the second speaker voice.
In another embodiment of the present invention, if the score value is less than or equal to the score threshold, the two standard classes are respectively classified into the first speaker and the second speaker according to the cosine distance by calculating the cosine distance between the class center of each standard class and the class centers of the first speaker and the second speaker in the to-be-detected speech feature set.
If the cosine distance between the class center of the standard class a and the class center of the first speaker sound is closer and the cosine distance between the class center of the standard class B and the class center of the second speaker sound is closer, classifying the standard class a into the first speaker sound and classifying the standard class B into the second speaker sound.
Similarly, the categorizing includes: combining the standard class A and the standard class with the first speaker voice or the second speaker voice respectively, and recalculating the combined class center; and accumulating the frames of the standard class A and the standard class B and the duration of the first speaker voice or the second speaker voice.
The background voice removing module 105 is configured to calculate the first speaker voice and the second speaker voice duration, determine the background voice in the voice set according to the first speaker voice and the second speaker voice duration, and delete the background voice from the voice set.
Preferably, in the call audio, the duration of the general target speaker is longer than the duration of the background speaker, so that in the embodiment of the present invention, the speaker with longer duration of the first speaker and the second speaker is used as the target speaker, and the rest of the speaker is used as the background speaker.
In detail, the background voice removal module 105 deletes the background voice from the voice speech set by the following method, including:
calculating the time length proportion of the background voice in the call by using a preset time length algorithm;
comparing the duration proportion with a preset proportion threshold value;
And when the duration proportion is greater than the proportion threshold, deleting the background voice from the voice set, and removing the background voice in the call audio.
Wherein, the duration algorithm is as follows:
R=t/T
wherein R is the time length proportion of the background voice in the conversation, T is the time length of the background speaker, and T is the total conversation time length, namely the sum of the time lengths of the target speaker and the background speaker.
Preferably, when the duration ratio is smaller than the ratio threshold, it indicates that the background voice noise interference suffered by the current call is small, and the call audio does not need to be processed; when the time length proportion is larger than the proportion threshold value, the condition that the conversation is interfered by more serious background voice noise is indicated, and the background voice is deleted from the voice set, so that the false recognition caused by the background voice can be reduced, and the voice conversation quality is improved.
As shown in fig. 5, a schematic structural diagram of an electronic device implementing a noise cancellation method for voice call according to the present invention is shown.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a noise cancellation program 12 for a voice call, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of the noise cancellation program 12 for voice call, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective parts of the entire electronic device using various interfaces and lines, executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., a noise canceling program or the like for performing voice call) stored in the memory 11, and calling data stored in the memory 11.
The bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 5 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The noise cancellation program 12 of the voice call stored in the memory 11 in the electronic device 1 is a combination of a plurality of instructions, which when executed in the processor 10, can implement:
Detecting voice end points of call audio to obtain a voice set;
extracting voice characteristics of the voice set to obtain a voice characteristic set;
intercepting to-be-detected voice feature sets with accumulated time length being a preset time length threshold value from the voice feature sets according to time sequence to obtain a plurality of to-be-detected voice feature sets, clustering each to-be-detected voice feature set, and scoring the obtained clustering result by using a preset evaluation algorithm to obtain a scoring value of each to-be-detected voice feature set;
dividing the voice speech set into a first speaker voice and a second speaker voice according to the scoring value;
And calculating the duration of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the duration of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any accompanying diagram representation in the claims should not be considered as limiting the claim concerned.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (8)

1. A method for noise cancellation in a voice call, the method comprising:
Detecting voice end points of call audio to obtain a voice set;
extracting voice characteristics of the voice set to obtain a voice characteristic set;
Intercepting to-be-detected voice feature sets with accumulated time length being a preset time length threshold value from the voice feature sets according to time sequence to obtain a plurality of to-be-detected voice feature sets, clustering each to-be-detected voice feature set to obtain two corresponding standard categories, and scoring the obtained clustering results by using a preset evaluation algorithm to obtain a scoring value of each to-be-detected voice feature set;
dividing the voice speech set into a first speaker voice and a second speaker voice according to the scoring value;
Calculating the time length of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the time length of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set;
Wherein the dividing the voice speech set into a first speaker voice and a second speaker voice according to the scoring value includes: selecting one of the voice feature sets to be detected, and obtaining a corresponding scoring value; comparing the scoring value with a preset scoring threshold value; when the scoring value is larger than a preset scoring threshold value, combining the two standard categories of the selected voice feature set to be detected into a single voice category, calculating a category center of the single voice category, and generating a first speaker voice according to the single voice category and the category center; when the scoring value is smaller than or equal to a preset scoring threshold value, generating a first speaker sound and a second speaker sound according to the two standard categories; selecting a next voice feature set to be detected, obtaining a corresponding scoring value, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the scoring value;
The classifying the two standard categories in the to-be-detected voice feature set into the first speaker voice or the second speaker voice according to the scoring value includes: if the scoring value is greater than the scoring threshold, merging two standard categories of the to-be-detected voice feature set into a single voice category, calculating a category center of the single voice category, and classifying the single voice category into the first speaker voice or the second speaker voice according to a cosine distance between the category center of the single voice category and the category centers of the first speaker voice and the second speaker voice; if the scoring value is smaller than or equal to a scoring threshold value, classifying the two standard categories into the first speaker sound and the second speaker sound respectively according to cosine distances between category centers of the two standard categories in the to-be-detected voice feature set and category centers of the first speaker sound and the second speaker sound;
the higher the scoring value is, the greater the possibility that the voices corresponding to the two standard categories belong to the same speaker is; the lower the score value, the less likely that the voices corresponding to the two standard categories belong to the same speaker.
2. The method for noise cancellation in a voice call according to claim 1, wherein said performing a voice feature extraction on said voice feature set to obtain a voice feature set comprises:
pre-emphasis, framing and windowing are carried out on the voice set to obtain a voice frame sequence;
Obtaining a corresponding frequency spectrum for each frame of voice in the voice frame sequence through fast Fourier transform;
converting the spectrum into a mel spectrum by a mel filter bank;
and carrying out cepstrum analysis on the Mel frequency spectrum to obtain a voice feature set corresponding to the voice set of the human voice.
3. The method for noise cancellation in a voice call according to claim 1, wherein said clustering each of said sets of to-be-detected voice features comprises:
step a, randomly selecting two feature vectors from the voice feature set to be detected as a class center;
Step b, clustering each feature vector in the voice feature set to be detected with the nearest class center by calculating the distance from the feature vector to each class center to obtain two initial classes;
step c, updating the category centers of the two initial categories;
and d, repeating the step b and the step c until the iteration times reach a preset time threshold value, and obtaining two standard categories.
4. The method of noise cancellation for a voice call of claim 1, wherein the categorizing comprises:
Combining the single voice category with the first speaker voice or the second speaker voice, recalculating the combined category center, and accumulating the frame number of the single voice category and the duration of the first speaker voice or the second speaker voice; or (b)
And combining the two standard categories with the first speaker voice and the second speaker voice respectively, recalculating the combined category centers, and accumulating the frame number of the standard categories and the duration of the first speaker voice or the second speaker voice.
5. The noise cancellation method of a voice call of any one of claims 1to 4, wherein said deleting the background voice from the voice speech set comprises:
calculating the time length proportion of the background voice in the call by using a preset time length algorithm;
And deleting the background voice from the voice set when the duration proportion is greater than a preset proportion threshold value, and removing the background voice in the call audio.
6. A noise cancellation device for a voice call, the device comprising:
the voice endpoint detection module is used for detecting voice endpoints of call audio to obtain a voice set;
The voice feature extraction module is used for extracting voice features of the voice set to obtain a voice feature set;
The clustering scoring module is used for intercepting to-be-detected voice feature sets with accumulated time length being a preset time length threshold value from the voice feature sets according to time sequence to obtain a plurality of to-be-detected voice feature sets, clustering each to-be-detected voice feature set to obtain two corresponding standard categories, and scoring the obtained clustering results by utilizing a preset evaluation algorithm to obtain a scoring value of each to-be-detected voice feature set;
the voice classification module is used for dividing the voice speech set into a first speaker voice and a second speaker voice according to the grading value;
The background voice removing module is used for calculating the duration of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the duration of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set;
Wherein the dividing the voice speech set into a first speaker voice and a second speaker voice according to the scoring value includes: selecting one of the voice feature sets to be detected, and obtaining a corresponding scoring value; comparing the scoring value with a preset scoring threshold value; when the scoring value is larger than a preset scoring threshold value, combining the two standard categories of the selected voice feature set to be detected into a single voice category, calculating a category center of the single voice category, and generating a first speaker voice according to the single voice category and the category center; when the scoring value is smaller than or equal to a preset scoring threshold value, generating a first speaker sound and a second speaker sound according to the two standard categories; selecting a next voice feature set to be detected, obtaining a corresponding scoring value, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the scoring value;
The classifying the two standard categories in the to-be-detected voice feature set into the first speaker voice or the second speaker voice according to the scoring value includes: if the scoring value is greater than the scoring threshold, merging two standard categories of the to-be-detected voice feature set into a single voice category, calculating a category center of the single voice category, and classifying the single voice category into the first speaker voice or the second speaker voice according to a cosine distance between the category center of the single voice category and the category centers of the first speaker voice and the second speaker voice; if the scoring value is smaller than or equal to a scoring threshold value, classifying the two standard categories into the first speaker sound and the second speaker sound respectively according to cosine distances between category centers of the two standard categories in the to-be-detected voice feature set and category centers of the first speaker sound and the second speaker sound;
the higher the scoring value is, the greater the possibility that the voices corresponding to the two standard categories belong to the same speaker is; the lower the score value, the less likely that the voices corresponding to the two standard categories belong to the same speaker.
7. An electronic device, the electronic device comprising:
a memory storing at least one instruction; and
A processor executing instructions stored in the memory to perform the method of noise cancellation for a voice call as claimed in any one of claims 1 to 5.
8. A computer readable storage medium comprising a stored data area storing created data and a stored program area storing a computer program, characterized in that the computer program when executed by a processor implements the noise cancellation method of a voice call according to any one of claims 1 to 5.
CN202010570483.4A 2020-06-19 2020-06-19 Noise elimination method and device for voice call, electronic equipment and storage medium Active CN111754982B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010570483.4A CN111754982B (en) 2020-06-19 2020-06-19 Noise elimination method and device for voice call, electronic equipment and storage medium
PCT/CN2020/121571 WO2021151310A1 (en) 2020-06-19 2020-10-16 Voice call noise cancellation method, apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010570483.4A CN111754982B (en) 2020-06-19 2020-06-19 Noise elimination method and device for voice call, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111754982A CN111754982A (en) 2020-10-09
CN111754982B true CN111754982B (en) 2024-11-05

Family

ID=72675687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010570483.4A Active CN111754982B (en) 2020-06-19 2020-06-19 Noise elimination method and device for voice call, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111754982B (en)
WO (1) WO2021151310A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111754982B (en) * 2020-06-19 2024-11-05 平安科技(深圳)有限公司 Noise elimination method and device for voice call, electronic equipment and storage medium
CN112700790A (en) * 2020-12-11 2021-04-23 广州市申迪计算机系统有限公司 IDC machine room sound processing method, system, equipment and computer storage medium
CN113255362B (en) * 2021-05-19 2024-02-02 平安科技(深圳)有限公司 Method and device for filtering and identifying human voice, electronic device and storage medium
CN113572908A (en) * 2021-06-16 2021-10-29 云茂互联智能科技(厦门)有限公司 Method, device and system for reducing noise in VoIP (Voice over Internet protocol) call
CN114070935B (en) * 2022-01-12 2022-04-15 百融至信(北京)征信有限公司 Intelligent outbound interruption method and system
CN115394310B (en) * 2022-08-19 2023-04-07 中邮消费金融有限公司 Neural network-based background voice removing method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070033027A1 (en) * 2005-08-03 2007-02-08 Texas Instruments, Incorporated Systems and methods employing stochastic bias compensation and bayesian joint additive/convolutive compensation in automatic speech recognition
JP6171544B2 (en) * 2013-05-08 2017-08-02 カシオ計算機株式会社 Audio processing apparatus, audio processing method, and program
CN110797021B (en) * 2018-05-24 2022-06-07 腾讯科技(深圳)有限公司 Hybrid speech recognition network training method, hybrid speech recognition device and storage medium
CN109065028B (en) * 2018-06-11 2022-12-30 平安科技(深圳)有限公司 Speaker clustering method, speaker clustering device, computer equipment and storage medium
CN109147798B (en) * 2018-07-27 2023-06-09 北京三快在线科技有限公司 Speech recognition method, device, electronic equipment and readable storage medium
CN110136749B (en) * 2019-06-14 2022-08-16 思必驰科技股份有限公司 Method and device for detecting end-to-end voice endpoint related to speaker
CN111754982B (en) * 2020-06-19 2024-11-05 平安科技(深圳)有限公司 Noise elimination method and device for voice call, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium

Also Published As

Publication number Publication date
CN111754982A (en) 2020-10-09
WO2021151310A1 (en) 2021-08-05

Similar Documents

Publication Publication Date Title
CN111754982B (en) Noise elimination method and device for voice call, electronic equipment and storage medium
WO2019227583A1 (en) Voiceprint recognition method and device, terminal device and storage medium
JP3584458B2 (en) Pattern recognition device and pattern recognition method
CN109801646B (en) Voice endpoint detection method and device based on fusion features
CN111862951B (en) Voice endpoint detection method and device, storage medium and electronic equipment
CN111276124B (en) Keyword recognition method, device, equipment and readable storage medium
CN113807103B (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN113191787A (en) Telecommunication data processing method, device electronic equipment and storage medium
CN113327586B (en) Voice recognition method, device, electronic equipment and storage medium
CN113707173A (en) Voice separation method, device and equipment based on audio segmentation and storage medium
CN113593597A (en) Voice noise filtering method and device, electronic equipment and medium
CN113112992B (en) Voice recognition method and device, storage medium and server
CN109688271A (en) The method, apparatus and terminal device of contact information input
CN115394318A (en) Audio detection method and device
US20190115044A1 (en) Method and device for audio recognition
CN112289311B (en) Voice wakeup method and device, electronic equipment and storage medium
CN111640450A (en) Multi-person audio processing method, device, equipment and readable storage medium
KR101229108B1 (en) Apparatus for utterance verification based on word specific confidence threshold
JP2011191542A (en) Voice classification device, voice classification method, and program for voice classification
CN112735432B (en) Audio identification method, device, electronic equipment and storage medium
KR101449856B1 (en) Method for estimating user emotion based on call speech
CN114512128A (en) Speech recognition method, device, equipment and computer readable storage medium
WO2022107242A1 (en) Processing device, processing method, and program
CN114664325A (en) Abnormal sound identification method, system, terminal equipment and computer readable storage medium
JP2000259198A (en) Device and method for recognizing pattern and providing medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant