[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2022105861A1 - 用于识别语音的方法、装置、电子设备和介质 - Google Patents

用于识别语音的方法、装置、电子设备和介质 Download PDF

Info

Publication number
WO2022105861A1
WO2022105861A1 PCT/CN2021/131694 CN2021131694W WO2022105861A1 WO 2022105861 A1 WO2022105861 A1 WO 2022105861A1 CN 2021131694 W CN2021131694 W CN 2021131694W WO 2022105861 A1 WO2022105861 A1 WO 2022105861A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
text
recognized
speech
matched
Prior art date
Application number
PCT/CN2021/131694
Other languages
English (en)
French (fr)
Inventor
许凌
何怡
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Priority to US18/037,546 priority Critical patent/US20240021202A1/en
Publication of WO2022105861A1 publication Critical patent/WO2022105861A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Definitions

  • the embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, an electronic device, and a medium for recognizing speech.
  • speech recognition technology has also achieved more and more applications. For example, in the field of voice interaction of smart devices, in the field of content review of audio, short video, and live broadcast platforms, all rely on the results of voice recognition.
  • a related way is to use various existing speech recognition models to perform feature extraction on the audio to be recognized, recognize the acoustic state, and output the corresponding recognized text through the language model.
  • Embodiments of the present disclosure propose methods, apparatuses, electronic devices, and media for recognizing speech.
  • an embodiment of the present disclosure provides a method for recognizing speech, the method comprising: acquiring audio to be recognized, wherein the audio to be recognized includes a speech segment; and determining a start and end corresponding to the speech segment included in the audio to be recognized time; extract at least one speech segment from the audio to be recognized according to the determined start and end time; perform speech recognition on the extracted at least one speech segment to generate recognized text corresponding to the audio to be recognized.
  • an embodiment of the present disclosure provides an apparatus for recognizing speech, the apparatus comprising: an acquisition unit configured to acquire audio to be recognized, wherein the audio to be recognized includes speech segments; a first determination unit, which is be configured to determine the start and end time corresponding to the speech segment included in the audio to be recognized; the extraction unit is configured to extract at least one speech segment from the audio to be recognized according to the determined start and end time; the generation unit is configured to extract at least one speech segment from the extracted audio At least one speech segment is subjected to speech recognition to generate recognized text corresponding to the audio to be recognized.
  • embodiments of the present disclosure provide an electronic device, the electronic device includes: one or more processors; a storage device on which one or more programs are stored; when one or more programs are stored by one or more The multiple processors execute such that the one or more processors implement a method as described in any one of the implementations of the first aspect.
  • an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any implementation manner of the first aspect.
  • the method, apparatus, and electronic device for recognizing speech provided by the embodiments of the present disclosure can decompose the speech contained in the original audio into Voice clips.
  • the recognition text corresponding to the entire audio is generated by fusing the recognition results of the extracted speech segments, so that the speech segments can be recognized in parallel and the speed of speech recognition is improved.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure may be applied;
  • FIG. 2 is a flowchart of one embodiment of a method for recognizing speech according to the present disclosure
  • FIG. 3 is a schematic diagram of an application scenario of the method for recognizing speech according to an embodiment of the present disclosure
  • FIG. 4 is a flowchart of yet another embodiment of a method for recognizing speech according to the present disclosure
  • FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for recognizing speech according to the present disclosure
  • FIG. 6 is a schematic structural diagram of an electronic device suitable for implementing embodiments of the present disclosure.
  • FIG. 1 illustrates an exemplary architecture 100 to which the method for recognizing speech or the apparatus for recognizing speech may be applied.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the terminal devices 101, 102, and 103 interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, social platform software, text editing applications, voice interaction applications, etc. .
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal devices 101, 102, and 103 can be various electronic devices supporting voice interaction, including but not limited to smart phones, tablet computers, smart speakers, laptop computers, and desktop computers.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as a plurality of software or software modules (eg, software or software modules for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
  • the server 105 may be a server that provides various services, such as a background server that provides support for the speech recognition programs running on the terminal devices 101 , 102 and 103 .
  • the background server can analyze and process the acquired speech to be recognized, and generate a processing result (such as a recognized text), and can also feed back the processing result to the terminal device.
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server.
  • the server is software, it can be implemented as a plurality of software or software modules (for example, software or software modules for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
  • the method for recognizing speech provided by the embodiments of the present disclosure is generally executed by the server 105 , and accordingly, the apparatus for recognizing speech is generally set in the server 105 .
  • the method for recognizing speech provided by the embodiments of the present disclosure may also be executed by the terminal devices 101, 102, and 103.
  • the apparatus for recognizing speech may also be set in in terminal devices 101, 102, and 103. At this time, the network 104 and the server 105 may not exist.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the method for recognizing speech includes the following steps:
  • Step 201 Acquire the audio to be recognized.
  • the execution body of the method for recognizing speech may acquire the speech to be recognized through a wired connection or a wireless connection.
  • the audio to be recognized may include voice segments.
  • the above-mentioned speech segment may be, for example, audio of a person speaking or singing.
  • the above-mentioned execution body may acquire the pre-stored speech to be recognized locally.
  • the above-mentioned execution body may also acquire the audio to be recognized sent by the electronic device (for example, the terminal device shown in FIG. 1 ) that is communicatively connected to it.
  • Step 202 Determine the start and end times corresponding to the speech segment included in the audio to be recognized.
  • the above-mentioned execution subject may determine the start and end times corresponding to the speech segment included in the audio to be recognized obtained in the above step 201 in various ways.
  • the above-mentioned executive body may extract audio segments from the above-mentioned to-be-identified audio through an endpoint detection algorithm. Afterwards, the above-mentioned executive body may extract audio features from the extracted audio segment. Next, the above-mentioned executive body may determine the similarity between the extracted audio feature and the preset speech feature template. Wherein, the above-mentioned preset speech feature template is obtained based on feature extraction of a large number of speakers' speech. In response to determining that the similarity between the extracted audio feature and the speech feature template is greater than the preset threshold, the execution subject may determine the start and end points corresponding to the extracted audio features as the start and end moments corresponding to the speech segment.
  • the above-mentioned execution body may determine the start and end times corresponding to the speech segments included in the audio to be recognized according to the following steps:
  • the first step is to extract the audio frame feature of the audio to be recognized, and generate the first audio frame feature.
  • the above-mentioned execution body may extract the audio frame feature of the audio to be recognized obtained in the foregoing step 201 in various ways, thereby generating the first audio frame feature.
  • the above-mentioned execution body may sample the above-mentioned to-be-identified audio and perform feature extraction on the sampled audio frame, so as to generate the above-mentioned first audio frame characteristic.
  • the extracted features may include, but are not limited to, at least one of the following: Fbank feature, Linear Predictive Cepstral Coefficient (LPCC), and Mel Frequency Cepstrum Coefficient (MFCC).
  • the second step is to determine the probability that the audio frame corresponding to the first audio frame feature belongs to speech.
  • the foregoing executive body may determine the probability that the audio frame corresponding to the first audio frame feature belongs to speech in various manners.
  • the above executive body may determine the similarity between the first audio frame feature generated in the first step and the preset speech frame feature template.
  • the above-mentioned preset speech frame feature template is obtained based on frame feature extraction of speeches of a large number of speakers.
  • the execution body may determine the determined similarity as the probability that the audio frame corresponding to the first audio frame feature belongs to speech.
  • the above-mentioned execution body may input the above-mentioned first audio frame feature into a pre-trained speech detection model, and generate a probability that the audio frame corresponding to the first audio frame characteristic belongs to speech.
  • the above-mentioned speech detection model may include various neural network models for classification.
  • the above-mentioned speech detection model may output the probability that the above-mentioned first audio frame feature belongs to each category (eg, speech, ambient sound, pure music, etc.).
  • the above-mentioned speech detection model can be obtained by training the following steps:
  • the executive body used for training the above-mentioned speech detection model may acquire the above-mentioned first training sample set through a wired or wireless connection.
  • the first training samples in the above-mentioned first training sample set may include audio frame features of the first samples and corresponding sample labeling information.
  • the above-mentioned first sample audio frame feature may be obtained based on the feature extraction of the first sample audio.
  • the above-mentioned sample labeling information may be used to represent the category to which the above-mentioned first sample audio belongs.
  • the above categories may include speech.
  • the above voice may also include human voice speaking and human voice singing.
  • the above categories may also include pure music, others (eg, ambient sounds, animal calls, etc.), for example.
  • the above-mentioned executive body may acquire an initial speech detection model for classification through a wired or wireless connection.
  • the above-mentioned initial speech detection model may include various neural networks for audio feature classification, such as RNN (Recurrent Neural Network, Recurrent Neural Network), BiLSTM (Bi-directional Long Short Term Memory, Bidirectional Long Short-Term Memory Network), DFSMN (Deep Feed-Forward Sequential Memory Networks).
  • RNN Recurrent Neural Network, Recurrent Neural Network
  • BiLSTM Bi-directional Long Short Term Memory
  • DFSMN Deep Feed-Forward Sequential Memory Networks
  • the above-mentioned initial speech detection model may be a 10-layer DFSMN-structured network.
  • each layer of DFSMN structure can be composed of hidden layers and memory modules.
  • the last layer of the above network can be constructed based on the softmax function, and the number of output units it includes can be consistent with the number of categories for classification.
  • the first sample audio frame feature in the first training sample set is used as the input of the initial speech detection model, and the annotation information corresponding to the input first sample audio frame feature is used as the expected output, and the speech detection model is obtained by training.
  • the above-mentioned execution body may use the first sample audio frame feature in the first training sample set obtained in the above step S1 as the input of the initial speech detection model, and use the input first sample audio frame feature with the input first sample audio frame feature.
  • the corresponding annotation information is used as the expected output, and the speech detection model is obtained by training through machine learning.
  • the above-mentioned executive body may use a cross entropy criterion (Cross Entropy Criteria, CE criterion) to adjust the network parameters of the above-mentioned initial speech detection model, thereby obtaining the above-mentioned speech detection model.
  • CE criterion Cross entropy criterion
  • the above-mentioned executive body can use a pre-trained speech detection model to determine whether each frame belongs to a speech frame, thereby improving the recognition accuracy of the speech frame.
  • the start and end times corresponding to the speech segment are generated according to the comparison between the determined probability and the preset threshold.
  • the execution subject may generate the start and end times corresponding to the speech segment in various ways.
  • the aforementioned executive body may first select a probability greater than a preset threshold. Then, the above-mentioned executive body may determine the start and end times of the audio segment composed of consecutive audio frames corresponding to the selected probability as the start and end times of the speech segment.
  • the above-mentioned execution body can determine the start and end times corresponding to the speech segment according to the probability that the audio frame in the audio to be recognized belongs to speech, thereby improving the detection accuracy of the start and end times corresponding to the speech segment.
  • the above-mentioned execution subject may generate the start and end times corresponding to the speech segment according to the following steps:
  • the above-mentioned execution body may use a preset sliding window to select the probability corresponding to the first target number of audio frames.
  • the width of the above-mentioned preset sliding window may be preset according to an actual application scenario, for example, 10 milliseconds.
  • the above-mentioned first number pass may refer to the number of audio frames included in the above-mentioned preset sliding window.
  • the above-mentioned executive body may determine the statistical value of the probability selected in the above-mentioned step S1 in various ways.
  • the above statistical value can be used to characterize the overall magnitude of the selected probability.
  • the above-mentioned statistical value may be a value obtained by weighted summation.
  • the above statistical value may also include, but is not limited to, at least one of the following: maximum value, minimum value, and median.
  • the above-mentioned executive body may determine that the audio segment composed of the first number of audio frames corresponding to the selected probability belongs to the speech segment. Therefore, the above-mentioned executive body may determine the end point time corresponding to the above-mentioned sliding window as the start and end time corresponding to the above-mentioned speech segment.
  • the above-mentioned execution body can reduce the influence of the "glitch" in the original speech on the detection accuracy of the speech segment, thereby improving the detection accuracy of the start and end times corresponding to the speech segment, thereby providing a data basis for subsequent speech recognition.
  • Step 203 Extract at least one speech segment from the audio to be recognized according to the determined start and end time.
  • the above-mentioned execution subject may extract at least one speech segment from the audio to be recognized in various ways.
  • the start and end times of the above-mentioned extracted speech segments are generally consistent with the determined start and end times.
  • the above-mentioned executive body may also perform segmentation or merging of audio segments according to the determined start and end times, so as to keep the length of the generated speech segments within a certain range.
  • Step 204 Perform speech recognition on the extracted at least one speech segment to generate recognized text corresponding to the audio to be recognized.
  • the above-mentioned execution subject may use various speech recognition technologies to perform speech recognition on at least one speech segment extracted in step 203, thereby generating recognized text corresponding to each speech segment. Then, the above-mentioned execution body may combine the recognized texts corresponding to the generated speech segments, thereby generating the above-mentioned recognized texts corresponding to the audio to be recognized.
  • the above-mentioned execution body may perform speech recognition on at least one extracted speech segment according to the following steps, and generate recognized text corresponding to the audio to be recognized:
  • frame features of speech are extracted from the extracted at least one speech segment to generate second audio frame features.
  • the above-mentioned executive body may extract the frame features of the speech from the at least one speech segment extracted in step 203 in various ways, to generate the second audio frame features.
  • the above-mentioned second audio frame feature may include, but is not limited to, at least one of the following: Fbank feature, LPCC feature, and MFCC feature.
  • the above-mentioned execution body may generate the above-mentioned second audio frame characteristic in a manner similar to that of generating the first audio frame characteristic in the above-mentioned step 201 .
  • the execution body may directly select the corresponding audio frame feature from the generated first audio frame feature to generate the first audio frame feature. Two audio frame features.
  • the second audio frame feature is input to the pre-trained acoustic model to obtain a second number of phoneme sequences to be matched and corresponding scores corresponding to the second audio frame feature.
  • the above-mentioned executive body may input the second audio frame feature into the pre-trained acoustic model, and obtain the second number of phoneme sequences to be matched corresponding to the second audio frame feature and the corresponding score.
  • the above-mentioned acoustic model may include various models used for acoustic state determination in speech recognition.
  • the above-mentioned acoustic model may output the phonemes of the audio frame corresponding to the above-mentioned second audio frame feature and the corresponding probability.
  • the above-mentioned executive body may determine the second number of phoneme sequences with the highest probability corresponding to the above-mentioned second audio frame feature and the corresponding score based on the Viterbi algorithm.
  • the above acoustic model can be obtained by training the following steps:
  • the executive body used for training the above-mentioned acoustic model may acquire the above-mentioned second training sample set through a wired or wireless connection.
  • the second training samples in the above-mentioned second training sample set may include second sample audio frame features and corresponding sample texts.
  • the above-mentioned second sample audio frame feature can be obtained based on the feature extraction of the second sample audio.
  • the above-mentioned sample text can be used to characterize the content of the above-mentioned second sample audio.
  • the above-mentioned sample text may be a directly obtained phoneme sequence, such as "nihao".
  • the above-mentioned sample text may also be a phoneme sequence converted from a text (for example, "Hello") according to a preset dictionary library.
  • the above-mentioned executive body may acquire the initial acoustic model through wired or wireless connection.
  • the above-mentioned initial acoustic model may include various neural networks for acoustic state determination, such as RNN, BiLSTM, DFSMN.
  • the above-mentioned initial acoustic model may be a 30-layer DFSMN-structured network.
  • each layer of DFSMN structure can be composed of hidden layers and memory modules.
  • the last layer of the above network can be constructed based on the softmax function, and the number of output units it includes can be consistent with the number of recognizable phonemes.
  • the above-mentioned execution body may use the second sample audio frame feature in the second training sample set obtained in the above step S1 as the input of the initial acoustic model, and use the sample corresponding to the input second sample audio frame feature
  • the syllable indicated by the text is used as the desired output, and the initial acoustic model is pre-trained based on the first training criterion.
  • the above-mentioned first training criterion may be generated based on an audio frame sequence.
  • the above-mentioned first training criterion may include a CTC (Connectionist Temporal Classification) criterion.
  • the above-mentioned execution body may use a preset window function to convert the phoneme indicated by the second sample text obtained in step S1 into a phoneme label used for the second training criterion.
  • the above-mentioned window function may include but not limited to at least one of the following: rectangular window, triangular window.
  • the above-mentioned second training criteria may be generated based on audio frames, such as CE criteria.
  • the phoneme indicated by the second sample text may be "nihao”
  • the execution body may convert the phoneme into "nnniihhao" by using the preset window function.
  • the above-mentioned execution subject may use the second sample audio frame feature in the second training sample set obtained in step S1 as the input of the initial acoustic model after pre-training in step S3, and use the second sample audio frame feature with the input second
  • the phoneme label converted in step S4 corresponding to the sample audio frame feature is used as the desired output, and the parameters of the pre-trained initial acoustic model are adjusted by using the above-mentioned second training criterion to obtain the acoustic model.
  • the above-mentioned executor may utilize the cooperation between the training criteria (such as CTC criteria) generated based on the sequence dimension and the training criteria (such as CE criteria) generated based on the frame dimension, which not only reduces the work of labeling samples It also ensures the validity of the model obtained by training.
  • the training criteria such as CTC criteria
  • CE criteria such as CE criteria
  • the second number of to-be-matched phoneme sequences are input into the pre-trained language model, and the to-be-matched text and corresponding scores corresponding to the second number of to-be-matched phoneme sequences are obtained.
  • the above-mentioned execution body may input the second number of to-be-matched phoneme sequences obtained in the second step into the pre-trained language model, and obtain the to-be-matched text corresponding to the second number of to-be-matched phoneme sequences and the corresponding score.
  • the language model may output the text to be matched and the corresponding score corresponding to each of the second number of phoneme sequences to be matched.
  • the above scores are usually positively related to the probability and grammatical degree of occurrence in the preset corpus.
  • Step 4 According to the obtained scores corresponding to the phoneme sequence to be matched and the text to be matched, select the text to be matched from the obtained text to be matched as the matched text corresponding to the at least one speech segment.
  • the above-mentioned execution body may select the text to be matched from the obtained text to be matched in various ways as corresponding to at least one speech segment. matching text.
  • the above-mentioned execution body may first select a phoneme sequence to be matched whose score corresponding to the obtained phoneme sequence to be matched is greater than the first preset threshold. Then, the execution body may select the text to be matched with the highest score corresponding to the text to be matched from the selected phoneme sequence to be matched as the matched text corresponding to the speech segment corresponding to the above phoneme sequence to be matched.
  • the above-mentioned execution body can also select the text to be matched from the obtained text to be matched through the following steps as a match corresponding to at least one voice segment.
  • the above-mentioned execution body may perform a weighted sum of the obtained scores corresponding to the phoneme sequence to be matched and the text to be matched corresponding to the same speech segment, and generate a total score corresponding to each text to be matched.
  • the scores corresponding to the to-be-matched phoneme sequences "nihao” and “niao” corresponding to the speech segment 001 may be 82 and 60, respectively.
  • the scores corresponding to the to-be-matched texts "Hello” and "Fuhao” corresponding to the above-mentioned to-be-matched phoneme sequence "nihao” may be 95 and 72, respectively.
  • the above-mentioned execution body may select the to-be-matched text with the highest total score from the to-be-matched texts obtained in the above step S1 as the matched text corresponding to the at least one speech segment.
  • the above-mentioned execution body may assign different weights to the obtained scores corresponding to the phoneme sequence to be matched and the text to be matched according to the actual application scenario, so as to be more suitable for different application scenarios.
  • the recognition text corresponding to the audio to be recognized is generated.
  • the above-mentioned execution body may generate the recognized text corresponding to the audio to be recognized in various ways.
  • the above-mentioned execution body may arrange the selected matched texts according to the sequence of the corresponding speech segments in the above-mentioned audio to be recognized, and perform text post-processing, thereby generating recognized text corresponding to the above-mentioned audio to be recognized.
  • the above-mentioned execution body can generate the recognition text from two dimensions of the phoneme sequence and the language model, so as to improve the recognition accuracy.
  • FIG. 3 is a schematic diagram of an application scenario of the method for recognizing speech according to an embodiment of the present disclosure.
  • the user 301 uses the terminal device 302 to record audio as the audio to be recognized 303 .
  • the background server 304 acquires the above-mentioned audio to be recognized 303 .
  • the background server 304 may determine the start and end times 305 of the speech segment included in the audio to be recognized 303 .
  • the start time and end time of the speech segment A may be 0"24 and 1"15, respectively.
  • the background server 304 may extract at least one speech segment 306 from the audio to be recognized 303 .
  • audio frames corresponding to 0"24-1"15 in the audio to be recognized 303 may be extracted as speech segments. Then, the background server 304 may perform speech recognition on the extracted speech segment 306 to generate recognized text 306 corresponding to the audio to be recognized 303 .
  • the above-mentioned text to be recognized 306 may be "Hello everyone, welcome to XX class" composed of recognized texts corresponding to multiple speech fragments.
  • the background server 304 may also feed back the generated recognition text 306 to the terminal device 302 .
  • one of the existing technologies is usually to directly perform speech recognition on the acquired audio. Since the audio often includes non-speech content, the process of extracting features and performing speech recognition consumes too many resources and does not affect the performance of speech recognition. Accuracy is adversely affected.
  • the speech contained in the original audio is decomposed into speech segments by extracting speech segments from the audio to be recognized according to the determined start and end times corresponding to the speech segments.
  • the recognition text corresponding to the entire audio is generated by fusing the recognition results of the extracted speech segments, so that the speech segments can be recognized in parallel and the speed of speech recognition is improved.
  • the process 400 of the method for recognizing speech includes the following steps:
  • Step 401 acquiring the video file to be reviewed.
  • the execution body of the method for recognizing speech can use various methods from a local or communicatively connected electronic device (for example, the terminal devices 101 and 102 shown in FIG. 1 ) , 103) Obtain the video file to be reviewed.
  • the above-mentioned file to be reviewed may be, for example, a streaming video of a live broadcast platform, or a submitted video of a short video platform.
  • Step 402 extract the audio track from the video file to be reviewed, and generate the audio to be recognized.
  • the above-mentioned execution body may extract the audio track from the to-be-reviewed video file obtained in the above-mentioned step 401 in various ways to generate the to-be-recognized audio.
  • the above-mentioned execution body may convert the above-mentioned extracted audio track into an audio file in a pre-specified format as the above-mentioned to-be-identified audio.
  • Step 403 Determine the start and end times corresponding to the speech segment included in the audio to be recognized.
  • Step 404 Extract at least one speech segment from the audio to be recognized according to the determined start and end time.
  • Step 405 Perform speech recognition on the extracted at least one speech segment to generate recognized text corresponding to the audio to be recognized.
  • steps 403, 404, and 405 are respectively consistent with the steps 202, 203, and 204 in the foregoing embodiment, and the above descriptions of the steps 202, 203, and 204 and their optional implementations are also applicable to the steps Step 403 , step 404 and step 405 are not repeated here.
  • Step 406 Determine whether there are words in the preset word set in the recognized text.
  • the above-mentioned execution subject may determine whether there are words in the preset vocabulary set in the recognized text generated in step 405 in various ways.
  • the above-mentioned preset word set may include a preset sensitive word set.
  • the above sensitive word set may include, for example, advertising terms, uncivilized terms, and the like.
  • the above-mentioned execution body may determine whether there are words in the preset vocabulary set in the recognized text according to the following steps:
  • the words in the preset word set are divided into a third number of retrieval units.
  • the above-mentioned execution body may split the words in the above-mentioned preset word set into a third number of retrieval units.
  • the words in the preset vocabulary set may include "time-limited seckill", and the above-mentioned execution subject may use word segmentation technology to split the above-mentioned "limited-time seckill" into “time-limited” and "seckill” as retrieval units.
  • the second step according to the number of words in the recognized text that match the retrieval unit, it is determined whether there are words in the preset vocabulary set in the recognized text.
  • the above-mentioned execution body may firstly match the recognized text generated in the above-mentioned step 405 with the above-mentioned retrieval units to determine the number of matching retrieval units. Then, according to the determined number of retrieval units, the above-mentioned execution body can determine whether there are words in the preset word set in the above-mentioned recognized text in various ways. As an example, in response to the determined number of retrieval units corresponding to the same word being greater than 1, the above-mentioned execution body may determine whether a word in a preset word set exists in the recognized text.
  • the above-mentioned execution body may further determine that there are words in the preset vocabulary set in the recognized text in response to determining that all retrieval units belonging to the same word in the above-mentioned preset vocabulary set exist in the recognized text.
  • the above-mentioned execution subject can implement fuzzy matching of search terms, thereby enhancing the strength of the review.
  • the words in the preset word set may correspond to risk level information.
  • the above risk level information may be used to represent different urgency levels, such as priority processing levels, sequential processing levels, and the like.
  • Step 407 in response to the determination of existence, sending the video file to be reviewed and the identification text to the target terminal.
  • the execution subject may send the video file to be reviewed and the recognized text to the target terminal in various ways.
  • the above-mentioned target terminal may be a terminal for reviewing the video to be reviewed, such as a terminal for manual review or a terminal for performing keyword review using other review technologies.
  • the target terminal may also be a terminal that sends the video file to be reviewed, so as to prompt a user using the terminal to adjust the video file to be reviewed.
  • the execution subject may send the video file to be reviewed and the identification text to the target terminal according to the following steps:
  • risk level information corresponding to the matched word is determined.
  • the above-mentioned execution body may determine the risk level information corresponding to the above-mentioned matched words.
  • the video file to be reviewed and the identification text are sent to the terminal that matches the determined risk level information.
  • the execution subject may send the video file to be reviewed and the identification text to the terminal that matches the determined risk level information.
  • the above-mentioned execution subject may send the video file to be reviewed and the identification text corresponding to the risk level information used to represent the priority processing to the terminal used for the priority processing.
  • the above-mentioned execution subject may store the video file to be reviewed and the identification text corresponding to the risk level information used to represent the sequential processing in the to-be-reviewed queue. Then, the to-be-reviewed video file and the identification text are selected from the above-mentioned to-be-reviewed queue and sent to the terminal for review.
  • the above-mentioned execution body can perform hierarchical processing on to-be-reviewed video files triggering keywords of different risk levels, which improves processing efficiency and flexibility.
  • the process 400 of the method for recognizing speech in this embodiment embodies the steps of extracting audio from the video file to be reviewed, and in response to determining that there is a pre-existing audio in the recognized text corresponding to the extracted audio Set the words in the vocabulary set, and send the video file to be reviewed and the recognized text to the target terminal. Therefore, the solution described in this embodiment can significantly reduce the amount of video review and effectively improve the efficiency of video review when the target terminal is used for video content review by only sending videos that hit a specific word to the target terminal.
  • the present disclosure provides an embodiment of an apparatus for recognizing speech, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 or FIG. 4 , Specifically, the device can be applied to various electronic devices.
  • the apparatus 500 for recognizing speech includes an acquiring unit 501 , a first determining unit 502 , an extracting unit 503 and a generating unit 504 .
  • the acquiring unit 501 is configured to acquire the audio to be recognized, wherein the audio to be recognized includes voice segments;
  • the first determining unit 502 is configured to determine the start and end times corresponding to the audio segments included in the audio to be recognized;
  • the extraction unit 503 is configured to extract at least one speech segment from the audio to be recognized according to the determined start and end time;
  • the generating unit 504 is configured to perform speech recognition on the extracted at least one speech segment to generate recognized text corresponding to the audio to be recognized.
  • the specific processing of the acquiring unit 501 , the first determining unit 502 , the extracting unit 503 and the generating unit 504 and the technical effects brought by them can be implemented with reference to FIG. 2 respectively.
  • the related descriptions of step 201 , step 202 , step 203 and step 204 in the example will not be repeated here.
  • the foregoing first determination unit 502 may include a first determination subunit (not shown in the figure) and a first generation subunit (not shown in the figure).
  • the above-mentioned first determining subunit may be configured to determine the probability that the audio frame corresponding to the first audio frame feature belongs to speech.
  • the above-mentioned first generating subunit may be configured to generate start and end times corresponding to the speech segment according to the comparison between the determined probability and a preset threshold.
  • the above-mentioned first determining subunit may be further configured to: input the first audio frame feature into a pre-trained speech detection model, and generate an audio frame corresponding to the first audio frame feature Probability of belonging to speech.
  • the above-mentioned speech detection model may be obtained by training through the following steps: acquiring a first training sample set; acquiring an initial speech detection model for classification; A sample audio frame feature is used as the input of the initial speech detection model, and the annotation information corresponding to the input first sample audio frame feature is used as the expected output, and the speech detection model is obtained by training, wherein the first training sample set in the first training sample set.
  • the training sample includes the first sample audio frame feature and corresponding sample annotation information, the first sample audio frame feature is obtained based on the feature extraction of the first sample audio, and the sample annotation information is used to represent the category to which the first sample audio belongs , the category includes speech.
  • the above-mentioned first generation subunit may include a first selection module (not shown in the figure), a determination module (not shown in the figure), and a first generation module (not shown in the figure) not shown).
  • the above-mentioned first selection module may be configured to use a preset sliding window to select probabilities corresponding to the first number of audio frames.
  • the determination module described above may be configured to determine a statistical value of the selected probability.
  • the above-mentioned first generating module may be configured to, in response to determining that the statistical value is greater than the above-mentioned preset threshold, generate the start and end times corresponding to the speech segment according to the audio segment composed of the first number of audio frames corresponding to the selected probability.
  • the foregoing generating unit 504 may include a second generating subunit (not shown in the figure), a third generating subunit (not shown in the figure), and a fourth generating subunit (not shown in the figure), a selection sub-unit (not shown in the figure), and a fifth generation sub-unit (not shown in the figure).
  • the above-mentioned second generating subunit may be configured to extract the frame feature of the speech from the extracted at least one speech segment, and generate the second audio frame feature.
  • the above-mentioned third generating subunit may be configured to input the second audio frame feature into the pre-trained acoustic model, and obtain a second number of to-be-matched phoneme sequences and corresponding scores corresponding to the second audio frame feature.
  • the above-mentioned fourth generating subunit may be configured to input the second number of to-be-matched phoneme sequences into the pre-trained language model, and obtain the to-be-matched text and corresponding scores corresponding to the second number of to-be-matched phoneme sequences.
  • the above selection subunit may be configured to select the text to be matched from the obtained text to be matched as the matching text corresponding to the at least one speech segment according to the obtained scores corresponding to the phoneme sequence to be matched and the text to be matched respectively.
  • the above-mentioned fifth generating subunit may be configured to generate recognized text corresponding to the audio to be recognized according to the selected matching text.
  • the above acoustic model may be obtained by training through the following steps: obtaining a second training sample set; obtaining an initial acoustic model; using the second sample audio frame feature in the second training sample set as The input of the initial acoustic model takes the phoneme indicated by the sample text corresponding to the input second sample audio frame feature as the expected output, and pre-trains the initial acoustic model based on the first training criterion;
  • the phoneme indicated by the two-sample text is converted into a phoneme label for the second training criterion;
  • the second sample audio frame feature in the second training sample set is used as the input of the pre-trained initial acoustic model, and the second sample audio frame feature in the second training sample set is used as the input of the pre-trained
  • the phoneme label corresponding to the sample audio frame feature is used as the expected output, and the pre-trained initial acoustic model is trained by using the second training criterion to obtain the acoustic
  • the above-mentioned selection subunit may include a second generation module (not shown in the figure) and a second selection module (not shown in the figure).
  • the above-mentioned second generation module may be configured to perform a weighted sum of the obtained scores corresponding to the phoneme sequence to be matched and the text to be matched, respectively, to generate a total score corresponding to each text to be matched.
  • the above-mentioned second selection module may be configured to select the text to be matched with the highest total score from the obtained texts to be matched as the matched text corresponding to the at least one speech segment.
  • the foregoing obtaining unit 501 may include an obtaining subunit (not shown in the figure) and a sixth generating subunit (not shown in the figure).
  • the obtaining subunit may be configured to obtain the video file to be reviewed.
  • the above sixth generating subunit may be configured to extract the audio track from the video file to be reviewed to generate the audio to be recognized.
  • the above apparatus for recognizing speech may further include: a second determining unit (not shown in the figure), and a sending unit (not shown in the figure).
  • the above-mentioned second determining unit may be configured to determine whether words in the preset vocabulary exist in the recognized text.
  • the above-mentioned sending unit may be configured to send the video file to be reviewed and the identification text to the target terminal in response to determining the existence.
  • the above-mentioned second determination unit may include a split subunit (not shown in the figure) and a second determination subunit (not shown in the figure).
  • the above-mentioned splitting subunit may be configured to split the words in the preset word set into a third number of retrieval units.
  • the above-mentioned second determination subunit may be configured to determine whether words in the preset word set exist in the recognized text according to the number of words in the recognized text that match the retrieval unit.
  • the above-mentioned second determining subunit 502 may be further configured to, in response to determining that all retrieval units belonging to the same word in the preset word set exist in the recognized text, determine that the recognized text contains Words in the preset word set.
  • the words in the preset word set may correspond to risk level information.
  • the above-mentioned sending unit may include a third determining subunit (not shown in the figure) and a sending subunit (not shown in the figure). Wherein, the above-mentioned third determination subunit may be configured to, in response to determining the existence, determine the risk level information corresponding to the matched word.
  • the above-mentioned sending subunit may be configured to send the video file to be reviewed and the identification text to the terminal matching the determined risk level information.
  • the extraction unit 503 extracts the speech segment from the audio to be recognized according to the start and end times corresponding to the speech segment determined by the first determination unit 502, thereby realizing the separation of speech from the original audio.
  • the generating unit 504 fuses the recognition results of the speech segments extracted by the extracting unit 503 to generate the recognized text corresponding to the entire audio, so that the speech segments can be recognized in parallel and the speed of speech recognition is improved.
  • FIG. 6 it shows a schematic structural diagram of an electronic device (eg, the server in FIG. 1 ) 600 suitable for implementing an embodiment of the present disclosure.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile phones such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), in-vehicle terminals (eg, in-vehicle navigation terminals), etc. Terminals as well as stationary terminals such as digital TVs, desktop computers, etc.
  • the server shown in FIG. 6 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 600 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 601 that may be loaded into random access according to a program stored in a read only memory (ROM) 602 or from a storage device 608 Various appropriate actions and processes are executed by the programs in the memory (RAM) 603 . In the RAM 603, various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to bus 604 .
  • I/O interface 605 input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 607 , speaker, vibrator, etc.; storage device 608 , including, for example, magnetic tape, hard disk, etc.; and communication device 609 .
  • Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 6 shows electronic device 600 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in FIG. 6 may represent one device, or may represent multiple devices as required.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 609, or from the storage device 608, or from the ROM 602.
  • the processing apparatus 601 the above-described functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires audio to be recognized, wherein the audio to be recognized includes voice fragments; The start and end times corresponding to the speech segments included in the audio; extract at least one speech segment from the audio to be recognized according to the determined start and end times; perform speech recognition on the extracted at least one speech segment to generate recognized text corresponding to the audio to be recognized.
  • Computer program code for performing operations of embodiments of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language, Python or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner.
  • the described unit may also be set in a processor, for example, it may be described as: a processor, including an acquisition unit, a first determination unit, an extraction unit, and a generation unit. In some cases, the names of these units do not constitute a limitation on the unit itself.
  • the acquiring unit may also be described as "a unit for acquiring audio to be recognized, wherein the audio to be recognized includes a voice segment".
  • the present disclosure provides a method for recognizing speech, the method comprising: acquiring audio to be recognized, wherein the audio to be recognized includes a speech segment; The start and end time corresponding to the speech segment; according to the determined start and end time, extract at least one speech segment from the audio to be recognized; perform speech recognition on the extracted at least one speech segment to generate recognized text corresponding to the audio to be recognized.
  • the above-mentioned determining the start and end times corresponding to the speech segments included in the audio to be recognized includes: extracting audio frame features of the audio to be recognized, generating The first audio frame feature; determining the probability that the audio frame corresponding to the first audio frame feature belongs to speech; and generating the start and end times corresponding to the speech segment according to the comparison between the determined probability and a preset threshold.
  • the above-mentioned determining the probability that the audio frame corresponding to the first audio frame feature belongs to speech includes: inputting the first audio frame feature into a preset
  • the trained speech detection model generates the probability that the audio frame corresponding to the first audio frame feature belongs to speech.
  • the above-mentioned speech detection model is trained by the following steps: acquiring a first training sample set, wherein the first training sample set in the first training sample set A training sample includes a first sample audio frame feature and corresponding sample labeling information, the first sample audio frame feature is obtained based on the feature extraction of the first sample audio, and the sample labeling information is used to represent the first sample audio belongs to category, the category includes voice; obtain the initial voice detection model for classification; use the first sample audio frame feature in the first training sample set as the input of the initial voice detection model, and use the input first sample audio frame feature with the input.
  • the corresponding annotation information is used as the expected output, and the speech detection model is obtained by training.
  • generating the start and end times corresponding to the speech segment according to the comparison between the determined probability and a preset threshold includes: using a preset sliding The window selects the probability corresponding to the first number of audio frames; determines the statistical value of the selected probability; in response to determining that the statistical value is greater than the preset threshold, according to the audio segment formed by the first number of audio frames corresponding to the selected probability, Generate the start and end times corresponding to the speech segment.
  • the above-mentioned performing speech recognition on at least one extracted speech segment to generate recognized text corresponding to the audio to be recognized includes: The frame feature of the at least one speech segment is extracted from the speech, and the second audio frame feature is generated; the second audio frame feature is input into the pre-trained acoustic model, and the second number of phoneme sequences to be matched corresponding to the second audio frame feature are obtained and Corresponding score; input the second number of the phoneme sequences to be matched into the pre-trained language model, and obtain the text to be matched and the corresponding score corresponding to the second number of the phoneme sequences to be matched; According to the scores corresponding to the matched texts, the to-be-matched texts are selected from the obtained to-be-matched texts as the matched texts corresponding to at least one speech segment; according to the selected matched texts, the recognized texts corresponding to the to-be-recognized audios are generated
  • the above-mentioned acoustic model is obtained by training through the following steps: acquiring a second training sample set, wherein the second training sample set in the second training sample set is obtained.
  • the training sample includes the second sample audio frame feature and corresponding sample text, the second sample audio frame feature is obtained based on the feature extraction of the second sample audio, and the sample text is used to characterize the content of the second sample audio; obtain the initial acoustic model;
  • the second sample audio frame feature in the second training sample set is used as the input of the initial acoustic model, and the phoneme indicated by the sample text corresponding to the input second sample audio frame feature is used as the expected output.
  • the model is pre-trained, wherein the first training criterion is generated based on the audio frame sequence; the phoneme indicated by the second sample text is converted into a phoneme label for the second training criterion by using a preset window function, wherein the second training criterion
  • the criterion is generated based on the audio frame; the second sample audio frame feature in the second training sample set is used as the input of the initial acoustic model after pre-training, and the phoneme label corresponding to the input second sample audio frame feature is used as the expected output.
  • the second training criterion trains the pre-trained initial acoustic model to obtain an acoustic model.
  • the method for recognizing speech in the method for recognizing speech provided by the present disclosure, according to the obtained scores corresponding to the phoneme sequence to be matched and the text to be matched, respectively, select from the obtained text to be matched
  • the text to be matched is used as the matching text corresponding to at least one speech segment, including: weighting and summing the obtained scores corresponding to the phoneme sequence to be matched and the text to be matched, respectively, to generate a total score corresponding to each text to be matched; from the obtained Among the texts to be matched, the text to be matched with the highest total score is selected as the matched text corresponding to at least one speech segment.
  • the above-mentioned acquiring audio to be recognized includes: acquiring a video file to be reviewed; and the method further includes: determining whether a word in a preset vocabulary exists in the recognized text; in response to determining the presence, sending the video file to be reviewed and the recognized text to the target terminal.
  • the above-mentioned determining whether there are words in a preset vocabulary set in the recognized text includes: splitting the words in the preset vocabulary set into first There are three retrieval units; according to the number of words in the recognized text that match the retrieval units, it is determined whether there are words in the preset word set in the recognized text.
  • the method for recognizing speech provided by the present disclosure, according to the number of words in the recognized text that match the number of retrieval units, it is determined whether there is a word in a preset word set in the recognized text , comprising: in response to determining that all retrieval units belonging to the same word in the preset vocabulary set exist in the identified text, determining that words in the preset vocabulary set exist in the identified text.
  • the words in the preset vocabulary set correspond to risk level information
  • Sending the text to the target terminal includes: in response to determining the existence, determining the risk level information corresponding to the matched word; sending the video file to be reviewed and the identification text to the terminal matching the determined risk level information.
  • the present disclosure provides an apparatus for recognizing speech, the apparatus comprising: an acquisition unit configured to acquire audio to be recognized, wherein the audio to be recognized includes a speech segment; a determining unit, configured to determine the start and end times corresponding to the speech segments included in the audio to be recognized; the extraction unit, configured to extract at least one speech segment from the audio to be recognized according to the determined start and end times; the generating unit, configured Speech recognition is performed on the extracted at least one speech segment to generate recognized text corresponding to the audio to be recognized.
  • the above-mentioned first determining unit includes: a first determining subunit configured to determine that the audio frame corresponding to the first audio frame feature belongs to Probability of speech; the first generating subunit is configured to generate the start and end times corresponding to the speech segment according to the comparison between the determined probability and the preset threshold.
  • the above-mentioned first determining subunit is further configured to: input the first audio frame feature into a pre-trained speech detection model to generate The probability that the audio frame corresponding to the first audio frame feature belongs to speech.
  • the above-mentioned speech detection model is obtained by training through the following steps: acquiring a first training sample set; acquiring an initial speech detection model for classification; The first sample audio frame feature in the first training sample set is used as the input of the initial speech detection model, and the annotation information corresponding to the input first sample audio frame feature is used as the expected output, and the speech detection model is obtained by training, wherein, The first training sample in the first training sample set includes the first sample audio frame feature and corresponding sample labeling information, the first sample audio frame feature is obtained based on the feature extraction of the first sample audio, and the sample labeling information is used for Indicates the category to which the first sample audio belongs, and the category includes speech.
  • the above-mentioned first generating subunit includes: a first selecting module configured to select a first number of audios by using a preset sliding window The probability corresponding to the frame; the determining module is configured to determine the statistical value of the selected probability; the first generating module is configured to respond to determining that the statistical value is greater than the above-mentioned preset threshold, according to the selected probability corresponding to the first number of The audio segment composed of audio frames generates the start and end times corresponding to the speech segment.
  • the above-mentioned generating unit includes: a second generating subunit configured to extract frame features of the speech from the extracted at least one speech segment , to generate the second audio frame feature; the third generation subunit is configured to input the second audio frame feature to the pre-trained acoustic model to obtain a second number of phoneme sequences to be matched corresponding to the second audio frame feature and the corresponding
  • the 4th generation subunit is configured to input the second number of phoneme sequences to be matched into the language model of pre-training, obtains the text to be matched corresponding to the second number of phoneme sequences to be matched and the corresponding score;
  • the unit is configured to select the text to be matched from the obtained text to be matched as the matching text corresponding to at least one voice segment according to the obtained scores of the phoneme sequence to be matched and the text to be matched respectively; the fifth generation subunit , which is configured to generate recognition text corresponding
  • the above-mentioned acoustic model is obtained by training through the following steps: obtaining a second training sample set; obtaining an initial acoustic model; The second sample audio frame feature in the input is used as the input of the initial acoustic model, and the phoneme indicated by the sample text corresponding to the input second sample audio frame feature is used as the expected output, and the initial acoustic model is pre-trained based on the first training criterion; Using a preset window function, the phoneme indicated by the second sample text is converted into a phoneme label for the second training criterion; the second sample audio frame feature in the second training sample set is used as the pre-trained initial acoustic model The input of the input, the phoneme label corresponding to the input second sample audio frame feature is used as the expected output, and the pre-trained initial acoustic model is trained by using the second training criterion
  • the second sample audio frame feature is obtained based on the feature extraction of the second sample audio.
  • the sample text is used to represent the content of the second sample audio.
  • the first training criterion Generated based on the sequence of audio frames, the second training criterion is generated based on the audio frames.
  • the above-mentioned selection subunit includes: a second generation module, configured to separate the obtained phoneme sequence to be matched and the text to be matched. The corresponding scores are weighted and summed, and the total score corresponding to each text to be matched is generated; the second selection module is configured to select the text to be matched with the highest total score from the obtained text to be matched as the corresponding text with at least one voice segment. match text.
  • the above-mentioned acquiring unit includes: an acquiring subunit, configured to acquire a video file to be reviewed; a sixth generating subunit, configured to The audio track is extracted from the video file to be reviewed, and the audio to be recognized is generated; the device for recognizing speech further includes: a second determining unit configured to determine whether a word in a preset vocabulary exists in the recognized text; a sending unit, which is is configured to transmit the video file to be reviewed and the identification text to the target terminal in response to determining the existence.
  • the above-mentioned second determining subunit includes: a splitting subunit configured to split the words in the preset word set into third a number of retrieval units; and a second determination subunit configured to determine whether words in the preset word set exist in the recognized text according to the number of words in the recognized text that match the retrieval units.
  • the above-mentioned second determining subunit is further configured to respond to determining that all words belonging to the same word in the preset word set exist in the recognized text
  • the retrieval unit determines that words in the preset word set exist in the recognized text.
  • words in the preset vocabulary set correspond to risk level information
  • the sending unit includes: a third determining subunit configured to In response to determining the existence, the risk level information corresponding to the matched word is determined; the sending subunit is configured to send the video file to be reviewed and the identification text to the terminal matching the determined risk level information.
  • the present disclosure provides an electronic device comprising: one or more processors; a storage device on which one or more programs are stored; A program is executed by one or more processors, such that the one or more processors implement a method as described in any one of the implementations of the first aspect.
  • the present disclosure provides a computer-readable medium having stored thereon a computer program, which, when executed by a processor, implements the above-described method for recognizing speech.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

一种用于识别语音的方法、装置、电子设备和介质。包括:获取待识别音频(201),其中,待识别音频中包括语音片段;确定待识别音频中包括的语音片段对应的起止时刻(202);根据确定的起止时刻,从待识别音频中提取至少一个语音片段(203);对提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本(204)。实现了将包含在原始音频中的语音分解为语音片段,为对各语音片段进行并行识别、提升语音识别的速度提供基础。

Description

用于识别语音的方法、装置、电子设备和介质
相关申请的交叉引用
本申请基于申请号为202011314072.5、申请日为2020年11月20日、名称为“用于识别语音的方法、装置、电子设备和介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开实施例涉及计算机技术领域,具体涉及用于识别语音的方法、装置、电子设备和介质。
背景技术
随着人工智能技术的飞速发展,语音识别技术也取得了越来越多的应用。例如,在智能设备语音交互领域,在音频、短视频、直播平台的内容审核领域,都依赖于语音识别的结果。
相关的方式是采用各种现有的语音识别模型,对待识别音频进行特征提取、声学状态识别、以及通过语言模型输出对应的识别文本。
发明内容
本公开实施例提出了用于识别语音的方法、装置、电子设备和介质。
第一方面,本公开实施例提供了一种用于识别语音的方法,该方法包括:获取待识别音频,其中,待识别音频中包括语音片段;确定待识别音频中包括的语音片段对应的起止时刻;根据所确定的起止时刻,从待识别音频中提取至少一个语音片段;对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本。
第二方面,本公开实施例提供了一种用于识别语音的装置,该装置包括:获取单元,被配置成获取待识别音频,其中,待识别音频中包括语音片段;第一确定单元,被配置成确定待识别音频中包括的语音片段对应的起止时刻;提取单元,被配置成根据所确定的起止时刻,从待识别音频中提取至少一个语音片段;生成单元,被配置成对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本。
第三方面,本公开实施例提供了一种电子设备,该电子设备包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当一个或多个程序被一个或 多个处理器执行,使得一个或多个处理器实现如第一方面中任一实现方式描述的方法。
第四方面,本公开实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面中任一实现方式描述的方法。
本公开实施例提供的用于识别语音的方法、装置、电子设备,通过根据所确定的语音片段对应的起止时刻从待识别音频中提取语音片段,实现了将包含在原始音频中的语音分解为语音片段。而且,还通过将所提取的各语音片段的识别结果融合生成整个音频对应的识别文本,从而可以对各语音片段进行并行识别,提升语音识别的速度。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显:
图1是本公开的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本公开的用于识别语音的方法的一个实施例的流程图;
图3是根据本公开的实施例的用于识别语音的方法的一个应用场景的示意图;
图4是根据本公开的用于识别语音的方法的又一个实施例的流程图;
图5是根据本公开的用于识别语音的装置的一个实施例的结构示意图;
图6是适于用来实现本公开的实施例的电子设备的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本公开,而非对本公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关本公开相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
图1示出了可以应用本公开的用于识别语音的方法或用于识别语音的装置的示例性架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、社交平台软件、文本编辑类应用、语音交互类应用等。
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是支持语音交互的各种电子设备,包括但不限于智能手机、平板电脑、智能音箱、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如为终端设备101、102、103上运行的语音识别程序提供支持的后台服务器。后台服务器可以对获取的待识别语音进行分析等处理,并生成处理结果(如识别文本),还可以将处理结果反馈给终端设备。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
需要说明的是,本公开实施例所提供的用于识别语音的方法一般由服务器105执行,相应地,用于识别语音的装置一般设置于服务器105中。可选地,在满足计算能力的条件下,本公开实施例所提供的用于识别语音的方法也可以由终端设备101、102、103执行,相应地,用于识别语音的装置也可以设置于终端设备101、102、103中。此时,可以不存在网络104和服务器105。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,示出了根据本公开的用于识别语音的方法的一个实施例的流程200。该用于识别语音的方法包括以下步骤:
步骤201,获取待识别音频。
在本实施例中,用于识别语音的方法的执行主体(如图1所示的服务器105)可以通过有线连接方式或者无线连接方式获取待识别语音。其中,上述待识别音频中可以包括语音片段。上述语音片段例如可以是人说话或唱歌的音频。作为示例,上述执行主体可以从本地获取预先存储的待识别语音。作为又一示例,上述执行主体也可以获取与之通信连接的电子设备(例如图1所示的终端设备)发送的待识别音频。
步骤202,确定待识别音频中包括的语音片段对应的起止时刻。
在本实施例中,上述执行主体可以通过各种方式确定上述步骤201所获取的待识别音频中包括的语音片段对应的起止时刻。作为示例,上述执行主体可以通过端点检测算法从上述待识别音频中提取音频片段。之后,上述执行主体可以对所提取的音频片段提取音频特征。接下来,上述执行主体可以确定所提取的音频特征与预设的语音特征模板之间的相似度。其中,上述预设的语音特征模板基于 对大量的说话人的语音的特征提取而得到。响应于确定所提取的音频特征与上述语音特征模板之间的相似度大于预设阈值,上述执行主体可以将所提取的音频特征对应的起止点确定为上述语音片段对应的起止时刻。
在本实施例的一些可选的实现方式中,上述执行主体可以按照如下步骤确定待识别音频中包括的语音片段对应的起止时刻:
第一步,提取待识别音频的音频帧特征,生成第一音频帧特征。
在这些实现方式中,上述执行主体可以通过各种方式提取上述步骤201所获取的待识别音频的音频帧特征,从而生成第一音频帧特征。作为示例,上述执行主体可以对上述待识别音频进行采样并对采样后的音频帧进行特征提取,从而生成上述第一音频帧特征。其中,所提取的特征可以包括但不限于以下至少一项:Fbank特征,线性预测倒谱系数(Linear Predictive Cepstral Coefficient,LPCC),梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)。
第二步,确定第一音频帧特征对应的音频帧属于语音的概率。
在这些实现方式中,上述执行主体可以通过各种方式确定第一音频帧特征对应的音频帧属于语音的概率。作为示例,上述执行主体可以确定上述第一步所生成的第一音频帧特征与预设语音帧特征模板之间的相似度。其中,上述预设语音帧特征模板基于对大量的说话人的语音的帧特征提取而得到。响应于确定上述所确定相似度大于预设阈值,上述执行主体可以将上述所确定的相似度确定为上述第一音频帧特征对应的音频帧属于语音的概率。
可选地,上述执行主体可以将上述第一音频帧特征输入至预先训练的语音检测模型,生成第一音频帧特征对应的音频帧属于语音的概率。其中,上述语音检测模型可以包括各种用于进行分类的神经网络模型。作为示例,上述语音检测模型可以输出上述第一音频帧特征属于各类别(例如语音、环境音、纯音乐等)的概率。
可选地,上述语音检测模型可以通过以下步骤训练得到:
S1、获取第一训练样本集合。
在这些实现方式中,用于训练上述语音检测模型的执行主体可以通过有线或无线连接的方式获取上述第一训练样本集合。其中,上述第一训练样本集合中的第一训练样本可以包括第一样本音频帧特征和对应的样本标注信息。上述第一样本音频帧特征可以基于对第一样本音频的特征提取得到。上述样本标注信息可以用于表征上述第一样本音频所属的类别。上述类别可以包括语音。可选地,上述语音还可以包括人声说话和人声唱歌。上述类别例如还可以包括纯音乐、其他(例如环境音、动物叫等)。
S2、获取用于分类的初始语音检测模型。
在这些实现方式中,上述执行主体可以通过有线或无线连接的方式获取用于分类的初始语音检测模型。其中,上述初始语音检测模型可以包括各种用于音频 特征分类的神经网络,例如RNN(Recurrent Neural Network,循环神经网络)、BiLSTM(Bi-directional Long Short Term Memory,双向长短期记忆网络)、DFSMN(Deep Feed-Forward Sequential Memory Networks)。作为示例,上述初始语音检测模型可以是一个10层的DFSMN结构的网络。其中,每层DFSMN结构可以由隐藏层和记忆模块组成。上述网络的最后一层可以基于softmax函数构建,其包括的输出单元的个数可以与分类的类别数目一致。
S3、将第一训练样本集合中的第一样本音频帧特征作为初始语音检测模型的输入,将与输入的第一样本音频帧特征对应的标注信息作为期望输出,训练得到语音检测模型。
在这些实现方式中,上述执行主体可以将上述步骤S1所获取的第一训练样本集合中的第一样本音频帧特征作为初始语音检测模型的输入,将与输入的第一样本音频帧特征对应的标注信息作为期望输出,通过机器学习方式训练得到语音检测模型。作为示例,上述执行主体可以利用交叉熵准则(Cross Entropy Criteria,CE准则)对上述初始语音检测模型的网络参数进行调整,从而得到上述语音检测模型。
基于上述可选的实现方式,上述执行主体可以利用预先训练的语音检测模型确定各帧是否属于语音帧,从而提升了语音帧的识别准确度。
第三步,根据所确定的概率与预设阈值的比较,生成语音片段对应的起止时刻。
在这些实现方式中,根据上述第二步所确定的概率与预设阈值的比较,上述执行主体可以通过各种方式生成语音片段对应的起止时刻。
作为示例,上述执行主体可以首先选取大于预设阈值的概率。而后,上述执行主体可以将所选取的概率对应的连续的音频帧所组成的音频片段的起止时刻确定为语音片段的起止时刻。
基于上述可选的实现方式,上述执行主体可以根据待识别音频中的音频帧属于语音的概率,来确定语音片段对应的起止时刻,从而提高语音片段对应的起止时刻的检测精度。
可选地,根据所确定的概率与预设阈值的比较,上述执行主体可以按照如下步骤生成语音片段对应的起止时刻:
S1、利用预设滑动窗选取第一数目个音频帧对应的概率。
在这些实现方式中,上述执行主体可以利用预设滑动窗选取第一目标数目个音频帧对应的概率。其中,上述预设滑动窗的宽度可以根据实际应用场景而预先设置,例如10毫秒。上述第一数目通过可以指上述预设滑动窗所包含的音频帧的数目。
S2、确定所选取的概率的统计值。
在这些实现方式中,上述执行主体可以通过各种方式确定上述步骤S1所选 取的概率的统计值。其中,上述统计值可以用于表征所选取的概率的整体幅值。作为示例,上述统计值可以是加权求和所得到的值。可选地,上述统计值也可以包括但不限于以下至少一项:最大值、最小值、中位数。
S3、响应于确定统计值大于预设阈值,根据所选取的概率对应的第一数目个音频帧所组成的音频片段,生成语音片段对应的起止时刻。
在这些实现方式中,响应于确定步骤S2所确定的统计值大于预设阈值,上述执行主体可以确定所选取的概率对应的第一数目个音频帧所组成的音频片段属于语音片段。从而,上述执行主体可以将上述滑动窗所对应的端点时刻确定为上述语音片段对应的起止时刻。
基于上述可选的实现方式,上述执行主体可以减少原始语音中的“毛刺”对语音片段检测准确性的影响,从而提升语音片段对应的起止时刻的检测准确性,从而为后续语音识别提供数据基础。
步骤203,根据所确定的起止时刻,从待识别音频中提取至少一个语音片段。
在本实施例中,根据步骤202所确定的起止时刻,上述执行主体可以通过各种方式从待识别音频中提取至少一个语音片段。其中,上述所提取的语音片段的起止时刻通常与所确定的起止时刻一致。可选地,上述执行主体还可以根据所确定的起止时刻进行音频片段的切分或合并,以使所生成的语音片段的长度保持在一定的范围内。
步骤204,对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本。
在本实施例中,上述执行主体可以利用各种语音识别技术对步骤203所提取的至少一个语音片段进行语音识别,从而生成各语音片段所对应的识别文本。而后,上述执行主体可以对所生成的各语音片段所对应的识别文本进行合并,从而生成上述待识别音频对应的识别文本。
在本实施例的一些可选的实现方式中,上述执行主体可以按照如下步骤对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本:
第一步,对所提取的至少一个语音片段提取语音的帧特征,生成第二音频帧特征。
在这些实现方式中,上述执行主体可以通过各种方式对步骤203所提取的至少一个语音片段提取语音的帧特征,生成第二音频帧特征。其中,上述第二音频帧特征可以包括但不限于以下至少一项:Fbank特征,LPCC特征,MFCC特征。作为示例,上述执行主体可以与前述步骤201中生成第一音频帧特征类似的方式生成上述第二音频帧特征。作为又一示例,当上述第一音频帧特征与上述第二音频帧特征形式一致的情况下,上述执行主体可以直接从所生成的第一音频帧特征中选取相应的音频帧特征,生成上述第二音频帧特征。
第二步,将第二音频帧特征输入至预先训练的声学模型,得到与第二音频帧 特征对应的第二数目个待匹配音素序列以及对应的得分。
在这些实现方式中,上述执行主体可以将第二音频帧特征输入至预先训练的声学模型,得到与第二音频帧特征对应的第二数目个待匹配音素序列以及对应的得分。其中,上述声学模型可以包括各种用于语音识别中进行声学状态确定的模型。作为示例,上述声学模型可以输出上述第二音频帧特征对应的音频帧的音素以及对应的概率。之后,上述执行主体可以基于维特比(viterbi)算法确定上述第二音频帧特征所对应的概率最大的第二数目个音素序列以及对应的得分。
可选地,上述声学模型可以通过以下步骤训练得到:
S1、获取第二训练样本集合。
在这些实现方式中,用于训练上述声学模型的执行主体可以通过有线或无线连接的方式获取上述第二训练样本集合。其中,上述第二训练样本集合中的第二训练样本可以包括第二样本音频帧特征和对应的样本文本。上述第二样本音频帧特征可以基于对第二样本音频的特征提取得到。上述样本文本可以用于表征上述第二样本音频的内容。其中,上述样本文本可以是直接获取的音素序列,例如“nihao”。上述样本文本也可以是根据预设的词典库从文字(例如“你好”)转换得到的音素序列。
S2、获取初始声学模型。
在这些实现方式中,上述执行主体可以通过有线或无线连接的方式获取初始声学模型。其中,上述初始声学模型可以包括各种用于声学状态确定的神经网络,例如RNN、BiLSTM、DFSMN。作为示例,上述初始声学模型可以是一个30层的DFSMN结构的网络。其中,每层DFSMN结构可以由隐藏层和记忆模块组成。上述网络的最后一层可以基于softmax函数构建,其包括的输出单元的个数可以与可识别的音素的数目一致。
S3、将第二训练样本集合中的第二样本音频帧特征作为初始声学模型的输入,将与输入的第二样本音频帧特征对应的样本文本所指示的音素作为期望输出,基于第一训练准则对初始声学模型进行预训练。
在这些实现方式中,上述执行主体可以将上述步骤S1所获取的第二训练样本集合中的第二样本音频帧特征作为初始声学模型的输入,将与输入的第二样本音频帧特征对应的样本文本所指示的音节作为期望输出,基于第一训练准则对初始声学模型进行预训练。其中,上述第一训练准则可以基于音频帧序列生成。作为示例,上述第一训练准则可以包括CTC(Connectionist Temporal Classification)准则。
S4、利用预设的窗函数,将第二样本文本所指示的音素转换为用于第二训练准则的音素标签。
在这些实现方式中,上述执行主体可以利用预设的窗函数,将步骤S1所获取的第二样本文本所指示的音素转换为用于第二训练准则的音素标签。其中,上 述窗函数可以包括但不限于以下至少一项:矩形窗,三角窗。上述第二训练准则可以基于音频帧生成,例如CE准则。作为示例,上述第二样本文本所指示的音素可以为“nihao”,上述执行主体可以利用上述预设的窗函数将上述音素转换为“nnniihhao”。
S5、将第二训练样本集合中的第二样本音频帧特征作为预训练后的初始声学模型的输入,将与输入的第二样本音频帧特征对应的音素标签作为期望输出,利用第二训练准则对预训练后的初始声学模型进行训练,得到声学模型。
在这些实现方式中,上述执行主体可以将步骤S1所获取的第二训练样本集合中的第二样本音频帧特征作为经过步骤S3进行预训练后的初始声学模型的输入,将与输入的第二样本音频帧特征对应的经过步骤S4转换后的音素标签作为期望输出,利用上述第二训练准则对预训练后的初始声学模型的参数进行调整,得到声学模型。
基于上述可选的实现方式,上述执行主体可以利用基于序列维度生成的训练准则(例如CTC准则)和基于帧维度生成的训练准则(例如CE准则)之间的配合,既减少样本了标注的工作量,又保证了训练所得到的模型的有效性。
第三步,将第二数目个待匹配音素序列输入至预先训练的语言模型,得到第二数目个待匹配音素序列对应的待匹配文本以及对应的得分。
在这些实现方式中,上述执行主体可以将上述第二步所得到的第二数目个待匹配音素序列输入至预先训练的语言模型,得到第二数目个待匹配音素序列对应的待匹配文本以及对应的得分。其中,上述语言模型可以输出上述第二数目个待匹配音素序列各自对应的待匹配文本以及对应的得分。上述得分通常与预设语料库中出现的概率、合乎语法程度正相关。
第四步,根据所得到的待匹配音素序列和待匹配文本分别对应的得分,从所得到的待匹配文本中选取待匹配文本作为与至少一个语音片段对应的匹配文本。
在这些实现方式中,根据所得到的待匹配音素序列和待匹配文本分别对应的得分,上述执行主体可以通过各种方式从所得到的待匹配文本中选取待匹配文本作为与至少一个语音片段对应的匹配文本。作为示例,上述执行主体可以首先选取所得到的待匹配音素序列对应的得分大于第一预设阈值的待匹配音素序列。而后,上述执行主体可以从所选取的待匹配音素序列中选取待匹配文本对应的得分最高的待匹配文本作为与上述待匹配音素序列对应的语音片段对应的匹配文本。
可选地,根据所得到的待匹配音素序列和待匹配文本分别对应的得分,上述执行主体还可以通过以下步骤从所得到的待匹配文本中选取待匹配文本作为与至少一个语音片段对应的匹配文本:
S1、对所得到的待匹配音素序列和待匹配文本分别对应的得分进行加权求和,生成各待匹配文本对应的总得分。
在这些实现方式中,上述执行主体可以对所得到的对应于同一语音片段的待 匹配音素序列和待匹配文本分别对应的得分进行加权求和,生成各待匹配文本对应的总得分。作为示例,与语音片段001所对应的待匹配音素序列“nihao”和“niao”所对应的得分可以分别为82和60。与上述待匹配音素序列“nihao”所对应的待匹配文本“你好”和“拟好”所对应的得分可以分别为95和72。与上述待匹配音素序列“niao”所对应的待匹配文本“鸟”和“你啊哦”所对应的得分可以分别为67和55。假设待匹配音素序列对应的得分和待匹配文本对应的得分的权重可以分别为30%和70%。则,上述执行主体可以确定“你好”对应的总得分为82*30%+95*70%=91.1。上述执行主体可以确定“鸟”对应的总得分为60*30%+67*70%=64.9。
S2、从所得到的待匹配文本中选取总得分最高的待匹配文本作为与至少一个语音片段对应的匹配文本。
在这些实现方式中,上述执行主体可以从上述步骤S1所得到的待匹配文本中选取总得分最高的待匹配文本作为与至少一个语音片段对应的匹配文本。
基于上述可选的实现方式,上述执行主体可以根据实际应用场景对所得到的待匹配音素序列和待匹配文本分别对应的得分赋予不同的权重,以更加适应不同的应用场景。
第五步,根据所选取的匹配文本,生成待识别音频对应的识别文本。
在这些实现方式中,根据上述第四步所选取的匹配文本,上述执行主体可以通过各种方式生成待识别音频对应的识别文本。作为示例,上述执行主体可以将所选取的匹配文本按照对应的语音片段在上述待识别音频中的先后顺序进行排列,并进行文本后处理,从而生成上述待识别音频对应的识别文本。
基于上述可选的实现方式,上述执行主体可以从音素序列和语言模型两个维度生成识别文本,以提升识别准确性。
继续参见图3,图3是根据本公开实施例的用于识别语音的方法的应用场景的一个示意图。在图3的应用场景中,用户301使用终端设备302录制音频作为待识别音频303。后台服务器304获取上述待识别音频303。之后,后台服务器304可以确定上述待识别音频303中包括的语音片段的起止时刻305。例如,语音片段A的起始时刻和终止时刻可以分别为0"24和1"15。根据所确定的语音片段的起止时刻305,后台服务器304可以从待识别音频303中提取至少一个语音片段306。例如,可以提取待识别音频303中0"24~1"15对应的音频帧作为语音片段。而后,后台服务器304可以对所提取的语音片段306进行语音识别,生成待识别音频303对应的识别文本306。例如,上述待识别文本306可以是由多个语音片段对应的识别文本组合而成的“大家好,欢迎来到XX课堂”。可选地,后台服务器304还可以将所生成的识别文本306反馈给终端设备302。
目前,现有技术之一通常是直接对所获取的音频进行语音识别,由于音频中往往包括非语音内容,导致在提取特征、进行语音识别过程中既消耗过多的资源, 又对语音识别的准确度产生不良影响。而本公开的上述实施例提供的方法,通过根据所确定的语音片段对应的起止时刻从待识别音频中提取语音片段,实现了将包含在原始音频中的语音分解为语音片段。而且,还通过将所提取的各语音片段的识别结果融合生成整个音频对应的识别文本,从而可以对各语音片段进行并行识别,提升语音识别的速度。
进一步参考图4,其示出了用于识别语音的方法的又一个实施例的流程400。该用于识别语音的方法的流程400,包括以下步骤:
步骤401,获取待审核视频文件。
在本实施例中,用于识别语音的方法的执行主体(例如图1所示的服务器105)可以通过各种方式从本地或通信连接的电子设备(例如图1所示的终端设备101、102、103)获取待审核视频文件。其中,上述待审核文件例如可以是直播平台的流媒体视频,也可以是短视频平台的投稿视频。
步骤402,从待审核视频文件中提取音轨,生成待识别音频。
在本实施例中,上述执行主体可以通过各种方式从上述步骤401所获取的待审核视频文件中提取音轨,生成待识别音频。作为示例,上述执行主体可以将上述所提取的音轨转换为预先指定的格式的音频文件作为上述待识别音频。
步骤403,确定待识别音频中包括的语音片段对应的起止时刻。
步骤404,根据所确定的起止时刻,从待识别音频中提取至少一个语音片段。
步骤405,对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本。
上述步骤403、步骤404、步骤405分别与前述实施例中的步骤202、步骤203、步骤204一致,上文针对步骤202、步骤203和步骤204及其可选的实现方式的描述也适用于步骤403、步骤404和步骤405,此处不再赘述。
步骤406,确定识别文本中是否存在预设词集中的词。
在本实施例中,上述执行主体可以通过各种方式确定步骤405所生成的识别文本中是否存在预设词集中的词。其中,上述预设词集合可以包括预先设置的敏感词集合。上述敏感词集合中例如可以包括广告宣传用语,不文明用语等。
在本实施例的一些可选的实现方式中,上述执行主体可以按照如下步骤确定识别文本中是否存在预设词集中的词:
第一步,将预设词集中的词拆分成第三数目个检索单元。
在这些实现方式中,上述执行主体可以将上述预设词集中的词拆分成第三数目个检索单元。作为示例,上述预设词集中的词可以包括“限时秒杀”,上述执行主体可以利用分词技术将上述“限时秒杀”拆分成“限时”和“秒杀”作为检索单元。
第二步,根据识别文本中的词与检索单元相匹配的数目,确定识别文本中是否存在预设词集中的词。
在这些实现方式中,上述执行主体可以首先将上述步骤405所生成的识别文本与上述检索单元进行匹配,以确定匹配的检索单元的数目。而后,根据所确定的检索单元的数目,上述执行主体可以通过各种方式确定上述识别文本中是否存在预设词集中的词。作为示例,响应于所确定的对应于同一词的检索单元的数目大于1,上述执行主体可以确定识别文本中是否存在预设词集中的词。
可选地,上述执行主体还可以响应于确定识别文本中存在属于上述预设词集中的同一词的所有检索单元,确定识别文本中存在预设词集中的词。
基于上述可选的实现方式,上述执行主体可以实现检索词的模糊匹配,从而提升审核的力度。
在本实施例的一些可选的实现方式中,上述预设词集中的词可以对应有风险级别信息。其中,上述风险级别信息可以用于表征不同的紧迫程度,例如优先处理级别、按序处理级别等。
步骤407,响应于确定存在,将待审核视频文件和识别文本发送至目标终端。
在本实施例中,响应于确定步骤405所生成的识别文本中存在预设词集中的词,上述执行主体可以通过各种方式将待审核视频文件和识别文本发送至目标终端。作为示例,上述目标终端可以是用于对待审核视频进行复核的终端,例如人工审核终端或利用其它审核技术进行关键词审核的终端。作为又一示例,上述目标终端还可以是发送上述待审核视频文件的终端,以提示使用上述终端的用户对上述待审核视频文件进行调整。
在本实施例的一些可选的实现方式中,基于上述预设词集中的词对应有风险级别信息,上述执行主体可以按照如下步骤将待审核视频文件和识别文本发送至目标终端:
第一步,响应于确定存在,确定匹配的词对应的风险级别信息。
在这些实现方式中,响应于确定存在,上述执行主体可以确定上述匹配的词所对应的风险级别信息。
第二步,将待审核视频文件和识别文本发送至与所确定的风险级别信息匹配的终端。
在这些实现方式中,上述执行主体可以将上述待审核视频文件和识别文本发送至与所确定的风险级别信息匹配的终端。作为示例,上述执行主体可以将用于表征优先处理的风险级别信息所对应的待审核视频文件和识别文本发送至用于优先处理的终端。作为又一示例,上述执行主体可以将用于表征按序处理的风险级别信息所对应的待审核视频文件和识别文本存入待审核队列。而后,从上述待审核队列中选取待审核视频文件和识别文本发送至用于进行复核的终端。
基于上述可选的实现方式,上述执行主体可以对触发不同风险等级的关键词的待审核视频文件进行分级处理,提升了处理效率和灵活性。
从图4中可以看出,本实施例中的用于识别语音的方法的流程400体现了从 待审核视频文件中提取音频的步骤,以及响应于确定所提取的音频对应的识别文本中存在预设词集中的词,将待审核视频文件和识别文本发送至目标终端的步骤。由此,本实施例描述的方案通过只将命中特定词的视频发送至目标终端,当目标终端用于视频内容复审时,可以显著降低视频的进审量,有效提升视频审核的效率。而且,通过将视频文件中所包括的语音转换为识别文本进行视频文件的内容审核,与逐帧听取音频相比,能够更快地定位到命中的特定词,既丰富了视频审核的维度,又提升了审核效率。
进一步参考图5,作为对上述各图所示方法的实现,本公开提供了用于识别语音的装置的一个实施例,该装置实施例与图2或图4所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,本实施例提供的用于识别语音的装置500包括获取单元501、第一确定单元502、提取单元503和生成单元504。其中,获取单元501,被配置成获取待识别音频,其中,待识别音频中包括语音片段;第一确定单元502,被配置成确定待识别音频中包括的语音片段对应的起止时刻;提取单元503,被配置成根据所确定的起止时刻,从待识别音频中提取至少一个语音片段;生成单元504,被配置成对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本。
在本实施例中,用于识别语音的装置500中:获取单元501、第一确定单元502、提取单元503和生成单元504的具体处理及其所带来的技术效果可分别参考图2对应实施例中的步骤201、步骤202、步骤203和步骤204的相关说明,在此不再赘述。
在本实施例的一些可选的实现方式中,上述第一确定单元502可以包括第一确定子单元(图中未示出)、第一生成子单元(图中未示出)。其中,上述第一确定子单元可以被配置成确定第一音频帧特征对应的音频帧属于语音的概率。上述第一生成子单元可以被配置成根据所确定的概率与预设阈值的比较,生成语音片段对应的起止时刻。
在本实施例的一些可选的实现方式中,上述第一确定子单元可以进一步被配置成:将第一音频帧特征输入至预先训练的语音检测模型,生成第一音频帧特征对应的音频帧属于语音的概率。
在本实施例的一些可选的实现方式中,上述语音检测模型可以通过以下步骤训练得到:获取第一训练样本集合;获取用于分类的初始语音检测模型;将第一训练样本集合中的第一样本音频帧特征作为初始语音检测模型的输入,将与输入的第一样本音频帧特征对应的标注信息作为期望输出,训练得到语音检测模型,其中,第一训练样本集合中的第一训练样本包括第一样本音频帧特征和对应的样本标注信息,第一样本音频帧特征基于对第一样本音频的特征提取得到,样本标注信息用于表征第一样本音频所属的类别,类别包括语音。
在本实施例的一些可选的实现方式中,上述第一生成子单元可以包括第一选取模块(图中未示出)、确定模块(图中未示出)、第一生成模块(图中未示出)。其中,上述第一选取模块可以被配置成利用预设滑动窗选取第一数目个音频帧对应的概率。上述确定模块可以被配置成确定所选取的概率的统计值。上述第一生成模块可以被配置成响应于确定统计值大于上述预设阈值,根据所选取的概率对应的第一数目个音频帧所组成的音频片段,生成语音片段对应的起止时刻。
在本实施例的一些可选的实现方式中,上述生成单元504可以包括第二生成子单元(图中未示出)、第三生成子单元(图中未示出)、第四生成子单元(图中未示出)、选取子单元(图中未示出)、第五生成子单元(图中未示出)。其中,上述第二生成子单元可以被配置成对所提取的至少一个语音片段提取语音的帧特征,生成第二音频帧特征。上述第三生成子单元可以被配置成将第二音频帧特征输入至预先训练的声学模型,得到与第二音频帧特征对应的第二数目个待匹配音素序列以及对应的得分。上述第四生成子单元可以被配置成将第二数目个待匹配音素序列输入至预先训练的语言模型,得到第二数目个待匹配音素序列对应的待匹配文本以及对应的得分。上述选取子单元可以被配置成根据所得到的待匹配音素序列和待匹配文本分别对应的得分,从所得到的待匹配文本中选取待匹配文本作为与至少一个语音片段对应的匹配文本。上述第五生成子单元可以被配置成根据所选取的匹配文本,生成待识别音频对应的识别文本。
在本实施例的一些可选的实现方式中,上述声学模型可以通过以下步骤训练得到:获取第二训练样本集合;获取初始声学模型;将第二训练样本集合中的第二样本音频帧特征作为初始声学模型的输入,将与输入的第二样本音频帧特征对应的样本文本所指示的音素作为期望输出,基于第一训练准则对初始声学模型进行预训练;利用预设的窗函数,将第二样本文本所指示的音素转换为用于第二训练准则的音素标签;将第二训练样本集合中的第二样本音频帧特征作为预训练后的初始声学模型的输入,将与输入的第二样本音频帧特征对应的音素标签作为期望输出,利用第二训练准则对预训练后的初始声学模型进行训练,得到声学模型,其中,第二训练样本集合中的第二训练样本包括第二样本音频帧特征和对应的样本文本,第二样本音频帧特征基于对第二样本音频的特征提取得到,样本文本用于表征第二样本音频的内容,第一训练准则基于音频帧序列生成,第二训练准则基于音频帧生成。
在本实施例的一些可选的实现方式中,上述选取子单元可以包括第二生成模块(图中未示出)、第二选取模块(图中未示出)。其中,上述第二生成模块可以被配置成对所得到的待匹配音素序列和待匹配文本分别对应的得分进行加权求和,生成各待匹配文本对应的总得分。上述第二选取模块可以被配置成从所得到的待匹配文本中选取总得分最高的待匹配文本作为与至少一个语音片段对应的匹配文本。
在本实施例的一些可选的实现方式中,上述获取单元501可以包括获取子单元(图中未示出)、第六生成子单元(图中未示出)。其中,上述获取子单元可以被配置成获取待审核视频文件。上述第六生成子单元可以被配置成从待审核视频文件中提取音轨,生成待识别音频。上述用于识别语音的装置还可以包括:第二确定单元(图中未示出)、发送单元(图中未示出)。其中,上述第二确定单元可以被配置成确定识别文本中是否存在预设词集中的词。上述发送单元可以被配置成响应于确定存在,将待审核视频文件和识别文本发送至目标终端。
在本实施例的一些可选的实现方式中,上述第二确定单元可以包括拆分子单元(图中未示出)、第二确定子单元(图中未示出)。其中,上述拆分子单元可以被配置成将预设词集中的词拆分成第三数目个检索单元。上述第二确定子单元可以被配置成根据识别文本中的词与检索单元相匹配的数目,确定识别文本中是否存在预设词集中的词。
在本实施例的一些可选的实现方式中,上述第二确定子单元502可以进一步被配置成响应于确定识别文本中存在属于预设词集中的同一词的所有检索单元,确定识别文本中存在预设词集中的词。
在本实施例的一些可选的实现方式中,上述预设词集中的词可以对应有风险级别信息。上述发送单元可以包括第三确定子单元(图中未示出)、发送子单元(图中未示出)。其中,上述第三确定子单元可以被配置成响应于确定存在,确定匹配的词对应的风险级别信息。上述发送子单元可以被配置成将待审核视频文件和识别文本发送至与所确定的风险级别信息匹配的终端。
本公开的上述实施例提供的装置,通过提取单元503根据第一确定单元502所确定的语音片段对应的起止时刻从待识别音频中提取语音片段,实现了语音从原始音频中的分离。而且,还通过生成单元504将提取单元503所提取的各语音片段的识别结果融合生成整个音频对应的识别文本,从而可以对各语音片段进行并行识别,提升语音识别的速度。
下面参考图6,其示出了适于用来实现本公开实施例的电子设备(例如图1中的服务器)600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的服务器仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605 也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD,Liquid Crystal Display)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图6中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开的实施例的方法中限定的上述功能。
需要说明的是,本公开的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取待识别音频,其中,待识别音频中包括语音片段;确定待识别音频中包括的语音片段对应的起 止时刻;根据所确定的起止时刻,从待识别音频中提取至少一个语音片段;对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开实施例的操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言、Python或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开的各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器,包括获取单元、第一确定单元、提取单元、生成单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,获取单元还可以被描述为“获取待识别音频的单元,其中,待识别音频中包括语音片段”。
根据本公开的一个或多个实施例,本公开提供了一种用于识别语音的方法,该方法包括:获取待识别音频,其中,待识别音频中包括语音片段;确定待识别音频中包括的语音片段对应的起止时刻;根据所确定的起止时刻,从待识别音频中提取至少一个语音片段;对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述确定待识别音频中包括的语音片段对应的起止时刻,包括:提取待识别音频的音频帧特征,生成第一音频帧特征;确定第一音频帧特征对应的音频帧属于语音的概率;根据所确定的概率与预设阈值的比较,生成语音片段对应的起止时刻。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上 述确定第一音频帧特征对应的音频帧属于语音的概率,包括:将第一音频帧特征输入至预先训练的语音检测模型,生成第一音频帧特征对应的音频帧属于语音的概率。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述语音检测模型通过以下步骤训练得到:获取第一训练样本集合,其中,第一训练样本集合中的第一训练样本包括第一样本音频帧特征和对应的样本标注信息,第一样本音频帧特征基于对第一样本音频的特征提取得到,样本标注信息用于表征第一样本音频所属的类别,类别包括语音;获取用于分类的初始语音检测模型;将第一训练样本集合中的第一样本音频帧特征作为初始语音检测模型的输入,将与输入的第一样本音频帧特征对应的标注信息作为期望输出,训练得到语音检测模型。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述根据所确定的概率与预设阈值的比较,生成语音片段对应的起止时刻,包括:利用预设滑动窗选取第一数目个音频帧对应的概率;确定所选取的概率的统计值;响应于确定统计值大于预设阈值,根据所选取的概率对应的第一数目个音频帧所组成的音频片段,生成语音片段对应的起止时刻。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本,包括:对所提取的至少一个语音片段提取语音的帧特征,生成第二音频帧特征;将第二音频帧特征输入至预先训练的声学模型,得到与第二音频帧特征对应的第二数目个待匹配音素序列以及对应的得分;将第二数目个待匹配音素序列输入至预先训练的语言模型,得到第二数目个待匹配音素序列对应的待匹配文本以及对应的得分;根据所得到的待匹配音素序列和待匹配文本分别对应的得分,从所得到的待匹配文本中选取待匹配文本作为与至少一个语音片段对应的匹配文本;根据所选取的匹配文本,生成待识别音频对应的识别文本。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述声学模型通过以下步骤训练得到:获取第二训练样本集合,其中,第二训练样本集合中的第二训练样本包括第二样本音频帧特征和对应的样本文本,第二样本音频帧特征基于对第二样本音频的特征提取得到,样本文本用于表征第二样本音频的内容;获取初始声学模型;将第二训练样本集合中的第二样本音频帧特征作为初始声学模型的输入,将与输入的第二样本音频帧特征对应的样本文本所指示的音素作为期望输出,基于第一训练准则对初始声学模型进行预训练,其中,第一训练准则基于音频帧序列生成;利用预设的窗函数,将第二样本文本所指示的音素转换为用于第二训练准则的音素标签,其中,第二训练准则基于音频帧生成;将第二训练样本集合中的第二样本音频帧特征作为预训练后的初始声学模型的输入,将与输入的第二样本音频帧特征对应的音素标签作为期望输出,利用第二 训练准则对预训练后的初始声学模型进行训练,得到声学模型。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述根据所得到的待匹配音素序列和待匹配文本分别对应的得分,从所得到的待匹配文本中选取待匹配文本作为与至少一个语音片段对应的匹配文本,包括:对所得到的待匹配音素序列和待匹配文本分别对应的得分进行加权求和,生成各待匹配文本对应的总得分;从所得到的待匹配文本中选取总得分最高的待匹配文本作为与至少一个语音片段对应的匹配文本。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述获取待识别音频,包括:获取待审核视频文件;从待审核视频文件中提取音轨,生成待识别音频;以及该方法还包括:确定识别文本中是否存在预设词集中的词;响应于确定存在,将待审核视频文件和识别文本发送至目标终端。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述确定识别文本中是否存在预设词集中的词,包括:将预设词集中的词拆分成第三数目个检索单元;根据识别文本中的词与检索单元相匹配的数目,确定识别文本中是否存在预设词集中的词。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述根据识别文本中的词与检索单元相匹配的数目,确定识别文本中是否存在预设词集中的词,包括:响应于确定识别文本中存在属于预设词集中的同一词的所有检索单元,确定识别文本中存在预设词集中的词。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的方法中,上述预设词集中的词对应有风险级别信息;以及上述响应于确定存在,将待审核视频文件和识别文本发送至目标终端,包括:响应于确定存在,确定匹配的词对应的风险级别信息;将待审核视频文件和识别文本发送至与所确定的风险级别信息匹配的终端。
根据本公开的一个或多个实施例,本公开提供了一种用于识别语音的装置,该装置包括:获取单元,被配置成获取待识别音频,其中,待识别音频中包括语音片段;第一确定单元,被配置成确定待识别音频中包括的语音片段对应的起止时刻;提取单元,被配置成根据所确定的起止时刻,从待识别音频中提取至少一个语音片段;生成单元,被配置成对所提取的至少一个语音片段进行语音识别,生成待识别音频对应的识别文本。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述第一确定单元包括:第一确定子单元,被配置成确定第一音频帧特征对应的音频帧属于语音的概率;第一生成子单元,被配置成根据所确定的概率与预设阈值的比较,生成语音片段对应的起止时刻。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述第一确定子单元进一步被配置成:将第一音频帧特征输入至预先训练的语音检 测模型,生成第一音频帧特征对应的音频帧属于语音的概率。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述语音检测模型通过以下步骤训练得到:获取第一训练样本集合;获取用于分类的初始语音检测模型;将第一训练样本集合中的第一样本音频帧特征作为初始语音检测模型的输入,将与输入的第一样本音频帧特征对应的标注信息作为期望输出,训练得到语音检测模型,其中,第一训练样本集合中的第一训练样本包括第一样本音频帧特征和对应的样本标注信息,第一样本音频帧特征基于对第一样本音频的特征提取得到,样本标注信息用于表征第一样本音频所属的类别,类别包括语音。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述第一生成子单元包括:第一选取模块,被配置成利用预设滑动窗选取第一数目个音频帧对应的概率;确定模块,被配置成确定所选取的概率的统计值;第一生成模块,被配置成响应于确定统计值大于上述预设阈值,根据所选取的概率对应的第一数目个音频帧所组成的音频片段,生成语音片段对应的起止时刻。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述生成单元包括:第二生成子单元,被配置成对所提取的至少一个语音片段提取语音的帧特征,生成第二音频帧特征;第三生成子单元,被配置成将第二音频帧特征输入至预先训练的声学模型,得到与第二音频帧特征对应的第二数目个待匹配音素序列以及对应的得分;第四生成子单元,被配置成将第二数目个待匹配音素序列输入至预先训练的语言模型,得到第二数目个待匹配音素序列对应的待匹配文本以及对应的得分;选取子单元,被配置成根据所得到的待匹配音素序列和待匹配文本分别对应的得分,从所得到的待匹配文本中选取待匹配文本作为与至少一个语音片段对应的匹配文本;第五生成子单元,被配置成根据所选取的匹配文本,生成待识别音频对应的识别文本。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述声学模型通过以下步骤训练得到:获取第二训练样本集合;获取初始声学模型;将第二训练样本集合中的第二样本音频帧特征作为初始声学模型的输入,将与输入的第二样本音频帧特征对应的样本文本所指示的音素作为期望输出,基于第一训练准则对初始声学模型进行预训练;利用预设的窗函数,将第二样本文本所指示的音素转换为用于第二训练准则的音素标签;将第二训练样本集合中的第二样本音频帧特征作为预训练后的初始声学模型的输入,将与输入的第二样本音频帧特征对应的音素标签作为期望输出,利用第二训练准则对预训练后的初始声学模型进行训练,得到声学模型,其中,第二训练样本集合中的第二训练样本包括第二样本音频帧特征和对应的样本文本,第二样本音频帧特征基于对第二样本音频的特征提取得到,样本文本用于表征第二样本音频的内容,第一训练准则基于音频帧序列生成,第二训练准则基于音频帧生成。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述选取子单元包括:第二生成模块,被配置成对所得到的待匹配音素序列和待匹配文本分别对应的得分进行加权求和,生成各待匹配文本对应的总得分;第二选取模块,被配置成从所得到的待匹配文本中选取总得分最高的待匹配文本作为与至少一个语音片段对应的匹配文本。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述获取单元包括:获取子单元,被配置成获取待审核视频文件;第六生成子单元,被配置成从待审核视频文件中提取音轨,生成待识别音频;该用于识别语音的装置还包括:第二确定单元,被配置成确定识别文本中是否存在预设词集中的词;发送单元,被配置成响应于确定存在,将待审核视频文件和识别文本发送至目标终端。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述第二确定子单元包括:拆分子单元,被配置成将预设词集中的词拆分成第三数目个检索单元;第二确定子单元,被配置成根据识别文本中的词与检索单元相匹配的数目,确定识别文本中是否存在预设词集中的词。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述第二确定子单元进一步被配置成响应于确定识别文本中存在属于预设词集中的同一词的所有检索单元,确定识别文本中存在预设词集中的词。
根据本公开的一个或多个实施例,本公开提供的用于识别语音的装置中,上述预设词集中的词对应有风险级别信息;上述发送单元包括:第三确定子单元,被配置成响应于确定存在,确定匹配的词对应的风险级别信息;发送子单元,被配置成将待审核视频文件和识别文本发送至与所确定的风险级别信息匹配的终端。
根据本公开的一个或多个实施例,本公开提供了一种电子设备,该电子设备包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面中任一实现方式描述的方法。
根据本公开的一个或多个实施例,本公开提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现上述用于识别语音的方法。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述本公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (15)

  1. 一种用于识别语音的方法,包括:
    获取待识别音频,其中,所述待识别音频中包括语音片段;
    确定所述待识别音频中包括的语音片段对应的起止时刻;
    根据所确定的起止时刻,从所述待识别音频中提取至少一个语音片段;
    对所提取的至少一个语音片段进行语音识别,生成所述待识别音频对应的识别文本。
  2. 根据权利要求1所述的方法,其中,所述确定所述待识别音频中包括的语音片段对应的起止时刻,包括:
    提取所述待识别音频的音频帧特征,生成第一音频帧特征;
    确定所述第一音频帧特征对应的音频帧属于语音的概率;
    根据所确定的概率与预设阈值的比较,生成语音片段对应的起止时刻。
  3. 根据权利要求2所述的方法,其中,所述确定所述第一音频帧特征对应的音频帧属于语音的概率,包括:
    将所述第一音频帧特征输入至预先训练的语音检测模型,生成所述第一音频帧特征对应的音频帧属于语音的概率。
  4. 根据权利要求3所述的方法,其中,所述语音检测模型通过以下步骤训练得到:
    获取第一训练样本集合,其中,所述第一训练样本集合中的第一训练样本包括第一样本音频帧特征和对应的样本标注信息,所述第一样本音频帧特征基于对第一样本音频的特征提取得到,所述样本标注信息用于表征所述第一样本音频所属的类别,所述类别包括语音;
    获取用于分类的初始语音检测模型;
    将所述第一训练样本集合中的第一样本音频帧特征作为所述初始语音检测模型的输入,将与输入的第一样本音频帧特征对应的标注信息作为期望输出,训练得到所述语音检测模型。
  5. 根据权利要求2所述的方法,其中,所述根据所确定的概率与预设阈值的比较,生成语音片段对应的起止时刻,包括
    利用预设滑动窗选取第一数目个音频帧对应的概率;
    确定所选取的概率的统计值;
    响应于确定所述统计值大于所述预设阈值,根据所选取的概率对应的第一数目个音频帧所组成的音频片段,生成语音片段对应的起止时刻。
  6. 根据权利要求1所述的方法,其中,所述对所提取的至少一个语音片段进行语音识别,生成所述待识别音频对应的识别文本,包括:
    对所提取的至少一个语音片段提取语音的帧特征,生成第二音频帧特征;
    将所述第二音频帧特征输入至预先训练的声学模型,得到与所述第二音频帧特征对应的第二数目个待匹配音素序列以及对应的得分;
    将所述第二数目个待匹配音素序列输入至预先训练的语言模型,得到所述第二数目个待匹配音素序列对应的待匹配文本以及对应的得分;
    根据所得到的待匹配音素序列和待匹配文本分别对应的得分,从所得到的待匹配文本中选取待匹配文本作为与所述至少一个语音片段对应的匹配文本;
    根据所选取的匹配文本,生成所述待识别音频对应的识别文本。
  7. 根据权利要求6所述的方法,其中,所述声学模型通过以下步骤训练得到:
    获取第二训练样本集合,其中,所述第二训练样本集合中的第二训练样本包括第二样本音频帧特征和对应的样本文本,所述第二样本音频帧特征基于对第二样本音频的特征提取得到,所述样本文本用于表征所述第二样本音频的内容;
    获取初始声学模型;
    将所述第二训练样本集合中的第二样本音频帧特征作为所述初始声学模型的输入,将与输入的第二样本音频帧特征对应的样本文本所指示的音素作为期望输出,基于第一训练准则对所述初始声学模型进行预训练,其中,所述第一训练准则基于音频帧序列生成;
    利用预设的窗函数,将所述第二样本文本所指示的音素转换为用于第二训练准则的音素标签,其中,所述第二训练准则基于音频帧生成;
    将所述第二训练样本集合中的第二样本音频帧特征作为预训练后的初始声学模型的输入,将与输入的第二样本音频帧特征对应的音素标签作为期望输出,利用所述第二训练准则对所述预训练后的初始声学模型进行训练,得到所述声学模型。
  8. 根据权利要求6所述的方法,其中,所述根据所得到的待匹配音素序列和待匹配文本分别对应的得分,从所得到的待匹配文本中选取待匹配文本作为与所述至少一个语音片段对应的匹配文本,包括:
    对所得到的待匹配音素序列和待匹配文本分别对应的得分进行加权求和,生成各待匹配文本对应的总得分;
    从所得到的待匹配文本中选取总得分最高的待匹配文本作为与所述至少一个语音片段对应的匹配文本。
  9. 根据权利要求1-8之一所述的方法,其中,所述获取待识别音频,包括:
    获取待审核视频文件;
    从所述待审核视频文件中提取音轨,生成待识别音频;以及
    所述方法还包括:
    确定所述识别文本中是否存在预设词集中的词;
    响应于确定存在,将所述待审核视频文件和所述识别文本发送至目标终端。
  10. 根据权利要求9所述的方法,其中,所述确定所述识别文本中是否存在预设词集中的词,包括:
    将所述预设词集中的词拆分成第三数目个检索单元;
    根据所述识别文本中的词与所述检索单元相匹配的数目,确定所述识别文本中是否存在预设词集中的词。
  11. 根据权利要求10所述的方法,其中,所述根据所述识别文本中的词与所述检索单元相匹配的数目,确定所述识别文本中是否存在预设词集中的词,包括:
    响应于确定所述识别文本中存在属于所述预设词集中的同一词的所有检索单元,确定所述识别文本中存在预设词集中的词。
  12. 根据权利要求9所述的方法,其中,所述预设词集中的词对应有风险级别信息;以及
    所述响应于确定存在,将所述待审核视频文件和所述识别文本发送至目标终端,包括:
    响应于确定存在,确定匹配的词对应的风险级别信息;
    将所述待审核视频文件和所述识别文本发送至与所确定的风险级别信息匹配的终端。
  13. 一种用于识别语音的装置,包括:
    获取单元,被配置成获取待识别音频,其中,所述待识别音频中包括语音片段;
    第一确定单元,被配置成确定所述待识别音频中包括的语音片段对应的起止时刻;
    提取单元,被配置成根据所确定的起止时刻,从所述待识别音频中提取至少一个语音片段;
    生成单元,被配置成对所提取的至少一个语音片段进行语音识别,生成所述待识别音频对应的识别文本。
  14. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-12中任一所述的方法。
  15. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-12中任一所述的方法。
PCT/CN2021/131694 2020-11-20 2021-11-19 用于识别语音的方法、装置、电子设备和介质 WO2022105861A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/037,546 US20240021202A1 (en) 2020-11-20 2021-11-19 Method and apparatus for recognizing voice, electronic device and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011314072.5A CN112530408A (zh) 2020-11-20 2020-11-20 用于识别语音的方法、装置、电子设备和介质
CN202011314072.5 2020-11-20

Publications (1)

Publication Number Publication Date
WO2022105861A1 true WO2022105861A1 (zh) 2022-05-27

Family

ID=74982098

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/131694 WO2022105861A1 (zh) 2020-11-20 2021-11-19 用于识别语音的方法、装置、电子设备和介质

Country Status (3)

Country Link
US (1) US20240021202A1 (zh)
CN (1) CN112530408A (zh)
WO (1) WO2022105861A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052683A (zh) * 2023-03-31 2023-05-02 中科雨辰科技有限公司 一种平板电脑上离线语音录入的数据采集方法
CN116153294A (zh) * 2023-04-14 2023-05-23 京东科技信息技术有限公司 语音识别方法、装置、系统、设备及介质

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160820B (zh) * 2021-04-28 2024-02-27 百度在线网络技术(北京)有限公司 语音识别的方法、语音识别模型的训练方法、装置及设备
CN113053363B (zh) * 2021-05-12 2024-03-01 京东科技控股股份有限公司 语音识别方法、语音识别装置和计算机可读存储介质
CN113824986B (zh) * 2021-09-18 2024-03-29 北京云上曲率科技有限公司 基于上下文直播音频审核方法、装置、存储介质及设备
CN114220436A (zh) * 2021-12-16 2022-03-22 游密科技(深圳)有限公司 语音处理方法、装置、计算机设备和存储介质
CN114255754A (zh) * 2021-12-27 2022-03-29 贝壳找房网(北京)信息技术有限公司 语音识别方法、电子设备、程序产品和存储介质
CN114898271B (zh) * 2022-05-26 2024-10-22 中国平安人寿保险股份有限公司 视频内容监控方法、装置、设备及介质
CN115209188B (zh) * 2022-09-07 2023-01-20 北京达佳互联信息技术有限公司 多帐号同时直播的检测方法、装置、服务器及存储介质
CN115512692B (zh) * 2022-11-04 2023-02-28 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质
CN117033308B (zh) * 2023-08-28 2024-03-26 中国电子科技集团公司第十五研究所 一种基于特定范围的多模态检索方法及装置
CN118072901B (zh) * 2024-04-18 2024-07-19 中国人民解放军海军青岛特勤疗养中心 一种基于语音识别的门诊电子病历生成方法及系统
CN118296186B (zh) * 2024-06-05 2024-10-11 上海蜜度科技股份有限公司 视频广告检测方法、系统、存储介质及电子设备
CN118298852B (zh) * 2024-06-06 2024-09-10 中国科学院自动化研究所 一种基于高频特征的区域生成音频检测与定位方法及装置

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308653A (zh) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 一种应用于语音识别系统的端点检测方法
US20120179465A1 (en) * 2011-01-10 2012-07-12 International Business Machines Corporation Real time generation of audio content summaries
CN103165130A (zh) * 2013-02-06 2013-06-19 湘潭安道致胜信息科技有限公司 语音文本匹配云系统
WO2014069443A1 (ja) * 2012-10-31 2014-05-08 日本電気株式会社 不満通話判定装置及び不満通話判定方法
CN105654947A (zh) * 2015-12-30 2016-06-08 中国科学院自动化研究所 一种获取交通广播语音中路况信息的方法及系统
US20170110146A1 (en) * 2014-09-17 2017-04-20 Kabushiki Kaisha Toshiba Voice segment detection system, voice starting end detection apparatus, and voice terminal end detection apparatus
CN107452401A (zh) * 2017-05-27 2017-12-08 北京字节跳动网络技术有限公司 一种广告语音识别方法及装置
JP2018072697A (ja) * 2016-11-02 2018-05-10 日本電信電話株式会社 音素崩れ検出モデル学習装置、音素崩れ区間検出装置、音素崩れ検出モデル学習方法、音素崩れ区間検出方法、プログラム
CN109584896A (zh) * 2018-11-01 2019-04-05 苏州奇梦者网络科技有限公司 一种语音芯片及电子设备
CN110335612A (zh) * 2019-07-11 2019-10-15 招商局金融科技有限公司 基于语音识别的会议记录生成方法、装置及存储介质
CN111050201A (zh) * 2019-12-10 2020-04-21 Oppo广东移动通信有限公司 数据处理方法、装置、电子设备及存储介质
CN111476615A (zh) * 2020-05-27 2020-07-31 杨登梅 一种基于语音识别的产品需求确定方法

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9728185B2 (en) * 2014-05-22 2017-08-08 Google Inc. Recognizing speech using neural networks
US10453434B1 (en) * 2017-05-16 2019-10-22 John William Byrd System for synthesizing sounds from prototypes
CN108346428B (zh) * 2017-09-13 2020-10-02 腾讯科技(深圳)有限公司 语音活动检测及其模型建立方法、装置、设备及存储介质
CN108124191B (zh) * 2017-12-22 2019-07-12 北京百度网讯科技有限公司 一种视频审核方法、装置及服务器
CN108564941B (zh) * 2018-03-22 2020-06-02 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质
CN109473123B (zh) * 2018-12-05 2022-05-31 百度在线网络技术(北京)有限公司 语音活动检测方法及装置
CN111462735B (zh) * 2020-04-10 2023-11-28 杭州网易智企科技有限公司 语音检测方法、装置、电子设备及存储介质
CN111883139A (zh) * 2020-07-24 2020-11-03 北京字节跳动网络技术有限公司 用于筛选目标语音的方法、装置、设备和介质

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308653A (zh) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 一种应用于语音识别系统的端点检测方法
US20120179465A1 (en) * 2011-01-10 2012-07-12 International Business Machines Corporation Real time generation of audio content summaries
WO2014069443A1 (ja) * 2012-10-31 2014-05-08 日本電気株式会社 不満通話判定装置及び不満通話判定方法
CN103165130A (zh) * 2013-02-06 2013-06-19 湘潭安道致胜信息科技有限公司 语音文本匹配云系统
US20170110146A1 (en) * 2014-09-17 2017-04-20 Kabushiki Kaisha Toshiba Voice segment detection system, voice starting end detection apparatus, and voice terminal end detection apparatus
CN105654947A (zh) * 2015-12-30 2016-06-08 中国科学院自动化研究所 一种获取交通广播语音中路况信息的方法及系统
JP2018072697A (ja) * 2016-11-02 2018-05-10 日本電信電話株式会社 音素崩れ検出モデル学習装置、音素崩れ区間検出装置、音素崩れ検出モデル学習方法、音素崩れ区間検出方法、プログラム
CN107452401A (zh) * 2017-05-27 2017-12-08 北京字节跳动网络技术有限公司 一种广告语音识别方法及装置
CN109584896A (zh) * 2018-11-01 2019-04-05 苏州奇梦者网络科技有限公司 一种语音芯片及电子设备
CN110335612A (zh) * 2019-07-11 2019-10-15 招商局金融科技有限公司 基于语音识别的会议记录生成方法、装置及存储介质
CN111050201A (zh) * 2019-12-10 2020-04-21 Oppo广东移动通信有限公司 数据处理方法、装置、电子设备及存储介质
CN111476615A (zh) * 2020-05-27 2020-07-31 杨登梅 一种基于语音识别的产品需求确定方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052683A (zh) * 2023-03-31 2023-05-02 中科雨辰科技有限公司 一种平板电脑上离线语音录入的数据采集方法
CN116052683B (zh) * 2023-03-31 2023-06-13 中科雨辰科技有限公司 一种平板电脑上离线语音录入的数据采集方法
CN116153294A (zh) * 2023-04-14 2023-05-23 京东科技信息技术有限公司 语音识别方法、装置、系统、设备及介质
CN116153294B (zh) * 2023-04-14 2023-08-08 京东科技信息技术有限公司 语音识别方法、装置、系统、设备及介质

Also Published As

Publication number Publication date
US20240021202A1 (en) 2024-01-18
CN112530408A (zh) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022105861A1 (zh) 用于识别语音的方法、装置、电子设备和介质
CN111933129B (zh) 音频处理方法、语言模型的训练方法、装置及计算机设备
CN111899719B (zh) 用于生成音频的方法、装置、设备和介质
EP3469592B1 (en) Emotional text-to-speech learning system
KR102582291B1 (ko) 감정 정보 기반의 음성 합성 방법 및 장치
WO2021051544A1 (zh) 语音识别方法及其装置
CN111312231B (zh) 音频检测方法、装置、电子设备及可读存储介质
CN109686383B (zh) 一种语音分析方法、装置及存储介质
CN112786008B (zh) 语音合成方法、装置、可读介质及电子设备
CN110097870B (zh) 语音处理方法、装置、设备和存储介质
CN112489621B (zh) 语音合成方法、装置、可读介质及电子设备
CN112927674B (zh) 语音风格的迁移方法、装置、可读介质和电子设备
CN112786007A (zh) 语音合成方法、装置、可读介质及电子设备
CN111369971A (zh) 语音合成方法、装置、存储介质和电子设备
CN112786013B (zh) 基于唱本的语音合成方法、装置、可读介质和电子设备
CN108877779B (zh) 用于检测语音尾点的方法和装置
CN110047481A (zh) 用于语音识别的方法和装置
CN112509562B (zh) 用于文本后处理的方法、装置、电子设备和介质
CN111916053B (zh) 语音生成方法、装置、设备和计算机可读介质
CN109697978B (zh) 用于生成模型的方法和装置
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN113779208A (zh) 用于人机对话的方法和装置
KR102312993B1 (ko) 인공신경망을 이용한 대화형 메시지 구현 방법 및 그 장치
CN110647613A (zh) 一种课件构建方法、装置、服务器和存储介质
CN111899718B (zh) 用于识别合成语音的方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21894008

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18037546

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21894008

Country of ref document: EP

Kind code of ref document: A1