CN108510981B - Method and system for acquiring voice data - Google Patents
Method and system for acquiring voice data Download PDFInfo
- Publication number
- CN108510981B CN108510981B CN201810324045.2A CN201810324045A CN108510981B CN 108510981 B CN108510981 B CN 108510981B CN 201810324045 A CN201810324045 A CN 201810324045A CN 108510981 B CN108510981 B CN 108510981B
- Authority
- CN
- China
- Prior art keywords
- voice data
- voice
- recognition model
- application object
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
- G06F21/32—User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/12—Fingerprints or palmprints
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
- G10L17/24—Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/724—User interfaces specially adapted for cordless or mobile telephones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/34—Microprocessors
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/36—Memories
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/38—Displays
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2250/00—Details of telephonic subscriber devices
- H04M2250/12—Details of telephonic subscriber devices including a sensor for measuring a physical value, e.g. temperature or motion
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephone Function (AREA)
- Telephonic Communication Services (AREA)
Abstract
本发明提供一种语音数据的获取方法和系统,包括:当用户进行语音通话时,保存智能终端系统内实时传输的语音数据流,将麦克风的输入语音数据流保存为第一语音数据,将听筒的输出语音数据流保存为第二语音数据;检测第一语音数据和第二语音数据是否符合语音识别模型训练要求,若是,继续判断第一语音数据是否来自语音识别模型的应用对象,若是,将第一语音数据标记为应用对象语音数据,将第二语音数据标记为非应用对象语音数据;若否,将第一语音数据和第二语音数据标记为非应用对象语音数据。基于本发明的方法,通过改善语音的获取方法,减轻用户训练语音识别模型的负担,提高了用户体验。
The present invention provides a method and system for acquiring voice data, including: when a user makes a voice call, saving the voice data stream transmitted in real time in the intelligent terminal system, saving the input voice data stream of the microphone as the first voice data, and storing the earpiece The output voice data stream is saved as the second voice data; detect whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, continue to judge whether the first voice data comes from the application object of the voice recognition model, if so, set the The first voice data is marked as application target voice data, and the second voice data is marked as non-application target voice data; if not, the first voice data and the second voice data are marked as non-application target voice data. Based on the method of the present invention, by improving the voice acquisition method, the user's burden of training the voice recognition model is reduced, and the user experience is improved.
Description
技术领域technical field
本发明涉及人工智能领域,特别涉及一种语音数据的获取方法和系统。The invention relates to the field of artificial intelligence, in particular to a method and system for acquiring voice data.
背景技术Background technique
移动终端语音识别分为语义识别和说话人识别两大类Mobile terminal speech recognition is divided into two categories: semantic recognition and speaker recognition.
说话人识别通常称为声纹识别。一般分为文本相关(Text-dependent)和文本不相关(Text-independent)两类。Speaker recognition is often referred to as voiceprint recognition. Generally, it is divided into two categories: Text-dependent and Text-independent.
文本相关的语音识别通常要求用户重复跟读固定词句2-3遍。以记录相关的特征信息作为登记(Enroll)。使用时,同样要求用户念读同样的固定词句用以语音判别(Predict)。Text-related speech recognition usually requires the user to repeat the fixed words and sentences 2-3 times. Record related feature information as Enroll. When using, the user is also required to read the same fixed words and phrases for speech discrimination (Predict).
非文本相关的语音识别则不要求用户跟读固定的语句。用户通过输入大量的语音数据作为机器学习的训练(Train),用户的特征信息在大量数据的训练下获得高度的提纯。用以训练的语音数据需要包含本用户的语音数据(语音识别模型的应用对象)和其他人的语音数据。语音判别时也不需要念读固定词句。正常的语音数据就可以用来语音判别。Non-text-related speech recognition does not require users to follow fixed sentences. The user inputs a large amount of speech data as the machine learning training (Train), and the user's characteristic information is highly purified under the training of a large amount of data. The voice data used for training needs to include the user's voice data (the application object of the voice recognition model) and other people's voice data. There is also no need to read fixed words and phrases during speech discrimination. Normal speech data can be used for speech discrimination.
现有技术中,移动智能终端对语音识别尚不能进行用户身份的区分,对不同用户的声音特征值没有区分,导致同一台移动智能终端可以同时为不同用户的语音指令服务,保密性和专属性较差。In the prior art, the mobile intelligent terminal cannot distinguish the user identity for voice recognition, and does not distinguish the voice feature values of different users, so that the same mobile intelligent terminal can serve the voice commands of different users at the same time, confidentiality and exclusivity. poor.
以语音助手为例,现有移动智能终端在启用语音助手服务时都需要有一个固定的唤醒过程。这是文本相关语音识别的缺陷,不能够脱离固定文本的限制,不能够做到对本用户(应用对象)任何的语音指令快速的响应。所有的语音指令都需要在语音助手被唤醒后才可以使用。任何用户都可以通过固定词句唤醒语音助手,并发出语音指令,语音助手无法对用户身份做语音识别,全部的语音指令都会被执行。Taking the voice assistant as an example, existing mobile smart terminals need to have a fixed wake-up process when the voice assistant service is enabled. This is a defect of text-related speech recognition, which cannot escape the limitation of fixed text, and cannot respond quickly to any voice command of the user (application object). All voice commands need to be activated after the voice assistant is awakened. Any user can wake up the voice assistant through fixed words and issue voice commands. The voice assistant cannot perform voice recognition on the user's identity, and all voice commands will be executed.
非文本相关的语音识别利用了机器学习技术,通过建立完整的学习模型,大量的语音数据输入训练来获得高度提纯的用户特征信息及模型参数。基于训练好的模型,用户可以通过任意的语音输入来实现高度正确率的说话人语音识别,不受固定文本的限制。Non-text-related speech recognition uses machine learning technology to obtain highly purified user feature information and model parameters by establishing a complete learning model and inputting a large amount of speech data for training. Based on the trained model, users can achieve highly accurate speaker speech recognition through arbitrary speech input, without being restricted by fixed text.
但是在移动智能终端上实现非文本相关的语音识别,需要获取大量的登记人和非登记人的语音数据。训练的过程漫长而枯燥。对用户的使用体验是很大的挑战。用户不希望花费时间和精力输入语音数据。另外获取非语音识别模型的应用对象的语音数据对终端用户来说也是一个尴尬的问题。没有充足的训练数据就无法达到识别的高准确率。所以现有的移动智能终端还没有出现非文本相关的语音识别系统。However, to realize non-text-related speech recognition on a mobile intelligent terminal, it is necessary to acquire a large amount of voice data of registrants and non-registrants. The training process is long and boring. The user experience is a big challenge. Users do not want to spend time and effort entering voice data. In addition, acquiring speech data of an application object of a non-speech recognition model is also an embarrassing problem for end users. High recognition accuracy cannot be achieved without sufficient training data. Therefore, there is no non-text-related speech recognition system in the existing mobile smart terminals.
针对上述问题,特别是终端应用的非文本语音识别模型时语音数据的获取方法,目前尚未提出有效的解决方案。In view of the above problems, especially the acquisition method of speech data in the non-text speech recognition model of terminal application, no effective solution has been proposed yet.
发明内容SUMMARY OF THE INVENTION
本发明提供一种语音数据的获取方法和系统,通过改善语音数据的获取过程,减轻用户负担。The present invention provides a voice data acquisition method and system, which reduces the burden on users by improving the voice data acquisition process.
本发明提供一种语音数据的获取方法,语音数据用于训练语音识别模型,该方法包括The present invention provides a method for acquiring voice data. The voice data is used for training a voice recognition model, and the method includes the following steps:
步骤A-1:当用户进行语音通话时,保存智能终端系统内实时传输的语音数据流,将麦克风的输入语音数据流保存为第一语音数据,将听筒的输出语音数据流保存为第二语音数据;Step A-1: When the user makes a voice call, save the voice data stream transmitted in real time in the intelligent terminal system, save the input voice data stream of the microphone as the first voice data, and save the output voice data stream of the earpiece as the second voice data;
步骤A-2:检测第一语音数据和第二语音数据是否符合语音识别模型训练要求,若是,执行步骤A-3;Step A-2: Detect whether the first voice data and the second voice data meet the training requirements of the voice recognition model, and if so, perform Step A-3;
步骤A-3:判断第一语音数据是否来自语音识别模型的应用对象,若是,执行步骤A-4,若否,执行步骤A-5;Step A-3: judge whether the first speech data is from the application object of the speech recognition model, if so, execute step A-4, if not, execute step A-5;
步骤A-4:将第一语音数据标记为应用对象语音数据,将第二语音数据标记为非应用对象语音数据,应用对象语音数据用于语音识别模型中应用对象的语音特征学习;非应用对象语音数据用于语音识别模型中非应用对象的语音特征学习;Step A-4: mark the first speech data as application object speech data, mark the second speech data as non-application object speech data, and use the application object speech data for the speech feature learning of the application object in the speech recognition model; Speech data is used for speech feature learning of non-application objects in speech recognition model;
步骤A-5:将第一语音数据和第二语音数据标记为非应用对象语音数据。Step A-5: Mark the first voice data and the second voice data as non-application target voice data.
本发明还提供一种语音数据的获取系统,语音数据用于训练语音识别模型,该系统包括:The present invention also provides a system for acquiring voice data, the voice data is used for training a voice recognition model, and the system includes:
保存模块:当用户进行语音通话时,保存智能终端系统内实时传输的语音数据流,将麦克风的输入语音数据流保存为第一语音数据,将听筒的输出语音数据流保存为第二语音数据;Save module: when the user makes a voice call, save the voice data stream transmitted in real time in the intelligent terminal system, save the input voice data stream of the microphone as the first voice data, and save the output voice data stream of the earpiece as the second voice data;
检测模块:检测第一语音数据和第二语音数据是否符合语音识别模型训练要求,若是,执行用户判断模块;Detection module: detects whether the first voice data and the second voice data meet the training requirements of the voice recognition model, and if so, executes the user judgment module;
用户判断模块:判断第一语音数据是否来自语音识别模型的应用对象,若是则执行语音对象标记模块1,若否则执行语音对象标记2;User judgment module: judge whether the first voice data comes from the application object of the voice recognition model, if so, execute the voice object marking module 1, if otherwise, execute the voice object marking 2;
语音对象标记1:将第一语音数据标记为应用对象语音数据,将第二语音数据标记为非应用对象语音数据,应用对象语音数据用于语音识别模型中应用对象的语音特征学习;非应用对象语音数据用于语音识别模型中非应用对象的语音特征学习;Voice object marking 1: mark the first voice data as application object voice data, mark the second voice data as non-application object voice data, and use the application object voice data for the voice feature learning of the application object in the speech recognition model; Speech data is used for speech feature learning of non-application objects in speech recognition model;
语音对象标记2:将第一语音数据和第二语音数据标记为非应用对象语音数据。Voice object marking 2: marking the first voice data and the second voice data as non-application object voice data.
本发明通过保存用户语音通话时的语音数据,将麦克风的输入语音数据(第一语音数据)用于语音识别模型中应用对象的语音特征学习,将听筒的输出语音数据(第二语音数据)用于语音识别模型中非应用对象的语音特征学习,在移动智能终端后台以“静默”的方式将训练语音数据传递给语音识别模型,用户无需做枯燥繁杂的输入工作,减轻了用户的训练负担,提高了用户体验。同时本申请的方法和系统可应用于任一基于神经网络的语音识别模型,适用范围广。基于本申请的语音数据获取方法和获取系统,使得非文本相关的语音识别可以在移动智能终端上得以实现,突破现有的文本相关语音识别的限制,可以让终端更智能理解各个用户的特征、使用习惯,增强专属性和安全性。The present invention saves the voice data during the user's voice call, uses the input voice data (first voice data) of the microphone for the voice feature learning of the application object in the voice recognition model, and uses the output voice data (second voice data) of the earpiece as For the speech feature learning of non-application objects in the speech recognition model, the training speech data is transmitted to the speech recognition model in a "silent" way in the background of the mobile intelligent terminal, and the user does not need to do boring and complicated input work, which reduces the user's training burden. Improved user experience. At the same time, the method and system of the present application can be applied to any speech recognition model based on neural network, and has a wide range of applications. Based on the voice data acquisition method and acquisition system of the present application, non-text-related voice recognition can be implemented on mobile intelligent terminals, breaking through the limitations of existing text-related voice recognition, allowing terminals to more intelligently understand the characteristics of each user, Use habits to enhance exclusivity and security.
附图说明Description of drawings
图1为本发明语音数据的获取方法的流程图;Fig. 1 is the flow chart of the acquisition method of speech data of the present invention;
图2为图1的一个实施例;Fig. 2 is an embodiment of Fig. 1;
图3为本发明语音数据的获取系统的结构图;Fig. 3 is the structure diagram of the acquisition system of speech data of the present invention;
图4为图3的一个实施例。FIG. 4 is an embodiment of FIG. 3 .
具体实施方式Detailed ways
为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
图1为本发明语音数据的获取方法的流程图,包括以下步骤:Fig. 1 is the flow chart of the acquisition method of speech data of the present invention, comprises the following steps:
步骤A-1(S101):当用户进行语音通话时,保存智能终端系统内实时传输的语音数据流,将麦克风的输入语音数据流保存为第一语音数据,将听筒的输出语音数据流保存为第二语音数据;Step A-1 (S101): when the user makes a voice call, save the voice data stream transmitted in real time in the intelligent terminal system, save the input voice data stream of the microphone as the first voice data, and save the output voice data stream of the earpiece as second voice data;
步骤A-2(S102):检测第一语音数据和第二语音数据是否符合语音识别模型训练要求,若是,执行步骤A-3;Step A-2 (S102): Detect whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, execute Step A-3;
步骤A-3(S103):判断第一语音数据是否来自语音识别模型的应用对象,若是,执行步骤A-4,若否,执行步骤A-5;Step A-3 (S103): judge whether the first speech data comes from the application object of the speech recognition model, if so, execute step A-4, if not, execute step A-5;
步骤A-4(S104):将第一语音数据标记为应用对象语音数据,将第二语音数据标记为非应用对象语音数据,应用对象语音数据用于语音识别模型中应用对象的语音特征学习;非应用对象语音数据用于语音识别模型中非应用对象的语音特征学习;Step A-4 (S104): the first voice data is marked as application object voice data, the second voice data is marked as non-application object voice data, and the application object voice data is used for the voice feature learning of the application object in the voice recognition model; The non-application object speech data is used for the speech feature learning of non-application objects in the speech recognition model;
步骤A-5(S105):将第一语音数据和第二语音数据标记为非应用对象语音数据。Step A-5 (S105): Mark the first voice data and the second voice data as non-application target voice data.
在步骤A-1中,语音通话不仅包括音频通话,也包含VoIP、VoLTE等视频通话;同时包括其他即时通讯app的实时音视频通话,如微信的“视频聊天”或“语音聊天”。In step A-1, the voice call includes not only audio calls, but also video calls such as VoIP and VoLTE; it also includes real-time audio and video calls of other instant messaging apps, such as WeChat's "video chat" or "voice chat".
当用户启动语音通话,即触发执行图1的方法。应用于微信、QQ时,当检测到相应的动作,如“视频聊天”或“语音聊天”按钮按下或生效,即触发执行图1的方法。When the user initiates a voice call, the method of FIG. 1 is triggered to be executed. When applied to WeChat and QQ, when a corresponding action is detected, such as pressing or taking effect of a "video chat" or "voice chat" button, the method in FIG. 1 is triggered to be executed.
步骤A-1保存语音数据流的工作,可以设置在移动智能终端操作系统中的硬件设备操作层,当用户开始语音通话时,在系统的硬件操作层,实时备份并保存麦克风的输入语音数据和听筒的输出语音数据,其中,麦克风的输入语音数据代表终端用户的语音数据,听筒的输出语音数据代表对端实时传送给终端用户的语音数据。Step A-1 saves the work of the voice data stream, which can be set at the hardware device operation layer in the operating system of the mobile intelligent terminal. When the user starts a voice call, the hardware operation layer of the system backs up and saves the input voice data and data of the microphone in real time. The output voice data of the receiver, wherein the input voice data of the microphone represents the voice data of the terminal user, and the output voice data of the receiver represents the voice data transmitted by the opposite end to the terminal user in real time.
以安卓系统为例,硬件设备操作层为Android HAL,通话的状态判断可以参考AudioHAL中tiny_audio_device中的call_connected的相关属性,当adev->call_connected为真,表示用于开启并处于语音通话中。Taking the Android system as an example, the hardware device operation layer is Android HAL, and the call status can be judged by referring to the related attributes of call_connected in tiny_audio_device in AudioHAL. When adev->call_connected is true, it means that it is used to open and in a voice call.
在AudioHAL中,当audio_hw_device为AUDIO_DEVICE_IN_BUILTIN_MIC,则表示当前麦克风设备处于工作状体。当audio_hw_device为AUDIO_DEVICE_OUT_EARPIECE,则表示当前听筒设备处于工作状态。In AudioHAL, when audio_hw_device is AUDIO_DEVICE_IN_BUILTIN_MIC, it means that the current microphone device is in working state. When audio_hw_device is AUDIO_DEVICE_OUT_EARPIECE, it means that the current earpiece device is in working state.
进一步地,在AudioHAL的out_write()函数中备份并保存听筒输出前的语音数据,write对应播放过程。同样地,在in_read()函数中备份并保存麦克风输入的语音数据,read对应录音过程。Further, in the out_write() function of AudioHAL, backup and save the voice data before the earpiece output, and write corresponds to the playback process. Similarly, the voice data input by the microphone is backed up and saved in the in_read() function, and read corresponds to the recording process.
另外,在步骤A-1中,可以将第一语音数据和第二语音数据保存在移动智能终端的ROM或RAM中。In addition, in step A-1, the first voice data and the second voice data may be stored in the ROM or RAM of the mobile smart terminal.
神经网络语音识别模型的语音特征值提取技术需要预先输入大量个人语音进行训练以获得说话人的声纹特征。现有的方法,是由指定程序让用户一句句录入语音数据进行训练,需要耗费用户额外的时间专门进行个人语音特征训练,且过程繁杂枯燥。The voice feature value extraction technology of the neural network speech recognition model needs to input a large number of personal voices in advance for training to obtain the speaker's voiceprint features. The existing method uses a designated program to allow the user to input speech data sentence by sentence for training, which requires the user to spend extra time to train personal voice characteristics, and the process is complicated and tedious.
本申请图1的训练方法,通过智能终端采集用户在日常工作、生活中产生的即时通讯语音数据,并将采集到的语音数据用于训练语音识别模型,以不断提高语音识别模型的识别准确率。与现有技术相比,用户无需做枯燥繁杂的输入工作,减轻了用户的训练负担,提高了用户体验。The training method shown in FIG. 1 of the present application collects instant messaging voice data generated by users in daily work and life through intelligent terminals, and uses the collected voice data to train a voice recognition model, so as to continuously improve the recognition accuracy of the voice recognition model. . Compared with the prior art, the user does not need to do boring and complicated input work, which reduces the training burden of the user and improves the user experience.
同时,使用本申请图1的训练方法,日积月累,使得非文本相关的语音识别可以在移动智能终端上得以实现,突破现有的文本相关语音识别的限制,可以使终端更智能理解各个用户的特征、使用习惯,增强专属性和安全性。At the same time, using the training method shown in FIG. 1 of the present application, over time, non-text-related speech recognition can be implemented on mobile intelligent terminals, breaking through the limitations of existing text-related speech recognition, enabling the terminal to more intelligently understand the characteristics of each user , usage habits, enhance exclusivity and security.
以语音助手为例,增加了非文本语音识别功能后,可实现对用户身份的识别,进而只处理语音识别模型应用对象的语音指令,增强安全性和专属性。Taking the voice assistant as an example, after adding the non-text voice recognition function, it can realize the identification of the user's identity, and then only process the voice commands of the application object of the voice recognition model, which enhances security and exclusivity.
在本申请图1的步骤A-2中,检测第一语音数据和第二语音数据是否符合模型训练要求,举例说明,可以检测第一语音数据和第二语音数据中是否包含非静音特征,如果否,说明第一语音数据和第二语音数据中没有记录人声,则不符合模型训练要求,如果是,则继续检测第一语音数据和第二语音数据中的语音是否清晰,如果不清晰,对模型训练无益,也不符合模型训练要求。In step A-2 of FIG. 1 of the present application, it is detected whether the first voice data and the second voice data meet the model training requirements. For example, it can be detected whether the first voice data and the second voice data contain non-silent features. If No, it means that there is no human voice recorded in the first voice data and the second voice data, it does not meet the model training requirements, if yes, continue to detect whether the voices in the first voice data and the second voice data are clear, if not clear, It is useless for model training and does not meet the model training requirements.
可选地,在图1的步骤A-2中,执行步骤A-3之前,还可以进一步对第一语音数据和第二语音数据进行语音清洗,语音清洗后,再执行步骤A-3。语音清洗包括去噪、降噪等处理,可以使第一语音数据和第二语音数据具有更好的品质,进而提高模型训练效果。Optionally, in step A-2 of FIG. 1 , before step A-3 is performed, voice cleaning may be further performed on the first voice data and the second voice data, and step A-3 is performed after voice cleaning. The voice cleaning includes denoising, noise reduction and other processing, which can make the first voice data and the second voice data have better quality, thereby improving the model training effect.
在本申请图1的步骤A-3中,判断第一语音数据的语音是否来自语音识别模型的应用对象,可以采用人脸识别或指纹判定或输出对话框让用户确认身份。以人脸识别为例,通过摄像头主动收集用户人脸信息,通过比较判定是否为语音识别模型的应用对象,若收集失败,则提示用户输入。指纹判定,一般需提示用户输入指定手指的指纹,通过比较判断是否为语音识别模型的应用对象。In step A-3 of FIG. 1 of the present application, to determine whether the voice of the first voice data comes from the application object of the voice recognition model, face recognition or fingerprint determination or output dialog box may be used to allow the user to confirm the identity. Taking face recognition as an example, the camera actively collects the user's face information, and determines whether it is the application object of the speech recognition model through comparison. If the collection fails, the user is prompted for input. Fingerprint determination, generally need to prompt the user to input the fingerprint of the designated finger, and determine whether it is the application object of the speech recognition model through comparison.
在本申请图1中,为了节省语音识别模型的训练时间,其中语音识别模型中的通用特征模块,特别是非应用对象语音特征模块,可以预先经过训练。In FIG. 1 of the present application, in order to save the training time of the speech recognition model, the general feature module in the speech recognition model, especially the non-application object speech feature module, may be pre-trained.
另一方面,为了避免占用终端系统资源和泄露用户隐私,在步骤A-4或步骤A-5之后,立即将第一语音数据和第二语音数据用于训练语音识别模型,训练结束后,如果第一语音数据、第二语音数据未更新,则清除第一语音数据、第二语音数据并退出图1流程。或者说,训练结束后,清除相关的语音数据并退出图1流程。On the other hand, in order to avoid occupying terminal system resources and leaking user privacy, immediately after step A-4 or step A-5, the first voice data and the second voice data are used to train the speech recognition model. If the first voice data and the second voice data are not updated, the first voice data and the second voice data are cleared, and the process of FIG. 1 is exited. In other words, after the training is over, clear the relevant voice data and exit the process of Figure 1.
图2对图1的方法进行了扩展,给出了一个具体应用的实施例,包括以下步骤:Fig. 2 expands the method of Fig. 1, and provides an embodiment of a specific application, including the following steps:
步骤A-11(S201):当用户进行语音通话时,将麦克风的输入语音数据流保存为第三语音数据,将听筒的输出语音数据流保存为第四语音数据;Step A-11 (S201): when the user makes a voice call, the input voice data stream of the microphone is saved as the third voice data, and the output voice data stream of the earpiece is saved as the fourth voice data;
步骤A-12(S202):当第三语音数据达到预设时长时,令第一语音数据等于第三语音数据,同时令第三语音数据为空,执行步骤A-2,同时返回步骤A-11。Step A-12 (S202): when the third voice data reaches the preset duration, make the first voice data equal to the third voice data, and simultaneously make the third voice data empty, execute step A-2, and return to step A- 11.
步骤A-13(S203):当第四语音数据的语音达到预设时长时,令第二语音数据等于第四语音数据,同时令第四语音数据为空,执行步骤A-2,同时返回步骤A-11。Step A-13 (S203): when the voice of the fourth voice data reaches the preset duration, make the second voice data equal to the fourth voice data, and simultaneously make the fourth voice data empty, execute step A-2, and return to step A-11.
步骤A-2(S204):检测第一语音数据和第二语音数据是否符合语音识别模型训练要求,若是,执行步骤A-3;Step A-2 (S204): Detect whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, execute Step A-3;
步骤A-31(S205):利用语音识别模型判断第一语音数据是否来自语音识别模型的应用对象,并输出(判断)结果的置信度;如果置信度小于阈值,则执行步骤A-32;如果判断结果是语音识别模型的应用对象且置信度大于等于阈值,则执行步骤A-4;如果判断结果不是语音识别模型的应用对象且置信度大于等于阈值,则执行步骤A-5;Step A-31 (S205): utilize the speech recognition model to judge whether the first speech data comes from the application object of the speech recognition model, and output (judgment) the confidence of the result; if the confidence is less than the threshold, then execute step A-32; if If the judgment result is the application object of the speech recognition model and the confidence is greater than or equal to the threshold, then step A-4 is performed; if the judgment result is not the application object of the speech recognition model and the confidence is greater than or equal to the threshold, then step A-5 is performed;
步骤A-32(S206):在本次语音通话中,用户是否已确认身份,如果否,让用户确认是否是语音识别模型的应用对象,并记录用户确认结果,如果是语音识别模型的应用对象,执行步骤A-4,如果不是语音识别模型的应用对象,执行步骤A-5。Step A-32 (S206): in this voice call, whether the user has confirmed the identity, if not, let the user confirm whether it is the application object of the speech recognition model, and record the user confirmation result, if it is the application object of the speech recognition model , go to step A-4, if it is not the application object of the speech recognition model, go to step A-5.
步骤A-4(S207):将第一语音数据标记为应用对象语音数据,将第二语音数据标记为非应用对象语音数据,应用对象语音数据用于语音识别模型中应用对象的语音特征学习;非应用对象语音数据用于语音识别模型中非应用对象的语音特征学习;Step A-4 (S207): the first voice data is marked as application object voice data, the second voice data is marked as non-application object voice data, and the application object voice data is used for the voice feature learning of the application object in the voice recognition model; The non-application object speech data is used for the speech feature learning of non-application objects in the speech recognition model;
步骤A-5(S208):将第一语音数据和第二语音数据标记为非应用对象语音数据。Step A-5 (S208): Mark the first voice data and the second voice data as non-application target voice data.
本申请图1的方法可以将一次语音通话所有数据都保存为第一语音数据和第二语音数据,然后基于该数据训练语音识别模型,也可以如图2的方法所示设置为边保存边训练,同时,图2的步骤A-11至A-13也可以替换为图1的A-1,根据实际需要选用。The method of FIG. 1 of the present application can save all the data of a voice call as the first voice data and the second voice data, and then train the voice recognition model based on the data, or it can be set to save while training as shown in the method of FIG. 2 , Meanwhile, steps A-11 to A-13 in FIG. 2 can also be replaced with A-1 in FIG. 1 , which can be selected according to actual needs.
在图2的步骤A-12和A-13中,预设时间大于10秒,或大于执行图2中步骤A-2至步骤A-5所耗费的时间。In steps A-12 and A-13 of FIG. 2 , the preset time is greater than 10 seconds, or greater than the time taken to perform steps A-2 to A-5 in FIG. 2 .
本申请图2的步骤A-31,未采用人脸识别或指纹识别进行用户身份认证,而是采用语音识别模型自身进行用户认证,考虑到语音识别模型训练初始,判断错误大,辅助用户人工确认身份,待模型识别精度越来越高,则无需人工参与,本申请图2的方法可以实现以“静默”的方式,在后台运行,在用户没有感知的前提下,持续不断地训练语音识别模型。In step A-31 of FIG. 2 of this application, face recognition or fingerprint recognition is not used for user identity authentication, but the voice recognition model itself is used for user authentication. Considering the initial training of the voice recognition model, the judgment error is large, and the user is assisted in manual confirmation. Identity, when the recognition accuracy of the model is getting higher and higher, there is no need for manual participation. The method in Figure 2 of this application can be implemented in a "silent" way, running in the background, and continuously training the speech recognition model without the user's perception. .
本发明还包括一种语音数据的获取系统,如图3所示,包括:The present invention also includes a system for acquiring voice data, as shown in Figure 3, including:
保存模块:当用户进行语音通话时,保存智能终端系统内实时传输的语音数据流,将麦克风的输入语音数据流保存为第一语音数据,将听筒的输出语音数据流保存为第二语音数据;Save module: when the user makes a voice call, save the voice data stream transmitted in real time in the intelligent terminal system, save the input voice data stream of the microphone as the first voice data, and save the output voice data stream of the earpiece as the second voice data;
检测模块:检测第一语音数据和第二语音数据是否符合语音识别模型训练要求,若是,执行用户判断模块;Detection module: detects whether the first voice data and the second voice data meet the training requirements of the voice recognition model, and if so, executes the user judgment module;
用户判断模块:判断第一语音数据是否来自语音识别模型的应用对象,若是则执行语音对象标记模块1,若否则执行语音对象标记2;User judgment module: judge whether the first voice data comes from the application object of the voice recognition model, if so, execute the voice object marking module 1, if otherwise, execute the voice object marking 2;
语音对象标记1:将第一语音数据标记为应用对象语音数据,将第二语音数据标记为非应用对象语音数据,应用对象语音数据用于语音识别模型中应用对象的语音特征学习;非应用对象语音数据用于语音识别模型中非应用对象的语音特征学习;Voice object marking 1: mark the first voice data as application object voice data, mark the second voice data as non-application object voice data, and use the application object voice data for the voice feature learning of the application object in the speech recognition model; Speech data is used for speech feature learning of non-application objects in speech recognition model;
语音对象标记2:将第一语音数据和第二语音数据标记为非应用对象语音数据。Voice object marking 2: marking the first voice data and the second voice data as non-application object voice data.
可选地,如图4所示,记录模块或者包括:循环记录模块、传递模块1和传递模块2。Optionally, as shown in FIG. 4 , the recording module may include: a cyclic recording module, a transmission module 1 and a transmission module 2 .
循环记录模块:将麦克风的输入语音数据流保存为第三语音数据,将听筒的输出语音数据流保存为第四语音数据;Loop recording module: save the input voice data stream of the microphone as the third voice data, and save the output voice data stream of the earpiece as the fourth voice data;
传递模块1:当第三语音数据达到预设时长时,令第一语音数据等于第三语音数据,同时令第三语音数据为空,执行检测模块,同时返回循环记录模块。Transfer module 1: when the third voice data reaches a preset duration, set the first voice data to be equal to the third voice data, and at the same time set the third voice data to be empty, execute the detection module, and return to the loop recording module at the same time.
传递模块2:当第四语音数据的语音达到预设时长时,令第二语音数据等于第四语音数据,同时令第四语音数据为空,执行检测模块,同时返回循环记录模块。Transmission module 2: when the voice of the fourth voice data reaches a preset duration, set the second voice data to be equal to the fourth voice data, and at the same time set the fourth voice data to be empty, execute the detection module, and return to the loop recording module at the same time.
可选地,如图4所示,用户判断模块或者包括:语音识别模型用户判断模块和用户确认模块,Optionally, as shown in Figure 4, the user judgment module may include: a voice recognition model user judgment module and a user confirmation module,
语音识别模型用户判断模块:利用语音识别模型判断第一语音数据是否来自语音识别模型的应用对象,并输出结果的置信度;如果置信度小于阈值,则执行用户确认模块;如果判断结果是语音识别模型的应用对象且置信度大于等于阈值,则执行语音对象标记1;如果判断结果不是语音识别模型的应用对象且置信度大于等于阈值,则执行语音对象标记2;Voice recognition model user judgment module: use the voice recognition model to judge whether the first voice data comes from the application object of the voice recognition model, and output the confidence of the result; if the confidence is less than the threshold, execute the user confirmation module; if the judgment result is voice recognition If the application object of the model and the confidence level is greater than or equal to the threshold, then perform voice object marking 1; if the judgment result is not the application object of the speech recognition model and the confidence level is greater than or equal to the threshold, then perform voice object marking 2;
用户确认模块:在本次语音通话中,用户是否已确认身份,如果否,让用户确认是否是语音识别模型的应用对象,并记录用户确认结果,如果是语音识别模型的应用对象,执行语音对象标记1,如果不是语音识别模型的应用对象,执行语音对象标记2。User confirmation module: In this voice call, whether the user has confirmed the identity, if not, let the user confirm whether it is the application object of the speech recognition model, and record the user confirmation result, if it is the application object of the speech recognition model, execute the voice object Mark 1, if it is not the application object of the speech recognition model, execute the speech object mark 2.
可选地,在检测模块中,检测第一语音数据和第二语音数据是否符合语音识别模型训练要求包括:Optionally, in the detection module, detecting whether the first voice data and the second voice data meet the training requirements of the voice recognition model includes:
检测第一语音数据和第二语音数据中是否包含非静音特征,如果否,则不符合模型训练要求,如果是,则继续检测第一语音数据和第二语音数据中的语音是否清晰,如果不清晰,也不符合模型训练要求。Detect whether the first voice data and the second voice data contain non-silent features, if not, it does not meet the model training requirements, if so, continue to detect whether the voices in the first voice data and the second voice data are clear, if not. It is clear and does not meet the model training requirements.
可选地,在检测模块中,若是,执行用户判断模块包括:Optionally, in the detection module, if yes, executing the user judgment module includes:
若是,则对第一语音数据和第二语音数据进行语音清洗后,执行用户判断模块。If so, the user judgment module is executed after voice cleaning is performed on the first voice data and the second voice data.
需要说明的是,本发明的语音数据的获取系统的实施例,与语音数据的获取方法的实施例原理相同,相关之处可以互相参照。It should be noted that, the embodiment of the voice data acquisition system of the present invention has the same principle as the embodiment of the voice data acquisition method, and relevant parts can be referred to each other.
以上所述仅为本发明的较佳实施例而已,并不用以限定本发明的包含范围,凡在本发明技术方案的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the technical solution of the present invention shall be should be included within the protection scope of the present invention.
Claims (10)
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810324045.2A CN108510981B (en) | 2018-04-12 | 2018-04-12 | Method and system for acquiring voice data |
KR1020190035388A KR102714096B1 (en) | 2018-04-12 | 2019-03-27 | Electronic apparatus and operation method thereof |
US16/382,712 US10984795B2 (en) | 2018-04-12 | 2019-04-12 | Electronic apparatus and operation method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810324045.2A CN108510981B (en) | 2018-04-12 | 2018-04-12 | Method and system for acquiring voice data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108510981A CN108510981A (en) | 2018-09-07 |
CN108510981B true CN108510981B (en) | 2020-07-24 |
Family
ID=63381824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810324045.2A Active CN108510981B (en) | 2018-04-12 | 2018-04-12 | Method and system for acquiring voice data |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR102714096B1 (en) |
CN (1) | CN108510981B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020096078A1 (en) * | 2018-11-06 | 2020-05-14 | 주식회사 시스트란인터내셔널 | Method and device for providing voice recognition service |
KR20220120197A (en) * | 2021-02-23 | 2022-08-30 | 삼성전자주식회사 | Electronic apparatus and controlling method thereof |
EP4207805A4 (en) * | 2021-02-23 | 2024-04-03 | Samsung Electronics Co., Ltd. | ELECTRONIC DEVICE AND CONTROL METHOD THEREFOR |
US12260865B2 (en) | 2021-08-09 | 2025-03-25 | Electronics And Telecommunications Research Institute | Automatic interpretation server and method based on zero UI for connecting terminal devices only within a speech-receiving distance |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002196781A (en) * | 2000-12-26 | 2002-07-12 | Toshiba Corp | Voice interactive system and recording medium used for the same |
KR100864828B1 (en) * | 2006-12-06 | 2008-10-23 | 한국전자통신연구원 | System for obtaining speaker's information using the speaker's acoustic characteristics |
JP5158174B2 (en) * | 2010-10-25 | 2013-03-06 | 株式会社デンソー | Voice recognition device |
US9489950B2 (en) * | 2012-05-31 | 2016-11-08 | Agency For Science, Technology And Research | Method and system for dual scoring for text-dependent speaker verification |
CN104517587B (en) * | 2013-09-27 | 2017-11-24 | 联想(北京)有限公司 | A kind of screen display method and electronic equipment |
KR102274317B1 (en) * | 2013-10-08 | 2021-07-07 | 삼성전자주식회사 | Method and apparatus for performing speech recognition based on information of device |
KR101564087B1 (en) * | 2014-02-06 | 2015-10-28 | 주식회사 에스원 | Method and apparatus for speaker verification |
CN103956169B (en) * | 2014-04-17 | 2017-07-21 | 北京搜狗科技发展有限公司 | A kind of pronunciation inputting method, device and system |
KR20160098581A (en) * | 2015-02-09 | 2016-08-19 | 홍익대학교 산학협력단 | Method for certification using face recognition an speaker verification |
KR102371697B1 (en) * | 2015-02-11 | 2022-03-08 | 삼성전자주식회사 | Operating Method for Voice function and electronic device supporting the same |
KR101618512B1 (en) * | 2015-05-06 | 2016-05-09 | 서울시립대학교 산학협력단 | Gaussian mixture model based speaker recognition system and the selection method of additional training utterance |
CN105976820B (en) * | 2016-06-14 | 2019-12-31 | 上海质良智能化设备有限公司 | Voice emotion analysis system |
-
2018
- 2018-04-12 CN CN201810324045.2A patent/CN108510981B/en active Active
-
2019
- 2019-03-27 KR KR1020190035388A patent/KR102714096B1/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108510981A (en) | 2018-09-07 |
KR20190119521A (en) | 2019-10-22 |
KR102714096B1 (en) | 2024-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12080295B2 (en) | System and method for dynamic facial features for speaker recognition | |
US10699702B2 (en) | System and method for personalization of acoustic models for automatic speech recognition | |
CN112037799B (en) | Voice interrupt processing method and device, computer equipment and storage medium | |
CN110428810B (en) | Voice wake-up recognition method and device and electronic equipment | |
CN108510981B (en) | Method and system for acquiring voice data | |
CN112037791B (en) | Conference summary transcription method, apparatus and storage medium | |
TWI659409B (en) | Speech endpoint detection method and speech recognition method | |
EP3890342B1 (en) | Waking up a wearable device | |
WO2017197953A1 (en) | Voiceprint-based identity recognition method and device | |
US8521525B2 (en) | Communication control apparatus, communication control method, and non-transitory computer-readable medium storing a communication control program for converting sound data into text data | |
JP2013527490A (en) | Smart audio logging system and method for mobile devices | |
CN112102850A (en) | Processing method, device and medium for emotion recognition and electronic equipment | |
US20160077792A1 (en) | Methods and apparatus for unsupervised wakeup | |
KR102326853B1 (en) | User adaptive conversation apparatus based on monitoring emotion and ethic and method for thereof | |
CN112802498B (en) | Voice detection method, device, computer equipment and storage medium | |
CN113707154B (en) | Model training method, device, electronic equipment and readable storage medium | |
JPWO2017085992A1 (en) | Information processing device | |
JP6915637B2 (en) | Information processing equipment, information processing methods, and programs | |
CN114420121A (en) | Voice interaction method, electronic device and storage medium | |
WO2017024835A1 (en) | Voice recognition method and device | |
CN117894321B (en) | Voice interaction method, voice interaction prompting system and device | |
CN111354358A (en) | Control method, voice interaction device, voice recognition server, storage medium, and control system | |
CN112151070A (en) | Voice detection method and device and electronic equipment | |
CN113539295B (en) | Voice processing method and device | |
JP2000148187A (en) | Speaker recognition method, apparatus using the method, and program recording medium therefor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |