[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111601215A - Scene-based key information reminding method, system and device - Google Patents

Scene-based key information reminding method, system and device Download PDF

Info

Publication number
CN111601215A
CN111601215A CN202010313790.4A CN202010313790A CN111601215A CN 111601215 A CN111601215 A CN 111601215A CN 202010313790 A CN202010313790 A CN 202010313790A CN 111601215 A CN111601215 A CN 111601215A
Authority
CN
China
Prior art keywords
keyword
audio stream
voice
reminder
recording
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010313790.4A
Other languages
Chinese (zh)
Inventor
张时嘉
曾娟鹃
张亦农
王海业
由海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xijueshuo Information Technology Co ltd
Original Assignee
Nanjing Xijueshuo Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Xijueshuo Information Technology Co ltd filed Critical Nanjing Xijueshuo Information Technology Co ltd
Priority to CN202010313790.4A priority Critical patent/CN111601215A/en
Publication of CN111601215A publication Critical patent/CN111601215A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本发明提供一种关键信息提醒方法、系统及嵌入式音频播放装置,相对现有技术,其中的嵌入式音频播放装置可独立完成基于场景的连续语音关键信息实时检测、提醒、录制和回放,使用方便、便捷且私密性好。其中的关键信息提醒系统和提醒方法,预先根据当前应用场景的实际需求及用户自行定制的关键词或训练样本,训练获得与应用场景高度契合的关键词识别模型,因此可有效提高识别连续语音流中的关键信息准确率,并针对当前应用场景应当重点关注的信息和用户感兴趣的信息及时输出提醒并保存,用户体验度极好。

Figure 202010313790

The present invention provides a key information reminder method, system and embedded audio playback device. Compared with the prior art, the embedded audio playback device can independently complete scene-based continuous voice key information real-time detection, reminder, recording and playback. Convenience, convenience and privacy. The key information reminder system and reminder method are based on the actual needs of the current application scenario and the keywords or training samples customized by the user in advance to obtain a keyword recognition model that is highly suitable for the application scenario, so it can effectively improve the recognition of continuous speech streams. The accuracy of key information in the current application scenario, and timely output and save reminders for the information that should be focused on in the current application scenario and the information that users are interested in, the user experience is excellent.

Figure 202010313790

Description

一种基于场景的关键信息提醒方法、系统及装置A scenario-based key information reminder method, system and device

技术领域technical field

本发明涉及嵌入式设备技术领域,尤其涉及一种基于场景的关键信息提醒系统、方法及嵌入式音频播放装置。The invention relates to the technical field of embedded devices, in particular to a scene-based key information reminder system, method and embedded audio playback device.

背景技术Background technique

当前,互联网、移动通信网络已经进入千家万户,遍布人们生活的角角落落。基于这些远程通信平台的网络会议、网络教学、网络商务洽谈、网络销售等各种远程音视频应用也随着计算机网络技术、音视频处理技术和以片上系统SoC为核心的嵌入式设备等相关技术和产品的高度成熟而日渐兴起。这些配合手机、耳机、平板电脑、音箱等嵌入式设备使用的远程音视频应用,完全打破了地域的限制,使身处异地的人们可以随时实现实时的语音和视频的交流互动,为人们的生产生活提供了极大的便利。例如在当前疫情汹涌的情势下,学生通过网络教学平台得以继续在家上课。学生们经常会通过头戴式耳机参加网课,并在授课过程戴着耳机在一定范围内随意走动。但不利的是,网络教学时因为缺少课堂氛围,老师也无法及时观察到每个孩子的听课状态,因此非常依赖学生个人的自律性。而学生一旦走神或私下玩耍,就没有人能够给予及时提醒和纠正,而老师传授的课业内容也被错过了。这种情形在网络视频会议时其实也类似地存在,例如会议中因为私事的打扰或者接听电话等,而错失会议的关键的语音信息。通常,在手机或者电脑上的网课、视频会议软件中没有针对对端说话者的内容的关键信息提醒功能。即使有,本地用户也未必在手机或电脑边上。因此,非常有必要在头戴式耳机或者音箱这类最贴近本地用户的手机或电脑的附属型设备中直接实现关键信息提醒的功能,使本地用户的思维能在第一时间内被拉回网课或视频会议。At present, the Internet and mobile communication networks have entered thousands of households, covering every corner of people's lives. Various remote audio and video applications based on these remote communication platforms, such as online conferences, online teaching, online business negotiation, and online sales, are also accompanied by related technologies such as computer network technology, audio and video processing technology, and embedded devices with system-on-chip SoC as the core. And the high maturity of the product is rising day by day. These remote audio and video applications used with embedded devices such as mobile phones, earphones, tablet computers, speakers, etc., completely break the geographical restrictions, so that people in different places can realize real-time voice and video communication and interaction at any time, for people's production Life provides great convenience. For example, under the current turbulent epidemic situation, students can continue to teach at home through online teaching platforms. Students often participate in online classes through headsets and move around within a certain range while teaching. However, the disadvantage is that due to the lack of classroom atmosphere during online teaching, teachers cannot observe the status of each child's lectures in time, so they rely heavily on students' individual self-discipline. And once students lose their minds or play in private, no one can give timely reminders and corrections, and the course content taught by teachers is also missed. This kind of situation actually exists in the network video conference in a similar way. For example, the key voice information of the conference is missed due to interruption of personal affairs or answering the phone during the conference. Usually, there is no key information reminder function for the content of the opposite end speaker in the online courses and video conferencing software on the mobile phone or computer. Even if there are, local users are not necessarily on the side of mobile phones or computers. Therefore, it is very necessary to directly realize the function of key information reminder in the accessory devices such as headphones or speakers that are closest to the local user's mobile phone or computer, so that the local user's thinking can be pulled back to the network in the first time. class or video conference.

近年来,语音识别技术被越来越多的用于语音监测和识别重要信息中。特别是在摩尔定律和大数据的强力支撑下,基于人工智能技术的语音识别已经从浅层识别迈入了深度学习阶段。基于深度学习理论和神经网络模型的语音识别技术可输出更高正确率的识别结果,因此在智能语音唤醒、智能语音控制、智能语音对话等诸多领域被广泛应用。In recent years, speech recognition technology has been increasingly used in speech monitoring and identification of important information. Especially under the strong support of Moore's Law and big data, speech recognition based on artificial intelligence technology has entered the stage of deep learning from shallow recognition. Speech recognition technology based on deep learning theory and neural network model can output recognition results with higher accuracy, so it is widely used in many fields such as intelligent voice wake-up, intelligent voice control, and intelligent voice dialogue.

但发明人深入研究后发现,如果将人工智能的语音识别技术用于当前远程音视频应用中实现关键语音信息提醒功能,却存在诸多技术瓶颈,例如:However, after in-depth research, the inventor found that if the artificial intelligence speech recognition technology is used in the current remote audio and video applications to realize the key voice information reminder function, there are many technical bottlenecks, such as:

第一方面,人工智能的语音识别技术中,语音识别模型是保障识别准确率的关键。而现在的各种智能语音唤醒、智能语音控制、智能语音对话技术应用中,往往都是采用通用版的语音识别模型,即由设备/应用的提供方预先完成语音识别模型的训练,重要信息的判断标准、训练样本的选择,全部都由设备/应用的提供方决定。如果将这种通用版的语音识别模型简单用于远程音视频应用中,难以适应各种不同的应用场景,甚至可能因为无法保障识别准确率而导致糟糕的用户体验。First, in artificial intelligence speech recognition technology, the speech recognition model is the key to ensuring the recognition accuracy. However, in various applications of intelligent voice wake-up, intelligent voice control, and intelligent voice dialogue technology, the general-purpose voice recognition model is often used, that is, the training of the voice recognition model is completed in advance by the provider of the device/application, and the important information Judgment criteria and selection of training samples are all determined by the provider of the device/application. If this general-purpose speech recognition model is simply used in remote audio and video applications, it is difficult to adapt to various application scenarios, and may even lead to poor user experience because the recognition accuracy cannot be guaranteed.

第二方面,人工智能的语音识别技术,特别是深度学习技术的实现,需要大量高精度计算,这依赖于硬件系统在内存、计算开销和功耗等方面的强大支撑。因此目前这类技术大多是用在GPU、FPGA等高成本、高功耗、高性能的大型专用计算平台上,而在普通消费者使用得最多的各种低功耗、低性能的嵌入式设备(如耳机、便携式音箱、电话手表、会议终端设备等等手机或电脑的附属型设备)进行不依赖于手机或者云端的、独立的关键词识别却非常少见,或者仅采用较简单的孤立词或者固定关键词集合,限定句型识别等技术,实现一些简单、低层次的语音识别功能,如,简单的语音唤醒、智能家居语音控制等,却未能在复杂且连续的语音流中实现关键语音信息提醒功能。如目前市场上为大家所熟知的一些智能语音助理,均是将嵌入式设备采集到的语音流上载到手机或者云端后进行识别,通常只能实现单一语句的语音识别;而将采集的语音流上传到云端或远端设备识别后再返回结果,通常因延时长而导致用户体验差,并且用户的隐私难以得到保障。究其原因,其中很重要的一点是嵌入式设备硬件算力和功耗限制,难以为现行的大词汇量连续语音识别技术提供足够的支撑。Second, the implementation of artificial intelligence speech recognition technology, especially deep learning technology, requires a lot of high-precision computing, which depends on the strong support of hardware systems in terms of memory, computing overhead, and power consumption. Therefore, most of these technologies are currently used in GPU, FPGA and other high-cost, high-power, high-performance large-scale special-purpose computing platforms, and in various low-power, low-performance embedded devices that are most used by ordinary consumers. (such as headsets, portable speakers, telephone watches, conference terminal equipment, etc. mobile phone or computer accessory equipment) It is very rare to perform independent keyword recognition that does not depend on mobile phones or cloud, or only uses relatively simple isolated words or Fixed keyword sets, limited sentence pattern recognition and other technologies to achieve some simple and low-level speech recognition functions, such as simple voice wake-up, smart home voice control, etc., but failed to realize key voices in complex and continuous voice streams Information reminder function. For example, some of the well-known intelligent voice assistants on the market currently upload the voice stream collected by the embedded device to the mobile phone or the cloud for recognition. Usually, only a single sentence can be recognized. Uploading to the cloud or the remote device for identification and then returning the result usually results in poor user experience due to long delays, and it is difficult to protect the user's privacy. The reason is that one of the most important points is that the hardware computing power and power consumption of embedded devices are limited, and it is difficult to provide sufficient support for the current large-vocabulary continuous speech recognition technology.

第三方面,目前的消费领域的语音识别,均是对本地输入的语音流进行关键词或者全语音的识别后进行一定的交互,缺少在特定场景中对另一方向的/来自于远端的语音进行感兴趣关键词识别后进行提醒的功能。On the third aspect, the current speech recognition in the consumer field is to perform a certain interaction after the local input speech stream is recognized by keywords or full speech. The function of reminding after voice recognition of interesting keywords.

因此有必要提出一种基于场景的关键信息提醒技术,以解决上述的至少一个技术缺陷。Therefore, it is necessary to propose a scenario-based key information reminder technology to solve at least one of the above-mentioned technical defects.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明提出一种关键信息提醒方法、设备、系统及嵌入式音频播放装置,可以有效提醒用户注意关键信息。In view of this, the present invention provides a key information reminder method, device, system and embedded audio playback device, which can effectively remind users to pay attention to key information.

为实现上述目的,作为本发明的第一方面,提供一种嵌入式音频播放装置,包括扬声器和通信单元,还包括控制单元、存储单元、语音识别单元及提醒单元,In order to achieve the above object, as a first aspect of the present invention, an embedded audio playback device is provided, which includes a speaker and a communication unit, and also includes a control unit, a storage unit, a voice recognition unit and a reminder unit,

所述通信单元接收来自远端的音频流;The communication unit receives the audio stream from the remote end;

所述语音识别单元包括关键词识别模型单元,所述关键词识别模型单元用于存储基于场景的关键词识别模型;The speech recognition unit includes a keyword recognition model unit, and the keyword recognition model unit is used to store a scene-based keyword recognition model;

所述关键词和应用场景关联,其中包含一组在该应用场景中需要重点关注的词汇,所述词汇中的一个或多个由用户预先指定;The keyword is associated with the application scenario, and includes a set of vocabulary that needs to be focused on in the application scenario, and one or more of the vocabulary is pre-specified by the user;

所述语音识别单元自所述音频流中提取语音信号,并采用所述基于场景的关键词识别模型实时检测所述语音信号中是否包含所述关键词;The voice recognition unit extracts a voice signal from the audio stream, and uses the scene-based keyword recognition model to detect in real time whether the voice signal contains the keyword;

所述控制单元用于在所述语音信号中包含关键词时,开始录制所接收的音频流,并控制所述提醒单元输出关键信息提醒;The control unit is configured to start recording the received audio stream when a keyword is included in the voice signal, and control the reminder unit to output key information reminders;

所述存储单元用于存储被录制的音频流;The storage unit is used to store the recorded audio stream;

所述扬声器用于播放所述音频流,或响应于回放指令,回放所录制的音频流。The speaker is used to play the audio stream, or to play back the recorded audio stream in response to a playback instruction.

优选的,所述基于场景的关键词识别模型可以为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,采用深度学习算法训练获得;Preferably, the scene-based keyword recognition model may be pre-trained with a deep learning algorithm based on a training sample library containing speech samples for the keyword and/or speech samples of a specific person for the keyword. get;

所述控制单元还可以用于通过所述通信单元自远端下载所述基于场景的关键词识别模型。The control unit may also be configured to download the scene-based keyword recognition model from a remote end through the communication unit.

优选的,所述语音识别单元还可以包括语音预处理单元,用于对输入的音频流进行预处理,以消除噪声、背景人声、音乐声,提取语音信号;Preferably, the speech recognition unit may further include a speech preprocessing unit, which is used for preprocessing the input audio stream to remove noise, background vocals, and musical sounds, and extract speech signals;

优选的,所述语音识别单元还可以包括神经网络处理单元,用于基于所述关键词识别模型,采用深度学习算法对所述语音信号或所述语音预处理单元处理后的语音信号进行数据处理,从而对语音信号中出现的词汇进行推理和判决,以确定其中是否包含关键词词汇。Preferably, the speech recognition unit may further include a neural network processing unit for performing data processing on the speech signal or the speech signal processed by the speech preprocessing unit by using a deep learning algorithm based on the keyword recognition model , so as to reason and judge the words appearing in the speech signal to determine whether the keyword words are contained in them.

优选的,所述提醒单元可以为指示灯模组、振动器模组、文字消息生成模组、语音消息生成模组、音乐消息生成模组中的一种或多种。Preferably, the reminder unit may be one or more of an indicator light module, a vibrator module, a text message generation module, a voice message generation module, and a music message generation module.

进一步的,还可以包括输入单元,用于接收用户输入的录制停止指令、回放指令;Further, an input unit may also be included for receiving a recording stop instruction and a playback instruction input by a user;

所述控制单元在所述语音信号中包含关键词时,可以开始对接收到的音频流进行持续压缩编码并本地存储;When the control unit contains keywords in the voice signal, it can start to continuously compress and encode the received audio stream and store it locally;

所述控制单元在接收到录制停止指令或持续录制时间超过第一预定时长时,可以停止录制;The control unit can stop the recording when receiving the recording stop instruction or when the continuous recording time exceeds the first predetermined duration;

所述控制单元在接收到回放本地音频指令时,可以播放本地存储的录制音频流;The control unit can play the locally stored recording audio stream when receiving the playback local audio instruction;

进一步的,所述控制单元还可以用于在所述语音信号中包含关键词时,向远端发送录制开始指令,用于使远端开始对所发送的音频流持续录制,当持续录制时间未超过第二预定时长且接收到停止录制指令时,向远端发送录制停止指令;Further, the control unit can also be used to send a recording start instruction to the far end when the voice signal includes a keyword, so that the far end starts to continuously record the audio stream sent, when the continuous recording time is not long. When the second predetermined duration is exceeded and a stop recording instruction is received, a recording stop instruction is sent to the remote end;

所述控制单元在接收到回放远端音频指令时,可以向远端发送回放请求,并接收和播放远端存储的录制音频流。The control unit may send a playback request to the remote end, and receive and play the recorded audio stream stored at the remote end when receiving the instruction to play back the remote end audio.

优选的,所述嵌入式音频播放装置为耳机或带通话功能的音箱。Preferably, the embedded audio playback device is an earphone or a speaker with a call function.

作为本发明的第二方面,提供一种关键信息提醒系统,包括嵌入式音频播放装置和远端设备,As a second aspect of the present invention, a key information reminder system is provided, comprising an embedded audio playback device and a remote device,

所述远端设备接收用户自定义的关键词词汇,和/或用户提供的、至少包含所述关键词的特定人的语音样本,以用于获取基于场景的关键词识别模型;所述关键词和应用场景关联,其中包含一组在该应用场景中需要重点关注的词汇;The remote device receives a user-defined keyword vocabulary, and/or a user-provided voice sample of a specific person that contains at least the keyword, so as to obtain a scene-based keyword recognition model; the keyword Associated with the application scenario, which contains a set of vocabulary that needs to be focused on in the application scenario;

所述基于场景的关键词识别模型为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得;The scene-based keyword recognition model is obtained by training based on a training sample library containing speech samples for the keyword and/or speech samples of a specific person for the keyword in advance;

所述嵌入式音频播放装置与所述远端设备通信,接收来自远端设备的音频流,并播放;The embedded audio playback device communicates with the remote device, receives an audio stream from the remote device, and plays it;

所述嵌入式音频播放装置还自所述音频流中获取语音信号,并采用基于场景的关键词识别模型针对所述语音信号进行语音识别,实时检测所述语音信号中是否包含关键词;The embedded audio playback device also obtains a voice signal from the audio stream, and uses a scene-based keyword recognition model to perform voice recognition on the voice signal, and detects in real time whether the voice signal contains keywords;

当所述语音信号中包含关键词时,所述嵌入式音频播放装置产生关键信息提醒,并开始录制所接收的音频流;When the voice signal contains a keyword, the embedded audio playback device generates a reminder of key information and starts recording the received audio stream;

所述嵌入式音频播放装置响应于回放指令,播放所录制的音频流。The embedded audio playback device plays the recorded audio stream in response to the playback instruction.

进一步的,还可以包括云服务器,Further, a cloud server may also be included,

所述远端设备与所述云服务器通信,将所述关键词和/或特定人的语音样本发送至所述云服务器;The remote device communicates with the cloud server, and sends the keyword and/or the voice sample of a specific person to the cloud server;

所述云服务器将接收到的关键词和/或所述特定人的语音样本用于对其标准样本库进行扩充形成训练样本库,并基于所述训练样本库,采用深度学习算法训练获得所述基于场景的关键词识别模型;The cloud server uses the received keywords and/or the voice samples of the specific person to expand its standard sample library to form a training sample library, and uses deep learning algorithm training to obtain the training sample library based on the training sample library. Scenario-based keyword recognition model;

所述远端设备接收来自所述云服务器的基于场景的关键词识别模型,并将所述基于场景的关键词识别模型下载至所述嵌入式音频播放装置。The remote device receives the scene-based keyword recognition model from the cloud server, and downloads the scene-based keyword recognition model to the embedded audio playback device.

优选的,所述远端设备将用户输入的关键词词汇和/或用户提供的至少包含所述关键词的特定人的语音样本用于对标准样本库进行扩充,形成训练样本库,并基于所述训练样本库,采用深度学习算法训练获得所述基于场景的关键词识别模型;Preferably, the remote device uses the keyword vocabulary input by the user and/or the user-provided speech sample of a specific person that contains at least the keyword to expand the standard sample library to form a training sample library, and based on the The training sample library is described, and the scene-based keyword recognition model is obtained by training with a deep learning algorithm;

所述远端设备将所述基于场景的关键词识别模型下载至所述嵌入式音频播放装置。The remote device downloads the scene-based keyword recognition model to the embedded audio playback device.

作为本发明的第三方面,提供一种关键信息提醒方法,其中,As a third aspect of the present invention, a key information reminder method is provided, wherein,

接收用户自定义的关键词词汇,和/或用户提供的、至少包含所述关键词的特定人的语音样本;所述关键词和应用场景关联,包含一组在该应用场景中需要重点关注的词汇;Receive a user-defined keyword vocabulary, and/or a user-provided voice sample of a specific person that contains at least the keyword; the keyword is associated with an application scenario, and includes a set of items that need to be focused on in the application scenario vocabulary;

基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得所述基于场景的关键词识别模型;The scene-based keyword recognition model is obtained by training based on a training sample library comprising speech samples for the keyword and/or speech samples of a specific person for the keyword;

在接收和播放音频流时,自所述音频流中获取语音信号;When receiving and playing an audio stream, obtain a voice signal from the audio stream;

采用所述基于场景的关键词识别模型针对所述语音信号进行语音识别,实时检测所述语音信号中是否包含关键词;Use the scene-based keyword recognition model to perform speech recognition on the voice signal, and detect in real time whether the voice signal contains keywords;

当所述语音信号中包含关键词时,产生关键信息提醒,并开始录制所接收的音频流;When a keyword is included in the voice signal, a key information reminder is generated, and the received audio stream is started to be recorded;

响应于回放指令,播放所录制的音频流。In response to the playback instruction, the recorded audio stream is played.

优选的,预先采集广泛的语音样本,形成标准样本库;Preferably, extensive speech samples are collected in advance to form a standard sample library;

根据所述关键词获取至少包含所述关键词的语音样本;Acquiring at least a voice sample containing the keyword according to the keyword;

将所述包含所述关键词的语音样本和/或所述特定人的语音样本扩充至所述标准样本库,形成训练样本库,基于所述训练样本库采用深度学习算法训练获得所述基于场景的关键词识别模型。Expanding the speech samples containing the keywords and/or the speech samples of the specific person to the standard sample library to form a training sample library, and using deep learning algorithm training based on the training sample library to obtain the scene-based keyword recognition model.

进一步的,所述的自所述音频流中获取语音信号步骤中,还可以包括消除噪声、音乐声、背景人声的预处理步骤;Further, in the step of acquiring the voice signal from the audio stream, a preprocessing step of eliminating noise, music, and background vocals may also be included;

优选的,所述采用所述基于场景的关键词识别模型针对所述语音信号或预处理后的语音信号进行语音识别,实时检测所述语音信号中是否包含关键词,具体可以包括:构建基于所述关键词识别模型的深度学习神经网络,将语音信号连续输入所述深度学习神经网络进行数据处理,以对所述语音信号中出现的词汇进行推理和判决,确定其中是否包含关键词词汇。Preferably, using the scene-based keyword recognition model to perform speech recognition on the speech signal or the preprocessed speech signal, and detecting in real time whether the speech signal contains keywords, may specifically include: constructing a speech signal based on the speech signal. The deep learning neural network of the keyword recognition model is used, and the speech signal is continuously input into the deep learning neural network for data processing, so as to infer and judge the words appearing in the speech signal, and determine whether the keyword words are included.

优选的,所述录制所接收的音频流,具体可以包括:在所述语音信号中包含关键词时,开始对接收到的音频流进行持续压缩编码并本地存储;Preferably, the recording of the received audio stream may specifically include: when the voice signal contains a keyword, starting to continuously compress and encode the received audio stream and store it locally;

接收到录制停止指令或持续录制时间超过第一预定时长时,停止本地录制;When receiving a recording stop instruction or the continuous recording time exceeds the first predetermined duration, stop the local recording;

所述响应于回放指令,播放所录制的音频流,具体包括:响应于回放本地音频指令,播放本地存储的录制音频流。The playing the recorded audio stream in response to the playback instruction specifically includes: in response to the playback local audio instruction, playing the locally stored recorded audio stream.

优选的,所述录制所接收的音频流,具体可以包括:在所述语音信号中包含关键词时,向远端发送录制开始指令,远端开始对所发送的音频流持续录制,并进行远端存储;Preferably, the recording of the received audio stream may specifically include: when the voice signal contains a keyword, sending a recording start instruction to the remote end, and the remote end starts to continuously record the sent audio stream, and performs remote recording. end storage;

持续录制时间未超过第二预定时长且接收到停止录制指令时,向远端发送录制停止指令,远端停止录制;When the continuous recording time does not exceed the second predetermined duration and a stop recording instruction is received, the recording stop instruction is sent to the remote end, and the remote end stops recording;

所述响应于回放指令,播放所录制的音频流,具体包括:响应于回放远端音频指令,向远端发送回放请求,并接收和播放远端存储的录制音频流。The playing the recorded audio stream in response to the playback instruction specifically includes: in response to the playback remote audio instruction, sending a playback request to the remote end, and receiving and playing the recorded audio stream stored at the remote end.

优选的,所述关键信息提醒可以为视觉提醒、触觉提醒和听觉提醒中的一种或多种形式的组合;Preferably, the key information reminder may be a combination of one or more of visual reminders, tactile reminders and auditory reminders;

所述视觉提醒包括光效提醒、远端文字消息提醒;The visual reminder includes light effect reminder and remote text message reminder;

所述触觉提醒包括振动提醒;The tactile reminder includes a vibration reminder;

所述听觉提醒包括语音提醒、音乐提醒The auditory reminder includes voice reminder, music reminder

本发明的有益效果是:相对现有技术,本发明所提供的嵌入式音频播放设备,可独立完成基于场景的连续语音关键信息实时检测、提醒、录制和回放,使用方便、便捷且私密性好。本发明所提供的关键信息提醒系统和提醒方法,预先根据当前应用场景的实际需求及用户自行定制的关键词或训练样本,训练获得与应用场景高度契合的关键词识别模型,因此可有效提高识别连续语音流中的关键信息准确率,并针对当前应用场景应当重点关注的信息和用户感兴趣的信息及时输出提醒并保存,用户体验度极好。The beneficial effects of the present invention are: compared with the prior art, the embedded audio playback device provided by the present invention can independently complete the real-time detection, reminder, recording and playback of key information of continuous voice based on the scene, which is convenient and convenient to use and has good privacy. . The key information reminder system and reminder method provided by the present invention are based on the actual needs of the current application scenario and the keywords or training samples customized by the user in advance to train and obtain a keyword recognition model that is highly compatible with the application scenario, so that the recognition model can be effectively improved. The accuracy of key information in the continuous voice stream, and timely output reminders for the information that should be focused on in the current application scenario and the information that users are interested in, and the user experience is excellent.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例1的关键信息提醒方法的流程图;Fig. 1 is the flow chart of the key information reminding method of Embodiment 1 of the present invention;

图2为本发明实施例2的嵌入式音频播放装置的电路原理方框图;2 is a block diagram of the circuit principle of the embedded audio playback device according to Embodiment 2 of the present invention;

图3为本发明实施例3的关键信息提醒系统的系统架构图。FIG. 3 is a system architecture diagram of a key information reminder system according to Embodiment 3 of the present invention.

具体实施方式Detailed ways

在摩尔定律应用的40多年里,半导体芯片设计技术、制造工艺水平得到飞速提高,芯片计算能力得到大幅度提升,片上存储容量大幅度提升,而同时功耗不断降低,这使得人工智能技术在小型低功耗的嵌入式设备中广泛应用已成为可能。本发明是针对现有技术中人们在使用远程音视频应用时容易遗漏来自于对端的重要信息的缺陷所提出的技术改进。具体为,在嵌入式设备上,针对语音信息,采用基于场景的人工智能语音识别技术实时识别来自于对端的感兴趣信息,并及时输出提醒和保存关键音频流。本发明可适用于不同应用场景,满足不同用户的个性化需求,因此可有效解决现有技术的缺陷。本文中所述的“实时”是指嵌入式音频播放装置有足够的算力对原速播放的音频流中的关键词进行识别。In the more than 40 years of Moore's Law application, semiconductor chip design technology and manufacturing process level have been rapidly improved, chip computing power has been greatly improved, on-chip storage capacity has been greatly improved, and power consumption has been continuously reduced, which makes artificial intelligence technology in small Widespread application in low-power embedded devices has become possible. The present invention is a technical improvement proposed in view of the defect in the prior art that people easily miss important information from the opposite end when using remote audio and video applications. Specifically, on the embedded device, for voice information, scene-based artificial intelligence speech recognition technology is used to identify interesting information from the opposite end in real time, and timely output reminders and save key audio streams. The present invention can be applied to different application scenarios and meet the individual needs of different users, so it can effectively solve the defects of the prior art. The "real-time" mentioned in this article means that the embedded audio playback device has enough computing power to recognize the keywords in the audio stream played at the original speed.

下面通过附图和实施例,对本发明的技术方案做进一步的示例性的描述。显然,所描述的实施例仅是本申请的一部分实施例,而不是所有实施例的穷举。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。The technical solutions of the present invention will be further exemplarily described below through the accompanying drawings and embodiments. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than an exhaustive list of all the embodiments. It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict.

实施例1:Example 1:

如图1所示,根据本发明的核心思想,本实施例提供一种关键信息提醒方法,其中,As shown in FIG. 1, according to the core idea of the present invention, this embodiment provides a key information reminder method, wherein,

步骤100,初始化步骤。Step 100, an initialization step.

本步骤为进入关键信息提醒之前的处理流程,主要用于检查和更新关键信息提醒所需的软、硬件环境配置、参数准备、程序准备等。其中可以包括采用无线通信方式或有线通信方式使本地设备与远端设备建立通信连接的步骤,还可以包括获取基于场景的关键词识别模型的步骤。This step is the processing flow before entering the key information reminder, and is mainly used to check and update the software and hardware environment configuration, parameter preparation, program preparation, etc. required for the key information reminder. It may include the step of establishing a communication connection between the local device and the remote device by means of wireless communication or wired communication, and may also include the step of acquiring a scene-based keyword recognition model.

需要说明的是,本文中所称的“本地”和“远端”是相对概念,其中“本地”是指接收音频流并产生关键信息提醒的一方或一端,而“远端”是指独立于“本地”但与“本地”直接或通过一个或多个媒介体间接地以有线方式或无线方式通信,且向“本地”发送音频流的另一方或另一端。另外,需要说明的是,本文中所称的“远端“和本领域内在描述语音通话时常用的“对端”不是一个概念,“对端”是指音频流的另一发起方,而“远端”是指通话音频流自对端发起后的最初接收方,相应的“本地”是指通话音频流的最终接收方。It should be noted that "local" and "remote" referred to in this article are relative concepts, where "local" refers to the party or end that receives the audio stream and generates key information reminders, and "remote" refers to an independent "Local" but communicate with "local" either directly or indirectly through one or more intermediaries, wired or wireless, and to the other party or end of the audio stream to "local". In addition, it should be noted that the term "remote end" and "peer end" commonly used in describing voice calls in this field are not the same concept. "Remote" refers to the initial receiver of the call audio stream from the peer end, and the corresponding "local" refers to the final receiver of the call audio stream.

作为一种具体实施方式,“本地”可以为基于嵌入式系统的音频播放装置(简称:嵌入式音频播放装置)。本文中所称的“嵌入式系统”是嵌入到对象体系中的专用计算机系统,是以应用为中心,以计算机技术为基础,并且软硬件可裁剪,适用于应用系统对功能、可靠性、成本、体积、功耗有严格要求的专用计算机系统。“嵌入式装置”是指内部包含嵌入式系统的装置,其一般基于ARM内核和架构或者其它低功耗内核和架构,并用于实现特定的功能和应用,是相对于泛用的具有多种功能的PC机而言的装置,具体可以是耳机、音箱、电话手表、会议终端设备等等。而“远端”可以是最终用户计算机系统、网络服务器或服务器系统、移动计算设备、消费电子设备或其它适当的电子设备或它们的任何组合或部分,具体如手机、平板电脑、计算机、智能电视等。As a specific implementation manner, "local" may be an audio playback device based on an embedded system (abbreviation: embedded audio playback device). The "embedded system" referred to in this article is a special-purpose computer system embedded in the object system. It is application-centric, based on computer technology, and can be tailored for software and hardware. , volume, power consumption has strict requirements for special computer systems. "Embedded device" refers to a device that contains an embedded system, which is generally based on ARM cores and architectures or other low-power cores and architectures, and is used to implement specific functions and applications. A device in terms of a PC, specifically a headset, a speaker, a phone watch, a conference terminal, and the like. And "remote" can be an end user computer system, a web server or server system, a mobile computing device, a consumer electronic device or other suitable electronic device or any combination or part thereof, such as a cell phone, tablet, computer, smart TV Wait.

远程音视频应用可以适用的场景各式各样,所传输的语音信息量巨大,其中关键信息的类别也因人而异,因场景而异,各不相同。譬如,人们经常会通过头戴式耳机或者音箱这类手机或电脑的附属型设备参加视频会议或者网课。在视频会议中,对于用户而言,可能最关注的是会议中涉及到自己的那部分内容,如自己所在的部门、自己的上司、与自己相关的业务等,因而用于识别关键信息的关键词应该为部门名称、上司的名字、自己的名字、业务名称、任务布置、交付期限等;在网课中,对于学生而言,可能最关注的是老师讲授的知识点,因而用于识别关键信息的关键词应该为重点、难点、考点、总结、回顾等等;而在客服中心,对于客服人员而言,则可能最关注顾客提及的投诉事件,因而关键词需要包括投诉、建议、质量、服务态度等等。如果在这些不同的场景中采用全文的语音识别模型,并想要保证识别准确率,则必须基于海量语音样本训练语音识别模型。然而通常情况下这在手机或电脑的附属型设备中很难实现,一方面,海量的语音样本很难获取,另一方面,基于海量的语音样本训练对计算机硬件的要求非常高,高昂的实施成本限制了技术在手机或电脑的附属型设备中的推广应用。Remote audio and video applications can be applied to various scenarios, and the amount of voice information transmitted is huge. The types of key information also vary from person to person, and from scenario to scenario. For example, people often participate in video conferences or online classes through headsets or speakers that are attached to mobile phones or computers. In a video conference, users may be most concerned about the part of the conference that involves themselves, such as their own department, their boss, their related business, etc., so they are used to identify key information. The words should be the name of the department, the name of the boss, the name of oneself, the name of the business, the assignment of tasks, the deadline for delivery, etc. In the online class, students may be most concerned about the knowledge points taught by the teacher, so they are used to identify key points. The keywords of the information should be the key points, difficulties, test points, summary, review, etc.; while in the customer service center, for the customer service staff, they may be most concerned about the complaints mentioned by customers, so the keywords need to include complaints, suggestions, quality , service attitude, etc. If a full-text speech recognition model is used in these different scenarios and the recognition accuracy is to be guaranteed, the speech recognition model must be trained based on a large number of speech samples. However, it is usually difficult to achieve this in accessory devices of mobile phones or computers. On the one hand, it is difficult to obtain massive voice samples. On the other hand, training based on massive voice samples has very high requirements on computer hardware and is expensive to implement. Costs limit the deployment of the technology in cellphone or computer accessory devices.

因此,本实施例中,所述获取基于场景的关键词识别模型的步骤,特别是在手机或电脑的附属型嵌入式设备中获取所述关键词识别模型,是用于根据实际应用场景,调整、更新关键词识别模型,以使关键词识别模型更加契合当前场景,满足用户需求。所述关键词和应用场景关联,包含一组在该应用场景中需要重点关注的词汇。不同的应用场景可以对应不同的关键词。用户可以根据实际需求而自行设置、指定关键词中的一个或多个词汇。Therefore, in this embodiment, the step of acquiring the keyword recognition model based on the scene, especially the acquisition of the keyword recognition model in the accessory embedded device of the mobile phone or the computer, is used for adjusting according to the actual application scene. , Update the keyword recognition model to make the keyword recognition model more suitable for the current scene and meet user needs. The keywords are associated with application scenarios, and include a group of words that need to be focused on in the application scenarios. Different application scenarios can correspond to different keywords. Users can set and specify one or more words in the keywords according to actual needs.

所述获取基于场景的关键词识别模型的步骤具体可以包括:接收用户自定义的关键词词汇,和/或接收用户提供的至少包含所述关键词的特定人的语音样本;将所述关键词和所述特定人的语音样本用于对标准样本库进行扩充,从而形成训练样本库,并基于训练样本库训练获得基于场景的关键词识别模型。其中,所述标准样本库可以是基于预先采集的广泛的语音样本而形成的训练样本集合。The step of acquiring the scene-based keyword recognition model may specifically include: receiving a user-defined keyword vocabulary, and/or receiving a user-provided voice sample of a specific person that contains at least the keyword; and the voice samples of the specific person are used to expand the standard sample library, so as to form a training sample library, and a scene-based keyword recognition model is obtained by training based on the training sample library. Wherein, the standard sample library may be a training sample set formed based on pre-collected extensive speech samples.

所述接收用户自定义的关键词词汇,和/或接收用户提供的至少包含所述关键词的特定人的语音样本的步骤通常在基于手机或者电脑上更丰富的用户界面,在远端进行。The step of receiving a user-defined keyword vocabulary and/or receiving a user-provided voice sample of a specific person at least containing the keyword is usually performed remotely based on a richer user interface on a mobile phone or computer.

作为一种可选实施方式,用户可以根据自己的喜好、需求、使用场景,预先通过远端设置自定义关键词集合;音频流的提供方也可以根据使用场景、音频流内容、用户使用习惯等各种因素生成默认关键词集合。远端还可以预先显示若干默认关键词词汇,以供用户选择、增加、删减,以形成与应用场景关联的关键词词汇集合。As an optional implementation, the user can set a custom keyword set through the remote end in advance according to his own preferences, needs, and usage scenarios; the provider of the audio stream can also use the scenario, audio stream content, user usage habits, etc. Various factors generate a default keyword set. The remote end can also display several default keyword words in advance for the user to select, add, and delete, so as to form a keyword word set associated with the application scenario.

为了匹配嵌入式装置的硬件环境,可以设置所述关键词中词汇的数量上限,例如30组词汇等。In order to match the hardware environment of the embedded device, an upper limit of the number of words in the keyword can be set, for example, 30 groups of words and the like.

另外,鉴于语音识别中,说话人的性别、年龄、发音的生理特征、方言、非母语发音、说话时的情感、环境噪声等各种因素都可能影响识别的准确率,例如,同一个单词“重点”,四川人和广东人的读音就大相径庭。因此,本实施例中还可以获取用户提供的至少包含关键词的带特定人口音的语音样本,并用其扩充标准样本库,例如,学生可以提供一些老师上课的录音,职员可以提供一段老板开会时的录音等等。In addition, in view of speech recognition, various factors such as the speaker's gender, age, physiological characteristics of pronunciation, dialect, non-native pronunciation, emotion when speaking, environmental noise and other factors may affect the accuracy of recognition, for example, the same word " The key point”, the pronunciation of Sichuan and Cantonese are quite different. Therefore, in this embodiment, a voice sample with a specific accent provided by the user at least containing keywords can also be obtained, and the standard sample library can be expanded with it. For example, a student can provide some recordings of a teacher in class, and a staff member can provide a section of the boss’s meeting. recordings, etc.

在获取用户设置好的关键词后,将针对所述关键词用于对已有的海量语音标准样本库进行选择,形成训练样本库;在接收到用户提供的至少包含关键词的特定人的语音样本后,也将该特定人的语音样本扩充所述训练样本库。本实施例基于包含这些与应用场景密切相关的语音样本的样本库训练获得基于场景的关键词识别模型,将可有效提高识别准确率。After acquiring the keywords set by the user, the keywords are used to select the existing massive voice standard sample library to form a training sample library; After sampling, the training sample library is also augmented with the speech samples of the specific person. In this embodiment, a scene-based keyword recognition model is obtained by training a sample library containing these speech samples closely related to the application scene, which can effectively improve the recognition accuracy.

本实施例中训练获得关键词识别模型的过程,既可以采用目前已成功用于语音识别、文字识别的隐马尔可夫模型(Hidden Markov Model,HMM)、动态主题模型(DynamicTopic Models,DTM)以及基于此类技术衍生的各种经典人工智能的语音识别算法实现,也可以采用基于深度学习的算法以及未来各类相关算法来实现。深度学习是机器学习(Machine Learning)研究中的重要领域之一,其动机在于建立、模拟人脑进行分析学习的神经网络,通过模仿人脑的机制来解释数据,例如图像、声音和文本。深度学习的核心是通过构建具有多个隐层的机器学习模型和大量的训练数据来学习更有用的特征,从而最终提升分类或预测的准确性。目前,在计算机视觉和自然语言中,主流的深度学习算法是卷积神经网络(Convolutional neural network,CNN)、循环神经网络(Recurrent NeuralNetwork,简称RNN)算法,另外也有长短期记忆网络(Long Short-Term Memory,简称LSTM)算法、深度全序列卷积神经网络(Deep Fully Convolutional Neural Network,简称DFCNN)算法等。就提实施时,本实施例可采用包括但不限于这些现有的或未来将有的各种适用的深度学习算法。In the process of obtaining a keyword recognition model by training in this embodiment, the Hidden Markov Model (HMM), Dynamic Topic Models (DTM) and Various classical artificial intelligence speech recognition algorithms derived from such technologies can also be implemented by deep learning-based algorithms and various future related algorithms. Deep learning is one of the important fields in machine learning research. Its motivation is to build and simulate the neural network of the human brain for analysis and learning, and to interpret data, such as images, sounds and texts, by imitating the mechanism of the human brain. The core of deep learning is to learn more useful features by building a machine learning model with multiple hidden layers and a large amount of training data, thereby ultimately improving the accuracy of classification or prediction. At present, in computer vision and natural language, the mainstream deep learning algorithms are Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) algorithms, and there are also Long Short-term Memory Networks (Long Short-term Memory Networks). Term Memory, LSTM for short) algorithm, Deep Fully Convolutional Neural Network (DFCNN for short) algorithm, etc. As far as implementation is concerned, this embodiment may adopt various applicable deep learning algorithms including but not limited to these existing or future ones.

作为一种优选实施方式,本实施例采用基于深度学习算法的连续语音关键词识别技术。例如,在获得所述训练样本库后,采用卷积神经网络 (Convolutional NeuralNetwork,简称CNN)算法、循环神经网络(Recurrent Neural Network,简称RNN)算法等深度学习算法,基于所述训练样本库训练获得基于场景的关键词识别模型。As a preferred implementation manner, this embodiment adopts a continuous speech keyword recognition technology based on a deep learning algorithm. For example, after obtaining the training sample library, deep learning algorithms such as a Convolutional Neural Network (CNN) algorithm and a Recurrent Neural Network (RNN) algorithm are used to train and obtain the training sample library based on the training sample library. Scenario-based keyword recognition model.

所述采用深度学习算法的关键词识别模型的训练过程可以是在远端完成,也可以是在云端完成。需要说明的是,本文中所述的“云端”,是指具有强大处理和存储能力的云计算的服务器端或云计算的后台服务器。作为一种优选实施方式,所述训练的过程在云端完成,以便充分利用云端的硬件资源及强大的计算能力。具体包括:用户在远端输入关键词词汇或者上传包含关键词的特定人的语音样本后,远端将所述关键词和/或语音样本发送至云端,使得云端可以采用从互联网等各种方式获取包括所述关键词的语音样本,并将所述语音样本和特定人的语音样本扩充至其标准样本库中,形成训练样本库,再采用所述训练样本库训练获得基于场景的关键词识别模型。The training process of the keyword recognition model using the deep learning algorithm may be completed remotely, or may be completed in the cloud. It should be noted that the "cloud" mentioned in this article refers to the server side of cloud computing or the backend server of cloud computing with powerful processing and storage capabilities. As a preferred embodiment, the training process is completed in the cloud, so as to make full use of the hardware resources and powerful computing power of the cloud. Specifically, it includes: after the user inputs a keyword vocabulary at the remote end or uploads a voice sample of a specific person containing the keyword, the remote end sends the keyword and/or the voice sample to the cloud, so that the cloud can use various methods such as from the Internet Acquiring voice samples including the keywords, and expanding the voice samples and the voice samples of a specific person into its standard sample library to form a training sample library, and then using the training sample library to train to obtain scene-based keyword recognition Model.

进一步的,在训练完成后,远端接收来自云端的所述基于场景的关键词识别模型。Further, after the training is completed, the remote end receives the scene-based keyword recognition model from the cloud.

所述训练的过程在远端完成的过程可以参照上述云端训练的过程,此处不再赘述。For the process of completing the training process at the remote end, reference may be made to the above-mentioned cloud training process, which will not be repeated here.

在所述初始化步骤中,还可以包括更新本地关键词识别模型的步骤,其具体包括:远端将所述基于场景的关键词识别模型下载至本地。其可以是远端主动向本地发送更新请求,亦可是远端响应于本地的更新请求,而启动下载执行。The initialization step may further include a step of updating the local keyword recognition model, which specifically includes: remotely downloading the scene-based keyword recognition model to the local. It may be that the remote end actively sends an update request to the local, or it may be that the remote end initiates the download execution in response to the local update request.

在完成初始化步骤后,即可进入以下关键信息实时检测和提醒流程。After completing the initialization steps, you can enter the following key information real-time detection and reminder process.

步骤110,在接收和播放音频流时,自所述音频流中获取语音信号;Step 110, when receiving and playing the audio stream, obtain the voice signal from the audio stream;

本实施例的关键信息提醒流程,是在远程音视频应用中,接收和播放远程音视频应用中的音频流的同时,识别语音信息中包含的关键信息,并进行提醒。The key information reminding process of this embodiment is that in the remote audio and video application, while receiving and playing the audio stream in the remote audio and video application, the key information contained in the voice information is recognized and reminded.

作为一种优选实施方式,本步骤在获取所述音频流中的语音信号时,还对所述音频流进行背景声消除,去除背景中的噪声、背景人声、音乐声等,提取高信噪比的前景语音信号,以提升信噪比,进而提升语音识别的成功率。As a preferred embodiment, when acquiring the voice signal in the audio stream, this step also performs background sound cancellation on the audio stream, removes noise, background human voice, music, etc. in the background, and extracts high signal noise. Compared with the foreground speech signal, to improve the signal-to-noise ratio, thereby improving the success rate of speech recognition.

步骤120,采用所述基于场景的关键词识别模型针对所述语音信号进行语音识别,实时检测所述语音信号中是否包含预设关键词;Step 120, using the scene-based keyword recognition model to perform speech recognition on the voice signal, and detect in real time whether the voice signal contains a preset keyword;

检测所述语音信号中是否包含关键词时,只要检测到其中的一组关键词词汇,即可认为所述语音信号中包含关键词。When detecting whether the speech signal contains keywords, as long as a group of keyword words is detected, it can be considered that the speech signal contains keywords.

作为一种较佳实施方式,采用基于深度学习算法的连续语音关键词识别技术进行基于场景的关键信息识别,具体包括,构建基于所述关键词识别模型的深度学习神经网络,将待识别的连续语音信号输入所述深度学习神经网络进行数据处理,以对所述语音信号中出现的词汇进行推理和判决,确定其中是否包含关键词词汇。As a preferred embodiment, using the continuous speech keyword recognition technology based on the deep learning algorithm to recognize the key information based on the scene, specifically includes: constructing a deep learning neural network based on the keyword recognition model, The speech signal is input into the deep learning neural network for data processing, so as to infer and judge the words appearing in the speech signal, and determine whether the word contains a keyword or not.

本实施例中,采用基于深度学习算法的连续语音关键词识别技术进行基于场景的关键信息识别,相对现有技术中的大词汇量连续语音识别,不必识别全部文字,而只是检测用户设置的一组或多组关键词是否在连续语音流中出现,一方面可以针对连续的语音流进行实时检测,另一方面,其对硬件的运算能力、存储空间和功耗要求较低,可应用于小型低功耗的嵌入式系统中,同时,基于场景的识别可有效提高识别准确率,提升语音识别的用户体验。In this embodiment, the continuous speech keyword recognition technology based on the deep learning algorithm is used to recognize the key information based on the scene. Compared with the continuous speech recognition of large vocabulary in the prior art, it is not necessary to recognize all the characters, but only detects a set of characters set by the user. Whether a group or groups of keywords appear in a continuous voice stream, on the one hand, it can be detected in real time for the continuous voice stream, on the other hand, it has low requirements on the computing power, storage space and power consumption of the hardware, and can be applied to small In low-power embedded systems, at the same time, scene-based recognition can effectively improve the recognition accuracy and improve the user experience of speech recognition.

当检测所述语音信号中未包含关键词时,则返回步骤110,持续检测后续获取的音频流。When it is detected that the voice signal does not contain a keyword, the process returns to step 110 to continuously detect the subsequently acquired audio stream.

当所述语音信号中至少包含一组关键词时,执行步骤130和140;When the voice signal contains at least one group of keywords, perform steps 130 and 140;

步骤130,产生关键信息提醒。Step 130, generating a key information reminder.

所述关键信息提醒可以包括视觉提醒、触觉提醒和听觉提醒;The key information reminders may include visual reminders, tactile reminders and auditory reminders;

所述视觉提醒包括光效提醒、文字消息提醒,如LED指示灯闪烁或显示特定光效、远端屏幕上出现闪烁图案、远端文字消息(如,手机应用程序APP 的通知消息)等;The visual reminders include light effect reminders and text message reminders, such as LED indicators flashing or displaying specific light effects, flashing patterns appearing on the remote screen, and remote text messages (eg, notification messages of mobile phone applications APP), etc.;

所述触觉提醒包括振动提醒,如按照预定规律的振铃;The tactile reminder includes a vibration reminder, such as ringing according to a predetermined regularity;

所述听觉提醒包括语音提醒、音乐提醒,如采用预定语音内容或音乐进行提醒。The auditory reminders include voice reminders and music reminders, such as using predetermined voice content or music for reminders.

在具体实施时,可以根据实际应用场景选择上述的一种或几种提醒方式,例如,可以只设置光效提醒或音乐提醒,也可以在振动提醒的同时,向关联的计算机应用程序APP发送消息,以便获得双重提醒的效果。During specific implementation, one or more of the above-mentioned reminder methods can be selected according to the actual application scenario. For example, only light effect reminders or music reminders can be set, or a message can be sent to the associated computer application APP while the vibration reminder is performed. , in order to get the effect of double reminder.

步骤140,开始录制所接收的音频流;Step 140, start recording the received audio stream;

本实施例中,在判断当前音频流的语音信息中包含关键词时,为了帮助使用者能够尽量少的遗漏重要内容,在产生提醒的同时,还对接收到的音频流启动录制。In this embodiment, when judging that the voice information of the current audio stream contains keywords, in order to help the user to omit important content as little as possible, recording is also started on the received audio stream when a reminder is generated.

录制所述音频流时,可以将关键词本身作为录制起点,也可以自关键词出现后所接收的音频流作为录制起点,还可以将关键词出现时的当前音频流中已经滚动压缩编码的一段音频流以关键词为起点向前推移固定时间作为录制起点。即,所录制的音频流中可以包含关键词出现时的音频流,也可以不包含,还可以包含关键词出现前的音频流。When recording the audio stream, the keyword itself can be used as the recording starting point, or the audio stream received after the keyword appears as the recording starting point, or a segment of the current audio stream when the keyword appears that has been scrolled and compressed and encoded. The audio stream starts from the keyword and moves forward by a fixed time as the recording start point. That is, the recorded audio stream may include the audio stream when the keyword appears, or may not include the audio stream before the keyword appears.

录制的音频流将被压缩编码后存储在本地,以便本地回放。所述音频流被持续录制直至接收到录制停止指令或持续录制时间超过第一预定时长时,则停止录制。第一预定时长兼顾到本地存储载体的有限容量,可以设置比较短,例如1~2分钟。通常情况下,重要内容会出现在关键词出现后的较短时间的语音信息中,因此较短的第一预定时长,虽然简短却可能保存了最重要的语音内容,便于用户回放录音时快速了解重要信息。The recorded audio stream will be compressed and encoded and stored locally for local playback. The audio stream is continuously recorded until a recording stop instruction is received or when the continuous recording time exceeds the first predetermined duration, the recording is stopped. The first predetermined duration takes into account the limited capacity of the local storage carrier, and can be set to be relatively short, for example, 1 to 2 minutes. Usually, the important content will appear in the voice information within a short period of time after the keyword appears, so the shorter first predetermined duration, although short, may save the most important voice content, which is convenient for users to quickly understand when replaying the recording. Important information.

作为一种优选方式,在所述语音信号中包含关键词时,还可以向远端发送录制开始指令,该指令可以使远端开始对其所发送的音频流持续录制,并进行远端存储;在发送所述录制开始指令后还可以开始计时,当计算远端持续录制时间未超过第二预定时长且接收到停止录制指令时,则向远端发送录制停止指令,以使得远端可以在第二预定时长内随时因收到录制停止指令而终止录制,增加了录音时长的可控性。远端在持续录制时间超过第二预定时长时,可以自动停止录制。As a preferred way, when the voice signal contains a keyword, a recording start instruction can also be sent to the remote end, and the instruction can make the remote end start to continuously record the audio stream sent by it, and perform remote storage; After the recording start instruction is sent, the timer can also be started. When the calculated recording time of the far end does not exceed the second predetermined duration and the recording stop instruction is received, the recording stop instruction is sent to the far end, so that the far end can 2. The recording will be terminated at any time due to the receipt of the recording stop instruction within the predetermined time period, which increases the controllability of the recording time. The far end can automatically stop recording when the continuous recording time exceeds the second predetermined duration.

为了尽可能地帮助用户充分掌握重要信息,减少信息遗漏,第二预定时长可以设置为大于或者等于第一预定时长,即,使第二预定时比较长,如2~5 分钟,这样可将包含关键信息的较长时间的音频流保存下来以备用户回放。In order to help users fully grasp important information as much as possible and reduce information omission, the second predetermined duration can be set to be greater than or equal to the first predetermined duration, that is, to make the second predetermined duration longer, such as 2 to 5 minutes, so that the Longer audio streams of key information are saved for playback by the user.

当然,在本地存储空间足够大的情况下,也可以设置第一预定时长大于或者等于第二预定时长,这样,本地存储足够长时间的录音,而远端则保留较短时长的录音,使得用户或者其它人可以在远端回放录音以快速了解关键信息。Of course, if the local storage space is large enough, the first predetermined duration can also be set to be greater than or equal to the second predetermined duration. In this way, recordings with a sufficiently long duration are stored locally, while recordings with a shorter duration are retained at the remote end, so that the user can Or others can play back the recording at the far end to quickly understand key information.

另外,作为一种可选实施方式,远端在录制音频流时还可以对所录制的音频流进行全文语音识别,以获得相应的文字,并存储该文字信息。In addition, as an optional implementation manner, the remote end may also perform full-text speech recognition on the recorded audio stream when recording the audio stream, so as to obtain the corresponding text, and store the text information.

步骤150,响应于回放指令,播放所录制的音频流。Step 150, in response to the playback instruction, play the recorded audio stream.

本步骤中,可以响应于回放本地音频指令,播放本地录制并存储的音频流;也可以响应于回放远端音频指令,向远端发送回放请求,并接收和播放远端存储的录制音频流。In this step, the locally recorded and stored audio stream may be played in response to the playback local audio command; the playback request may also be sent to the remote end in response to the playback remote audio command, and the recorded audio stream stored at the remote end may be received and played.

作为一种可选实施方式,本地存储所录制的音频流时,可以按照录制开始时间的先后顺序存储;相应的,在回放时,可以按照录制开始时间的先后顺序依次播放。As an optional implementation manner, when the recorded audio streams are stored locally, they may be stored in the order of the recording start time; correspondingly, during playback, they may be played in the order of the recording start time.

应当说明的时,步骤104是基于获得的回放指定而执行的步骤,因此它不必一定在步骤103之后实施,或者说,它可被实施为在使用过程中随时检测回放指令,以回放录音。It should be noted that step 104 is a step performed based on the obtained playback specification, so it does not have to be implemented after step 103, or it can be implemented to detect playback instructions at any time during use to playback the recording.

在一种典型应用场景中,本实施例的关键信息提醒方法可以应用在呼叫中心系统中。通常呼叫中心的接线员每天要接听数以百计的语音电话,工作强度很大。而打进电话的对方因口头表达能力差异、口音问题、甚至情绪问题,往往很难在短时间内清楚表达其主要的通话目的。接线员如果精神不能高度集中,很容易错过对方的重要信息,甚至误解对方的意思,而导致不良后果。采用本实施例的方法,接线员在接听呼叫电话时,佩戴可进行关键信息提醒的耳机,耳机自动识别通话对方的语音信息中是否含有诸如“报警”、“投诉”、“骗子”之类的关键词,并及时提醒接线员注意关键信息;并且,耳机还可以录制关键信息,或者通知与其通信连接的远端(如呼叫中心管理平台、呼叫电话转接平台等)录制关键信息。这样,接线员可以通过回放功能更准确全面的了解关键信息,加深对对方通话意图的理解。可见,本实施例的关键信息提醒方法,不仅可以及时、有效地提醒接线员,还可以帮助其回顾通话内容,既减少信息损失,更将大大减轻接线员的工作压力。In a typical application scenario, the key information reminding method of this embodiment can be applied to a call center system. Usually the operator of the call center has to answer hundreds of voice calls every day, which is very intensive. However, due to differences in oral expression ability, accent problems, and even emotional problems, it is often difficult for the caller to clearly express the main purpose of the call in a short period of time. If the operator cannot concentrate highly, it is easy to miss the important information of the other party, or even misunderstand the meaning of the other party, which will lead to bad consequences. Using the method of this embodiment, the operator wears a headset that can remind key information when answering a call, and the headset automatically identifies whether the voice information of the caller contains key information such as "alarm", "complaint", and "liar". In addition, the headset can also record key information, or notify the remote end (such as call center management platform, call transfer platform, etc.) to record key information. In this way, the operator can get a more accurate and comprehensive understanding of key information through the playback function, and deepen the understanding of the other party's call intention. It can be seen that the key information reminding method of this embodiment can not only remind the operator in a timely and effective manner, but also help the operator to review the content of the call, which not only reduces the loss of information, but also greatly reduces the operator's work pressure.

实施例2Example 2

参考图2,根据本发明的核心思想,本实施例提供一种嵌入式音频播放装置,包括通信单元、扬声器、控制单元、存储单元、语音识别单元及提醒单元,2, according to the core idea of the present invention, the present embodiment provides an embedded audio playback device, including a communication unit, a speaker, a control unit, a storage unit, a voice recognition unit and a reminder unit,

所述存储单元用于存储于本装置运行相关的数据、程序等。The storage unit is used for storing data, programs and the like related to the operation of the device.

所述通信单元可以为有线通信单元,亦可为无线通信单元,还可以既包括有线通信模组,也包括无线通信模组。具体的,所述通信单元可以被实施为蓝牙通信单元、WIFi通信单元、Internet网络接口、音频专用有线传输接口、 USB接口、micro USB接口、mini usb接口、Type-C接口、Lightning接口等各种已知的或未来可用于本实施例中的通信单元。The communication unit may be a wired communication unit or a wireless communication unit, and may include both a wired communication module and a wireless communication module. Specifically, the communication unit can be implemented as a Bluetooth communication unit, a WIFi communication unit, an Internet network interface, an audio dedicated wired transmission interface, a USB interface, a micro USB interface, a mini usb interface, a Type-C interface, a Lightning interface, etc. Known or future available communication units in this embodiment.

所述通信单元接收来自远端的音频流;The communication unit receives the audio stream from the remote end;

所述语音识别单元用于自所述音频流中提取语音信号,并采用基于场景的关键词识别模型实时检测所述语音信号中是否包含关键词;Described speech recognition unit is used for extracting speech signal from described audio stream, and adopt scene-based keyword recognition model to detect in real time whether described speech signal contains keyword;

所述控制单元是本装置的控制中心,其利用各种接口和线路连接本装置中的其它单元,并对各个单元进行整体监控和调度,以实现本装置的各项功能,特别是在所述语音信号中包含关键词时,开始录制所接收的音频流,并控制所述提醒单元输出关键信息提醒;The control unit is the control center of the device, which uses various interfaces and lines to connect other units in the device, and conducts overall monitoring and scheduling for each unit to realize various functions of the device, especially in the When the voice signal contains keywords, start recording the received audio stream, and control the reminder unit to output key information reminders;

本实施例中,所述关键词和应用场景关联,其中包含一组在该应用场景中需要重点关注的词汇,所述词汇中的一个或多个由用户预先指定;In this embodiment, the keyword is associated with an application scenario, and includes a set of vocabulary that needs to be focused on in the application scenario, and one or more of the vocabulary is pre-specified by the user;

所述语音识别单元包括关键词识别模型单元,所述关键词识别模型单元用于存储所述基于场景的关键词识别模型。所述基于场景的关键词识别模型为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得;作为一种优选实施方式,所述基于场景的关键词识别模型为采用深度学习算法训练获得,所述语音识别单元可以采用该关键词识别模型,进行连续语音关键词识别,以实时地检测所述语音信号中是否包含关键词;The speech recognition unit includes a keyword recognition model unit for storing the scene-based keyword recognition model. The scene-based keyword recognition model is obtained by training based on a training sample library containing speech samples for the keyword and/or speech samples of a specific person for the keyword in advance; as a preferred embodiment , the scene-based keyword recognition model is obtained by training with a deep learning algorithm, and the speech recognition unit can use the keyword recognition model to perform continuous speech keyword recognition to detect in real time whether the speech signal contains key words word;

所述语音识别单元还可以包括语音预处理单元,用于对输入的音频流进行预处理,以消除噪声、音乐声、背景人声等,提取高信噪比的语音信号;The speech recognition unit may also include a speech preprocessing unit for preprocessing the input audio stream to eliminate noise, music, background vocals, etc., and extract a speech signal with a high signal-to-noise ratio;

所述语音识别单元还可以包括神经网络处理单元,用于基于所述关键词识别模型,采用深度学习算法对所述语音信号进行数据处理,从而对所述语音信号中出现的词汇进行推理和判决,以确定其中是否包含关键词词汇。所述的神经网络处理单元可以是嵌入式神经网络处理器(neural-network processunits,NPU)、专用神经网络处理阵列处理单元、DSP、嵌入式处理器等各种可用于神经网络中处理海量多媒体数据的处理模组。The speech recognition unit may further include a neural network processing unit for performing data processing on the speech signal by using a deep learning algorithm based on the keyword recognition model, so as to infer and judge the words appearing in the speech signal. , to determine if it contains a keyword word. The neural network processing unit can be an embedded neural network processor (neural-network processunits, NPU), a dedicated neural network processing array processing unit, a DSP, an embedded processor, etc., which can be used to process massive multimedia data in a neural network. processing module.

本实施例中,所述关键词识别模型是在外部训练完成,并在使用前被下载到本装置内的。因此,所述控制单元还用于通过所述通信单元自远端下载所述基于场景的关键词识别模型。In this embodiment, the keyword recognition model is externally trained and downloaded into the device before use. Therefore, the control unit is further configured to download the scene-based keyword recognition model from a remote end through the communication unit.

所述提醒单元为指示灯模组、振动器模组、文字消息生成模组、语音消息生成模组、音乐消息生成模组中的一种或多种。其中,所述指示灯模组可以为LED指示灯,其可以以灯光闪烁、或显示特定图形的方式输出提醒;所述振动器模组可以产生预定频率的振动;所述文字消息生成模组可以为根据预定的消息格式生成文字消息,如包含当前被识别出的关键词的文字消息;所述语音消息生成模组可以根据预定的语音消息格式生成语音消息,如生成包含当前被识别出的关键词的语音消息;所述声音消息生成模组可以根据预先设定的方式从预存的声音数据中选择一段,以作为声音消息,如“滴滴”“叮咚”等Tone音等。The reminding unit is one or more of an indicator light module, a vibrator module, a text message generation module, a voice message generation module, and a music message generation module. Wherein, the indicator light module can be an LED indicator light, which can output reminders in the form of flashing lights or displaying specific graphics; the vibrator module can generate vibrations with a predetermined frequency; the text message generating module can In order to generate a text message according to a predetermined message format, such as a text message containing currently recognized keywords; the voice message generation module can generate a voice message according to a predetermined voice message format, such as generating a currently recognized key word. The voice message of the word; the voice message generation module can select a segment from the pre-stored voice data according to a preset method, as a voice message, such as "DiDi", "Ding Dong" and other Tone sounds.

所述扬声器用于播放所述音频流,或回放所录制的音频流,或者播放所述语音消息或声音消息等。应当能理解的是,在一些具体实施方式中,所述扬声器可以与控制单元、存储单元配合,取代提醒单元的功能,例如仅采用声音提醒的方式。The speaker is used for playing the audio stream, or playing back the recorded audio stream, or playing the voice message or sound message, and so on. It should be understood that, in some specific implementation manners, the speaker may cooperate with the control unit and the storage unit to replace the function of the reminder unit, for example, only in the form of sound reminder.

所述嵌入式音频播放装置还包括输入单元,用于接收用户输入的各项控制指令,例如,接收用户输入的回放指令、停止提醒指令、录制停止指令等。The embedded audio playback device further includes an input unit for receiving various control instructions input by the user, for example, receiving a playback instruction, a stop reminder instruction, a recording stop instruction and the like input by the user.

所述输入单元可以为触控面板、按键、语音命令输入模组等各种机械或语音输入模组。The input unit may be various mechanical or voice input modules such as touch panels, buttons, and voice command input modules.

所述存储单元用于存储被录制的音频流;The storage unit is used to store the recorded audio stream;

在一种可选实施方式中,所述控制单元在所述语音信号中包含关键词时,开始对接收到的音频流进行持续压缩编码并本地存储;所述控制单元在接收到录制停止指令或持续录制时间超过第一预定时长时,停止录制;In an optional implementation manner, when the voice signal contains a keyword, the control unit starts to continuously compress and encode the received audio stream and store it locally; when the control unit receives a recording stop instruction or When the continuous recording time exceeds the first predetermined time, the recording will be stopped;

所述控制单元在接收到回放本地音频指令时,播放本地存储的录制音频流;The control unit plays the locally stored recording audio stream when receiving the playback local audio instruction;

所述控制单元还用于在所述语音信号中包含关键词时,向远端发送录制开始指令,用于使远端开始对所发送的音频流持续录制,当持续录制时间未超过第二预定时长且接收到停止录制指令时,向远端发送录制停止指令;The control unit is also used to send a recording start instruction to the far end when the voice signal contains a keyword, so that the far end starts to continuously record the sent audio stream, when the continuous recording time does not exceed the second predetermined time. When the duration is long and a stop recording instruction is received, a recording stop instruction is sent to the remote end;

所述控制单元在接收到回放远端音频指令时,向远端发送回放请求,并接收和播放远端存储的录制音频流。The control unit sends a playback request to the remote end, and receives and plays the recorded audio stream stored at the remote end when receiving the remote end audio playback instruction.

另外,所述嵌入式音频播放装置还可以包括供电单元,所述供电单元用于提供本装置工作时的所需电源,其可以是通过纽扣电池或可充电电池供电的供电电路模组,亦可是通过外部输入电源为本装置供电的供电管理模组,还可以是基于有线通信接口自取电的电路模组。In addition, the embedded audio playback device may further include a power supply unit, which is used to provide the required power supply when the device works, which may be a power supply circuit module powered by a button battery or a rechargeable battery, or a power supply circuit module powered by a button battery or a rechargeable battery. The power supply management module that supplies power to the device through an external input power supply can also be a circuit module that draws power by itself based on a wired communication interface.

显然,本实施例的嵌入式音频播放装置可以用于实现实施例1所述的关键信息提醒方法中的部分或全部的方法、流程或步骤。其中与实施例1相同或相似部分的描述,本实施例不再赘述。Obviously, the embedded audio playback device of this embodiment can be used to implement some or all of the methods, processes or steps in the key information reminder method described in Embodiment 1. The descriptions of the same or similar parts as those in Embodiment 1 will not be repeated in this embodiment.

所述嵌入式音频播放装置可以被具体实施为头戴式音频播放设备,如各种有线耳机装置、无线耳机装置等,还可以被实施为各种便携式音箱;亦可以被具体实施为电话手表、便携式游戏机、便携式多媒体播放器等等手机或电脑的附属型设备。例如,在一种典型应用场景中,所述嵌入式音频播放装置为带通话功能的音箱。所述音箱的外壳上设置有LED指示灯,其内部被预先下载基于场景的关键词识别模型,可以对音箱当前播放的语音信息进行实时的连续检测。在当前语音信息中包含关键词时,LED指示灯开始闪烁,以提醒用户。所述音箱具有智能语音控制功能,用户可以语音发出控制指令,以控制音箱执行LED指示灯关闭、录音停止、回放等功能。所述音箱用于实现关键信息提醒的详细过程可参照前述实施例1和本实施例的部分描述,此处不再赘述。The embedded audio playback device can be implemented as a head-mounted audio playback device, such as various wired earphone devices, wireless earphone devices, etc., and can also be implemented as various portable speakers; it can also be implemented as a phone watch, Portable game consoles, portable multimedia players and other accessories for mobile phones or computers. For example, in a typical application scenario, the embedded audio playback device is a speaker with a call function. The casing of the speaker is provided with an LED indicator, and a scene-based keyword recognition model is pre-downloaded in the speaker, which can continuously detect the voice information currently played by the speaker in real time. When a keyword is included in the current voice message, the LED indicator starts flashing to alert the user. The speaker has an intelligent voice control function, and the user can issue control commands by voice to control the speaker to perform functions such as turning off the LED indicator, stopping recording, and playing back. For the detailed process of implementing the key information reminder by the speaker, reference may be made to the foregoing Embodiment 1 and part of the description in this embodiment, which will not be repeated here.

实施例3Example 3

根据本发明的核心思想,本实施例提供一种关键信息提醒系统,包括嵌入式音频播放装置和远端设备,According to the core idea of the present invention, this embodiment provides a key information reminder system, including an embedded audio playback device and a remote device,

所述远端设备接收用户自定义的关键词词汇,和/或用户提供的、至少包含所述关键词的特定人的语音样本,以用于获取基于场景的关键词识别模型;所述关键词和应用场景关联,其中包含一组在该应用场景中需要重点关注的词汇;The remote device receives a user-defined keyword vocabulary, and/or a user-provided voice sample of a specific person that contains at least the keyword, so as to obtain a scene-based keyword recognition model; the keyword Associated with the application scenario, which contains a set of vocabulary that needs to be focused on in the application scenario;

所述基于场景的关键词识别模型为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得;The scene-based keyword recognition model is obtained by training based on a training sample library containing speech samples for the keyword and/or speech samples of a specific person for the keyword in advance;

所述嵌入式音频播放装置与所述远端设备通信,接收来自远端设备的音频流,并播放;所述通信可以为任何适用的通信形式,诸如有线(例如以太网、USB、闪电、光纤)通信或无线(例如WiFi、蓝牙、IR)通信。The embedded audio playback device communicates with the remote device, receives the audio stream from the remote device, and plays it; the communication can be in any suitable communication form, such as wired (eg ethernet, USB, lightning, optical fiber) ) communication or wireless (eg WiFi, Bluetooth, IR) communication.

所述嵌入式音频播放装置还自所述音频流中获取语音信号,并采用基于场景的关键词识别模型针对所述语音信号进行语音识别,实时检测所述语音信号中是否包含关键词;The embedded audio playback device also obtains a voice signal from the audio stream, and uses a scene-based keyword recognition model to perform voice recognition on the voice signal, and detects in real time whether the voice signal contains keywords;

当所述语音信号中包含关键词时,所述嵌入式音频播放装置产生关键信息提醒,并开始录制所接收的音频流;When the voice signal contains a keyword, the embedded audio playback device generates a reminder of key information and starts recording the received audio stream;

所述嵌入式音频播放装置响应于回放指令,播放所录制的音频流。The embedded audio playback device plays the recorded audio stream in response to the playback instruction.

作为一种可选方式,所述关键识别模型在所述远端设备上完成训练,所述远端设备将用户自定义的关键词词汇和/或用户提供的至少包含所述关键词的特定人的语音样本用于对其标准样本库进行扩充,形成训练样本库,并基于所述训练样本库训练获得基于场景的关键词识别模型;As an optional manner, the key recognition model is trained on the remote device, and the remote device uses a user-defined keyword vocabulary and/or a user-provided specific person who contains at least the keyword The voice sample is used to expand its standard sample library, form a training sample library, and obtain a scene-based keyword recognition model based on the training sample library training;

所述远端设备将所述基于场景的关键词识别模型下载至所述嵌入式音频播放装置。The remote device downloads the scene-based keyword recognition model to the embedded audio playback device.

作为另一种可选实施方式,所述关键词识别模型在云端完成训练,所述系统还包括云服务器;As another optional implementation manner, the keyword recognition model is trained in the cloud, and the system further includes a cloud server;

所述远端设备与所述云服务器通信,将所述关键词和/或特定人的语音样本发送至所述云服务器;The remote device communicates with the cloud server, and sends the keyword and/or the voice sample of a specific person to the cloud server;

所述云服务器将接收到的关键词和所述特定人的语音样本用于对其标准样本库进行扩充,并基于所述训练样本库训练获得基于场景的关键词识别模型;The cloud server uses the received keywords and the voice samples of the specific person to expand its standard sample library, and obtains a scene-based keyword recognition model based on the training sample library;

所述远端设备接收来自所述云服务器的基于场景的关键词识别模型,并将所述基于场景的关键词识别模型下载至所述嵌入式音频播放装置。The remote device receives the scene-based keyword recognition model from the cloud server, and downloads the scene-based keyword recognition model to the embedded audio playback device.

显然,本实施例提供的关键信息提醒系统可以被用于实现实施例1所述的关键信息提醒方法中的部分或者全部的方法、流程或步骤。实施例2所述的嵌入式音频播放装置亦可被用于实现本实施例的关键信息提醒系统。其类似的技术细节可以参考前述实施例的描述,此处不再赘述。Obviously, the key information reminder system provided in this embodiment can be used to implement some or all of the methods, processes or steps in the key information reminder method described in Embodiment 1. The embedded audio playback device described in Embodiment 2 can also be used to implement the key information reminder system of this embodiment. For similar technical details, reference may be made to the descriptions of the foregoing embodiments, which will not be repeated here.

以下将以一种典型应用场景为例,以更加清楚、详细地描述本发明实施例的核心思想。The following will take a typical application scenario as an example to describe the core idea of the embodiments of the present invention more clearly and in detail.

参考图3,本应用场景中,所述关键信息提醒系统包括视频播放设备(如平板电脑)300、耳机310和云服务器320。Referring to FIG. 3 , in this application scenario, the key information reminder system includes a video playback device (such as a tablet computer) 300 , an earphone 310 and a cloud server 320 .

所述耳机310可以是头戴式耳机、入耳式耳机或耳挂式耳机;可以是有线耳机,也可以是无线耳机;可以只有1个耳麦311,还可以具有左右耳麦 311;其左右耳麦311可以是连体式,亦可是分体式。The earphone 310 can be a headphone, an in-ear earphone or an ear-hook earphone; it can be a wired earphone or a wireless earphone; there can be only one earphone 311, and left and right earphones 311; the left and right earphones 311 can be It can be one-piece or split.

所述耳机310与视频播放设备300有线通信或无线通信,从而接收来自视频播放设备300的音频流。所述视频播放设备300可以是用户的个人计算机、平板电脑、智能电视、手机等。用户通过视频播放设备300收看视频节目。图3所示为学生通过平板电脑收看网课。The headset 310 communicates with the video playback device 300 by wire or wirelessly, so as to receive the audio stream from the video playback device 300 . The video playback device 300 may be a user's personal computer, a tablet computer, a smart TV, a mobile phone, or the like. The user watches the video program through the video playback device 300 . Figure 3 shows students watching online classes through tablet computers.

所述视频播放设备300还可以基于网络访问云服务器320,所述网络可以是局域网、广域网、蜂窝网络、或它们的组合。The video playback device 300 may also access the cloud server 320 based on a network, and the network may be a local area network, a wide area network, a cellular network, or a combination thereof.

所述耳机310上设置LED指示灯312、按键313~316。所述LED指示灯 312可发出闪烁的红光;所述按键313为音量增加键,按键314为播放/暂停键,按键315为停止提醒/停止录制/回放键,按键316为音量降低键。其中按键315可以被设置为1次按压即同时执行停止提醒、停止录制和开始回放三种功能,也可以被设置为1次按压即同时执行停止提醒和录制,连续两次按压即开始回放。具体可根据实际实施环境而设置,本发明对此不作具体限制。The earphone 310 is provided with an LED indicator 312 and buttons 313-316. The LED indicator 312 can emit a flashing red light; the button 313 is a volume up button, the button 314 is a play/pause button, the button 315 is a stop reminder/stop recording/playback button, and the button 316 is a volume down button. The button 315 can be set to perform three functions of stop reminder, stop recording and start playback at the same time by pressing it once, or it can be set to perform stop reminder and recording at the same time by pressing it once, and start playback by pressing it twice in a row. Specifically, it can be set according to the actual implementation environment, which is not specifically limited in the present invention.

所述LED指示灯312也可以被设置于所述耳机310的外置式麦克风(图中未示出)上,用户在佩戴耳机时,可将外置式麦克风调至自己的嘴唇前方位置,这样LED指示灯312如发光提醒,用户更容易看到。The LED indicator 312 can also be set on the external microphone (not shown in the figure) of the headset 310. When wearing the headset, the user can adjust the external microphone to the position in front of his lips, so that the LED indicates The light 312 is like a light-emitting reminder, which is easier for the user to see.

另外,所述耳机310内还设置有振动器(图中未示出)。所述振动器可以采用现有的或未来适用的技术实现,本发明不做具体限制。例如,可以是具有凸轮的偏心电动机。In addition, a vibrator (not shown in the figure) is also provided in the earphone 310 . The vibrator can be implemented using existing or future applicable technologies, which is not specifically limited in the present invention. For example, it could be an eccentric motor with a cam.

所述云服务器320可以基于前述的深度学习算法训练产生关键词识别模型。在具体实施时,所述云服务器320上可以预先采集广泛的语音样本,并对语音样本进行词汇标注等处理后,形成标准样本库。The cloud server 320 may generate a keyword recognition model based on the aforementioned deep learning algorithm training. In a specific implementation, the cloud server 320 may collect a wide range of speech samples in advance, and perform vocabulary tagging and other processing on the speech samples to form a standard sample library.

本应用场景下,关键信息提醒系统实现关键信息提醒的流程如下:In this application scenario, the key information reminder system realizes the process of key information reminder as follows:

步骤一,初始化步骤。Step 1, initialization step.

在启动关键信息提醒流程之前,先进行初始化步骤,检查和更新系统内各装置、设备运行、通信所需的软、硬件环境配置以及各项参数设置。Before starting the key information reminder process, the initialization steps are carried out to check and update the software and hardware environment configuration and various parameter settings required for each device, equipment operation and communication in the system.

其中包括设置关键词,获得新的关键词识别模型。具体为:These include setting keywords and obtaining new keyword recognition models. Specifically:

用户通过视频播放设备300设置关键词词汇,如学生在上网课前可以输入“重点”、“考试”、“总结”以及自己的名字等文字作为关键词。通过用户的自主设置可以形成符合当前应用场景、且具有个性化的关键词。The user sets the keyword vocabulary through the video playback device 300. For example, students can input words such as "focus", "exam", "summary" and their name as keywords before the online class. Through the user's independent settings, personalized keywords that conform to the current application scenario can be formed.

为了匹配耳机310的硬件功耗和算力,设置关键词词汇数量的上限为20。In order to match the hardware power consumption and computing power of the headset 310, the upper limit of the number of keyword words is set to 20.

当视频播放设备300的关键词中被输入新的词汇时,视频播放设备300 访问云服务器320,向云服务器320发送更新关键词识别模型请求,并将关键词送至云服务器320。When a new word is input into the keyword of the video playback device 300 , the video playback device 300 accesses the cloud server 320 , sends a request for updating the keyword recognition model to the cloud server 320 , and sends the keyword to the cloud server 320 .

云服务器320接收到关键词后,可将其中的关键词词汇与云服务器320 上现存的关键词词汇比对,当视频播放设备300发送的关键词中的所有词汇均包含在云服务器320现存的关键词中时,则直接将现有的标准样本库作为训练样本库,针对所述关键词采用所述的深度学习算法训练获得新的基于场景的关键词识别模型;当关键词中的部分词汇未包含在云服务器320现存的关键词中时,则从互联网获取包含该部分词汇的语音样本,并将其扩充标准样本库以形成训练样本库后,再训练产生新的关键词识别模型。After the cloud server 320 receives the keywords, it can compare the keyword words with the existing keyword words on the cloud server 320. When all words in the keywords sent by the video playback device 300 are included in the existing keyword words in the cloud server 320 When there are keywords, the existing standard sample library is directly used as the training sample library, and the deep learning algorithm is used for training to obtain a new scene-based keyword recognition model for the keywords; when part of the vocabulary in the keywords When not included in the existing keywords in the cloud server 320, the speech samples containing the part of the vocabulary are obtained from the Internet, and the standard sample database is expanded to form a training sample database, and then a new keyword recognition model is generated by training.

用户也可以通过视频播放设备300上传包含关键词中一个或多个词汇的特定人的语音样本,如学生将某位老师的语音音频资料上传至所述视频播放设备300。视频播放设备300将该特定人的语音样本上传至云服务器320,以使扩充云服务器320的标准样本库,使得云服务器320可以基于至少包含所述关键词的特定人的语音样本的训练样本库,训练获得新的关键词识别模型。The user can also upload a voice sample of a specific person containing one or more words in the keyword through the video playback device 300 , for example, a student uploads a teacher's voice and audio data to the video playback device 300 . The video playback device 300 uploads the voice sample of the specific person to the cloud server 320, so as to expand the standard sample library of the cloud server 320, so that the cloud server 320 can base on the training sample library of the voice sample of the specific person that contains at least the keyword , and train to obtain a new keyword recognition model.

云服务器320响应于视频播放设备300的更新请求,将训练所得的基于场景的关键词识别模型发送给视频播放设备300。In response to the update request of the video playback device 300 , the cloud server 320 sends the scene-based keyword recognition model obtained by training to the video playback device 300 .

视频播放设备300从所述云服务器320接收所述关键词识别模型后,将关键词识别模型下载至所述耳机310,以使耳机310更新自己本地存储的关键词识别模型。After receiving the keyword recognition model from the cloud server 320, the video playback device 300 downloads the keyword recognition model to the headset 310, so that the headset 310 updates its locally stored keyword recognition model.

需要说明的是,所述设置关键词,获得新的关键词识别模型的过程可以在初始化步骤完成,亦可在系统运行中的各个适宜的时间内完成,具体可以根据实际情况确定,本发明对此不做限制。It should be noted that the process of setting keywords and obtaining a new keyword recognition model can be completed in the initialization step, and can also be completed in each appropriate time during system operation. The specific process can be determined according to the actual situation. This does not limit.

步骤二,耳机310接收音频流。Step 2, the earphone 310 receives the audio stream.

在完成系统初始化后,用户可以通过耳机310上的按键314开始接收和播放来自视频播放设备300的音频流。如学生此时通过耳机310和平板电脑 300,收看网络课程。After completing the system initialization, the user can start to receive and play the audio stream from the video playback device 300 through the button 314 on the headset 310 . For example, students can watch the online course through the headset 310 and the tablet computer 300 at this time.

步骤三,耳机310获取音频流中的语音信号,并针对所述语音信号进行语音识别,采用所述基于场景的关键词识别模型,实时检测所述语音信号中是否包含预设关键词。Step 3, the headset 310 acquires the voice signal in the audio stream, performs voice recognition on the voice signal, and uses the scene-based keyword recognition model to detect in real time whether the voice signal contains a preset keyword.

耳机310中内置语音识别单元,其可以为嵌入式神经网络处理器,用于基于所述关键词识别模型构建神经网络,采用深度学习算法进行数据处理,以对连续输入的语音信号进行实时的关键词识别。The headset 310 has a built-in speech recognition unit, which can be an embedded neural network processor, used to construct a neural network based on the keyword recognition model, and use a deep learning algorithm for data processing to perform real-time key on the continuously input speech signals. word recognition.

网络课程的音频流中可能包括音乐、语音等各种声音信号,耳机310提取其中的语音信号,并采用基于场景的关键词识别模型和深度学习算法,检测语音信号中是否包含预设关键词。例如,学生预先设置了关键词“总结”,则当网课老师讲到“下面我们总结一下这节课的主要内容”时,则可检测识别当前语音信号中包含了关键词;而如果学生将自己的姓名或学号也作为关键词,则在被网课老师点名时,耳机310可以很好的发挥辅助提醒的作用。The audio stream of the online course may include various sound signals such as music and voice. The headset 310 extracts the voice signal, and uses a scene-based keyword recognition model and a deep learning algorithm to detect whether the voice signal contains preset keywords. For example, if the student presets the keyword "summary", when the teacher of the online class says "Let's summarize the main content of this class", the keyword can be detected and recognized in the current speech signal; If your name or student number is also used as a keyword, the headset 310 can play a good role as an auxiliary reminder when you are called by the online class teacher.

而未识别出关键词时,耳机310继续接收和播放音频流,不进入以下步骤的执行。应当能够理解的是,在本系统进行关键信息提醒时,耳机310接收和播放音频流的过程可以不受影响。When the keyword is not identified, the earphone 310 continues to receive and play the audio stream, and does not enter into the execution of the following steps. It should be understood that, when the system performs the key information reminder, the process of receiving and playing the audio stream by the earphone 310 may not be affected.

步骤四,耳机310产生关键信息提醒,及录制音频流。Step 4, the headset 310 generates a key information reminder and records an audio stream.

当耳机310检测识别当前语音信号中包含预设关键词时,将使其振动器开始振动。用户可以通过按键315使耳机310停止振动。如果振动超过预定的振动时间,如10秒钟,用户没有停止振动,则可以自动停止振动,并使其 LED指示灯312开始发出闪烁的红光。红光可以在较长的闪烁时间内持续闪烁,或者一直闪烁,直到用户通过按键315使其停止。当耳机310在产生新的振动前,检测发现LED指示灯312当前的状态为工作状态(红光闪烁)时,则不产生新的振动,而是继续保持LED指示灯312的当前工作状态。如此,则学生如果此时还戴着耳机,则可以通过振动方式使其注意关键信息;而如果他已摘掉耳机,则可以通过光效方式达到提醒目的。When the headset 310 detects and recognizes that the preset keyword is contained in the current voice signal, it will make its vibrator start to vibrate. The user can stop the vibration of the earphone 310 by pressing the button 315 . If the vibration exceeds a predetermined vibration time, such as 10 seconds, and the user does not stop the vibration, the vibration may be automatically stopped, and the LED indicator 312 will start to flash red light. The red light can continue to flash for a longer flashing time, or it can continue to flash until the user presses the key 315 to stop it. When the earphone 310 detects that the current state of the LED indicator 312 is a working state (flashing red) before generating a new vibration, no new vibration is generated, but the current working state of the LED indicator 312 is maintained. In this way, if the student is still wearing the headset at this time, he can make him pay attention to the key information through vibration; and if he has taken off the headset, the light effect can be used to remind him.

耳机310在产生关键信息提醒的同时,还开始录制所接收的音频流。具体为:The earphone 310 also starts to record the received audio stream while generating the key information reminder. Specifically:

在第一预定时长内,将所录制的音频流进行本地存储。第一预定时长应当小于或等于耳机310最多可存储音频流的时长。第一预定时长可以为预先设置的定值,如耳机310最多可存储音频流的时长为2分钟,则第一预定时长可以为2分钟,或者第一预定时长可以为30秒,则可以使耳机310最多可以存储4条最长时长为30秒的音频流。Within the first predetermined period of time, the recorded audio stream is locally stored. The first predetermined duration should be less than or equal to the maximum duration that the headset 310 can store the audio stream. The first predetermined duration may be a preset fixed value. For example, the maximum duration of the audio stream that can be stored by the earphone 310 is 2 minutes, then the first predetermined duration may be 2 minutes, or the first predetermined duration may be 30 seconds, then the earphone The 310 can store up to 4 audio streams with a maximum duration of 30 seconds.

耳机310在开始录制所接收的音频流的同时,还向视频播放设备300发送录制开始指令和检测所得的关键词词汇。When the headset 310 starts to record the received audio stream, it also sends a recording start instruction and the detected keyword vocabulary to the video playback device 300 .

视频播放设备300接收到耳机310发送的录制开始指令后,开始对所发送的音频流进行录制。After receiving the recording start instruction sent by the headset 310, the video playback device 300 starts to record the sent audio stream.

步骤五,视频播放设备300将录制的语音信号转换成文字信息并存储。Step 5: The video playback device 300 converts the recorded voice signal into text information and stores it.

所述视频播放设备300可以获取所录制的音频流中的语音信号,并采用现有技术中各种语音转换文字的方法将其全文转换成文字后存储。存储时,还可以将接收到的耳机310检测所得的关键词词汇、文字和录音关联存储,以便用户后续选择查阅。The video playback device 300 can acquire the voice signal in the recorded audio stream, and convert the full text of the voice signal into text by using various voice-to-text methods in the prior art, and then store it. When storing, the received keyword vocabulary, words and recordings detected by the earphone 310 may also be stored in association, so that the user can select and refer to them later.

步骤六,录制停止。Step 6, the recording stops.

当用户通过按键315输入录制停止指令,或者当持续录制的时长超过第一预定时长却依然未收到用户发出的停止录制指令时,耳机310将自动停止录制音频流。When the user inputs a recording stop instruction through the key 315, or when the duration of continuous recording exceeds the first predetermined duration but still does not receive the stop recording instruction from the user, the headset 310 will automatically stop recording the audio stream.

当用户通过按键315输入录制停止指令,或者当持续录制的时长超过第二预定时长却依然未收到用户发出的停止录制指令时,视频播放设备300将自动停止录制音频流。The video playback device 300 will automatically stop recording the audio stream when the user inputs a recording stop instruction through the key 315, or when the duration of continuous recording exceeds the second predetermined duration without receiving the stop recording instruction from the user.

步骤七,录音回放。Step 7, playback of the recording.

本实施例中,用户可以在耳机310上回放录音,也可以在视频播放设备 300上回放录音。In this embodiment, the user can play back the recording on the earphone 310, and can also play the recording on the video playback device 300.

例如,学生通过连续按压2次按键315启动本地回放功能时,耳机310 将在播放来自视频播放设备300的音频流的同时,播放本地存储的录制音频流。播放时,可以使两路音频流混合后再播放,也可以使两个耳麦311中的一个播放一路音频流,另一个耳麦播放另一路音频流。For example, when the student starts the local playback function by pressing the button 315 twice continuously, the headset 310 will play the audio stream from the video playback device 300 at the same time as the locally stored recorded audio stream. When playing, the two audio streams can be mixed and then played, or one of the two headsets 311 can be made to play one audio stream, and the other headset can play another audio stream.

或者,学生通过连续按压3次按键315启动远端回放功能时,耳机310 向视频播放设备300发送回放请求指令,视频播放设备300接收到该回放请求指令后,将其所录制的音频流发送给耳机310。Or, when the student starts the remote playback function by pressing the button 315 three times in a row, the headset 310 sends a playback request instruction to the video playback device 300, and after the video playback device 300 receives the playback request command, it sends the recorded audio stream to the video playback device 300. Headphones 310.

另外,学生也可以直接在视频播放设备300上输入回放指令,播放视频播放设备300中存储的录制音频流。In addition, students can also directly input playback instructions on the video playback device 300 to play the recorded audio stream stored in the video playback device 300 .

学生还可以在视频播放设备300上指定播放其中的录制音频流。The student can also specify to play the recorded audio stream on the video playback device 300 .

步骤八,查阅文字信息。Step 8, check the text information.

本步骤中,学生可以在所述视频播放设备300中查阅被录制的音频流所对应的文字信息,以便于学生根据文字信息复习和记笔记。In this step, the student can consult the text information corresponding to the recorded audio stream in the video playback device 300, so that the student can review and take notes according to the text information.

通过上述实施例及典型应用场景的描述可见,本发明实施例提供的关键信息提醒方法、系统以及嵌入式音频播放装置,实现了在小型、低功耗的嵌入式设备上进行连续语音的关键信息实时检测、提醒和回放,使用方便、操作简单、适用范围广,并可以有效提醒、保存和回顾关键信息,减少用户遗漏关键信息的损失,增加用户对远程音视频应用满意度。It can be seen from the descriptions of the above embodiments and typical application scenarios that the key information reminder method, system, and embedded audio playback device provided by the embodiments of the present invention realize the key information of continuous speech on a small, low-power embedded device Real-time detection, reminder and playback, easy to use, simple to operate, and widely applicable, and can effectively remind, save and review key information, reduce the loss of users missing key information, and increase user satisfaction with remote audio and video applications.

本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执轨道,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art should further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two, in order to clearly illustrate the hardware and software interchangeability, the components and steps of each example have been generally described in terms of functions in the above description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Persons of ordinary skill in the art may use different methods of implementing the described functionality for each particular application, but such implementations should not be considered beyond the scope of this application.

结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执轨道的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims (15)

1.一种嵌入式音频播放装置,包括扬声器和通信单元,其特征在于:还包括控制单元、存储单元、语音识别单元及提醒单元,1. an embedded audio playback device, comprising a loudspeaker and a communication unit, is characterized in that: also comprise a control unit, a storage unit, a voice recognition unit and a reminder unit, 所述通信单元接收来自远端的音频流;The communication unit receives the audio stream from the remote end; 所述语音识别单元包括关键词识别模型单元,所述关键词识别模型单元用于存储基于场景的关键词识别模型;The speech recognition unit includes a keyword recognition model unit, and the keyword recognition model unit is used to store a scene-based keyword recognition model; 所述关键词和应用场景关联,其中包含一组在该应用场景中需要重点关注的词汇,所述词汇中的一个或多个由用户预先指定;The keyword is associated with the application scenario, and includes a set of vocabulary that needs to be focused on in the application scenario, and one or more of the vocabulary is pre-specified by the user; 所述语音识别单元自所述音频流中提取语音信号,并采用所述基于场景的关键词识别模型实时检测所述语音信号中是否包含所述关键词;The voice recognition unit extracts a voice signal from the audio stream, and uses the scene-based keyword recognition model to detect in real time whether the voice signal contains the keyword; 所述控制单元用于在所述语音信号中包含关键词时,开始录制所接收的音频流,并控制所述提醒单元输出关键信息提醒;The control unit is configured to start recording the received audio stream when a keyword is included in the voice signal, and control the reminder unit to output key information reminders; 所述存储单元用于存储被录制的音频流;The storage unit is used to store the recorded audio stream; 所述扬声器用于播放所述音频流,或响应于回放指令,回放所录制的音频流。The speaker is used to play the audio stream, or to play back the recorded audio stream in response to a playback instruction. 2.如权利要求1所述嵌入式音频播放装置,其特征在于:所述基于场景的关键词识别模型为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,采用深度学习算法训练获得;2. The embedded audio playback device according to claim 1, wherein the scene-based keyword recognition model is based on a voice sample containing the keyword in advance, and/or a specific keyword for the keyword. The training sample library of human speech samples is obtained by training with deep learning algorithm; 所述控制单元还用于通过所述通信单元自远端下载所述基于场景的关键词识别模型。The control unit is further configured to download the scene-based keyword recognition model from a remote end through the communication unit. 3.如权利要求2所述嵌入式音频播放装置,其特征在于:所述语音识别单元还包括语音预处理单元,用于对输入的音频流进行预处理,以消除噪声、背景人声、音乐声,提取语音信号;3. The embedded audio playback device of claim 2, wherein the speech recognition unit further comprises a speech preprocessing unit for preprocessing the input audio stream to eliminate noise, background vocals, music sound, extract the voice signal; 所述语音识别单元还包括神经网络处理单元,用于基于所述关键词识别模型,采用深度学习算法对所述语音信号或所述语音预处理单元处理后的语音信号进行数据处理,从而对语音信号中出现的词汇进行推理和判决,以确定其中是否包含关键词词汇。The speech recognition unit further includes a neural network processing unit for performing data processing on the speech signal or the speech signal processed by the speech preprocessing unit by using a deep learning algorithm based on the keyword recognition model, so as to perform data processing on the speech signal. Words appearing in the signal are reasoned and judged to determine whether they contain keyword words. 4.如权利要求1所述嵌入式音频播放装置,其特征在于:所述提醒单元为指示灯模组、振动器模组、文字消息生成模组、语音消息生成模组、音乐消息生成模组中的一种或多种。4. embedded audio playback device as claimed in claim 1, is characterized in that: described reminding unit is indicator light module, vibrator module, text message generation module, voice message generation module, music message generation module one or more of. 5.如权利要求1所述嵌入式音频播放装置,其特征在于:还包括输入单元,用于接收用户输入的录制停止指令、回放指令;5. The embedded audio playback device of claim 1, further comprising an input unit for receiving a recording stop instruction and a playback instruction input by a user; 所述控制单元在所述语音信号中包含关键词时,开始对接收到的音频流进行持续压缩编码并本地存储;When the control unit contains keywords in the voice signal, it starts to continuously compress and encode the received audio stream and store it locally; 所述控制单元在接收到录制停止指令或持续录制时间超过第一预定时长时,停止录制;The control unit stops recording when receiving the recording stop instruction or when the continuous recording time exceeds the first predetermined duration; 所述控制单元在接收到回放本地音频指令时,播放本地存储的录制音频流;The control unit plays the locally stored recording audio stream when receiving the playback local audio instruction; 所述控制单元还用于在所述语音信号中包含关键词时,向远端发送录制开始指令,用于使远端开始对所发送的音频流持续录制,当持续录制时间未超过第二预定时长且接收到停止录制指令时,向远端发送录制停止指令;The control unit is also used to send a recording start instruction to the far end when the voice signal contains a keyword, so that the far end starts to continuously record the sent audio stream, when the continuous recording time does not exceed the second predetermined time. When the duration is long and a stop recording instruction is received, a recording stop instruction is sent to the remote end; 所述控制单元在接收到回放远端音频指令时,向远端发送回放请求,并接收和播放远端存储的录制音频流。The control unit sends a playback request to the remote end, and receives and plays the recorded audio stream stored at the remote end when receiving the remote end audio playback instruction. 6.如权利要求1至5中之一所述嵌入式音频播放装置,其特征在于:所述嵌入式音频播放装置为耳机或带通话功能的音箱。6 . The embedded audio playback device according to claim 1 , wherein the embedded audio playback device is an earphone or a speaker with a call function. 7 . 7.一种关键信息提醒系统,其特征在于:包括嵌入式音频播放装置和远端设备,7. A key information reminder system, characterized in that: comprising an embedded audio playback device and a remote device, 所述远端设备接收用户自定义的关键词词汇,和/或用户提供的、至少包含所述关键词的特定人的语音样本,以用于获取基于场景的关键词识别模型;所述关键词和应用场景关联,其中包含一组在该应用场景中需要重点关注的词汇;The remote device receives a user-defined keyword vocabulary, and/or a user-provided voice sample of a specific person that contains at least the keyword, so as to obtain a scene-based keyword recognition model; the keyword Associated with the application scenario, which contains a set of vocabulary that needs to be focused on in the application scenario; 所述基于场景的关键词识别模型为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得;The scene-based keyword recognition model is obtained by training based on a training sample library containing speech samples for the keyword and/or speech samples of a specific person for the keyword in advance; 所述嵌入式音频播放装置与所述远端设备通信,接收来自远端设备的音频流,并播放;The embedded audio playback device communicates with the remote device, receives an audio stream from the remote device, and plays it; 所述嵌入式音频播放装置还自所述音频流中获取语音信号,并采用基于场景的关键词识别模型针对所述语音信号进行语音识别,实时检测所述语音信号中是否包含关键词;The embedded audio playback device also obtains a voice signal from the audio stream, and uses a scene-based keyword recognition model to perform voice recognition on the voice signal, and detects in real time whether the voice signal contains keywords; 当所述语音信号中包含关键词时,所述嵌入式音频播放装置产生关键信息提醒,并开始录制所接收的音频流;When the voice signal contains a keyword, the embedded audio playback device generates a reminder of key information and starts recording the received audio stream; 所述嵌入式音频播放装置响应于回放指令,播放所录制的音频流。The embedded audio playback device plays the recorded audio stream in response to the playback instruction. 8.如权利要求7所述关键信息提醒系统,其特征在于:还包括云服务器,8. key information reminder system as claimed in claim 7, is characterized in that: also comprises cloud server, 所述远端设备与所述云服务器通信,将所述关键词和/或特定人的语音样本发送至所述云服务器;The remote device communicates with the cloud server, and sends the keyword and/or the voice sample of a specific person to the cloud server; 所述云服务器将接收到的关键词和/或所述特定人的语音样本用于对其标准样本库进行扩充形成训练样本库,并基于所述训练样本库,采用深度学习算法训练获得所述基于场景的关键词识别模型;The cloud server uses the received keywords and/or the voice samples of the specific person to expand its standard sample library to form a training sample library, and uses deep learning algorithm training to obtain the training sample library based on the training sample library. Scenario-based keyword recognition model; 所述远端设备接收来自所述云服务器的基于场景的关键词识别模型,并将所述基于场景的关键词识别模型下载至所述嵌入式音频播放装置。The remote device receives the scene-based keyword recognition model from the cloud server, and downloads the scene-based keyword recognition model to the embedded audio playback device. 9.如权利要求7所述关键信息提醒系统,其特征在于:所述远端设备将用户输入的关键词词汇和/或用户提供的至少包含所述关键词的特定人的语音样本用于对标准样本库进行扩充,形成训练样本库,并基于所述训练样本库,采用深度学习算法训练获得所述基于场景的关键词识别模型;9. The key information reminder system according to claim 7, wherein the remote device uses the keyword vocabulary input by the user and/or the voice sample of a specific person provided by the user that at least contains the keyword for the The standard sample library is expanded to form a training sample library, and based on the training sample library, a deep learning algorithm is used to train to obtain the scene-based keyword recognition model; 所述远端设备将所述基于场景的关键词识别模型下载至所述嵌入式音频播放装置。The remote device downloads the scene-based keyword recognition model to the embedded audio playback device. 10.一种关键信息提醒方法,其特征在于:10. A key information reminder method, characterized in that: 接收用户自定义的关键词词汇,和/或用户提供的、至少包含所述关键词的特定人的语音样本;所述关键词和应用场景关联,包含一组在该应用场景中需要重点关注的词汇;Receive a user-defined keyword vocabulary, and/or a user-provided voice sample of a specific person that contains at least the keyword; the keyword is associated with an application scenario, and includes a set of items that need to be focused on in the application scenario vocabulary; 基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得所述基于场景的关键词识别模型;The scene-based keyword recognition model is obtained by training based on a training sample library comprising speech samples for the keyword and/or speech samples of a specific person for the keyword; 在接收和播放音频流时,自所述音频流中获取语音信号;When receiving and playing an audio stream, obtain a voice signal from the audio stream; 采用所述基于场景的关键词识别模型针对所述语音信号进行语音识别,实时检测所述语音信号中是否包含关键词;Use the scene-based keyword recognition model to perform speech recognition on the voice signal, and detect in real time whether the voice signal contains keywords; 当所述语音信号中包含关键词时,产生关键信息提醒,并开始录制所接收的音频流;When a keyword is included in the voice signal, a key information reminder is generated, and the received audio stream is started to be recorded; 响应于回放指令,播放所录制的音频流。In response to the playback instruction, the recorded audio stream is played. 11.如权利要求10所述关键信息提醒方法,其特征在于:11. key information reminder method as claimed in claim 10, is characterized in that: 预先采集广泛的语音样本,形成标准样本库;Collect a wide range of speech samples in advance to form a standard sample library; 根据所述关键词获取至少包含所述关键词的语音样本;Acquiring at least a voice sample containing the keyword according to the keyword; 将所述包含所述关键词的语音样本和/或所述特定人的语音样本扩充至所述标准样本库,形成训练样本库,基于所述训练样本库采用深度学习算法训练获得所述基于场景的关键词识别模型。Expanding the speech samples containing the keywords and/or the speech samples of the specific person to the standard sample library to form a training sample library, and using deep learning algorithm training based on the training sample library to obtain the scene-based keyword recognition model. 12.如权利要求10所述关键信息提醒方法,其特征在于:所述的自所述音频流中获取语音信号步骤中,还包括消除噪声、音乐声、背景人声的预处理步骤;12. The key information reminder method as claimed in claim 10, characterized in that: in the described step of acquiring the voice signal from the audio stream, it also includes a preprocessing step of eliminating noise, musical sound, and background vocals; 所述采用所述基于场景的关键词识别模型针对所述语音信号或预处理后的语音信号进行语音识别,实时检测所述语音信号中是否包含关键词,具体包括:构建基于所述关键词识别模型的深度学习神经网络,将语音信号连续输入所述深度学习神经网络进行数据处理,以对所述语音信号中出现的词汇进行推理和判决,确定其中是否包含关键词词汇。The using the scene-based keyword recognition model to perform speech recognition on the speech signal or the preprocessed speech signal, and detecting in real time whether the speech signal contains keywords, specifically includes: constructing a recognition system based on the keyword The deep learning neural network of the model continuously inputs the speech signal into the deep learning neural network for data processing, so as to infer and judge the words appearing in the speech signal, and determine whether a keyword word is included in the speech signal. 13.如权利要求10所述关键信息提醒方法,其特征在于:所述录制所接收的音频流,具体包括:在所述语音信号中包含关键词时,开始对接收到的音频流进行持续压缩编码并本地存储;13. The key information reminder method according to claim 10, wherein the recording of the received audio stream specifically comprises: when the voice signal contains a keyword, starting to continuously compress the received audio stream encoded and stored locally; 接收到录制停止指令或持续录制时间超过第一预定时长时,停止本地录制;When receiving a recording stop instruction or the continuous recording time exceeds the first predetermined duration, stop the local recording; 所述响应于回放指令,播放所录制的音频流,具体包括:响应于回放本地音频指令,播放本地存储的录制音频流。The playing the recorded audio stream in response to the playback instruction specifically includes: in response to the playback local audio instruction, playing the locally stored recorded audio stream. 14.如权利要求10所述关键信息提醒方法,其特征在于:所述录制所接收的音频流,具体包括:在所述语音信号中包含关键词时,向远端发送录制开始指令,远端开始对所发送的音频流持续录制,并进行远端存储;14. key information reminder method as claimed in claim 10, it is characterised in that: described recording the received audio stream, specifically comprising: when the voice signal contains a keyword, sending a recording start instruction to the far end, the far end Start continuous recording of the sent audio stream and store it remotely; 持续录制时间未超过第二预定时长且接收到停止录制指令时,向远端发送录制停止指令,远端停止录制;When the continuous recording time does not exceed the second predetermined duration and a stop recording instruction is received, the recording stop instruction is sent to the remote end, and the remote end stops recording; 所述响应于回放指令,播放所录制的音频流,具体包括:响应于回放远端音频指令,向远端发送回放请求,并接收和播放远端存储的录制音频流。The playing the recorded audio stream in response to the playback instruction specifically includes: in response to the playback remote audio instruction, sending a playback request to the remote end, and receiving and playing the recorded audio stream stored at the remote end. 15.如权利要求10所述关键信息提醒方法,其特征在于:所述关键信息提醒为视觉提醒、触觉提醒和听觉提醒中的一种或多种形式的组合;15. The key information reminder method according to claim 10, wherein the key information reminder is a combination of one or more forms in a visual reminder, a tactile reminder and an auditory reminder; 所述视觉提醒包括光效提醒、远端文字消息提醒;The visual reminder includes light effect reminder and remote text message reminder; 所述触觉提醒包括振动提醒;The tactile reminder includes a vibration reminder; 所述听觉提醒包括语音提醒、音乐提醒。The auditory reminders include voice reminders and music reminders.
CN202010313790.4A 2020-04-20 2020-04-20 Scene-based key information reminding method, system and device Pending CN111601215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010313790.4A CN111601215A (en) 2020-04-20 2020-04-20 Scene-based key information reminding method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010313790.4A CN111601215A (en) 2020-04-20 2020-04-20 Scene-based key information reminding method, system and device

Publications (1)

Publication Number Publication Date
CN111601215A true CN111601215A (en) 2020-08-28

Family

ID=72183273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010313790.4A Pending CN111601215A (en) 2020-04-20 2020-04-20 Scene-based key information reminding method, system and device

Country Status (1)

Country Link
CN (1) CN111601215A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111109888A (en) * 2018-10-31 2020-05-08 仁宝电脑工业股份有限公司 Intelligent wine cabinet and management method for same
CN113468317A (en) * 2021-06-26 2021-10-01 北京网聘咨询有限公司 Resume screening method, system, equipment and storage medium
CN115188397A (en) * 2022-09-07 2022-10-14 云丁网络技术(北京)有限公司 Media output control method, device, equipment and readable medium
WO2023283965A1 (en) * 2021-07-16 2023-01-19 华为技术有限公司 Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145341A (en) * 2006-09-04 2008-03-19 美商富迪科技股份有限公司 Method, system and apparatus for improved voice recognition
US20150253973A1 (en) * 2014-03-10 2015-09-10 Htc Corporation Reminder generating method and a mobile electronic device using the same
CN107464557A (en) * 2017-09-11 2017-12-12 广东欧珀移动通信有限公司 Call recording method, device, mobile terminal and storage medium
CN109451158A (en) * 2018-11-09 2019-03-08 维沃移动通信有限公司 A kind of based reminding method and device
CN109979440A (en) * 2019-03-13 2019-07-05 广州市网星信息技术有限公司 Keyword sample determines method, audio recognition method, device, equipment and medium
CN110556110A (en) * 2019-10-24 2019-12-10 北京九狐时代智能科技有限公司 Voice processing method and device, intelligent terminal and storage medium
CN212588503U (en) * 2020-04-20 2021-02-23 南京西觉硕信息科技有限公司 Embedded audio playing device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145341A (en) * 2006-09-04 2008-03-19 美商富迪科技股份有限公司 Method, system and apparatus for improved voice recognition
US20150253973A1 (en) * 2014-03-10 2015-09-10 Htc Corporation Reminder generating method and a mobile electronic device using the same
CN107464557A (en) * 2017-09-11 2017-12-12 广东欧珀移动通信有限公司 Call recording method, device, mobile terminal and storage medium
CN109451158A (en) * 2018-11-09 2019-03-08 维沃移动通信有限公司 A kind of based reminding method and device
CN109979440A (en) * 2019-03-13 2019-07-05 广州市网星信息技术有限公司 Keyword sample determines method, audio recognition method, device, equipment and medium
CN110556110A (en) * 2019-10-24 2019-12-10 北京九狐时代智能科技有限公司 Voice processing method and device, intelligent terminal and storage medium
CN212588503U (en) * 2020-04-20 2021-02-23 南京西觉硕信息科技有限公司 Embedded audio playing device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘欢: "基于软交换系统录音方式的应用研究", 《通信电源技术》, 31 December 2015 (2015-12-31), pages 130 - 131 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111109888A (en) * 2018-10-31 2020-05-08 仁宝电脑工业股份有限公司 Intelligent wine cabinet and management method for same
CN111109888B (en) * 2018-10-31 2022-10-14 仁宝电脑工业股份有限公司 Intelligent wine cabinet and management method therefor
CN113468317A (en) * 2021-06-26 2021-10-01 北京网聘咨询有限公司 Resume screening method, system, equipment and storage medium
CN113468317B (en) * 2021-06-26 2024-03-08 北京网聘信息技术有限公司 Resume screening method, system, equipment and storage medium
WO2023283965A1 (en) * 2021-07-16 2023-01-19 华为技术有限公司 Method and apparatus for listening to speech by using device instead of ears, method and apparatus for generating speech recognition model, electronic device, and medium
CN115188397A (en) * 2022-09-07 2022-10-14 云丁网络技术(北京)有限公司 Media output control method, device, equipment and readable medium

Similar Documents

Publication Publication Date Title
CN111601215A (en) Scene-based key information reminding method, system and device
CN111630876B (en) Audio device and audio processing method
CN106874265B (en) Content output method matched with user emotion, electronic equipment and server
CN105320726B (en) Reduce the demand to manual beginning/end point and triggering phrase
US9788105B2 (en) Wearable headset with self-contained vocal feedback and vocal command
EP3611724A1 (en) Voice response method and device, and smart device
WO2019242414A1 (en) Voice processing method and apparatus, storage medium, and electronic device
IL229370A (en) Interface apparatus and method for providing interaction of a user with network entities
CN101808047A (en) Instant messaging partner robot and instant messaging method with messaging partner
CN103269405A (en) Method and device for hinting friendlily
CN112204654B (en) System and method for predictive dialog content generation based on predictions
KR20190005103A (en) Electronic device-awakening method and apparatus, device and computer-readable storage medium
WO2018088319A1 (en) Reproduction terminal and reproduction method
CN108648754A (en) Sound control method and device
CN212588503U (en) Embedded audio playing device
KR20230133864A (en) Systems and methods for handling speech audio stream interruptions
CN110111795B (en) Voice processing method and terminal equipment
CN111339881A (en) Baby growth monitoring method and system based on emotion recognition
JP2021117371A (en) Information processor, information processing method and information processing program
CN112672207B (en) Audio data processing method, device, computer equipment and storage medium
WO2019242415A1 (en) Position prompt method, device, storage medium and electronic device
CN110196900A (en) Exchange method and device for terminal
CN110459239A (en) Role analysis method, apparatus and computer readable storage medium based on voice data
JP6867543B1 (en) Information processing equipment, information processing methods and programs
FR2899097A1 (en) Hearing-impaired person helping system for understanding and learning oral language, has system transmitting sound data transcription to display device, to be displayed in field of person so that person observes movements and transcription

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination