[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021051506A1 - Voice interaction method and apparatus, computer device and storage medium - Google Patents

Voice interaction method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2021051506A1
WO2021051506A1 PCT/CN2019/116512 CN2019116512W WO2021051506A1 WO 2021051506 A1 WO2021051506 A1 WO 2021051506A1 CN 2019116512 W CN2019116512 W CN 2019116512W WO 2021051506 A1 WO2021051506 A1 WO 2021051506A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
audio signal
response
analysis result
customer
Prior art date
Application number
PCT/CN2019/116512
Other languages
French (fr)
Chinese (zh)
Inventor
周定军
王健宗
彭俊清
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051506A1 publication Critical patent/WO2021051506A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/527Centralised call answering arrangements not requiring operator intervention
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Definitions

  • This application relates to the field of natural language processing, and in particular to a voice interaction method, device, computer equipment and storage medium.
  • the system architecture of an intelligent voice outbound platform is generally based on a telephone exchange platform and a variety of voice processing engines, such as a speech recognition engine (ASR), a semantic understanding engine (NLP), a speech synthesis engine (TTS), etc.
  • ASR speech recognition engine
  • NLP semantic understanding engine
  • TTS speech synthesis engine
  • the basic processing flow of this intelligent voice outbound platform includes: recognizing the customer’s voice into text information through the voice recognition engine, and then further analyzing the text information through the semantic understanding engine to obtain the analysis result, and selecting the response sentence based on the analysis result. Finally, the response sentence is synthesized into the response voice through the speech synthesis engine, and the response voice is transmitted to the customer.
  • a voice interaction method including:
  • a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
  • a voice interaction device includes:
  • the audio judgment module is used to obtain the audio signal of the client channel when the dialogue voice is played, and judge whether the specified parameter of the audio signal is greater than the first preset threshold;
  • the suspension playing module is configured to stop playing the dialogue voice if the designated parameter of the audio signal is greater than a first preset threshold
  • the determining response sentence module is used to analyze the audio signal and obtain the analysis result of the audio signal, and determine the response sentence according to the analysis result;
  • the sending response voice module is used to generate a response voice according to the response sentence when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response voice to the corresponding client channel client.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor implements the following steps when the processor executes the computer-readable instructions: During the conversation voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
  • a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
  • One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps: when the dialogue voice is played , Obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
  • a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
  • FIG. 1 is a schematic diagram of an application environment of a voice interaction method in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a voice interaction method in an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a voice interaction method in an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a voice interaction method in an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a voice interaction method in an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a voice interaction method in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a voice interaction device in an embodiment of the present application.
  • Fig. 8 is a schematic diagram of a computer device in an embodiment of the present application.
  • the voice interaction method provided in this embodiment can be applied in the application environment as shown in FIG. 1, where the terminal device communicates with the server through the network.
  • terminal devices include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented with an independent server or a server cluster composed of multiple servers.
  • a voice interaction method is provided.
  • the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • the voice interaction method can be applied to an intelligent outbound call platform, and can also be applied to an intelligent response platform, or other intelligent interactive platforms.
  • the server can be set with multiple processing processes for processing audio signals transmitted through the client channel.
  • the client may refer to the client carried by the client, and the server establishes a communication connection with the client (in some cases, it may be a call connection) to realize intelligent interaction with the client.
  • the voice interaction method provided in this embodiment can be applied to scenarios such as customer return visits and questionnaire surveys.
  • the client may refer to an application terminal with a voice recording device, such as a terminal for self-management of business.
  • the voice interaction method can also be applied to one-to-many interaction scenarios.
  • the server simultaneously establishes a call connection with multiple clients.
  • the server can be based on the telephone soft switch platform (FreeSwitch) and use shared memory technology to realize the storage of audio data of a specific client channel.
  • the shared memory can realize the input and output voice of the same voice channel, sharing the same memory buffer; when input or output voice operations, the memory buffer is locked to ensure the exclusivity of the operation; after the operation is completed , Release the lock, and use the memory cache again for subsequent operations.
  • shared memory can be organically combined with message queues, state machines, multi-thread synchronization and other technologies to achieve multi-channel speech recognition and speech synthesis.
  • the dialogue voice may be generated based on the client's last speech data, or may be generated based on a preset response text.
  • playing the dialogue voice can send the synthesized dialogue voice to the client.
  • the dialogue voice can be played to send the corresponding dialogue text and speech parameters to the client, and then the client synthesizes the dialogue speech according to the aforementioned dialogue text and speech parameters.
  • the server is also provided with a special process for monitoring whether the designated parameter of the audio signal of the client channel is greater than the first preset threshold.
  • the specified parameter may refer to the volume of the audio signal
  • the first preset threshold may refer to the volume threshold.
  • the designated parameters may also be other audio parameters.
  • the value of the first preset threshold can be set according to actual needs, for example, it can be set to 15-25 decibels. In other cases, the first preset threshold may be determined based on the signal-to-noise ratio of the client channel.
  • the signal in the signal-to-noise ratio of the client channel refers to the audio signal with the highest volume in the specified time period
  • the noise refers to the average value of the background noise in the specified time period (can be based on the preset algorithm Determine that the audio signal within the specified time period belongs to the background noise part).
  • the audio signal of the client channel is greater than the first preset threshold, it indicates that the dialogue voice played by the current server is interrupted (may be caused by the client's voice, or may be caused by the environment where the client is located, such as large noise).
  • the server stops playing the above-mentioned dialogue voice. If the server transmits audio data to the client in real-time, the way to stop playing the dialogue voice is to stop transmitting audio data to the client; if the server transmits the audio data to the client in the form of dialogue text and voice parameters, and the client synthesizes If the dialogue voice is output, the way to stop playing the dialogue voice is to send a stop playing instruction to the client to make the client stop playing the dialogue voice.
  • the audio signal of the client channel corresponding to the analysis result may include the audio signal when it is determined whether the specified parameter is greater than the first preset threshold and the audio signal for a certain period of time later.
  • the longest end point may refer to the audio signal of the client channel determined A moment less than the second preset threshold.
  • the audio signal is initially analyzed to determine whether it contains human voice. If the audio signal contains human voice, the audio signal needs to be further analyzed, and the analyzed content includes, but is not limited to, text data and tone information. You can also perform semantic analysis on the text data parsed in the previous step to determine the customer's intentions.
  • Each analysis result can correspond to a specific response sentence.
  • the final analysis result is "The wrong number was dialed", and the corresponding response sentence could be "Oh sorry, the call is wrong, then I will register here to avoid disturbing you in the future”.
  • the final analysis result is "the customer does not need the services currently provided”, and the corresponding response sentence can be "then do not disturb you first, please hang up first, wish you happiness and safety, goodbye”.
  • the final analysis result is "the customer's intention is unclear”, and the corresponding response sentence can be "Excuse me, I didn't catch it very well just now, can you repeat the question just now”.
  • the final analysis result is "The customer suspects that the customer service is a robot", and the corresponding response sentence can be "Yeah ⁇ You are really good, you have heard it all, I am an intelligent customer service, I am honored to serve you”.
  • the final analysis result is "the customer's environment is very noisy”, and the corresponding response sentence can be "the environment on your side is noisy, I don't know if you can hear what you just said”.
  • the second preset threshold can be adjusted according to different analysis results. For example, if the analysis result determines that the audio signal is not a human voice, the second preset threshold may be 55-75 decibels; if the analysis result determines that the audio signal is a human voice, the second preset threshold may be the same as the first preset threshold. After it is determined that the response voice can be issued, the response voice can be generated according to the response sentence, and the response voice can be sent to the customer so that the customer can hear the response voice.
  • customer satisfaction has increased from the original 50% to 80%
  • business compliance rate has also increased from the original 40% to 70%.
  • the reason is that because the embodiments of the application have good adaptability (monitoring the audio signal of the customer channel), they can respond flexibly to customer feedback in time, improve the interaction with customers, and improve the smoothness of intelligent voice communication with customers.
  • the degree of customer satisfaction and business compliance rate has also been greatly improved.
  • steps S10-S40 when the dialogue voice is played, the audio signal of the client channel is obtained, and it is judged whether the specified parameter of the audio signal is greater than the first preset threshold, so as to monitor whether the client channel has the interrupted voice of the client or is larger. Environmental noise. If the designated parameter of the audio signal is greater than the first preset threshold, the playback of the dialogue voice is suspended to pause the voice output and prevent interference with the customer's speech. The audio signal is analyzed and the analysis result of the audio signal is obtained, and the response sentence is determined according to the analysis result, so as to generate corresponding feedback information (that is, the response sentence) in combination with the actual situation.
  • a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel, so that at an appropriate time Interact with customers with appropriate response voice.
  • the method further includes:
  • S102 Establish a call connection with the customer according to the customer information
  • S103 Determine initial voice parameters and initial dialogue text according to the customer information and preset interactive tasks
  • S104 Generate an initial dialogue voice according to the initial voice parameters and the initial dialogue text.
  • the customer information includes, but is not limited to, the customer's name, age, occupation, contact information, and historical communication records.
  • the contact method can refer to a mobile phone number or a landline. You can establish a call connection with the customer by calling the customer's mobile phone number or fixed-line phone.
  • the preset interactive tasks can refer to the purpose of this exchange, such as user return visits, user surveys, business recommendations, and so on.
  • the initial speech parameters may include pronunciation gender, speaking speed, intonation, volume and so on.
  • the initial dialogue text can be the first sentence or multiple sentences of dialogue text after the server establishes a call connection with the client. For example, if the last name of the customer is "Li" obtained from the customer information, the following initial dialogue text is adopted when calling the customer-"Hello, is this Mr. Li". After the customer confirms his identity, he can take the following initial dialogue text-"Hello, Mr. Li, I have a questionnaire survey, which will take you about 3 minutes. Is it convenient for you now".
  • the corresponding initial dialogue speech can be synthesized by the speech synthesis engine.
  • a speech synthesis engine with a higher degree of immersion can be selected to generate an initial dialogue voice that is closer to the voice of a real person.
  • the initial dialogue voice can be sent to the client carried by the client through the call connection, and the client receives the initial dialogue voice through the client.
  • step S101-S102 customer information is obtained to obtain the customer's contact information. Establish a call connection with the customer according to the customer data to establish a call with the customer.
  • the initial voice parameters and initial dialogue text are determined according to the customer profile and preset interactive tasks, and data is prepared for generating the initial dialogue voice.
  • an initial conversation speech is generated to convert the text data into audio data.
  • the initial dialogue voice is sent to the client so that the client can receive the initial dialogue voice.
  • step S30 includes:
  • the server can set a human voice recognition program to determine whether the audio signal contains human voice.
  • There are two judgment results of the human voice recognition program including human voice and non-human voice.
  • a number of different connection sentences can be preset, which are associated with different judgment results. For example, when it is judged that the audio signal does not contain human voice, and the environment of the customer is determined to be relatively noisy, the connection sentence can be "Mr. X, your side is a bit noisy, do I need to increase the volume and repeat it again".
  • the first voice adjustment parameter may be generated based on the judgment result to change the volume of the response voice.
  • conversational speech refers to conversational speech interrupted by noise. Part or all of the content can be selected from the conversation speech interrupted by noise, and the connection sentence can be used to generate the response sentence.
  • the generated response sentence is associated with the adjusted first voice adjustment parameter, and the two can synthesize a corresponding response voice.
  • steps S301-S303 the audio signal of the client channel is analyzed and an analysis result of the audio signal is obtained, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice to distinguish Different coping scenarios. If the obtained analysis result is that the audio signal does not contain human voice, then select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice, so as to do when the analysis result is environmental noise Take the corresponding response steps.
  • the response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter to generate a response sentence suitable for environmental noise.
  • the method further includes:
  • the obtained analysis result is that the audio signal contains human voice, convert the audio signal of the customer channel into text data through a voice recognition engine, and recognize the audio of the customer channel through a preset tone recognition model
  • the tone type of the signal
  • the human voice in the audio signal needs to be further identified to learn the needs of the client.
  • the specific recognition steps include: first converting the audio signal into text data through the speech recognition engine, and then recognizing the semantic information of the text data through the semantic understanding engine.
  • the tone type of the audio signal can be recognized at the same time.
  • a preset tone recognition model can be used to recognize the tone type of the audio signal.
  • the recognized tone types include two types, one is positive and the other is negative. In the advanced tone recognition model, more than two tone types can be identified.
  • the second voice adjustment parameter matching the tone type can be selected to adjust the voice parameter of the response voice.
  • the preset response sentence database is pre-stored with multiple response sentences, which are matched with specific semantic information. After recognizing the semantic information in the audio information, the response sentence with the highest matching degree can be found in the preset response sentence database. At the same time, the second voice adjustment parameter is associated with the response sentence.
  • steps S304-S306 if the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a voice recognition engine, and the voice recognition model is used to recognize the The tone type of the audio signal of the customer channel to identify the content and tone of the current customer's sentence.
  • the semantic information of the text data is recognized by the semantic understanding engine to further determine the customer's needs.
  • the response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence to Select the appropriate response sentence to respond to the customer's words.
  • step S40 includes:
  • multiple background noise types can be preset, the similarity between the current audio signal and the feature values of each background noise type is calculated, and the background noise type with the highest similarity is selected as the background noise type of the audio signal.
  • the preset background noise type can be a road scene, a commercial street scene, a supermarket scene, etc.
  • Each background noise type matches a second preset threshold.
  • the second preset threshold for road scene matching may be 80 decibels
  • the second preset threshold for commercial street scene matching may be 70 decibels.
  • the audio signal is greater than the second preset threshold, it means that the background noise is very large. At this time, even if the dialogue voice is played, it is difficult for the customer to hear the content. Therefore, it is necessary to wait for the audio signal to be below the second preset threshold before the response voice Out.
  • a segment of audio signal can be buffered at a preset buffering time interval, and if the highest volume of the audio signal in the buffering time interval is less than the second preset threshold, the audio signal is judged The signal is less than the second preset threshold; if the highest volume of the audio signal in the buffer time interval is greater than or equal to the second preset threshold, it is determined that the audio signal is greater than or equal to the second preset threshold.
  • the buffering time interval can be 0.3 to 0.5 seconds, and it can vary with the type of background noise.
  • the background noise type of the audio signal of the client channel is identified to determine the type of scene the client is currently in. Acquire the second preset threshold that matches the type of background noise to select an appropriate response threshold (ie, the second preset threshold).
  • the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client.
  • the voice interaction method provided by the embodiments of the present application can improve the adaptability of intelligent voice, enhance the interaction with customers, and improve the fluency of communication with customers.
  • a voice interaction device is provided, and the voice interaction device corresponds to the voice interaction method in the foregoing embodiment one-to-one.
  • the voice interaction device includes an audio judgment module 10, a playback suspension module 20, a confirmation response sentence module 30, and a response voice sending module 40.
  • the detailed description of each functional module is as follows:
  • the audio judging module 10 is used to obtain the audio signal of the client channel when the dialogue voice is played, and to judge whether the specified parameter of the audio signal is greater than the first preset threshold;
  • the suspension playing module 20 is configured to stop playing the dialogue voice if the designated parameter of the audio signal is greater than a first preset threshold
  • the determining response sentence module 30 is configured to analyze the audio signal and obtain the analysis result of the audio signal, and determine the response sentence according to the analysis result;
  • the sending response voice module 40 is configured to generate a response voice according to the response sentence when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response voice to the corresponding client channel customer of.
  • the voice interaction device further includes:
  • Get information module used to obtain customer information
  • a call connection establishment module configured to establish a call connection with the customer according to the customer information
  • the dialog text determining module is used to determine the initial voice parameters and the initial dialog text according to the customer information and preset interactive tasks;
  • Generating an initial dialogue voice module configured to generate an initial dialogue voice according to the initial voice parameters and the initial dialogue text
  • the initial dialogue voice sending module is used to send the initial dialogue voice to the client.
  • the determining response sentence module 30 includes:
  • a parsing unit configured to analyze the audio signal of the client channel and obtain an analysis result of the audio signal, wherein the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;
  • connection sentence unit for selecting a connection sentence and a first voice adjustment parameter corresponding to the analysis result that does not contain human voice if the obtained analysis result is that the audio signal does not contain human voice;
  • the first generating response sentence unit is configured to generate the response sentence according to the connection sentence and the dialogue voice, and associate the response sentence with the first voice adjustment parameter.
  • the determining response sentence module 30 further includes:
  • the voice recognition unit is configured to, if the obtained analysis result is that the audio signal contains human voice, convert the audio signal of the customer channel into text data through a voice recognition engine, and recognize the voice through a preset tone recognition model The tone type of the audio signal of the client channel;
  • the semantic understanding unit is used to identify the semantic information of the text data through the semantic understanding engine
  • the second generating response sentence unit is configured to select the response sentence matching the semantic information from a preset response sentence database, and obtain a second voice adjustment parameter matching the tone type, and the second voice adjustment The parameter is associated with the response sentence.
  • the sending and answering voice module 40 includes:
  • a background noise recognition unit configured to recognize the background noise type of the audio signal of the client channel
  • An acquiring threshold unit configured to acquire the second preset threshold matching the background noise type
  • Sending a response voice unit configured to generate the response voice according to the response sentence and the first voice adjustment parameter when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response The voice is sent to the customer corresponding to the customer channel.
  • Each module in the above-mentioned voice interaction device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used to store data related to the above-mentioned voice interaction method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a voice interaction method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer device including a memory, a processor, and computer readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer readable instructions, the following steps are implemented:
  • a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
  • a computer-readable storage medium in one embodiment, includes a non-volatile readable storage medium and a volatile readable storage medium.
  • the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by the processor, the following steps are implemented:
  • a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
  • a person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions.
  • the computer-readable instructions can be stored in a non-volatile computer.
  • a readable storage medium or a volatile readable storage medium when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments.
  • any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice interaction method, the method comprising: acquiring an audio signal of a customer channel when playing back a dialogue voice, and determining whether a specified parameter of the audio signal is greater than a first preset threshold (S10); if the specified parameter of the audio signal is greater than the first preset threshold, then suspending playback of the dialogue voice (S20); parsing the audio signal and acquiring a parsing result of the audio signal, and determining a response sentence according to the parsing result (S30); and when the specified parameter of the audio signal of the customer channel is less than a second preset threshold, generating a response voice according to the response sentence, and sending the response voice to a customer corresponding to the customer channel (S40). The described method may improve the adaptability of intelligent voice, enhance interaction with customers, and improve the fluency of communication with customers.

Description

语音交互方法、装置、计算机设备及存储介质Voice interaction method, device, computer equipment and storage medium
本申请以2019年9月18日提交的申请号为201910883213.6,名称为“语音交互方法、装置、计算机设备及存储介质”的中国发明申请为基础,并要求其优先权。This application is based on the Chinese invention application with the application number 201910883213.6 filed on September 18, 2019, entitled "Voice interaction method, device, computer equipment and storage medium", and claims its priority.
技术领域Technical field
本申请涉及自然语言处理领域,尤其涉及一种语音交互方法、装置、计算机设备及存储介质。This application relates to the field of natural language processing, and in particular to a voice interaction method, device, computer equipment and storage medium.
背景技术Background technique
当前,智能语音外呼平台的系统架构,一般基于电话交换平台和多种语音处理引擎,如语音识别引擎(ASR)、语义理解引擎(NLP)、语音合成引擎(TTS)等。这种智能语音外呼平台的基本处理流程包括:通过语音识别引擎将客户的语音识别成文本信息,然后,通过语义理解引擎对文本信息进一步解析,获得解析结果,并根据解析结果挑选应答语句,最后通过语音合成引擎将应答语句合成为应答语音,将该应答语音传送给客户。At present, the system architecture of an intelligent voice outbound platform is generally based on a telephone exchange platform and a variety of voice processing engines, such as a speech recognition engine (ASR), a semantic understanding engine (NLP), a speech synthesis engine (TTS), etc. The basic processing flow of this intelligent voice outbound platform includes: recognizing the customer’s voice into text information through the voice recognition engine, and then further analyzing the text information through the semantic understanding engine to obtain the analysis result, and selecting the response sentence based on the analysis result. Finally, the response sentence is synthesized into the response voice through the speech synthesis engine, and the response voice is transmitted to the customer.
然而,这种交互方式十分机械乏味,使得智能语音的应变性差,无法及时针对客户的反馈做出灵活应答,降低了与客户的交互性,影响智能语音与客户交流的流畅度。However, this kind of interaction is very mechanical and tedious, making the smart voice less adaptable, unable to respond flexibly to customer feedback in time, reducing the interaction with customers, and affecting the fluency of smart voice communication with customers.
发明内容Summary of the invention
基于此,有必要针对上述技术问题,提供一种语音交互方法、装置、计算机设备及存储介质,以提高智能语音的应变性,增强与客户的交互性,提升与客户交流的流畅度。Based on this, it is necessary to provide a voice interaction method, device, computer equipment, and storage medium for the above technical problems to improve the adaptability of intelligent voice, enhance the interaction with customers, and improve the fluency of communication with customers.
一种语音交互方法,包括:A voice interaction method, including:
在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;When playing the dialogue voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
一种语音交互装置,包括:A voice interaction device includes:
音频判断模块,用于在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;The audio judgment module is used to obtain the audio signal of the client channel when the dialogue voice is played, and judge whether the specified parameter of the audio signal is greater than the first preset threshold;
中止播放模块,用于若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;The suspension playing module is configured to stop playing the dialogue voice if the designated parameter of the audio signal is greater than a first preset threshold;
确定应答语句模块,用于对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;The determining response sentence module is used to analyze the audio signal and obtain the analysis result of the audio signal, and determine the response sentence according to the analysis result;
发送应答语音模块,用于当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。The sending response voice module is used to generate a response voice according to the response sentence when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response voice to the corresponding client channel client.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. The processor implements the following steps when the processor executes the computer-readable instructions: During the conversation voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps: when the dialogue voice is played , Obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are set forth in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings, and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中语音交互方法的一应用环境示意图;FIG. 1 is a schematic diagram of an application environment of a voice interaction method in an embodiment of the present application;
图2是本申请一实施例中语音交互方法的一流程示意图;2 is a schematic flowchart of a voice interaction method in an embodiment of the present application;
图3是本申请一实施例中语音交互方法的一流程示意图;FIG. 3 is a schematic flowchart of a voice interaction method in an embodiment of the present application;
图4是本申请一实施例中语音交互方法的一流程示意图;FIG. 4 is a schematic flowchart of a voice interaction method in an embodiment of the present application;
图5是本申请一实施例中语音交互方法的一流程示意图;FIG. 5 is a schematic flowchart of a voice interaction method in an embodiment of the present application;
图6是本申请一实施例中语音交互方法的一流程示意图;FIG. 6 is a schematic flowchart of a voice interaction method in an embodiment of the present application;
图7是本申请一实施例中语音交互装置的一结构示意图;FIG. 7 is a schematic structural diagram of a voice interaction device in an embodiment of the present application;
图8是本申请一实施例中计算机设备的一示意图。Fig. 8 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本实施例提供的语音交互方法,可应用在如图1的应用环境中,其中,终端设备通过网络与服务端进行通信。其中,终端设备包括但不限于各种个人计算机 、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务端可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The voice interaction method provided in this embodiment can be applied in the application environment as shown in FIG. 1, where the terminal device communicates with the server through the network. Among them, terminal devices include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented with an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种语音交互方法,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:In an embodiment, as shown in FIG. 2, a voice interaction method is provided. The method is applied to the server in FIG. 1 as an example for description, including the following steps:
S10、在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;S10. Obtain the audio signal of the client channel when the dialogue voice is played, and determine whether the designated parameter of the audio signal is greater than a first preset threshold;
S20、若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;S20: If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
S30、对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;S30. Analyze the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
S40、当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。S40: When the designated parameter of the audio signal of the client channel is less than a second preset threshold, generate a response voice according to the response sentence, and send the response voice to the client corresponding to the client channel.
本实施例中,语音交互方法可应用于智能外呼平台,也可用于智能应答平台,或其他智能交互平台。服务端可设置有多个处理进程,用于处理经客户通道传送过来的音频信号。在一些情况下,客户端可以指客户携带的客户端,服务端通过与客户端建立通信连接(在一些情况下可以是通话连接),以实现与客户的智能交互。这种情况下,本实施例提供的语音交互方法可应用于客户回访、问卷调查等场景。在另一些情况下,客户端可以指带有语音录入设备的应用终端,如业务自主办理终端等。In this embodiment, the voice interaction method can be applied to an intelligent outbound call platform, and can also be applied to an intelligent response platform, or other intelligent interactive platforms. The server can be set with multiple processing processes for processing audio signals transmitted through the client channel. In some cases, the client may refer to the client carried by the client, and the server establishes a communication connection with the client (in some cases, it may be a call connection) to realize intelligent interaction with the client. In this case, the voice interaction method provided in this embodiment can be applied to scenarios such as customer return visits and questionnaire surveys. In other cases, the client may refer to an application terminal with a voice recording device, such as a terminal for self-management of business.
在一实例中,语音交互方法还可应用于一对多的交互场景。如服务端同时与多个客户端建立通话连接。此时,服务端可基于电话软交换平台(FreeSwitch),并使用共享内存技术,实现对特定客户通道的音频数据的存储。在此处,共享内存可实现对同一个语音通道的输入和输出语音,共享同一块内存缓存;当进行输入或者输出语音操作时,对该内存缓存加锁,保证操作的独占性;操作完成后,释放掉锁,供后续的操作再次使用该内存缓存。在具体的实现过程中,可将共享内存与消息队列、状态机、多线程同步等技术有机结合,实现多通道语音识别和语音合成。In an example, the voice interaction method can also be applied to one-to-many interaction scenarios. For example, the server simultaneously establishes a call connection with multiple clients. At this time, the server can be based on the telephone soft switch platform (FreeSwitch) and use shared memory technology to realize the storage of audio data of a specific client channel. Here, the shared memory can realize the input and output voice of the same voice channel, sharing the same memory buffer; when input or output voice operations, the memory buffer is locked to ensure the exclusivity of the operation; after the operation is completed , Release the lock, and use the memory cache again for subsequent operations. In the specific implementation process, shared memory can be organically combined with message queues, state machines, multi-thread synchronization and other technologies to achieve multi-channel speech recognition and speech synthesis.
具体的,对话语音可以是基于客户最近一次发言数据而生成,也可以是基于预设的应答文本而生成。特别的,播放对话语音可以指向客户端发送经合成后的对话语音。在一些情况下,如在客户端安装有适配的应用程序时,播放对话语音可以指向客户端发送相应的对话文本及语音参数,然后由客户端根据上述对话文本及语音参数合成出对话语音。Specifically, the dialogue voice may be generated based on the client's last speech data, or may be generated based on a preset response text. In particular, playing the dialogue voice can send the synthesized dialogue voice to the client. In some cases, such as when an adapted application is installed on the client, the dialogue voice can be played to send the corresponding dialogue text and speech parameters to the client, and then the client synthesizes the dialogue speech according to the aforementioned dialogue text and speech parameters.
服务端还设置有专门的进程,用于监测客户通道的音频信号的指定参数是否大于第一预设阈值。在此处,指定参数可以指音频信号的音量,第一预设阈值可以指音量阈值。在一些情况下,指定参数也可以是其他音频参数。可以根据实际需要设定第一预设阈值的数值,如可以设置为15~25分贝。在另一些情况下,第一预设阈值可以基于客户通道的信噪比进行确定。在此处,客户通道的信噪比中的信号指的是在指定时间段内的音量最高的音频信号,噪音指的是该指定时间段内的背景噪音的平均值(可以根据预设的算法确定指定时间段内的音频信号中属于背景噪音部分)。The server is also provided with a special process for monitoring whether the designated parameter of the audio signal of the client channel is greater than the first preset threshold. Here, the specified parameter may refer to the volume of the audio signal, and the first preset threshold may refer to the volume threshold. In some cases, the designated parameters may also be other audio parameters. The value of the first preset threshold can be set according to actual needs, for example, it can be set to 15-25 decibels. In other cases, the first preset threshold may be determined based on the signal-to-noise ratio of the client channel. Here, the signal in the signal-to-noise ratio of the client channel refers to the audio signal with the highest volume in the specified time period, and the noise refers to the average value of the background noise in the specified time period (can be based on the preset algorithm Determine that the audio signal within the specified time period belongs to the background noise part).
当客户通道的音频信号大于第一预设阈值,说明当前服务端播放的对话语音被打断(可能由客户的语音所引起,也可能由客户所处的环境引起,比如较大的噪音)。此时,服务端中止播放上述对话语音。若服务端以实时方式向客户端传送音频数据,则中止播放对话语音的方式为停止向客户端传送音频数据;若服务端以对话文本及语音参数的方式传送给客户端,并由客户端合成出对话语音,则中止播放对话语音的方式为向客户端发送中止播放指令,使客户端停止播放该对话语音。When the audio signal of the client channel is greater than the first preset threshold, it indicates that the dialogue voice played by the current server is interrupted (may be caused by the client's voice, or may be caused by the environment where the client is located, such as large noise). At this time, the server stops playing the above-mentioned dialogue voice. If the server transmits audio data to the client in real-time, the way to stop playing the dialogue voice is to stop transmitting audio data to the client; if the server transmits the audio data to the client in the form of dialogue text and voice parameters, and the client synthesizes If the dialogue voice is output, the way to stop playing the dialogue voice is to send a stop playing instruction to the client to make the client stop playing the dialogue voice.
在中止播放对话语音之后,需要根据客户通道的音频信号的解析结果确定相应的应对策略。解析结果所对应的客户通道的音频信号可以包括判定出指定参数是否大于第一预设阈值时的音频信号以及在后一定时长的音频信号,最长的时间终点可以指判定出客户通道的音频信号小于第二预设阈值的时刻。可能存在多种不同的解析结果,如,音频信号经初步解析,判断其是否含有人声。若音频信号包含人声,则需要对该音频信号进一步解析,解析出的内容包括但不限于文本数据、语气信息。还可以对上一步解析的文本数据进行语义解析,以确定客户的意图。每种解析结果可以与特定的应答语句对应。After the dialogue voice is suspended, the corresponding response strategy needs to be determined according to the analysis result of the audio signal of the client channel. The audio signal of the client channel corresponding to the analysis result may include the audio signal when it is determined whether the specified parameter is greater than the first preset threshold and the audio signal for a certain period of time later. The longest end point may refer to the audio signal of the client channel determined A moment less than the second preset threshold. There may be a variety of different analysis results. For example, the audio signal is initially analyzed to determine whether it contains human voice. If the audio signal contains human voice, the audio signal needs to be further analyzed, and the analyzed content includes, but is not limited to, text data and tone information. You can also perform semantic analysis on the text data parsed in the previous step to determine the customer's intentions. Each analysis result can correspond to a specific response sentence.
例如,最终的解析结果为“拨打的是错误号码”,其对应的应答语句可以是“哦不好意思,电话打错了,那我这边登记一下,避免今后再打扰到您”。最终的解析结果为“客户不需要当前提供的服务”,其对应的应答语句可以是“那先不打扰您了,请您先挂机,祝您幸福平安,再见”。最终的解析结果为“客户意图不清”,其对应的应答语句可以是“不好意思,我刚刚没太听清,您能再重复下刚才的问题吗”。最终的解析结果为“客户怀疑客服是机器人”,其对应的应答语句可以是“呀~~您可真厉害,这都被您听出来了,我是智能客服,很荣幸为您服务”。最终的解析结果为“客户所在环境很嘈杂”,其对应的应答语句可以是“您那边的环境比较吵,不知道您能否听清刚才讲的内容”。For example, the final analysis result is "The wrong number was dialed", and the corresponding response sentence could be "Oh sorry, the call is wrong, then I will register here to avoid disturbing you in the future". The final analysis result is "the customer does not need the services currently provided", and the corresponding response sentence can be "then do not disturb you first, please hang up first, wish you happiness and safety, goodbye". The final analysis result is "the customer's intention is unclear", and the corresponding response sentence can be "Excuse me, I didn't catch it very well just now, can you repeat the question just now". The final analysis result is "The customer suspects that the customer service is a robot", and the corresponding response sentence can be "Yeah~~You are really good, you have heard it all, I am an intelligent customer service, I am honored to serve you". The final analysis result is "the customer's environment is very noisy", and the corresponding response sentence can be "the environment on your side is noisy, I don't know if you can hear what you just said".
在确定应答语句之后,需要选择合适的时机发出相应的应答语音。可以选择在音频信号小于第二预设阈值时,生成并发出该应答语音。第二预设阈值可以根据解析结果的不同而做出调整。例如,解析结果判断出音频信号不是人声,则第二预设阈值可以是55~75分贝;解析结果判断出音频信号是人声,则第二预设阈值可以与第一预设阈值相同。在确定可以发出应答语音后,则可以根据应答语句生成应答语音,并将该应答语音发送给客户,使客户听到该应答语音。After confirming the response sentence, you need to choose an appropriate time to send out the corresponding response voice. You can choose to generate and issue the response voice when the audio signal is less than the second preset threshold. The second preset threshold can be adjusted according to different analysis results. For example, if the analysis result determines that the audio signal is not a human voice, the second preset threshold may be 55-75 decibels; if the analysis result determines that the audio signal is a human voice, the second preset threshold may be the same as the first preset threshold. After it is determined that the response voice can be issued, the response voice can be generated according to the response sentence, and the response voice can be sent to the customer so that the customer can hear the response voice.
据调查数据显示,采用本申请实施例提供的语音交互方法后,客户的满意度从原有的50%提高至80%,业务达标率也从原来的40%提高到70%。原因在于,本申请实施例由于具有良好的应变性(监听客户通道的音频信号),可以及时针对客户的反馈做出灵活应答,提高了与客户的交互性,提升了智能语音与客户交流的流畅度,使得客户的满意度及业务达标率也随着大幅提高。According to survey data, after using the voice interaction method provided by the embodiments of this application, customer satisfaction has increased from the original 50% to 80%, and the business compliance rate has also increased from the original 40% to 70%. The reason is that because the embodiments of the application have good adaptability (monitoring the audio signal of the customer channel), they can respond flexibly to customer feedback in time, improve the interaction with customers, and improve the smoothness of intelligent voice communication with customers. The degree of customer satisfaction and business compliance rate has also been greatly improved.
步骤S10-S40中,在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值,以监听客户通道是否有客户的打断语音或较大的环境噪音。若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音,以暂停语音输出,防止干扰客户的发言。对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句,以结合实际情况产生相应的反馈信息(即应答语句)。当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户,以在适当的时机以适当 的应答语音与客户交互。In steps S10-S40, when the dialogue voice is played, the audio signal of the client channel is obtained, and it is judged whether the specified parameter of the audio signal is greater than the first preset threshold, so as to monitor whether the client channel has the interrupted voice of the client or is larger. Environmental noise. If the designated parameter of the audio signal is greater than the first preset threshold, the playback of the dialogue voice is suspended to pause the voice output and prevent interference with the customer's speech. The audio signal is analyzed and the analysis result of the audio signal is obtained, and the response sentence is determined according to the analysis result, so as to generate corresponding feedback information (that is, the response sentence) in combination with the actual situation. When the designated parameter of the audio signal of the client channel is less than the second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel, so that at an appropriate time Interact with customers with appropriate response voice.
可选的,如图3所示,步骤S10之前,还包括:Optionally, as shown in FIG. 3, before step S10, the method further includes:
S101、获取客户资料;S101. Obtain customer information;
S102、根据所述客户资料建立与所述客户的通话连接;S102: Establish a call connection with the customer according to the customer information;
S103、根据所述客户资料和预设的交互任务确定初始语音参数及初始对话文本;S103: Determine initial voice parameters and initial dialogue text according to the customer information and preset interactive tasks;
S104、根据所述初始语音参数和所述初始对话文本生成初始对话语音;S104: Generate an initial dialogue voice according to the initial voice parameters and the initial dialogue text.
S105、将所述初始对话语音发送给所述客户。S105. Send the initial dialogue voice to the client.
本实施例中,客户资料包括但不限于客户的姓名、年龄、职业、联系方式、历史沟通记录。在此处,联系方式可以指手机号码或固定电话。可以通过呼叫客户的手机号码或固定电话与客户建立通话连接。In this embodiment, the customer information includes, but is not limited to, the customer's name, age, occupation, contact information, and historical communication records. Here, the contact method can refer to a mobile phone number or a landline. You can establish a call connection with the customer by calling the customer's mobile phone number or fixed-line phone.
预设的交互任务可以指本次交流所要实现的目的,如用户回访、用户调查、业务推荐等。初始语音参数可以包括发音性别、语速、语调、音量等。初始对话文本可以是服务端与客户建立通话连接后最开始的一句或多句对话文本。例如,通过客户资料获取到客户的姓为“李”,则在呼叫该客户时,采取以下初始对话文本——“你好,请问是李先生吗”。而当客户确认身份后,则可以采取以下初始对话文本——“李先生,您好,我这边现在有个问卷调查,大概需要占用您3分钟的时间,请问您现在方便吗”。The preset interactive tasks can refer to the purpose of this exchange, such as user return visits, user surveys, business recommendations, and so on. The initial speech parameters may include pronunciation gender, speaking speed, intonation, volume and so on. The initial dialogue text can be the first sentence or multiple sentences of dialogue text after the server establishes a call connection with the client. For example, if the last name of the customer is "Li" obtained from the customer information, the following initial dialogue text is adopted when calling the customer-"Hello, is this Mr. Li". After the customer confirms his identity, he can take the following initial dialogue text-"Hello, Mr. Li, I have a questionnaire survey, which will take you about 3 minutes. Is it convenient for you now".
在确定初始语音参数和初始对话文本后,可通过语音合成引擎合成出相应的初始对话语音。在此处,可以选用拟真程度更高的语音合成引擎,以生成与真人发声更接近的初始对话语音。After the initial speech parameters and the initial dialogue text are determined, the corresponding initial dialogue speech can be synthesized by the speech synthesis engine. Here, a speech synthesis engine with a higher degree of immersion can be selected to generate an initial dialogue voice that is closer to the voice of a real person.
在生成初始对话语音之后,可以通过通话连接将该初始对话语音发送给客户携带的客户端,客户通过该客户端接收初始对话语音。After the initial dialogue voice is generated, the initial dialogue voice can be sent to the client carried by the client through the call connection, and the client receives the initial dialogue voice through the client.
步骤S101-S102中,获取客户资料,以取得客户的联系方式。根据所述客户资料建立与所述客户的通话连接,以建立与客户的通话。根据所述客户资料和预设的交互任务确定初始语音参数及初始对话文本,为生成初始对话语音准备数据。根据所述初始语音参数和所述初始对话文本生成初始对话语音,以将文本数据转化为音频数据。将所述初始对话语音发送给所述客户,以使客户接收到 该初始对话语音。In steps S101-S102, customer information is obtained to obtain the customer's contact information. Establish a call connection with the customer according to the customer data to establish a call with the customer. The initial voice parameters and initial dialogue text are determined according to the customer profile and preset interactive tasks, and data is prepared for generating the initial dialogue voice. According to the initial speech parameters and the initial conversation text, an initial conversation speech is generated to convert the text data into audio data. The initial dialogue voice is sent to the client so that the client can receive the initial dialogue voice.
可选的,如图4所示,步骤S30包括:Optionally, as shown in FIG. 4, step S30 includes:
S301、解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声;S301. Analyze the audio signal of the client channel and obtain an analysis result of the audio signal, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;
S302、若获取的所述解析结果为所述音频信号不包含人声,则选取与不包含人声的所述解析结果对应的连接语句和第一语音调节参数;S302: If the obtained analysis result is that the audio signal does not contain human voice, select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice;
S303、根据所述连接语句和所述对话语音生成所述应答语句,并使所述应答语句与所述第一语音调节参数关联。S303. Generate the response sentence according to the connection sentence and the dialogue voice, and associate the response sentence with the first voice adjustment parameter.
本实施例中,服务端可设置有人声识别程序,用于判断音频信号是否包含人声。人声识别程序的判断结果有两种,包括人声和非人声。可以预设多个不同的连接语句,与不同的判断结果相关联。例如,在判断音频信号不包含人声,且确定客户所处环境比较嘈杂时,连接语句可以是“X先生,您那边有点嘈杂,我需要提高音量重新讲一遍吗”。可以基于判断结果生成第一语音调节参数,以改变应答语音的音量。在此处,对话语音指的是被噪音打断的对话语音。可以从被噪音打断的对话语音中选取部分或全部内容,连同连接语句生成应答语句。生成的应答语句与调整后的第一语音调节参数关联,这两者可合成出相应的应答语音。In this embodiment, the server can set a human voice recognition program to determine whether the audio signal contains human voice. There are two judgment results of the human voice recognition program, including human voice and non-human voice. A number of different connection sentences can be preset, which are associated with different judgment results. For example, when it is judged that the audio signal does not contain human voice, and the environment of the customer is determined to be relatively noisy, the connection sentence can be "Mr. X, your side is a bit noisy, do I need to increase the volume and repeat it again". The first voice adjustment parameter may be generated based on the judgment result to change the volume of the response voice. Here, conversational speech refers to conversational speech interrupted by noise. Part or all of the content can be selected from the conversation speech interrupted by noise, and the connection sentence can be used to generate the response sentence. The generated response sentence is associated with the adjusted first voice adjustment parameter, and the two can synthesize a corresponding response voice.
步骤S301-S303中,解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声,以区分不同的应对场景。若获取的所述解析结果为所述音频信号不包含人声,则选取与不包含人声的所述解析结果对应的连接语句和第一语音调节参数,以在解析结果为环境噪音时,做出相应的响应步骤。根据所述连接语句和所述对话语音生成所述应答语句,并使所述应答语句与所述第一语音调节参数关联,以生成适用于环境噪音时的应答语句。In steps S301-S303, the audio signal of the client channel is analyzed and an analysis result of the audio signal is obtained, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice to distinguish Different coping scenarios. If the obtained analysis result is that the audio signal does not contain human voice, then select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice, so as to do when the analysis result is environmental noise Take the corresponding response steps. The response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter to generate a response sentence suitable for environmental noise.
可选的,如图5所示,步骤S301之后,还包括:Optionally, as shown in FIG. 5, after step S301, the method further includes:
S304、若获取的所述解析结果为所述音频信号包含人声,通过语音识别引擎将所述客户通道的音频信号转化为文本数据,并通过预设的语气识别模型识别所述客户通道的音频信号的语气类型;S304. If the obtained analysis result is that the audio signal contains human voice, convert the audio signal of the customer channel into text data through a voice recognition engine, and recognize the audio of the customer channel through a preset tone recognition model The tone type of the signal;
S305、通过语义理解引擎识别所述文本数据的语义信息;S305: Recognizing the semantic information of the text data through a semantic understanding engine;
S306、从预设的应答语句数据库选取与所述语义信息匹配的所述应答语句,并获取与所述语气类型匹配的第二语音调节参数,所述第二语音调节参数与所述应答语句关联。S306. Select the response sentence matching the semantic information from a preset response sentence database, and obtain a second voice adjustment parameter matching the tone type, and the second voice adjustment parameter is associated with the response sentence .
本实施例中,若客户通道的音频信号包含人声,则需对音频信号中的人声进行进一步识别,以获知客户的需求。具体的识别步骤包括:先通过语音识别引擎将音频信号转化为文本数据,然后通过语义理解引擎识别所述文本数据的语义信息。在将音频信号转化为文本数据时,可同时识别所述音频信号的语气类型。可使用预设的语气识别模型对音频信号的语气类型进行识别。在一种简化的语气识别模型中,识别出的语气类型包括两种,一种为积极,另一种为消极。而在进阶的语气识别模型中,可识别出多于两种的语气类型。在识别出音频信号的语气类型后,可选取与语气类型匹配的第二语音调节参数,以调节应答语音的语音参数。In this embodiment, if the audio signal of the client channel includes human voice, the human voice in the audio signal needs to be further identified to learn the needs of the client. The specific recognition steps include: first converting the audio signal into text data through the speech recognition engine, and then recognizing the semantic information of the text data through the semantic understanding engine. When the audio signal is converted into text data, the tone type of the audio signal can be recognized at the same time. A preset tone recognition model can be used to recognize the tone type of the audio signal. In a simplified tone recognition model, the recognized tone types include two types, one is positive and the other is negative. In the advanced tone recognition model, more than two tone types can be identified. After the tone type of the audio signal is recognized, the second voice adjustment parameter matching the tone type can be selected to adjust the voice parameter of the response voice.
预设的应答语句数据库预存有多个应答语句,与特定的语义信息匹配。在识别出音频信息中的语义信息后,可在预设的应答语句数据库查找出匹配度最高的应答语句。同时,将第二语音调节参数与应答语句关联。The preset response sentence database is pre-stored with multiple response sentences, which are matched with specific semantic information. After recognizing the semantic information in the audio information, the response sentence with the highest matching degree can be found in the preset response sentence database. At the same time, the second voice adjustment parameter is associated with the response sentence.
步骤S304-S306中,若获取的所述解析结果为所述音频信号包含人声,通过语音识别引擎将所述客户通道的音频信号转化为文本数据,并通过预设的语气识别模型识别所述客户通道的音频信号的语气类型,以识别当前客户的语句内容及语气。通过语义理解引擎识别所述文本数据的语义信息,以进一步确定客户的需求。从预设的应答语句数据库选取与所述语义信息匹配的所述应答语句,并获取与所述语气类型匹配的第二语音调节参数,所述第二语音调节参数与所述应答语句关联,以选取恰当的应答语句,响应客户的话语。In steps S304-S306, if the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a voice recognition engine, and the voice recognition model is used to recognize the The tone type of the audio signal of the customer channel to identify the content and tone of the current customer's sentence. The semantic information of the text data is recognized by the semantic understanding engine to further determine the customer's needs. The response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence to Select the appropriate response sentence to respond to the customer's words.
可选的,如图6所示,步骤S40包括:Optionally, as shown in FIG. 6, step S40 includes:
S401、识别所述客户通道的音频信号的背景噪音类型;S401: Identify the background noise type of the audio signal of the client channel;
S402、获取与所述背景噪音类型匹配的所述第二预设阈值;S402. Obtain the second preset threshold that matches the background noise type.
S403、当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句和所述第一语音调节参数生成所述应答语音,并将所述应答语音发送 给与所述客户通道对应的客户。S403. When the designated parameter of the audio signal of the client channel is less than a second preset threshold, generate the response voice according to the response sentence and the first voice adjustment parameter, and send the response voice to Describe the customer corresponding to the customer channel.
本实施例中,可以预先设置多个背景噪音类型,计算当前音频信号与各个背景噪音类型的特征值的相似度,选取相似度最高的背景噪音类型为该音频信号的背景噪音类型。预设的背景噪音类型可以是马路场景、商业街场景、超市场景等。每个背景噪音类型匹配一个第二预设阈值。如,马路场景匹配的第二预设阈值可以是80分贝,商业街场景匹配的第二预设阈值可以是70分贝。In this embodiment, multiple background noise types can be preset, the similarity between the current audio signal and the feature values of each background noise type is calculated, and the background noise type with the highest similarity is selected as the background noise type of the audio signal. The preset background noise type can be a road scene, a commercial street scene, a supermarket scene, etc. Each background noise type matches a second preset threshold. For example, the second preset threshold for road scene matching may be 80 decibels, and the second preset threshold for commercial street scene matching may be 70 decibels.
若音频信号大于第二预设阈值,则说明背景噪音很大,此时即使播放对话语音,客户也很难听清内容,因此需要等待音频信号低于第二预设阈值时,才将应答语音播出。在判断所述音频信号是否小于第二预设阈值时,可按预设的缓存时间间隔缓存一段音频信号,若在缓存时间间隔内的音频信号的最高音量小于第二预设阈值,则判定音频信号小于第二预设阈值;若在缓存时间间隔内的音频信号的最高音量大于或等于第二预设阈值,则判定音频信号大于或等于第二预设阈值。缓存时间间隔可以0.3~0.5秒,可随着背景噪音类型的不同而不同。If the audio signal is greater than the second preset threshold, it means that the background noise is very large. At this time, even if the dialogue voice is played, it is difficult for the customer to hear the content. Therefore, it is necessary to wait for the audio signal to be below the second preset threshold before the response voice Out. When judging whether the audio signal is less than the second preset threshold, a segment of audio signal can be buffered at a preset buffering time interval, and if the highest volume of the audio signal in the buffering time interval is less than the second preset threshold, the audio signal is judged The signal is less than the second preset threshold; if the highest volume of the audio signal in the buffer time interval is greater than or equal to the second preset threshold, it is determined that the audio signal is greater than or equal to the second preset threshold. The buffering time interval can be 0.3 to 0.5 seconds, and it can vary with the type of background noise.
步骤S401-S403中,识别所述客户通道的音频信号的背景噪音类型,以判断客户当前所处的场景类型。获取与所述背景噪音类型匹配的所述第二预设阈值,以选取适当的响应阈值(即第二预设阈值)。当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句和所述第一语音调节参数生成所述应答语音,并将所述应答语音发送给与所述客户通道对应的客户,以在较佳的时机与客户进行交互。In steps S401-S403, the background noise type of the audio signal of the client channel is identified to determine the type of scene the client is currently in. Acquire the second preset threshold that matches the type of background noise to select an appropriate response threshold (ie, the second preset threshold). When the designated parameter of the audio signal of the client channel is less than a second preset threshold, the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client The customer corresponding to the channel can interact with the customer at a better time.
本申请实施例提供的语音交互方法,可提高智能语音的应变性,增强与客户的交互性,提升与客户交流的流畅度。The voice interaction method provided by the embodiments of the present application can improve the adaptability of intelligent voice, enhance the interaction with customers, and improve the fluency of communication with customers.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
在一实施例中,提供一种语音交互装置,该语音交互装置与上述实施例中语音交互方法一一对应。如图7所示,该语音交互装置包括音频判断模块10、中止播放模块20、确定应答语句模块30和发送应答语音模块40。各功能模块详细说明如下:In one embodiment, a voice interaction device is provided, and the voice interaction device corresponds to the voice interaction method in the foregoing embodiment one-to-one. As shown in FIG. 7, the voice interaction device includes an audio judgment module 10, a playback suspension module 20, a confirmation response sentence module 30, and a response voice sending module 40. The detailed description of each functional module is as follows:
音频判断模块10,用于在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;The audio judging module 10 is used to obtain the audio signal of the client channel when the dialogue voice is played, and to judge whether the specified parameter of the audio signal is greater than the first preset threshold;
中止播放模块20,用于若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;The suspension playing module 20 is configured to stop playing the dialogue voice if the designated parameter of the audio signal is greater than a first preset threshold;
确定应答语句模块30,用于对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;The determining response sentence module 30 is configured to analyze the audio signal and obtain the analysis result of the audio signal, and determine the response sentence according to the analysis result;
发送应答语音模块40,用于当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。The sending response voice module 40 is configured to generate a response voice according to the response sentence when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response voice to the corresponding client channel customer of.
可选的,语音交互装置还包括:Optionally, the voice interaction device further includes:
获取资料模块,用于获取客户资料;Get information module, used to obtain customer information;
建立通话连接模块,用于根据所述客户资料建立与所述客户的通话连接;A call connection establishment module, configured to establish a call connection with the customer according to the customer information;
确定对话文本模块,用于根据所述客户资料和预设的交互任务确定初始语音参数及初始对话文本;The dialog text determining module is used to determine the initial voice parameters and the initial dialog text according to the customer information and preset interactive tasks;
生成初始对话语音模块,用于根据所述初始语音参数和所述初始对话文本生成初始对话语音;Generating an initial dialogue voice module, configured to generate an initial dialogue voice according to the initial voice parameters and the initial dialogue text;
发送初始对话语音模块,用于将所述初始对话语音发送给所述客户。The initial dialogue voice sending module is used to send the initial dialogue voice to the client.
可选的,确定应答语句模块30包括:Optionally, the determining response sentence module 30 includes:
解析单元,用于解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声;A parsing unit, configured to analyze the audio signal of the client channel and obtain an analysis result of the audio signal, wherein the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;
选取连接语句单元,用于若获取的所述解析结果为所述音频信号不包含人声,则选取与不包含人声的所述解析结果对应的连接语句和第一语音调节参数;Selecting a connection sentence unit for selecting a connection sentence and a first voice adjustment parameter corresponding to the analysis result that does not contain human voice if the obtained analysis result is that the audio signal does not contain human voice;
第一生成应答语句单元,用于根据所述连接语句和所述对话语音生成所述应答语句,并使所述应答语句与所述第一语音调节参数关联。The first generating response sentence unit is configured to generate the response sentence according to the connection sentence and the dialogue voice, and associate the response sentence with the first voice adjustment parameter.
可选的,确定应答语句模块30还包括:Optionally, the determining response sentence module 30 further includes:
语音识别单元,用于若获取的所述解析结果为所述音频信号包含人声,通过语音识别引擎将所述客户通道的音频信号转化为文本数据,并通过预设的语气识 别模型识别所述客户通道的音频信号的语气类型;The voice recognition unit is configured to, if the obtained analysis result is that the audio signal contains human voice, convert the audio signal of the customer channel into text data through a voice recognition engine, and recognize the voice through a preset tone recognition model The tone type of the audio signal of the client channel;
语义理解单元,用于通过语义理解引擎识别所述文本数据的语义信息;The semantic understanding unit is used to identify the semantic information of the text data through the semantic understanding engine;
第二生成应答语句单元,用于从预设的应答语句数据库选取与所述语义信息匹配的所述应答语句,并获取与所述语气类型匹配的第二语音调节参数,所述第二语音调节参数与所述应答语句关联。The second generating response sentence unit is configured to select the response sentence matching the semantic information from a preset response sentence database, and obtain a second voice adjustment parameter matching the tone type, and the second voice adjustment The parameter is associated with the response sentence.
可选的,发送应答语音模块40,包括:Optionally, the sending and answering voice module 40 includes:
背景噪音识别单元,用于识别所述客户通道的音频信号的背景噪音类型;A background noise recognition unit, configured to recognize the background noise type of the audio signal of the client channel;
获取阈值单元,用于获取与所述背景噪音类型匹配的所述第二预设阈值;An acquiring threshold unit, configured to acquire the second preset threshold matching the background noise type;
发送应答语音单元,用于当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句和所述第一语音调节参数生成所述应答语音,并将所述应答语音发送给与所述客户通道对应的客户。Sending a response voice unit, configured to generate the response voice according to the response sentence and the first voice adjustment parameter when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response The voice is sent to the customer corresponding to the customer channel.
关于语音交互装置的具体限定可以参见上文中对于语音交互方法的限定,在此不再赘述。上述语音交互装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the voice interaction device, please refer to the above limitation of the voice interaction method, which will not be repeated here. Each module in the above-mentioned voice interaction device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储涉及上述语音交互方法的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种语音交互方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The database of the computer equipment is used to store data related to the above-mentioned voice interaction method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a voice interaction method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实 现以下步骤:In one embodiment, a computer device is provided, including a memory, a processor, and computer readable instructions stored in the memory and capable of running on the processor. When the processor executes the computer readable instructions, the following steps are implemented:
在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;When playing the dialogue voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
在一个实施例中,提供了一种计算机可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。可读存储介质上存储有计算机可读指令,计算机可读指令被处理器执行时实现以下步骤:In one embodiment, a computer-readable storage medium is provided. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium. The readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by the processor, the following steps are implemented:
在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;When playing the dialogue voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM (DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种语音交互方法,其特征在于,包括:A voice interaction method, characterized in that it comprises:
    在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;When playing the dialogue voice, obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
    若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
    对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
    当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
  2. 如权利要求1所述的语音交互方法,其特征在于,所述在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值之前,还包括:The voice interaction method according to claim 1, characterized in that, before acquiring the audio signal of the client channel when the dialogue voice is played, and determining whether the designated parameter of the audio signal is greater than the first preset threshold, the method further comprises :
    获取客户资料;Obtain customer information;
    根据所述客户资料建立与所述客户的通话连接;Establishing a call connection with the customer according to the customer information;
    根据所述客户资料和预设的交互任务确定初始语音参数及初始对话文本;Determining initial voice parameters and initial dialogue text according to the customer profile and preset interactive tasks;
    根据所述初始语音参数和所述初始对话文本生成初始对话语音;Generating an initial dialogue voice according to the initial voice parameters and the initial dialogue text;
    将所述初始对话语音发送给所述客户。The initial dialogue voice is sent to the customer.
  3. 如权利要求1所述的语音交互方法,其特征在于,所述对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句,包括:5. The voice interaction method according to claim 1, wherein the analyzing the audio signal and obtaining the analysis result of the audio signal, and determining the response sentence according to the analysis result, comprises:
    解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声;Parse the audio signal of the client channel and obtain an analysis result of the audio signal, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;
    若获取的所述解析结果为所述音频信号不包含人声,则选取与不包含人声的所述解析结果对应的连接语句和第一语音调节参数;If the obtained analysis result is that the audio signal does not contain human voice, select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice;
    根据所述连接语句和所述对话语音生成所述应答语句,并使所述应答语句与所述第一语音调节参数关联。The response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter.
  4. 如权利要求3所述的语音交互方法,其特征在于,所述解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声之后,还包括:The voice interaction method of claim 3, wherein the analysis result of the audio signal of the client channel is analyzed and the analysis result of the audio signal is obtained, wherein the analysis result includes that the audio signal contains human voice or After the audio signal does not contain human voice, it also includes:
    若获取的所述解析结果为所述音频信号包含人声,通过语音识别引擎将所述客户通道的音频信号转化为文本数据,并通过预设的语气识别模型识别所述客户通道的音频信号的语气类型;If the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a speech recognition engine, and the audio signal of the customer channel is recognized through a preset tone recognition model Tone type
    通过语义理解引擎识别所述文本数据的语义信息;Recognizing the semantic information of the text data through a semantic understanding engine;
    从预设的应答语句数据库选取与所述语义信息匹配的所述应答语句,并获取与所述语气类型匹配的第二语音调节参数,所述第二语音调节参数与所述应答语句关联。The response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence.
  5. 如权利要求3所述的语音交互方法,其特征在于,所述当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户,包括:The voice interaction method of claim 3, wherein when the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response The voice sent to the customer corresponding to the customer channel includes:
    识别所述客户通道的音频信号的背景噪音类型;Identifying the background noise type of the audio signal of the client channel;
    获取与所述背景噪音类型匹配的所述第二预设阈值;Acquiring the second preset threshold that matches the type of background noise;
    当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句和所述第一语音调节参数生成所述应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client The customer corresponding to the channel.
  6. 一种语音交互装置,其特征在于,包括:A voice interaction device, characterized in that it comprises:
    音频判断模块,用于在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;The audio judgment module is used to obtain the audio signal of the client channel when the dialogue voice is played, and judge whether the specified parameter of the audio signal is greater than the first preset threshold;
    中止播放模块,用于若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;The suspension playing module is configured to stop playing the dialogue voice if the designated parameter of the audio signal is greater than a first preset threshold;
    确定应答语句模块,用于对所述音频信号进行解析并获取所述音 频信号的解析结果,根据所述解析结果确定应答语句;The determining response sentence module is used to analyze the audio signal and obtain the analysis result of the audio signal, and determine the response sentence according to the analysis result;
    发送应答语音模块,用于当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。The sending response voice module is used to generate a response voice according to the response sentence when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response voice to the corresponding client channel client.
  7. 如权利要求6所述的语音交互装置,其特征在于,还包括:8. The voice interaction device of claim 6, further comprising:
    获取资料模块,用于获取客户资料;Get information module, used to obtain customer information;
    建立通话连接模块,用于根据所述客户资料建立与所述客户的通话连接;A call connection establishment module, configured to establish a call connection with the customer according to the customer information;
    确定对话文本模块,用于根据所述客户资料和预设的交互任务确定初始语音参数及初始对话文本;The dialog text determining module is used to determine the initial voice parameters and the initial dialog text according to the customer information and preset interactive tasks;
    生成初始对话语音模块,用于根据所述初始语音参数和所述初始对话文本生成初始对话语音;Generating an initial dialogue voice module, configured to generate an initial dialogue voice according to the initial voice parameters and the initial dialogue text;
    发送初始对话语音模块,用于将所述初始对话语音发送给所述客户。The initial dialogue voice sending module is used to send the initial dialogue voice to the client.
  8. 如权利要求6所述的语音交互装置,其特征在于,所述确定应答语句模块包括:7. The voice interaction device according to claim 6, wherein the determining response sentence module comprises:
    解析单元,用于解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声;A parsing unit, configured to analyze the audio signal of the client channel and obtain an analysis result of the audio signal, wherein the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;
    选取连接语句单元,用于若获取的所述解析结果为所述音频信号不包含人声,则选取与不包含人声的所述解析结果对应的连接语句和第一语音调节参数;Selecting a connection sentence unit for selecting a connection sentence and a first voice adjustment parameter corresponding to the analysis result that does not contain human voice if the obtained analysis result is that the audio signal does not contain human voice;
    第一生成应答语句单元,用于根据所述连接语句和所述对话语音生成所述应答语句,并使所述应答语句与所述第一语音调节参数关联。The first generating response sentence unit is configured to generate the response sentence according to the connection sentence and the dialogue voice, and associate the response sentence with the first voice adjustment parameter.
  9. 如权利要求8所述的语音交互装置,其特征在于,所述确定应答语句模块还包括:8. The voice interaction device according to claim 8, wherein the determining response sentence module further comprises:
    语音识别单元,用于若获取的所述解析结果为所述音频信号包含 人声,通过语音识别引擎将所述客户通道的音频信号转化为文本数据,并通过预设的语气识别模型识别所述客户通道的音频信号的语气类型;The voice recognition unit is configured to, if the obtained analysis result is that the audio signal contains human voice, convert the audio signal of the customer channel into text data through a voice recognition engine, and recognize the voice through a preset tone recognition model The tone type of the audio signal of the client channel;
    语义理解单元,用于通过语义理解引擎识别所述文本数据的语义信息;The semantic understanding unit is used to identify the semantic information of the text data through the semantic understanding engine;
    第二生成应答语句单元,用于从预设的应答语句数据库选取与所述语义信息匹配的所述应答语句,并获取与所述语气类型匹配的第二语音调节参数,所述第二语音调节参数与所述应答语句关联。The second generating response sentence unit is configured to select the response sentence matching the semantic information from a preset response sentence database, and obtain a second voice adjustment parameter matching the tone type, and the second voice adjustment The parameter is associated with the response sentence.
  10. 如权利要求8所述的语音交互装置,其特征在于,所述发送应答语音模块,包括:8. The voice interaction device according to claim 8, wherein the voice sending and response module comprises:
    背景噪音识别单元,用于识别所述客户通道的音频信号的背景噪音类型;A background noise recognition unit, configured to recognize the background noise type of the audio signal of the client channel;
    获取阈值单元,用于获取与所述背景噪音类型匹配的所述第二预设阈值;An acquiring threshold unit, configured to acquire the second preset threshold matching the background noise type;
    发送应答语音单元,用于当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句和所述第一语音调节参数生成所述应答语音,并将所述应答语音发送给与所述客户通道对应的客户。Sending a response voice unit, configured to generate the response voice according to the response sentence and the first voice adjustment parameter when the designated parameter of the audio signal of the client channel is less than a second preset threshold, and send the response The voice is sent to the customer corresponding to the customer channel.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows Steps: when the dialogue voice is played, the audio signal of the client channel is obtained, and it is judged whether the specified parameter of the audio signal is greater than the first preset threshold;
    若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
    对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
    当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
  12. 如权利要求11所述的计算机设备,其特征在于,所述在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 11, wherein, when the dialog voice is played, the audio signal of the client channel is acquired, and before the specified parameter of the audio signal is greater than the first preset threshold, the processing When the device executes the computer-readable instructions, the following steps are also implemented:
    获取客户资料;Obtain customer information;
    根据所述客户资料建立与所述客户的通话连接;Establishing a call connection with the customer according to the customer information;
    根据所述客户资料和预设的交互任务确定初始语音参数及初始对话文本;Determining initial voice parameters and initial dialogue text according to the customer profile and preset interactive tasks;
    根据所述初始语音参数和所述初始对话文本生成初始对话语音;Generating an initial dialogue voice according to the initial voice parameters and the initial dialogue text;
    将所述初始对话语音发送给所述客户。The initial dialogue voice is sent to the customer.
  13. 如权利要求11所述的计算机设备,其特征在于,所述对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句,包括:11. The computer device according to claim 11, wherein the analyzing the audio signal and obtaining the analysis result of the audio signal, and determining the response sentence according to the analysis result, comprises:
    解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声;Parse the audio signal of the client channel and obtain an analysis result of the audio signal, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;
    若获取的所述解析结果为所述音频信号不包含人声,则选取与不包含人声的所述解析结果对应的连接语句和第一语音调节参数;If the obtained analysis result is that the audio signal does not contain human voice, select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice;
    根据所述连接语句和所述对话语音生成所述应答语句,并使所述应答语句与所述第一语音调节参数关联。The response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter.
  14. 如权利要求13所述的计算机设备,其特征在于,所述解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声之后,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 13, wherein the analysis result of the audio signal of the client channel is analyzed and the analysis result of the audio signal is obtained, wherein the analysis result includes that the audio signal contains a human voice or a voice signal. After the audio signal does not contain human voice, the processor further implements the following steps when executing the computer-readable instruction:
    若获取的所述解析结果为所述音频信号包含人声,通过语音识别 引擎将所述客户通道的音频信号转化为文本数据,并通过预设的语气识别模型识别所述客户通道的音频信号的语气类型;If the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a speech recognition engine, and the audio signal of the customer channel is recognized through a preset tone recognition model Tone type
    通过语义理解引擎识别所述文本数据的语义信息;Recognizing the semantic information of the text data through a semantic understanding engine;
    从预设的应答语句数据库选取与所述语义信息匹配的所述应答语句,并获取与所述语气类型匹配的第二语音调节参数,所述第二语音调节参数与所述应答语句关联。The response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence.
  15. 如权利要求13所述的计算机设备,其特征在于,所述当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户,包括:The computer device according to claim 13, wherein when the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice Send to the customer corresponding to the customer channel, including:
    识别所述客户通道的音频信号的背景噪音类型;Identifying the background noise type of the audio signal of the client channel;
    获取与所述背景噪音类型匹配的所述第二预设阈值;Acquiring the second preset threshold that matches the type of background noise;
    当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句和所述第一语音调节参数生成所述应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client The customer corresponding to the channel.
  16. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指定参数是否大于第一预设阈值;One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps: when the dialogue voice is played , Obtain the audio signal of the client channel, and determine whether the specified parameter of the audio signal is greater than the first preset threshold;
    若所述音频信号的指定参数大于第一预设阈值,则中止播放所述对话语音;If the designated parameter of the audio signal is greater than the first preset threshold, stop playing the dialogue voice;
    对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句;Parse the audio signal and obtain an analysis result of the audio signal, and determine a response sentence according to the analysis result;
    当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the response voice is sent to the client corresponding to the client channel.
  17. 如权利要求16所述的可读存储介质,其特征在于,所述在播放对话语音时,获取客户通道的音频信号,并判断所述音频信号的指 定参数是否大于第一预设阈值之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium according to claim 16, wherein when the dialogue voice is played, the audio signal of the client channel is acquired, and before the specified parameter of the audio signal is determined to be greater than the first preset threshold, When the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:
    获取客户资料;Obtain customer information;
    根据所述客户资料建立与所述客户的通话连接;Establishing a call connection with the customer according to the customer information;
    根据所述客户资料和预设的交互任务确定初始语音参数及初始对话文本;Determining initial voice parameters and initial dialogue text according to the customer profile and preset interactive tasks;
    根据所述初始语音参数和所述初始对话文本生成初始对话语音;Generating an initial dialogue voice according to the initial voice parameters and the initial dialogue text;
    将所述初始对话语音发送给所述客户。The initial dialogue voice is sent to the customer.
  18. 如权利要求16所述的可读存储介质,其特征在于,所述对所述音频信号进行解析并获取所述音频信号的解析结果,根据所述解析结果确定应答语句,包括:15. The readable storage medium according to claim 16, wherein the analyzing the audio signal and obtaining the analysis result of the audio signal, and determining the response sentence according to the analysis result, comprises:
    解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声;Parse the audio signal of the client channel and obtain an analysis result of the audio signal, where the analysis result includes that the audio signal contains human voice or the audio signal does not contain human voice;
    若获取的所述解析结果为所述音频信号不包含人声,则选取与不包含人声的所述解析结果对应的连接语句和第一语音调节参数;If the obtained analysis result is that the audio signal does not contain human voice, select the connection sentence and the first voice adjustment parameter corresponding to the analysis result that does not contain human voice;
    根据所述连接语句和所述对话语音生成所述应答语句,并使所述应答语句与所述第一语音调节参数关联。The response sentence is generated according to the connection sentence and the dialogue voice, and the response sentence is associated with the first voice adjustment parameter.
  19. 如权利要求18所述的可读存储介质,其特征在于,所述解析所述客户通道的音频信号并获取所述音频信号的解析结果,其中,所述解析结果包括所述音频信号包含人声或所述音频信号不包含人声之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium of claim 18, wherein the analysis result of the audio signal of the client channel is analyzed and the analysis result of the audio signal is obtained, wherein the analysis result includes that the audio signal contains a human voice. Or after the audio signal does not contain human voice, when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:
    若获取的所述解析结果为所述音频信号包含人声,通过语音识别引擎将所述客户通道的音频信号转化为文本数据,并通过预设的语气识别模型识别所述客户通道的音频信号的语气类型;If the obtained analysis result is that the audio signal contains human voice, the audio signal of the customer channel is converted into text data by a speech recognition engine, and the audio signal of the customer channel is recognized through a preset tone recognition model Tone type
    通过语义理解引擎识别所述文本数据的语义信息;Recognizing the semantic information of the text data through a semantic understanding engine;
    从预设的应答语句数据库选取与所述语义信息匹配的所述应答语句,并获取与所述语气类型匹配的第二语音调节参数,所述第二语音调节参数与所述应答语句关联。The response sentence matching the semantic information is selected from a preset response sentence database, and a second voice adjustment parameter matching the tone type is obtained, and the second voice adjustment parameter is associated with the response sentence.
  20. 如权利要求18所述的可读存储介质,其特征在于,所述当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句生成应答语音,并将所述应答语音发送给与所述客户通道对应的客户,包括:The readable storage medium of claim 18, wherein when the designated parameter of the audio signal of the client channel is less than a second preset threshold, a response voice is generated according to the response sentence, and the The response voice sent to the customer corresponding to the customer channel includes:
    识别所述客户通道的音频信号的背景噪音类型;Identifying the background noise type of the audio signal of the client channel;
    获取与所述背景噪音类型匹配的所述第二预设阈值;Acquiring the second preset threshold that matches the background noise type;
    当所述客户通道的音频信号的指定参数小于第二预设阈值时,根据所述应答语句和所述第一语音调节参数生成所述应答语音,并将所述应答语音发送给与所述客户通道对应的客户。When the designated parameter of the audio signal of the client channel is less than a second preset threshold, the response voice is generated according to the response sentence and the first voice adjustment parameter, and the response voice is sent to the client The customer corresponding to the channel.
PCT/CN2019/116512 2019-09-18 2019-11-08 Voice interaction method and apparatus, computer device and storage medium WO2021051506A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910883213.6 2019-09-18
CN201910883213.6A CN110661927B (en) 2019-09-18 2019-09-18 Voice interaction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021051506A1 true WO2021051506A1 (en) 2021-03-25

Family

ID=69038207

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116512 WO2021051506A1 (en) 2019-09-18 2019-11-08 Voice interaction method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN110661927B (en)
WO (1) WO2021051506A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113473345A (en) * 2021-06-30 2021-10-01 歌尔科技有限公司 Wearable device hearing assistance control method, device and system and readable storage medium

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111273990A (en) * 2020-01-21 2020-06-12 腾讯科技(深圳)有限公司 Information interaction method and device, computer equipment and storage medium
CN111654581A (en) * 2020-04-30 2020-09-11 南京智音云数字科技有限公司 Intelligent dialogue robot control method and system
CN111752523A (en) * 2020-05-13 2020-10-09 深圳追一科技有限公司 Human-computer interaction method and device, computer equipment and storage medium
CN111629110A (en) * 2020-06-11 2020-09-04 中国建设银行股份有限公司 Voice interaction method and voice interaction system
CN111797215B (en) * 2020-06-24 2024-08-13 北京小米松果电子有限公司 Dialogue method, dialogue device and storage medium
CN114077840A (en) * 2020-08-17 2022-02-22 大众问问(北京)信息科技有限公司 Method, device, equipment and storage medium for optimizing voice conversation system
CN112820316A (en) * 2020-12-31 2021-05-18 大唐融合通信股份有限公司 Intelligent customer service dialogue method and system
CN112908314B (en) * 2021-01-29 2023-01-10 深圳通联金融网络科技服务有限公司 Intelligent voice interaction method and device based on tone recognition
CN112883178B (en) * 2021-02-18 2024-03-29 Oppo广东移动通信有限公司 Dialogue method, dialogue device, dialogue server and dialogue storage medium
CN113066489B (en) * 2021-03-16 2024-10-29 深圳地平线机器人科技有限公司 Voice interaction method and device, computer readable storage medium and electronic equipment
CN113096645A (en) * 2021-03-31 2021-07-09 闽江学院 Telephone voice processing method
CN113257242B (en) * 2021-04-06 2024-07-30 杭州远传新业科技股份有限公司 Voice broadcasting suspension method, device, equipment and medium in self-service voice service
CN113160817B (en) * 2021-04-22 2024-06-28 平安科技(深圳)有限公司 Voice interaction method and system based on intention recognition
CN113836172A (en) * 2021-09-30 2021-12-24 深圳追一科技有限公司 Interaction method, interaction device, electronic equipment, storage medium and computer program product
CN114285830B (en) * 2021-12-21 2024-05-24 北京百度网讯科技有限公司 Voice signal processing method, device, electronic equipment and readable storage medium
CN118233438B (en) * 2024-05-27 2024-08-09 烟台小樱桃网络科技有限公司 Comprehensive real-time audio/video multimedia communication platform, system, management system and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040083107A1 (en) * 2002-10-21 2004-04-29 Fujitsu Limited Voice interactive system and method
EP1494208A1 (en) * 2003-06-30 2005-01-05 Harman Becker Automotive Systems GmbH Method for controlling a speech dialog system and speech dialog system
US6882973B1 (en) * 1999-11-27 2005-04-19 International Business Machines Corporation Speech recognition system with barge-in capability
CN105070290A (en) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 Man-machine voice interaction method and system
CN107146613A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 A kind of voice interactive method and device
US20180033424A1 (en) * 2016-07-28 2018-02-01 Red Hat, Inc. Voice-controlled assistant volume control
CN109509471A (en) * 2018-12-28 2019-03-22 浙江百应科技有限公司 A method of the dialogue of intelligent sound robot is interrupted based on vad algorithm
CN109903758A (en) * 2017-12-08 2019-06-18 阿里巴巴集团控股有限公司 Audio-frequency processing method, device and terminal device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978980B (en) * 2015-07-03 2018-03-02 上海斐讯数据通信技术有限公司 A kind of method for controlling sound to play and sound playing system
CN109462707A (en) * 2018-11-13 2019-03-12 平安科技(深圳)有限公司 Method of speech processing, device and computer equipment based on automatic outer call system
CN109949071A (en) * 2019-01-31 2019-06-28 平安科技(深圳)有限公司 Products Show method, apparatus, equipment and medium based on voice mood analysis
CN109977218B (en) * 2019-04-22 2019-10-25 浙江华坤道威数据科技有限公司 A kind of automatic answering system and method applied to session operational scenarios

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6882973B1 (en) * 1999-11-27 2005-04-19 International Business Machines Corporation Speech recognition system with barge-in capability
US20040083107A1 (en) * 2002-10-21 2004-04-29 Fujitsu Limited Voice interactive system and method
EP1494208A1 (en) * 2003-06-30 2005-01-05 Harman Becker Automotive Systems GmbH Method for controlling a speech dialog system and speech dialog system
CN105070290A (en) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 Man-machine voice interaction method and system
US20180033424A1 (en) * 2016-07-28 2018-02-01 Red Hat, Inc. Voice-controlled assistant volume control
CN107146613A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 A kind of voice interactive method and device
CN109903758A (en) * 2017-12-08 2019-06-18 阿里巴巴集团控股有限公司 Audio-frequency processing method, device and terminal device
CN109509471A (en) * 2018-12-28 2019-03-22 浙江百应科技有限公司 A method of the dialogue of intelligent sound robot is interrupted based on vad algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113473345A (en) * 2021-06-30 2021-10-01 歌尔科技有限公司 Wearable device hearing assistance control method, device and system and readable storage medium
CN113473345B (en) * 2021-06-30 2022-11-01 歌尔科技有限公司 Wearable device hearing assistance control method, device and system and readable storage medium

Also Published As

Publication number Publication date
CN110661927B (en) 2022-08-26
CN110661927A (en) 2020-01-07

Similar Documents

Publication Publication Date Title
WO2021051506A1 (en) Voice interaction method and apparatus, computer device and storage medium
US11210461B2 (en) Real-time privacy filter
US9293133B2 (en) Improving voice communication over a network
US9571638B1 (en) Segment-based queueing for audio captioning
CN107818798A (en) Customer service quality evaluating method, device, equipment and storage medium
US10217466B2 (en) Voice data compensation with machine learning
CN109873907B (en) Call processing method, device, computer equipment and storage medium
KR102535790B1 (en) Methods and apparatus for managing holds
CN109712610A (en) The method and apparatus of voice for identification
US20150149162A1 (en) Multi-channel speech recognition
US11699043B2 (en) Determination of transcription accuracy
CN113779208A (en) Method and device for man-machine conversation
CN111696576A (en) Intelligent voice robot talk test system
US11996114B2 (en) End-to-end time-domain multitask learning for ML-based speech enhancement
US20210312143A1 (en) Real-time call translation system and method
WO2019242415A1 (en) Position prompt method, device, storage medium and electronic device
CN116016779A (en) Voice call translation assisting method, system, computer equipment and storage medium
JP2005308950A (en) Speech processors and speech processing system
EP3641286B1 (en) Call recording system for automatically storing a call candidate and call recording method
KR102378895B1 (en) Method for learning wake-word for speech recognition, and computer program recorded on record-medium for executing method therefor
RU2783966C1 (en) Method for processing incoming calls
CN110125946A (en) Automatic call method, device, electronic equipment and computer-readable medium
KR20230156599A (en) A system that records and manages calls in the contact center
WO2024050487A1 (en) Systems and methods for substantially real-time speech, transcription, and translation
KR20150010499A (en) Method and device for improving voice recognition using voice by telephone conversation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945841

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945841

Country of ref document: EP

Kind code of ref document: A1