CN114420130A

CN114420130A - Telephone voice interaction method, device, equipment and storage medium

Info

Publication number: CN114420130A
Application number: CN202210096102.2A
Authority: CN
Inventors: 何超勋; 王柳佳; 黄声勇; 吴金铭; 蔡洁锐; 彭伟锋
Original assignee: Guangdong Power Grid Co Ltd; Shanwei Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Shanwei Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-04-29

Abstract

The invention discloses a telephone voice interaction method, a telephone voice interaction device, telephone voice interaction equipment and a storage medium. A telephone voice interaction method, comprising: continuously acquiring voice data after the telephone is connected, and storing the voice data in a segmented manner; performing speech and event recognition on the speech data segment by segment to generate speech text and process events; matching response actions based on the process event; when the response action is to play voice, the voice file is matched based on the voice text and played. The voice data is stored in a segmented mode and is identified, the performance requirement on a server can be effectively reduced, the situation that the same event needs to process a large number of data streams is avoided, the voice text and the process event are generated through segmented identification, the action response speed of the soft switching platform for the caller can be effectively improved, and the situation that the caller waits for a response is avoided.

Description

Telephone voice interaction method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to a telephone voice interaction technology, in particular to a telephone voice interaction method, a telephone voice interaction device, telephone voice interaction equipment and a storage medium.

Background

With the development of technologies such as artificial intelligence, intelligent telephone robots and the like, more and more sales and customer services realize the dialing and answering of marketing calls through the intelligent telephone robots, and the working efficiency of telephone marketing and service customers is effectively improved.

The existing intelligent telephone robot has relatively simple operation logic, recognizes the collected voice after a caller speaks for a period of speech based on a voice recognition technology, recognizes a voice text, performs voice matching of the speech based on the voice text, and finally plays the matched voice. In the process of recognizing the voice of the caller, the caller is usually waited to finish a period of speaking before performing voice recognition to match the speech term voice and playing, so that the reply interval to the caller is relatively long, the caller feels that the system response is slow, and the speaking reply to the caller is slow. In the process of communication, the behaviors of interrupting a caller, not speaking for a long time and the like cannot be accurately identified, so that the response of the intelligent telephone robot is stuttered and is not flexible enough.

Disclosure of Invention

The invention provides a telephone voice interaction method, a telephone voice interaction device, telephone voice interaction equipment and a storage medium, which are used for realizing more timely response to words of a caller.

In a first aspect, an embodiment of the present invention provides a telephone voice interaction method, including:

continuously acquiring voice data after the telephone is connected, and storing the voice data in a segmented manner;

performing voice and event recognition on the voice data segment by segment to generate voice text and process events;

matching a response action based on the process event;

and when the response action is used for playing voice, matching a voice file based on the voice text and playing.

Optionally, the voice data includes at least one segment;

the continuous voice data acquisition after the call is connected comprises the following steps:

continuously acquiring call data of a telephone after the telephone is dialed;

and storing the collected call data as a section of voice data at a preset time interval.

Optionally, the performing speech and event recognition on the speech data to generate a speech text and a process event includes:

carrying out voice recognition on the voice data to generate a speaking starting event and a speaking ending event;

and carrying out voice recognition on the voice data based on the speaking starting event and the speaking ending event to obtain a voice text.

Optionally, the performing voice recognition on the voice data to generate a speaking start event and a speaking end event includes:

identifying whether the current voice data is voice data;

if the voice data at the current moment is the human voice data, judging whether the voice data at the previous moment is the human voice data;

if the previous moment is not the voice data, generating a speaking starting event;

if the voice data at the current moment is not the human voice data, judging whether the voice data at the previous moment is the human voice data or not;

and if the previous moment is the voice data, generating a speech ending event.

Optionally, the process event further comprises a break event;

before generating the speaking start event if the previous time is not the voice data, the method further comprises:

acquiring whether a current voice file is playing;

if yes, generating an interruption event;

if not, generating a speaking starting event.

Optionally, the VAD algorithm of WebRTC is used to identify whether the voice data is voice data currently.

Optionally, after the generating of the end-of-talk event, the method further includes:

timing a duration that the voice data is not voice data;

and generating a reminding event when the duration reaches a preset time threshold.

In a second aspect, an embodiment of the present invention further provides a telephone voice interaction apparatus, including:

the acquisition module is used for continuously acquiring voice data after the call is connected;

the recognition module is used for carrying out voice and event recognition on the voice data so as to generate a voice text and a process event;

a matching module to match response actions based on the process event;

and the response module is used for matching the voice file based on the voice text and playing the voice file when the response action is used as playing voice.

In a third aspect, an embodiment of the present invention further provides a telephone voice interaction device, where the device includes:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the telephone voice interaction method of the first aspect.

In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the telephone voice interaction method according to the first aspect.

The invention can effectively reduce the performance requirement on the server and avoid the need of processing a large amount of data streams for the same event by continuously collecting the voice data after the telephone is connected and storing the voice data in segments, then identifying the voice and the event by segments of the voice data to generate the voice text and the process event, and effectively improving the action response speed of the soft switching platform to the caller and avoiding the condition that the caller waits for a response.

Drawings

Fig. 1 is a flowchart of a telephone voice interaction method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a telephone voice interaction apparatus according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a telephone voice interaction device according to a third embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a telephone voice interaction method according to an embodiment of the present invention, where the method is applicable to situations where a smart phone robot makes a marketing call and answers a customer service call, and the method may be executed by a telephone voice interaction device, where the telephone voice interaction device may be implemented by software and/or hardware, and may be configured in a computer device, such as a server, a workstation, a personal computer, and the like, and the method specifically includes the following steps:

and step 110, continuously acquiring voice data after the call is connected, and storing the voice data in a segmented mode.

In the embodiment of the invention, a Freeswitch telephone soft switch platform can be adopted to realize the connection of the telephone and the collection of voice data. Freeswitch is a cross-platform open source telephony softswitch platform, the main development language being C, published in MPL 1.1. It has strong scalability-from the simplest softphones to commercial-grade softswitch platforms almost impossible. It supports communication protocols such as SIP, Skype, H323, IAX and Google Talk. In addition, it also supports many advanced SIP features such as presence, BLF, SLA, and TCP TLS and sRTP. It can be used as a pure SBC, e.g., as a proxy for t.38 and other point-to-point communications, etc. Other open source VoIP systems, such as OpenPBX, bayone, YATE, Asterisk, etc., may also be connected as B2 BUA. The Freeswitch supports voice coding and decoding with various bandwidths, supports high-definition calls of 8K, 16K, 32K and 48KHz, and can automatically switch when voice with different frequencies is bridged.

Here, the voice data refers to digitized information (data) converted from an analog input (electric signal) of an original sound.

In a specific implementation, the voice data may be collected after the call is dialed, or may be collected when the call is answered, which may be changed according to the actual situation. The generated voice data is continuously collected after the call is dialed or answered until the call is ended. In the embodiment of the present invention, when voice data is collected, the voice data is stored and transmitted in segments, that is, the voice data is segmented according to a certain time or size in the collection process, and the voice data is sent to the next step for processing after the set time interval or data size is reached. In order to make the response in the call process more timely, the set time interval or the data size is set to be relatively smaller so as to ensure that the voice data can be processed timely and make corresponding response action timely. For example, the time interval is set to 1ms, 2ms, 3ms, 5ms, 10ms, 20, 30ms, 100ms, 200ms, or the like. Correspondingly, when the data is segmented by setting the size of the data, the time interval corresponding to the data in each segment needs to be relatively small.

The relational formula of the time interval and the data size is as follows:

data _ size ═ sampling precision/8 × -sampling rate transmission time interval/1000

For example, if the time interval is set to 40ms, the corresponding data size is 16/8 × 8000 × 40/1000 — 640Byte when the encoding format of the voice data is pcm, the precision is 16 bits, and the sampling rate is 8 k.

Speech and event recognition is performed on the speech data segment by segment to generate speech text and process events, step 120.

In a specific implementation, the process event may include an event that a caller answers a call, starts speaking (starts speaking during a conversation), ends speaking (finishes speaking a sentence during a conversation), interrupts speaking, hangs up a call, and the like during a conversation. The identification of the event can be judged and identified according to whether the caller speaks or not and the conversation state, and the identification and the judgment of the text content corresponding to the voice data are not needed, so that the calculation amount is effectively reduced.

Further comprising performing speech-text recognition on the speech data after recognizing the process event. Speech recognition refers to having a machine convert speech signals into corresponding text or commands through a recognition and understanding process. The voice recognition technology mainly comprises three aspects of a feature extraction technology, a pattern matching criterion and a model training technology. The speech recognition technology is developed more mature at present, and the speech data recognition in the embodiment of the invention can be realized by directly using the existing speech recognition technology.

In the specific implementation, the voice data is segmented, so that the pressure on the server is relatively low when the voice recognition technology is used for recognition, more timely voice text recognition can be realized, and the response speed is improved.

For example, in the embodiment of the invention, the recognition of the speech text and the recognition of the process event can be realized by combining an ASR module on the basis of a Freeswitch telephone soft switch platform. Among them, asr (automatic Speech recognition) is automatic Speech recognition, mainly converting Speech into text. Specifically, they can be used as such: tencent cloud speech recognition (ASR), and real-time speech recognition ASR which is carried out by science news and flies and transmits speech data by websocket. The ASR module source used in the present invention is not particularly limited as long as the recognition requirement of the speech text in the embodiment of the present invention can be satisfied.

Step 130, matching response actions based on the process event.

In a specific implementation, a speech text and a process event are generated based on the speech data recognition in the foregoing steps, wherein the process event corresponds to a call action of the caller, such as whether the phone is switched on, hung up, talking, interrupted voice file being played by Freeswitch phone softswitch platform, long-time non-response, and the like.

The FreeSwitch telephone softswitch platform needs to respond correspondingly to the various process events (call actions) to realize the interaction with the caller. For example, when the marketing call is automatically dialed by using the telephone voice interaction provided by the embodiment of the invention, a preset opening voice needs to be played after a caller answers the call, a corresponding answer is needed when the caller asks for the call, the playing needs to be suspended when the caller interrupts the voice being played, whether the interruption is real or not is judged, and then whether the playing is continued or not is judged, or a new answer voice file is switched to answer, and call data is uploaded and stored after the caller hangs up or not is judged.

And 140, matching the voice file based on the voice text and playing when the response action is used as playing voice.

In this step, when the process event matches that the caller initiates a question or a demand, the matching response action is used as a response to the caller playing speech, and the specific operation is to match a speech file based on a speech text and play the speech file.

According to the technical scheme of the embodiment, voice data are continuously collected after the telephone is connected, the voice data are stored in a segmented mode, then voice and event recognition is carried out on the voice data segment by segment to generate a voice text and a process event, the voice data are stored in a segmented mode and are recognized and processed, the performance requirement on a server can be effectively lowered, the situation that a large number of data streams need to be processed in the same event is avoided, the voice text and the process event are generated through segmented recognition, the action response speed of the telephone soft switching platform to a caller can be effectively improved, and the situation that the caller waits for answering is avoided.

On the basis of the technical scheme, the voice data comprises at least one segment. As can be seen from the foregoing, the actual call duration is shorter when only one segment of voice data is actually used, and the length of the segment of voice data does not exceed the segment interval time or the data size set in the embodiment of the present invention.

And in step 110, may include:

and step 111, continuously acquiring call data of the phone after the phone is dialed.

In a specific implementation, the Freeswitch telephone soft switch platform can be used for realizing the connection of a telephone and the continuous acquisition of voice data, and temporarily storing the acquired voice data into a cache area.

And 112, storing the acquired call data as a section of voice data at preset time intervals.

In addition to the above-mentioned manner of using the preset time interval in step 112, the voice data may be segmented by using a preset data size and then sent to the next step for processing.

In order to make the response in the call process more timely, the set time interval or the data size is set to be relatively smaller so as to ensure that the voice data can be processed timely and make corresponding response action timely. For example, the time interval is set to 1ms, 2ms, 3ms, 5ms, 10ms, 20, 30ms, 100ms, 200ms, or the like. Correspondingly, when the data is segmented by setting the size of the data, the time interval corresponding to the data in each segment needs to be relatively small.

Step 120 may include:

and step 121, performing voice recognition on the voice data to generate a speaking starting event and a speaking ending event.

In the embodiment of the present invention, the voice recognition refers to recognizing the voice in the voice data, determining whether the caller is speaking, and further determining the current communication scenario (process event) with the caller. Judging whether the caller starts speaking, and generating a speaking starting event; whether speaking is finished or not, and generating a speech finishing event; whether to interrupt the speech being played to it, generate an interrupt event, etc.

In a specific implementation, besides the talk start event and the talk end event, the system can also include a listen event, an interrupt event, a hang-up event, and the like.

And step 122, carrying out voice recognition on the voice data based on the speaking starting event and the speaking ending event to obtain a voice text.

In practical implementation, the sentence division of the voice data can be realized based on the event of starting speaking and the event of ending speaking, that is, the voice data is further divided according to each speaking of a caller, so that the sentence integrity during text recognition is ensured, and the accuracy during semantic judgment is ensured.

In step 121, performing voice recognition on the voice data to generate a speaking start event and a speaking end event, which may include:

step 1211, identify whether the current voice data is voice data.

In a specific implementation, for recognizing the human voice data, a VAD algorithm in a webrtc framework of Google open source can be adopted, the module is mainly used for detecting whether a section of voice data is the voice of human speaking, the function input is a section of binary voice data, and the output is-1 or 1. -1 indicates that the piece of speech data is not human voice, and 1 indicates that the piece of speech data is human voice. In other embodiments, the recognition of the voice can be realized by other implementation manners, and whether the current speaker is speaking or not can be judged.

Step 1212, if the voice data at the current moment is voice data, determining whether the voice data at the previous moment is voice data.

And step 1213, if the previous time is not the voice data, generating a speaking starting event.

When the voice data is judged, whether the voice data is the voice data at the previous moment is judged, whether the previous caller speaks can be determined, and then whether the caller starts speaking or is speaking at present is judged.

Step 1214, if the voice data at the current moment is not the human voice data, determining whether the voice data at the previous moment is the human voice data.

And 1215, if the previous moment is the voice data, generating an ending speaking event.

By judging whether the data is the voice data or not at the previous moment when the data is judged not to be the voice data, whether the previous speaker is speaking or not can be determined, and then whether the speaker finishes speaking or listens to the voice played to the speaker or not is judged at present.

Further, in one embodiment, the process event further comprises a break event;

before step 1213, it further comprises:

whether the voice file is currently played or not is acquired, that is, whether the question of the caller is currently answered or the service is currently introduced to the caller is judged.

If the voice data at the current moment is the voice data and the voice data at the previous moment is not the voice data, an interruption event is generated if the voice file is detected to be played currently, and if not, a speaking starting event is generated.

After generating the end-of-talk event at step 1215, further comprising:

and the timing voice data is not the duration of the human voice data, and when the duration reaches a preset time threshold, a reminding event is generated.

By timing the duration that the voice data is not the voice data, the silence duration of the caller can be obtained, and whether the caller has long-time no response is determined based on the length of the played voice, and a corresponding reminding event is generated.

In one embodiment of the invention, the telephone voice interaction method is deployed on a Freeswitch telephone softswitch platform, and an intelligent voice ASR module and an intelligent voice IVR module are integrated on the Freeswitch telephone softswitch platform.

Wherein the intelligent speech ASR module may include the following functionality:

1) automatic speech recognition. Namely voice data recognition based on websocket.

2) The generation of a process event. The events comprise a speaking starting event, a speaking ending event, an interruption event, a voice recognition completion event, a voice recognition failure event, a voice recognition intermediate result return event and the like.

The intelligent voice IVR module may include the following functions:

1) event monitoring: various events generated by the Freeswitch telephony switch platform are monitored and received.

2) Event registration and processing: first, the processing functions of various events are registered, and when a certain event is received, the corresponding processing function is called.

3) Interacting with the service background. And constructing notification events for the service background according to various events generated by the Freeswitch telephone exchange platform, and executing action instructions issued by the service background.

The automatic speech recognition of the intelligent speech ASR module can be based on real-time speech recognition ASR which is provided by scientific news and transmits speech data through websocket. The method can identify the voice data with single sound channel, a pcm coding format, 16bit precision and 8k sampling rate, and return the identified sentence text in real time.

The recognition process of the intelligent speech ASR module is as follows:

1) and establishing an ASR service connection request in a websocket mode.

2) And continuously acquiring the voice data of the Freeswitch telephone exchange platform at certain time intervals (milliseconds) until the call is ended.

When ASR recognition is started, a media _ bug function and a callback function are added to the Freeswitch telephone switching platform through a switch _ core _ media _ bug _ add function of the Freeswitch telephone switching platform for copying and processing voice data.

When the Freeswitch telephone exchange platform continuously collects voice data and writes the voice data into the newly added media _ bug, the registered callback function is executed to process the voice data in the media _ bug.

In this step, the function of the callback function is to read the voice data from the media _ bug, store the data in the cache, and send the data to the ASR server for voice recognition.

3) Asynchronously receiving returned voice data recognition results (voice text)

The step is to create a thread when the ASR service is started, and specially process response data returned by the ASR.

Generating speech recognition process events

In the step, a vad module in a Google open-source webrtc framework is used, the vad module is mainly used for detecting whether a section of voice data is the voice of a human speaking, the function input is a section of binary voice data, and the output is-1 or 1. -1 indicates that the piece of speech data is not human voice, and 1 indicates that the piece of speech data is human voice.

From the response data returned by the vad module and the ASR service, several key events can be defined as follows:

1) generating an ASR service connection establishment success event: ASR _ START _ SUCCESS

The event may be generated when the connection with the ASR server is successful.

2) Generating an event of failure to establish an ASR service connection: ASR _ START _ FAIL

This event may be generated when the connection with the ASR server fails.

3) Generating a speak-begin event: SENTENCE _ BEGIN

In the foregoing step, in the callback function callback _ function, the WebRtcVad _ Process function of the vad module is called, and we can determine whether the current voice data is the voice of the human speaking. 2 session variables, such as is _ voice and spaking, are set in the ASR module of the component, with initial values of false. When the current voice data is judged to be human voice, the is _ voice is set to true. Then it is determined that the speak begin event asr occurs when speaking is false and is _ voice is true, sense _ begin, while speaking is set to true. Thus, the talk-start event asr can be generated only when the talk starts, namely, the presence _ begin event, through the state value transition of the two variables, so that the event is prevented from being generated all the time during the talk process.

4) Generating an end-of-talk event: SENTENCE _ END

Like the principle of step 3.2.2), when the WebRtcVad _ Process function judges that the current voice data is not the voice of the person speaking, the is _ voice is set to false. Then, it is determined that the speak end event asr: "sense _ end" is generated when speaking is true and is _ voice is false, while speaking is set to false. Thus, by switching the state values of these two variables, the end-of-talk event asr:, sensor _ end event, can be generated when the speech is stopped, avoiding the event being generated when the person is not speaking.

5) Generating a speech recognition procedure failure event: ASR _ TASK _ FAIL

The event may be generated when an error occurs in the process of sending the speech data or an error occurs in the process of ASR server recognition, such as a network outage.

6) Generating a speech recognition completion event, SENTENCE _ COMPLETE

The event may be generated when the ASR server recognizes the contents of a sentence.

The intelligent voice IVR module has the main functions of registering an event processing function, monitoring and processing events and executing instructions issued by a service background.

The method comprises the following specific steps:

registering event handling functions

Events received by the intelligent voice IVR include, but are not limited to: CHANNEL _ ANSWER, CHANNEL _ HANGUP _ COMPLETE, RECORD _ START, CHANNEL _ EXECUTE _ COMPLETE, play _ START, play _ STOP, and CUSTOM (CUSTOM event). Therefore, different logic processing is required for the events.

For example, receiving a CHANNEL _ ANSWER event requires changing the call state to ANSWER, receiving a CHANNEL _ HANGUP _ COMPLETE event requires disconnecting the socket connection with the FreeSwitch telephone exchange platform, receiving a playlist _ START event requires setting the play sound state to true, receiving a send _ BEGIN SENTENCE _ START and resetting the mute timing, receiving a send _ END SENTENCE _ END and starting the mute timing and feeding back the identified content to the service background, and so on.

Listening and processing events

The component IVR module initiates the socket service, starts listening to a port, such as the 8041 port, and waits for the FreeSwitch telephony switch platform to connect. After the Freeswitch telephony exchange platform session thread is connected to the intelligent IVR component, a new thread is created, and various events generated by the Freeswitch telephony exchange platform are received and processed. Events received and processed include, but are not limited to: CHANNEL _ ANSWER, CHANNEL _ HANGUP _ COMPLETE, RECORD _ START, CHANNEL _ EXECUTE _ COMPLETE, play _ START, play _ STOP, send _ BEGIN, send _ END.

Sending service event notice and executing instructions issued by service background

And the intelligent IVR module and the service background carry out data transmission through the message body in the json format. The message format mainly comprises a request message body and a response message body. The request message body is sent to the service background by the intelligent IVR module, and the response message body is sent to the intelligent IVR module by the service background.

The service event notification mainly comprises the following steps: entry _ notify, answer _ notify, interrupt _ notify, silent _ notify, leave _ notify, asr _ notify.

enter _ notify: and when the Freeswitch telephone exchange platform is connected with the intelligent IVR module after the call is dialed, the service background is informed.

answer _ notify: and informing the service background when the user connects the call.

asr _ notify: and when the user finishes a sentence and acquires the recognition result, informing the service background.

interrupt _ notify: when the robot plays the recording, the user gives a notice to the service background after the speaking is cut off.

silent _ notify: and informing the service background when the user mutes for a certain time.

leave _ notify: and when the call is finished, the call detail data is informed to the service background.

The response actions mainly include: start _ asr, playback, break _ uuid, noop.

start _ asr: generally, when the service background receives an enter _ notify event notification, the action is issued, and the Freeswitch telephone exchange platform loads an intelligent ASR module for starting the component.

playback: generally, the action is issued when the service background receives the answer _ notify, asr _ notify and silent _ notify event notifications, and the Freeswitch telephone exchange platform plays the specified voice file.

break _ uuid: generally, when receiving an interrupt _ notify event notification, the service background issues the action, so that the Freeswitch telephone exchange platform interrupts the currently played voice file, and executes the subsequent playback action.

noop: generally, the action is issued when the service background receives the leave _ notify event notification. The instruction is a null instruction and the Freeswitch telephony switch platform does not perform any action.

Request message body: {

"callereid": 1362286xxxx ",// called number

"callerid": 1326504xxxx "// calling number

"notify": xxx _ notify "// event type

}

Response message body: {

Action xxx/perform action

"params" { action-related parameter setting }

"after _ action"// action to be performed subsequently

"after _ params" { after _ action-related parameter setting }

}

For example, when the intelligent IVR module receives a CHANNEL _ ANSWER event, an ANSWER _ notify message body is constructed and sent to the service background notification background, the client has connected the call, and then the service background issues a playback instruction for playing the startup white record.

Example two

Fig. 2 is a structural diagram of a telephone voice interaction apparatus according to a second embodiment of the present invention. The device includes: an acquisition module 21, an identification module 22, a matching module 23 and a response module 24. Wherein:

the acquisition module 21 is used for continuously acquiring voice data after the call is connected;

a recognition module 22 for performing speech and event recognition on the speech data to generate speech text and process events;

a matching module 23 for matching response actions based on process events;

and the response module 24 is used for matching the voice file based on the voice text and playing when the response action is used for playing voice.

The voice data includes at least one segment;

the acquisition module 21 includes:

the acquisition unit is used for continuously acquiring call data of the phone after the phone is dialed;

and the segmentation unit is used for storing the acquired call data as a section of voice data at preset time intervals.

The identification module 22 includes:

the event identification unit is used for carrying out voice identification on voice data to generate a speaking starting event and a speaking ending event;

and the voice recognition unit is used for carrying out voice recognition on the voice data based on the speaking starting event and the speaking ending event to obtain a voice text.

An event recognition unit comprising:

the voice identification subunit is used for identifying whether the current voice data is voice data or not;

a first previous moment judgment subunit, configured to judge whether the previous moment is human voice data or not if the voice data at the current moment is human voice data;

a start generation subunit, configured to generate a talk start event if the previous time is not the voice data;

a second previous moment judgment subunit, configured to judge whether the previous moment is human voice data or not if the voice data at the current moment is not human voice data;

and the ending generation subunit is used for generating an ending speaking event if the previous moment is the voice data.

The process event also includes a break event;

the start generation subunit is configured to, before generating the speech start event if the previous time is not the human voice data, further include:

a playing state subunit, configured to acquire whether the current audio file is being played;

the interruption generating subunit is used for generating an interruption event if the interruption event is generated;

if not, the execution starting generation subunit generates a speaking starting event.

In the embodiment of the invention, the VAD algorithm of the WebRTC is used for realizing the identification of whether the current voice data is the voice data or not.

Further comprising:

the timing subunit is used for timing the duration time of voice data which is not the human voice data;

and the reminding generation subunit is used for generating a reminding event after the duration time reaches a preset time threshold value.

The telephone voice interaction device provided by the embodiment of the invention can execute the telephone voice interaction method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE III

Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. As shown in fig. 3, the electronic apparatus includes a processor 30, a memory 31, a communication module 32, an input device 33, and an output device 34; the number of the processors 30 in the electronic device may be one or more, and one processor 30 is taken as an example in fig. 3; the processor 30, the memory 31, the communication module 32, the input device 33 and the output device 34 in the electronic device may be connected by a bus or other means, and the bus connection is taken as an example in fig. 3.

The memory 31 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as the modules corresponding to a telephone voice interaction method in the present embodiment (for example, the fault information receiving module 31, the solution determining module 32, and the first serviceman determining module 33 in a telephone voice interaction apparatus). The processor 30 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 31, so as to implement the telephone voice interaction method.

The memory 31 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 31 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 31 may further include memory located remotely from the processor 30, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And the communication module 32 is used for establishing connection with the display screen and realizing data interaction with the display screen. The input device 33 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus.

The electronic device provided by the embodiment of the invention can execute the telephone voice interaction method provided by any embodiment of the invention, and has corresponding functions and beneficial effects.

Example four

A fourth embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a telephone voice interaction method, the method including:

matching a response action based on the process event;

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling a computer electronic device (which may be a personal computer, a server, or a network electronic device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the telephone voice interaction apparatus, the units and modules included in the embodiment are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A telephone voice interaction method, comprising:

matching a response action based on the process event;

2. A telephone voice interaction method according to claim 1, wherein the voice data comprises at least one segment;

continuously acquiring call data of a telephone after the telephone is dialed;

3. The telephony voice interaction method of claim 1, wherein the performing voice and event recognition on the voice data to generate voice text and process events comprises:

4. The method of claim 3, wherein the performing voice recognition on the voice data to generate a talk start event and a talk end event comprises:

identifying whether the current voice data is voice data;

and if the previous moment is the voice data, generating a speech ending event.

5. The telephony voice interaction method of claim 4, wherein the process event further comprises a break event;

acquiring whether a current voice file is playing;

if yes, generating an interruption event;

if not, generating a speaking starting event.

6. A telephone voice interaction method according to any one of claims 3 to 5, characterized in that the identification of whether the voice data is present or not is carried out using the VAD algorithm of WebRTC.

7. The telephony voice interaction method of any of claims 3-5, further comprising, after the generating an end-of-talk event:

timing a duration that the voice data is not voice data;

8. A telephony voice interaction device, comprising:

a matching module to match response actions based on the process event;

9. A telephony voice interaction device, the device comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the telephony voice interaction method of any of claims 1-7.

10. A storage medium containing computer-executable instructions for performing the telephony voice interaction method of any of claims 1-7 when executed by a computer processor.