CN114679515A - Method, device, equipment and storage medium for judging connection time point of outbound system - Google Patents
Method, device, equipment and storage medium for judging connection time point of outbound system Download PDFInfo
- Publication number
- CN114679515A CN114679515A CN202210598979.1A CN202210598979A CN114679515A CN 114679515 A CN114679515 A CN 114679515A CN 202210598979 A CN202210598979 A CN 202210598979A CN 114679515 A CN114679515 A CN 114679515A
- Authority
- CN
- China
- Prior art keywords
- audio
- outbound
- signal
- time point
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000010801 machine learning Methods 0.000 claims abstract description 36
- 230000007613 environmental effect Effects 0.000 claims abstract description 33
- 230000005236 sound signal Effects 0.000 claims abstract description 23
- 230000008569 process Effects 0.000 claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 238000012360 testing method Methods 0.000 claims description 23
- 238000012544 monitoring process Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 5
- 230000009471 action Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims 1
- 230000000717 retained effect Effects 0.000 claims 1
- 238000013136 deep learning model Methods 0.000 abstract description 4
- 238000013473 artificial intelligence Methods 0.000 abstract 2
- 238000010586 diagram Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/22—Arrangements for supervision, monitoring or testing
- H04M3/2281—Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Acoustics & Sound (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Security & Cryptography (AREA)
- Technology Law (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a method and a device for judging a connection time point of an intelligent outbound system and a storage medium, belonging to the field of outbound of AI (artificial intelligence) telephones. Acquiring audio data of the outbound robot in the outbound process in real time, if a sound signal exists, extracting a first audio characteristic signal, and judging the probability of speaking voice belonging to a human or environmental background sound; and performing secondary feature extraction on the first audio feature signal, splicing top n features and the probability, judging the probability of the speaking voice of a person or the environmental background sound, if the probability is greater than a threshold value, taking the current time point as a connection time point, returning a connection signal to the outbound calling robot, and finishing the judgment of the connection time point of the outbound. The invention adopts a scheme of combining a deep learning model and a traditional machine learning model, reduces the dependence on expert knowledge in the audio field, does not need to call asr technology to obtain an intermediate result of a voice-to-text, and meets the application in a secondary outbound robot scene with extremely high real-time requirement.
Description
Technical Field
The invention relates to the field of AI telephone outgoing calls, in particular to a method, a device, equipment and a storage medium for judging a connection time point of an intelligent outgoing call system.
Background
At present, the attention degree on personal privacy protection is higher and higher, particularly, the protection of personal privacy telephone numbers is realized, and when the intelligent outbound robot applied to multiple fields carries out telephone outbound, the intelligent outbound robot can adopt a form of secondary outbound of a virtual extension number. When a normal call is dialed, when a receiving party is connected, operators such as telecom, mobile and Unicom return a connected signal to inform the dialing party that the dialed call is connected; different from normal telephone dialing, in a secondary outbound scene, an outbound robot firstly connects an operator, then the operator carries out secondary outbound through a virtual extension number, and when a receiver is connected, the operator can not return an accessed signal.
For a secondary outbound scenario, the following states of the audio signal are generated prior to the turn-on time: 1. mute state, 2 beep state, 3 polyphonic ringtone state, 4 machine alert tone state. After the switch-on time, the following audio signals are generated: 1. human speech, 2. environmental background sounds. The AI outbound robot correctly judges the second outbound connection time point, thereby greatly improving the conversation experience and reducing the waiting reply time after the receiver is connected.
The traditional technology generally monitors an audio signal in real time after a second external call, converts the audio signal into a text by using asr (voice to text) technology, and identifies the connection moment by carrying out a large number of rule decisions on text data, for example, the connection moment is considered to be connected when the text data is matched with a first feeding, hello and the like; or the audio features are blended on the basis of the text data, the audio features are obtained by manually extracting from the audio data, and the text data and the audio features are combined and then trained into a two-classification machine learning model to realize the automatic judgment of the connection time point. The above method has the following problems:
(1) the rule determination method needs to enumerate the switched-on dialects very comprehensively, so that important rules may not be omitted, the application under different scenes is difficult to realize, and even misjudgment can be performed on the automatic reply voice set in the power-off state and the like.
(2) The process of converting the voice into the text needs to call asr technology, so that the delay is long, the conversation experience is delayed, and the cost is high.
(3) Discrete signals exist among the audio features obtained by manual extraction, the process of interaction of all the features is lacked, and the audio features are not fully utilized; the manual extraction of the audio features seriously depends on expert knowledge and is difficult to adapt to different application scenes.
Disclosure of Invention
In order to solve the problems of long delay time, high misjudgment rate and high cost of the conventional method for judging the connection time point of the intelligent outbound system, the invention provides a method, a device, equipment and a storage medium for judging the connection time point of the outbound system, and a scheme of combining a deep learning model and a conventional machine learning model is adopted, so that the dependence on expert knowledge in the audio field is reduced, particularly, the asr technology is not required to be called to obtain an intermediate result of a voice-to-text, and the requirement on instantaneity in a secondary outbound robot scene with strong requirements is met.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a method for determining a connection time point of an outbound call system, including the following steps:
step 1, acquiring audio data of an outbound robot in an outbound process in real time, filtering the audio data, judging whether an audio signal exists or not, and if not, continuously monitoring the audio data; if yes, extracting a first audio characteristic signal from the audio data;
step 2, judging the probability that the first audio characteristic signal belongs to the human speaking sound or the environmental background sound by using a first machine learning model;
step 3, performing secondary feature extraction on the first audio feature signal in the step 1 by using a Yamnet model, taking a top n feature, and splicing the probability of the speaking voice belonging to the person or the environmental background voice obtained in the step 2 with the top n feature to obtain a second audio feature signal;
step 4, judging the probability that the second audio characteristic signal belongs to the human speaking sound or the environmental background sound by using a second machine learning model, if the probability is greater than a threshold value, taking the current time point as a connection time point, returning a connection signal to the calling robot, finishing the judgment of the connection time point of the current call, and stopping monitoring the audio data; otherwise, returning to the step 1.
In a second aspect, the present invention provides an apparatus for determining a connection time point of an outbound call system, including:
the outbound robot is used for dialing the virtual extension number of the receiver and carrying out secondary outbound on the real mobile phone number of the receiver after the virtual extension number is connected by an operator;
the audio clip window module is used for continuously monitoring audio data, collecting an audio clip every m milliseconds and storing the audio clip in a test list;
the test list module is used for storing newly monitored audio data and is empty initially;
the sound signal judgment module is used for filtering the audio data and judging whether a sound signal exists, if so, extracting a first audio characteristic signal from the audio data and transmitting the first audio characteristic signal to the first machine learning model module and the Yamnet model module, and if not, not executing the next action;
the first machine learning model module is used for judging the probability that the received first audio characteristic signal belongs to the human speaking sound or the environmental background sound and transmitting the probability to the Yamnet model module;
the Yamnet model module is used for carrying out secondary feature extraction on the received first audio feature signal, taking a top n feature, splicing the received probability of the speaking voice belonging to the person or the environmental background sound to the top n feature to obtain a second audio feature signal, and transmitting the second audio feature signal to the second machine learning model module;
and the second machine learning model module is used for judging the probability that the second audio characteristic signal belongs to the human speaking sound or the environmental background sound, if the probability is greater than the threshold value, taking the current time point as a connection time point, returning a connection signal to the calling robot, finishing the judgment of the connection time point of the calling, and simultaneously sending a signal for stopping monitoring the audio data to the audio segment window module.
In a third aspect, the present invention provides an electronic device comprising a memory and a processor;
the memory for storing a computer program;
the processor is configured to implement the above method for determining a time point at which the outbound system is turned on when executing the computer program.
In a fourth aspect, the present invention provides a computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the method for determining a turn-on time point of an outbound system as described above.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention adopts the technical scheme of combining the deep learning model and the traditional machine learning model, reduces the dependence on the expert knowledge in the audio field and has low cost.
(2) According to the method, whether the audio signal data are connected or not is directly judged, an intermediate result of asr (voice to text) technology is not needed, the response speed of the outbound robot can be greatly improved in a secondary outbound robot scene with extremely high real-time requirement, and the conversation experience is improved.
(3) The invention does not relate to a method for judging the text rule, can greatly reduce the misjudgment rate compared with the existing rule method, particularly does not need to update the rule under the condition that the language is updated day by day, and has wide application range.
Drawings
FIG. 1 is a schematic diagram illustrating a secondary outbound robot scenario in accordance with an exemplary embodiment;
fig. 2 is a flowchart illustrating a method for determining a turn-on time point of an outbound system according to an exemplary embodiment;
fig. 3 is a block diagram illustrating a turn-on time point determination apparatus of an outbound system according to an exemplary embodiment;
fig. 4 is a diagram illustrating a terminal structure of an electronic device for implementing a method for determining a turn-on time point of an outbound call system according to an exemplary embodiment.
Detailed Description
The invention is further illustrated with reference to the following figures and examples. The figures are only schematic illustrations of the invention, some of the block diagrams shown in the figures are functional entities, which do not necessarily have to correspond to physically or logically separate entities, which may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Unless defined otherwise, technical or scientific terms used herein shall have the same general meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of this application do not denote a limitation of quantity, either in the singular or the plural. The terms "comprises," "comprising," "has," "having," and any variations thereof, as referred to in this application, are intended to cover non-exclusive inclusions; reference throughout this application to "connected," "coupled," and the like is not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
Fig. 1 shows a typical secondary outbound robot scenario, in which the outbound robot cannot know the real mobile phone number of the receiver, but dials the virtual extension number and is switched by the operator. When the receiving party is connected with the telephone, the operator can not return a connected signal to the outbound robot, and the outbound robot can not know when the receiving party connects the telephone because the AI outbound robot can not judge whether the receiving party is connected like a person. The invention is proposed based on the background, and whether a receiver connects a call can be quickly judged by the method of the invention, and a connected signal is fed back to the outbound robot.
For example, in a sales scene, a mobile phone number reserved by a user to a merchant is encrypted by a manufacturer to protect the privacy of the user, when the sales is delivered or is about to be delivered, a virtual extension number of the user is dialed by a salesman, then the operator switches over the real number of the user, and after the user is connected, the salesman can communicate with the user. With the development of the intelligent outbound system, the outbound robot informs the user that the outbound is delivered or is about to be delivered, and the outbound robot cannot intelligently judge whether the user connects the telephone or not like a takeout person.
As shown in fig. 2, a flow of the method for determining the call-out system connection time point is as follows:
s1, judging whether a sound signal exists or not, and extracting a first audio characteristic signal:
acquiring audio data of the outbound robot in the outbound process in real time, filtering the audio data, judging whether an audio signal exists or not, and if not, continuously monitoring the audio data; if yes, extracting a first audio characteristic signal from the audio data.
In the step, when audio data of the outbound robot in the outbound process is acquired in real time, one audio clip is acquired every m milliseconds, the acquired audio clips are stored in a test list, when the length of the test list meets the preset length requirement, all the audio data in the test list are taken out and subjected to subsequent processing, the audio clips are monitored continuously, and new audio clips are stored in the test list.
The value of m is selected according to the actual situation, preferably, m is more than or equal to 10 and less than or equal to 30, namely, the audio clip is collected every 10 milliseconds to 30 milliseconds. When the obtained audio pieces are accumulated, for example, 2000 milliseconds, a determination is made as to whether the audio data of 2000 milliseconds is a human speech sound or an environmental background sound, or a status sound at the time of non-turn-on. In order to ensure the real-time performance of the data transmission process, the websocket data transmission protocol is adopted in the embodiment; due to the fact that the audio data formats are numerous, the pcm audio data format is adopted for transmission in order to increase the robustness of the system.
When judging whether a sound signal exists, the sound signal comprises a human speaking sound and a background sound, the embodiment utilizes acoustic knowledge, such as an energy value of an audio frequency, to simply judge, if the energy value of the audio frequency is zero, the sound signal does not exist, a subsequent judging process is stopped, and new audio data is continuously monitored; if the audio energy value is greater than zero, the audio data may belong to an on state, or may belong to an automatic reply in a user handset off state, for example, "you are good, the phone called by you is off", or belongs to other possibilities, and at this time, the first audio feature signal is extracted from the audio data and used for subsequent processing.
S2, first determining whether to turn on:
and judging the probability that the first audio characteristic signal belongs to the human speaking sound or the environmental background sound by utilizing the first machine learning model.
In this step, the first machine learning model is used to implement two classes, including sounds belonging to the human speech or environmental background sounds and sounds not belonging to the human speech or environmental background sounds.
Before the present invention is implemented, the CNN convolutional neural network model needs to be trained, a large amount of audio data is collected as training samples, the training samples include audio data generated in an on state, audio data generated in an off state, audio data generated in an unmanned state, and the like, the audio data generated in the on state is used as a positive sample, the rest of the audio data are used as negative samples, all the positive samples and the negative samples need to extract a first audio characteristic signal by using the method of step S1, and the first audio characteristic signals of the positive samples and the negative samples are used for supervised training of the CNN convolutional neural network model.
In actual use, whether the first machine learning model judges whether the first machine learning model belongs to the human speaking sound or the environmental background sound, the probability is recorded as the auxiliary signal, for example, if the judgment result is that the probability of the first machine learning model belongs to the human speaking sound or the environmental background sound is 0.3, and the probability of the first machine learning model not belonging to the human speaking sound or the environmental background sound is 0.7, 0.3 is used as the auxiliary signal for subsequent processing.
S3, extracting a second audio characteristic signal:
and performing secondary feature extraction on the first audio feature signal of the S1 by using a Yamnet model, taking a top n feature, and splicing the probability of the speaking voice belonging to the person or the environmental background voice obtained in the step S2 with the top n feature to obtain a second audio feature signal.
In this step, the Yamnet model processes the first audio characteristic signal to obtain a characteristic vector of 521 latitude; extracting a top100 probability value from the last 521-dimensional vector of the Yamnet model, and adding the probability of the human speaking sound or the environmental background sound obtained in the step S2 to obtain a total 101 latitude of feature vectors as a second audio feature signal.
S4, second determination whether to turn on:
judging the probability that the second audio characteristic signal belongs to the human speaking sound or the environmental background sound by using a second machine learning model, if the probability is greater than a threshold value, taking the current time point as a connection time point, returning a connection signal to the calling robot, finishing the judgment of the connection time point of the calling, and stopping monitoring audio data; otherwise, return to step S1.
In this step, the second machine learning model is used to implement two classes, including sounds belonging to the human speech or environmental background sounds and sounds not belonging to the human speech or environmental background sounds.
Before the present invention is implemented, the LightGBM model needs to be trained, a large amount of audio data is collected as training samples, including audio data generated in an on state, audio data generated in an off state, audio data generated in an unmanned state, and the like, the audio data generated in the on state is used as a positive sample, all the other samples are used as negative samples, all the positive samples and the negative samples need to be extracted by the method of step S3 to obtain a second audio feature signal, and a determination result of the first machine learning model needs to be used in a process of extracting the second audio feature signal, so that the first machine learning model should be trained earlier than the second machine learning model. And carrying out supervised training on the LightGBM model by using the second audio characteristic signals of the positive and negative samples.
In practical use, the second machine learning model judges whether the second audio characteristic signal belongs to human speaking sound or environmental background sound, compares the judged probability with a threshold value, for example, sets the threshold value to be 0.5, and when the judgment result is that the second audio characteristic signal belongs to the human speaking sound or the environmental background sound, the current time point is taken as a connection time point, returns a connection signal to the caller, ends the connection time point judgment of the current call, and stops monitoring the audio data.
The current time point is the time corresponding to the audio segment acquired at the last moment in the audio data to which the second audio characteristic signal which is newly used for judging the second machine learning model belongs, and the time is slightly earlier than the time for returning the connection signal to the outbound robot.
In one embodiment of the invention, the wav2vec pre-training model is used to extract the first audio feature signal from the audio data in order to extract a more reasonable audio feature signal. The wav2vec is an unsupervised pre-training algorithm for generating the phonetic representation.
In a specific implementation of the present invention, the Yamnet model is a pre-trained deep network model, and can predict scores of the audio data under 521 audio event categories based on an AudioSet-YouTube corpus to obtain 521-dimensional feature vectors, in this embodiment, a prediction result of a top100 is used for subsequent processing, where the prediction result of the top100 refers to a category with a score of 100 before and a score of the category.
In summary, according to the embodiments, the scheme of combining the deep learning model and the traditional machine learning model is adopted, dependence on expert knowledge in the audio field is reduced, particularly, the asr technology does not need to be called to obtain the intermediate result of the voice-to-text conversion, and the requirement on the real-time performance in the secondary outbound robot scene with the strong requirement is met.
In this embodiment, a device for determining a turn-on time point of an outbound system is also provided, and the device is used to implement the foregoing embodiments, and the description of the device is omitted. The terms "module," "unit," "subunit," and the like as used below may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware or a combination of software and hardware is also possible.
Fig. 3 is a block diagram showing the configuration of an apparatus for determining the time of connection of the outbound system according to the present embodiment, the apparatus including:
the outbound robot is used for dialing the virtual extension number of the receiver and carrying out secondary outbound on the real mobile phone number of the receiver after the virtual extension number is connected by an operator;
the audio clip window module is used for continuously monitoring audio data, collecting an audio clip every m milliseconds and storing the audio clip in the test list;
the test list module is used for storing newly monitored audio data and is empty initially;
the sound signal judgment module is used for filtering the audio data and judging whether a sound signal exists, if so, extracting a first audio characteristic signal from the audio data and transmitting the first audio characteristic signal to the first machine learning model module and the Yamnet model module, and if not, not executing the next action;
the first machine learning model module is used for judging the probability that the received first audio characteristic signal belongs to human speaking sound or environment background sound and transmitting the probability to the Yamnet model module;
the Yamnet model module is used for carrying out secondary feature extraction on the received first audio feature signal, taking a top n feature, splicing the received probability of the speaking voice belonging to the person or the environmental background sound to the top n feature to obtain a second audio feature signal, and transmitting the second audio feature signal to the second machine learning model module;
and the second machine learning model module is used for judging the probability that the second audio characteristic signal belongs to the human speaking sound or the environmental background sound, if the probability is greater than the threshold value, taking the current time point as a connection time point, returning a connection signal to the calling robot, finishing the judgment of the connection time point of the calling, and simultaneously sending a signal for stopping monitoring the audio data to the audio segment window module.
In some embodiments, the audio clip window module may capture an audio clip every m milliseconds by setting the window size, e.g., setting the window size to m milliseconds, aligning the last position of the window at each current time with the audio at the current time t, and applying at-m+1, At-m+2,...,At]The continuous audio data can be acquired by acquiring the audio clips every m milliseconds as the audio clips acquired at the current moment.
In some embodiments, the audio data stored in the test list module is arranged according to the audio acquisition time and is initially empty; when the length of the audio frequency stored in the test list module meets the requirement of the preset length, all the audio frequencies with the length are taken out, the test list module is emptied and then the next round of data acquisition is executed, and the emptying process does not influence the next round of data acquisition.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again. For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Referring to fig. 4, there is also provided in the present embodiment an electronic device comprising a memory and a processor;
the memory for storing a computer program;
the processor is configured to implement the above method for determining a time point at which the outbound system is turned on when executing the computer program.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring audio data of the outbound robot in the outbound process in real time, filtering the audio data, judging whether an audio signal exists or not, and if not, continuously monitoring the audio data; if yes, extracting a first audio characteristic signal from the audio data by using a wav2vec pre-training model;
s2, judging the probability that the first audio characteristic signal belongs to the human speaking sound or the environmental background sound by using the CNN convolutional neural network model;
s3, performing secondary feature extraction on the first audio feature signal of S1 by using a Yamnet model, taking a top100 feature, and splicing the probability of the speaking voice of the person or the environmental background voice obtained in the S2 with the top100 feature to obtain a second audio feature signal;
and S4, judging the probability that the second audio characteristic signal belongs to the human speaking voice or the environmental background voice by using the LightGBM model, if the probability is greater than a threshold value, taking the current time point as a connection time point, returning a connection signal to the calling robot, finishing the judgment of the connection time point of the current call, and stopping monitoring the audio data.
It is obvious that the drawings are only examples or embodiments of the present application, and it is obvious to those skilled in the art that the present application can be applied to other similar cases according to the drawings without creative efforts. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
The embodiment of the device for determining the turn-on time point of the outbound call system of the present invention can be applied to any device with data processing capability, such as a computer or other devices or apparatuses. The apparatus embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 4, for a hardware structure diagram provided in this embodiment, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, any device with data processing capability where the apparatus in the embodiment is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described in detail herein.
An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for determining the turn-on time point of the outbound system.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.
In the embodiment, the test is performed in a real secondary outbound scene, and the test process is as follows:
the applicant cooperates with the merchant to perform outbound system testing by using the method provided by the application, and 569 effective outbound calls are tested in total. In the outbound process, recording the real connection time point and the connection time point judged by the outbound system in real time. When the connection time point judged by the outbound system is within the plus-minus 100-millisecond offset interval of the real time point, the prediction is considered to be correct, otherwise, the prediction is wrong, and the prediction is expressed as: accuracy = prediction correct number/total number of test sets.
In addition, according to the real connecting time point and the connecting time point judged by the outbound system, the real-time performance is recorded, and the method comprises the following steps: according to the telephone audio file after the dialing is finished, for example, the real connection time point is 8 seconds after the telephone is dialed out, and the connection time point judged by the outbound system is 8.5 seconds after the telephone is dialed out, so that the real-time property is the difference value of the real time point and the real time property, namely 0.5 second.
In this embodiment, the on-time point determination method and the rule-based determination method proposed in the present application are compared, and the accuracy and real-time performance are used as indicators, and the test results are shown in table 1.
Table 1:
as can be seen from the test results, the method of the application is superior to a rule-based method in both accuracy and real-time. According to the method, whether the audio signal data are connected or not is judged directly, an intermediate result of the asr technology is not needed, the response speed of the outbound robot can be greatly increased in a secondary outbound robot scene with extremely high real-time requirements, and the conversation experience is improved.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the patent protection. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.
Claims (10)
1. A method for judging a connection time point of an outbound system is characterized by comprising the following steps:
step 1, acquiring audio data of an outbound robot in an outbound process in real time, filtering the audio data, judging whether a sound signal exists or not, and if not, continuously monitoring the audio data; if yes, extracting a first audio characteristic signal from the audio data;
step 2, judging the probability that the first audio characteristic signal belongs to human speaking sound or environmental background sound by using a first machine learning model;
step 3, performing secondary feature extraction on the first audio feature signal in the step 1 by using a Yamnet model, taking a top n feature, and splicing the probability of the speaking voice belonging to the person or the environmental background voice obtained in the step 2 with the top n feature to obtain a second audio feature signal;
step 4, judging the probability that the second audio characteristic signal belongs to the human speaking sound or the environmental background sound by using a second machine learning model, if the probability is greater than a threshold value, taking the current time point as a connection time point, returning a connection signal to the calling robot, finishing the judgment of the connection time point of the current call, and stopping monitoring the audio data; otherwise, returning to the step 1.
2. The method for judging the turn-on time point of the outbound system according to claim 1, wherein when the audio data of the outbound robot in the outbound process is acquired in real time, one audio clip is acquired every m milliseconds, the acquired audio clips are stored in the test list, when the length of the test list meets the preset length requirement, all the audio data in the test list are taken out and subjected to subsequent processing, the audio clips are monitored continuously, and new audio clips are stored in the test list.
3. The method of determining an on-time point of an outbound system according to claim 2, wherein m is 10. ltoreq. m.ltoreq.30.
4. The method for determining an on-time of an outbound call system according to claim 1, wherein said step 1 determines whether there is a sound signal by an energy value of audio data, said sound signal comprising a human speech sound and a background sound.
5. The method of claim 1, wherein the step 1 comprises extracting the first audio feature signal from the audio data by using a wav2vec pre-training model.
6. The method for determining a turn-on time point of an outbound system according to claim 1, wherein when the Yamnet model is used to perform secondary feature extraction on the first audio feature signal in step 1, a 521-dimensional feature vector is obtained, and top100 features are retained for subsequent processing.
7. The method of claim 1, wherein the first machine learning model is a CNN convolutional neural network model, and the second machine learning model is a LightGBM model.
8. An apparatus for determining a connection time point of an outbound system, comprising:
the outbound robot is used for dialing the virtual extension number of the receiver and carrying out secondary outbound on the real mobile phone number of the receiver after the virtual extension number is connected by an operator;
the audio clip window module is used for continuously monitoring audio data, collecting an audio clip every m milliseconds and storing the audio clip in the test list;
the test list module is used for storing newly monitored audio data and is empty initially;
the sound signal judgment module is used for filtering the audio data and judging whether a sound signal exists, if so, extracting a first audio characteristic signal from the audio data and transmitting the first audio characteristic signal to the first machine learning model module and the Yamnet model module, and if not, not executing the next action;
the first machine learning model module is used for judging the probability that the received first audio characteristic signal belongs to human speaking sound or environment background sound and transmitting the probability to the Yamnet model module;
the Yamnet model module is used for carrying out secondary feature extraction on the received first audio feature signal, taking a top n feature, splicing the received probability of the speaking voice belonging to the person or the environmental background sound to the top n feature to obtain a second audio feature signal, and transmitting the second audio feature signal to the second machine learning model module;
and the second machine learning model module is used for judging the probability that the second audio characteristic signal belongs to the human speaking sound or the environmental background sound, if the probability is greater than the threshold value, taking the current time point as a connection time point, returning a connection signal to the calling robot, finishing the judgment of the connection time point of the calling, and simultaneously sending a signal for stopping monitoring the audio data to the audio segment window module.
9. An electronic device comprising a memory and a processor;
the memory for storing a computer program;
the processor, when executing the computer program, is configured to implement a method for determining a point in time at which an outbound system is turned on according to any of claims 1-7.
10. A computer-readable storage medium, on which a program is stored, which program, when being executed by a processor, is adapted to carry out a method for determining a point in time of activation of an outbound system according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210598979.1A CN114679515B (en) | 2022-05-30 | 2022-05-30 | Method, device, equipment and storage medium for judging connection time point of outbound system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210598979.1A CN114679515B (en) | 2022-05-30 | 2022-05-30 | Method, device, equipment and storage medium for judging connection time point of outbound system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114679515A true CN114679515A (en) | 2022-06-28 |
CN114679515B CN114679515B (en) | 2022-08-30 |
Family
ID=82081197
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210598979.1A Active CN114679515B (en) | 2022-05-30 | 2022-05-30 | Method, device, equipment and storage medium for judging connection time point of outbound system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114679515B (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106504768A (en) * | 2016-10-21 | 2017-03-15 | 百度在线网络技术(北京)有限公司 | Phone testing audio frequency classification method and device based on artificial intelligence |
CN106686191A (en) * | 2015-11-06 | 2017-05-17 | 北京奇虎科技有限公司 | Processing method for adaptively identifying harassing call and processing system thereof |
CN109086264A (en) * | 2017-06-14 | 2018-12-25 | 松下知识产权经营株式会社 | It speaks and continues determination method, speaks and continue decision maker and recording medium |
US10277745B1 (en) * | 2017-05-30 | 2019-04-30 | Noble Systems Corporation | Answering machine detection for a contact center |
CN110010121A (en) * | 2019-03-08 | 2019-07-12 | 平安科技(深圳)有限公司 | Verify method, apparatus, computer equipment and the storage medium of the art that should answer |
CN110266896A (en) * | 2019-05-15 | 2019-09-20 | 平安科技(深圳)有限公司 | Method of calling, device, computer equipment and storage medium based on virtual-number |
CN111508527A (en) * | 2020-04-17 | 2020-08-07 | 北京帝派智能科技有限公司 | Telephone answering state detection method, device and server |
CN112399019A (en) * | 2020-09-16 | 2021-02-23 | 中国农业银行股份有限公司河北省分行 | Intelligent outbound call method, terminal equipment and readable storage medium |
US20210074316A1 (en) * | 2019-09-09 | 2021-03-11 | Apple Inc. | Spatially informed audio signal processing for user speech |
CN112750465A (en) * | 2020-12-29 | 2021-05-04 | 昆山杜克大学 | Cloud language ability evaluation system and wearable recording terminal |
JP2021078012A (en) * | 2019-11-08 | 2021-05-20 | 株式会社ハロー | Answering machine determination device, method and program |
CN114418733A (en) * | 2021-12-16 | 2022-04-29 | 上海浦东发展银行股份有限公司 | Off-hook optimal time point prediction method and system based on feedforward neural network |
-
2022
- 2022-05-30 CN CN202210598979.1A patent/CN114679515B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106686191A (en) * | 2015-11-06 | 2017-05-17 | 北京奇虎科技有限公司 | Processing method for adaptively identifying harassing call and processing system thereof |
CN106504768A (en) * | 2016-10-21 | 2017-03-15 | 百度在线网络技术(北京)有限公司 | Phone testing audio frequency classification method and device based on artificial intelligence |
US10277745B1 (en) * | 2017-05-30 | 2019-04-30 | Noble Systems Corporation | Answering machine detection for a contact center |
CN109086264A (en) * | 2017-06-14 | 2018-12-25 | 松下知识产权经营株式会社 | It speaks and continues determination method, speaks and continue decision maker and recording medium |
CN110010121A (en) * | 2019-03-08 | 2019-07-12 | 平安科技(深圳)有限公司 | Verify method, apparatus, computer equipment and the storage medium of the art that should answer |
CN110266896A (en) * | 2019-05-15 | 2019-09-20 | 平安科技(深圳)有限公司 | Method of calling, device, computer equipment and storage medium based on virtual-number |
US20210074316A1 (en) * | 2019-09-09 | 2021-03-11 | Apple Inc. | Spatially informed audio signal processing for user speech |
JP2021078012A (en) * | 2019-11-08 | 2021-05-20 | 株式会社ハロー | Answering machine determination device, method and program |
CN111508527A (en) * | 2020-04-17 | 2020-08-07 | 北京帝派智能科技有限公司 | Telephone answering state detection method, device and server |
CN112399019A (en) * | 2020-09-16 | 2021-02-23 | 中国农业银行股份有限公司河北省分行 | Intelligent outbound call method, terminal equipment and readable storage medium |
CN112750465A (en) * | 2020-12-29 | 2021-05-04 | 昆山杜克大学 | Cloud language ability evaluation system and wearable recording terminal |
CN114418733A (en) * | 2021-12-16 | 2022-04-29 | 上海浦东发展银行股份有限公司 | Off-hook optimal time point prediction method and system based on feedforward neural network |
Non-Patent Citations (2)
Title |
---|
A RAJAGOPAL;NIRMALA VEDAMANICKAM: "New Approach to Human AI Interaction to Address Digital Divide& AI divide: Creating an Interactive AIplatform to Connect Teachers & Students", 《2019 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION TECHNOLOGIES (ICECCT)》 * |
白雪: "智能呼叫中心的设计与实现", 《中国优秀硕士学位论文全文数据库》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114679515B (en) | 2022-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102237087B (en) | Voice control method and voice control device | |
CN107995360A (en) | Call handling method and Related product | |
CN109559754B (en) | Voice rescue method and system for tumble identification | |
CN102404462A (en) | Call progress analysis method and device for telephone dialing system | |
CN104599675A (en) | Speech processing method, device and terminal | |
CN110062097B (en) | Crank call processing method and device, mobile terminal and storage medium | |
JPH1063293A (en) | Telephone voice recognition device | |
US20030088403A1 (en) | Call classification by automatic recognition of speech | |
CN101902517B (en) | Communication terminal and incoming call answering method | |
CN114679515B (en) | Method, device, equipment and storage medium for judging connection time point of outbound system | |
US20030083875A1 (en) | Unified call classifier for processing speech and tones as a single information stream | |
US5692040A (en) | Method of and apparatus for exchanging compatible universal identification telephone protocols over a public switched telephone network | |
CN101340681A (en) | Implementing method and apparatus for shielding DTMF sound of outer telephone | |
CN108540680A (en) | Switching method and device of speaking state and conversation system | |
CN208386657U (en) | Can the change of voice phone | |
CN208656882U (en) | Call center's traffic administration system | |
CN107357859A (en) | A kind of intelligent terminal for realizing that knowledge base shows automatically by voice collecting | |
CN109584877B (en) | Voice interaction control method and device | |
CN104717346A (en) | Call hanging-up method and device | |
CN110232919A (en) | Real-time voice stream extracts and speech recognition system and method | |
US20030081756A1 (en) | Multi-detector call classifier | |
CN103516865A (en) | Photographing system and photographing method | |
CN103929532A (en) | Information processing method and electronic equipment | |
CN108551514A (en) | A kind of telephone device of complete acoustic control | |
CN114420130A (en) | Telephone voice interaction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |