CN113327621A

CN113327621A - Model training method, user identification method, system, device and medium

Info

Publication number: CN113327621A
Application number: CN202110641691.3A
Authority: CN
Inventors: 任君; 罗超; 邹宇; 李巍; 严丽
Original assignee: Ctrip Travel Information Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Information Technology Shanghai Co Ltd
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-08-31

Abstract

The invention discloses a model training method, a user identification method, a system, equipment and a medium, wherein the model training method comprises the following steps: acquiring a plurality of audio data in a training sample; setting a loss function used in the training of the neural network model; performing model training on the neural network model based on each audio data and the loss function to obtain a voiceprint recognition model; the loss function includes a first loss function and a second loss function. According to the method, the obtained audio data and the loss functions are utilized to perform model training on the neural network model to obtain the voiceprint recognition model, the loss functions comprise the first loss function and the second loss function, the model training on the neural network model based on the combination of the first loss function and the second loss function is realized, the discrimination degree and the performance of the voiceprint recognition model are improved, the problem that the voiceprint recognition model is difficult to converge is avoided, and the accuracy and the safety of the voiceprint recognition model for recognizing the identity of the incoming line user are improved.

Description

Model training method, user identification method, system, device and medium

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a model training method, a user recognition method, a system, equipment and a medium.

Background

With the vigorous development of the non-contact multi-modal technology, the multi-modal fusion verification gradually becomes a trend, the single technology identification has strong limitation, the precision cannot meet the commercial requirements in some scenes, meanwhile, only the single identification technology has bugs and security risks, and the product form after the multi-modal technology fusion of the voiceprint identification + can obviously improve the use experience of users. Recently, deep fake (voice or video generated by artificial intelligence) fraud is rising, voiceprint recognition gradually becomes the focus of audio and video identification technology, voice of a speaker is imitated and synthesized to easily cheat the human ear, but the voiceprint recognition system is difficult to cheat, so the voiceprint recognition technology has good identification on anti-cheating effect. Due to the particularity of each scene, the universal voiceprint recognition system cannot meet the service requirements.

At present, in a system based on voiceprint recognition in an OTA (on line Travel Agency) scene, a voiceprint recognition framework with high real-time performance is still not used for verifying the identity of a user, and due to the fact that the number of users is large, the recording environment is complex, and the establishment of tens of millions of voiceprint libraries is difficult and extremely large.

In the OTA industry, after the user finishes placing a hotel order, a behavior of "stranger" checking the house or modifying hotel order information exists, and the behavior seriously damages the information security of the user and the benefit of the OTA platform. The OTA industry needs to have huge call center support, a large amount of call records are surely generated every day, the call records relate to both ends of a guest and a customer service, the customer service environment is relatively single, but the guest environment is extremely complex and dynamically changes, so that great challenges are brought to a voiceprint recognition system, and the defects of low accuracy and poor safety in recognizing the identity of an incoming line user are caused by the fact that an existing model is difficult to converge, low in discrimination and poor in performance.

Disclosure of Invention

The invention aims to overcome the defects of low accuracy and poor safety of identifying the identity of an incoming line user due to difficult convergence, low discrimination and poor performance of a model in the prior art, and provides a model training method, a user identification method, a system, equipment and a medium.

The invention solves the technical problems through the following technical scheme:

the invention provides a model training method in a first aspect, which comprises the following steps:

obtaining a training sample, wherein the training sample comprises a plurality of audio data;

setting a loss function used in the training of the neural network model;

performing model training on the neural network model based on each audio data and the loss function to obtain a voiceprint recognition model;

the loss functions comprise a first loss function and a second loss function, the first loss function is classified based on the characteristic angle, and the second loss function is classified between classes and in-class loss functions.

Preferably, the expression of the loss function is: l is_total＝αL_AAM+_βL_cos

Wherein,

L_totalrepresents the loss function, L_AAMRepresenting a first loss function, L_cosRepresenting the second loss function, alpha representing L_AAMThe weight of (a) is determined,_βrepresents L_cosN denotes the number of training samples, S denotes the scaling factor hyperparameter of the cosine distance, m denotes the separation distance, i denotes the ith training sample, y denotes the number of training samples_iIndicates the label corresponding to the ith training sample, θ y_iThe included angle between the ith training sample and the ith label is shown, and thetaj represents the included angle between the jth training sample and the jth label.

Preferably, the neural network model comprises 3 convolutional layers, 9 residual linkage layers, 3 SE-block layers, attention focusing layers and affine layers; and combining each convolution layer with 3 residual error link layers and 1 SE-block layer to form a 15-layer neural network.

Preferably, after the step of obtaining the training samples, the model training method further includes:

performing data enhancement processing on each audio data;

carrying out noise removal processing on the audio data after each data enhancement to obtain each audio data after noise removal;

the step of performing model training on the neural network model based on each piece of audio data and the loss function to obtain a voiceprint recognition model specifically includes:

and carrying out model training on the neural network model based on each audio data after the noise is removed and the loss function so as to obtain a voiceprint recognition model.

The invention provides a model training system in a second aspect, which comprises a first acquisition module, a setting module and a training module;

the first obtaining module is used for obtaining a training sample, and the training sample comprises a plurality of audio data;

the setting module is used for setting a loss function used in the training of the neural network model;

the training module is used for carrying out model training on the neural network model based on each audio data and the loss function so as to obtain a voiceprint recognition model;

Wherein,

Preferably, the neural network model comprises 3 convolutional layers, 9 residual linkage layers, 3 SE-block layers, an attribution posing layer and an affine layer; and combining each convolution layer with 3 residual error link layers and 1 SE-block layer to form a 15-layer neural network.

Preferably, the model training system further comprises a first processing module and a second processing module;

the first processing module is used for performing data enhancement processing on each audio data;

the second processing module is used for carrying out noise removal processing on the audio data after each data enhancement to obtain each audio data after noise removal;

the training module is specifically configured to perform model training on the neural network model based on each of the denoised audio data and the loss function to obtain a voiceprint recognition model.

A third aspect of the present invention provides a user identification method, including:

acquiring audio data of an incoming line user and identification information corresponding to the audio data;

extracting voiceprint features of the audio data;

inputting the voiceprint features into a voiceprint recognition model to obtain a first voiceprint feature vector, wherein the voiceprint recognition model is obtained by training through the model training method of the first aspect;

acquiring a first historical voiceprint characteristic vector corresponding to the identification information from a voiceprint database;

calculating a first similarity score between the first voiceprint feature vector and the first historical voiceprint feature vector;

and if the first similarity score is larger than a similarity threshold value, determining that the incoming line user and the user corresponding to the first historical voiceprint feature vector are the same person.

Preferably, before the step of extracting the voiceprint features of the audio data, the user identification method further includes:

performing code conversion on the audio data of the incoming line user to obtain code-converted audio data;

performing audio cutting and noise removal processing on the audio data after code conversion;

splicing the audio data after audio cutting and noise removal;

the step of extracting the voiceprint features of the audio data specifically includes:

and extracting the voiceprint characteristics of the spliced audio data.

Preferably, after the step of determining that the incoming user and the user corresponding to the first historical voiceprint feature vector are the same person if the first similarity score is greater than the similarity threshold, the user identification method further includes:

calculating the audio quality score of the audio data corresponding to the first voiceprint feature vector and the audio data corresponding to the first historical voiceprint feature vector;

updating the voice print characteristic vector with high audio quality score to the voice print database;

and/or the presence of a gas in the gas,

the user identification method further comprises the following steps:

if the first similarity score is not larger than the similarity threshold, judging whether a second historical voiceprint feature vector corresponding to the identification information exists in the voiceprint database, if so, calculating a second similarity score between the first voiceprint feature vector and the second historical voiceprint feature vector, and replacing the first historical voiceprint feature vector in the voiceprint database with the second historical voiceprint feature vector when the second similarity score is larger than the similarity threshold.

The fourth aspect of the present invention provides a user identification system, which includes a second obtaining module, an extracting module, a third obtaining module, a fourth obtaining module, a first calculating module and a determining module;

the second acquisition module is used for acquiring audio data of an incoming line user and identification information corresponding to the audio data;

the extraction module is used for extracting the voiceprint features of the audio data;

the third obtaining module is configured to input the voiceprint features into a voiceprint recognition model to obtain a first voiceprint feature vector, where the voiceprint recognition model is obtained by training with the model training system according to the second aspect;

the fourth obtaining module is used for obtaining a first historical voiceprint feature vector corresponding to the identification information from a voiceprint database;

the first calculation module is used for calculating a first similarity score of the first voiceprint feature vector and the first historical voiceprint feature vector;

the determining module is configured to determine that the incoming line user and the user corresponding to the first historical voiceprint feature vector are the same person if the first similarity score is greater than a similarity threshold.

Preferably, the user identification system further comprises a conversion module, a third processing module and a splicing module;

the conversion module is used for performing code conversion on the audio data of the incoming line user to obtain the audio data after code conversion;

the third processing module is used for performing audio cutting and noise removal processing on the audio data after code conversion;

the splicing module is used for splicing the audio data after audio cutting and noise removal;

the extraction module is specifically used for extracting the voiceprint features of the spliced audio data.

Preferably, the user identification system further comprises a second calculation module and a storage module;

the second calculation module is used for calculating the audio quality scores of the audio data corresponding to the first voiceprint feature vector and the audio data corresponding to the first historical voiceprint feature vector;

the storage module is used for updating the voiceprint feature vectors with high audio quality scores into the voiceprint database;

and/or the presence of a gas in the gas,

the user identification system also comprises a judgment module, a third calculation module and a replacement module;

the judging module is used for judging whether a second historical voiceprint characteristic vector corresponding to the identification information exists in the voiceprint database or not if the first similarity score is not larger than the similarity threshold, and calling the third calculating module if the second historical voiceprint characteristic vector exists;

the third calculation module is used for calculating a second similarity score of the first voiceprint feature vector and the second historical voiceprint feature vector;

the replacement module is configured to replace the first historical voiceprint feature vector in the voiceprint database with the second historical voiceprint feature vector when the second similarity score is greater than the similarity threshold.

A fifth aspect of the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the model training method according to the first aspect or the user recognition method according to the third aspect when executing the computer program.

A sixth aspect of the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a model training method as described in the first aspect or performs a user recognition method as described in the third aspect.

The positive progress effects of the invention are as follows:

according to the method, the neural network model is subjected to model training by utilizing the plurality of audio data in the obtained training samples and the loss function used in the set neural network model training so as to obtain the voiceprint recognition model, the loss function comprises the first loss function and the second loss function, the model training of the neural network model based on the combination of the first loss function and the second loss function is realized, the discrimination degree and the performance of the voiceprint recognition model are improved, the problem that the voiceprint recognition model is difficult to converge is avoided, and the accuracy and the safety of the voiceprint recognition model for recognizing the identity of the incoming line user are improved.

Drawings

Fig. 1 is a flowchart of a model training method according to embodiment 1 of the present invention.

Fig. 2 is a block diagram of a model training system according to embodiment 2 of the present invention.

Fig. 3 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.

Fig. 4 is a flowchart of a user identification method according to embodiment 5 of the present invention.

Fig. 5 is a schematic block diagram of a subscriber identification system according to embodiment 6 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.

Example 1

As shown in fig. 1, the present embodiment provides a model training method, including:

step 101, a training sample is obtained, wherein the training sample comprises a plurality of audio data.

In the OTA industry, when a guest or a hotel and OTA customer service communicate with each other in real time through a telephone voice, for a single-channel audio, no matter the guest side or the customer service side contains the whole audio duration, the embodiment can collect audio data of different users as training samples, and can also collect audio data of different users and different customer services as training samples.

And 102, performing data enhancement processing on each audio data.

And 103, performing noise removal processing on the audio data after each data enhancement to obtain each audio data after noise removal.

For example, in this embodiment, 1600 pieces of customer service audio data and 8000 pieces of guest audio data are collected as training samples, the audio data amount to 120 ten thousand audio segments, data enhancement processing is performed on the 120 ten thousand audio data, the data enhancement processing includes speed change, mechanical noise addition, and reverberation, the audio data after the data enhancement processing is performed on the 120 ten thousand audio data amount to 250 ten thousand audio segments, which relate to 9600 different speakers in total, so that the number of the training samples is larger and richer, then VAD (voice endpoint detection) is used to perform endpoint detection on the audio data, part of noise in the audio data is removed, the audio data after the noise is removed is finally obtained, and then the neural network model is trained in batch.

And step 104, setting a loss function used in the training of the neural network model.

In this embodiment, the loss function includes a first loss function and a second loss function, where the first loss function is a loss function classified based on a feature angle, and has an advantage of maximizing a classification boundary in an angle space of feature expression; the second loss function is a loss function that distinguishes between classes and within classes, which has the effect that the distance between classes is maximized while the distance within classes is minimized. In this embodiment, the first Loss function is specifically AAM-softmax (Additive Angular Margin Loss function), the second Loss function is specifically cos-softmax (cosine Loss function), and the AAM-softmax and the cos-softmax are combined to form a Loss function used in the neural network model training, so that the neural network model has a better distinguishing effect between speakers and within classes.

In this embodiment, the expression of the loss function is: l is_total＝αL_AAM+_βL_cos

Wherein,

L_totalrepresents the loss function, L_AAMRepresenting a first loss function, L_cosRepresenting the second loss function, alpha representing L_AAMThe weight of (a) is determined,_βrepresents L_cosWeight of (a) and_βthe sum is 1 (for example, α may be set to 0.6,_β0.4), N represents the number of training samples, S represents the scaling factor hyperparameter of the cosine distance, m represents the separation distance (for example, m may be set to 0.35 to make the neural network model more effective), i represents the i-th training sample, y represents the number of training samples_iIndicates the label corresponding to the ith training sample, θ y_iThe included angle between the ith training sample and the ith label is shown, and thetaj represents the included angle between the jth training sample and the jth label.

And 105, performing model training on the neural network model based on each audio data with noise removed and the loss function to obtain a voiceprint recognition model.

In this embodiment, model training is performed on the neural network model based on each audio data and the loss function to obtain a voiceprint recognition model, and specifically, model training is performed on the neural network model based on each denoised audio data and the loss function to obtain the voiceprint recognition model.

In this embodiment, each audio data with noise removed is input into the neural network model, and the loss function is used to perform model training on the neural network model, so as to obtain the voiceprint recognition model.

In the embodiment, the neural network model comprises 3 convolutional layers, 9 residual error link layers, 3 SE-block layers, an attribution posing layer and an affine layer; each convolution layer is combined with 3 residual error link layers and 1 SE-block layer to form a 15-layer neural network. Specifically, the dimension of the input feature of the neural network model is set as (400, 101), 101-dimensional spectrogram features are extracted as the input of the neural network model, the neural network structure is designed into 3 convolutional layers, 9 residual error link layers and 3 SE-block layers, each convolutional layer is combined with 3 residual error layers and 1 SE-block layer to form 15 effective neural network designs, wherein the convolutional layers can effectively extract deep-layer voiceprint features of the audio spectrogram, the residual error layers can effectively adjust the convergence performance of the neural network, the SE-block layers can add the attention of channels in audio data to make useful feature expression more obvious, the neural network model is finally connected with an attention Pooling layer which can convert the frame-level features of the audio into the segment-level features of the audio and enhance the importance of each frame of data, and generating a 512-dimensional voiceprint characteristic vector by an affine layer after the neural network model, and transmitting the generated 512-dimensional voiceprint characteristic vector to AAM-softmax and cos-softmax for calculating loss.

The loss function that this embodiment used when utilizing the neural network model training of a plurality of audio data in the training sample of acquireing and setting carries out the model training to the neural network model, in order to obtain the voiceprint recognition model, the loss function includes first loss function and second loss function, realized carrying out the model training to the neural network model based on first loss function and second loss function combine together, the degree of discrimination and the performance of voiceprint recognition model have been promoted, the difficult problem of convergence of voiceprint recognition model has been avoided, and then the low and security of accuracy of voiceprint recognition model recognition incoming line user identity has been improved.

Example 2

As shown in fig. 2, the present embodiment provides a model training system, which includes a first obtaining module 1, a first processing module 2, a second processing module 3, a setting module 4, and a training module 5.

The first obtaining module 1 is configured to obtain a training sample, where the training sample includes a plurality of audio data.

The first processing module 2 is configured to perform data enhancement processing on each piece of audio data.

The second processing module 3 is configured to perform noise removal processing on each data-enhanced audio data to obtain each noise-removed audio data.

For example, in this embodiment, 1600 pieces of customer service audio data and 8000 pieces of guest audio data are collected as training samples, the audio data amount to 120 ten thousand audio segments, data enhancement processing is performed on the 120 ten thousand audio data, the data enhancement processing includes speed change, mechanical noise addition, and reverberation, the audio data after the data enhancement processing is performed on the 120 ten thousand audio data amount to 250 ten thousand audio segments, which relate to 9600 different speakers in total, so that the number of the training samples is larger and richer, then, VAD is used for performing endpoint detection on the audio data, part of noise in the audio data is removed, finally, the audio data after the noise removal is obtained, and then, the neural network model is trained in batches.

The setting module 4 is used for setting a loss function used in the neural network model training.

In this embodiment, the loss function includes a first loss function and a second loss function, where the first loss function is a loss function classified based on a feature angle, and has an advantage of maximizing a classification boundary in an angle space of feature expression; the second loss function is a loss function that distinguishes between classes and within classes, which has the effect that the distance between classes is maximized while the distance within classes is minimized. In this embodiment, the first loss function is specifically AAM-softmax, the second loss function is specifically cos-softmax, and AAM-softmax and cos-softmax are combined to form a loss function used in training of the neural network model, so that the neural network model has a better distinguishing effect between speakers and within classes.

In this embodiment, the expression of the loss function is: l is_total＝_αL_AAM+_βL_cos

Wherein,

The training module 5 is specifically configured to perform model training on the neural network model based on each audio data with noise removed and the loss function, so as to obtain a voiceprint recognition model.

In this embodiment, the training module 5 is configured to perform model training on the neural network model based on each audio data and the loss function to obtain a voiceprint recognition model, and specifically, the training module 5 performs model training on the neural network model based on each denoised audio data and the loss function to obtain the voiceprint recognition model.

In the embodiment, the neural network model comprises 3 convolutional layers, 9 residual error link layers, 3 SE-block layers, an attribution posing layer and an affine layer; each convolution layer is combined with 3 residual error link layers and 1 SE-block layer to form a 15-layer neural network. Specifically, the dimension of the input feature is set as (400, 101) by the neural network model, 101-dimensional spectrogram features are extracted as the input of the neural network model, the neural network structure is designed into 3 convolutional layers, 9 residual error link layers and 3 SE-block layers, each convolutional layer is combined with 3 residual error layers and 1 SE-block layer to form 15 effective neural network designs, wherein the convolutional layers can effectively extract deep-layer voiceprint features of the audio spectrogram, the residual error layers can effectively adjust the convergence performance of the neural network, the SE-block layers can add the attention of a channel in audio data to make useful features more obvious in expression, the neural network model is finally connected with an attention posing layer which can convert the frame-level features of the audio into the segment-level features of the audio, meanwhile, the importance of each frame of data can be enhanced, and a fine layer is connected with the neural network model to generate a 512-dimensional voiceprint feature vector, and transmitting the generated 512-dimensional voiceprint feature vector to AAM-softmax and cos-softmax to calculate loss.

Example 3

Fig. 3 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the model training method of embodiment 1 when executing the program. The electronic device 30 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 3, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).

The bus 33 includes a data bus, an address bus, and a control bus.

The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.

Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor 31 executes various functional applications and data processing, such as the model training method of embodiment 1 of the present invention, by executing the computer program stored in the memory 32.

The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown in FIG. 3, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 4

The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the model training method provided in embodiment 1.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform a method for model training as described in embodiment 1 when the program product is run on the terminal device.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Example 5

As shown in fig. 4, the present embodiment provides a user identification method, including:

step 201, obtaining the audio data of the incoming line user and the identification information corresponding to the audio data.

The call recording in the OTA scene is dual-channel, audio needs to be separated, sound is divided into two channels, one channel is used for storing the recording of customer service, and the other channel is used for storing the recording of customers, so that the recording of different channels can be processed. And audio separation is carried out according to the record cross storage mode, and the records of the user side and the customer service side are respectively collected.

In the embodiment, audio data transmitted by a telephone channel in real time is received, the audio data is logically judged, and call records in an application scene are screened out, wherein the logical judgment comprises a record attribute coding mode, the channels are separated to obtain records of an incoming line user and a customer service end, the incoming line of the telephone state is judged, voiceprint information is searched, and different services are called for different departments; specifically, audio data of a customer service made by a user is obtained first, then left and right channels of the audio data are separated, audio data of an incoming line user is obtained, when the call is judged to be in a connected state, identification information corresponding to the audio data of the incoming line user is obtained (the identification information may be a telephone number or other identifications, in this embodiment, the telephone number is taken as an example), and if a first historical voiceprint feature vector corresponding to the telephone number can be searched in a voiceprint database, it is judged which department the telephone number belongs to.

Step 202, performing transcoding on the audio data of the incoming line user to obtain transcoded audio data.

In this embodiment, the encoding conversion of the audio data of the incoming line user is specifically to convert the letter audio data into digital audio data, so as to improve the efficiency of identifying the incoming line user.

Step 203, performing audio segmentation and noise removal processing on the coded and converted audio data.

And step 204, splicing the audio data after audio cutting and noise removal.

In this embodiment, audio transcoding and VAD operations are performed on the received audio data, audio cutting and noise removal processing is performed on the audio data after the audio data is subjected to the encoding conversion, and then the audio data after the audio cutting and the noise removal are spliced together again.

In this embodiment, VAD is used to perform endpoint detection on audio data, and the detection process detects the position of an endpoint in a phone through two VAD combinations and TDNN (time delay neural network) deep learning endpoint detection for the first time; and performing VAD processing on the continuous frames for the second time by using the TDNN, wherein if one of the continuous N frames of audio in the audio data is larger than a threshold value obtained through an experiment (the threshold value is set according to the actual condition), the N frames are all regarded as valid frames, and if the energy values of the continuous N frames of audio in the audio are all smaller than the threshold value, the N frames are regarded as invalid frames and are removed.

And step 205, extracting the voiceprint characteristics of the spliced audio data.

In this embodiment, the voiceprint features of the audio data are extracted, and specifically, the voiceprint features of the spliced audio data are extracted.

Step 206, inputting the voiceprint features into the voiceprint recognition model to obtain a first voiceprint feature vector, wherein the voiceprint recognition model is obtained by training with the model training method of embodiment 1.

In this embodiment, the first voiceprint feature vector is a 512-dimensional vector, and in a specific implementation process, the extracted voiceprint features of the spliced audio data are input into a voiceprint recognition model for decoding, so that one 512-dimensional first voiceprint feature vector is decoded and used as information of the segment of audio.

And step 207, acquiring a first historical voiceprint feature vector corresponding to the identification information from the voiceprint database.

In this embodiment, a first historical voiceprint feature vector corresponding to a telephone number of an incoming line user is obtained from a voiceprint database according to the telephone number.

Step 208, calculating a first similarity score between the first voiceprint feature vector and the first historical voiceprint feature vector.

In this embodiment, a first similarity score between the first voiceprint feature vector and the first historical voiceprint feature vector is calculated by a cosine similarity algorithm.

And 209, if the first similarity score is larger than the similarity threshold, determining that the incoming line user and the user corresponding to the first historical voiceprint feature vector are the same person.

In the embodiment, the end-to-end training and prediction mode is realized based on the voiceprint recognition model obtained by combining AAM-softmax and cos-softmax training, the manual setting of a threshold value is avoided, whether an incoming line user is the user corresponding to the telephone number can be directly judged, the auditing strength of the customer service to an incoming call customer is enhanced, the error rate of OTA customer service processing is reduced, the safety of a real customer and an OTA platform is ensured, and the help is provided for the subsequent offline customer service quality inspection.

In an implementation scenario, as shown in fig. 4, the user identification method further includes:

step 210, calculating an audio quality score of the audio data corresponding to the first voiceprint feature vector and the audio data corresponding to the first historical voiceprint feature vector.

In this embodiment, the audio quality score of the audio data is calculated according to the signal-to-noise ratio or the amplitude truncation size of the audio, and the larger the signal-to-noise ratio is, the smaller the noise is, the higher the audio quality score is, and the smaller the amplitude truncation is, the higher the audio quality score is.

And step 211, updating the voiceprint feature vectors with high audio quality scores into the voiceprint database.

In this embodiment, the audio data with poor quality is screened out according to the audio quality score, and the voiceprint feature vector with high audio quality score is updated to the voiceprint database.

In this embodiment, the user identification method further includes: and if the first similarity score is not greater than the similarity threshold, judging whether a second historical voiceprint feature vector corresponding to the identification information exists in the voiceprint database, if so, calculating a second similarity score between the first voiceprint feature vector and the second historical voiceprint feature vector, and replacing the first historical voiceprint feature vector in the voiceprint database with the second historical voiceprint feature vector when the second similarity score is greater than the similarity threshold.

In this embodiment, a pre-trained voiceprint recognition model is called, whether a historical voiceprint feature vector (i.e., a first historical voiceprint feature vector) corresponding to a telephone number of an incoming subscriber exists in a voiceprint database is judged, if not, a current voiceprint feature vector (i.e., a first voiceprint feature vector) of the telephone number is stored in the voiceprint database, if so, a similarity score (i.e., a first similarity score) between the historical voiceprint feature vector (i.e., the first historical voiceprint feature vector) and the current voiceprint feature vector (i.e., the first voiceprint feature vector) is calculated, if the first similarity score is greater than a similarity threshold, the first historical voiceprint feature vector is similar to the first voiceprint feature vector, and then an audio quality score of audio data corresponding to the first voiceprint feature vector and corresponding to the first historical voiceprint feature vector is calculated; and updating the voice print characteristic vector with high audio quality score into the voice print database. If the first similarity score is not larger than the similarity threshold, the first historical voiceprint feature vector is not similar to the first voiceprint feature vector, whether a second historical voiceprint feature vector corresponding to the telephone number exists in the voiceprint database is judged, if yes, a second similarity score of the first voiceprint feature vector and the second historical voiceprint feature vector is calculated, if the second similarity score is larger than the similarity threshold, the first voiceprint feature vector is similar to the second historical voiceprint feature vector, and the first historical voiceprint feature vector in the voiceprint database is replaced by the second historical voiceprint feature vector. If the first voiceprint feature vector is not similar to the second historical voiceprint feature vector, stopping updating the voiceprint database; and if the second voiceprint feature vector does not exist in the voiceprint database, determining whether to store the second voiceprint feature vector according to a result of customer service verification after the customer service verifies the user identity. The voiceprint database is updated, so that service guarantee is provided for identifying the user based on the voiceprint identification model.

Example 6

As shown in fig. 5, the present embodiment provides a user identification system, which includes a second obtaining module 6, a converting module 7, a third processing module 8, a splicing module 9, an extracting module 10, a third obtaining module 11, a fourth obtaining module 12, a first calculating module 13, and a determining module 14.

The second obtaining module 6 is configured to obtain the audio data of the incoming line user and identification information corresponding to the audio data.

The conversion module 7 is configured to perform transcoding on the audio data of the incoming subscriber to obtain transcoded audio data.

The third processing module 8 is configured to perform audio cutting and noise removal processing on the audio data after the code conversion.

The splicing module 9 is used for splicing the audio data after audio cutting and noise removal.

In the embodiment, the VAD is used for carrying out endpoint detection on the audio data, the detection process is carried out by two times of VAD combination, and the position of the endpoint in a phone is detected by TDNN deep learning endpoint detection for the first time; and performing VAD processing on the continuous frames for the second time by using the TDNN, wherein if one of the continuous N frames of audio in the audio data is larger than a threshold value obtained through an experiment (the threshold value is set according to the actual condition), the N frames are all regarded as valid frames, and if the energy values of the continuous N frames of audio in the audio are all smaller than the threshold value, the N frames are regarded as invalid frames and are removed.

The extraction module 10 is specifically configured to extract a voiceprint feature of the spliced audio data.

In this embodiment, the extraction module 10 is configured to extract a voiceprint feature of the audio data, and specifically, the extraction module 10 is specifically configured to extract a voiceprint feature of the spliced audio data.

The third obtaining module 11 is configured to input the voiceprint features into a voiceprint recognition model to obtain a first voiceprint feature vector, where the voiceprint recognition model is obtained by using the model training system in embodiment 2.

The fourth obtaining module 12 is configured to obtain a first historical voiceprint feature vector corresponding to the identification information from the voiceprint database.

The first calculating module 13 is configured to calculate a first similarity score between the first voiceprint feature vector and the first historical voiceprint feature vector.

The determining module 14 is configured to determine that the incoming line user and the user corresponding to the first historical voiceprint feature vector are the same person if the first similarity score is greater than the similarity threshold.

In an implementation scenario, as shown in fig. 5, the subscriber identity system further includes a second calculation module 15, a storage module 16, a determination module 17, a third calculation module 18, and a replacement module 19.

The second calculating module 15 is configured to calculate an audio quality score of the audio data corresponding to the first voiceprint feature vector and the audio data corresponding to the first historical voiceprint feature vector.

The storage module 16 is configured to update the voiceprint feature vectors with high audio quality scores into the voiceprint database.

The judging module 17 is configured to judge whether there is a second historical voiceprint feature vector corresponding to the identification information in the voiceprint database if the first similarity score is not greater than the similarity threshold, and if so, invoke the third calculating module 18.

The third calculating module 18 is configured to calculate a second similarity score between the first voiceprint feature vector and the second historical voiceprint feature vector.

The replacing module 19 is configured to replace the first historical voiceprint feature vector in the voiceprint database with the second historical voiceprint feature vector when the second similarity score is greater than the similarity threshold.

Example 7

A schematic structural diagram of an electronic device provided in embodiment 7 of the present invention is the same as the structure in fig. 3. The electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the user identification method of embodiment 5 when executing the program. The electronic device 30 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

The bus 33 includes a data bus, an address bus, and a control bus.

The processor 31 executes various functional applications and data processing, such as the user identification method of embodiment 5 of the present invention, by executing the computer program stored in the memory 32.

Example 8

The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the user identification method provided in embodiment 5.

In a possible implementation, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform a method for user identification as described in implementation example 5, when the program product is run on the terminal device.

Claims

1. A method of model training, comprising:

setting a loss function used in the training of the neural network model;

2. The model training method of claim 1, wherein the loss function is expressed by: l is_total＝αL_AAM+βL_cos

Wherein,

L_totalrepresents the loss function, L_AAMRepresenting a first loss function, L_cosRepresenting the second loss function, alpha representing L_AAMBeta represents L_cosN denotes the number of training samples, S denotes the scaling factor hyperparameter of the cosine distance, m denotes the separation distance, i denotes the ith training sample, y denotes the number of training samples_iIndicates the label corresponding to the ith training sample, θ y_iThe included angle between the ith training sample and the ith label is shown, and thetaj represents the included angle between the jth training sample and the jth label.

3. The model training method of claim 1, wherein the neural network model comprises 3 convolutional layers, 9 residual linkage layers, 3 SE-block layers, an attribution posing layer, and an affine layer; and combining each convolution layer with 3 residual error link layers and 1 SE-block layer to form a 15-layer neural network.

4. The model training method of claim 1, wherein after the step of obtaining training samples, the model training method further comprises:

performing data enhancement processing on each audio data;

5. A model training system is characterized by comprising a first acquisition module, a setting module and a training module;

6. The model training system of claim 5, wherein the loss function is expressed as: l is_total＝αL_AAM+βL_cos

Wherein,

7. The model training system of claim 5, wherein the neural network model comprises 3 convolutional layers, 9 residual linkage layers, 3 SE-block layers, an attribution posing layer, and an affine layer; and combining each convolution layer with 3 residual error link layers and 1 SE-block layer to form a 15-layer neural network.

8. The model training system of claim 5, further comprising a first processing module and a second processing module;

9. A method for identifying a user, comprising:

extracting voiceprint features of the audio data;

inputting the voiceprint features into a voiceprint recognition model to obtain a first voiceprint feature vector, wherein the voiceprint recognition model is obtained by training through the model training method of any one of claims 1 to 4;

10. The user identification method of claim 9, wherein the step of extracting the voiceprint features of the audio data is preceded by the user identification method further comprising:

splicing the audio data after audio cutting and noise removal;

and extracting the voiceprint characteristics of the spliced audio data.

11. The method of claim 9, wherein after the step of determining that the incoming user is the same person as the user corresponding to the first historical voiceprint feature vector if the first similarity score is greater than the similarity threshold, the method further comprises:

and/or the presence of a gas in the gas,

the user identification method further comprises the following steps:

12. A user identification system is characterized by comprising a second acquisition module, an extraction module, a third acquisition module, a fourth acquisition module, a first calculation module and a determination module;

the third obtaining module is configured to input the voiceprint features into a voiceprint recognition model to obtain a first voiceprint feature vector, where the voiceprint recognition model is obtained by training with the model training system according to any one of claims 5 to 8;

13. The subscriber identification system of claim 12, wherein the subscriber identification system further comprises a conversion module, a third processing module, and a concatenation module;

14. The user identification system of claim 12, wherein the user identification system further comprises a second computing module and a storage module;

and/or the presence of a gas in the gas,

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the model training method according to any one of claims 1-4 or performs the user recognition method according to any one of claims 9-11 when executing the computer program.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a model training method as claimed in any one of claims 1 to 4, or carries out a user recognition method as claimed in any one of claims 9 to 11.