CN111916057A

CN111916057A - Language identification method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111916057A
Application number: CN202010569842.4A
Authority: CN
Inventors: 张�浩; 李志福; 艾巍; 鹿江锋; 杨邻瑞; 谢隆飞; 邵小亮
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2020-06-20
Filing date: 2020-06-20
Publication date: 2020-11-10

Abstract

The embodiment of the application provides a language identification method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: identifying the acquired language as a first target language; wherein the acquired language is voice information; matching a language identification model corresponding to the first target language type, and judging the content of the first target language according to the language identification model; and outputting the matched first target answer according to the judged language content. The technical scheme based on the invention can realize that the neural network is utilized to construct the language identification model, can more accurately identify the language type and output the language of the same type as the input language for response, and improves the product experience and affinity of users.

Description

Language identification method and device, electronic equipment and computer readable storage medium

Technical Field

The invention relates to the technical field of intelligent decision making, in particular to a language identification method, a language identification device, electronic equipment and a computer readable storage medium.

Background

The traditional manual customer service dialing mode has difficulty in meeting the business scenes of many companies due to low efficiency and high cost. With the development of artificial intelligence and natural language understanding technology and the progress of the traditional outbound technology, the intelligent outbound system gradually replaces a plurality of traditional service scenes dialed by artificial customer service due to higher concurrency efficiency and lower cost overhead. However, in the case of complex situations of user groups with multiple geographic areas and dialects, a single language recognition model has a low recognition rate for different dialects, and the intelligent outbound system cannot well cope with the multi-dialect scene.

Disclosure of Invention

The present application aims to solve at least one of the above technical drawbacks. The technical scheme adopted by the application is as follows:

in a first aspect, an embodiment of the present application provides a language identification method, where the method includes:

identifying the acquired language as a first target language; wherein the acquired language is voice information;

matching a language identification model corresponding to the first target language type, and judging the content of the first target language according to the language identification model;

and outputting the matched first target answer according to the judged language content.

Optionally, the language category includes a national language or a local dialect.

Optionally, the language identification model includes: pre-acquiring training data of at least one language;

and carrying out training processing by utilizing a convolutional neural network model and the training data of the at least one language to obtain the language identification model.

Optionally, the obtaining the language identification model by performing the training process using the convolutional neural network model and the training data of the at least one language further includes:

converting the training data of the at least one language into a two-dimensional spectrogram, and respectively producing a training set and a test set;

inputting the two-dimensional spectrogram of the training set into an initialized convolutional neural network model for model training to form a language identification model;

and testing the language identification model by using a regression classifier and a test set two-dimensional spectrogram.

Optionally, the outputting the matched first target answer according to the determined language content specifically includes:

according to the judged language content, the language identification model outputs a second target answer in a matched text form;

processing the second target answer in the text form into a first target answer; wherein the first target answer is a voice of the same kind as the first target language.

In a second aspect, the present invention provides a speech recognition apparatus, comprising: the device comprises an input module, an identification module, a matching module, a judgment module and an output module; wherein,

the recognition module is used for recognizing the language acquired by the input module as a first target language; wherein the acquired language is voice information;

the matching module is used for matching the language identification model corresponding to the first target language type;

the judging module is used for judging the content of the first target language according to the language identification model;

and the output module is used for outputting the matched first target answer according to the judged language content.

Optionally, the language identification model includes:

the input module acquires training data of at least one language in advance;

Optionally, the apparatus further comprises a language processing module;

according to the language content judged by the judging module, the output module outputs a second target answer in a matched text form;

the language processing module is used for processing the second target answer in the text form into a first target answer; wherein the first target answer is a voice of the same kind as the first target language.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory;

the memory is used for storing operation instructions;

the processor is used for executing the language identification method by calling the operation instruction.

In a fourth aspect, a computer-readable storage medium is characterized in that the storage medium has stored thereon a computer program, which when executed by a processor implements the above-mentioned method of language identification.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

according to the scheme provided by the embodiment of the application, the acquired language is identified as the first target language; wherein the acquired language is voice information; matching a language identification model corresponding to the first target language type, and judging the content of the first target language according to the language identification model; and outputting the matched first target answer according to the judged language content. Based on the scheme, the language identification model can be constructed by utilizing the neural network, the language type can be identified more accurately, the language of the same type as the input language is output to respond, and the product experience and affinity of a user are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a language identification method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a method for dialect recognition using a convolutional neural network according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an intelligent outbound system design based on dialect type identification according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The current intelligent outbound system can only set one language identification model, for example, the unified mandarin identification model is set, but when dialect users in places such as Sichuan, Shanghai or Guangdong are targeted, a lot of users using local dialects exist, but the unified mandarin identification model has a low recognition rate for different dialects, so that the question and answer of the business process are influenced, and the user experience is seriously influenced. And moreover, a model capable of identifying multiple dialects simultaneously is trained, so that the difficulty is high, and the accuracy cannot be guaranteed. Therefore, the intelligent outbound system needs a solution that can simultaneously deal with various dialects, mandarins, and even foreign languages and ensure a high recognition rate. The invention aims at the problems and designs an intelligent outbound method and system based on different types of language identification, which can well improve the poor experience brought by the local dialect used by different dialect user groups in different regions.

The embodiment of the application provides a language identification method, a language identification device, an electronic device and a computer-readable storage medium, which aim to solve at least one of the above technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 shows a schematic flowchart of a language identification method provided in an embodiment of the present application, and as shown in fig. 1, the method mainly includes:

step S101, identifying the acquired language as a first target language; wherein the acquired language is voice information;

step S102, matching a language identification model corresponding to the first target language type, and judging the content of the first target language according to the language identification model;

and S103, outputting the matched first target answer according to the judged language content.

In the specific embodiments of the present application, the following are specifically described: IVR (Interactive Voice Response), which is an Interactive Voice Response, is a powerful automatic telephone service system. In the embodiment, the IVR is used for acquiring and identifying the acquired language as a first target language; the obtained language is voice information, namely the voice information of the dialect user is obtained from the IVR, and a language identification model of the corresponding dialect is selected according to the type of the language, such as a dialect type identification result; performing intention judgment on a voice recognition result through natural voice processing, judging the content of the obtained dialect, and returning a first target answer, namely a corresponding text answer; and selecting a voice synthesis model of the corresponding dialect according to the dialect type recognition result, synthesizing the first target answer, namely the text answer, into the corresponding voice, for example, obtaining the dialect, namely the synthesized dialect, and playing the dialect answer to the user through the IVR.

and carrying out training processing by utilizing a convolutional neural network model and the training data of the at least one language to obtain the language identification model. Optionally, the obtaining the language identification model by performing the training process using the convolutional neural network model and the training data of the at least one language further includes: converting the training data of the at least one language into a two-dimensional spectrogram, and respectively producing a training set and a test set; inputting the two-dimensional spectrogram of the training set into an initialized convolutional neural network model for model training to form a language identification model; and testing the language identification model by using a regression classifier and a test set two-dimensional spectrogram.

In the embodiments of the present application, the language is used as a dialect for describing the embodiments. The dialect voice of the user is obtained, the dialect type of the user is judged by identifying a key value in the receiver dialect, and the dialect is defaulted to be mandarin when the judgment value is lower than a judgment threshold value. And when the dialect type is judged, the corresponding language recognition model is selected for the subsequent speech recognition and speech synthesis, so that the recognition rate of the speech recognition is improved, and the speech audio of the corresponding dialect is synthesized. Dialect class identification uses CNN (convolutional neural network) for model training and dialect class classification. Converting the training data of the at least one language into a two-dimensional spectrogram, and respectively producing a training set and a test set; inputting the two-dimensional spectrogram of the training set into an initialized convolutional neural network model for model training to form a language identification model, specifically, converting a labeled dialect audio file comprising single words, words and sentences in a wav format into the two-dimensional spectrogram through windowing and framing and short-time Fourier transform to obtain the training set and the test set. Initializing a CNN network model, training the model by using a training set, and performing classification verification on the language type by using a Softmax regression classifier. And testing by using the test set after the trained model is obtained. The two-dimensional spectrogram is used for converting the frequency domain of the voice signal, so that the method has the advantages of avoiding the interference caused by noise and better embodying the characteristics of the voice. The two-dimensional spectrogram is obtained by performing short-time Fourier transform (STFT) on a continuous voice signal, namely windowing and framing a long signal, selecting a Hamming window or a rectangular window and the like, performing Fast Fourier Transform (FFT) on each frame, and finally stacking the results of each frame along the other dimension to obtain the spectrogram. The method comprises the following specific steps:

let the discrete time domain sampling signal be x (N), where N is 0,1, and N-1, where N is the time domain sampling point number and N is the total number of sampling points. When the signal is windowed and frame-divided, x (n) is expressed as x_n(m), N is 0,1, N-1, where N is a frame number, m is a frame synchronization time number, and N is the number of sampling points in one frame. Then the short-time Fourier transform of x (n) is:

where w (n) is a selected window function, the discrete time domain fourier transform of signal x (n) is:

wherein k is greater than or equal to 0 and less than or equal to N-1, then | X (N, k) | is the spectral estimation of X (N), and then the time m spectral energy density function (two-dimensional spectrogram) is:

P(n,k)＝|X(n,k)|²

fig. 2 is a schematic diagram of a dialect recognition method using a convolutional neural network, in which a dialect user responds to an outbound call, arrives through an operator, obtains a voice signal of the dialect user through an IVR, and performs short-time fourier transform on the signal to obtain a two-dimensional spectrogram. And carrying out dialect type identification on the trained CNN model.

Optionally, in an embodiment of the present application, the outputting the matched first target answer according to the determined language content specifically includes: according to the judged language content, the language identification model outputs a second target answer in a matched text form; processing the second target answer in the text form into a first target answer; wherein the first target answer is a voice of the same kind as the first target language. Specifically, the technical solution is further described by taking dialects as an example and combining actual life scenes. The user answers the outgoing call by the voice (i.e. receives the voice message), identifies the dialect to which the user belongs through the dialect type (identifies the voice type through the language identification model), and judges the corresponding key value. The dialect type is a dialect voice model corresponding to the own dialect, the selected language identification model is judged through a key value, the voice is transferred to characters with high accuracy, then the result is processed by natural language technology to analyze the intention of the user and give response characters, finally, the corresponding language synthesis model or the dialect model is selected through the key value, and the dialect voice (target answer) corresponding to the dialect user is synthesized to be listened by the user. Optionally, in order to improve response efficiency, the dialect type identification may determine the valid dialect type key value only when the dialect user responds for the first time, and cache the key value in the IVR, and when the user responds again, directly read the valid dialect type key value from the cache for voice identification and voice synthesis, but not perform the dialect type identification again.

FIG. 3 is a schematic diagram of the design of an intelligent outbound call system based on dialect type identification, wherein an outbound call platform of the intelligent outbound call system makes an outbound call to a specific dialect user through a telecommunication operator and plays a welcome language; the dialect user responds to the welcome language (acquires the language); if the IVR does not cache the effective dialect type key value, the dialect user response voice is subjected to dialect type identification through the IVR; fifthly, caching the identified dialect type effective key value into the IVR passing through this time; if the effective dialect category key value exists, selecting a language identification model corresponding to the key value (matching the corresponding language identification model according to the first target language category), and converting dialect user response voice into characters; seventhly, the recognized characters are sent to a natural language processing module to judge the voice content of the dialect user, and response characters corresponding to the voice content are returned (according to the judged language content, the language recognition model outputs a second target answer in a matched text form); sending the outbound response characters to a voice synthesis module, acquiring an effective dialect category key value from a cache according to the ninthly, selecting a corresponding voice synthesis model, and synthesizing corresponding audio data (processing a second target answer in a text form into a first target answer, wherein the first target answer is the voice with the same category as the first target language); the synthesized outbound response voice is processed by IVR and R

Operator network

Answering the dialect user to complete a turn and dialectAnd (5) interactive response of the user.

Fig. 4 is a diagram of a speech recognition apparatus according to the present invention, the apparatus including: a 401 input module, a 402 identification module, a 403 matching module, a 404 judgment module and a 405 output module; wherein,

Optionally, the language identification model includes:

the input module acquires training data of at least one language in advance;

Optionally, the apparatus further comprises a language processing module;

It is understood that the above modules of the language identification device in the present embodiment have functions of implementing the corresponding steps of the method in the embodiment shown in fig. 1. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules can be software and/or hardware, and each module can be implemented independently or by integrating a plurality of modules. For the functional description of each module, reference may be specifically made to the corresponding description of the method in the embodiment shown in fig. 1, and details are not repeated here.

The embodiment of the application provides an electronic device, which comprises a processor and a memory;

a memory for storing operating instructions;

and the processor is used for executing the language identification method provided by any embodiment of the application by calling the operation instruction.

As an example, fig. 5 shows a schematic structural diagram of an electronic device to which an embodiment of the present application is applicable, and as shown in fig. 5, the electronic device 2000 includes: a processor 2001 and a memory 2003. Wherein the processor 2001 is coupled to a memory 2003, such as via a bus 2002. Optionally, the electronic device 2000 may also include a transceiver 2004. It should be noted that the transceiver 2004 is not limited to one in practical applications, and the structure of the electronic device 2000 is not limited to the embodiment of the present application.

The processor 2001 is applied to the embodiment of the present application to implement the method shown in the above method embodiment. The transceiver 2004 may include a receiver and a transmitter, and the transceiver 2004 is applied to the embodiments of the present application to implement the functions of the electronic device of the embodiments of the present application to communicate with other devices when executed.

The Processor 2001 may be a CPU (Central Processing Unit), general Processor, DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array) or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, etc.

Bus 2002 may include a path that conveys information between the aforementioned components. The bus 2002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 2002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

The Memory 2003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

Optionally, the memory 2003 is used for storing application program code for performing the disclosed aspects, and is controlled in execution by the processor 2001. The processor 2001 is configured to execute the application program code stored in the memory 2003 to implement the language identification method provided in any of the embodiments of the present application.

The electronic device provided by the embodiment of the application is applicable to any embodiment of the method, and is not described herein again.

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the language identification method shown in the above method embodiment.

The computer-readable storage medium provided in the embodiments of the present application is applicable to any of the embodiments of the foregoing method, and is not described herein again.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of language identification, the method comprising:

2. The method according to claim 1, wherein the language includes a national language or a local dialect.

3. The language identification method according to claim 1, wherein the language identification model comprises:

pre-acquiring training data of at least one language;

4. The method of claim 3, wherein the obtaining the language identification model by training with the convolutional neural network model and the training data of the at least one language further comprises:

5. The method according to claim 1 or 4, wherein the outputting the matched first target answer according to the determined language content specifically includes:

6. A speech recognition apparatus, the apparatus comprising: the device comprises an input module, an identification module, a matching module, a judgment module and an output module; wherein,

7. The language recognition device of claim 6, wherein the language recognition model comprises:

the input module acquires training data of at least one language in advance;

8. The speech recognition apparatus of claim 6 or 7, wherein the apparatus further comprises a language processing module;

9. An electronic device comprising a processor and a memory;

the memory is used for storing operation instructions;

the processor is used for executing the method of any one of claims 1-5 by calling the operation instruction.

10. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method of any one of claims 1-5.