CN111026884A

CN111026884A - Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus

Info

Publication number: CN111026884A
Application number: CN201911271656.6A
Authority: CN
Inventors: 张献涛; 张猛; 暴筱; 林小俊
Original assignee: Nanchang Zhonghui Zhiying Information Technology Co Ltd
Current assignee: Shanghai Yishang Network Technology Co ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2020-04-17
Anticipated expiration: 2039-12-12
Also published as: CN111026884B

Abstract

The invention discloses a dialogue corpus generation method for improving quality and diversity of a man-machine interaction dialogue corpus. The method comprises the following steps: 1) carrying out synonymy sentence expansion on the selected dialogue corpus to form a candidate set; 2) carrying out anomaly detection on each dialogue corpus in the candidate set to obtain an abnormal value of each dialogue corpus; 3) saving the dialogue corpus of which the abnormal value is lower than the set scoring threshold value into the promoted dialogue corpus; 4) performing semantic analysis on the dialogue corpora with the abnormal value higher than or equal to the scoring threshold: if the dialog data are wrong, directly discarding the dialog data; if the conversation data is diversified, executing step 5); otherwise, the current dialogue corpus is stored in the promoted dialogue corpus; 5) and (3) taking the dialogue data judged to be diverse as input again, executing the steps 1-4) until a pause condition is reached, and stopping iteration. The invention realizes the quality control and the diversity expansion of the original dialogue corpus.

Description

Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus

Technical Field

The invention belongs to the technical field of information technology and data mining, and relates to a dialogue corpus generation method for improving the quality and diversity of a man-machine interaction dialogue corpus.

Background

With the continuous development of scientific technology, various artificial intelligence models are increasingly applied to various intelligent systems, and various human-computer interaction requirements are provided. How to carry out human-computer interaction more effectively is a problem which needs to be solved urgently at present. Currently, most human-computer interaction models are driven by data, and are trained on Corpus (Corpus) to obtain parameter results with good performances, and the models are applied to systems. Therefore, a high quality corpus plays an increasingly important role.

Human beings have various abundant language expression modes in human-computer interaction, and have higher requirement on the accuracy of semantic understanding. In order to better train an accurate (accuracuracy) and robust (robust) model, an accurate high-quality dialogue corpus is required, and the dialogue corpus is required to be as rich as possible and to be expressed in various ways.

Chinese patent ZL201510251428.8 discloses a corpus screening method and apparatus, wherein the corpus screening method includes: performing cross check based on the first corpus set to obtain a first check result; judging whether the first check result meets a first preset condition or not; when the first check result meets the first preset condition, performing public check based on the first corpus set to obtain a second check result; judging whether the first corpus set needs to be screened or not according to the second check result; and when the first corpus set is judged to be required to be screened, performing first screening processing on the first corpus set. The method solves the problem of low quality of training samples caused by influence of subjective preference when the corpora are screened in the related technology, and further achieves the effect of improving the quality of the training samples.

Chinese patent ZL201310344326.1 provides a corpus expansion device, which includes: the screening unit screens out an initial corpus sample according to preset corpus screening conditions; and the expansion unit is used for identifying the collected corpus according to the initial corpus sample and the expansion strategy to obtain an expanded corpus sample, and performing corpus expansion again based on the expanded corpus sample and the expansion strategy. The method carries out machine labeling on the large-scale training corpus in an automatic mode, thereby greatly saving the time period and the cost for manufacturing the large-scale training corpus and improving the labeling accuracy.

Currently, most corpus processing methods are simply cleaning, and according to different standards, removing abnormal data inconsistent with expectation or overall distribution. The invention focuses on the abnormal data in the man-machine conversation corpus and divides the abnormal data into error data and special case data. The error data needs to be eliminated, while the special case data is a more special expression, which is not a common expression method, but can enhance the diversity of expressions in the corpus, and needs to be retained and further expanded. And finally, the quality of the human-computer interaction dialogue corpus is improved, and the precision of subsequent model training is improved by utilizing the corpus.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention aims to provide a method for improving the quality and diversity of a man-machine interaction dialogue corpus. The invention is based on the statistical method or machine learning method of the corpus, carry on the control of quality and extension of the diversity to the original dialogue corpus.

The technical scheme adopted by the invention is as follows:

a method for improving the quality and diversity of a man-machine interaction dialogue corpus comprises the following steps:

1) carrying out synonymy sentence expansion on the input selected dialogue corpus to form a candidate set;

2) carrying out anomaly detection on the dialogue linguistic data in the candidate set, and outputting an abnormal value of each dialogue linguistic data to score;

3) sorting according to the scores, determining a threshold value according to an adjacent maximum difference method, and storing the score lower than the threshold value into an improved dialogue corpus;

4) and carrying out further semantic analysis on the abnormal points with the abnormal values higher than the threshold value:

4.1) if the dialogue data is wrong, directly discarding the dialogue data;

4.2) if the dialogue data is the dialogue data with better diversity, the dialogue data is used as input again, iteration is carried out again, and the step 1) is carried out;

4.3) if the dialog data are correct in other classes and general in diversity, storing the dialog data into the promoted dialog corpus;

5) and when a pause condition is reached, stopping iteration.

Further, the expansion of the synonyms of step 1) may be manually expanded by a human. A plurality of marking personnel can expand according to the input dialogue linguistic data. The method can better utilize the experience knowledge of human beings to expand. In order to save labor and improve efficiency, automatic synonym expansion can also be adopted. Specifically, the task of synonym expansion can be completed by randomly performing operations such as word order exchange, stop word deletion, synonym replacement, cross-language translation and the like on the spoken documents.

Further, step 2) the present invention can perform vectorization representation on the question sentences of all the segmented dialogues, i.e. a fixed vector d is used_textAnd (4) performing representation. Then, the average value of all the vectors of the dialogue corpus is calculated to obtain an average vector d_mean. According to the distance Dis calculation formula provided by the invention, the distance between each dialogue corpus and the average vector is calculated, and the distance is used as a difference scoring value. The higher the score is, the greater the difference is, and the "greater difference" here may mean that the dialog corpus is an incorrect sentence, and should be discarded, so as to improve the quality of the corpus. It is also possible to say that the dialog corpus is correct, and is only an unusual expression, which is beneficial to increase the diversity of the corpus, and further expansion needs to be reserved.

Further, in step 3), according to the score of the score, the score is low, and the score can be regarded as a common effective expression and can be stored in the corpus after being promoted.

Further, step 4) is to perform a judgment process on the dialogue corpus with a high score according to the processing result in step 3) according to different situations. The judgment here can choose to distinguish manually. A plurality of marking personnel can judge the category of the dialogue linguistic data according to the input dialogue linguistic data and experience. The method can better utilize the experience knowledge of human and flexibly process. In order to save labor and improve efficiency, an automatic judgment method can also be adopted. The comprehensive judgment can be carried out through automatic network search indexes and results returned by the question-answering model.

Further, in step 4.1), if the dialog corpus is wrong and is irrelevant to the intended target of the dialog itself, or the type of the dialog corpus cannot be judged temporarily, the dialog corpus is directly discarded, and the quality of the corpus is improved.

Further, if the dialog corpus in step 4.2) is correct and is a less common expression, which is beneficial to increase the diversity of the corpus, it needs to be retained and used as a seed dialog corpus as an input for the next iteration, and step 1 is repeated.

Further, step 4.3) is a correct and effective dialog corpus, but the dialog corpus is normally expressed, and the dialog corpus is directly stored in the promoted corpus.

Further, step 5), the method is an iterative updating process, and different stop conditions can be set according to requirements. For example, a fixed number of iterations is reached, or the input in step 4.2) is null, the number of dialogs in the corpus is increased to satisfy a preset number, and the like.

The innovation of the method is that the method focuses on abnormal data of the dialogue corpus, and in the process iteration processing, the corpus quality can be improved, and the corpus diversity can be increased. The synonym expansion method in the step 1), the abnormal data detection method in the step 2) and the distinguishing processing method in the step 4) all have novelty, feasibility and effectiveness.

The invention also provides an artificial intelligence model training method which is characterized in that the artificial intelligence model is trained by the corpora in the dialogue corpus obtained by the method.

The invention also provides a human-computer interaction method which is characterized in that the human-computer interaction is carried out by adopting the artificial intelligence model obtained by training.

Compared with the prior art, the invention has the following positive effects:

the invention can improve the quality and expand the diversity of the human-computer interaction dialogue corpus, can reduce the input of manpower, and can improve the accuracy and the robustness of the algorithm model by the improved corpus.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a flow diagram of an automatic synonym expansion;

FIG. 3 is an exemplary diagram of a synonym expansion;

FIG. 4 is a flow chart of abnormal dialog corpus detection.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Taking the application of improving the language material of human-computer interaction in the hotel field as an example, the invention specifically describes the improvement of quality and the expansion of diversity of the original language material based on a plurality of proposed steps. The invention firstly carries out basic word segmentation processing on the initial corpus. Initial Corpus Corpus_initIs composed of a series of n dialog corpora (question-answer dialog pairs), which can be expressed as { QApair₁,QApair₂,…,QApair_n}. Each question-answer pair has question, answer and labeled question-answer intentions, e.g. the ith question-answer pair may be expressed as (Sennce)_i,Answer_i,Intent_i) For example ("asking hotel rooms to cover wifi", "our hotel is full coverage wireless, without password direct connection", "ask for network"). In general, the Answer is basically fixed and unchangeable under the same intention. Therefore, the present invention is currently focused on the quality and diversity of the extended question Sentence under the same Intent.

The Chinese word segmentation is the basic step of Chinese natural language processing, and the word segmentation adopts a method of combining dictionary word segmentation and statistical word segmentation. Firstly, a maximum matching word segmentation method based on a dictionary is adopted, and a word segmentation method of sequence labeling (conditional random field) is adopted for ambiguous parts of word segmentation.

Thus, the user speaks a Sentence sequence_iMay be composed of a plurality of divided words, and may be expressed as

Where i represents the question of the ith dialogue corpus, k is the word order number, and max represents the maximum number of words in the allowed sentence. The invention takes max as 100, if the length is exceeded, the following words are truncated.

FIG. 1 is a flow chart of the steps of the method of the present invention, the following are the specific implementation steps:

step 1: the expansion of the synonymous sentences can be manually expanded according to the dialogue linguistic data. The method can better utilize the experience knowledge of human beings to expand. There are also mature approaches such as "crowdsourcing" to expand spoken material. In order to save labor and improve efficiency, automatic synonym expansion can also be adopted. Specifically, operations such as word order exchange, stop word deletion, synonym replacement, cross-language translation and the like can be randomly performed on the spoken documents, and the task of synonym expansion is completed. The expansion flow is shown in fig. 2, and a simple example is shown in fig. 3.

The present invention defines four basic operations, respectively as follows

1. And (3) word order exchange operation: selecting any one word for the input dialogue corpus question sentence

With the next word

And carrying out exchange. In an actual man-machine conversation, a user often has the action of word sequence conversion in expression. The operation such as "asking for hotel room coverage wifi" can be expressed as "asking for hotel room coverage wifi" so word order exchange is advantageous to enhance the diversity of expressions, and can cover the expression habit of part of users. This operation is not essentialOperation, can be with p₁Is skipped with probability value of [0,1]In the invention, the ratio is 0.8.

2. And (3) deleting stop words: for the input dialogue corpus question sentence, a common stop word dictionary is utilized, and the dictionary is obtained by manual arrangement. And judging whether the sentence contains the stop word or not, and if so, deleting the stop word. For example, the word "ask" in "ask hotel room coverage wifi" can be deleted, and does not affect the expression of the whole sentence. This operation is not a necessary operation and may be represented by p₂Is skipped with probability value of [0,1]In the invention, the ratio is 0.4.

3. Synonym replacement operation: for the input dialogue corpus question, a synonym dictionary (which is obtained by manual correction based on the public Harvard big word forest) is utilized to judge whether synonyms are contained in the dialogue corpus question, and if so, the synonyms with the same meaning in the synonym dictionary are replaced. For example, "ask for hotel room wifi to cover how," hotel "can find synonym" hotel, "this sentence can be replaced with" ask for hotel room wifi to cover how. Synonym substitution can maintain semantic consistency and can introduce unseen words (meaning words that have not appeared in the original corpus but are in the synonym dictionary) to enhance the diversity of expressions. This operation is not a necessary operation and may be represented by p₃Is skipped with probability value of [0,1]In the invention, the ratio is 0.1.

4. Cross-language translation operations: the cross-language translation operation is to expand the synonymous sentence by using the translation expression change between different languages by using the current machine translation technology. The specific operation includes that the dialogue corpus question sentence is translated into the intermediate language I by using the existing machine translation services (such as Google translation, Baidu translation and the like), then translated into the intermediate language II from the intermediate language I, and finally translated back to the Chinese from the intermediate language II. The returned results are compared and retained if they do not match the original input. For example, "how do wifi coverage of hotel room to ask" may be translated to "do the hotel room has wifi coverage" in English, and then translated from English to "La Chambre d'

The est-ell coverage par un r seau sans fil "is finally translated by French back to Chinese" whether the hotel room is covered by wireless network ". The final returned linguistic expression not only preserves semantic consistency, but also has richer expression.

This operation is not a necessary operation and may be represented by p₄Is skipped with probability value of [0,1]In between, the invention takes 0.3.

The above four steps of operations are all skipped or selected randomly, and finally, the probability values are 1-p₁，1-p₂，1-p₃，1-p₄And acting on the candidate dialogue corpus to operate the candidate dialogue corpus. Fig. 3 is an example, where there are already 4 possible variations when the first two operations are performed, and there are 16 possible variations when the operations are continued.

Step 2: the present invention can make vectorization treatment of all dialog linguistic data after word separation, i.e. use a fixed vector d_textAnd (4) performing representation. On the basis, the distance is calculated. The specific steps are shown in fig. 4.

The system maps each word to a low-dimensional continuous vector. The text depth representation model (e.g., Word2Vec) can be used to represent the question sentences of the dialogue corpus in the text segment to obtain Word vectors. word2vec is a tool that converts words into vector form. For sentences

Each word in (1)

Can be mapped to a vector, where the vector dimension is taken to be 200, e.g.

Then, the addition operation is carried out according to the word vectors to obtain the representation of the semantic vectors of the dialogue linguistic data

Such as

Further, averaging the vectors of n dialog corpora to obtain an average vector

Such as

And calculating the distance between each dialogue corpus and the average vector, wherein the distance is used as the difference score value of the dialogue corpuses. The invention designs a distance calculation formula in particular and amplifies the difference and the distinguishability between the dialogue linguistic data as much as possible. For the input vector

And the average vector

The distance Dis of (d) is calculated as follows:

the threshold in the formula is a defined threshold, and it is sufficient to ensure that the difference is small. The value in the present invention is 0.01.

The higher the score of Dis indicates the greater the difference, and the "greater difference" may indicate that the corpus is an incorrect corpus, which should be discarded to improve the corpus quality. It is also possible to say that the dialog corpus is correct, and is only an unusual expression, which is beneficial to increase the diversity of the corpus and needs to be preserved.

And step 3: sorting according to score, with low scoreSaving to the promoted dialogue Corpus Corpus_improvedIn (1). In step 2, each dialog corpus has been scored as n dialog corpuses. And then sequencing each dialogue corpus according to the score. And sorting according to the increasing order from low to high. Different threshold selection methods may be determined as desired. In the present invention, the threshold value Score is determined according to the adjacent maximum difference method_thresholdIf the score is lower than the threshold value, the first r dialogue corpora are selected. The dialogue linguistic data are low in score and small in difference, can be considered as effective and accurate dialogue linguistic data, and can be put into Corpus_improvedIn (1). The remaining n-r dialogue corpora need to be analyzed and processed in the next step. The adjacent maximum difference method proposed herein is calculated as follows:

3.1 sequencing: the n dialog corpora are ranked according to the distance Score from high to low as follows (sequence)₁,Score₁),(Sentence₂,Score₂),..,(Sentence_n,Score_n)。

3.2 calculate proximity difference: calculating the difference Delta of adjacent sequences_kThe calculation method is as follows:

3.3 taking the maximum value of the difference, determining the threshold Score of the Score_threshold: taking the maximum difference value in 3.2 and marking as Delta_qI.e. representing the sequence_q-1And Sennce_qThe difference between the two is maximum, the average value of the two is taken, and the scoring threshold is calculated as follows:

3.4 Score according to the threshold value in 3.3_thresholdIf the score is lower than the threshold value, the first r dialog corpora can be selected. The dialogue linguistic data are low in score and small in difference, can be considered as effective and accurate dialogue linguistic data, and can be put into Corpus_improvedIn (1). The rest n-r dialogue corpora need to be processedAnd (5) analyzing and processing in one step.

And 4, carrying out further semantic analysis on abnormal points of the high-ranking people, wherein the analysis work at the position needs to judge the quality and diversity of the abnormal points manually according to the dialogue linguistic data, mainly judging whether the current dialogue linguistic data is consistent with the target intention and whether the expression of the dialogue linguistic data is clear, and finally classifying the abnormal points into ① error dialogue data of four classes, ② diversity dialogue data, ③ correct data with general diversity and ④ data which cannot be subjected to semantic judgment temporarily.

In order to improve the efficiency, the classification and screening of semantic analysis can be automatically carried out. The comprehensive judgment can be carried out through automatic network search indexes and results returned by the question-answering model. The automatic judgment process is as follows:

4.1 calculating frequency of occurrence of dialogue corpus

For question sequence in dialog corpus_iUsing the word string of sentence as search word, searching by using Internet search index (Baidu, Google, etc.), returning the found related result count_iThe value represents the heat frequency of the dialog corpus, and the tuple (Senence) is obtained_i，count_i). The method is not limited to search engine data, and can also construct an index database by itself and return similar measurement values through well-known technologies such as inverted indexes and the like.

4.2 verifying question-answer Effect of dialogue corpus

First, a well-known question-and-answer model, such as DSSM (deep Structured Semantic models), is used to generate a Corpus Corpus_initAnd training to obtain an automatic question-answering Model QA-Model which can answer the input question. Thereafter, the question sequence for a dialog corpus_iThe string is used as an input, and the QA-Model is used to automatically answer the string, and there are two answer results. One is to return string answers that can be answered under the current model capabilities, and the other is to return a null result, if the model is currently not available to answer. Get tuple (Sennce)_i，Answer_i-QA-Model)。

4.3 Classification affiliation

According to 4.1 and 4.2The result of the tuple analysis(s) of (1), for each dialog corpus, a triple (Senence) can be obtained_i，count_i，Answer_i-QA-Model). According to count_i，Answer_i-QA-ModelFor sequence_iThe following classifications are made:

when count_iGreater than K, and Answer_i-QA-ModelA null value, classified as ① false dialog data;

when count_iNot greater than K, and Answer_i-QA-ModelNot null, classified as ② diversity dialogue data;

when count_iGreater than K, and Answer_i-QA-ModelValues not null, ascribed to ③ correct but general data of diversity;

when count_iNot greater than K, and Answer_i-QA-ModelThe value is null, and is classified as ④ data for which semantic determination is temporarily impossible.

The above threshold K can be empirically specified, and the value selected in the present invention is 100000.

4.4 according to different classification attributions, the following processing is carried out:

for example, under the premise that Intent is 'query network', the dialogue corpus containing a question 'send me a network line' does not conform to the intention and needs to be discarded;

if the dialog data ② is the dialog data ② with better diversity, the dialog data is re-used as input, re-iterated, and enter step 1). for example, on the premise that Intent is "inquire network", the dialog Corpus containing "how do i can connect wifi to hotel" has better diversity and meets the intention, and not only the Corpus of the Corpus is put into the Corpus to be promoted_improvedExpanding is continued on the basis, and the step 1 is iterated repeatedly;

for example, under the premise that Intent is "query network", the dialog corpus containing "hotel wifi is a full coverage bar" has general diversity, and is directed toPut into the Corpus to be promoted Corpus_improvedThe preparation method comprises the following steps of (1) performing;

and 5: the method is an iterative updating process, and different stopping conditions can be set according to requirements. For example, a fixed number of iterations is reached, the input is null in step 4, and Corpus is promoted_improvedThe number of the sessions in (1) may satisfy a preset number.

Finally improve Corpus corpuses_improvedThe man-machine dialogue corpus which is reliable in quality and rich in diversity is obtained by the method.

The dictionary of the maximum matching method mentioned above and the training learning corpus of the supervised conditional random field model are all from 10 ten thousand user reviews labeled manually in the present invention.

Test results on a plurality of groups of dialogue corpora show that the method for improving the quality and diversity of the human-computer dialogue corpora reduces about 20% of wrong dialogue corpora and improves the quality of the corpora; about 60% of the corpus number is expanded, and the diversity of the corpus is increased. And on the corpus after the improvement, the precision of the man-machine conversation model is generally improved by 3-7 percentage points.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A dialogue corpus generation method for improving quality and diversity of a man-machine interaction dialogue corpus comprises the following steps:

1) carrying out synonymy sentence expansion on the selected dialogue corpus to form a candidate set;

2) carrying out anomaly detection on each dialogue corpus in the candidate set to obtain an abnormal value of each dialogue corpus;

3) saving the dialogue corpus of which the abnormal value is lower than the set scoring threshold value into the promoted dialogue corpus;

4) performing semantic analysis on the dialogue corpora with the abnormal value higher than or equal to the scoring threshold: if the dialog data are wrong, directly discarding the dialog data; if the conversation data is diversified, executing step 5); otherwise, the current dialogue corpus is stored in the promoted dialogue corpus;

5) and (3) taking the dialogue data judged to be diverse as input again, executing the steps 1-4) until a pause condition is reached, and stopping iteration.

2. The method of claim 1, wherein the candidate set is generated by:

11) selecting two adjacent words from dialogue corpus each time

And

exchanging to obtain a plurality of expanded sentences;

12) deleting the stop words in each expanded sentence by using the stop word dictionary;

13) judging whether each participle in each sentence has a synonym or not by using the synonym dictionary, if so, replacing the corresponding participle by using the synonym in the synonym dictionary, and expanding each sentence into a plurality of sentences;

14) for each expanded sentence, firstly translating the sentence into a first intermediate language, then translating the sentence from the first intermediate language into a second intermediate language, and then translating the sentence from the second intermediate language back to the original language or translating the sentence into the original language after a plurality of times of language conversion; and then comparing whether the returned result after the translation conversion for multiple times is consistent with the original sentence or not, if not, storing the returned result and the original sentence into a candidate set, otherwise, storing the original sentence into the candidate set.

3. The method according to claim 2, wherein the word order exchange of step 11), the stop word deletion of step 12), the synonym substitution of step 13), and the cross-language translation processing of step 14) correspond to a skip probability for setting a probability of skipping the processing of the corresponding step.

4. The method according to claim 1, wherein in step 2), all sentences after each dialogue corpus participle in the candidate set are first vectorized to obtain a vector d with a set length_text(ii) a Then, the average value of the vectors corresponding to all dialogue corpora in the candidate set is calculated to obtain an average vector d_mean(ii) a Then calculate each vector d_textAnd the average vector d_meanThe distance is used as the difference value of the corresponding dialogue corpus.

5. The method of claim 4, wherein the distance is

Where threshold is a defined threshold, N is the dimension of the vector, x_iIs a vector d_textThe i-th dimension component of (1), d_iIs an average vector d_meanThe ith dimension component of (1).

6. The method of claim 1, wherein in step 3), the scoring threshold is determined according to a neighboring maximum difference method; wherein the adjacent maximum difference method is as follows:

31) sequencing each dialogue corpus according to the abnormal value, and recording the obtained sequencing result as: (Sennce)₁,Score₁),(Sentence₂,Score₂),..,(Sentence_n,Score_n)；Sentence_nFor the sentence corresponding to the nth dialog corpus, Score_nAbnormal value of nth dialog corpus;

32) computing differences in adjacent ranks

k∈[2,n]；

33) Taking the maximum difference value in the result obtained in the step 32) and marking as Delta_q(ii) a Mixing Delta_qCorresponding two adjacent outliers Score_q、Score_q-1As the Score threshold Score_threshold。

7. The method of claim 1, wherein the semantic analysis of the spoken material is performed by:

41) computing question sequence in dialog corpus_iAs a search word, the sequence is counted_iOccurrence frequency count of_i；

42) To general Sence_iInputting an automatic question-answering model to obtain a returned result Answer_i-QA-Model；

43) According to count_iAnd Answer_i-QA-ModelFor the Sennce_iAnd (4) classifying: when count_iGreater than a set threshold K, and Answer_i-QA-ModelIf the value is null, the data is classified as error dialogue data; when count_iIs not greater than a set threshold K, and Answer_i-QA-ModelIf the value is not null, classifying the data into diversified dialogue data; when count_iGreater than K, and Answer_i-QA-ModelIf the value is not null, the data is classified as correct data but general diversity; when count_iNot greater than K, and Answer_i-QA-ModelIf the value is null, the data is classified as data for which semantic determination is temporarily impossible.

8. The method of claim 1, wherein the selected dialog corpus comprises question sentences, answers, and labeled question-answer intentions; and step 1), performing synonymy sentence expansion on the question sentences in the selected dialogue corpus.

9. A method for training an artificial intelligence model, wherein the artificial intelligence model is trained using the corpus of dialoging corpus obtained by the method of claim 1.

10. A method for human-computer interaction, characterized in that the artificial intelligence model trained by the method of claim 9 is used for human-computer interaction.