CN111026884A - Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus - Google Patents
Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus Download PDFInfo
- Publication number
- CN111026884A CN111026884A CN201911271656.6A CN201911271656A CN111026884A CN 111026884 A CN111026884 A CN 111026884A CN 201911271656 A CN201911271656 A CN 201911271656A CN 111026884 A CN111026884 A CN 111026884A
- Authority
- CN
- China
- Prior art keywords
- corpus
- dialogue
- sentence
- data
- dialogue corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000003993 interaction Effects 0.000 title claims abstract description 18
- 230000002159 abnormal effect Effects 0.000 claims abstract description 19
- 238000004458 analytical method Methods 0.000 claims abstract description 8
- 238000001514 detection method Methods 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 25
- 238000013519 translation Methods 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 8
- 238000013473 artificial intelligence Methods 0.000 claims description 7
- 239000000463 material Substances 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000006467 substitution reaction Methods 0.000 claims description 2
- 238000003908 quality control method Methods 0.000 abstract 1
- 230000014509 gene expression Effects 0.000 description 17
- 230000011218 segmentation Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 238000012216 screening Methods 0.000 description 6
- 241000282414 Homo sapiens Species 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- AJIPIJNNOJSSQC-NYLIRDPKSA-N estetrol Chemical compound OC1=CC=C2[C@H]3CC[C@](C)([C@H]([C@H](O)[C@@H]4O)O)[C@@H]4[C@@H]3CCC2=C1 AJIPIJNNOJSSQC-NYLIRDPKSA-N 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a dialogue corpus generation method for improving quality and diversity of a man-machine interaction dialogue corpus. The method comprises the following steps: 1) carrying out synonymy sentence expansion on the selected dialogue corpus to form a candidate set; 2) carrying out anomaly detection on each dialogue corpus in the candidate set to obtain an abnormal value of each dialogue corpus; 3) saving the dialogue corpus of which the abnormal value is lower than the set scoring threshold value into the promoted dialogue corpus; 4) performing semantic analysis on the dialogue corpora with the abnormal value higher than or equal to the scoring threshold: if the dialog data are wrong, directly discarding the dialog data; if the conversation data is diversified, executing step 5); otherwise, the current dialogue corpus is stored in the promoted dialogue corpus; 5) and (3) taking the dialogue data judged to be diverse as input again, executing the steps 1-4) until a pause condition is reached, and stopping iteration. The invention realizes the quality control and the diversity expansion of the original dialogue corpus.
Description
Technical Field
The invention belongs to the technical field of information technology and data mining, and relates to a dialogue corpus generation method for improving the quality and diversity of a man-machine interaction dialogue corpus.
Background
With the continuous development of scientific technology, various artificial intelligence models are increasingly applied to various intelligent systems, and various human-computer interaction requirements are provided. How to carry out human-computer interaction more effectively is a problem which needs to be solved urgently at present. Currently, most human-computer interaction models are driven by data, and are trained on Corpus (Corpus) to obtain parameter results with good performances, and the models are applied to systems. Therefore, a high quality corpus plays an increasingly important role.
Human beings have various abundant language expression modes in human-computer interaction, and have higher requirement on the accuracy of semantic understanding. In order to better train an accurate (accuracuracy) and robust (robust) model, an accurate high-quality dialogue corpus is required, and the dialogue corpus is required to be as rich as possible and to be expressed in various ways.
Chinese patent ZL201510251428.8 discloses a corpus screening method and apparatus, wherein the corpus screening method includes: performing cross check based on the first corpus set to obtain a first check result; judging whether the first check result meets a first preset condition or not; when the first check result meets the first preset condition, performing public check based on the first corpus set to obtain a second check result; judging whether the first corpus set needs to be screened or not according to the second check result; and when the first corpus set is judged to be required to be screened, performing first screening processing on the first corpus set. The method solves the problem of low quality of training samples caused by influence of subjective preference when the corpora are screened in the related technology, and further achieves the effect of improving the quality of the training samples.
Chinese patent ZL201310344326.1 provides a corpus expansion device, which includes: the screening unit screens out an initial corpus sample according to preset corpus screening conditions; and the expansion unit is used for identifying the collected corpus according to the initial corpus sample and the expansion strategy to obtain an expanded corpus sample, and performing corpus expansion again based on the expanded corpus sample and the expansion strategy. The method carries out machine labeling on the large-scale training corpus in an automatic mode, thereby greatly saving the time period and the cost for manufacturing the large-scale training corpus and improving the labeling accuracy.
Currently, most corpus processing methods are simply cleaning, and according to different standards, removing abnormal data inconsistent with expectation or overall distribution. The invention focuses on the abnormal data in the man-machine conversation corpus and divides the abnormal data into error data and special case data. The error data needs to be eliminated, while the special case data is a more special expression, which is not a common expression method, but can enhance the diversity of expressions in the corpus, and needs to be retained and further expanded. And finally, the quality of the human-computer interaction dialogue corpus is improved, and the precision of subsequent model training is improved by utilizing the corpus.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a method for improving the quality and diversity of a man-machine interaction dialogue corpus. The invention is based on the statistical method or machine learning method of the corpus, carry on the control of quality and extension of the diversity to the original dialogue corpus.
The technical scheme adopted by the invention is as follows:
a method for improving the quality and diversity of a man-machine interaction dialogue corpus comprises the following steps:
1) carrying out synonymy sentence expansion on the input selected dialogue corpus to form a candidate set;
2) carrying out anomaly detection on the dialogue linguistic data in the candidate set, and outputting an abnormal value of each dialogue linguistic data to score;
3) sorting according to the scores, determining a threshold value according to an adjacent maximum difference method, and storing the score lower than the threshold value into an improved dialogue corpus;
4) and carrying out further semantic analysis on the abnormal points with the abnormal values higher than the threshold value:
4.1) if the dialogue data is wrong, directly discarding the dialogue data;
4.2) if the dialogue data is the dialogue data with better diversity, the dialogue data is used as input again, iteration is carried out again, and the step 1) is carried out;
4.3) if the dialog data are correct in other classes and general in diversity, storing the dialog data into the promoted dialog corpus;
5) and when a pause condition is reached, stopping iteration.
Further, the expansion of the synonyms of step 1) may be manually expanded by a human. A plurality of marking personnel can expand according to the input dialogue linguistic data. The method can better utilize the experience knowledge of human beings to expand. In order to save labor and improve efficiency, automatic synonym expansion can also be adopted. Specifically, the task of synonym expansion can be completed by randomly performing operations such as word order exchange, stop word deletion, synonym replacement, cross-language translation and the like on the spoken documents.
Further, step 2) the present invention can perform vectorization representation on the question sentences of all the segmented dialogues, i.e. a fixed vector d is usedtextAnd (4) performing representation. Then, the average value of all the vectors of the dialogue corpus is calculated to obtain an average vector dmean. According to the distance Dis calculation formula provided by the invention, the distance between each dialogue corpus and the average vector is calculated, and the distance is used as a difference scoring value. The higher the score is, the greater the difference is, and the "greater difference" here may mean that the dialog corpus is an incorrect sentence, and should be discarded, so as to improve the quality of the corpus. It is also possible to say that the dialog corpus is correct, and is only an unusual expression, which is beneficial to increase the diversity of the corpus, and further expansion needs to be reserved.
Further, in step 3), according to the score of the score, the score is low, and the score can be regarded as a common effective expression and can be stored in the corpus after being promoted.
Further, step 4) is to perform a judgment process on the dialogue corpus with a high score according to the processing result in step 3) according to different situations. The judgment here can choose to distinguish manually. A plurality of marking personnel can judge the category of the dialogue linguistic data according to the input dialogue linguistic data and experience. The method can better utilize the experience knowledge of human and flexibly process. In order to save labor and improve efficiency, an automatic judgment method can also be adopted. The comprehensive judgment can be carried out through automatic network search indexes and results returned by the question-answering model.
Further, in step 4.1), if the dialog corpus is wrong and is irrelevant to the intended target of the dialog itself, or the type of the dialog corpus cannot be judged temporarily, the dialog corpus is directly discarded, and the quality of the corpus is improved.
Further, if the dialog corpus in step 4.2) is correct and is a less common expression, which is beneficial to increase the diversity of the corpus, it needs to be retained and used as a seed dialog corpus as an input for the next iteration, and step 1 is repeated.
Further, step 4.3) is a correct and effective dialog corpus, but the dialog corpus is normally expressed, and the dialog corpus is directly stored in the promoted corpus.
Further, step 5), the method is an iterative updating process, and different stop conditions can be set according to requirements. For example, a fixed number of iterations is reached, or the input in step 4.2) is null, the number of dialogs in the corpus is increased to satisfy a preset number, and the like.
The innovation of the method is that the method focuses on abnormal data of the dialogue corpus, and in the process iteration processing, the corpus quality can be improved, and the corpus diversity can be increased. The synonym expansion method in the step 1), the abnormal data detection method in the step 2) and the distinguishing processing method in the step 4) all have novelty, feasibility and effectiveness.
The invention also provides an artificial intelligence model training method which is characterized in that the artificial intelligence model is trained by the corpora in the dialogue corpus obtained by the method.
The invention also provides a human-computer interaction method which is characterized in that the human-computer interaction is carried out by adopting the artificial intelligence model obtained by training.
Compared with the prior art, the invention has the following positive effects:
the invention can improve the quality and expand the diversity of the human-computer interaction dialogue corpus, can reduce the input of manpower, and can improve the accuracy and the robustness of the algorithm model by the improved corpus.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a flow diagram of an automatic synonym expansion;
FIG. 3 is an exemplary diagram of a synonym expansion;
FIG. 4 is a flow chart of abnormal dialog corpus detection.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Taking the application of improving the language material of human-computer interaction in the hotel field as an example, the invention specifically describes the improvement of quality and the expansion of diversity of the original language material based on a plurality of proposed steps. The invention firstly carries out basic word segmentation processing on the initial corpus. Initial Corpus CorpusinitIs composed of a series of n dialog corpora (question-answer dialog pairs), which can be expressed as { QApair1,QApair2,…,QApairn}. Each question-answer pair has question, answer and labeled question-answer intentions, e.g. the ith question-answer pair may be expressed as (Sennce)i,Answeri,Intenti) For example ("asking hotel rooms to cover wifi", "our hotel is full coverage wireless, without password direct connection", "ask for network"). In general, the Answer is basically fixed and unchangeable under the same intention. Therefore, the present invention is currently focused on the quality and diversity of the extended question Sentence under the same Intent.
The Chinese word segmentation is the basic step of Chinese natural language processing, and the word segmentation adopts a method of combining dictionary word segmentation and statistical word segmentation. Firstly, a maximum matching word segmentation method based on a dictionary is adopted, and a word segmentation method of sequence labeling (conditional random field) is adopted for ambiguous parts of word segmentation.
Thus, the user speaks a Sentence sequenceiMay be composed of a plurality of divided words, and may be expressed asWhere i represents the question of the ith dialogue corpus, k is the word order number, and max represents the maximum number of words in the allowed sentence. The invention takes max as 100, if the length is exceeded, the following words are truncated.
FIG. 1 is a flow chart of the steps of the method of the present invention, the following are the specific implementation steps:
step 1: the expansion of the synonymous sentences can be manually expanded according to the dialogue linguistic data. The method can better utilize the experience knowledge of human beings to expand. There are also mature approaches such as "crowdsourcing" to expand spoken material. In order to save labor and improve efficiency, automatic synonym expansion can also be adopted. Specifically, operations such as word order exchange, stop word deletion, synonym replacement, cross-language translation and the like can be randomly performed on the spoken documents, and the task of synonym expansion is completed. The expansion flow is shown in fig. 2, and a simple example is shown in fig. 3.
The present invention defines four basic operations, respectively as follows
1. And (3) word order exchange operation: selecting any one word for the input dialogue corpus question sentenceWith the next wordAnd carrying out exchange. In an actual man-machine conversation, a user often has the action of word sequence conversion in expression. The operation such as "asking for hotel room coverage wifi" can be expressed as "asking for hotel room coverage wifi" so word order exchange is advantageous to enhance the diversity of expressions, and can cover the expression habit of part of users. This operation is not essentialOperation, can be with p1Is skipped with probability value of [0,1]In the invention, the ratio is 0.8.
2. And (3) deleting stop words: for the input dialogue corpus question sentence, a common stop word dictionary is utilized, and the dictionary is obtained by manual arrangement. And judging whether the sentence contains the stop word or not, and if so, deleting the stop word. For example, the word "ask" in "ask hotel room coverage wifi" can be deleted, and does not affect the expression of the whole sentence. This operation is not a necessary operation and may be represented by p2Is skipped with probability value of [0,1]In the invention, the ratio is 0.4.
3. Synonym replacement operation: for the input dialogue corpus question, a synonym dictionary (which is obtained by manual correction based on the public Harvard big word forest) is utilized to judge whether synonyms are contained in the dialogue corpus question, and if so, the synonyms with the same meaning in the synonym dictionary are replaced. For example, "ask for hotel room wifi to cover how," hotel "can find synonym" hotel, "this sentence can be replaced with" ask for hotel room wifi to cover how. Synonym substitution can maintain semantic consistency and can introduce unseen words (meaning words that have not appeared in the original corpus but are in the synonym dictionary) to enhance the diversity of expressions. This operation is not a necessary operation and may be represented by p3Is skipped with probability value of [0,1]In the invention, the ratio is 0.1.
4. Cross-language translation operations: the cross-language translation operation is to expand the synonymous sentence by using the translation expression change between different languages by using the current machine translation technology. The specific operation includes that the dialogue corpus question sentence is translated into the intermediate language I by using the existing machine translation services (such as Google translation, Baidu translation and the like), then translated into the intermediate language II from the intermediate language I, and finally translated back to the Chinese from the intermediate language II. The returned results are compared and retained if they do not match the original input. For example, "how do wifi coverage of hotel room to ask" may be translated to "do the hotel room has wifi coverage" in English, and then translated from English to "La Chambre d'The est-ell coverage par un r seau sans fil "is finally translated by French back to Chinese" whether the hotel room is covered by wireless network ". The final returned linguistic expression not only preserves semantic consistency, but also has richer expression.
This operation is not a necessary operation and may be represented by p4Is skipped with probability value of [0,1]In between, the invention takes 0.3.
The above four steps of operations are all skipped or selected randomly, and finally, the probability values are 1-p1,1-p2,1-p3,1-p4And acting on the candidate dialogue corpus to operate the candidate dialogue corpus. Fig. 3 is an example, where there are already 4 possible variations when the first two operations are performed, and there are 16 possible variations when the operations are continued.
Step 2: the present invention can make vectorization treatment of all dialog linguistic data after word separation, i.e. use a fixed vector dtextAnd (4) performing representation. On the basis, the distance is calculated. The specific steps are shown in fig. 4.
The system maps each word to a low-dimensional continuous vector. The text depth representation model (e.g., Word2Vec) can be used to represent the question sentences of the dialogue corpus in the text segment to obtain Word vectors. word2vec is a tool that converts words into vector form. For sentencesEach word in (1)Can be mapped to a vector, where the vector dimension is taken to be 200, e.g.Then, the addition operation is carried out according to the word vectors to obtain the representation of the semantic vectors of the dialogue linguistic dataSuch as
Further, averaging the vectors of n dialog corpora to obtain an average vectorSuch asAnd calculating the distance between each dialogue corpus and the average vector, wherein the distance is used as the difference score value of the dialogue corpuses. The invention designs a distance calculation formula in particular and amplifies the difference and the distinguishability between the dialogue linguistic data as much as possible. For the input vectorAnd the average vectorThe distance Dis of (d) is calculated as follows:
the threshold in the formula is a defined threshold, and it is sufficient to ensure that the difference is small. The value in the present invention is 0.01.
The higher the score of Dis indicates the greater the difference, and the "greater difference" may indicate that the corpus is an incorrect corpus, which should be discarded to improve the corpus quality. It is also possible to say that the dialog corpus is correct, and is only an unusual expression, which is beneficial to increase the diversity of the corpus and needs to be preserved.
And step 3: sorting according to score, with low scoreSaving to the promoted dialogue Corpus CorpusimprovedIn (1). In step 2, each dialog corpus has been scored as n dialog corpuses. And then sequencing each dialogue corpus according to the score. And sorting according to the increasing order from low to high. Different threshold selection methods may be determined as desired. In the present invention, the threshold value Score is determined according to the adjacent maximum difference methodthresholdIf the score is lower than the threshold value, the first r dialogue corpora are selected. The dialogue linguistic data are low in score and small in difference, can be considered as effective and accurate dialogue linguistic data, and can be put into CorpusimprovedIn (1). The remaining n-r dialogue corpora need to be analyzed and processed in the next step. The adjacent maximum difference method proposed herein is calculated as follows:
3.1 sequencing: the n dialog corpora are ranked according to the distance Score from high to low as follows (sequence)1,Score1),(Sentence2,Score2),..,(Sentencen,Scoren)。
3.2 calculate proximity difference: calculating the difference Delta of adjacent sequenceskThe calculation method is as follows:
3.3 taking the maximum value of the difference, determining the threshold Score of the Scorethreshold: taking the maximum difference value in 3.2 and marking as DeltaqI.e. representing the sequenceq-1And SennceqThe difference between the two is maximum, the average value of the two is taken, and the scoring threshold is calculated as follows:
3.4 Score according to the threshold value in 3.3thresholdIf the score is lower than the threshold value, the first r dialog corpora can be selected. The dialogue linguistic data are low in score and small in difference, can be considered as effective and accurate dialogue linguistic data, and can be put into CorpusimprovedIn (1). The rest n-r dialogue corpora need to be processedAnd (5) analyzing and processing in one step.
And 4, carrying out further semantic analysis on abnormal points of the high-ranking people, wherein the analysis work at the position needs to judge the quality and diversity of the abnormal points manually according to the dialogue linguistic data, mainly judging whether the current dialogue linguistic data is consistent with the target intention and whether the expression of the dialogue linguistic data is clear, and finally classifying the abnormal points into ① error dialogue data of four classes, ② diversity dialogue data, ③ correct data with general diversity and ④ data which cannot be subjected to semantic judgment temporarily.
In order to improve the efficiency, the classification and screening of semantic analysis can be automatically carried out. The comprehensive judgment can be carried out through automatic network search indexes and results returned by the question-answering model. The automatic judgment process is as follows:
4.1 calculating frequency of occurrence of dialogue corpus
For question sequence in dialog corpusiUsing the word string of sentence as search word, searching by using Internet search index (Baidu, Google, etc.), returning the found related result countiThe value represents the heat frequency of the dialog corpus, and the tuple (Senence) is obtainedi,counti). The method is not limited to search engine data, and can also construct an index database by itself and return similar measurement values through well-known technologies such as inverted indexes and the like.
4.2 verifying question-answer Effect of dialogue corpus
First, a well-known question-and-answer model, such as DSSM (deep Structured Semantic models), is used to generate a Corpus CorpusinitAnd training to obtain an automatic question-answering Model QA-Model which can answer the input question. Thereafter, the question sequence for a dialog corpusiThe string is used as an input, and the QA-Model is used to automatically answer the string, and there are two answer results. One is to return string answers that can be answered under the current model capabilities, and the other is to return a null result, if the model is currently not available to answer. Get tuple (Sennce)i,Answeri-QA-Model)。
4.3 Classification affiliation
According to 4.1 and 4.2The result of the tuple analysis(s) of (1), for each dialog corpus, a triple (Senence) can be obtainedi,counti,Answeri-QA-Model). According to counti,Answeri-QA-ModelFor sequenceiThe following classifications are made:
when countiGreater than K, and Answeri-QA-ModelA null value, classified as ① false dialog data;
when countiNot greater than K, and Answeri-QA-ModelNot null, classified as ② diversity dialogue data;
when countiGreater than K, and Answeri-QA-ModelValues not null, ascribed to ③ correct but general data of diversity;
when countiNot greater than K, and Answeri-QA-ModelThe value is null, and is classified as ④ data for which semantic determination is temporarily impossible.
The above threshold K can be empirically specified, and the value selected in the present invention is 100000.
4.4 according to different classification attributions, the following processing is carried out:
for example, under the premise that Intent is 'query network', the dialogue corpus containing a question 'send me a network line' does not conform to the intention and needs to be discarded;
if the dialog data ② is the dialog data ② with better diversity, the dialog data is re-used as input, re-iterated, and enter step 1). for example, on the premise that Intent is "inquire network", the dialog Corpus containing "how do i can connect wifi to hotel" has better diversity and meets the intention, and not only the Corpus of the Corpus is put into the Corpus to be promotedimprovedExpanding is continued on the basis, and the step 1 is iterated repeatedly;
for example, under the premise that Intent is "query network", the dialog corpus containing "hotel wifi is a full coverage bar" has general diversity, and is directed toPut into the Corpus to be promoted CorpusimprovedThe preparation method comprises the following steps of (1) performing;
and 5: the method is an iterative updating process, and different stopping conditions can be set according to requirements. For example, a fixed number of iterations is reached, the input is null in step 4, and Corpus is promotedimprovedThe number of the sessions in (1) may satisfy a preset number.
Finally improve Corpus corpusesimprovedThe man-machine dialogue corpus which is reliable in quality and rich in diversity is obtained by the method.
The dictionary of the maximum matching method mentioned above and the training learning corpus of the supervised conditional random field model are all from 10 ten thousand user reviews labeled manually in the present invention.
Test results on a plurality of groups of dialogue corpora show that the method for improving the quality and diversity of the human-computer dialogue corpora reduces about 20% of wrong dialogue corpora and improves the quality of the corpora; about 60% of the corpus number is expanded, and the diversity of the corpus is increased. And on the corpus after the improvement, the precision of the man-machine conversation model is generally improved by 3-7 percentage points.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.
Claims (10)
1. A dialogue corpus generation method for improving quality and diversity of a man-machine interaction dialogue corpus comprises the following steps:
1) carrying out synonymy sentence expansion on the selected dialogue corpus to form a candidate set;
2) carrying out anomaly detection on each dialogue corpus in the candidate set to obtain an abnormal value of each dialogue corpus;
3) saving the dialogue corpus of which the abnormal value is lower than the set scoring threshold value into the promoted dialogue corpus;
4) performing semantic analysis on the dialogue corpora with the abnormal value higher than or equal to the scoring threshold: if the dialog data are wrong, directly discarding the dialog data; if the conversation data is diversified, executing step 5); otherwise, the current dialogue corpus is stored in the promoted dialogue corpus;
5) and (3) taking the dialogue data judged to be diverse as input again, executing the steps 1-4) until a pause condition is reached, and stopping iteration.
2. The method of claim 1, wherein the candidate set is generated by:
11) selecting two adjacent words from dialogue corpus each timeAndexchanging to obtain a plurality of expanded sentences;
12) deleting the stop words in each expanded sentence by using the stop word dictionary;
13) judging whether each participle in each sentence has a synonym or not by using the synonym dictionary, if so, replacing the corresponding participle by using the synonym in the synonym dictionary, and expanding each sentence into a plurality of sentences;
14) for each expanded sentence, firstly translating the sentence into a first intermediate language, then translating the sentence from the first intermediate language into a second intermediate language, and then translating the sentence from the second intermediate language back to the original language or translating the sentence into the original language after a plurality of times of language conversion; and then comparing whether the returned result after the translation conversion for multiple times is consistent with the original sentence or not, if not, storing the returned result and the original sentence into a candidate set, otherwise, storing the original sentence into the candidate set.
3. The method according to claim 2, wherein the word order exchange of step 11), the stop word deletion of step 12), the synonym substitution of step 13), and the cross-language translation processing of step 14) correspond to a skip probability for setting a probability of skipping the processing of the corresponding step.
4. The method according to claim 1, wherein in step 2), all sentences after each dialogue corpus participle in the candidate set are first vectorized to obtain a vector d with a set lengthtext(ii) a Then, the average value of the vectors corresponding to all dialogue corpora in the candidate set is calculated to obtain an average vector dmean(ii) a Then calculate each vector dtextAnd the average vector dmeanThe distance is used as the difference value of the corresponding dialogue corpus.
6. The method of claim 1, wherein in step 3), the scoring threshold is determined according to a neighboring maximum difference method; wherein the adjacent maximum difference method is as follows:
31) sequencing each dialogue corpus according to the abnormal value, and recording the obtained sequencing result as: (Sennce)1,Score1),(Sentence2,Score2),..,(Sentencen,Scoren);SentencenFor the sentence corresponding to the nth dialog corpus, ScorenAbnormal value of nth dialog corpus;
33) Taking the maximum difference value in the result obtained in the step 32) and marking as Deltaq(ii) a Mixing DeltaqCorresponding two adjacent outliers Scoreq、Scoreq-1As the Score threshold Scorethreshold。
7. The method of claim 1, wherein the semantic analysis of the spoken material is performed by:
41) computing question sequence in dialog corpusiAs a search word, the sequence is countediOccurrence frequency count ofi;
42) To general SenceiInputting an automatic question-answering model to obtain a returned result Answeri-QA-Model;
43) According to countiAnd Answeri-QA-ModelFor the SennceiAnd (4) classifying: when countiGreater than a set threshold K, and Answeri-QA-ModelIf the value is null, the data is classified as error dialogue data; when countiIs not greater than a set threshold K, and Answeri-QA-ModelIf the value is not null, classifying the data into diversified dialogue data; when countiGreater than K, and Answeri-QA-ModelIf the value is not null, the data is classified as correct data but general diversity; when countiNot greater than K, and Answeri-QA-ModelIf the value is null, the data is classified as data for which semantic determination is temporarily impossible.
8. The method of claim 1, wherein the selected dialog corpus comprises question sentences, answers, and labeled question-answer intentions; and step 1), performing synonymy sentence expansion on the question sentences in the selected dialogue corpus.
9. A method for training an artificial intelligence model, wherein the artificial intelligence model is trained using the corpus of dialoging corpus obtained by the method of claim 1.
10. A method for human-computer interaction, characterized in that the artificial intelligence model trained by the method of claim 9 is used for human-computer interaction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911271656.6A CN111026884B (en) | 2019-12-12 | 2019-12-12 | Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911271656.6A CN111026884B (en) | 2019-12-12 | 2019-12-12 | Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111026884A true CN111026884A (en) | 2020-04-17 |
CN111026884B CN111026884B (en) | 2023-06-02 |
Family
ID=70208856
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911271656.6A Active CN111026884B (en) | 2019-12-12 | 2019-12-12 | Dialog corpus generation method for improving quality and diversity of man-machine interaction dialog corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111026884B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112231458A (en) * | 2020-10-23 | 2021-01-15 | 河北省讯飞人工智能研究院 | Capacity expansion method, device, equipment and storage medium for dialogue corpus |
CN112597748A (en) * | 2020-12-18 | 2021-04-02 | 深圳赛安特技术服务有限公司 | Corpus generation method, apparatus, device and computer readable storage medium |
CN112836525A (en) * | 2021-01-13 | 2021-05-25 | 江苏金陵科技集团有限公司 | Human-computer interaction based machine translation system and automatic optimization method thereof |
CN113204966A (en) * | 2021-06-08 | 2021-08-03 | 重庆度小满优扬科技有限公司 | Corpus augmentation method, apparatus, device and storage medium |
WO2021208700A1 (en) * | 2020-11-23 | 2021-10-21 | 平安科技(深圳)有限公司 | Method and apparatus for speech data selection, electronic device, and storage medium |
CN114417984A (en) * | 2021-12-31 | 2022-04-29 | 上海流利说信息技术有限公司 | Training sample enhancement method and device, device and storage medium |
CN115062630A (en) * | 2022-07-25 | 2022-09-16 | 北京云迹科技股份有限公司 | Method and device for confirming nickname of robot |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10283373A (en) * | 1997-04-07 | 1998-10-23 | Aptecs Software Inc | System and method for generating and retrieving context vector |
US20150195406A1 (en) * | 2014-01-08 | 2015-07-09 | Callminer, Inc. | Real-time conversational analytics facility |
US20160232211A1 (en) * | 2013-09-29 | 2016-08-11 | Peking University Founder Group Co., Ltd. | Keyword expansion method and system, and classified corpus annotation method and system |
CN107122346A (en) * | 2016-12-28 | 2017-09-01 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
CN108197274A (en) * | 2018-01-08 | 2018-06-22 | 合肥工业大学 | Abnormal individual character detection method and device based on dialogue |
CN109189901A (en) * | 2018-08-09 | 2019-01-11 | 北京中关村科金技术有限公司 | Automatically a kind of method of the new classification of discovery and corresponding corpus in intelligent customer service system |
CN109376224A (en) * | 2018-10-24 | 2019-02-22 | 深圳市壹鸽科技有限公司 | Corpus filter method and device |
CN110134952A (en) * | 2019-04-29 | 2019-08-16 | 华南师范大学 | A method, device and storage medium for rejecting wrong text |
CN110362659A (en) * | 2019-07-16 | 2019-10-22 | 北京洛必德科技有限公司 | The abnormal statement filter method and system of the open corpus of robot |
CN110489538A (en) * | 2019-08-27 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Sentence answer method, device and electronic equipment based on artificial intelligence |
-
2019
- 2019-12-12 CN CN201911271656.6A patent/CN111026884B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10283373A (en) * | 1997-04-07 | 1998-10-23 | Aptecs Software Inc | System and method for generating and retrieving context vector |
US20160232211A1 (en) * | 2013-09-29 | 2016-08-11 | Peking University Founder Group Co., Ltd. | Keyword expansion method and system, and classified corpus annotation method and system |
US20150195406A1 (en) * | 2014-01-08 | 2015-07-09 | Callminer, Inc. | Real-time conversational analytics facility |
CN107122346A (en) * | 2016-12-28 | 2017-09-01 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
CN108197274A (en) * | 2018-01-08 | 2018-06-22 | 合肥工业大学 | Abnormal individual character detection method and device based on dialogue |
CN109189901A (en) * | 2018-08-09 | 2019-01-11 | 北京中关村科金技术有限公司 | Automatically a kind of method of the new classification of discovery and corresponding corpus in intelligent customer service system |
CN109376224A (en) * | 2018-10-24 | 2019-02-22 | 深圳市壹鸽科技有限公司 | Corpus filter method and device |
CN110134952A (en) * | 2019-04-29 | 2019-08-16 | 华南师范大学 | A method, device and storage medium for rejecting wrong text |
CN110362659A (en) * | 2019-07-16 | 2019-10-22 | 北京洛必德科技有限公司 | The abnormal statement filter method and system of the open corpus of robot |
CN110489538A (en) * | 2019-08-27 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Sentence answer method, device and electronic equipment based on artificial intelligence |
Non-Patent Citations (3)
Title |
---|
SIDIK SOLEMAN;: "Experiments on the Indonesian plagiarism detection using latent semantic analysis", 《2014 2ND INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (ICOICT)》 * |
李斌等: "基于混合方法及回归校验的汉维句子对齐", 《电视技术》 * |
石静;吴云芳;邱立坤;吕学强;: "基于大规模语料库的汉语词义相似度计算方法" * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112231458A (en) * | 2020-10-23 | 2021-01-15 | 河北省讯飞人工智能研究院 | Capacity expansion method, device, equipment and storage medium for dialogue corpus |
CN112231458B (en) * | 2020-10-23 | 2023-03-21 | 河北省讯飞人工智能研究院 | Capacity expansion method, device, equipment and storage medium for dialogue corpus |
WO2021208700A1 (en) * | 2020-11-23 | 2021-10-21 | 平安科技(深圳)有限公司 | Method and apparatus for speech data selection, electronic device, and storage medium |
CN112597748A (en) * | 2020-12-18 | 2021-04-02 | 深圳赛安特技术服务有限公司 | Corpus generation method, apparatus, device and computer readable storage medium |
CN112597748B (en) * | 2020-12-18 | 2023-08-11 | 深圳赛安特技术服务有限公司 | Corpus generation method, corpus generation device, corpus generation equipment and computer-readable storage medium |
CN112836525A (en) * | 2021-01-13 | 2021-05-25 | 江苏金陵科技集团有限公司 | Human-computer interaction based machine translation system and automatic optimization method thereof |
CN112836525B (en) * | 2021-01-13 | 2023-08-18 | 江苏金陵科技集团有限公司 | Machine translation system based on man-machine interaction and automatic optimization method thereof |
CN113204966A (en) * | 2021-06-08 | 2021-08-03 | 重庆度小满优扬科技有限公司 | Corpus augmentation method, apparatus, device and storage medium |
CN114417984A (en) * | 2021-12-31 | 2022-04-29 | 上海流利说信息技术有限公司 | Training sample enhancement method and device, device and storage medium |
CN115062630A (en) * | 2022-07-25 | 2022-09-16 | 北京云迹科技股份有限公司 | Method and device for confirming nickname of robot |
CN115062630B (en) * | 2022-07-25 | 2023-01-06 | 北京云迹科技股份有限公司 | Method and device for confirming nickname of robot |
Also Published As
Publication number | Publication date |
---|---|
CN111026884B (en) | 2023-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639171B (en) | Knowledge graph question-answering method and device | |
CN111026884A (en) | Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus | |
WO2022110637A1 (en) | Question and answer dialog evaluation method and apparatus, device, and storage medium | |
CN101566998B (en) | Chinese question-answering system based on neural network | |
CN108829757B (en) | Intelligent service method, server and storage medium for chat robot | |
CN107729468B (en) | Answer extraction method and system based on deep learning | |
CN112035730B (en) | Semantic retrieval method and device and electronic equipment | |
CN110895559B (en) | Model training method, text processing method, device and equipment | |
CN105138864B (en) | Protein interactive relation data base construction method based on Biomedical literature | |
CN110717018A (en) | Industrial equipment fault maintenance question-answering system based on knowledge graph | |
CN116166782A (en) | Intelligent question-answering method based on deep learning | |
CN113033183B (en) | Network new word discovery method and system based on statistics and similarity | |
CN110287482B (en) | Semi-automatic participle corpus labeling training device | |
CN112307182B (en) | An Extended Query Method for Pseudo-Relevant Feedback Based on Question Answering System | |
CN106257455B (en) | A kind of Bootstrapping method extracting viewpoint evaluation object based on dependence template | |
CN114036281A (en) | Construction method and question answering system of citrus management and control question and answer module based on knowledge graph | |
CN110717045A (en) | Letter element automatic extraction method based on letter overview | |
CN118152547B (en) | Robot answer method, medium and system according to understanding capability of questioner | |
CN112445894A (en) | Business intelligent system based on artificial intelligence and analysis method thereof | |
CN113111152A (en) | Depression detection method based on knowledge distillation and emotion integration model | |
CN111125299A (en) | Dynamic word bank updating method based on user behavior analysis | |
CN114519351A (en) | Subject text rapid detection method based on user intention embedded map learning | |
CN112214335A (en) | Web service discovery method based on knowledge graph and similarity network | |
CN114238595A (en) | A method and system for question answering of metallurgical knowledge based on knowledge graph | |
CN111930953A (en) | Text attribute feature identification, classification and structure analysis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20210112 Address after: Room b-9014, 8th floor, building 1, 188 Changyi Road, Baoshan District, Shanghai 200441 Applicant after: Shanghai Yishang Network Technology Co.,Ltd. Address before: Room 1506, 15 / F, building 1, yangyangchun investment building, 66 Yangming East Road, Donghu District, Nanchang City, Jiangxi Province, 330000 Applicant before: Nanchang Zhonghui Zhiying Information Technology Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |