CN110222192A

CN110222192A - Corpus method for building up and device

Info

Publication number: CN110222192A
Application number: CN201910420207.7A
Authority: CN
Inventors: 张宾; 孙喜民; 周晶; 于晓昆
Original assignee: British Business Services Ltd; State Grid Agel Ecommerce Ltd
Current assignee: British Business Services Ltd; State Grid Agel Ecommerce Ltd; State Grid E Commerce Co Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2019-09-10

Abstract

This application provides a kind of corpus method for building up and devices, are related to the communications field, can be fast and convenient establish customer service corpus.This method comprises: obtaining target corpus database and keyword database, wherein target corpus database includes at least one text, and keyword database includes at least one first keyword；Determine the similarity at least one text in each text and at least one first keyword between each first keyword；According to the similarity between each text, each first keyword and each text and each first keyword, corpus is obtained.For establishing customer service corpus.

Description

Corpus method for building up and device

Technical field

This application involves computer field more particularly to a kind of corpus method for building up and device.

Background technique

With the development of internet and universal, electronic commerce affair, which continues to develop, to grow, and is purchased by e-commerce Number of users continue to increase, electronic commerce affair trading volume constantly increases.User also gets over the demand for services amount of e-commerce Come bigger, the contact staff of electron commercial affairs brings the increasing pressure therefrom.In order to mitigate the pressure of contact staff, The robot customer service processing of automation is transferred to for user's problem common during carrying out electronic commerce affair.Robot Customer service generally detects the problem of receiving to the processing of problem, is searched in customer service corpus according to key to the issue word corresponding Problem answers return to corresponding problem answers.

The method for currently establishing customer service corpus is usually artificial foundation, i.e., the common visitor of artificial acquisition e-commerce field Data are taken, customer service data are analyzed, determine the common problem in customer service field and the corresponding answer of each problem.By problem and Answer is stored in customer service corpus data library.When robot customer service needs to answer the problem of user puts question to, from customer service corpus number According to matching corresponding answer in library, and return to user.

This method for establishing customer service corpus needs to expend a large amount of human and material resources and time.And compare dependence work The working experience of personnel.And the exponential increase of customer service data volume currently analyzed, current establishes customer service language The scheme in material library has also been unable to satisfy demand.

Summary of the invention

The embodiment of the present application provides a kind of corpus method for building up and device, for realizing customer service corpus is established automatically.

In order to achieve the above objectives, the application adopts the following technical scheme that

In a first aspect, this application provides a kind of corpus method for building up, this method comprises: obtaining target corpus database And keyword database, wherein target corpus database includes at least one text, and keyword database includes at least one First keyword；It determines at least one text in each text and at least one first keyword between each first keyword Similarity；According to the similarity between each text, each first keyword and each text and each first keyword, Obtain corpus.

Corpus method for building up provided by the embodiments of the present application obtains target corpus database and keyword database, Wherein, target corpus database includes at least one text, and keyword database includes at least one first keyword.It determines extremely Similarity in a few text in each text and at least one first keyword between each first keyword；By text and First keyword is associated by similarity, can retrieve corresponding text by multiple keywords and similarity search This.According to the similarity between each text, each first keyword and each text and each first keyword, language is obtained Expect library, ensure that the process for establishing corpus largely can be realized automatically, can be fast and convenient establish corpus.

Second aspect, this application provides a kind of corpus to establish device, which includes: acquiring unit, for obtaining Target corpus database and keyword database, wherein target corpus database includes at least one text, keyword data Library includes at least one first keyword；Processing unit, for determine at least one text each text and at least one the Similarity in one keyword between each first keyword；Processing unit is also used to according to each text, each first key Similarity between word and each text and each first keyword, obtains corpus.

The third aspect, this application provides a kind of corpus to establish system, which includes: that corpus establishes device.Its In, which establishes corpus foundation side of the device for executing above-mentioned first aspect and its any one implementation is somebody's turn to do Method.

Fourth aspect, this application provides a kind of corpus to establish device, which includes: processor and communication interface； The communication interface and the processor couple, and the processor is for running computer program or instruction, to realize above-mentioned first aspect And its corpus method for building up described in any one implementation.

5th aspect, this application provides a kind of computer readable storage mediums, which is characterized in that this is computer-readable to deposit It is stored with instruction in storage media, when the instruction is performed, realizes described in above-mentioned first aspect and its any one implementation Corpus method for building up.

6th aspect, this application provides a kind of computer program products comprising instruction, when the computer program product When running on computers, so that the computer executes corpus described in above-mentioned first aspect and its any one implementation and builds Cube method.

7th aspect, the embodiment of the present application provide a kind of chip, and chip includes processor and communication interface, communication interface and Processor coupling, processor is for running computer program or instruction, to realize any of such as first aspect and first aspect Corpus method for building up described in possible implementation.

Specifically, the chip provided in the embodiment of the present application further includes memory, for storing computer program or instruction.

Detailed description of the invention

Fig. 1 is the system architecture diagram that a kind of corpus provided by the present application establishes system；

Fig. 2 is a kind of flow chart one of corpus method for building up provided by the embodiments of the present application；

Fig. 3 is a kind of flowchart 2 of corpus method for building up provided by the embodiments of the present application；

Fig. 4 is a kind of flow chart 3 of corpus method for building up provided by the embodiments of the present application；

Fig. 5 is a kind of flow chart four of corpus method for building up provided by the embodiments of the present application；

Fig. 6 is a kind of flow chart five of corpus method for building up provided by the embodiments of the present application；

Fig. 7 is a kind of flow chart six of corpus method for building up provided by the embodiments of the present application；

Fig. 8 is the structural schematic diagram that a kind of corpus provided by the embodiments of the present application establishes device；

Fig. 9 is the structural schematic diagram that another corpus provided by the embodiments of the present application establishes device.

Specific embodiment

Corpus method for building up provided by the present application and device are described in detail below in conjunction with attached drawing.

Term " first " and " second " in the description of the present application and attached drawing etc. be for distinguishing different objects, and It is not intended to the particular order of description object.

In addition, the term " includes " being previously mentioned in the description of the present application and " having " and their any deformation, it is intended that It is to cover and non-exclusive includes.Such as the process, method, system, product or equipment for containing a series of steps or units do not have It is defined in listed step or unit, but optionally further comprising the step of other are not listed or unit, or optionally It further include other step or units intrinsic for these process, methods, product or equipment.

It should be noted that in the embodiment of the present application, " illustrative " or " such as " etc. words make example, example for indicating Card or explanation.Be described as in the embodiment of the present application " illustrative " or " such as " any embodiment or design scheme do not answer It is interpreted than other embodiments or design scheme more preferably or more advantage.Specifically, " illustrative " or " example are used Such as " word is intended to that related notion is presented in specific ways.

In the description of the present application, unless otherwise indicated, the meaning of " plurality " is refer to two or more.

Hereinafter, to this application involves noun explain, understood with helping reader:

E-commerce customer service data processing standards technology:

Pattern match: then the character string to be processed occurred in common regular expression matching text is appointed according to practical Business scene is replaced, filter or the operation such as polishing, removal number, English, punctuation mark etc. make text more meet practical Business is really scene.

Chinese language model n-gram analysis: n-gram analysis refers to character string being divided into length by certain minimum unit Degree is the continuous substring of n, retains most significant substring, to facilitate subsequent analysis.Such as n=1 (referred to as unigram), with Single letter is minimum unit, and word " flood " can be divided " f ", " l ", " o ", " o ", " d ".For bigger n, such as n =5, in five yuan of continuous substrings of word " flooding ", it is obviously desirable to retain " flood ".But in n=4, " ding " in " flooding " may also be judged as a significant word.For sentence complete for one, commonly use Word is as smallest partition unit.

E-commerce customer service corpus labeling technology:

Domain term matching: domain lexicon is used, matched text, the word for some field occurred just is classified as certain one kind, if nothing Method provides domain lexicon, can be by keyword extraction related algorithm, such as word frequency-inverse document frequency (term frequency- Inverse document frequency, TF-IDF) algorithm, left and right entropy algorithm, mutual information algorithm, textrank algorithm etc. mention Take out the keyword in text set, after duplicate removal as domain lexicon come using.

Establish machine learning classification model: by having a large amount of text sets marked, using machine learning algorithm as propped up Hold vector machine (support vector machine, SVM), random forest, nearest neighbor algorithm (k-Nearest Neighbor, The train classification models such as KNN) trained model can be used to carry out automatic marking newly not after having reached required assessment requirement The corpus of classification.

Establish deep learning disaggregated model: by having a large amount of text sets marked, using deep learning related algorithm, Such as convolutional neural networks (convolutional neural networks, CNN), Recognition with Recurrent Neural Network (recurrent Neural network, RNN), shot and long term memory network (long short-term memory, LSTM), text convolutional Neural Network textCNN scheduling algorithm trains neural network model, after having reached required assessment requirement, that is, trained mould can be used Type carrys out automatic marking newly non-classified corpus.

Customer service field descriptor establishing techniques:

Key phrases extraction algorithm textrank:

Pretreatment: content of text is carried out using hidden Markov model (hidden markov model, HMM) algorithm Participle, the process also include stop words, part of speech filtering.

Construction word window: each four words are exactly the window of this word before and after each word, if participle the latter word occurs Repeatedly, that is, occurring taking a window every time, each secondary result is then merged duplicate removal.

Iteration ballot: the last score of each word is successively chosen in a vote by each word in the word window, and each word is launched The score gone to he itself weight it is related.

Statistics ballot score finally exports K word of highest scoring as keyword.

Key phrases extraction algorithm document theme generates model (latent dirichlet allocation, LDA):

LDA is a kind of document subject matter generation model, also becomes three layers of bayesian probability model, includes word, main body, text This three-decker of shelves.So-called generation model, that is, each word of an article is by with certain probability selection one A theme, and obtained from this theme with this word of certain probability selection this process.LDA is a kind of non-prison Learning art is superintended and directed, can be used to identify the subject information hidden in magnanimity document.It uses the side of bag of words (bag of words) One document is identified as a word frequency vector, text information is converted to mathematical information by method, this method.

Corpus method for building up provided by the embodiments of the present application is applied to corpus as shown in Figure 1 and establishes in system.Such as Shown in Fig. 1, it includes: that corpus establishes device 101, first database 102, at least one second data which, which establishes system, Library 103.

Wherein, corpus establishes device 101 for obtaining target corpus database and keyword from the second database 103 Database.Corpus establishes device 101 and is also used to processing target corpus data, and corpus data that processing is completed is stored in the In one database 102.Handle includes text, the similarity of keyword and text key word in the corpus data completed.

First database 102 is for storing the corpus data handled well.First database 102, which is also used to work as, receives machine After the keyword that people's customer service is sent, text relevant to the keyword is transferred from first database, and to robot customer service Return to the text.

Second database 103 is for storing target corpus data.Illustratively, which may include: more The customer service corpus data of the customer service corpus data of a industry universal, professional domain.

Wherein, the customer service corpus data of multiple industry universals is the text of question and answer mode.Such as: ask: service calls are more It is few? it answers: 400-0000-0000.Ask: how to turn artificial customer service? it answers: by the 0 artificial customer service of switching, waiting suchlike multiple industries General customer service corpus data.

The customer service corpus data of professional domain is the customer service corpus data of plain text or the customer service language of dialogic operation Expect data.For example, the process of commercial scale procurement practice, after sale application process, billing settlement mode etc..

The embodiment of the present application provides a kind of corpus method for building up, establishes system applied to above-mentioned corpus shown in FIG. 1 In, which can be applied in the corpus foundation of multiple fields.Such as e-commerce customer service field, network Operator's customer service field, bank's customer service field etc..The embodiment of the present application is described in detail by taking e-commerce customer service field as an example. As shown in Fig. 2, corpus method for building up includes:

Step 101 obtains target corpus database and keyword database.

Wherein, target corpus database includes at least one text, and keyword database includes at least one first key Word；

It include the combination of following one or more: the customer service corpus number of multiple industry universals in the target corpus database According to the customer service corpus data with professional domain.Target corpus data is plain text (texts of i.e. multiple Chinese character compositions) Corpus data.Each each text can be split as multiple keywords by segmentation methods or participle tool.

The keyword database can be original keyword database having had built up.It is also possible to the application implementation The keyword database re-established in example.The application does not limit this.First keyword includes multiple industry universals The customer service corpus data keyword of customer service corpus data keyword and/or professional domain.

Specifically, corpus establishes device obtains target corpus database and key from multiple second databases respectively Word database.

Step 102 determines at least one text in each text and at least one first keyword that each first is crucial Similarity between word.

Specifically, can be primary vector by any first text conversion at least one text, by least one the Any keyword in one keyword is converted to secondary vector.First is calculated to the vector distance between secondary vector, is determined For the similarity of the first text and the first keyword.

Specifically, a mathematic vector can be converted the text to respectively, the first keyword is converted into mathematic vector.It is logical It crosses and calculates the similarity that the distance between two vectors calculate text and the first keyword.

Illustratively, text are as follows: how to turn artificial customer service? it can be with there are the first keyword in keyword database are as follows: " how ", " turning ", " artificial customer service " and other and lower first keyword of text correlation, such as: " buying ", " quotient Product ", " replacing ", " payment " etc..

Corpus establish device by scheduled algorithm determine the first keyword " how ", " turning ", " artificial customer service " respectively With the similarity of the text, such as the first keyword " artificial customer service " is highest data of significance level in this document, then first Keyword " artificial customer service " is higher with the similarity of the text.First keyword " how " it is the general type for puing question to sentence, then First keyword " how " lower with the similarity of the text.First keyword " turning " be carry out movement, significance level between First keyword " artificial customer service " and the first keyword " how " between, then at the similarity of the first keyword " turning " and the text In relatively intermediate level.

Illustratively, in a step 102, it can be calculated by above-mentioned key phrases extraction algorithm textrank and LDA each The similarity of first keyword and each text.

It should be understood that a text can correspond to multiple first keywords.Each first is crucial in multiple first keyword The similarity of word and text can be the same or different.

Meanwhile first keyword may also correspond to multiple texts.One keyword corresponds to multiple text.The key The similarity of word and each text in multiple text can be the same or different.

For example, two the first texts be respectively a and b, two the first keywords be not c and b.The similarity of a and c is m₁, the similarity of a and d are m₂, the similarity of b and c are m₃, the similarity of b and d are m₄。m₁, m₂, m₃And m₄Size can phase It is same to can also be different.

Step 103, according between each text, each first keyword and each text and each first keyword Similarity obtains corpus.

Corpus establishes device in determining target corpus data after the similarity of each text and the first keyword.Machine Device people's customer service etc. needs to call the main body of text can be by the similarity of multiple keywords and multiple keyword and text Determine corresponding text.

Therefore, corpus establishes the mapping relations that device is established between each text and each first keyword.Corpus It establishes device to store mapping relations between each text and each first keyword and between the two in the database, obtain To corpus.The mapping relations specifically can be implemented as the similarity between text and keyword.

Illustratively, corpus establishes device for reflecting between each text and each first keyword and between the two The relationship of penetrating is stored in Mongo database, obtains the first corpus.

In conjunction with Fig. 2, as shown in figure 3, step 102 specifically may be implemented are as follows:

Step 104, at least one second keyword for determining the first text.

Wherein, each second keyword has a weighted value at least one second keyword.At least one second pass Weighted value of each second keyword in the first text is different in keyword.

In a kind of implementation of step 104, weighted value of second keyword in the first text can pass through second The position that the frequency and the second keyword that keyword occurs in the first text occur in the first text determines.

For example, the first text is the text of an article type.Then the second keyword occurs in topic, then second is crucial The weighted value of word is 500, if n times occurs in the topic of article in the second keyword, the weighted value of the second keyword is 500×n.If the second keyword occurs in the subtitle of article, the weighted value of the second keyword is 100, if second There are n times in subtitle in keyword, then the weighted value of the second keyword is 100 × n.If the second keyword is in text Occur in the content of article, if then the weighted value of the second keyword is the 10, second keyword in the content of the article of text There are n times, then the weighted value of the second keyword are as follows: 10 × n.

Specifically, corpus establishes device selects first text from target corpus data.Corpus establishes device The first text is segmented using participle tool or segmentation methods, obtains at least one keyword.

For example, being segmented by above-mentioned pattern match or n-gram analysis method to the first text.It obtains meeting the first text This usage scenario and can give expression to the first text text meaning keyword.

Step 105 determines the first similarity between each second keyword and each first keyword.

It, can be by calculating each first keyword and every after segmenting the first text to obtain at least one keyword The similarity of a second keyword determines the similarity of each first keyword and the first text.

For example, to the first text: how to turn artificial customer service? participle, obtain three the second keywords " how ", " turning ", " people Work customer service ".There is the first keyword corresponding with three second keywords in keyword database." how ", " turning ", " artificial customer service " and lower first keyword of three the second keyword relevancies of other and this, such as: " buying ", " quotient Product ", " replacing ", " payment " etc..Calculate three the second keywords similarity with seven the first keywords respectively.For example, last The result of calculating are as follows: the second keyword " how " " how " similarity is 100% with the first keyword.Second keyword " turns " Similarity with the first keyword " turning " is 100%.Second keyword " artificial customer service " and the first keyword " artificial customer service " Similarity is 100%.

Second keyword " how ", " turning ", " artificial customer service " and the similarity of other main points word it is lower, can pass through Preset threshold excludes.

In a kind of implementation of step 105, corpus establishes device and the second keyword is converted to the second keyword Vector.Corpus establishes device and determines the first crucial term vector.It calculates between second the first crucial term vector of keyword vector sum Vector distance.Corpus establishes device using the vector distance between second the first crucial term vector of keyword vector sum as two The similarity of person.

Specifically, corpus is established after at least one second keyword that device segments the first text, corpus Library establishes device and selects second keyword of target from least one second keyword, which is above-mentioned The second keyword of any of at least one second keyword.

Corpus establishes device and second keyword of target is converted to the crucial term vector of target second, and determines target the The distance between two crucial term vectors and the first crucial term vector, using the distance as the second keyword of target and the first keyword Similarity.

Corpus establish device according to this method determine each second keyword at least one second keyword and The similarity of one keyword, i.e. corpus establish device and determine each second keyword and each first keyword according to this method Between the first similarity.

It should be understood that the first keyword can be and be stored in keyword database in vector form, it is also possible to language Material library establish device after obtaining the first keyword in keyword database, by the first keyword be converted to the first keyword to Amount.

In a kind of implementation of step 105, the first similarity can be each second keyword and each first and close The sum of similarity of keyword；It is also possible in the second keyword and keyword database and the second highest pass of crucial Word similarity The similarity of keyword；It is greater than preset threshold with the second crucial Word similarity in either the second keyword and keyword database The sum of similarity of keyword.

Step 106 determines the first text and each the according to the weighted value of the first similarity and each second keyword Similarity between one keyword.

Specifically, to each second keyword determined in step 104 weighted value shared in the first text and The first similarity determined in step 105 carries out the first text is calculated similar to the first keyword according to preset rules Degree.

Illustratively, each weighted value and each first similarity are normalized first, and are weighted value distribution the One coefficient distributes the second coefficient for the first similarity.The product of weighted value and the first coefficient after determining normalization, determines normalizing The product of the first similarity and the second coefficient after change.The product and normalization of weighted value and the first coefficient after determining normalization Rear the first similarity and the product of the second coefficient and, as the similarity of the first text and the first keyword.

The embodiment of the present application obtains the second keyword for representing the first text, calculates by segmenting to the first text First similarity of the second keyword and the first keyword.And is determined according to the weighted value of the first similarity and the second keyword The corresponding keyword of one text and similarity between the two, can be accurately obtained the phase of the first keyword of the first text Like degree.

In conjunction with Fig. 2, as shown in figure 4, step 103 specifically may be implemented are as follows:

Step 107, the determining similarity with the first text is greater than the target the of preset threshold from each first keyword One keyword.

Wherein, target keyword be greater than in the first keyword with the similarity of the first text preset threshold at least one the One keyword.

Specifically, may include the first large number of keyword in keyword database.Wherein with first text Relevant first keyword may only have several to dozens of.First keyword and first higher with the similarity of the first text The correlation of text is stronger.Therefore the first keyword can be screened by the way that preset threshold is arranged, obtains the pass of target first Keyword.

Step 108 determines corresponding first triple of each first text at least one first text.

Wherein, the first triple includes the first text, the first keyword of target and the first text and each first key Similarity between word.

Specifically, the first keyword of target corresponding with the first text that will be determined in the first text, step 107, and Similarity between the first keyword of first text and target is stored with triple form.

It should be understood that the triple can be the triple based on the first text, it is also possible to based on keyword Triple.

It include first text in the triple when the triple is the triple based on the first text, the Similarity between corresponding the first keyword of target of one text and the first keyword of the first text and target.

It include first keyword in the triple, this first when the triple is based on the first keyword Each first text at least one corresponding first text of keyword and first keyword and at least one first text Between similarity.

Corresponding first triple of each first text is determined as corpus by step 109.

Corpus establishes device and corresponding first triple store of each first text is obtained this in first database The corpus constructed required for application.

The embodiment of the present application stores the first text, the first keyword of target and between the two similar in the form of triple Degree.It can be more efficiently according to corresponding first text of keyword query.

In a kind of implementation of the embodiment of the present application, which is the group of following one or more It closes: the customer service corpus data of the customer service corpus datas of multiple industry universals, professional domain.

Corpus establish device determine in target corpus data the similarity of each first text and the first keyword it Before.Corpus establishes device and also needs to obtain original customer service corpus data first, and carries out to the original corpus customer service data Processing can just obtain the target corpus data.

When target corpus data is the customer service corpus data of multiple industry universals, corpus establishes device according to the first language The customer service corpus data of the original multiple industry universals of material processing rule process, obtains the customer service corpus number of multiple industry universals According to i.e. target corpus data.

When the customer service corpus data that target corpus data is professional domain, corpus is established device and is handled according to the second corpus The customer service corpus data of the original professional domain of rule process obtains the customer service corpus data of professional domain, i.e. target corpus number According to.

When the customer service corpus data of customer service corpus data and professional domain that target corpus data is multiple industry universals When combination, corpus establishes the customer service corpus number that device handles the original multiple industry universals of rule process according to the first corpus According to obtaining the customer service corpus data of multiple industry universals.It is original according to the second corpus processing rule process that corpus establishes device Professional domain customer service corpus data, obtain the customer service corpus data of professional domain.By the customer service corpus of multiple industry universals The customer service corpus data of data and professional domain combines to obtain target corpus data.

Specifically, in conjunction with Fig. 2, as shown in figure 5, when step 101 obtains target corpus database are as follows: corpus establishes device The customer service corpus data that the original professional domain of rule process is handled according to the second corpus, obtains the customer service corpus number of professional domain According to when, step 101 specifically may be implemented are as follows:

Step 110 obtains the first corpus data including multiple texts.

It wherein, include multiple texts in the first corpus data.First corpus data is above-mentioned original professional domain Customer service corpus data.

By taking the professional domain is e-commerce field as an example, the customer service corpus data of e-commerce field is generally stored inside electricity In sub- commercial field operation service database.It is difficult to get the stronger e-commerce of territoriality by approach such as internet hunts Field customer service data.Therefore original electronics quotient is obtained from e-commerce field operation service database in the embodiment of the present application The customer service corpus data in business field.And it is got using corresponding storage medium (such as hard disk, database server etc.) storage Original e-commerce field customer service corpus data.

AFR control in step 111, the first corpus data of filling, obtains the second corpus data.

Wherein, AFR control is the data lacked in the first corpus data.Such as store the equipment portion of the first corpus data Damage or human operational error is divided to delete the data lacked caused by the reasons such as certain data.

The filling most common method of AFR control is to fill AFR control using most probable data at present.For example it can use Recurrence, Bayes's formalization method tool or decision tree conclusion etc. determine AFR control.Such methods are believed by existing data Breath makes AFR control there is a greater chance that keeping contacting between other attributes to speculate AFR control.

There are also some other methods to handle AFR control, such as replace AFR control with a global constant, use attribute Average data filling AFR control or all tuples are pressed into certain attributive classifications, then with the average data of attribute in same class Fill AFR control.The application is not construed as limiting this.

Noise data in step 112, the second corpus data of processing, obtains third corpus data.

Wherein, noise data is the wrong data in the second corpus data or the data there are error；

Noise is the random error or deviation in a measurand, data or the desired isolated point of deviation including mistake Data.Can be with following technology come smooth noise data, isolated point data is deleted in identification.

1, branch mailbox: by the data distribution of storage into some casees, with the data in case come the number of local smoothing method storing data According to.Specifically can using it is smooth by case average data, by data smoothing in case and press case edge smoothing.

2, it returns: appropriate regression function can be found and carry out smoothed data.Linear regression will find out suitable two variables " best " straight line enables a variable to predict another.Multilinear Regression is related to multiple variables, and data will be suitble to a multidimensional Face.

3, it calculates machine check and manual inspection combines: data and known normal data can will be determined by computer Compare, the mode that difference degree is greater than some threshold data is output in a table, then the mode in manual examination and verification table, identification Isolated point out.

4, cluster: in groups or " cluster " by similar data organization, the data fallen in except cluster is gathered are considered as isolated Point.Isolated dot pattern may be junk data, it is also possible to be to provide the significant data of information.Junk data is given from database To remove.

The noise data in the second corpus data is deleted through the above way, obtains third corpus data.

The each text formatting for including in third corpus data is converted to target text format and obtains target by step 113 Corpus data library.

Since corpus establishes device when obtaining the first corpus data, may be obtained respectively from multiple databases Corpus data forms the first corpus data.Therefore, the text formatting of each corpus in the first corpus data may be different.Together When, even if the corpus data in the same database, the shapes such as the text formatting and symbol of Chinese form therein, number, English Format between the text of formula may also be different, it is therefore desirable to which the text of more different-formats formats, and is converted into identical The text of format.

Illustratively, the text formatting of current Chinese form is usually stored with the format of UTF-8.And in the first corpus data In, the maximum data of data volume are the text of Chinese form.Therefore the character format in each text can be converted into The format of UTF-8 stores.

In conjunction with Fig. 2, as shown in fig. 6, when step 101 obtains target corpus database are as follows: corpus establishes device according to the One corpus handles the customer service corpus data of the original multiple industry universals of rule process, obtains the customer service corpus of multiple industry universals When data, step 101 specifically be may be implemented are as follows:

Step 114 obtains the 4th corpus data.

Wherein, the 4th corpus data is the customer service corpus data of original multiple industry universals.It is wrapped in 4th corpus data Include multiple texts；It include the text of a variety of question and answer formats in multiple texts.Question and answer format includes: question-response, one asks and more answer, is more Ask that one answers and ask more answer more.

Wherein, question-response is the corresponding answer of a problem.One asks that answer corresponds to multiple answers for a problem more.It is more Ask that one answers as the corresponding answer of multiple problems.Ask that answer corresponds to multiple answers for multiple problems more more.

The customer service corpus data of multiple industry universals is had existed in the prior art.These data are usually the number of structuring According to.Without carrying out filling AFR control above-mentioned, processing noise data and etc..

Step 115 obtains the 5th corpus data according to preset data reduction rule the 4th corpus data of processing.

Wherein, the amount of text in the 5th corpus data is less than the amount of text in the 4th corpus data.

The reduction that data regularization technology can be used to obtain data set indicates, the reduction of data set is close to keeping former data Integrality, but data volume is more much smaller than former data.Compared with non-reduction data, excavated in the data of reduction, it is required Time and memory source it is less, excavation will be more effective, and generates identical or almost the same analysis result.

Preset data reduction rule can be following a variety of hough transformation methods:

1, it ties up reduction: reducing data volume by deleting incoherent attribute (or dimension).Data set is not only had compressed, is also reduced Appear in the attribute number on discovery mode.It generallys use attribute set selection method and finds out minimal attribute set, so that data The probability distribution of class is close to the former distribution for using all properties.The heuristic technology of attribute set selection has: Stepwise forward selection, by null attribute collection, by belonging to originally property concentration " best " attribute gradually dose in the set；Gradually to After delete, by entire property set, each step deletes " the worst " attribute that current attribute is concentrated；It selects forward and deletes backward Combination: each step selects " best " attribute, deletes " the worst " attribute；Decision tree is concluded, and use information gain measurement is built Vertical decision tree of classifying, the attribute in tree form the attribute set after reduction.

2, data compression: data encoding or transformation are applied, the reduction or compression expression of former data are obtained.Data compression is divided into Lossless compression and lossy compression.It is popular and effectively to damage data compression method be wavelet transformation and Principle components analysis. Data of the wavelet transformation for sparse or tilt data and with orderly attribute have good compression result.Principle components analysis It is low to calculate cost, can be used for orderly or unordered attribute, and can handle sparse or tilt data.

3, numerical value reduction: numerical value reduction reduces data volume by selection substitution, lesser data representation format.Numerical value Reduction techniques can be ginseng, be also possible to no ginseng.Ginseng method is to assess data using a model, only needs to store Parameter, without storing real data.Having the numerical value reduction techniques of ginseng has following two:

It returns: linear regression and multiple regression；Log-linear model: the multidimensional probability distribution in Approximation Discrete property set.

There are three types of numerical value reduction techniques without ginseng: histogram, is a kind of prevalence using branch mailbox technology come approximate data distribution Numerical value reduction form.Wherein V- is optimal and Max Diff histogram is most accurate and most practical；Cluster, cluster is by data Tuple is considered as object, it divides the object into group or cluster, so that the object " similar " in a cluster, and clustered with other In object " not similar ", replace real data with the cluster of data in data regularization；Sampling: smaller with press proof with data This indicates big data set, such as simple sampling, cluster sampling and layering sampling.

4, Concept Hierarchies: Concept Hierarchies define numerical value by collecting and replacing the concept of lower level with the concept of higher level One discretization of attribute.Concept Hierarchies can be used to reduction data, although by this generalization loss in detail, after generalization Data it is more meaningful, be easier to understand, and required space is fewer than former data.For numerical attribute, due to data can The diversity of energy value range and the update of data value are frequent, illustrate that Concept Hierarchies are difficult.The Concept Hierarchies of numerical attribute It can automatically be constructed according to the distributional analysis of data, such as use branch mailbox, histogram analysis, clustering, the discretization based on entropy Numerical value Concept Hierarchies are generated with technologies such as natural division segmentations.

Classification data itself is discrete data, and a categorical attribute has limited different value, unordered between value.A kind of side Method be by user expert mode grade explicitly declared attribute partial order or total order, to obtain the layering of concept；It is another Method is a declared attribute collection, but does not illustrate their partial order, generates attribute according to the number of each attribute different value by system Sequence constructs significant Concept Hierarchies automatically.

Step 116, by the question and answer format of each text is converted to preset question and answer in multiple texts in the 5th corpus data Format obtains target corpus data.

In the customer service corpus data of original multiple industry universals of acquisition.The question and answer of different field customer service expectation data Format may be different.Therefore the question and answer uniform format to text is needed in the embodiment of the present application.

Illustratively, all question and answer formats are all unified into the text of question-response format by the embodiment of the present application.

For example, asking the data more answered for one, which is split as multiple problems, each problem respectively corresponds one A answer.Form the data of multiple question-responses.

For the data answered are asked, which is split as multiple answers, each answer respectively corresponds one and asks more Topic, forms the data of multiple question-responses.

It asks that more answers are split according to using identical method more, forms the data of multiple question-responses.

In a kind of implementation of the embodiment of the present application, in conjunction with Fig. 3, as shown in fig. 7, step 104 specifically may be implemented Are as follows:

Step 117 segments the first text, obtains at least one third keyword.

Specifically, the text that the first text is made of multiple words, multiple word combinations obtain the first text note later The content of load.The first text is segmented first, the first text is divided into the of multiple meanings that can represent the first text Three keywords.

Step 118 marks at least one third keyword and obtains at least one second keyword.

The part of speech of each third keyword is labeled, it is crucial to obtain more accurate, more representative second Word.Wherein, the part of speech of third keyword includes but is not limited to: pronoun, noun, adjective, verb, adverbial word, number, quantifier, hat Word, preposition, conjunction, auxiliary word, interjection, onomatopoeia.

The meaning as expressed by word same in Chinese different part of speech may be it is entirely different, to third The part of speech of keyword marks, and distributes weight or utilization part of speech into one using the part of speech of third keyword as third keyword Step calculates the similarity of the second keyword and the first keyword.The degree of correspondence that can be keyword and text is more accurate.

It in step 118, can be by establishing machine learning classification model or establishing deep learning disaggregated model to each A third keyword is labeled.Label time and workload can be greatlyd save by the method for this automatic marking.

Since automatic marking may have mark inaccuracy.Therefore, after automatic marking completion, Ke Yi Result error correction by the method that marks by hand to automatic marking.Keep final annotation results accurate as far as possible.

Step 119 determines weighted value of second keyword in the first text according to the attribute of the second keyword.

Wherein, the attribute of the second keyword includes: frequency, the second keyword that the second keyword occurs in the first text The mark of position, the second keyword in the first text.

In a kind of implementation of the embodiment of the present application, additionally provide a kind of robot customer service according to the demand of user from The method transferred corresponding text in the corpus built and the text is sent to client.

Illustratively, this method specifically may be implemented are as follows:

Robot customer service detects the sentence of user's input.Robot customer service segments the sentence that user inputs, and obtains three A keyword, respectively keyword one, keyword two and keyword three.Wherein, each keyword respectively corresponds an at least text This, and three keywords have a similarity at least one corresponding text respectively.Keyword one is directed toward four texts, point Not are as follows: A, B, C, O.Keyword two is directed toward five texts, respectively A, C, L, M, N.Three keyword A, C of the direction of keyword three, P.Robot customer service can determine the corresponding text of sentence of user's input by least the following two kinds mode.

Mode one, robot customer service determine that the corresponding text of sentence of user input may be A or C.Robot customer service Determine keyword one, keyword two and keyword three respectively with the similarity of A and keyword one, keyword two and keyword three Respectively with the similarity of C.The highest text of similarity in A and C is sent to user by robot customer service.Such as robot customer service is true Determine keyword one, keyword two and keyword three and be respectively as follows: 0.1,0.3,0.5 with the similarity of A respectively, the sum of similarity is 0.9.Robot customer service determine keyword one, keyword two and keyword three be respectively as follows: 0.2 with the similarity of C respectively, 0.1, 0.3, the sum of similarity is 0.6.Robot customer service determines that the similarity of keyword one, keyword two and keyword three and A is greater than With the similarity of C.Then text A is sent to user by robot customer service.

Mode two, robot customer service determine similarity of the keyword one respectively with A, B, C, O, for example, 0.1,0.2,0.2, 0.6.Robot customer service determines similarity of the keyword two respectively with A, C, L, M, N, such as are as follows: 0.3,0.1,0.3,0.4,0.1. Robot customer service determines similarity of the keyword three respectively with A, C, P, such as are as follows: 0.5,0.3,0.2.Robot customer service determine with Keyword one, keyword two and the highest text of three similarity of keyword are A, and A is sent to user.

It should be understood that robot customer service can be by diversified forms to user's sending information A.Such as in voice call robot In customer service, to user's sending information A or by short message form to user's sending information A by way of voice broadcast.In net In page or the customer service of APP robot, customer service can be in dialog box directly to user sending information A, or sends and refer to user Link etc. to text A.The application does not limit this.

In a kind of implementation of the embodiment of the present application, robot customer service is to after user's sending information A, if receiving The text A that user returns is not the text wanted required for user, then robot customer service sends in addition to A that similarity is most to user Robot customer service is again to user's sending information C in high text, such as mode one.In mode two robot customer service again to Family sending information C and/or text O.

In a kind of implementation of the embodiment of the present application, the first text and first can also be determined as follows and closed The similarity of keyword:

It is inputted in similarity calculation tool: keyword database, target corpus data.

Obtain output result: the ternary Groups List formed with the first text, keyword, similarity

Specifically:

The input of the algorithm is keyword database and target corpus data, and output is with keyword database, target language The ternary Groups List for expecting data composition passes through the available higher key of similarity with the first document of this ternary Groups List Word list.The algorithm will operate each keyword example in keyword database, obtain its all properties as this The description of keyword then obtains the second keyword that can most represent first text by the algorithm of TF-IDF.To second Their Jaccard similarity is calculated in keyword and keyword database keyword.Then it reuses HowNet and calculates the The semantic similarity of one text and the first keyword.Jaccard similarity is added with semantic similarity finally, obtains the first text Sheet and the first keyword.So far, the algorithm will obtain one with the first text, keyword, similarity ternary Groups List.

The embodiment of the present application can establish device to corpus according to above method example and carry out functional module or function The division of unit, for example, each functional module of each function division or functional unit can be corresponded to, it can also be by two or two A above function is integrated in a processing module.Above-mentioned integrated module both can take the form of hardware realization, can also It is realized in the form of using software function module or functional unit.Wherein, to module or unit in the embodiment of the present application Division is schematically that only a kind of logical function partition, there may be another division manner in actual implementation.

The embodiment of the present application provides a kind of corpus and establishes device, establishes system applied to above-mentioned corpus shown in FIG. 1 In.As shown in figure 8, corpus establishes device includes:

Acquiring unit 801, for obtaining target corpus database and keyword database, wherein target corpus data Library includes at least one text, and keyword database includes at least one first keyword.

Processing unit 802, it is each in each text and at least one first keyword for determining at least one text Similarity between first keyword.

Processing unit 802 is also used to be closed according to each text, each first keyword and each text and each first Similarity between keyword, obtains corpus.

Optionally, processing unit 802 are also used to: determining at least one second keyword of the first text.At least one Each second keyword has a weighted value in two keywords.It determines between each second keyword and each first keyword The first similarity.Determine that the first text and each first closes according to the weighted value of the first similarity and each second keyword Similarity between keyword.

Optionally, processing unit 802 are also used to: the determining similarity with text is greater than pre- from each first keyword If the first keyword of target of threshold value.Determine corresponding first triple of each text at least one text.First triple Including the similarity between text, the first keyword of target and text and each first keyword.Each text is corresponding First triple is determined as corpus.

Optionally, processing unit is also used to: being segmented to the first text, is obtained at least one third keyword.Mark is at least One third keyword obtains at least one second keyword.Determine the second keyword first according to the attribute of the second keyword Weighted value in text.The attribute of second keyword includes: frequency, the second key that the second keyword occurs in the first text Position of the word in the first text, the second keyword mark.

Optionally, device further include: acquiring unit 801 is also used to obtain the first corpus data including multiple texts. Processing unit 802 is also used to fill the AFR control in the first corpus data, obtains the second corpus data.Processing unit 802, It is also used to handle the noise data in the second corpus data, obtains third corpus data.Noise data is in the second corpus data Wrong data or data there are error.

Processing unit 802 is also used to each text formatting for including in third corpus data being converted to target text lattice Formula obtains target corpus database.

Optionally, device further include: acquiring unit 801 is also used to obtain the 4th corpus data, wherein the 4th corpus It include multiple texts in data.It include the text of a variety of question and answer formats in multiple texts.Question and answer format includes: question-response, one It being answered ask, Duo Wenyi is answered and asks answer more more more.Processing unit 802 is also used to handle the 4th language according to preset data reduction rule Material data obtain the 5th corpus data.Amount of text in 5th corpus data is less than the amount of text in the 4th corpus data. Processing unit 802 is also used to the question and answer format of each text is converted to preset ask in multiple texts in the 5th corpus data Format is answered, target corpus data is obtained.

When passing through hardware realization, receiving unit, acquiring unit and transmission unit in the embodiment of the present application be can integrate In communication interface, processing unit 702 be can integrate on a processor.Specific implementation is as shown in Figure 9.

Fig. 9 shows another possible structural schematic diagram that corpus involved in above-described embodiment establishes device. It includes: processor 902 and communication interface 903 that the corpus, which establishes device,.Processor 902 is moved for establishing device to corpus Control management is carried out, for example, executing the step of above-mentioned processing unit 802 executes, and/or for executing skill described herein Other processes of art.Communication interface 903 is for supporting corpus to establish the communication of device Yu other network entities, for example, executing The step of above-mentioned receiving unit 801 executes.It can also include memory 901 and bus 904, memory that corpus, which establishes device, 901 for storing program code and data that corpus establishes device.

Wherein, memory 901 can be the memory etc. that corpus is established in device, which may include volatibility Memory, such as random access memory；The memory also may include nonvolatile memory, such as read-only memory, fastly Flash memory, hard disk or solid state hard disk；The memory can also include the combination of the memory of mentioned kind.

Above-mentioned processor 902 can be realization or execute to combine and various illustratively patrols described in present disclosure Collect box, module and circuit.The processor can be central processing unit, general processor, digital signal processor, dedicated integrated Circuit, field programmable gate array or other programmable logic device, transistor logic, hardware component or it is any Combination.It, which may be implemented or executes, combines various illustrative logic blocks, module and electricity described in present disclosure Road.The processor is also possible to realize the combination of computing function, such as combines comprising one or more microprocessors, DSP and micro- The combination etc. of processor.

Bus 904 can be expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..Bus 904 can be divided into address bus, data/address bus, control bus etc..For convenient for table Show, only indicated with a thick line in Fig. 9, it is not intended that an only bus or a type of bus.

Through the above description of the embodiments, it is apparent to those skilled in the art that, for description It is convenienct and succinct, only the example of the division of the above functional modules, in practical application, can according to need and will be upper It states function distribution to be completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete All or part of function described above.The specific work process of the system, apparatus, and unit of foregoing description, before can referring to The corresponding process in embodiment of the method is stated, details are not described herein.

The embodiment of the present application provides a kind of computer program product comprising instruction, when the computer program product is calculating When being run on machine, so that the computer executes the corpus foundation side that netscape messaging server Netscape executes in above method embodiment Method.

The embodiment of the present application provides a kind of computer program product comprising instruction, when the computer program product is calculating When being run on machine, so that the computer executes information in above method embodiment and reinforces the corpus foundation side that server executes Method.

The embodiment of the present application provides a kind of computer program product comprising instruction, when the computer program product is calculating When being run on machine, so that the computer executes the corpus method for building up that information calls terminal to execute in above method embodiment.

The embodiment of the present application also provides a kind of computer readable storage medium, and finger is stored in computer readable storage medium It enables, when the instruction is run on computers, believes in method flow shown in above method embodiment so that the computer executes Cease the corpus method for building up that processing server executes.

The embodiment of the present application also provides a kind of computer readable storage medium, and finger is stored in computer readable storage medium It enables, when the instruction is run on computers, believes in method flow shown in above method embodiment so that the computer executes Breath reinforces the corpus method for building up that server executes.

The embodiment of the present application also provides a kind of computer readable storage medium, and finger is stored in computer readable storage medium It enables, when the instruction is run on computers, believes in method flow shown in above method embodiment so that the computer executes The corpus method for building up that breath calls terminal to execute.

Wherein, computer readable storage medium, such as electricity, magnetic, optical, electromagnetic, infrared ray can be but not limited to or partly led System, device or the device of body, or any above combination.The more specific example of computer readable storage medium is (non-poor The list of act) it include: the electrical connection with one or more conducting wires, portable computer diskette, hard disk, random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), erasable type may be programmed read-only It is memory (Erasable Programmable Read Only Memory, EPROM), register, hard disk, optical fiber, portable Compact disc read-only memory (Compact Disc Read-Only Memory, CD-ROM), light storage device, magnetic memory The computer readable storage medium of part or above-mentioned any appropriate combination or any other form well known in the art. A kind of illustrative storage medium is coupled to processor, to enable a processor to from the read information, and can be to Information is written in the storage medium.Certainly, storage medium is also possible to the component part of processor.Pocessor and storage media can be with In application-specific IC (Application Specific Integrated Circuit, ASIC).In the application In embodiment, computer readable storage medium can be any tangible medium for including or store program, which can be referred to Enable execution system, device or device use or in connection.

More than, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any at this Apply for the change or replacement in the technical scope disclosed, should all cover within the scope of protection of this application.Therefore, the application Protection scope should be subject to the protection scope in claims.

Claims

1. a kind of corpus method for building up, which is characterized in that the described method includes:

Obtain target corpus database and keyword database, wherein the target corpus database includes at least one text This, the keyword database includes at least one first keyword；

Determine at least one described text in each text and at least one described first keyword each first keyword it Between similarity；

According to each text, each first keyword and each text and each first keyword it Between similarity, obtain corpus.

2. the method according to claim 1, wherein determining the first text and at least one described first keyword In similarity between each first keyword, comprising:

Determine at least one second keyword of first text；Each second is crucial at least one described second keyword Word has a weighted value；First text is any one text at least one described text；

Determine the first similarity between each second keyword and each first keyword；

According to first similarity and the weighted value of each second keyword, first text and described every is determined Similarity between a first keyword.

3. -2 described in any item methods according to claim 1, which is characterized in that it is described according to each text, it is described every Similarity between a first keyword and each text and each first keyword, obtains corpus, comprising:

Determining the first keyword of target for being greater than preset threshold with the similarity of text from each first keyword；

Determine corresponding first triple of each text at least one text；First triple includes the text, institute State the similarity between the first keyword of target and the text and each first keyword；

Corresponding first triple of each text is determined as the corpus.

4. according to the method described in claim 2, it is characterized in that, to first text segment, obtain at least one second Keyword, comprising:

First text is segmented, at least one third keyword is obtained；

It marks at least one described third keyword and obtains at least one described second keyword；

Weighted value of second keyword in first text is determined according to the attribute of the second keyword；Described second closes The attribute of keyword include: second keyword occur in first text frequency, second keyword is described The mark of position, second keyword in first text.

5. -2 described in any item methods according to claim 1, which is characterized in that the acquisition target corpus database, comprising:

Obtain the first corpus data including multiple texts；

The AFR control in first corpus data is filled, the second corpus data is obtained；

The noise data in second corpus data is handled, third corpus data is obtained；The noise data is described second Wrong data in corpus data or the data there are error；

The each text formatting for including in the third corpus data is converted into target text format and obtains the target corpus Database.

6. -2 described in any item methods according to claim 1, which is characterized in that the acquisition target corpus database, comprising:

Obtain the 4th corpus data, wherein include multiple texts in the 4th corpus data；It include more in the multiple text The text of kind question and answer format；The question and answer format include: question-response, one ask answer, Duo Wenyi is answered and asks answer more more；

The 4th corpus data, which is handled, according to preset data reduction rule obtains the 5th corpus data；5th corpus data In amount of text be less than the 4th corpus data in amount of text；

By the question and answer format of each text is converted to preset question and answer format in multiple texts in the 5th corpus data, obtain The target corpus data.

7. a kind of corpus establishes device, which is characterized in that described device includes:

Acquiring unit, for obtaining target corpus database and keyword database, wherein the target corpus database packet At least one text is included, the keyword database includes at least one first keyword；

Processing unit, for each in each text at least one determining described text and at least one described first keyword Similarity between first keyword；

The processing unit, be also used to according to each text, each first keyword and each text with Similarity between each first keyword, obtains corpus.

8. device according to claim 7, which is characterized in that the processing unit is also used to:

Determine at least one second keyword of the first text；Each second keyword tool at least one described second keyword There is a weighted value；First text is any text at least one described text；

First text and described every is determined according to the weighted value of first similarity and each second keyword Similarity between a first keyword.

9. according to the described in any item devices of claim 7-8, which is characterized in that the processing unit is also used to:

Corresponding first triple of each text is determined as the corpus.

10. device according to claim 8, which is characterized in that the processing unit is also used to:

First text is segmented, at least one third keyword is obtained；

11. according to the described in any item devices of claim 7-8, which is characterized in that described device further include:

The acquiring unit is also used to obtain the first corpus data including multiple texts；

The processing unit is also used to fill the AFR control in first corpus data, obtains the second corpus data；

The processing unit is also used to handle the noise data in second corpus data, obtains third corpus data；It is described Noise data is the wrong data in second corpus data or the data there are error；

The processing unit is also used to each text formatting for including in the third corpus data being converted to target text lattice Formula obtains the target corpus database.

12. according to the described in any item devices of claim 7-8, which is characterized in that described device further include:

The acquiring unit is also used to obtain the 4th corpus data, wherein includes multiple texts in the 4th corpus data； It include the text of a variety of question and answer formats in the multiple text；The question and answer format includes: question-response, one asks and more answer, ask more One answers and asks more answer more；

The processing unit is also used to obtain the 5th corpus number according to preset data reduction rule processing the 4th corpus data According to；Amount of text in 5th corpus data is less than the amount of text in the 4th corpus data；

The processing unit is also used to the question and answer format of each text is converted in multiple texts in the 5th corpus data Preset question and answer format obtains the target corpus data.

13. a kind of corpus establishes device characterized by comprising processor and communication interface；The communication interface and described Processor coupling, the processor are as claimed in any one of claims 1 to 6 to realize for running computer program or instruction Method.

14. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium, When executed, as the method according to claim 1 to 6 is realized.