CN110222192A - Corpus method for building up and device - Google Patents
Corpus method for building up and device Download PDFInfo
- Publication number
- CN110222192A CN110222192A CN201910420207.7A CN201910420207A CN110222192A CN 110222192 A CN110222192 A CN 110222192A CN 201910420207 A CN201910420207 A CN 201910420207A CN 110222192 A CN110222192 A CN 110222192A
- Authority
- CN
- China
- Prior art keywords
- keyword
- text
- corpus
- data
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application provides a kind of corpus method for building up and devices, are related to the communications field, can be fast and convenient establish customer service corpus.This method comprises: obtaining target corpus database and keyword database, wherein target corpus database includes at least one text, and keyword database includes at least one first keyword;Determine the similarity at least one text in each text and at least one first keyword between each first keyword;According to the similarity between each text, each first keyword and each text and each first keyword, corpus is obtained.For establishing customer service corpus.
Description
Technical field
This application involves computer field more particularly to a kind of corpus method for building up and device.
Background technique
With the development of internet and universal, electronic commerce affair, which continues to develop, to grow, and is purchased by e-commerce
Number of users continue to increase, electronic commerce affair trading volume constantly increases.User also gets over the demand for services amount of e-commerce
Come bigger, the contact staff of electron commercial affairs brings the increasing pressure therefrom.In order to mitigate the pressure of contact staff,
The robot customer service processing of automation is transferred to for user's problem common during carrying out electronic commerce affair.Robot
Customer service generally detects the problem of receiving to the processing of problem, is searched in customer service corpus according to key to the issue word corresponding
Problem answers return to corresponding problem answers.
The method for currently establishing customer service corpus is usually artificial foundation, i.e., the common visitor of artificial acquisition e-commerce field
Data are taken, customer service data are analyzed, determine the common problem in customer service field and the corresponding answer of each problem.By problem and
Answer is stored in customer service corpus data library.When robot customer service needs to answer the problem of user puts question to, from customer service corpus number
According to matching corresponding answer in library, and return to user.
This method for establishing customer service corpus needs to expend a large amount of human and material resources and time.And compare dependence work
The working experience of personnel.And the exponential increase of customer service data volume currently analyzed, current establishes customer service language
The scheme in material library has also been unable to satisfy demand.
Summary of the invention
The embodiment of the present application provides a kind of corpus method for building up and device, for realizing customer service corpus is established automatically.
In order to achieve the above objectives, the application adopts the following technical scheme that
In a first aspect, this application provides a kind of corpus method for building up, this method comprises: obtaining target corpus database
And keyword database, wherein target corpus database includes at least one text, and keyword database includes at least one
First keyword;It determines at least one text in each text and at least one first keyword between each first keyword
Similarity;According to the similarity between each text, each first keyword and each text and each first keyword,
Obtain corpus.
Corpus method for building up provided by the embodiments of the present application obtains target corpus database and keyword database,
Wherein, target corpus database includes at least one text, and keyword database includes at least one first keyword.It determines extremely
Similarity in a few text in each text and at least one first keyword between each first keyword;By text and
First keyword is associated by similarity, can retrieve corresponding text by multiple keywords and similarity search
This.According to the similarity between each text, each first keyword and each text and each first keyword, language is obtained
Expect library, ensure that the process for establishing corpus largely can be realized automatically, can be fast and convenient establish corpus.
Second aspect, this application provides a kind of corpus to establish device, which includes: acquiring unit, for obtaining
Target corpus database and keyword database, wherein target corpus database includes at least one text, keyword data
Library includes at least one first keyword;Processing unit, for determine at least one text each text and at least one the
Similarity in one keyword between each first keyword;Processing unit is also used to according to each text, each first key
Similarity between word and each text and each first keyword, obtains corpus.
The third aspect, this application provides a kind of corpus to establish system, which includes: that corpus establishes device.Its
In, which establishes corpus foundation side of the device for executing above-mentioned first aspect and its any one implementation is somebody's turn to do
Method.
Fourth aspect, this application provides a kind of corpus to establish device, which includes: processor and communication interface;
The communication interface and the processor couple, and the processor is for running computer program or instruction, to realize above-mentioned first aspect
And its corpus method for building up described in any one implementation.
5th aspect, this application provides a kind of computer readable storage mediums, which is characterized in that this is computer-readable to deposit
It is stored with instruction in storage media, when the instruction is performed, realizes described in above-mentioned first aspect and its any one implementation
Corpus method for building up.
6th aspect, this application provides a kind of computer program products comprising instruction, when the computer program product
When running on computers, so that the computer executes corpus described in above-mentioned first aspect and its any one implementation and builds
Cube method.
7th aspect, the embodiment of the present application provide a kind of chip, and chip includes processor and communication interface, communication interface and
Processor coupling, processor is for running computer program or instruction, to realize any of such as first aspect and first aspect
Corpus method for building up described in possible implementation.
Specifically, the chip provided in the embodiment of the present application further includes memory, for storing computer program or instruction.
Detailed description of the invention
Fig. 1 is the system architecture diagram that a kind of corpus provided by the present application establishes system;
Fig. 2 is a kind of flow chart one of corpus method for building up provided by the embodiments of the present application;
Fig. 3 is a kind of flowchart 2 of corpus method for building up provided by the embodiments of the present application;
Fig. 4 is a kind of flow chart 3 of corpus method for building up provided by the embodiments of the present application;
Fig. 5 is a kind of flow chart four of corpus method for building up provided by the embodiments of the present application;
Fig. 6 is a kind of flow chart five of corpus method for building up provided by the embodiments of the present application;
Fig. 7 is a kind of flow chart six of corpus method for building up provided by the embodiments of the present application;
Fig. 8 is the structural schematic diagram that a kind of corpus provided by the embodiments of the present application establishes device;
Fig. 9 is the structural schematic diagram that another corpus provided by the embodiments of the present application establishes device.
Specific embodiment
Corpus method for building up provided by the present application and device are described in detail below in conjunction with attached drawing.
Term " first " and " second " in the description of the present application and attached drawing etc. be for distinguishing different objects, and
It is not intended to the particular order of description object.
In addition, the term " includes " being previously mentioned in the description of the present application and " having " and their any deformation, it is intended that
It is to cover and non-exclusive includes.Such as the process, method, system, product or equipment for containing a series of steps or units do not have
It is defined in listed step or unit, but optionally further comprising the step of other are not listed or unit, or optionally
It further include other step or units intrinsic for these process, methods, product or equipment.
It should be noted that in the embodiment of the present application, " illustrative " or " such as " etc. words make example, example for indicating
Card or explanation.Be described as in the embodiment of the present application " illustrative " or " such as " any embodiment or design scheme do not answer
It is interpreted than other embodiments or design scheme more preferably or more advantage.Specifically, " illustrative " or " example are used
Such as " word is intended to that related notion is presented in specific ways.
In the description of the present application, unless otherwise indicated, the meaning of " plurality " is refer to two or more.
Hereinafter, to this application involves noun explain, understood with helping reader:
E-commerce customer service data processing standards technology:
Pattern match: then the character string to be processed occurred in common regular expression matching text is appointed according to practical
Business scene is replaced, filter or the operation such as polishing, removal number, English, punctuation mark etc. make text more meet practical
Business is really scene.
Chinese language model n-gram analysis: n-gram analysis refers to character string being divided into length by certain minimum unit
Degree is the continuous substring of n, retains most significant substring, to facilitate subsequent analysis.Such as n=1 (referred to as unigram), with
Single letter is minimum unit, and word " flood " can be divided " f ", " l ", " o ", " o ", " d ".For bigger n, such as n
=5, in five yuan of continuous substrings of word " flooding ", it is obviously desirable to retain " flood ".But in n=4,
" ding " in " flooding " may also be judged as a significant word.For sentence complete for one, commonly use
Word is as smallest partition unit.
E-commerce customer service corpus labeling technology:
Domain term matching: domain lexicon is used, matched text, the word for some field occurred just is classified as certain one kind, if nothing
Method provides domain lexicon, can be by keyword extraction related algorithm, such as word frequency-inverse document frequency (term frequency-
Inverse document frequency, TF-IDF) algorithm, left and right entropy algorithm, mutual information algorithm, textrank algorithm etc. mention
Take out the keyword in text set, after duplicate removal as domain lexicon come using.
Establish machine learning classification model: by having a large amount of text sets marked, using machine learning algorithm as propped up
Hold vector machine (support vector machine, SVM), random forest, nearest neighbor algorithm (k-Nearest Neighbor,
The train classification models such as KNN) trained model can be used to carry out automatic marking newly not after having reached required assessment requirement
The corpus of classification.
Establish deep learning disaggregated model: by having a large amount of text sets marked, using deep learning related algorithm,
Such as convolutional neural networks (convolutional neural networks, CNN), Recognition with Recurrent Neural Network (recurrent
Neural network, RNN), shot and long term memory network (long short-term memory, LSTM), text convolutional Neural
Network textCNN scheduling algorithm trains neural network model, after having reached required assessment requirement, that is, trained mould can be used
Type carrys out automatic marking newly non-classified corpus.
Customer service field descriptor establishing techniques:
Key phrases extraction algorithm textrank:
Pretreatment: content of text is carried out using hidden Markov model (hidden markov model, HMM) algorithm
Participle, the process also include stop words, part of speech filtering.
Construction word window: each four words are exactly the window of this word before and after each word, if participle the latter word occurs
Repeatedly, that is, occurring taking a window every time, each secondary result is then merged duplicate removal.
Iteration ballot: the last score of each word is successively chosen in a vote by each word in the word window, and each word is launched
The score gone to he itself weight it is related.
Statistics ballot score finally exports K word of highest scoring as keyword.
Key phrases extraction algorithm document theme generates model (latent dirichlet allocation, LDA):
LDA is a kind of document subject matter generation model, also becomes three layers of bayesian probability model, includes word, main body, text
This three-decker of shelves.So-called generation model, that is, each word of an article is by with certain probability selection one
A theme, and obtained from this theme with this word of certain probability selection this process.LDA is a kind of non-prison
Learning art is superintended and directed, can be used to identify the subject information hidden in magnanimity document.It uses the side of bag of words (bag of words)
One document is identified as a word frequency vector, text information is converted to mathematical information by method, this method.
Corpus method for building up provided by the embodiments of the present application is applied to corpus as shown in Figure 1 and establishes in system.Such as
Shown in Fig. 1, it includes: that corpus establishes device 101, first database 102, at least one second data which, which establishes system,
Library 103.
Wherein, corpus establishes device 101 for obtaining target corpus database and keyword from the second database 103
Database.Corpus establishes device 101 and is also used to processing target corpus data, and corpus data that processing is completed is stored in the
In one database 102.Handle includes text, the similarity of keyword and text key word in the corpus data completed.
First database 102 is for storing the corpus data handled well.First database 102, which is also used to work as, receives machine
After the keyword that people's customer service is sent, text relevant to the keyword is transferred from first database, and to robot customer service
Return to the text.
Second database 103 is for storing target corpus data.Illustratively, which may include: more
The customer service corpus data of the customer service corpus data of a industry universal, professional domain.
Wherein, the customer service corpus data of multiple industry universals is the text of question and answer mode.Such as: ask: service calls are more
It is few? it answers: 400-0000-0000.Ask: how to turn artificial customer service? it answers: by the 0 artificial customer service of switching, waiting suchlike multiple industries
General customer service corpus data.
The customer service corpus data of professional domain is the customer service corpus data of plain text or the customer service language of dialogic operation
Expect data.For example, the process of commercial scale procurement practice, after sale application process, billing settlement mode etc..
The embodiment of the present application provides a kind of corpus method for building up, establishes system applied to above-mentioned corpus shown in FIG. 1
In, which can be applied in the corpus foundation of multiple fields.Such as e-commerce customer service field, network
Operator's customer service field, bank's customer service field etc..The embodiment of the present application is described in detail by taking e-commerce customer service field as an example.
As shown in Fig. 2, corpus method for building up includes:
Step 101 obtains target corpus database and keyword database.
Wherein, target corpus database includes at least one text, and keyword database includes at least one first key
Word;
It include the combination of following one or more: the customer service corpus number of multiple industry universals in the target corpus database
According to the customer service corpus data with professional domain.Target corpus data is plain text (texts of i.e. multiple Chinese character compositions)
Corpus data.Each each text can be split as multiple keywords by segmentation methods or participle tool.
The keyword database can be original keyword database having had built up.It is also possible to the application implementation
The keyword database re-established in example.The application does not limit this.First keyword includes multiple industry universals
The customer service corpus data keyword of customer service corpus data keyword and/or professional domain.
Specifically, corpus establishes device obtains target corpus database and key from multiple second databases respectively
Word database.
Step 102 determines at least one text in each text and at least one first keyword that each first is crucial
Similarity between word.
Specifically, can be primary vector by any first text conversion at least one text, by least one the
Any keyword in one keyword is converted to secondary vector.First is calculated to the vector distance between secondary vector, is determined
For the similarity of the first text and the first keyword.
Specifically, a mathematic vector can be converted the text to respectively, the first keyword is converted into mathematic vector.It is logical
It crosses and calculates the similarity that the distance between two vectors calculate text and the first keyword.
Illustratively, text are as follows: how to turn artificial customer service? it can be with there are the first keyword in keyword database are as follows:
" how ", " turning ", " artificial customer service " and other and lower first keyword of text correlation, such as: " buying ", " quotient
Product ", " replacing ", " payment " etc..
Corpus establish device by scheduled algorithm determine the first keyword " how ", " turning ", " artificial customer service " respectively
With the similarity of the text, such as the first keyword " artificial customer service " is highest data of significance level in this document, then first
Keyword " artificial customer service " is higher with the similarity of the text.First keyword " how " it is the general type for puing question to sentence, then
First keyword " how " lower with the similarity of the text.First keyword " turning " be carry out movement, significance level between
First keyword " artificial customer service " and the first keyword " how " between, then at the similarity of the first keyword " turning " and the text
In relatively intermediate level.
Illustratively, in a step 102, it can be calculated by above-mentioned key phrases extraction algorithm textrank and LDA each
The similarity of first keyword and each text.
It should be understood that a text can correspond to multiple first keywords.Each first is crucial in multiple first keyword
The similarity of word and text can be the same or different.
Meanwhile first keyword may also correspond to multiple texts.One keyword corresponds to multiple text.The key
The similarity of word and each text in multiple text can be the same or different.
For example, two the first texts be respectively a and b, two the first keywords be not c and b.The similarity of a and c is
m1, the similarity of a and d are m2, the similarity of b and c are m3, the similarity of b and d are m4。m1, m2, m3And m4Size can phase
It is same to can also be different.
Step 103, according between each text, each first keyword and each text and each first keyword
Similarity obtains corpus.
Corpus establishes device in determining target corpus data after the similarity of each text and the first keyword.Machine
Device people's customer service etc. needs to call the main body of text can be by the similarity of multiple keywords and multiple keyword and text
Determine corresponding text.
Therefore, corpus establishes the mapping relations that device is established between each text and each first keyword.Corpus
It establishes device to store mapping relations between each text and each first keyword and between the two in the database, obtain
To corpus.The mapping relations specifically can be implemented as the similarity between text and keyword.
Illustratively, corpus establishes device for reflecting between each text and each first keyword and between the two
The relationship of penetrating is stored in Mongo database, obtains the first corpus.
Corpus method for building up provided by the embodiments of the present application obtains target corpus database and keyword database,
Wherein, target corpus database includes at least one text, and keyword database includes at least one first keyword.It determines extremely
Similarity in a few text in each text and at least one first keyword between each first keyword;By text and
First keyword is associated by similarity, can retrieve corresponding text by multiple keywords and similarity search
This.According to the similarity between each text, each first keyword and each text and each first keyword, language is obtained
Expect library, ensure that the process for establishing corpus largely can be realized automatically, can be fast and convenient establish corpus.
In conjunction with Fig. 2, as shown in figure 3, step 102 specifically may be implemented are as follows:
Step 104, at least one second keyword for determining the first text.
Wherein, each second keyword has a weighted value at least one second keyword.At least one second pass
Weighted value of each second keyword in the first text is different in keyword.
In a kind of implementation of step 104, weighted value of second keyword in the first text can pass through second
The position that the frequency and the second keyword that keyword occurs in the first text occur in the first text determines.
For example, the first text is the text of an article type.Then the second keyword occurs in topic, then second is crucial
The weighted value of word is 500, if n times occurs in the topic of article in the second keyword, the weighted value of the second keyword is
500×n.If the second keyword occurs in the subtitle of article, the weighted value of the second keyword is 100, if second
There are n times in subtitle in keyword, then the weighted value of the second keyword is 100 × n.If the second keyword is in text
Occur in the content of article, if then the weighted value of the second keyword is the 10, second keyword in the content of the article of text
There are n times, then the weighted value of the second keyword are as follows: 10 × n.
Specifically, corpus establishes device selects first text from target corpus data.Corpus establishes device
The first text is segmented using participle tool or segmentation methods, obtains at least one keyword.
For example, being segmented by above-mentioned pattern match or n-gram analysis method to the first text.It obtains meeting the first text
This usage scenario and can give expression to the first text text meaning keyword.
Step 105 determines the first similarity between each second keyword and each first keyword.
It, can be by calculating each first keyword and every after segmenting the first text to obtain at least one keyword
The similarity of a second keyword determines the similarity of each first keyword and the first text.
For example, to the first text: how to turn artificial customer service? participle, obtain three the second keywords " how ", " turning ", " people
Work customer service ".There is the first keyword corresponding with three second keywords in keyword database." how ", " turning ",
" artificial customer service " and lower first keyword of three the second keyword relevancies of other and this, such as: " buying ", " quotient
Product ", " replacing ", " payment " etc..Calculate three the second keywords similarity with seven the first keywords respectively.For example, last
The result of calculating are as follows: the second keyword " how " " how " similarity is 100% with the first keyword.Second keyword " turns "
Similarity with the first keyword " turning " is 100%.Second keyword " artificial customer service " and the first keyword " artificial customer service "
Similarity is 100%.
Second keyword " how ", " turning ", " artificial customer service " and the similarity of other main points word it is lower, can pass through
Preset threshold excludes.
In a kind of implementation of step 105, corpus establishes device and the second keyword is converted to the second keyword
Vector.Corpus establishes device and determines the first crucial term vector.It calculates between second the first crucial term vector of keyword vector sum
Vector distance.Corpus establishes device using the vector distance between second the first crucial term vector of keyword vector sum as two
The similarity of person.
Specifically, corpus is established after at least one second keyword that device segments the first text, corpus
Library establishes device and selects second keyword of target from least one second keyword, which is above-mentioned
The second keyword of any of at least one second keyword.
Corpus establishes device and second keyword of target is converted to the crucial term vector of target second, and determines target the
The distance between two crucial term vectors and the first crucial term vector, using the distance as the second keyword of target and the first keyword
Similarity.
Corpus establish device according to this method determine each second keyword at least one second keyword and
The similarity of one keyword, i.e. corpus establish device and determine each second keyword and each first keyword according to this method
Between the first similarity.
It should be understood that the first keyword can be and be stored in keyword database in vector form, it is also possible to language
Material library establish device after obtaining the first keyword in keyword database, by the first keyword be converted to the first keyword to
Amount.
In a kind of implementation of step 105, the first similarity can be each second keyword and each first and close
The sum of similarity of keyword;It is also possible in the second keyword and keyword database and the second highest pass of crucial Word similarity
The similarity of keyword;It is greater than preset threshold with the second crucial Word similarity in either the second keyword and keyword database
The sum of similarity of keyword.
Step 106 determines the first text and each the according to the weighted value of the first similarity and each second keyword
Similarity between one keyword.
Specifically, to each second keyword determined in step 104 weighted value shared in the first text and
The first similarity determined in step 105 carries out the first text is calculated similar to the first keyword according to preset rules
Degree.
Illustratively, each weighted value and each first similarity are normalized first, and are weighted value distribution the
One coefficient distributes the second coefficient for the first similarity.The product of weighted value and the first coefficient after determining normalization, determines normalizing
The product of the first similarity and the second coefficient after change.The product and normalization of weighted value and the first coefficient after determining normalization
Rear the first similarity and the product of the second coefficient and, as the similarity of the first text and the first keyword.
The embodiment of the present application obtains the second keyword for representing the first text, calculates by segmenting to the first text
First similarity of the second keyword and the first keyword.And is determined according to the weighted value of the first similarity and the second keyword
The corresponding keyword of one text and similarity between the two, can be accurately obtained the phase of the first keyword of the first text
Like degree.
In conjunction with Fig. 2, as shown in figure 4, step 103 specifically may be implemented are as follows:
Step 107, the determining similarity with the first text is greater than the target the of preset threshold from each first keyword
One keyword.
Wherein, target keyword be greater than in the first keyword with the similarity of the first text preset threshold at least one the
One keyword.
Specifically, may include the first large number of keyword in keyword database.Wherein with first text
Relevant first keyword may only have several to dozens of.First keyword and first higher with the similarity of the first text
The correlation of text is stronger.Therefore the first keyword can be screened by the way that preset threshold is arranged, obtains the pass of target first
Keyword.
Step 108 determines corresponding first triple of each first text at least one first text.
Wherein, the first triple includes the first text, the first keyword of target and the first text and each first key
Similarity between word.
Specifically, the first keyword of target corresponding with the first text that will be determined in the first text, step 107, and
Similarity between the first keyword of first text and target is stored with triple form.
It should be understood that the triple can be the triple based on the first text, it is also possible to based on keyword
Triple.
It include first text in the triple when the triple is the triple based on the first text, the
Similarity between corresponding the first keyword of target of one text and the first keyword of the first text and target.
It include first keyword in the triple, this first when the triple is based on the first keyword
Each first text at least one corresponding first text of keyword and first keyword and at least one first text
Between similarity.
Corresponding first triple of each first text is determined as corpus by step 109.
Corpus establishes device and corresponding first triple store of each first text is obtained this in first database
The corpus constructed required for application.
The embodiment of the present application stores the first text, the first keyword of target and between the two similar in the form of triple
Degree.It can be more efficiently according to corresponding first text of keyword query.
In a kind of implementation of the embodiment of the present application, which is the group of following one or more
It closes: the customer service corpus data of the customer service corpus datas of multiple industry universals, professional domain.
Corpus establish device determine in target corpus data the similarity of each first text and the first keyword it
Before.Corpus establishes device and also needs to obtain original customer service corpus data first, and carries out to the original corpus customer service data
Processing can just obtain the target corpus data.
When target corpus data is the customer service corpus data of multiple industry universals, corpus establishes device according to the first language
The customer service corpus data of the original multiple industry universals of material processing rule process, obtains the customer service corpus number of multiple industry universals
According to i.e. target corpus data.
When the customer service corpus data that target corpus data is professional domain, corpus is established device and is handled according to the second corpus
The customer service corpus data of the original professional domain of rule process obtains the customer service corpus data of professional domain, i.e. target corpus number
According to.
When the customer service corpus data of customer service corpus data and professional domain that target corpus data is multiple industry universals
When combination, corpus establishes the customer service corpus number that device handles the original multiple industry universals of rule process according to the first corpus
According to obtaining the customer service corpus data of multiple industry universals.It is original according to the second corpus processing rule process that corpus establishes device
Professional domain customer service corpus data, obtain the customer service corpus data of professional domain.By the customer service corpus of multiple industry universals
The customer service corpus data of data and professional domain combines to obtain target corpus data.
Specifically, in conjunction with Fig. 2, as shown in figure 5, when step 101 obtains target corpus database are as follows: corpus establishes device
The customer service corpus data that the original professional domain of rule process is handled according to the second corpus, obtains the customer service corpus number of professional domain
According to when, step 101 specifically may be implemented are as follows:
Step 110 obtains the first corpus data including multiple texts.
It wherein, include multiple texts in the first corpus data.First corpus data is above-mentioned original professional domain
Customer service corpus data.
By taking the professional domain is e-commerce field as an example, the customer service corpus data of e-commerce field is generally stored inside electricity
In sub- commercial field operation service database.It is difficult to get the stronger e-commerce of territoriality by approach such as internet hunts
Field customer service data.Therefore original electronics quotient is obtained from e-commerce field operation service database in the embodiment of the present application
The customer service corpus data in business field.And it is got using corresponding storage medium (such as hard disk, database server etc.) storage
Original e-commerce field customer service corpus data.
AFR control in step 111, the first corpus data of filling, obtains the second corpus data.
Wherein, AFR control is the data lacked in the first corpus data.Such as store the equipment portion of the first corpus data
Damage or human operational error is divided to delete the data lacked caused by the reasons such as certain data.
The filling most common method of AFR control is to fill AFR control using most probable data at present.For example it can use
Recurrence, Bayes's formalization method tool or decision tree conclusion etc. determine AFR control.Such methods are believed by existing data
Breath makes AFR control there is a greater chance that keeping contacting between other attributes to speculate AFR control.
There are also some other methods to handle AFR control, such as replace AFR control with a global constant, use attribute
Average data filling AFR control or all tuples are pressed into certain attributive classifications, then with the average data of attribute in same class
Fill AFR control.The application is not construed as limiting this.
Noise data in step 112, the second corpus data of processing, obtains third corpus data.
Wherein, noise data is the wrong data in the second corpus data or the data there are error;
Noise is the random error or deviation in a measurand, data or the desired isolated point of deviation including mistake
Data.Can be with following technology come smooth noise data, isolated point data is deleted in identification.
1, branch mailbox: by the data distribution of storage into some casees, with the data in case come the number of local smoothing method storing data
According to.Specifically can using it is smooth by case average data, by data smoothing in case and press case edge smoothing.
2, it returns: appropriate regression function can be found and carry out smoothed data.Linear regression will find out suitable two variables
" best " straight line enables a variable to predict another.Multilinear Regression is related to multiple variables, and data will be suitble to a multidimensional
Face.
3, it calculates machine check and manual inspection combines: data and known normal data can will be determined by computer
Compare, the mode that difference degree is greater than some threshold data is output in a table, then the mode in manual examination and verification table, identification
Isolated point out.
4, cluster: in groups or " cluster " by similar data organization, the data fallen in except cluster is gathered are considered as isolated
Point.Isolated dot pattern may be junk data, it is also possible to be to provide the significant data of information.Junk data is given from database
To remove.
The noise data in the second corpus data is deleted through the above way, obtains third corpus data.
The each text formatting for including in third corpus data is converted to target text format and obtains target by step 113
Corpus data library.
Since corpus establishes device when obtaining the first corpus data, may be obtained respectively from multiple databases
Corpus data forms the first corpus data.Therefore, the text formatting of each corpus in the first corpus data may be different.Together
When, even if the corpus data in the same database, the shapes such as the text formatting and symbol of Chinese form therein, number, English
Format between the text of formula may also be different, it is therefore desirable to which the text of more different-formats formats, and is converted into identical
The text of format.
Illustratively, the text formatting of current Chinese form is usually stored with the format of UTF-8.And in the first corpus data
In, the maximum data of data volume are the text of Chinese form.Therefore the character format in each text can be converted into
The format of UTF-8 stores.
In conjunction with Fig. 2, as shown in fig. 6, when step 101 obtains target corpus database are as follows: corpus establishes device according to the
One corpus handles the customer service corpus data of the original multiple industry universals of rule process, obtains the customer service corpus of multiple industry universals
When data, step 101 specifically be may be implemented are as follows:
Step 114 obtains the 4th corpus data.
Wherein, the 4th corpus data is the customer service corpus data of original multiple industry universals.It is wrapped in 4th corpus data
Include multiple texts;It include the text of a variety of question and answer formats in multiple texts.Question and answer format includes: question-response, one asks and more answer, is more
Ask that one answers and ask more answer more.
Wherein, question-response is the corresponding answer of a problem.One asks that answer corresponds to multiple answers for a problem more.It is more
Ask that one answers as the corresponding answer of multiple problems.Ask that answer corresponds to multiple answers for multiple problems more more.
The customer service corpus data of multiple industry universals is had existed in the prior art.These data are usually the number of structuring
According to.Without carrying out filling AFR control above-mentioned, processing noise data and etc..
Step 115 obtains the 5th corpus data according to preset data reduction rule the 4th corpus data of processing.
Wherein, the amount of text in the 5th corpus data is less than the amount of text in the 4th corpus data.
The reduction that data regularization technology can be used to obtain data set indicates, the reduction of data set is close to keeping former data
Integrality, but data volume is more much smaller than former data.Compared with non-reduction data, excavated in the data of reduction, it is required
Time and memory source it is less, excavation will be more effective, and generates identical or almost the same analysis result.
Preset data reduction rule can be following a variety of hough transformation methods:
1, it ties up reduction: reducing data volume by deleting incoherent attribute (or dimension).Data set is not only had compressed, is also reduced
Appear in the attribute number on discovery mode.It generallys use attribute set selection method and finds out minimal attribute set, so that data
The probability distribution of class is close to the former distribution for using all properties.The heuristic technology of attribute set selection has:
Stepwise forward selection, by null attribute collection, by belonging to originally property concentration " best " attribute gradually dose in the set;Gradually to
After delete, by entire property set, each step deletes " the worst " attribute that current attribute is concentrated;It selects forward and deletes backward
Combination: each step selects " best " attribute, deletes " the worst " attribute;Decision tree is concluded, and use information gain measurement is built
Vertical decision tree of classifying, the attribute in tree form the attribute set after reduction.
2, data compression: data encoding or transformation are applied, the reduction or compression expression of former data are obtained.Data compression is divided into
Lossless compression and lossy compression.It is popular and effectively to damage data compression method be wavelet transformation and Principle components analysis.
Data of the wavelet transformation for sparse or tilt data and with orderly attribute have good compression result.Principle components analysis
It is low to calculate cost, can be used for orderly or unordered attribute, and can handle sparse or tilt data.
3, numerical value reduction: numerical value reduction reduces data volume by selection substitution, lesser data representation format.Numerical value
Reduction techniques can be ginseng, be also possible to no ginseng.Ginseng method is to assess data using a model, only needs to store
Parameter, without storing real data.Having the numerical value reduction techniques of ginseng has following two:
It returns: linear regression and multiple regression;Log-linear model: the multidimensional probability distribution in Approximation Discrete property set.
There are three types of numerical value reduction techniques without ginseng: histogram, is a kind of prevalence using branch mailbox technology come approximate data distribution
Numerical value reduction form.Wherein V- is optimal and Max Diff histogram is most accurate and most practical;Cluster, cluster is by data
Tuple is considered as object, it divides the object into group or cluster, so that the object " similar " in a cluster, and clustered with other
In object " not similar ", replace real data with the cluster of data in data regularization;Sampling: smaller with press proof with data
This indicates big data set, such as simple sampling, cluster sampling and layering sampling.
4, Concept Hierarchies: Concept Hierarchies define numerical value by collecting and replacing the concept of lower level with the concept of higher level
One discretization of attribute.Concept Hierarchies can be used to reduction data, although by this generalization loss in detail, after generalization
Data it is more meaningful, be easier to understand, and required space is fewer than former data.For numerical attribute, due to data can
The diversity of energy value range and the update of data value are frequent, illustrate that Concept Hierarchies are difficult.The Concept Hierarchies of numerical attribute
It can automatically be constructed according to the distributional analysis of data, such as use branch mailbox, histogram analysis, clustering, the discretization based on entropy
Numerical value Concept Hierarchies are generated with technologies such as natural division segmentations.
Classification data itself is discrete data, and a categorical attribute has limited different value, unordered between value.A kind of side
Method be by user expert mode grade explicitly declared attribute partial order or total order, to obtain the layering of concept;It is another
Method is a declared attribute collection, but does not illustrate their partial order, generates attribute according to the number of each attribute different value by system
Sequence constructs significant Concept Hierarchies automatically.
Step 116, by the question and answer format of each text is converted to preset question and answer in multiple texts in the 5th corpus data
Format obtains target corpus data.
In the customer service corpus data of original multiple industry universals of acquisition.The question and answer of different field customer service expectation data
Format may be different.Therefore the question and answer uniform format to text is needed in the embodiment of the present application.
Illustratively, all question and answer formats are all unified into the text of question-response format by the embodiment of the present application.
For example, asking the data more answered for one, which is split as multiple problems, each problem respectively corresponds one
A answer.Form the data of multiple question-responses.
For the data answered are asked, which is split as multiple answers, each answer respectively corresponds one and asks more
Topic, forms the data of multiple question-responses.
It asks that more answers are split according to using identical method more, forms the data of multiple question-responses.
In a kind of implementation of the embodiment of the present application, in conjunction with Fig. 3, as shown in fig. 7, step 104 specifically may be implemented
Are as follows:
Step 117 segments the first text, obtains at least one third keyword.
Specifically, the text that the first text is made of multiple words, multiple word combinations obtain the first text note later
The content of load.The first text is segmented first, the first text is divided into the of multiple meanings that can represent the first text
Three keywords.
Step 118 marks at least one third keyword and obtains at least one second keyword.
The part of speech of each third keyword is labeled, it is crucial to obtain more accurate, more representative second
Word.Wherein, the part of speech of third keyword includes but is not limited to: pronoun, noun, adjective, verb, adverbial word, number, quantifier, hat
Word, preposition, conjunction, auxiliary word, interjection, onomatopoeia.
The meaning as expressed by word same in Chinese different part of speech may be it is entirely different, to third
The part of speech of keyword marks, and distributes weight or utilization part of speech into one using the part of speech of third keyword as third keyword
Step calculates the similarity of the second keyword and the first keyword.The degree of correspondence that can be keyword and text is more accurate.
It in step 118, can be by establishing machine learning classification model or establishing deep learning disaggregated model to each
A third keyword is labeled.Label time and workload can be greatlyd save by the method for this automatic marking.
Since automatic marking may have mark inaccuracy.Therefore, after automatic marking completion, Ke Yi
Result error correction by the method that marks by hand to automatic marking.Keep final annotation results accurate as far as possible.
Step 119 determines weighted value of second keyword in the first text according to the attribute of the second keyword.
Wherein, the attribute of the second keyword includes: frequency, the second keyword that the second keyword occurs in the first text
The mark of position, the second keyword in the first text.
In a kind of implementation of the embodiment of the present application, additionally provide a kind of robot customer service according to the demand of user from
The method transferred corresponding text in the corpus built and the text is sent to client.
Illustratively, this method specifically may be implemented are as follows:
Robot customer service detects the sentence of user's input.Robot customer service segments the sentence that user inputs, and obtains three
A keyword, respectively keyword one, keyword two and keyword three.Wherein, each keyword respectively corresponds an at least text
This, and three keywords have a similarity at least one corresponding text respectively.Keyword one is directed toward four texts, point
Not are as follows: A, B, C, O.Keyword two is directed toward five texts, respectively A, C, L, M, N.Three keyword A, C of the direction of keyword three,
P.Robot customer service can determine the corresponding text of sentence of user's input by least the following two kinds mode.
Mode one, robot customer service determine that the corresponding text of sentence of user input may be A or C.Robot customer service
Determine keyword one, keyword two and keyword three respectively with the similarity of A and keyword one, keyword two and keyword three
Respectively with the similarity of C.The highest text of similarity in A and C is sent to user by robot customer service.Such as robot customer service is true
Determine keyword one, keyword two and keyword three and be respectively as follows: 0.1,0.3,0.5 with the similarity of A respectively, the sum of similarity is
0.9.Robot customer service determine keyword one, keyword two and keyword three be respectively as follows: 0.2 with the similarity of C respectively, 0.1,
0.3, the sum of similarity is 0.6.Robot customer service determines that the similarity of keyword one, keyword two and keyword three and A is greater than
With the similarity of C.Then text A is sent to user by robot customer service.
Mode two, robot customer service determine similarity of the keyword one respectively with A, B, C, O, for example, 0.1,0.2,0.2,
0.6.Robot customer service determines similarity of the keyword two respectively with A, C, L, M, N, such as are as follows: 0.3,0.1,0.3,0.4,0.1.
Robot customer service determines similarity of the keyword three respectively with A, C, P, such as are as follows: 0.5,0.3,0.2.Robot customer service determine with
Keyword one, keyword two and the highest text of three similarity of keyword are A, and A is sent to user.
It should be understood that robot customer service can be by diversified forms to user's sending information A.Such as in voice call robot
In customer service, to user's sending information A or by short message form to user's sending information A by way of voice broadcast.In net
In page or the customer service of APP robot, customer service can be in dialog box directly to user sending information A, or sends and refer to user
Link etc. to text A.The application does not limit this.
In a kind of implementation of the embodiment of the present application, robot customer service is to after user's sending information A, if receiving
The text A that user returns is not the text wanted required for user, then robot customer service sends in addition to A that similarity is most to user
Robot customer service is again to user's sending information C in high text, such as mode one.In mode two robot customer service again to
Family sending information C and/or text O.
In a kind of implementation of the embodiment of the present application, the first text and first can also be determined as follows and closed
The similarity of keyword:
It is inputted in similarity calculation tool: keyword database, target corpus data.
Obtain output result: the ternary Groups List formed with the first text, keyword, similarity
Specifically:
The input of the algorithm is keyword database and target corpus data, and output is with keyword database, target language
The ternary Groups List for expecting data composition passes through the available higher key of similarity with the first document of this ternary Groups List
Word list.The algorithm will operate each keyword example in keyword database, obtain its all properties as this
The description of keyword then obtains the second keyword that can most represent first text by the algorithm of TF-IDF.To second
Their Jaccard similarity is calculated in keyword and keyword database keyword.Then it reuses HowNet and calculates the
The semantic similarity of one text and the first keyword.Jaccard similarity is added with semantic similarity finally, obtains the first text
Sheet and the first keyword.So far, the algorithm will obtain one with the first text, keyword, similarity ternary Groups List.
The embodiment of the present application can establish device to corpus according to above method example and carry out functional module or function
The division of unit, for example, each functional module of each function division or functional unit can be corresponded to, it can also be by two or two
A above function is integrated in a processing module.Above-mentioned integrated module both can take the form of hardware realization, can also
It is realized in the form of using software function module or functional unit.Wherein, to module or unit in the embodiment of the present application
Division is schematically that only a kind of logical function partition, there may be another division manner in actual implementation.
The embodiment of the present application provides a kind of corpus and establishes device, establishes system applied to above-mentioned corpus shown in FIG. 1
In.As shown in figure 8, corpus establishes device includes:
Acquiring unit 801, for obtaining target corpus database and keyword database, wherein target corpus data
Library includes at least one text, and keyword database includes at least one first keyword.
Processing unit 802, it is each in each text and at least one first keyword for determining at least one text
Similarity between first keyword.
Processing unit 802 is also used to be closed according to each text, each first keyword and each text and each first
Similarity between keyword, obtains corpus.
Optionally, processing unit 802 are also used to: determining at least one second keyword of the first text.At least one
Each second keyword has a weighted value in two keywords.It determines between each second keyword and each first keyword
The first similarity.Determine that the first text and each first closes according to the weighted value of the first similarity and each second keyword
Similarity between keyword.
Optionally, processing unit 802 are also used to: the determining similarity with text is greater than pre- from each first keyword
If the first keyword of target of threshold value.Determine corresponding first triple of each text at least one text.First triple
Including the similarity between text, the first keyword of target and text and each first keyword.Each text is corresponding
First triple is determined as corpus.
Optionally, processing unit is also used to: being segmented to the first text, is obtained at least one third keyword.Mark is at least
One third keyword obtains at least one second keyword.Determine the second keyword first according to the attribute of the second keyword
Weighted value in text.The attribute of second keyword includes: frequency, the second key that the second keyword occurs in the first text
Position of the word in the first text, the second keyword mark.
Optionally, device further include: acquiring unit 801 is also used to obtain the first corpus data including multiple texts.
Processing unit 802 is also used to fill the AFR control in the first corpus data, obtains the second corpus data.Processing unit 802,
It is also used to handle the noise data in the second corpus data, obtains third corpus data.Noise data is in the second corpus data
Wrong data or data there are error.
Processing unit 802 is also used to each text formatting for including in third corpus data being converted to target text lattice
Formula obtains target corpus database.
Optionally, device further include: acquiring unit 801 is also used to obtain the 4th corpus data, wherein the 4th corpus
It include multiple texts in data.It include the text of a variety of question and answer formats in multiple texts.Question and answer format includes: question-response, one
It being answered ask, Duo Wenyi is answered and asks answer more more more.Processing unit 802 is also used to handle the 4th language according to preset data reduction rule
Material data obtain the 5th corpus data.Amount of text in 5th corpus data is less than the amount of text in the 4th corpus data.
Processing unit 802 is also used to the question and answer format of each text is converted to preset ask in multiple texts in the 5th corpus data
Format is answered, target corpus data is obtained.
When passing through hardware realization, receiving unit, acquiring unit and transmission unit in the embodiment of the present application be can integrate
In communication interface, processing unit 702 be can integrate on a processor.Specific implementation is as shown in Figure 9.
Fig. 9 shows another possible structural schematic diagram that corpus involved in above-described embodiment establishes device.
It includes: processor 902 and communication interface 903 that the corpus, which establishes device,.Processor 902 is moved for establishing device to corpus
Control management is carried out, for example, executing the step of above-mentioned processing unit 802 executes, and/or for executing skill described herein
Other processes of art.Communication interface 903 is for supporting corpus to establish the communication of device Yu other network entities, for example, executing
The step of above-mentioned receiving unit 801 executes.It can also include memory 901 and bus 904, memory that corpus, which establishes device,
901 for storing program code and data that corpus establishes device.
Wherein, memory 901 can be the memory etc. that corpus is established in device, which may include volatibility
Memory, such as random access memory;The memory also may include nonvolatile memory, such as read-only memory, fastly
Flash memory, hard disk or solid state hard disk;The memory can also include the combination of the memory of mentioned kind.
Above-mentioned processor 902 can be realization or execute to combine and various illustratively patrols described in present disclosure
Collect box, module and circuit.The processor can be central processing unit, general processor, digital signal processor, dedicated integrated
Circuit, field programmable gate array or other programmable logic device, transistor logic, hardware component or it is any
Combination.It, which may be implemented or executes, combines various illustrative logic blocks, module and electricity described in present disclosure
Road.The processor is also possible to realize the combination of computing function, such as combines comprising one or more microprocessors, DSP and micro-
The combination etc. of processor.
Bus 904 can be expanding the industrial standard structure (Extended Industry Standard
Architecture, EISA) bus etc..Bus 904 can be divided into address bus, data/address bus, control bus etc..For convenient for table
Show, only indicated with a thick line in Fig. 9, it is not intended that an only bus or a type of bus.
Through the above description of the embodiments, it is apparent to those skilled in the art that, for description
It is convenienct and succinct, only the example of the division of the above functional modules, in practical application, can according to need and will be upper
It states function distribution to be completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete
All or part of function described above.The specific work process of the system, apparatus, and unit of foregoing description, before can referring to
The corresponding process in embodiment of the method is stated, details are not described herein.
The embodiment of the present application provides a kind of computer program product comprising instruction, when the computer program product is calculating
When being run on machine, so that the computer executes the corpus foundation side that netscape messaging server Netscape executes in above method embodiment
Method.
The embodiment of the present application provides a kind of computer program product comprising instruction, when the computer program product is calculating
When being run on machine, so that the computer executes information in above method embodiment and reinforces the corpus foundation side that server executes
Method.
The embodiment of the present application provides a kind of computer program product comprising instruction, when the computer program product is calculating
When being run on machine, so that the computer executes the corpus method for building up that information calls terminal to execute in above method embodiment.
The embodiment of the present application also provides a kind of computer readable storage medium, and finger is stored in computer readable storage medium
It enables, when the instruction is run on computers, believes in method flow shown in above method embodiment so that the computer executes
Cease the corpus method for building up that processing server executes.
The embodiment of the present application also provides a kind of computer readable storage medium, and finger is stored in computer readable storage medium
It enables, when the instruction is run on computers, believes in method flow shown in above method embodiment so that the computer executes
Breath reinforces the corpus method for building up that server executes.
The embodiment of the present application also provides a kind of computer readable storage medium, and finger is stored in computer readable storage medium
It enables, when the instruction is run on computers, believes in method flow shown in above method embodiment so that the computer executes
The corpus method for building up that breath calls terminal to execute.
Wherein, computer readable storage medium, such as electricity, magnetic, optical, electromagnetic, infrared ray can be but not limited to or partly led
System, device or the device of body, or any above combination.The more specific example of computer readable storage medium is (non-poor
The list of act) it include: the electrical connection with one or more conducting wires, portable computer diskette, hard disk, random access memory
(Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), erasable type may be programmed read-only
It is memory (Erasable Programmable Read Only Memory, EPROM), register, hard disk, optical fiber, portable
Compact disc read-only memory (Compact Disc Read-Only Memory, CD-ROM), light storage device, magnetic memory
The computer readable storage medium of part or above-mentioned any appropriate combination or any other form well known in the art.
A kind of illustrative storage medium is coupled to processor, to enable a processor to from the read information, and can be to
Information is written in the storage medium.Certainly, storage medium is also possible to the component part of processor.Pocessor and storage media can be with
In application-specific IC (Application Specific Integrated Circuit, ASIC).In the application
In embodiment, computer readable storage medium can be any tangible medium for including or store program, which can be referred to
Enable execution system, device or device use or in connection.
More than, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any at this
Apply for the change or replacement in the technical scope disclosed, should all cover within the scope of protection of this application.Therefore, the application
Protection scope should be subject to the protection scope in claims.
Claims (14)
1. a kind of corpus method for building up, which is characterized in that the described method includes:
Obtain target corpus database and keyword database, wherein the target corpus database includes at least one text
This, the keyword database includes at least one first keyword;
Determine at least one described text in each text and at least one described first keyword each first keyword it
Between similarity;
According to each text, each first keyword and each text and each first keyword it
Between similarity, obtain corpus.
2. the method according to claim 1, wherein determining the first text and at least one described first keyword
In similarity between each first keyword, comprising:
Determine at least one second keyword of first text;Each second is crucial at least one described second keyword
Word has a weighted value;First text is any one text at least one described text;
Determine the first similarity between each second keyword and each first keyword;
According to first similarity and the weighted value of each second keyword, first text and described every is determined
Similarity between a first keyword.
3. -2 described in any item methods according to claim 1, which is characterized in that it is described according to each text, it is described every
Similarity between a first keyword and each text and each first keyword, obtains corpus, comprising:
Determining the first keyword of target for being greater than preset threshold with the similarity of text from each first keyword;
Determine corresponding first triple of each text at least one text;First triple includes the text, institute
State the similarity between the first keyword of target and the text and each first keyword;
Corresponding first triple of each text is determined as the corpus.
4. according to the method described in claim 2, it is characterized in that, to first text segment, obtain at least one second
Keyword, comprising:
First text is segmented, at least one third keyword is obtained;
It marks at least one described third keyword and obtains at least one described second keyword;
Weighted value of second keyword in first text is determined according to the attribute of the second keyword;Described second closes
The attribute of keyword include: second keyword occur in first text frequency, second keyword is described
The mark of position, second keyword in first text.
5. -2 described in any item methods according to claim 1, which is characterized in that the acquisition target corpus database, comprising:
Obtain the first corpus data including multiple texts;
The AFR control in first corpus data is filled, the second corpus data is obtained;
The noise data in second corpus data is handled, third corpus data is obtained;The noise data is described second
Wrong data in corpus data or the data there are error;
The each text formatting for including in the third corpus data is converted into target text format and obtains the target corpus
Database.
6. -2 described in any item methods according to claim 1, which is characterized in that the acquisition target corpus database, comprising:
Obtain the 4th corpus data, wherein include multiple texts in the 4th corpus data;It include more in the multiple text
The text of kind question and answer format;The question and answer format include: question-response, one ask answer, Duo Wenyi is answered and asks answer more more;
The 4th corpus data, which is handled, according to preset data reduction rule obtains the 5th corpus data;5th corpus data
In amount of text be less than the 4th corpus data in amount of text;
By the question and answer format of each text is converted to preset question and answer format in multiple texts in the 5th corpus data, obtain
The target corpus data.
7. a kind of corpus establishes device, which is characterized in that described device includes:
Acquiring unit, for obtaining target corpus database and keyword database, wherein the target corpus database packet
At least one text is included, the keyword database includes at least one first keyword;
Processing unit, for each in each text at least one determining described text and at least one described first keyword
Similarity between first keyword;
The processing unit, be also used to according to each text, each first keyword and each text with
Similarity between each first keyword, obtains corpus.
8. device according to claim 7, which is characterized in that the processing unit is also used to:
Determine at least one second keyword of the first text;Each second keyword tool at least one described second keyword
There is a weighted value;First text is any text at least one described text;
Determine the first similarity between each second keyword and each first keyword;
First text and described every is determined according to the weighted value of first similarity and each second keyword
Similarity between a first keyword.
9. according to the described in any item devices of claim 7-8, which is characterized in that the processing unit is also used to:
Determining the first keyword of target for being greater than preset threshold with the similarity of text from each first keyword;
Determine corresponding first triple of each text at least one text;First triple includes the text, institute
State the similarity between the first keyword of target and the text and each first keyword;
Corresponding first triple of each text is determined as the corpus.
10. device according to claim 8, which is characterized in that the processing unit is also used to:
First text is segmented, at least one third keyword is obtained;
It marks at least one described third keyword and obtains at least one described second keyword;
Weighted value of second keyword in first text is determined according to the attribute of the second keyword;Described second closes
The attribute of keyword include: second keyword occur in first text frequency, second keyword is described
The mark of position, second keyword in first text.
11. according to the described in any item devices of claim 7-8, which is characterized in that described device further include:
The acquiring unit is also used to obtain the first corpus data including multiple texts;
The processing unit is also used to fill the AFR control in first corpus data, obtains the second corpus data;
The processing unit is also used to handle the noise data in second corpus data, obtains third corpus data;It is described
Noise data is the wrong data in second corpus data or the data there are error;
The processing unit is also used to each text formatting for including in the third corpus data being converted to target text lattice
Formula obtains the target corpus database.
12. according to the described in any item devices of claim 7-8, which is characterized in that described device further include:
The acquiring unit is also used to obtain the 4th corpus data, wherein includes multiple texts in the 4th corpus data;
It include the text of a variety of question and answer formats in the multiple text;The question and answer format includes: question-response, one asks and more answer, ask more
One answers and asks more answer more;
The processing unit is also used to obtain the 5th corpus number according to preset data reduction rule processing the 4th corpus data
According to;Amount of text in 5th corpus data is less than the amount of text in the 4th corpus data;
The processing unit is also used to the question and answer format of each text is converted in multiple texts in the 5th corpus data
Preset question and answer format obtains the target corpus data.
13. a kind of corpus establishes device characterized by comprising processor and communication interface;The communication interface and described
Processor coupling, the processor are as claimed in any one of claims 1 to 6 to realize for running computer program or instruction
Method.
14. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium,
When executed, as the method according to claim 1 to 6 is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910420207.7A CN110222192A (en) | 2019-05-20 | 2019-05-20 | Corpus method for building up and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910420207.7A CN110222192A (en) | 2019-05-20 | 2019-05-20 | Corpus method for building up and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110222192A true CN110222192A (en) | 2019-09-10 |
Family
ID=67821662
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910420207.7A Pending CN110222192A (en) | 2019-05-20 | 2019-05-20 | Corpus method for building up and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110222192A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852109A (en) * | 2019-11-11 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Corpus generating method, corpus generating device, and storage medium |
CN111355715A (en) * | 2020-02-21 | 2020-06-30 | 腾讯科技(深圳)有限公司 | Processing method, system, device, medium and electronic equipment of event to be resolved |
CN111460117A (en) * | 2020-03-20 | 2020-07-28 | 平安科技(深圳)有限公司 | Dialog robot intention corpus generation method, device, medium and electronic equipment |
CN112214586A (en) * | 2020-10-13 | 2021-01-12 | 华东师范大学 | Corpus accumulation method for assisting interview investigation |
CN112732934A (en) * | 2021-01-11 | 2021-04-30 | 国网山东省电力公司电力科学研究院 | Power grid equipment word segmentation dictionary and fault case library construction method |
CN112784052A (en) * | 2021-03-15 | 2021-05-11 | 中国平安人寿保险股份有限公司 | Text classification method, device, equipment and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103914543A (en) * | 2014-04-03 | 2014-07-09 | 北京百度网讯科技有限公司 | Search result displaying method and device |
CN105183727A (en) * | 2014-05-29 | 2015-12-23 | 上海研深信息科技有限公司 | Method and system for recommending book |
US9411878B2 (en) * | 2014-02-19 | 2016-08-09 | International Business Machines Corporation | NLP duration and duration range comparison methodology using similarity weighting |
CN105955976A (en) * | 2016-04-15 | 2016-09-21 | 中国工商银行股份有限公司 | Automatic answering system and method |
CN106610951A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Improved text similarity solving algorithm based on semantic analysis |
-
2019
- 2019-05-20 CN CN201910420207.7A patent/CN110222192A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9411878B2 (en) * | 2014-02-19 | 2016-08-09 | International Business Machines Corporation | NLP duration and duration range comparison methodology using similarity weighting |
CN103914543A (en) * | 2014-04-03 | 2014-07-09 | 北京百度网讯科技有限公司 | Search result displaying method and device |
CN105183727A (en) * | 2014-05-29 | 2015-12-23 | 上海研深信息科技有限公司 | Method and system for recommending book |
CN105955976A (en) * | 2016-04-15 | 2016-09-21 | 中国工商银行股份有限公司 | Automatic answering system and method |
CN106610951A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Improved text similarity solving algorithm based on semantic analysis |
Non-Patent Citations (1)
Title |
---|
何阿静: "自动问答系统的研究与实现", 《信息科技辑》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852109A (en) * | 2019-11-11 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Corpus generating method, corpus generating device, and storage medium |
CN111355715A (en) * | 2020-02-21 | 2020-06-30 | 腾讯科技(深圳)有限公司 | Processing method, system, device, medium and electronic equipment of event to be resolved |
CN111460117A (en) * | 2020-03-20 | 2020-07-28 | 平安科技(深圳)有限公司 | Dialog robot intention corpus generation method, device, medium and electronic equipment |
CN111460117B (en) * | 2020-03-20 | 2024-03-08 | 平安科技(深圳)有限公司 | Method and device for generating intent corpus of conversation robot, medium and electronic equipment |
CN112214586A (en) * | 2020-10-13 | 2021-01-12 | 华东师范大学 | Corpus accumulation method for assisting interview investigation |
CN112214586B (en) * | 2020-10-13 | 2022-06-28 | 华东师范大学 | Corpus accumulation method for assisting interview investigation |
CN112732934A (en) * | 2021-01-11 | 2021-04-30 | 国网山东省电力公司电力科学研究院 | Power grid equipment word segmentation dictionary and fault case library construction method |
CN112732934B (en) * | 2021-01-11 | 2022-05-27 | 国网山东省电力公司电力科学研究院 | Power grid equipment word segmentation dictionary and fault case library construction method |
CN112784052A (en) * | 2021-03-15 | 2021-05-11 | 中国平安人寿保险股份有限公司 | Text classification method, device, equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101999152B1 (en) | English text formatting method based on convolution network | |
CN110222192A (en) | Corpus method for building up and device | |
CN106991085B (en) | Entity abbreviation generation method and device | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
CN111325029B (en) | Text similarity calculation method based on deep learning integrated model | |
CN110427623A (en) | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium | |
CN104199965B (en) | Semantic information retrieval method | |
CN116775847A (en) | Question answering method and system based on knowledge graph and large language model | |
CN104834651B (en) | Method and device for providing high-frequency question answers | |
CN111899090B (en) | Enterprise associated risk early warning method and system | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN110633366A (en) | Short text classification method, device and storage medium | |
CN112395421B (en) | Course label generation method and device, computer equipment and medium | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
CN109766437A (en) | A kind of Text Clustering Method, text cluster device and terminal device | |
CN111158641B (en) | Automatic recognition method for transaction function points based on semantic analysis and text mining | |
CN110990532A (en) | Method and device for processing text | |
CN111753082A (en) | Text classification method and device based on comment data, equipment and medium | |
CN110674301A (en) | Emotional tendency prediction method, device and system and storage medium | |
CN116644148A (en) | Keyword recognition method and device, electronic equipment and storage medium | |
CN114328800A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
CN113761875B (en) | Event extraction method and device, electronic equipment and storage medium | |
CN113761192A (en) | Text processing method, text processing device and text processing equipment | |
CN116151258A (en) | Text disambiguation method, electronic device and storage medium | |
Kang et al. | An Analysis of Research Trends on Language Model Using BERTopic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190910 |