[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108595602A - The question sentence file classification method combined with depth model based on shallow Model - Google Patents

The question sentence file classification method combined with depth model based on shallow Model Download PDF

Info

Publication number
CN108595602A
CN108595602A CN201810357603.5A CN201810357603A CN108595602A CN 108595602 A CN108595602 A CN 108595602A CN 201810357603 A CN201810357603 A CN 201810357603A CN 108595602 A CN108595602 A CN 108595602A
Authority
CN
China
Prior art keywords
question sentence
model
feature
convolution
shallow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810357603.5A
Other languages
Chinese (zh)
Inventor
黄青松
余慧
郭勃
刘利军
冯旭鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201810357603.5A priority Critical patent/CN108595602A/en
Publication of CN108595602A publication Critical patent/CN108595602A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to the question sentence file classification methods combined with depth model based on shallow Model, belong to Computer Natural Language Processing technical field.The present invention extracts the feature set of words of question sentence text first, and character pair word weight is worth to using normalized term vector after vectorization, is inputted as a part for shallow-layer linear model.Convolutional network carries out convolution using the convolution kernel of a variety of different windows sizes to question sentence text, the feature vector that the different convolution kernels for possessing equal length convolution window extract is rearranged, it is separately input to again among corresponding Recognition with Recurrent Neural Network, it finally is linked together the output of multiple Recognition with Recurrent Neural Network to obtain the syntactic-semantic feature vector of question sentence, another part as shallow-layer linear model inputs.Final shallow Model obtains the classification results of question sentence according to the input that the output by feature term vector and depth model is spliced to form.The present invention overcomes the shortcomings of single depth model, effectively improves the accuracy rate of Question Classification.

Description

The question sentence file classification method combined with depth model based on shallow Model
Technical field
The present invention relates to the question sentence file classification methods combined with depth model based on shallow Model, belong to computer nature Language processing techniques field.
Background technology
Question sentence text classification belongs to short text classification, is played an important role in automatically request-answering system.Question sentence text point Class mainly classifies to question sentence by analyzing the content of question sentence.There is rule-based method in early stage, utilizes the key of question sentence The correspondence of word or grammatical pattern and question sentence type, classifies to question sentence.This method to possess apparent interrogative or The Question Classification effect of question sentence Based on Class Feature Word Quadric is fine, but for there is no apparent in more complex question sentence or question sentence text Based on Class Feature Word Quadric is quite different, and the flexibility ratio of method is not high, and workload is larger, and the subjectivity of Question Classification is strong.With machine The development of study, the Question Classification method based on machine learning become mainstream, Zhang (<26th ACM years international Art meeting>, 2003) et al. using support vector machines (SVM), the syntactic feature for extracting sentence classifies to problem, this method Relatively pervious method accuracy rate has obtained larger promotion.Also have in addition to this and is mutually tied rule-based with the method for machine learning It closes, Li (<Chinese Journal of information>, 2008) etc. by interrogative and centre word rule and SVM method phases In conjunction with making the accuracy of Question Classification further increase.Nicety of grading depends on the effect of the technologies such as syntactic analysis, but in The syntactic analysis technology phase that the form variability of text and clause complexity cause the syntactic analysis difficulty of Chinese higher, current To not mature enough, the order of accuarcy of question sentence text classification is affected.
Recently as the proposition of deep learning, various deep learning frames have been widely used in image procossing and nature In Language Processing, and the promotion breakthrough relative to conventional method is achieved, wherein in terms of sentence Document Modeling and classification, volume Product neural network (CNN) and Recognition with Recurrent Neural Network (RNN) have become two kinds of most common deep learning neural network frameworks. Kim(<Eprint Arxiv>, 2014) et al. the text that is originally modeled using convolutional neural networks distich Ziwen, and will obtained Feature is used for text classification, this model structure is simple, but achieves preferable classification results, has become a kind of text classification Baseline Methods.Tang(<Natural language processing meeting>, 2015) et al. also taken in emotional semantic classification task with Recognition with Recurrent Neural Network Obtained preferable result.Depth model efficiently solves feature extraction complexity existing for conventional machines learning method, and transplantability is poor, The problems such as short text character representation is sparse.But due to the too strong learning ability of depth model, Cheng (<arXiv.org>, 2016) Et al. indicate that depth model is difficult to learn to indicate existed to effective feature vector to the lower feature of certain frequencies of occurrences In extensive problem, and shallow Model such as SVM, linear model etc. can the feature less to occurrence number preferably learned It practises, the perceptron network integration of Logic Regression Models and multilayer is improved the software application of Google's application shop by Cheng et al. Recommend accuracy rate.Often occur the unbalanced problem of training data in text categorization task, single depth model to data volume compared with Few classification is difficult that effective character representation is arrived in study.
Invention content
The present invention provides the question sentence file classification methods combined with depth model based on shallow Model, for single depth When model faces uneven training data there are the problem of, there is using traditional shallow Model to feature the spy of stronger Memorability Point effectively improves the accuracy rate of Question Classification.
The technical scheme is that:It is described based on the question sentence file classification method that shallow Model is combined with depth model Method is as follows:
Step1, the question sentence language for crawling economy and finance, laws and regulations, sports, 5 health care, electronic digital classifications Secondly material pre-processes language material text;
Step2, the feature set of words that question sentence text in question sentence language material is extracted using the method for evolution inspection CHI, will be each Feature Words are converted into the form of term vector, and use the corresponding normalization term vector value of Feature Words as its weighted value, thus To the part input Input1 of shallow-layer linear model;
Step3, increase question sentence keyword term vector weight, the question sentence text vector for then forming term vector matrix inputs Into first part's convolutional network of depth model;Wherein use the convolution kernel of a variety of different windows sizes respectively to question sentence text Convolution operation is carried out, the local phrase feature of sentence is extracted, the different convolution kernels for possessing equal length convolution window are extracted Feature vector rearranged;
Step4, the feature vector generated in Step3 is separately input among corresponding Recognition with Recurrent Neural Network;Cycle god The historical information of sentence can be captured by its chain structure through network, the long-term dependence characteristics of sequence data are arrived in study, The output of the last one time step contains the characteristic information of entire sentence, and the output of multiple Recognition with Recurrent Neural Network is linked to one The final feature as question sentence is played, another part input Input2 of shallow-layer linear model is thus obtained;
Step5, the final output Input2 of depth model in the Input1 obtained in Step2 and Step4 is spliced to form The input of shallow Model, shallow Model part use multiple linear regression structure, finally obtain the classification results of question sentence.
The step Step1 is as follows:
Step1.1, first manual compiling crawlers, Baidu know swash take economy and finance, laws and regulations, sport fortune Dynamic, 5 health care, electronic digital classifications question sentence language material;
Step1.2, the language material crawled, obtain unduplicated question sentence language material by filtering, duplicate removal, and it be stored in In database;
Step1.3, the question sentence language material in database is segmented, stop words is gone to pre-process.
The step Step2 is as follows:
Step2.1, the feature set of words that question sentence text is extracted using the method for evolution inspection CHI;
Step2.2, the form for converting each Feature Words in Step2.1 to term vector, using distributed term vector Representation method;
Step2.3, using the corresponding normalization term vector value of Feature Words as its weighted value, finally obtain question sentence text Non- syntactic feature indicates that the part as shallow Model inputs Input1.
The step Step3 is as follows:
Step3.1, question sentence key is extracted using the method based on tf-idf that the jieba kits in python provide The term vector of question sentence keyword is repeated once by word respectively after each word is represented as term vector in question sentence in its left and right, Keyword shared weight in sentence just will increase at this time, thus obtain a term vector matrix;
Step3.2, the question sentence text vector that the word vector matrix obtained in Step3.1 indicates is input to depth model First part's convolutional network in, wherein the line number of matrix be sentence in word number, columns be term vector dimension;Here make Vertical convolution operation is carried out to question sentence with each two of the convolution kernel of 2,3,4 three kinds of different length convolution windows, is extracted The local feature of different location in sentence, thus obtains several groups feature vector;
Step3.3, will possess identical convolution window size the extraction of different convolution kernels feature vector position in temporal sequence Confidence breath carries out rearranging combination so that the feature vector that different convolution kernels are obtained in sentence same position convolution is spliced one It rises.
The step Step4 is as follows:
Step4.1, by three kinds of different length convolution windows obtain in Step3.3 the feature rearranged respectively according to sentence Son is sequentially input among corresponding three Recognition with Recurrent Neural Network;Used here as LSTM Recognition with Recurrent Neural Network, for more preferably capturing To sentence, historical information, the long-term dependence characteristics of study to sequence data, the output of the last one time step include earlier The characteristic information of entire question sentence;
Step4.2, the output of three Recognition with Recurrent Neural Network in Step4.1 is linked together to final feature as question sentence It indicates, thus obtains another part input Input2 of shallow-layer linear model.
The step Step5 is as follows:
Step5.1, the final output Input2 of the Input1 obtained in Step2.3 and Step4.2 is spliced to form shallow-layer The input of model, shallow Model is using multiple linear regression structure here, i.e., one last layer connected entirely is added with softmax The general neural network of function;
Step5.2, the input layer content for obtaining Step5.1 pass through one layer of hidden layer, then the output of hidden layer is inputted To obtaining final Question Classification result in sotfmax functions.
The depth model part is made of convolutional network layer and Recognition with Recurrent Neural Network layer;K-th of convolution window in convolutional layer The Text Representation that the convolution nuclear convolution that mouth length is h obtains is wkh=[cki,…,ck(l-h+1)], wherein ckiIt indicates k-th The convolution feature of convolution kernel i-th of position in question sentence text;cki=Relu (oki+ b), okiIndicate the value that convolutional calculation obtains; oki=[xi,xi+1,…,xi+h-1]*fkh, wherein xiThe term vector of i-th of word in sentence is represented, h represents convolution kernel length of window, [xi,xi+1,…,xi+h-1] represent in sentence the word from i-th of word to the i-th-h+1, the term vector matrix that total h word forms;fkh Indicate that the convolution kernel that k-th of convolution length of window is h, * represent corresponding element multiplication sum operation in two matrixes;By convolutional layer Obtained feature vector rearranges combination and then inputs three different LSTM Recognition with Recurrent Neural Network layers respectively, is formed final special Sign vector is expressed as V=[v2,v3,v4], wherein v2,v3,v4Convolution length of window 2,3,4 is indicated respectively;The input of entire model Layer is spliced to form by the feature term vector of shallow-layer part and the output V of depth model, and the vector for forming a m dimension indicates, X= [wf1…wfn,V]。
The shallow Model final classification method is softmax functions.
The beneficial effects of the invention are as follows:
1, the present invention carries out term vector training using the word2vec modules of gensim, since the vector of word is the neighbour by word What nearly word calculated, so meeting implicit semantic information in vector, suitable for semantic information extraction.Term vector is indicated Text effectively improves the performance of model as the input of model.
2, in the preprocessing process of data, for depth model importation, increase the power of question sentence keyword term vector Weight.Keyword in question sentence works as each word in question sentence by table to judging that the classification of sentence often has the function of bigger It is shown as after term vector, the term vector of question sentence keyword in training corpus is repeated once respectively in its left and right, at this time keyword Shared weight just will increase in sentence, can further increase the classification performance of model in this way.
3, the present invention is based on the question sentence textual classification model that shallow Model is combined with depth model, combine depth model with Conventional machines learn the advantage of shallow Model respectively.Wherein depth model is by convolutional neural networks and LSTM Recognition with Recurrent Neural Network groups It closes, in order to the syntactic-semantic feature of preferably learning text, a variety of different windows sizes is used in convolutional network Convolution kernel carries out convolution operation to question sentence text.Simultaneously in order to overcome when the training corpus of certain class question sentence text is relatively fewer, Depth model is difficult to learn to the validity feature vector of feature corresponding to respective classes to indicate, it is proposed that on the basis of depth model Upper combination shallow Model has the characteristics that stronger Memorability using traditional shallow Model to feature.Training data imbalance with In the case of balance, the accuracy rate of Question Classification achieves promotion, especially in training data imbalance, compares other models Performance has a distinct increment.
To sum up, this question sentence file classification method combined with depth model based on shallow Model is passed through by convolutional Neural net Network and Recognition with Recurrent Neural Network are composed depth model, and doing preferably study extraction to the syntactic-semantic feature of question sentence is used as shallow-layer The part input of linear model, wherein depth model importation increase question sentence keyword term vector weight, and convolutional network makes With the convolution kernel of a variety of different windows sizes.Feature term vector is inputted as another part of shallow Model, utilizes shallow Model The advantages of, it is unbalanced in training corpus, overcome the deficiency of single depth model.Final unified model effectively carries The accuracy rate of Question Classification is risen.
Description of the drawings
Fig. 1 is the Question Classification model structure of the present invention;
Fig. 2 is depth model part-structure figure in the present invention;
Fig. 3 is the Question Classification accuracy rate comparison diagram of different convolutional network output processing in the present invention;
Fig. 4 is different neural network models with the training increased performance change comparison diagram of iterations.
Specific implementation mode
Embodiment 1:As shown in Figs 1-4, the question sentence file classification method combined with depth model based on shallow Model, it is described Method is as follows:
Step1, the question sentence language for crawling economy and finance, laws and regulations, sports, 5 health care, electronic digital classifications Secondly material pre-processes language material text;
Further, the step Step1 is as follows:
Step1.1, first manual compiling crawlers, Baidu know swash take economy and finance, laws and regulations, sport fortune Dynamic, 5 health care, electronic digital classifications question sentence language material;
Step1.2, the language material crawled, obtain unduplicated question sentence language material by filtering, duplicate removal, and it be stored in In database;
The present invention has crawled on Baidu is known economy and finance by crawlers, laws and regulations, sports, medical treatment are defended The problem of raw, 5 classifications of electronic digital each 5000 language materials, as first prepared language material set, i.e. balanced corpus.In addition Health care and electronic digital are respectively removed into 3000 language materials, retain 2000 language materials, other three types language material numbers are constant, make For second prepared corpus, i.e., uneven corpus.Each language material, which combines, takes therein 1/10th to be used as test set, remaining Be used as training set.In view of the question sentence language material crawled is there may be repetition, these language materials increase workload, without too big Meaning, so obtaining unduplicated question sentence corpus of text by filtering, duplicate removal on the basis of preparation language material.It is stored in database It is in order to facilitate the management and use of data.
Step1.3, the question sentence language material in database is segmented, stop words is gone to pre-process.
The present invention is in view of by the character string forms that text dividing is multiple characters composition, can directly cause in original text The loss of linguistic information between word, word.So to question sentence language material carry out pretreatment work, including with jieba tools into Row Chinese word segmentation removes stop words etc., facilitates the progress of follow-up work.
Step2, the feature set of words that question sentence text in question sentence language material is extracted using the method for evolution inspection CHI, will be each Feature Words are converted into the form of term vector, and use the corresponding normalization term vector value of Feature Words as its weighted value, thus To the part input Input1 of shallow-layer linear model;
Further, the step Step2 is as follows:
Step2.1, the feature set of words that question sentence text is extracted using the method for evolution inspection CHI;
The present invention, using word as its essential characteristic item, does not use in terms of the Feature Selection of multiple linear structure division Syntax grammar property, the feature selection approach evolution preferable and more commonly used using effect are examined to extract the spy of question sentence text Word is levied, and question sentence text is indicated with feature set of words.
Step2.2, the form for converting each Feature Words in Step2.1 to term vector, using distribution The term vector representation method of (distributed representation);
During text vector, the present invention considers the limitation of tradition one-hot representation methods, selection Distributed representation, the representation method of this term vector not only solve that one-hot dimensions are sparse to ask Topic, and distance is close between the term vector expression of similar word, has carried certain semantic information.Term vector is indicated in this way Text is helpful to the promotion of model performance as the input of model.The present invention is carried out using the word2vec modules of gensim Term vector is trained.
Step2.3, using the corresponding normalization term vector value of Feature Words as its weighted value, finally obtain question sentence text Non- syntactic feature indicates that the part as shallow Model inputs Input1.
Different weights are assigned for different Feature Words, the present invention is using most simple and effective normalization term vector value as special Term vector weighted value is levied, the non-syntactic feature expression of question sentence text is one that Feature Words vector is denoted as subground line model Divide input.
Step3, increase question sentence keyword term vector weight, the question sentence text vector for then forming term vector matrix inputs Into first part's convolutional network of depth model;Wherein use the convolution kernel of a variety of different windows sizes respectively to question sentence text Convolution operation is carried out, the local phrase feature of sentence is extracted, the different convolution kernels for possessing equal length convolution window are extracted Feature vector rearranged;
Further, the step Step3 is as follows:
Step3.1, question sentence key is extracted using the method based on tf-idf that the jieba kits in python provide The term vector of question sentence keyword is repeated once by word respectively after each word is represented as term vector in question sentence in its left and right, Keyword shared weight in sentence just will increase at this time, thus obtain a term vector matrix;
In the importation of depth model, because some words in question sentence are to judging that the classification of sentence often has more your writing With, for example " when basketball movement invents in question sentence" in, noun ' basketball ' is for differentiating that question sentence type is sport category Play key effect.Therefore after each word in question sentence is represented as term vector, by question sentence keyword in training corpus Term vector be repeated once respectively in its left and right, former sentence has reformed into " basketball basketball basketball movement is when to invent 's" at this time keyword shared weight in question sentence just will increase.In order to verify, keyword weight can be further in increase question sentence Increase the classification performance of model, keyword plays the result of Question Classification crucial effect, has been one group of contrast experiment, such as Shown in table 1:
Table 1
Do not increase keyword weight Increase keyword weight
Accuracy rate 0.9219 0.9226
Step3.2, the question sentence text vector that the word vector matrix obtained in Step3.1 indicates is input to depth model First part's convolutional network in, wherein the line number of matrix be sentence in word number, columns be term vector dimension;Here make Vertical convolution operation is carried out to question sentence with each two of the convolution kernel of 2,3,4 three kinds of different length convolution windows, is extracted The local feature of different location in sentence, thus obtains several groups feature vector;
As shown in Fig. 2, in order to preferably learn to obtain the syntactic-semantic feature of question sentence text, used in convolutional network part Each two of the convolution kernel of 2,3,4 three kinds of different length convolution windows carries out convolution operation to question text.Convolution length of window refers to In each convolution operation covering sentence word quantity number.Convolution kernel is slided on sentence, is extracted different in sentence Thus the local feature of position obtains one group of feature vector.
Step3.3, will possess identical convolution window size the extraction of different convolution kernels feature vector position in temporal sequence Confidence breath carries out rearranging combination so that the feature vector that different convolution kernels are obtained in sentence same position convolution is spliced one It rises.
The validity for handling and selecting different convolution windows in order to verify the present invention to convolutional network output, compares another A kind of outer processing convolutional network exports and inputs the strategy of Recognition with Recurrent Neural Network and point of different convolution window size selection strategies Class difference on effect.Second of link method of making comparisons is as described below, the feature after being reset to convolution, according to maximum length convolution window Subject to the characteristic length that mouth convolution is reset, the feature obtained after other two kinds of length convolution window convolution is reset is cut therewith Compare the part exceeded, and the feature of same position in sentence is linked together, and is input in a LSTM recirculating network. This structure is denoted as M2:cl2,3,4.The link policy that depth part uses in shallow depth binding model of the present invention is denoted as M1:Cl2,3,4, in addition also the model of single different length window is compared, is denoted as S respectively:Cl2, S:Cl3, S: Cl4 indicates that window size is 2,3,4.It is tested in corpus 1, the results are shown in Figure 3.It will become apparent from the M1 of the present invention: Cl2,3,4 tactful effects are best, and M2:The classifying quality of cl2,3,4 compares the model and M1 of single window length:Cl2,3,4, Declining occurs in its classification accuracy, and reason may be that the feature cut away causes influence to final characteristic sequence, causes Make LSTM that could not capture the sequence information of high quality.In addition in single length of window, when length of window is 3, classification accuracy Highest.
Step4, the feature vector generated in Step3 is separately input among corresponding Recognition with Recurrent Neural Network;Cycle god The historical information of sentence can be captured by its chain structure through network, the long-term dependence characteristics of sequence data are arrived in study, The output of the last one time step contains the characteristic information of entire sentence, and the output of multiple Recognition with Recurrent Neural Network is linked to one The final feature as question sentence is played, another part input Input2 of shallow-layer linear model is thus obtained;
Further, the step Step4 is as follows:
Step4.1, by three kinds of different length convolution windows obtain in Step3.3 the feature rearranged respectively according to sentence Son is sequentially input among corresponding three Recognition with Recurrent Neural Network;Used here as LSTM Recognition with Recurrent Neural Network, for more preferably capturing To sentence, historical information, the long-term dependence characteristics of study to sequence data, the output of the last one time step include earlier The characteristic information of entire question sentence;
The present invention is considered more preferably learn the syntactic-semantic feature of sentence, be recycled in the second part of depth model refreshing (LSTM) network is remembered through network selection shot and long term, because basic Recognition with Recurrent Neural Network model can lose sentence when sentence is longer The information of forward portion in son, to overcome the above disadvantages, people have invented LSTM Recognition with Recurrent Neural Network models, it is relatively traditional Neural network can preferably remember historical information earlier.
Step4.2, the output of three Recognition with Recurrent Neural Network in Step4.1 is linked together to final feature as question sentence It indicates, thus obtains another part input Input2 of shallow-layer linear model.
The output of three LSTM is spliced together, the final feature vector of question sentence, i.e. V=[v are formed2,v3,v4], wherein v2,v3,v4Indicate that convolution length of window is respectively 2,3,4 respectively.Multiwindow convolution loop combination of network depth model such as Fig. 2 institutes Show.
Step5, the final output Input2 of depth model in the Input1 obtained in Step2 and Step4 is spliced to form The input of shallow Model, shallow Model part use multiple linear regression structure, finally obtain the classification results of question sentence.
Further, the step Step5 is as follows:
Step5.1, the final output Input2 of the Input1 obtained in Step2.3 and Step4.2 is spliced to form shallow-layer The input of model, shallow Model is using multiple linear regression structure here, i.e., one last layer connected entirely is added with softmax The general neural network of function;
Step5.2, the input layer content for obtaining Step5.1 pass through one layer of hidden layer, then the output of hidden layer is inputted To obtaining final Question Classification result in sotfmax functions.
Further, the depth model part is made of convolutional network layer and Recognition with Recurrent Neural Network layer;Kth in convolutional layer The Text Representation that the convolution nuclear convolution that a convolution length of window is h obtains is wkh=[cki,…,ck(l-h+1)], wherein ckiTable Show the convolution feature of k-th of convolution kernel, i-th of position in question sentence text;cki=Relu (oki+ b), okiIndicate that convolutional calculation obtains The value arrived;oki=[xi,xi+1,…,xi+h-1]*fkh, wherein xiThe term vector of i-th of word in sentence is represented, h represents convolution kernel window Mouth length, [xi,xi+1,…,xi+h-1] represent in sentence the word from i-th of word to the i-th-h+1, the term vector that total h word forms Matrix;fkhIndicate that the convolution kernel that k-th of convolution length of window is h, * represent corresponding element multiplication sum operation in two matrixes; The feature vector that convolutional layer obtains is rearranged into combination and then inputs three different LSTM Recognition with Recurrent Neural Network layers, shape respectively It is expressed as V=[v at final feature vector2,v3,v4], wherein v2,v3,v4Convolution length of window 2,3,4 is indicated respectively;Entire mould The input layer of type is spliced to form by the feature term vector of shallow-layer part and the output V of depth model, forms the vector table of m dimensions Show, X=[wf1…wfn,V]。
Further, the shallow Model final classification method is softmax functions.
In order to compare shallow depth binding model and conventional machines learning model and convolutional neural networks model, cycle god The problem of the problem of convolution loop combinational network model through network model and multiple length convolution windows classifying quality classification effect Fruit, wherein conventional machines learning model have chosen three kinds of SVM, maximum entropy and naive Bayesian methods, in the present invention model and its Excess-three kind neural network model is denoted as WD, CNN, RNN and M respectively:Cnn+rnn, here respectively from balanced corpus 1 and injustice The accuracy rate of weighing apparatus corpus 2 is compared, as a result as shown in table 2, table 3.
Table 2
Table 3
By table 2, it is apparent that WD models accuracy rate in the corpus that language material balances compares other conventional machines Model is practised, accuracy rate highest, although accuracy rate is declined in uneven language material, fall is compared to for other models It is relatively low.
As can be seen from Table 3, the overall performance of depth model is still better than conventional model, but depth model is in unbalanced language material Accuracy rate fall is relatively large in library, reason be in face of a certain category feature language material it is less in the case of, depth model Too strong learning ability can increase the learning difficulty of effective characteristic of division.
In order to which further more general depth model and shallow depth binding model of the present invention, Fig. 4 are illustrated in imbalance In corpus 2, with the increase of training iterations, the variation of respective classification accuracy.It can be seen from the figure that with model The increase of training iterations, the Question Classification accuracy of 4 kinds of models are all increasing steadily, and iterations are at 200 times or so When accuracy rate no longer change substantially.Shallow depth binding model is better than other three kinds of models on final classification performance.From It can also be seen that convolutional network is slightly better than Recognition with Recurrent Neural Network on short text in figure.
In the present invention, the question sentence textual classification model that is combined with depth model based on shallow Model by shallow Model part with Depth model part forms, and overall structure is as shown in Figure 1.
Input layer
Input layer is spliced to form by the feature term vector of shallow-layer part and the output V of depth model, formed m dimension to Amount indicates, is denoted as X=[wf1…wfn,V]。
Softmax layers
Softmax layers are equivalent to the general neural network connected entirely for possessing one layer of hidden layer.The content of input layer is passed through One layer of hidden layer is crossed, then being input in sotfmax functions for hidden layer is obtained into final classification results.Hidden layer is k The neuron of a neural unit, input layer and hidden layer connects entirely.Its calculation formula:O=X*W, wherein W are m rows k row Matrix, matrix finite element stochastic production nonzero value, then constantly with new in training.O be possess k value it is one-dimensional to Amount.Each value represents the output valve of kth class, then is passed to softmax functions.The formula of softmax functions: OkIndicate the output valve of neural network kth class, skRepresent the probability value that text belongs to k classifications.
In order to be trained entire model, need to define a suitable loss function, using Adam (<Computer Science>, 2014) and optimization method minimizes or maximizes loss function and train entire model.For classification problem, one As using cross entropy (cross-entropy) be used as its loss function.Its formula is:Hy′(y)=- ∑iyi′logyi, wherein yi′ To be true probability distribution (i.e. the class label of training corpus), yiFor the probability distribution of model prediction.Pass through minimum in this Change Hy′(y) value trains entire model.
The specific implementation mode of the present invention is explained in detail above in conjunction with attached drawing, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (8)

1. the question sentence file classification method combined with depth model based on shallow Model, it is characterised in that:The method it is specific Steps are as follows:
Step1, the question sentence language material for crawling economy and finance, laws and regulations, sports, 5 health care, electronic digital classifications, Secondly language material text is pre-processed;
Step2, the feature set of words that question sentence text in question sentence language material is extracted using the method for evolution inspection CHI, by each feature Word is converted into the form of term vector, and uses the corresponding normalization term vector value of Feature Words as its weighted value, thus obtains shallow The part input Input1 of layer linear model;
Step3, increase question sentence keyword term vector weight, the question sentence text vector that term vector matrix forms then is input to depth It spends in first part's convolutional network of model;Wherein question sentence text is carried out respectively using the convolution kernel of a variety of different windows sizes Convolution operation extracts the local phrase feature of sentence, will possess the spy of the different convolution kernels extraction of equal length convolution window Sign vector is rearranged;
Step4, the feature vector generated in Step3 is separately input among corresponding Recognition with Recurrent Neural Network;Recycle nerve net Network can capture the historical information of sentence by its chain structure, and the long-term dependence characteristics of study to sequence data are last The output of one time step contains the characteristic information of entire sentence, and the output of multiple Recognition with Recurrent Neural Network is linked together work For the final feature of question sentence, another part input Input2 of shallow-layer linear model is thus obtained;
Step5, the final output Input2 of depth model in the Input1 obtained in Step2 and Step4 is spliced to form shallow-layer The input of model, shallow Model part use multiple linear regression structure, finally obtain the classification results of question sentence.
2. the question sentence file classification method according to claim 1 combined with depth model based on shallow Model, feature It is:The step Step1 is as follows:
Step1.1, first manual compiling crawlers, Baidu know swash take economy and finance, laws and regulations, sports, The question sentence language material of 5 health care, electronic digital classifications;
Step1.2, the language material crawled, obtain unduplicated question sentence language material by filtering, duplicate removal, and it be stored in data In library;
Step1.3, the question sentence language material in database is segmented, stop words is gone to pre-process.
3. the question sentence file classification method according to claim 1 combined with depth model based on shallow Model, feature It is:The step Step2 is as follows:
Step2.1, the feature set of words that question sentence text is extracted using the method for evolution inspection CHI;
Step2.2, the form for converting each Feature Words in Step2.1 to term vector are indicated using distributed term vector Method;
Step2.3, using the corresponding normalization term vector value of Feature Words as its weighted value, finally obtain the non-sentence of question sentence text Method character representation, the part input Input1 as shallow Model.
4. the question sentence file classification method according to claim 1 combined with depth model based on shallow Model, feature It is:The step Step3 is as follows:
Step3.1, question sentence keyword is extracted using the method based on tf-idf that the jieba kits in python provide, when Each word is represented as after term vector in question sentence, the term vector of question sentence keyword is repeated once respectively in its left and right, at this time Keyword shared weight in sentence just will increase, and thus obtain a term vector matrix;
Step3.2, the question sentence text vector that the word vector matrix obtained in Step3.1 indicates is input to the of depth model In a part of convolutional network, wherein the line number of matrix is the number of word in sentence, and columns is the dimension of term vector;Used here as 2, Each two of the convolution kernel of 3,4 three kinds of different length convolution windows carries out vertical convolution operation to question sentence, extracts sentence Thus the local feature of middle different location obtains several groups feature vector;
Step3.3, by the feature vector for the different convolution kernels extraction for possessing identical convolution window size, position is believed in temporal sequence Breath carries out rearranging combination so that the feature vector that different convolution kernels are obtained in sentence same position convolution is stitched together.
5. the question sentence file classification method according to claim 1 combined with depth model based on shallow Model, feature It is:The step Step4 is as follows:
It is Step4.1, three kinds of different length convolution windows obtain in Step3.3 the feature rearranged is suitable according to sentence respectively Sequence is input among corresponding three Recognition with Recurrent Neural Network;Used here as LSTM Recognition with Recurrent Neural Network, for more preferably capturing sentence Son historical information earlier, the long-term dependence characteristics of study to sequence data, the output of the last one time step contain whole The characteristic information of a question sentence;
Step4.2, the output of three Recognition with Recurrent Neural Network in Step4.1 is linked together to final mark sheet as question sentence Show, thus obtains another part input Input2 of shallow-layer linear model.
6. the question sentence file classification method according to claim 1 combined with depth model based on shallow Model, feature It is:The step Step5 is as follows:
Step5.1, the final output Input2 of the Input1 obtained in Step2.3 and Step4.2 is spliced to form shallow Model Input, here shallow Model use multiple linear regression structure, i.e., one last layer connected entirely is added with softmax functions General neural network;
Step5.2, the input layer content for obtaining Step5.1 pass through one layer of hidden layer, then the output of hidden layer are input to Final Question Classification result is obtained in sotfmax functions.
7. the question sentence file classification method according to claim 1 combined with depth model based on shallow Model, feature It is:The depth model part is made of convolutional network layer and Recognition with Recurrent Neural Network layer;K-th of convolution window is long in convolutional layer The Text Representation that the convolution nuclear convolution that degree is h obtains is wkh=[cki,…,ck(l-h+1)], wherein ckiIndicate k-th of convolution The convolution feature of core i-th of position in question sentence text;cki=Relu (oki+ b), okiIndicate the value that convolutional calculation obtains;oki= [xi,xi+1,…,xi+h-1]*fkh, wherein xiThe term vector of i-th of word in sentence is represented, h represents convolution kernel length of window, [xi, xi+1,…,xi+h-1] represent in sentence the word from i-th of word to the i-th-h+1, the term vector matrix that total h word forms;fkhTable Show that the convolution kernel that k-th of convolution length of window is h, * represent corresponding element multiplication sum operation in two matrixes;Convolutional layer is obtained To feature vector rearrange combination then respectively input three different LSTM Recognition with Recurrent Neural Network layers, form final feature Vector is expressed as V=[v2,v3,v4], wherein v2,v3,v4Convolution length of window 2,3,4 is indicated respectively;The input layer of entire model It is spliced to form by the feature term vector of shallow-layer part and the output V of depth model, the vector for forming a m dimension indicates, X= [wf1…wfn,V]。
8. the question sentence file classification method according to claim 6 combined with depth model based on shallow Model, feature It is:The shallow Model final classification method is softmax functions.
CN201810357603.5A 2018-04-20 2018-04-20 The question sentence file classification method combined with depth model based on shallow Model Pending CN108595602A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810357603.5A CN108595602A (en) 2018-04-20 2018-04-20 The question sentence file classification method combined with depth model based on shallow Model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810357603.5A CN108595602A (en) 2018-04-20 2018-04-20 The question sentence file classification method combined with depth model based on shallow Model

Publications (1)

Publication Number Publication Date
CN108595602A true CN108595602A (en) 2018-09-28

Family

ID=63613629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810357603.5A Pending CN108595602A (en) 2018-04-20 2018-04-20 The question sentence file classification method combined with depth model based on shallow Model

Country Status (1)

Country Link
CN (1) CN108595602A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271493A (en) * 2018-11-26 2019-01-25 腾讯科技(深圳)有限公司 A kind of language text processing method, device and storage medium
CN109739956A (en) * 2018-11-08 2019-05-10 第四范式(北京)技术有限公司 Corpus cleaning method, device, equipment and medium
CN109857852A (en) * 2019-01-24 2019-06-07 安徽商贸职业技术学院 A kind of the screening judgment method and system of electric business online comment training set feature
CN109871904A (en) * 2019-03-11 2019-06-11 广东工业大学 Inscriptions on bones or tortoise shells word identification model and training method, system, equipment, computer media
CN109918507A (en) * 2019-03-08 2019-06-21 北京工业大学 One kind being based on the improved file classification method of TextCNN
CN110009027A (en) * 2019-03-28 2019-07-12 腾讯科技(深圳)有限公司 Comparison method, device, storage medium and the electronic device of image
CN110046233A (en) * 2019-02-12 2019-07-23 阿里巴巴集团控股有限公司 Problem distributing method and device
CN110110372A (en) * 2019-04-09 2019-08-09 华东师范大学 A kind of user's timing behavior automatic segmentation prediction technique
CN110245353A (en) * 2019-06-20 2019-09-17 腾讯科技(深圳)有限公司 Natural language representation method, device, equipment and storage medium
CN110298036A (en) * 2019-06-06 2019-10-01 昆明理工大学 A kind of online medical text symptom identification method based on part of speech increment iterative
CN110309860A (en) * 2019-06-06 2019-10-08 昆明理工大学 The method classified based on grade malignancy of the convolutional neural networks to Lung neoplasm
CN110442720A (en) * 2019-08-09 2019-11-12 中国电子技术标准化研究院 A kind of multi-tag file classification method based on LSTM convolutional neural networks
CN110516070A (en) * 2019-08-28 2019-11-29 上海海事大学 A kind of Chinese Question Classification method based on text error correction and neural network
CN110991161A (en) * 2018-09-30 2020-04-10 北京国双科技有限公司 Similar text determination method, neural network model obtaining method and related device
CN111382244A (en) * 2018-12-29 2020-07-07 深圳市优必选科技有限公司 Deep retrieval matching classification method and device and terminal equipment
TWI717826B (en) * 2019-02-13 2021-02-01 開曼群島商創新先進技術有限公司 Method and device for extracting main words through reinforcement learning
CN112992356A (en) * 2021-03-30 2021-06-18 太原理工大学 Heart failure prediction method and device based on convolutional layer feature rearrangement and SVM
CN112989052A (en) * 2021-04-19 2021-06-18 北京建筑大学 Chinese news text classification method based on combined-convolutional neural network
CN113553844A (en) * 2021-08-11 2021-10-26 四川长虹电器股份有限公司 Domain identification method based on prefix tree features and convolutional neural network
CN113869458A (en) * 2021-10-21 2021-12-31 成都数联云算科技有限公司 Training method of text classification model, text classification method and related device
WO2024045247A1 (en) * 2022-08-31 2024-03-07 福建天甫电子材料有限公司 Production management and control system for ammonium fluoride production and control method therefor

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320374A (en) * 2008-07-10 2008-12-10 昆明理工大学 Field question classification method combining syntax structural relationship and field characteristic
CN105912528A (en) * 2016-04-18 2016-08-31 深圳大学 Question classification method and system
CN107038480A (en) * 2017-05-12 2017-08-11 东华大学 A kind of text sentiment classification method based on convolutional neural networks
CN107608999A (en) * 2017-07-17 2018-01-19 南京邮电大学 A kind of Question Classification method suitable for automatically request-answering system
CN107832312A (en) * 2017-01-03 2018-03-23 北京工业大学 A kind of text based on deep semantic discrimination recommends method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320374A (en) * 2008-07-10 2008-12-10 昆明理工大学 Field question classification method combining syntax structural relationship and field characteristic
CN105912528A (en) * 2016-04-18 2016-08-31 深圳大学 Question classification method and system
CN107832312A (en) * 2017-01-03 2018-03-23 北京工业大学 A kind of text based on deep semantic discrimination recommends method
CN107038480A (en) * 2017-05-12 2017-08-11 东华大学 A kind of text sentiment classification method based on convolutional neural networks
CN107608999A (en) * 2017-07-17 2018-01-19 南京邮电大学 A kind of Question Classification method suitable for automatically request-answering system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUHAN WU 等: ""THU NGN at NAACL-2018 Metaphor Shared Task: Neural Metaphor Detecting with CNN-LSTM Model"", 《HTTPS://WWW.RESEARCHGATE.NET》 *
RONG ZHANG 等: ""Deep and Shallow Model for Insurance Churn Prediction Service"", 《2017 IEEE 14TH INTERNATIONAL CONFERENCE ON SERVICES COMPUTING》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110991161A (en) * 2018-09-30 2020-04-10 北京国双科技有限公司 Similar text determination method, neural network model obtaining method and related device
CN110991161B (en) * 2018-09-30 2023-04-18 北京国双科技有限公司 Similar text determination method, neural network model obtaining method and related device
CN109739956A (en) * 2018-11-08 2019-05-10 第四范式(北京)技术有限公司 Corpus cleaning method, device, equipment and medium
CN109739956B (en) * 2018-11-08 2020-04-10 第四范式(北京)技术有限公司 Corpus cleaning method, apparatus, device and medium
CN109271493A (en) * 2018-11-26 2019-01-25 腾讯科技(深圳)有限公司 A kind of language text processing method, device and storage medium
CN111382244B (en) * 2018-12-29 2023-04-14 深圳市优必选科技有限公司 Deep retrieval matching classification method and device and terminal equipment
CN111382244A (en) * 2018-12-29 2020-07-07 深圳市优必选科技有限公司 Deep retrieval matching classification method and device and terminal equipment
CN109857852A (en) * 2019-01-24 2019-06-07 安徽商贸职业技术学院 A kind of the screening judgment method and system of electric business online comment training set feature
CN109857852B (en) * 2019-01-24 2021-02-23 安徽商贸职业技术学院 Method and system for screening and judging characteristics of E-commerce online comment training set
CN110046233A (en) * 2019-02-12 2019-07-23 阿里巴巴集团控股有限公司 Problem distributing method and device
TWI717826B (en) * 2019-02-13 2021-02-01 開曼群島商創新先進技術有限公司 Method and device for extracting main words through reinforcement learning
CN109918507A (en) * 2019-03-08 2019-06-21 北京工业大学 One kind being based on the improved file classification method of TextCNN
CN109871904A (en) * 2019-03-11 2019-06-11 广东工业大学 Inscriptions on bones or tortoise shells word identification model and training method, system, equipment, computer media
CN110009027A (en) * 2019-03-28 2019-07-12 腾讯科技(深圳)有限公司 Comparison method, device, storage medium and the electronic device of image
CN110110372A (en) * 2019-04-09 2019-08-09 华东师范大学 A kind of user's timing behavior automatic segmentation prediction technique
CN110110372B (en) * 2019-04-09 2023-04-18 华东师范大学 Automatic segmentation prediction method for user time sequence behavior
CN110309860A (en) * 2019-06-06 2019-10-08 昆明理工大学 The method classified based on grade malignancy of the convolutional neural networks to Lung neoplasm
CN110298036A (en) * 2019-06-06 2019-10-01 昆明理工大学 A kind of online medical text symptom identification method based on part of speech increment iterative
CN110298036B (en) * 2019-06-06 2022-07-22 昆明理工大学 Online medical text symptom identification method based on part-of-speech incremental iteration
CN110245353A (en) * 2019-06-20 2019-09-17 腾讯科技(深圳)有限公司 Natural language representation method, device, equipment and storage medium
CN110245353B (en) * 2019-06-20 2022-10-28 腾讯科技(深圳)有限公司 Natural language expression method, device, equipment and storage medium
CN110442720A (en) * 2019-08-09 2019-11-12 中国电子技术标准化研究院 A kind of multi-tag file classification method based on LSTM convolutional neural networks
CN110516070A (en) * 2019-08-28 2019-11-29 上海海事大学 A kind of Chinese Question Classification method based on text error correction and neural network
CN112992356B (en) * 2021-03-30 2022-04-26 太原理工大学 Heart failure prediction method and device based on convolutional layer feature rearrangement and SVM
CN112992356A (en) * 2021-03-30 2021-06-18 太原理工大学 Heart failure prediction method and device based on convolutional layer feature rearrangement and SVM
CN112989052B (en) * 2021-04-19 2022-03-08 北京建筑大学 Chinese news long text classification method based on combination-convolution neural network
CN112989052A (en) * 2021-04-19 2021-06-18 北京建筑大学 Chinese news text classification method based on combined-convolutional neural network
CN113553844A (en) * 2021-08-11 2021-10-26 四川长虹电器股份有限公司 Domain identification method based on prefix tree features and convolutional neural network
CN113553844B (en) * 2021-08-11 2023-07-25 四川长虹电器股份有限公司 Domain identification method based on prefix tree features and convolutional neural network
CN113869458A (en) * 2021-10-21 2021-12-31 成都数联云算科技有限公司 Training method of text classification model, text classification method and related device
WO2024045247A1 (en) * 2022-08-31 2024-03-07 福建天甫电子材料有限公司 Production management and control system for ammonium fluoride production and control method therefor

Similar Documents

Publication Publication Date Title
CN108595602A (en) The question sentence file classification method combined with depth model based on shallow Model
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
Rajayogi et al. Indian food image classification with transfer learning
CN105975573B (en) A kind of file classification method based on KNN
CN108829818A (en) A kind of file classification method
CN110442684A (en) A kind of class case recommended method based on content of text
CN110188272B (en) Community question-answering website label recommendation method based on user background
CN107239529A (en) A kind of public sentiment hot category classification method based on deep learning
CN110083700A (en) A kind of enterprise&#39;s public sentiment sensibility classification method and system based on convolutional neural networks
CN107516110A (en) A kind of medical question and answer Semantic Clustering method based on integrated convolutional encoding
CN107169035A (en) A kind of file classification method for mixing shot and long term memory network and convolutional neural networks
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN107291822A (en) The problem of based on deep learning disaggregated model training method, sorting technique and device
CN109635108A (en) A kind of remote supervisory entity relation extraction method based on human-computer interaction
CN107766324A (en) A kind of text coherence analysis method based on deep neural network
CN110321563A (en) Text emotion analysis method based on mixing monitor model
CN108121702A (en) Mathematics subjective item reads and appraises method and system
CN108052504A (en) Mathematics subjective item answers the structure analysis method and system of result
CN110413791A (en) File classification method based on CNN-SVM-KNN built-up pattern
CN109101584A (en) A kind of sentence classification improved method combining deep learning with mathematical analysis
CN110263174A (en) - subject categories the analysis method based on focus
CN110298036A (en) A kind of online medical text symptom identification method based on part of speech increment iterative
CN110188195A (en) A kind of text intension recognizing method, device and equipment based on deep learning
CN107480141A (en) It is a kind of that allocating method is aided in based on the software defect of text and developer&#39;s liveness
Qian Exploration of machine algorithms based on deep learning model and feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180928