CN104778158B - A kind of document representation method and device - Google Patents
A kind of document representation method and device Download PDFInfo
- Publication number
- CN104778158B CN104778158B CN201510096570.XA CN201510096570A CN104778158B CN 104778158 B CN104778158 B CN 104778158B CN 201510096570 A CN201510096570 A CN 201510096570A CN 104778158 B CN104778158 B CN 104778158B
- Authority
- CN
- China
- Prior art keywords
- term vector
- word
- text
- vector
- term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of document representation method and devices, to improve the accuracy of text representation, to improve the accuracy of text-processing.The method includes:Determine each word for constituting current text, determine the term vector of each word, each term vector is clustered, the Feature Words of current text and the weight of this feature word is determined in each word according to cluster result, the text vector of current text is determined according to the term vector of each Feature Words and weight.In this way, determine that the process of Feature Words already has accounted for the correlation between semanteme and sentence of the word in sentence by cluster, the term vector for the Feature Words determined can accurately express the intension of text, to which the accuracy of text representation can be improved, and then the accuracy of text-processing can be improved.
Description
Technical field
The present invention relates to the information processing technology more particularly to a kind of document representation methods and device.
Background technology
In technical field of information processing, often it is related to text-processing.Text-processing refers to after text representation
Content of text, carry out text retrieval, text classification, the processing such as text analyzing, wherein text representation refers to by original text
Content becomes computer-internal and indicates structure, which is the analyzable structure of computer program, for example, can use
Word, phrase in content of text etc. form the analyzable vector structure of computer.
The accuracy of text representation is higher, more can accurately express the intension of current text, the effect of text-processing is better,
Efficiency is higher, conversely, the accuracy of text representation is lower, the intension of the text given expression to more deviates in the reality of text
Contain, the effect of text-processing is poorer, efficiency is also lower.
In the prior art, document representation method is based primarily upon vector space model.Vector space model indicates text
Method is:For some text, the text is segmented first, obtains multiple words, is then existed further according to these words
The frequency occurred in text, selecting frequency is more than Feature Words of the word of preset value as the expression text, and calculates each
These Feature Words and corresponding weight are finally constituted text vector by the weight of Feature Words, and text vector is exactly the text
Representation.For example, for some text, ith feature word is fi, and the weight of this feature word is wi, then text representation shape
Formula is:{<f1:w1>、<f2:w2>、……、<fi:wi>..., wherein i=1,2,3 ....
In the document representation method that the above-mentioned prior art provides, in selected characteristic word, there is no consider Feature Words in sentence
Semanteme in son does not account for the correlation between sentence yet, and only extraction frequency is more than preset value in mechanical slave text
Word is as Feature Words, further, since the Feature Words in text vector are the word in text, since independent word may be deposited
In multilayer meaning, the intension of text can not be accurately expressed, therefore, text vector expresses the accuracy of text with regard to relatively low, correspondingly,
The accuracy of text-processing is also just relatively low.
Invention content
A kind of document representation method of offer of the embodiment of the present invention and device, to improve the accuracy of text representation, to
The accuracy of text-processing also can be improved.
A kind of document representation method provided in an embodiment of the present invention, including:
Determine each word for constituting current text;
Determine the term vector of each word;
Each term vector is clustered;
According to cluster result, the Feature Words of current text and the weight of this feature word are determined in each word;
The text vector of current text is determined according to the term vector of each Feature Words and weight.
A kind of text representation device provided in an embodiment of the present invention, including:
First determining module, for determining each word for constituting current text;
Second determining module, the term vector for determining each word;
Cluster module, for being clustered to each term vector;
Third determining module, for according to cluster result, the Feature Words of current text being determined in each word and are somebody's turn to do
The weight of Feature Words;
4th determining module, the text vector for determining current text according to the term vector and weight of each Feature Words.
A kind of document representation method and device provided in an embodiment of the present invention, this method determine each word for constituting current text
Language determines the term vector of each word, is clustered to each term vector, according to cluster result determine current text Feature Words and
The weight of this feature word determines the text vector of current text according to the corresponding term vector of the Feature Words of each word and weight.
As it can be seen that the word in the present invention is indicated by term vector, term vector compares word can be from multiple dimensions to the word
It is described, can more accurately indicate the semantic information of word, in addition, the process of cluster already has accounted for Feature Words in sentence
In semanteme and sentence between correlation, therefore, the present invention clusters determining Feature Words by being carried out to term vector, can be effective
The accuracy for the Feature Words for determining current text is improved, and then the accuracy of text-processing can be effectively improved.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and constitutes the part of the present invention, this hair
Bright illustrative embodiments and their description are not constituted improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is a kind of document representation method flow diagram provided in an embodiment of the present invention;
Fig. 2 is a kind of method flow schematic diagram in default term vector library provided in an embodiment of the present invention;
Fig. 3 is a kind of text representation apparatus structure schematic diagram provided in an embodiment of the present invention.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, below with reference to the specific embodiment of the invention and
Technical solution of the present invention is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the present invention one
Section Example, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not doing
Go out the every other embodiment obtained under the premise of creative work, shall fall within the protection scope of the present invention.
It is a kind of document representation method flow diagram provided in an embodiment of the present invention referring to Fig. 1, including:
S101:Determine each word for constituting current text.
In embodiments of the present invention, it is described it is current be the text for needing to carry out text representation that server obtains herein,
The text can be sentence, paragraph or chapter of Chinese form etc., and the text can be the text of the formats such as txt, doc, pdf, wps
This.
In embodiments of the present invention, server can be, but not limited to obtain text from default storage region (such as corpus)
This, or the online text for obtaining user and currently uploading, and using the text of acquisition as current text.
The embodiment of the present invention can segment the current text after obtaining current text, obtain constituting current
Each word of text.During participle, the segmenting method of use may include but be not limited to:By word traversal, mechanical Chinese word segmentation method
Deng.For example, it is assumed that server obtain an article, using this article as current text after, this article content is pre-processed,
Then pretreated article content is segmented, each word obtained after participle includes:Display, tablet, liquid crystal, illumination,
This five words can be determined as constituting each word of current text by this five words of device.
The calculation amount of server and the interference of some words is avoided when in order to reduce participle, the embodiment of the present invention is in participle
Before, current text can be pre-processed, for example, the hypertext markup language (Hypertext in removal current text
Markup Language, HTML), convert the complex form of Chinese characters in current text to simplified Chinese character, convert double byte character to half widths
Symbol etc..
In view of in each word in practical application scene, being obtained after participle in addition to comprising the word with practical significance
Except, it is also possible to include some words without practical significance, and Feature Words are generally the word with practical significance, therefore,
The embodiment of the present invention, specifically, being segmented to current text, obtains multiple words when determining each word for constituting current text
After language, the word of specified type can also be determined in each word, it, can also be into order to avoid filtering out identical word
One step carries out duplicate removal processing to the word of specified type, using each word after duplicate removal processing as each word for constituting current text
Language;Wherein, the word of the specified type can be specifically the word with practical significance, and the word with practical significance can
Including but not limited to:Noun, verb, adjective etc., then the word for not having practical significance is usually auxiliary word, adverbial word, function word etc..
S102:Determine the term vector of each word.
In embodiments of the present invention, in order to express the meaning (i.e. semantic information) of a word in more detail, packet can be used
The N-dimensional vector of N number of element is included to indicate that a word, the N-dimensional vector are the term vector of the word.N number of element of the term vector
In, each element is the weighted value of the corresponding text categories of the word, and wherein text categories may include:Computer, traffic, religion
Educate, economy, military affairs, sport, medicine, art, politics, environment etc..
For example, it is assumed that the text categories of term vector are represented by N-dimensional vector { computer, traffic, education, economy }4,
In, N=4.Assuming that display, tablet, liquid crystal, illumination, device this five words are each word for constituting current text, then " liquid crystal "
The term vector of this word can be expressed as:{0.175、0.095、0.185、0.041}4, wherein the meaning that the term vector indicates
For:" liquid crystal " correspond to computer, traffic, education, this economic four text categories weighted value be respectively 0.175,0.095,
0.185、0.041。
In embodiments of the present invention, server can determine tool when determining the term vector of each word according to term vector
Direct-on-line determines term vector.Optionally, word2vec calculating instruments can be used in server, to determine the term vector of each word.
In order to improve the efficiency for the term vector for determining each word, it is preferred that in embodiments of the present invention, can also in advance really
The term vector of fixed each word, it needs to be determined that each word term vector when, in preset term vector library, determine and (e.g., search)
Term vector corresponding with each word.As it can be seen that term vector corresponding with each word is determined in preset term vector library, it is convenient fast
Victory can effectively improve the treatment effeciency of server.
In embodiments of the present invention, when predefining the term vector of each word, word2vec calculating instruments can also be used,
To determine the term vector of each word.
S103:Each term vector is clustered.
In embodiments of the present invention, after the term vector that each word is determined by step S102, can to each term vector into
Row cluster.
The basic principle of cluster is that have larger similitude between of a sort term vector, between inhomogeneous term vector
It differs greatly, therefore, by carrying out similarity measurement between term vector, so that it may each vector be clustered with realizing.Specifically
, it can determine the similarity between term vector, cosine by calculating the COS distance (cosine) between two term vectors
Value is bigger, and similarity is bigger between term vector, conversely, cosine values are smaller, the similarity between term vector is with regard to smaller.
In embodiments of the present invention, adoptable clustering algorithm includes but not limited to:Chinese
RestaurantProcess (CRP) algorithm, K- means clustering algorithms, K- central points clustering algorithm, CLARANS algorithms, BIRCH
Algorithm, CLIQUE algorithms, DBSCAN algorithms etc..
In embodiments of the present invention, by being clustered to obtain multiclass term vector set, the multiclass word to each term vector
Vector set is combined into the cluster result clustered to each term vector;Wherein, per a kind of term vector set in comprising several words to
Amount.
It uses the example above, it is assumed that display, tablet, liquid crystal, illumination, the corresponding term vector of this five words of device are carried out
Cluster, obtains three classes term vector set.First kind term vector set includes liquid crystal, display, the corresponding word of these three words of device
Vector, the second class term vector set only include the corresponding term vector of tablet, and third class term vector set only includes that illumination is corresponding
Term vector, this illustrates the mutual similarity maximum of liquid crystal, display, the corresponding term vector of these three words of device, correlation
Highest.Correlation is relatively low between tablet and the corresponding term vector of illumination, tablet and illuminate respectively with liquid crystal, display, device
The correlation of these words is relatively low.That is, in this three classes term vector, the corresponding word of first kind term vector, which best embodies, works as
The feature of preceding text, the second class and third class are taken second place.
S104:According to cluster result, the Feature Words of current text and the weight of this feature word are determined in each word.
In embodiments of the present invention, according to cluster result, the Feature Words of current text are determined in each word, specifically may be used
In all kinds of term vector set, determine that the quantity for the term vector for including is more than the term vector set of predetermined threshold value.
It uses the example above, it is assumed that predetermined threshold value 2, according to the first kind, the second class and third class term vector set, (cluster is tied
Fruit), the Feature Words of current text are determined in display, tablet, liquid crystal, illumination, device these words, it specifically can be first
In class, the second class and third class term vector set, determine that the quantity for the term vector for including is more than the term vector collection of predetermined threshold value 2
It closes, the quantity for the term vector for including due to first kind term vector set is 3, the word that the second class and third class term vector set include
The quantity of vector is respectively 1, it is determined that the quantity for the term vector for including is more than that the term vector collection of predetermined threshold value 2 is combined into first kind word
Vector set, using the corresponding word of each term vector in the first kind term vector set determined as Feature Words, that is, by liquid
The Feature Words of brilliant, display, device these three words as current text.
In embodiments of the present invention, according to cluster result, the Feature Words of current text are determined in each word, it is specific to go back
All kinds of term vector set can be sorted according to the descending sequence of the quantity comprising term vector, m term vector collection before determining
It closes, wherein m is default value;Using the corresponding word of each term vector in the term vector set determined as Feature Words.
It uses the example above, it is assumed that default value m=1, by the first kind, the second class and third class term vector set according to including word
The descending sequence sequence of the quantity of vector, the quantity for the term vector for including due to first kind term vector set is 3, the second class
The quantity for the term vector for including with third class term vector set is respectively 1, then all kinds of term vector set sequence sequences are followed successively by:The
A kind of, the second class and third class term vector set, determine in a term vector set of the 1st (m=1) (i.e. first kind term vector set)
The corresponding word of each term vector (liquid crystal, display, device) be used as Feature Words.
In embodiments of the present invention, according to cluster result, the weight of the Feature Words of current text is determined in each word
wi, can specifically be determined by formula (1-1).
wi=log (1+ni/nm) (1-1)
Wherein, wiFor the weight of the ith feature word in current text, niIt is that ith feature word occurs in current text
Number (hereinafter referred to as word frequency), nmIt is the maximum word frequency of numerical value in the corresponding word frequency of each Feature Words.
For example, each Feature Words:Liquid crystal, display, device word frequency be respectively 10,30,20, then show the word frequency of this word
Maximum, i.e. nm=30.The then weight w of this word of liquid crystal1=log (1+10/30);Show the weight w of this word2=log (1
+30/30);The weight w of this word of device3=log (1+20/30).
S105:The text vector of current text is determined according to the term vector of each Feature Words and weight.
Specifically, according to the term vector and weight of each Feature Words, the multi-C vector being made of multiple elements is determined, this is more
Text vector of the dimensional vector as current text;Wherein, an element in the multi-C vector from the word of a Feature Words to
Amount and the weight of this feature word are constituted.
For example, the text vector of current text can be expressed as:{<F1:w1>、<F2:w2>、……、<Fi:wi>...,
Wherein, i=1,2,3 ....Fi is the corresponding term vector of ith feature word.
In above-mentioned method shown in FIG. 1, this method determines each word for constituting current text, determine the word of each word to
Amount, clusters each term vector, the weight of the Feature Words and this feature word of current text is determined according to cluster result, according to every
The corresponding term vector of Feature Words and weight of a word determine the text vector of current text.As it can be seen that the word in the present invention is
It is indicated by term vector, the term vector word that compares can be described the word from multiple dimensions, can more accurately
The semantic information of word is indicated, in addition, the process of cluster already has accounted between semanteme and sentence of the Feature Words in sentence
Correlation, therefore, the present invention clusters determining Feature Words by being carried out to term vector, can effectively improve the spy of determining current text
The accuracy of word is levied, and then the accuracy of text-processing can be improved.
When above-mentioned determining in preset term vector library (e.g., searching) term vector corresponding with each word, default word is needed
Vectorial library.
Referring to Fig. 2, in embodiments of the present invention, the method for presetting term vector library specifically may include following steps:
S201:Obtain multiple history texts.
When obtaining multiple history texts, multiple texts can be obtained from corpus, as history text, obtain text
Quantity can be hundreds of, thousands of etc., be not particularly limited here.
S202:Determine the multiple words for constituting each history text.
When determining each word for constituting each history text, the method class with the above-mentioned determining each word for constituting current text
Seemingly, for example, can be segmented to each history text by segmenting method, each word is obtained.
Optionally, in order to reduce the calculation amount of server and avoid the interference of some words, to each history text
Before being segmented, which can be pre-processed.Pretreatment may include but be not limited to:History text is gone
HTML, the complex form of Chinese characters are converted into simplified Chinese character, double byte character is converted into half-angle character, carry out duplicate removal processing to each history text.
When carrying out duplicate removal processing to each history text, each history text can be calculated by message digest algorithm
Informative abstract, for example, can be right by the message digest algorithm (Message-Digest Algorithm 5, MD5) of the 5th version
Each history text obtained is calculated, and after obtaining the corresponding MD5 values of each history text, identical MD5 values are corresponding to be gone through
History text only retains a (realizing duplicate removal processing).
In view of in practical application scene, for indicating that the Feature Words of text are typically the word of practical significance, because
This, optionally, after being segmented to each history text, it may be determined that constitute multiple specified types of each history text
Word;Wherein the word of the specified type can be specifically the word with practical significance.In this way, server can be further decreased
Calculation amount.
S203:Each word in history text is expressed as a multi-C vector, using the multi-C vector as the word
Initial term vector.In embodiments of the present invention, word2vec calculating instruments equally can be used determine the word of each word to
Amount, which is not described herein again.
S204:Each initial term vector is subjected to digital finger-print processing respectively, obtains digital finger-print treated term vector.
Digital finger-print processing is carried out to initial term vector, namely processing is digitized to initial term vector, for example, by first
Beginning term vector is converted to " 0 " of certain length (such as 64 bit), " 1 " numerical string.The embodiment of the present invention can be breathed out by local sensitivity
Term vector is converted to " 0 ", " 1 " numerical string by uncommon (LSH) algorithm.
For example, the term vector of " liquid crystal " this word is expressed as:{0.175、0.095、0.185、0.041}4, then to the word
Vector carries out digital finger-print processing, obtains digital finger-print treated that term vector can be<000000000010>;
If the term vector of " display " this word is expressed as:{0.123、0.195、0.085、0.441}4, then to the word to
Amount carries out digital finger-print processing, obtains digital finger-print treated that term vector can be<100101010010>.
S205:It is constituted using digital finger-print treated term vector and presets term vector library.
In embodiments of the present invention, it is constituted using digital finger-print treated term vector and presets term vector library, worked as in determination
After the corresponding each word of preceding text, from the term vector corresponding with each word found in default term vector library be digital finger-print from
Term vector after reason.When being clustered to each term vector, exactly the term vector after digital finger prints processing is clustered, then is clustered
When calculating the similitude between term vector in the process, the Hamming distance between two term vectors can be calculated, Hamming distance is passed through
The similarity between term vector can be determined, between bigger the two term vectors of explanation of the Hamming distance between two term vectors
Correlation is smaller, conversely, the Hamming distance between two term vectors is smaller, the similitude between two term vectors is bigger.In logarithm
When term vector after word is clustered, the calculation amount of server can be greatly reduced, the processing effect of server can be effectively improved
Rate.
In embodiments of the present invention, in order to further avoid in initial term vector, there is also the word of no practical significance correspondences
Term vector, then initial term vector can be screened, specifically, according to part of speech, word frequency, deactivate the attributes such as vocabulary, from each
The word without practical significance is removed in initial term vector, only retains the word with practical significance, so as to effectively reduce no reality
The interference of the word of border meaning, and then can effectively reduce the calculation amount of server.
In embodiments of the present invention, according to the term vector of each Feature Words and weight determine current text text vector it
Afterwards, so that it may text-processing is carried out to be based on text vector, for example, carrying out text retrieval, text classification, text analyzing, text
The processing such as similarity calculation.
During reducing text-processing, the calculation amount of server, to effectively improve the treatment effeciency of server, then
In embodiments of the present invention, the method further includes:The text of current text is determined according to the term vector of each Feature Words and weight
After vector, digital finger-print processing is carried out to the text vector of current text.
The digital finger-print handles namely is digitized processing, and optionally, the present invention may be used in LSH algorithms
One of which algorithm simhash to carry out digital finger-print processing to text vector.
For example, setting Feature Words:Term vector after liquid crystal, display and the corresponding digitlization of device is<010>、<001>
With<110>, the weight of liquid crystal, display and device is respectively 0.1,0.2,0.4, then text vector is expressed as:{<Liquid crystal term vector:
0.1>、<Show term vector:0.2>、<Device term vector:0.4>}.
Then to text vector<Liquid crystal term vector:0.1>、<Show term vector:0.2>、<Device term vector:0.4>Carry out
Digitized processing is specially:
" 0 " in each term vector is replaced with into " -1 ", " 1 " replaces with " 1 ", each term vector is multiplied by weight,
Obtain new term vector;First numerical value in each term vector, which adds up, obtains first value, by second number in each term vector
Value is cumulative to obtain second value, and the third numerical value in each term vector is added up and obtains third value.
In first value~third value, negative value will be replaced with 0, then obtain being made of 0 and 1 on the occasion of replacing with 1
Vector be digitized processing after vector.
For example,<010>、<001>With<110>In " 0 " replace with " -1 ", " 1 " replaces with " 1 ", and is multiplied by each term vector
Corresponding weight, obtained vector difference are as follows:
Term vector<010>Corresponding to vector 1<-0.1、0.1、-0.1>;
Term vector<001>Corresponding to vector 2<-0.2、-0.2、0.2>;
Term vector<110>Corresponding to vector 3<0.4、0.4、-0.4>;
First element -0.1, -0.2 of vector 1~vector 3 is added with 0.4, first obtained value is 0.1, the value
For just;
Second element 0.1, -0.2 of vector 1~vector 3 is added with 0.4, second obtained value is 0.3, the value
For just;
The third element -0.1,0.2 of vector 1~vector 3 is added with -0.4, obtained third value is -0.3, should
Value is negative;
Then in first value~third value, negative value will be replaced with 0 on the occasion of replacing with 1, then obtain being made of 0 and 1
Vector<110>For the vector after digitized processing.
It is document representation method provided in an embodiment of the present invention above, is based on same thinking, the embodiment of the present invention also carries
A kind of text representation device is supplied, as shown in figure 3, including:
First determining module 31, for determining each word for constituting current text;
Second determining module 32, the term vector for determining each word;
Cluster module 33, for being clustered to each term vector;
Third determining module 34, for according to cluster result, the Feature Words of current text being determined in each word and are somebody's turn to do
The weight of Feature Words;
4th determining module 35, the text vector for determining current text according to the term vector and weight of each Feature Words.
Optionally, described device further includes:
Processing module 36 carries out digital finger-print processing for the text vector to the current text.
Optionally, the second determining module 32 is specifically used for,
In preset term vector library, term vector corresponding with each word is determined.
Optionally, described device further includes:
Default term vector library module 37, for presetting term vector library;
The default term vector library module 37 is specifically used for, and obtains multiple history texts, determines and constitutes each history text
Each word in the history text is expressed as a multi-C vector, using the multi-C vector as institute's predicate by multiple words
Each initial term vector is carried out digital finger-print processing by the initial term vector of language respectively, obtains digital finger-print treated term vector,
It is constituted using the digital finger-print treated term vector and presets term vector library.
Optionally, the default term vector library module 37 is specifically used for, and determines the multiple specified classes for constituting each history text
The word of type.
Optionally, first determining module 31 is specifically used for, and is segmented to the current text, obtains multiple words
Language determines the word of specified type in each word, duplicate removal processing is carried out to the word of the specified type, at duplicate removal
Each word after reason is as each word for constituting current text.
Optionally, the cluster result includes multiclass term vector set, includes several words in every one kind term vector set
Vector;
The third determining module 34 is specifically used for, and in all kinds of term vector set, determines the number of term vector for including
Amount is more than the term vector set of predetermined threshold value, alternatively, all kinds of term vector set are descending according to the quantity comprising term vector
Sequence sequence, determine before m term vector set, wherein m is default value;It will be each in the term vector set determined
The corresponding word of term vector is as Feature Words.
Optionally, the 4th determining module 35 is specifically used for, and according to the term vector and weight of each Feature Words, determines by more
The multi-C vector that a element is constituted, using the multi-C vector as the text vector of current text;Wherein, in the multi-C vector
One element is made of the weight of the term vector of Feature Words and this feature word.
A kind of document representation method and device provided in an embodiment of the present invention, this method determine each word for constituting current text
Language determines the term vector of each word, is clustered to each term vector, according to cluster result determine current text Feature Words and
The weight of this feature word determines the text vector of current text according to the corresponding term vector of the Feature Words of each word and weight.
As it can be seen that the word in the present invention is indicated by term vector, term vector compares word can be from multiple dimensions to the word
It is described, can more accurately indicate the semantic information of word, in addition, the process of cluster already has accounted for Feature Words in sentence
In semanteme and sentence between correlation, therefore, the present invention clusters determining Feature Words by being carried out to term vector, can be effective
The accuracy for the Feature Words for determining current text is improved, and then the accuracy of text-processing can be improved.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, the present invention can be used in one or more wherein include computer usable program code computer
The computer program production implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real
The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to
Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or
The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus
Or any other non-transmission medium, it can be used for storage and can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
Including so that process, method, commodity or equipment including a series of elements include not only those elements, but also wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element
There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that the embodiment of the present invention can be provided as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the present invention
Form.It is deposited moreover, the present invention can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
It these are only the embodiment of the present invention, be not intended to restrict the invention.To those skilled in the art,
The invention may be variously modified and varied.It is all within spirit and principles of the present invention made by any modification, equivalent replacement,
Improve etc., it should be included within scope of the presently claimed invention.
Claims (10)
1. a kind of document representation method, which is characterized in that including:
Determine each word for constituting current text;
Determine the term vector of each word;
Each term vector is clustered to obtain multiclass term vector set;
According to cluster result, the Feature Words of current text and the weight of this feature word are determined in each word, wherein described
The weight of Feature Words is the sum of maximum frequency in the frequency and the text that this feature word occurs in current text, with Feature Words
The logarithm of the ratio of the frequency occurred in current text;
The text vector of current text is determined according to the term vector of each Feature Words and weight.
2. the method as described in claim 1, which is characterized in that the method further includes:
Digital finger-print processing is carried out to the text vector of the current text.
3. the method as described in claim 1, which is characterized in that the term vector of each word of determination specifically includes:Default
Term vector library in, determine corresponding with each word term vector;
Wherein, the method for presetting term vector library, specifically includes:
Obtain multiple history texts;
Determine the multiple words for constituting each history text;
Each word in the history text is expressed as a multi-C vector, using the multi-C vector as the first of the word
Beginning term vector;
Each initial term vector is subjected to digital finger-print processing respectively, obtains digital finger-print treated term vector;
It is constituted using the digital finger-print treated term vector and presets term vector library.
4. the method as described in claim 1, which is characterized in that the cluster result includes multiclass term vector set, per a kind of
Include several term vectors in term vector set;
It is described according to cluster result, the Feature Words of current text are determined in each word, are specifically included:
In all kinds of term vector set, determine that the quantity for the term vector for including is more than the term vector set of predetermined threshold value, alternatively,
All kinds of term vector set are sorted according to the descending sequence of the quantity comprising term vector, m term vector set before determining,
Wherein, m is default value;
Using the corresponding word of each term vector in the term vector set determined as Feature Words.
5. the method as described in claim 1, which is characterized in that the determining each word for constituting current text specifically includes:
The current text is segmented, multiple words are obtained;In each word, the word of specified type is determined;To the finger
The word for determining type carries out duplicate removal processing, using each word after duplicate removal processing as each word for constituting current text;
And/or
The text vector that current text is determined according to the term vector and weight of each Feature Words, specifically includes:According to each feature
The term vector and weight of word determine the multi-C vector being made of multiple elements, using the multi-C vector as the text of current text
Vector;Wherein, an element in the multi-C vector is made of the weight of the term vector of Feature Words and this feature word.
6. a kind of text representation device, which is characterized in that including:
First determining module, for determining each word for constituting current text;
Second determining module, the term vector for determining each word;
Cluster module obtains multiclass term vector set for being clustered to each term vector;
Third determining module, for according to cluster result, the Feature Words and this feature of current text to be determined in each word
The weight of word, wherein the weight of the Feature Words is maximum in the frequency and the text that this feature word occurs in current text
The sum of frequency, the logarithm of the ratio of the frequency occurred in current text with Feature Words;
4th determining module, the text vector for determining current text according to the term vector and weight of each Feature Words.
7. device as claimed in claim 6, which is characterized in that described device further includes:
Processing module carries out digital finger-print processing for the text vector to the current text.
8. device as claimed in claim 6, which is characterized in that second determining module is specifically used for, preset word to
It measures in library, determines term vector corresponding with each word;
Described device further includes:Default term vector library module, for presetting term vector library;
The default term vector library module is specifically used for, and obtains multiple history texts, determines the multiple words for constituting each history text
Each word in the history text is expressed as a multi-C vector by language, using the multi-C vector as the first of the word
Each initial term vector is carried out digital finger-print processing by beginning term vector respectively, digital finger-print is obtained treated term vector, using institute
It states digital finger-print treated term vector and constitute and preset term vector library.
9. device as claimed in claim 6, which is characterized in that the cluster result includes multiclass term vector set, per a kind of
Include several term vectors in term vector set;
The third determining module is specifically used for, and in all kinds of term vector set, determines that the quantity for the term vector for including is more than
The term vector set of predetermined threshold value, alternatively, by all kinds of term vector set according to the descending sequence of the quantity comprising term vector
Sequence, m term vector set before determining, wherein m is default value;By each term vector in the term vector set determined
Corresponding word is as Feature Words.
10. device as claimed in claim 6, which is characterized in that first determining module is specifically used for, to it is described ought be above
This is segmented, and multiple words are obtained, and in each word, the word of specified type is determined, to the word of the specified type
Duplicate removal processing is carried out, using each word after duplicate removal processing as each word for constituting current text;And/or
4th determining module is specifically used for, and according to the term vector and weight of each Feature Words, determines and to be made of multiple elements
Multi-C vector, using the multi-C vector as the text vector of current text;Wherein, an element in the multi-C vector is by one
The term vector of a Feature Words and the weight of this feature word are constituted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510096570.XA CN104778158B (en) | 2015-03-04 | 2015-03-04 | A kind of document representation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510096570.XA CN104778158B (en) | 2015-03-04 | 2015-03-04 | A kind of document representation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104778158A CN104778158A (en) | 2015-07-15 |
CN104778158B true CN104778158B (en) | 2018-07-17 |
Family
ID=53619632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510096570.XA Active CN104778158B (en) | 2015-03-04 | 2015-03-04 | A kind of document representation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104778158B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108345605A (en) * | 2017-01-24 | 2018-07-31 | 苏宁云商集团股份有限公司 | A kind of text search method and device |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095444A (en) * | 2015-07-24 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Information acquisition method and device |
CN106484681B (en) | 2015-08-25 | 2019-07-09 | 阿里巴巴集团控股有限公司 | A kind of method, apparatus and electronic equipment generating candidate translation |
CN106484682B (en) | 2015-08-25 | 2019-06-25 | 阿里巴巴集团控股有限公司 | Machine translation method, device and electronic equipment based on statistics |
CN105426354B (en) * | 2015-10-29 | 2019-03-22 | 杭州九言科技股份有限公司 | The fusion method and device of a kind of vector |
CN105426356B (en) * | 2015-10-29 | 2019-05-21 | 杭州九言科技股份有限公司 | A kind of target information recognition methods and device |
CN106446264B (en) * | 2016-10-18 | 2019-08-27 | 哈尔滨工业大学深圳研究生院 | Document representation method and system |
CN106503184B (en) * | 2016-10-24 | 2019-09-20 | 海信集团有限公司 | Determine the method and device of the affiliated class of service of target text |
CN107357895B (en) * | 2017-01-05 | 2020-05-19 | 大连理工大学 | Text representation processing method based on bag-of-words model |
CN107247704B (en) * | 2017-06-09 | 2020-09-08 | 阿里巴巴集团控股有限公司 | Word vector processing method and device and electronic equipment |
CN109408797A (en) * | 2017-08-18 | 2019-03-01 | 普天信息技术有限公司 | A kind of text sentence vector expression method and system |
US11823013B2 (en) * | 2017-08-29 | 2023-11-21 | International Business Machines Corporation | Text data representation learning using random document embedding |
CN107862620A (en) * | 2017-12-11 | 2018-03-30 | 四川新网银行股份有限公司 | A kind of similar users method for digging based on social data |
CN108304480B (en) * | 2017-12-29 | 2020-08-04 | 东软集团股份有限公司 | Text similarity determination method, device and equipment |
CN110362815A (en) * | 2018-04-11 | 2019-10-22 | 北京京东尚科信息技术有限公司 | Text vector generation method and device |
CN109033307B (en) * | 2018-07-17 | 2021-08-31 | 华北水利水电大学 | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method |
CN109101620B (en) * | 2018-08-08 | 2022-07-05 | 阿里巴巴(中国)有限公司 | Similarity calculation method, clustering method, device, storage medium and electronic equipment |
CN110874528B (en) * | 2018-08-10 | 2020-11-10 | 珠海格力电器股份有限公司 | Text similarity obtaining method and device |
CN109710845A (en) * | 2018-12-25 | 2019-05-03 | 百度在线网络技术(北京)有限公司 | Information recommended method, device, computer equipment and readable storage medium storing program for executing |
CN110083828A (en) * | 2019-03-29 | 2019-08-02 | 珠海远光移动互联科技有限公司 | A kind of Text Clustering Method and device |
CN110147449A (en) * | 2019-05-27 | 2019-08-20 | 中国联合网络通信集团有限公司 | File classification method and device |
CN110309515B (en) * | 2019-07-10 | 2023-08-11 | 北京奇艺世纪科技有限公司 | Entity identification method and device |
CN111428180B (en) * | 2020-03-20 | 2022-02-08 | 创优数字科技(广东)有限公司 | Webpage duplicate removal method, device and equipment |
CN111913912A (en) * | 2020-07-16 | 2020-11-10 | 北京字节跳动网络技术有限公司 | File processing method, file matching device, electronic equipment and medium |
CN112527971A (en) * | 2020-12-25 | 2021-03-19 | 华戎信息产业有限公司 | Method and system for searching similar articles |
CN113536763B (en) * | 2021-07-20 | 2024-11-05 | 北京中科闻歌科技股份有限公司 | Information processing method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
CN101853486A (en) * | 2010-06-08 | 2010-10-06 | 华中科技大学 | Image copying detection method based on local digital fingerprint |
CN103049569A (en) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | Text similarity matching method on basis of vector space model |
CN103744905A (en) * | 2013-12-25 | 2014-04-23 | 新浪网技术(中国)有限公司 | Junk mail judgment method and device |
CN104008090A (en) * | 2014-04-29 | 2014-08-27 | 河海大学 | Multi-subject extraction method based on concept vector model |
CN104182388A (en) * | 2014-07-21 | 2014-12-03 | 安徽华贞信息科技有限公司 | Semantic analysis based text clustering system and method |
-
2015
- 2015-03-04 CN CN201510096570.XA patent/CN104778158B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
CN101853486A (en) * | 2010-06-08 | 2010-10-06 | 华中科技大学 | Image copying detection method based on local digital fingerprint |
CN103049569A (en) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | Text similarity matching method on basis of vector space model |
CN103744905A (en) * | 2013-12-25 | 2014-04-23 | 新浪网技术(中国)有限公司 | Junk mail judgment method and device |
CN104008090A (en) * | 2014-04-29 | 2014-08-27 | 河海大学 | Multi-subject extraction method based on concept vector model |
CN104182388A (en) * | 2014-07-21 | 2014-12-03 | 安徽华贞信息科技有限公司 | Semantic analysis based text clustering system and method |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108345605A (en) * | 2017-01-24 | 2018-07-31 | 苏宁云商集团股份有限公司 | A kind of text search method and device |
CN108345605B (en) * | 2017-01-24 | 2022-04-05 | 苏宁易购集团股份有限公司 | Text search method and device |
Also Published As
Publication number | Publication date |
---|---|
CN104778158A (en) | 2015-07-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104778158B (en) | A kind of document representation method and device | |
US11599714B2 (en) | Methods and systems for modeling complex taxonomies with natural language understanding | |
US11243993B2 (en) | Document relationship analysis system | |
US11573996B2 (en) | System and method for hierarchically organizing documents based on document portions | |
US9542477B2 (en) | Method of automated discovery of topics relatedness | |
US8457950B1 (en) | System and method for coreference resolution | |
KR20180011254A (en) | Web page training methods and devices, and search intent identification methods and devices | |
US20140207782A1 (en) | System and method for computerized semantic processing of electronic documents including themes | |
US20170344822A1 (en) | Semantic representation of the content of an image | |
CN111090731A (en) | Electric power public opinion abstract extraction optimization method and system based on topic clustering | |
CN109471944A (en) | Training method, device and the readable storage medium storing program for executing of textual classification model | |
US11886515B2 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
Barua et al. | Multi-class sports news categorization using machine learning techniques: resource creation and evaluation | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN114416926A (en) | Keyword matching method and device, computing equipment and computer readable storage medium | |
CN114676346A (en) | News event processing method and device, computer equipment and storage medium | |
CN110222179B (en) | Address book text classification method and device and electronic equipment | |
Panthum et al. | Generating functional requirements based on classification of mobile application user reviews | |
US20220309276A1 (en) | Automatically classifying heterogenous documents using machine learning techniques | |
CN115129890A (en) | Feedback data map generation method and generation device, question answering device and refrigerator | |
CN114461809A (en) | Method and equipment for automatically generating semantic knowledge graph of Chinese abstract | |
US20240168999A1 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
CN116932767B (en) | Text classification method, system, storage medium and computer based on knowledge graph | |
Nagrale et al. | Document theme extraction using named-entity recognition | |
CN117725555B (en) | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230315 Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193 Patentee after: Sina Technology (China) Co.,Ltd. Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 20 floor Patentee before: Sina.com Technology (China) Co.,Ltd. |
|
TR01 | Transfer of patent right |