CN103123685B - Text mode recognition method - Google Patents
Text mode recognition method Download PDFInfo
- Publication number
- CN103123685B CN103123685B CN201110367595.0A CN201110367595A CN103123685B CN 103123685 B CN103123685 B CN 103123685B CN 201110367595 A CN201110367595 A CN 201110367595A CN 103123685 B CN103123685 B CN 103123685B
- Authority
- CN
- China
- Prior art keywords
- text
- weight
- keyword
- direct graph
- manifold edges
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of text mode recognition method, it comprises: urtext of lining by line scan file, records number of times and position that each keyword occurs in described text; Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword; Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight; Described simple direct graph with weight matrix is represented; Keyword occurrence number with according to obtained matrix and record, is mapped as Text eigenvector by described text.Compared with classic method, this method can more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.
Description
[technical field]
The present invention relates to text identification field, particularly relate to text mode recognition method.
[background technology]
Along with the development of network and the appearance of digital library, how from the text of magnanimity, quick obtaining effective information becomes one of important subject of field of information processing and area of pattern recognition.If we can carry out automatic classification mark to text according to certain taxonomic hierarchies according to the content of text, similarity analysis is carried out to different texts, then can people be helped better to organize and excavate text message.
The implementation of prior art: the keyword in text is used as a characteristic item of text for a long time always.Based on the repetition frequency of keyword, we carry out automatic classification by methods such as decision tree, network neural unit, bayes method or Support Vector Machine to text usually.For the similarity system design between different text, be also compare based on the repetition frequency of keyword usually.
Repetition frequency only based on keyword can compare rough large class division to text to a certain extent, but when the method is used for the similarity segmenting differing document text by us, result is not but fine.This mainly because: (1) only utilizes this method of the repetition frequency of keyword to have ignored the interdependent property that may exist between keyword and keyword.(2) traditional method does not utilize the structural information of text yet.These all directly will affect text classification results and text similarity system design result.
Therefore, be necessary to develop a kind of text mode recognition method that can improve to overcome the problems referred to above.
[summary of the invention]
One of the technical problem to be solved in the present invention is to provide a kind of text mode recognition method, and it can more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.
In order to solve the problem, according to an aspect of the present invention, the invention provides a kind of text mode recognition method, it comprises: urtext of lining by line scan file, records number of times and position that each keyword occurs in described text; Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword; Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight; Described simple direct graph with weight matrix is represented; Keyword occurrence number with according to obtained matrix and record, is mapped as Text eigenvector by described text.
Further, suppose that keyword set is K={k
1, k
2..., k
n, key word k
iin described text, occurrence number is f
i, with F=[f
1, f
2..., f
n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number.
Further, with node on behalf each in the direct graph with weight of Non-manifold edges keyword k
iif, keyword k
iposition p in described text
ioccur, keyword k
jposition p in described text
joccur, and position p
jat position p
iafterwards, then in the direct graph with weight of Non-manifold edges, a directed edge k is added
ik
j, directed edge k
ik
jweight be p
iand p
jbetween distance, if keyword k
iwith keyword k
joccur in described text repeatedly, then use the same method these keyword k that diverse location occurs in described text in the direct graph with weight of Non-manifold edges
iand k
jbe mapped as Non-manifold edges, j is more than or equal to 1 and is less than or equal to n.
Further, the direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight to comprise:
Using the node set of the node set of the direct graph with weight with Non-manifold edges as simple direct graph with weight;
From node k in simple direct graph with weight
ito node k
jbetween directed edge be expressed as k
ik
j, k
ik
jweight w (k
ik
j) be:
Wherein E
ijrepresent the direct graph with weight interior joint k with Non-manifold edges
ito node k
jbetween directed edge set,
represent directed edge e with the weighted value in the direct graph with weight of Non-manifold edges;
Further, represent that the matrix W of simple direct graph with weight is:
Further, the Text eigenvector R (D) mapping described text is:
R(D)=[f
1,f
2,…,f
n,w(k
1,k
1),…,w(k
1,k
n),…,w(k
n,k
1),…,w(k
n,k
n)]。
Further, suppose have text to be D
1..., D
m, obtaining corresponding Text eigenvector is then R (D
1) ..., R (D
m), described text mode recognition method also comprises:
Utilize any two text D of following formulae discovery
x, D
ybetween similarity.
wherein x, y are more than or equal to 1 and are less than or equal to m.
Compared with prior art, a direct graph with weight model is established in the present invention in order to describe text message.This model not only utilizes this information of the keyword frequency of occurrences in text, utilizes the range information between keyword positional information in the text and keyword simultaneously, each text is corresponded to a feature direct graph with weight.On this basis, each text is mapped as a Text eigenvector by us.The frequency information of the keyword that this Text eigenvector not only comprises, also implies the structural information of text simultaneously.Thus, the computational short cut of the similarity between different text is for calculating the similarity between the Text eigenvector corresponding to text.The present invention is compared with classic method, and more, the more effective characteristic information saving urtext file, makes us can obtain better result when carrying out text classification and text similarity calculates.
About other objects of the present invention, feature and advantage, describe in detail in a specific embodiment below in conjunction with accompanying drawing.
[accompanying drawing explanation]
In conjunction with reference accompanying drawing and ensuing detailed description, the present invention will be easier to understand, the structure member that wherein same Reference numeral is corresponding same, wherein:
Fig. 1 is the text mode recognition method schematic diagram in one embodiment in the present invention;
Fig. 2 shows the example of a text;
Fig. 3 shows the relative position information of each keyword in the text shown in Fig. 2;
Fig. 4 shows the direct graph with weight with Non-manifold edges of the text shown in Fig. 2; With
Fig. 5 shows the direct graph with weight of Non-manifold edges that has shown in Fig. 3 and simplifies the simple direct graph with weight obtained.
[embodiment]
For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, and below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
Detailed description of the present invention presents mainly through program, step, logical block, process or other symbolistic descriptions, the running of the technical scheme in its direct or indirect simulation the present invention.Affiliated those of skill in the art use the work that these describe and statement effectively introduces them to the others skilled in the art in affiliated field herein essential.
Alleged herein " embodiment " or " embodiment " refers to that the special characteristic relevant to described embodiment, structure or characteristic at least can be contained at least one implementation of the present invention.Different local in this manual " in one embodiment " occurred be non-essential all refers to same embodiment, must not be yet with other embodiments mutually exclusive separately or select embodiment.In addition, represent sequence of modules in the method for one or more embodiment, process flow diagram or functional block diagram and revocablely refer to any particular order, not also being construed as limiting the invention.
Fig. 1 is text mode recognition method 100 schematic flow sheet in one embodiment in the present invention.Described text mode recognition method 100 comprises the steps.
Step 110, urtext of lining by line scan file, records number of times and position that each keyword occurs in described text.
If a certain keyword occurs repeatedly in text, then the particular location occurred each time or relative position are all recorded.Record the number of times that each key word occurs simultaneously.
Suppose that keyword set is K={k
1, k
2..., k
n, suppose key word k
ioccurrence number is f
i, can F=[f be used
1, f
2..., f
n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number.
Step 120, is mapped as the direct graph with weight G with Non-manifold edges by described text
m.
Described direct graph with weight G
min each node on behalf keyword k
i, that is direct graph with weight G
mtotal n node.If keyword k
iposition p in described text
ioccur, keyword k
jposition p in described text
joccur, and position p
jat position p
iafterwards, then at direct graph with weight G
min add a directed edge k
ik
j, directed edge k
ik
jweight be p
iand p
jbetween distance.If keyword k
iwith keyword k
joccur in described text repeatedly, then at direct graph with weight G
mthe same rule of middle use is by these keyword k that diverse location occurs in described text
iand k
jbe mapped as Non-manifold edges, wherein j is more than or equal to 1 and is less than or equal to n.If the occurrence number of a keyword is greater than 1, so it is by multiple for correspondence position.
Step 130, by the direct graph with weight G with Non-manifold edges
mbe reduced to simple direct graph with weight G
s.
Suppose the direct graph with weight G obtained in step 120
min from node k
ito node k
j(i.e. keyword k in text
iwith keyword k
j) between limit set for E
ij.
Newly-built G
sprocess as follows:
By direct graph with weight G
mnode set as direct graph with weight G
snode set;
Direct graph with weight G
sin from node k
ito node k
jbetween directed edge be expressed as k
ik
j, k
ik
jweight w (k
ik
j) be defined as follows
Wherein E
ijrepresent the direct graph with weight G with Non-manifold edges
minterior joint k
ito node k
jbetween directed edge set,
represent directed edge e at the direct graph with weight G with Non-manifold edges
min weighted value;
Step 140, described simple direct graph with weight G
sdescribe by matrix W.
Step 150, to any text D, according to the keyword occurrence number F of obtained matrix W and record, is mapped as Text eigenvector R (D) by text file D.
R(D)=[f
1,f
2,…,f
n,w(k
1,k
1),…,w(k
1,k
n),…,w(k
n,k
1),…,w(k
n,k
n)]
Repeat the Text eigenvector that above-mentioned steps 110 to 150 can obtain all texts.Suppose have text to be D
1..., D
m, corresponding Text eigenvector is then R (D
1) ..., R (D
m).
With the matrix that the proper vector that M represents all texts forms
M is normalized and obtains new matrix
Described text mode recognition method in the present invention can further include:
Step 160, utilizes any two text D of following formulae discovery
x, D
ybetween similarity.
wherein x, y are more than or equal to 1 and are less than or equal to m.
One of benefit of the present invention, advantage and disadvantage are: establish a direct graph with weight model in order to describe text message, this model not only utilizes this information of the keyword frequency of occurrences in text, utilize the range information between keyword positional information in the text and keyword simultaneously, each text is corresponded to a feature direct graph with weight.On this basis, each text is mapped as a Text eigenvector.The frequency information of the keyword that this Text eigenvector not only comprises, also implies the structural information of text simultaneously.Thus, the computational short cut of the similarity between different text is for calculating the similarity between the Text eigenvector corresponding to text.The present invention is compared with classic method, and more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.
Fig. 2 shows the example of a text, and the set of keywords wherein used is combined into: { Bank, Account, Fund, Transfer}.The number of times that each keyword recorded occurs in the text shown in Fig. 2, is specially F=[f
1=1, f
2=2, f
3=2, f
4=2], f
1for the number of times that Bank occurs, f
2for the number of times that Account occurs, f
3for the number of times that Fund occurs, f
4for the number of times that Transfer occurs.The relative position information of each keyword of record as shown in Figure 3, described relative position is the word distance between adjacent two keywords, distance 12 word distances both 12 between first keyword bank and second the keyword fund such as occurred represents.Fig. 3 shows the direct graph with weight G with Non-manifold edges of the text shown in Fig. 2
m.Fig. 4 shows by the direct graph with weight G having Non-manifold edges shown in Fig. 3
msimplify the simple direct graph with weight G obtained
s.
Described simple direct graph with weight G is described
smatrix W be:
The Text eigenvector of the text shown in Fig. 2 is:
V=[1,2,2,2,0,0.0995,0.0705,0.0459,0,0.0200,0.3848,0.5668,0,0.0227,0.0204,0.0884,0,0.0345,0.3627,0.0323]。
Above to invention has been the enough detailed description with certain singularity.Belonging to those of ordinary skill in field should be appreciated that, the description in embodiment is only exemplary, make under the prerequisite not departing from true spirit of the present invention and scope change and all should belong to protection scope of the present invention.The present invention's scope required for protection is undertaken limiting by described claims, instead of limited by the foregoing description in embodiment.
Claims (5)
1. a text mode recognition method, is characterized in that, it comprises:
Urtext of lining by line scan file, records number of times and position that each keyword occurs in described text;
Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword;
Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight;
Described simple direct graph with weight matrix is represented; With
According to the keyword occurrence number of obtained matrix and record, described text is mapped as Text eigenvector,
Suppose that keyword set is K={k
1, k
2..., k
n, key word k
iin described text, occurrence number is f
i, with F=[f
1, f
2..., f
n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number,
With node on behalf each in the direct graph with weight of Non-manifold edges keyword k
iif, keyword k
iposition p in described text
ioccur, keyword k
jposition p in described text
joccur, and position p
jat position p
iafterwards, then in the direct graph with weight of Non-manifold edges, a directed edge k is added
ik
j, directed edge k
ik
jweight be p
iand p
jbetween distance, if keyword k
iwith keyword k
joccur in described text repeatedly, then use the same method these keyword k that diverse location occurs in described text in the direct graph with weight of Non-manifold edges
iand k
jbe mapped as Non-manifold edges, j is more than or equal to 1 and is less than or equal to n.
2. text mode recognition method according to claim 1, is characterized in that, the direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight and comprises:
Using the node set of the node set of the direct graph with weight with Non-manifold edges as simple direct graph with weight;
From node k in simple direct graph with weight
ito node k
jbetween directed edge be expressed as k
ik
j, k
ik
jweight w (k
ik
j) be:
Wherein E
ijrepresent the direct graph with weight interior joint k with Non-manifold edges
ito node k
jbetween directed edge set,
represent directed edge e with the weighted value in the direct graph with weight of Non-manifold edges;
3. text mode recognition method according to claim 2, is characterized in that, represents that the matrix W of simple direct graph with weight is:
4. text mode recognition method according to claim 3, is characterized in that, the Text eigenvector R (D) mapping described text is:
R(D)=[f
1,f
2,…,f
n,w(k
1,k
1),…,w(k
1,k
n),…,w(k
n,k
1),…,w(k
n,k
n)]。
5. text mode recognition method according to claim 4, is characterized in that, supposing has text to be D
1..., D
m, obtaining corresponding Text eigenvector is then R (D
1) ..., R (D
m),
Described text mode recognition method also comprises:
Utilize any two text D of following formulae discovery
x, D
ybetween similarity,
wherein x, y are more than or equal to 1 and are less than or equal to m.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110367595.0A CN103123685B (en) | 2011-11-18 | 2011-11-18 | Text mode recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110367595.0A CN103123685B (en) | 2011-11-18 | 2011-11-18 | Text mode recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103123685A CN103123685A (en) | 2013-05-29 |
CN103123685B true CN103123685B (en) | 2016-03-02 |
Family
ID=48454659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110367595.0A Expired - Fee Related CN103123685B (en) | 2011-11-18 | 2011-11-18 | Text mode recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103123685B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107622048B (en) * | 2017-09-06 | 2021-06-22 | 南京硅基智能科技有限公司 | Text mode recognition method and system |
CN108255797A (en) * | 2018-01-26 | 2018-07-06 | 上海康斐信息技术有限公司 | A kind of text mode recognition method and system |
US11410446B2 (en) | 2019-11-22 | 2022-08-09 | Nielsen Consumer Llc | Methods, systems, apparatus and articles of manufacture for receipt decoding |
US11810380B2 (en) | 2020-06-30 | 2023-11-07 | Nielsen Consumer Llc | Methods and apparatus to decode documents based on images using artificial intelligence |
CN111753919A (en) * | 2020-06-30 | 2020-10-09 | 江南大学 | Image design work plagiarism detection method based on countermeasure network |
US11822216B2 (en) | 2021-06-11 | 2023-11-21 | Nielsen Consumer Llc | Methods, systems, apparatus, and articles of manufacture for document scanning |
US11625930B2 (en) | 2021-06-30 | 2023-04-11 | Nielsen Consumer Llc | Methods, systems, articles of manufacture and apparatus to decode receipts based on neural graph architecture |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101620616A (en) * | 2009-05-07 | 2010-01-06 | 北京理工大学 | Chinese similar web page de-emphasis method based on microcosmic characteristic |
CN101694670A (en) * | 2009-10-20 | 2010-04-14 | 北京航空航天大学 | Chinese Web document online clustering method based on common substrings |
CN101944099A (en) * | 2010-06-24 | 2011-01-12 | 西北工业大学 | Method for automatically classifying text documents by utilizing body |
CN102033867A (en) * | 2010-12-14 | 2011-04-27 | 西北工业大学 | Semantic-similarity measuring method for XML (Extensible Markup Language) document classification |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004361987A (en) * | 2003-05-30 | 2004-12-24 | Seiko Epson Corp | Image retrieval system, image classification system, image retrieval program, image classification program, image retrieval method, and image classification method |
-
2011
- 2011-11-18 CN CN201110367595.0A patent/CN103123685B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101620616A (en) * | 2009-05-07 | 2010-01-06 | 北京理工大学 | Chinese similar web page de-emphasis method based on microcosmic characteristic |
CN101694670A (en) * | 2009-10-20 | 2010-04-14 | 北京航空航天大学 | Chinese Web document online clustering method based on common substrings |
CN101944099A (en) * | 2010-06-24 | 2011-01-12 | 西北工业大学 | Method for automatically classifying text documents by utilizing body |
CN102033867A (en) * | 2010-12-14 | 2011-04-27 | 西北工业大学 | Semantic-similarity measuring method for XML (Extensible Markup Language) document classification |
Also Published As
Publication number | Publication date |
---|---|
CN103123685A (en) | 2013-05-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103123685B (en) | Text mode recognition method | |
Czerniawski et al. | 6D DBSCAN-based segmentation of building point clouds for planar object classification | |
CN106599181B (en) | A kind of hot news detection method based on topic model | |
CN103279570B (en) | A kind of matrix weights negative mode method for digging of text-oriented data base | |
CN102419778B (en) | Information searching method for discovering and clustering sub-topics of query statement | |
CN101859320B (en) | Massive image retrieval method based on multi-characteristic signature | |
CN101739430B (en) | A kind of training method of the text emotion classifiers based on keyword and sorting technique | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
Yang et al. | An effective hybrid model for opinion mining and sentiment analysis | |
CN103294817A (en) | Text feature extraction method based on categorical distribution probability | |
CN106940702A (en) | Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text | |
CN106156145A (en) | The management method of a kind of address date and device | |
CN104239513A (en) | Semantic retrieval method oriented to field data | |
Lee | Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams | |
CN101950284A (en) | Chinese word segmentation method and system | |
CN101833650A (en) | Video copy detection method based on contents | |
CN107545025A (en) | Database is inquired about using morphological criteria | |
CN110502616A (en) | A kind of method, equipment and the computer storage medium of determining garbage classification | |
CN105893573A (en) | Site-based multi-modal media data subject extraction model | |
CN114881742A (en) | Graph neural network recommendation method and system based on commodity knowledge graph | |
CN106339481A (en) | Chinese compound new-word discovery method based on maximum confidence coefficient | |
CN101639837A (en) | Method and system for automatically classifying objects | |
Shri et al. | Prediction of reusability of object oriented software systems using clustering approach | |
CN108427730A (en) | It is a kind of that method is recommended based on the Social Label of random walk and condition random field | |
CN112445976A (en) | City address positioning method based on congestion index map |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160302 Termination date: 20191118 |