CN102339286A - Method for automatically identifying Chinese names - Google Patents
Method for automatically identifying Chinese names Download PDFInfo
- Publication number
- CN102339286A CN102339286A CN2010102336536A CN201010233653A CN102339286A CN 102339286 A CN102339286 A CN 102339286A CN 2010102336536 A CN2010102336536 A CN 2010102336536A CN 201010233653 A CN201010233653 A CN 201010233653A CN 102339286 A CN102339286 A CN 102339286A
- Authority
- CN
- China
- Prior art keywords
- type
- double word
- individual character
- probability
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 239000000463 material Substances 0.000 claims abstract description 27
- 101100481125 Schizosaccharomyces pombe (strain 972 / ATCC 24843) thi2 gene Proteins 0.000 claims description 4
- 238000005194 fractionation Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 abstract 1
- 238000000638 solvent extraction Methods 0.000 abstract 1
- 241000282376 Panthera tigris Species 0.000 description 3
- 238000012163 sequencing technique Methods 0.000 description 3
- 210000004556 brain Anatomy 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000004615 ingredient Substances 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a method for automatically identifying Chinese names. The method comprises the following steps of: counting and training written materials marked with Chinese names, partitioning a second-order model and a third-order model according to the occurrence positions of Chinese characters, calculating the probabilities of four types of distribution in every type of model, and acquiring the statistic rule of the Chinese names with a Bayesian probability statistic method; and skillfully performing probability calculation on Chinese text materials to be identified by using a combination strategy of double characters and a single character, and comparing the probability of each combination to judge whether the Chinese names occur. By adopting the method, 2-4 Chinese names can be identified stably, and a good ambiguous segmentation effect is achieved.
Description
Technical field
The present invention relates to a kind of Chinese information-searching method, particularly relate to a kind of recognition methods of Chinese name.
Background technology
Name is the maximum specific term of contact in the daily life, and name will be done as a wholely just can obtain result for retrieval accurately when information retrieval.With name " Cao Guowei " is example, if searching system is three individual characters " Cao ", " state ", " big " with its cutting, promptly Chinese name is not correctly identified, then can retrieve error result for example " in
StatePatent of invention, the inventor:
CaoCelebrating is fragrant, the mansion
BigRiver ".
Realize that Chinese automatic recognition of names has several big difficult points:
One of which, the combination of Chinese name is extremely many, can't directly use dictionary to carry out mechanical cutting.Be difficult to make up the dictionary of all Chinese names of limit on the one hand.On the other hand, make up the situation that contradiction can appear in dictionary.For example, if name " Wang Junhu " is added in the dictionary, then in sentence " Wang Jun is looking strong and good-natured ", can by error " Wang Junhu " be identified as name.
Its two, there is the situation of monosyllabic name, two-character surname in Chinese name, also has multiple situation such as two word names, three word names, four word names.
Its three, Chinese name may form the ambiguity combination with the front and back literal, brings obstacle for the correct identification of name.For example occurred in " the capital concert of Chen Xiao northeast " the words that " " speech is easy to be erroneously identified as in " Chen Xiao/northeast " in northeast when name is discerned.
Summary of the invention
Technical matters to be solved by this invention provides a kind of automatic identifying method of Chinese name, can identify Chinese name comparatively exactly.
For solving the problems of the technologies described above, the present invention's Chinese automatic recognition of names method comprises the steps:
In the 1st step, the written material that indicates Chinese name is added up;
In the 2nd step, the written material of Chinese name to be identified is carried out the identification of Chinese name;
Said the 1st step of method specifically comprises the steps:
The 1.1st step indicating in the written material of Chinese name, was divided into following four types with individual character, and said individual character is single Chinese character;
---the H1 type appears at the position of Chinese first word of name;
---the M1 type appears at Chinese name centre position;
---the T1 type appears at the position of Chinese name the last character;
---the N1 type appears at the position except that Chinese name;
Double word is divided into following four types, and said double word is two continuous Chinese characters;
---the H2 type appears at the position of preceding two words of Chinese name and is two-character surname;
---the HM2 type appears at the position of preceding two words of Chinese name and is not two-character surname;
---the MT2 type appears at the position of the non-two-character surname name of three words or four latter two words of word two-character surname name;
---the N2 type appears at the position except that Chinese name;
In the 1.2nd step, indicating in the written material of Chinese name:
The unduplicated individual character quantity that statistics H1, M1, T1, N1 are four types is designated as nh1, nm1, nt1, nn1 respectively;
The unduplicated double word quantity that statistics H2, HM2, MT2, N2 are four types is designated as nh2, nhm2, nmt2, nn2 respectively;
The total degree of adding up each individual character appearance is designated as z1; Add up the number of times that each individual character appears at four types of H1, M1, T1, N1, be designated as h1, m1, t1, n1 respectively;
The total degree of adding up each double word appearance is designated as z2; Add up the number of times that each double word belongs to four types of H2, HM2, MT2, N2, be designated as h2, hm2, mt2, n2 respectively;
In the 1.3rd step, indicating in the written material of Chinese name:
Calculate each individual character S
iThe probability that belongs to the H1 type
Calculate each individual character S
iThe probability that belongs to the M1 type
Calculate each individual character S
iThe probability that belongs to the T1 type
Calculate each individual character S
iThe probability that belongs to the N1 type
Calculate each double word D
iThe probability that belongs to the H2 type
Calculate each double word D
iThe probability that belongs to the HM2 type
Calculate each double word D
iThe probability that belongs to the MT2 type
Calculate each double word D
iThe probability that belongs to the N2 type
In the 1.4th step, indicating in the written material of Chinese name:
Calculate each individual character S in the M1 type
iProbability of occurrence
Calculate each double word D in the HM2 type
iProbability of occurrence
Calculate each double word D in the MT2 type
iProbability of occurrence
Said the 2nd step of method specifically comprises the steps:
In the 2.1st step, in the written material of Chinese name to be identified, judge successively sequentially whether each double word belongs to H2 type or HM2 type; P (H2|D
iThis double word of)>0 expression D
iBelong to the H2 type, otherwise represent this double word D
iDo not belong to the H2 type; P (HM2|D
i)>first threshold is represented this double word D
iBelong to the HM2 type, otherwise represent that this double word does not belong to the HM2 type; The span of first threshold is 0.13~0.22;
If this double word belongs to H2 type or HM2 type, then this double word is made as d1, got into for the 2.4th step;
If this double word neither belongs to the H2 type and does not also belong to the HM2 type, then this double word is split as two individual characters, got into for the 2.2nd step;
In the 2.2nd step, judge wherein whether first individual character belongs to the H1 type; P (H1|S
i)>0 this individual character of expression S
iBelong to the H1 type;
If this first individual character belongs to the H1 type, then this first individual character is made as d1, got into for the 2.4th step;
If this first individual character does not belong to the H1 type, then got into for the 2.3rd step;
In the 2.3rd step, judge wherein whether second individual character belongs to the H1 type; P (H1|S
i)>0 this individual character of expression S
iBelong to the H1 type;
If this second individual character belongs to the H1 type, then this second individual character is made as d1, got into for the 2.4th step;
If this second individual character do not belong to the H1 type, then two of this double word and fractionation thereof individual characters are not the parts of name or name, get into for the 2.1st step and the next double word of getting after this double word is judged;
In the 2.4th step, judge whether the double word after the d1 belongs to the MT2 type; P (MT2|D
i)>second threshold value is then represented this double word D
iBelong to the MT2 type; The span of second threshold value is 0.13~0.22;
If the double word after the d1 belongs to the MT2 type, then this double word is made as d2, and got into for the 2.6th step;
If the double word after the d1 does not belong to the MT2 type, then this double word is split as two individual characters, be made as d2, d3 respectively, and got into for the 2.5th step;
In the 2.5th step, judge whether the combination of d1, d2 and d3 is Chinese name; Satisfy following 5 formula simultaneously and then represent the Chinese name of being combined as of d1, d2 and d3:
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|H)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|H)×P(d3|T)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|H)
When d1 is individual character S
iThe time, P (d1|H) is P (S
i| H1), P (d1|N) is P (S
i| N1);
When d1 is double word D
iThe time, P (d1|H) is P (D
i| H2), P (d1|N) is P (D
i| N2);
When d2 is individual character S
iThe time, P (d2|H) is P (S
i| H1), P (d2|M) is P (S
i| M1), P (d2|T) is P (S
i| T1), P (d2|N) is P (S
i| N1);
When d2 is double word D
iThe time, P (d2|H) is P (D
i| H2), P (d2|M) is P (D
i| HM2), P (d2|T) is P (D
i| MT2), P (d2|N) is P (D
i| N2);
When d3 is individual character S
iThe time, P (d3|H) is P (S
i| H1), P (d3|T) is P (S
i| T1), P (d3|N) is P (S
i| N1);
When d3 is double word D
iThe time, P (d3|H) is P (D
i| H2), P (d3|T) is P (D
i| MT2), P (d3|N) is P (D
i| N2);
When the Chinese name that is combined as of judging d1, d2 and d3, then record should the Chinese name, gets into for the 2.1st step and gets d3 next double word afterwards and judge;
When the combination of judging d1, d2 and d3 is not Chinese name, then got into for the 2.6th step;
In the 2.6th step, judge whether the combination of d1 and d2 is Chinese name; Satisfy following 2 formula simultaneously and then represent the Chinese name of being combined as of d1 and d2:
P(d1|H)×P(d2|T)>P(d1|N)×P(d2|N)
P(d1|H)×P(d2|T)>P(d1|N)×P(d2|H)
When d1 is individual character S
iThe time, P (d1|H) is P (S
i| H1), P (d1|N) is P (S
i| N1);
When d1 is double word D
iThe time, P (d1|H) is P (D
i| H2), P (d1|N) is P (D
i| N2);
When d2 is individual character S
iThe time, P (d2|H) is P (S
i| H1), P (d2|T) is P (S
i| T1), P (d2|N) is P (S
i| N1);
When d2 is double word D
iThe time, P (d2|H) is P (D
i| H2), P (d2|T) is P (D
i| MT2), P (d2|N) is P (D
i| N2);
When judging the be combined as Chinese name of d1 with d2, then record should the Chinese name, gets into for the 2.1st step and gets d2 next double word afterwards and judge;
When judging that d1 and the combination of d2 are not Chinese names, then got into for the 2.1st step and get d2 next double word afterwards and judge.
The present invention at first obtains the statistical law of Chinese name through the method for Bayesian probability statistics; Utilize the combined strategy of double word and individual character to carry out probability calculation dexterously written material to be identified then; Thereby can stablize identification to the Chinese name of 2~4 words, and have good ambiguity partition effect.
Description of drawings
Fig. 1 is the method for the invention schematic flow sheet in the 2nd step.
Embodiment
The present invention's Chinese automatic recognition of names method comprises the steps:
In the 1st step, the written material that indicates Chinese name is added up;
In the 2nd step, the written material of Chinese name to be identified is carried out the identification of Chinese name.
Said the 1st step of method specifically comprises the steps:
The 1.1st step, indicating in the written material of Chinese name, individual character (single Chinese character) is divided into following four types according to the position:
---the H1 type appears at the position of Chinese first word of name.The individual character of H1 type possibly be first word of monosyllabic name or two-character surname.
---the M1 type appears at Chinese name centre position (neither first word, neither the last character).The individual character of M1 type possibly be second word or the 3rd word of four word names of the word in centre or the four word names of three word names.
---the T1 type appears at the position of Chinese name the last character.
---the N1 type appears at the position except that Chinese name.
Obviously, arbitrarily individual character belongs among four types of the H1, M1, T1, N1 one or more.
Indicating in the written material of Chinese name, double word (two continuous Chinese characters) be divided into following four types according to the position:
---the H2 type appears at the position of preceding two words of Chinese name and is two-character surname.The double word of H2 type possibly be the ingredient of three word names or four word names, can not be the ingredient of two word names.
---the HM2 type appears at the position of preceding two words of Chinese name and is not two-character surname.The double word of HM2 type possibly be preceding two words of two word names or the non-two-character surname name of three words or preceding two words of the non-two-character surname name of four words.
---the MT2 type appears at the position of the non-two-character surname name of three words or four latter two words of word two-character surname name.
---the N2 type appears at the position except that Chinese name;
Obviously, arbitrarily double word belongs among four types of the H2, HM2, MT2, N2 one or more.
Type through above-mentioned individual character and double word is divided, and Chinese name possibly be following several kinds of situation:
---two word names possibly be HM2 situation only, and this situation has comprised the situation of H1+T1 combination.
---three word names are H1+MT2 combination (having comprised the H1+M1+T1 combination) or H2+T1 combination or HM2+T1 combination (also having comprised the H1+M1+T1 combination).
---four word names, or H2+MT2 combination (having comprised the H2+M1+T1 combination) or HM2+MT2 combination (having comprised that HM2+M1+T1 combination, H1+M1+M1+T1 combination, H1+M1+MT2 make up three kinds of situation).
In the 1.2nd step, indicating in the written material of Chinese name:
Statistics individual character sum (being the Chinese character sum) is designated as ns1.Add up unduplicated individual character sum and be designated as nss1.Obvious nss1≤ns1.The unduplicated individual character quantity that statistics H1, M1, T1, N1 are four types is designated as nh1, nm1, nt1, nn1 respectively.Because it is dissimilar that same individual character possibly appear at, so nh1+nm1+nt1+nn1 >=nss1.
Statistics double word sum is designated as ns2.For example in " Wang Jun is looking strong and good-natured " the words, " Wang Jun ", " army tiger ", " tiger head ", " head tiger ", " brave brain ", " brain " all are double word, obviously ns2=ns1-1.Add up unduplicated double word sum and be designated as nss2.Obvious nss2≤ns2.The unduplicated double word quantity that statistics H2, HM2, MT2, N2 are four types is designated as nh2, nhm2, nmt2, nn2 respectively.Because it is dissimilar that same double word possibly appear at, so nh2+nhm2+nmt2+nn2 >=nss2.
The total degree of adding up each individual character appearance is designated as z1.Add up the number of times that each individual character appears at four types of H1, M1, T1, N1, be designated as h1, m1, t1, n1 respectively.Obviously, z1=h1+m1+t1+n1.
The total degree of adding up each double word appearance is designated as z2.Add up the number of times that each double word belongs to four types of H2, HM2, MT2, N2, be designated as h2, hm2, mt2, n2 respectively.Obviously, z2=h2+hm2+mt2+n2.
In the 1.3rd step, indicating in the written material of Chinese name:
Calculate each individual character S
iThe probability that belongs to the H1 type
Calculate each individual character S
iThe probability that belongs to the M1 type
Calculate each individual character S
iThe probability that belongs to the T1 type
Calculate each individual character S
iThe probability that belongs to the N1 type
Obvious P (H1|S
i)+P (M1|S
i)+P (T1|S
i)+P (N1|S
i)=1.
Calculate each double word D
iThe probability that belongs to the H2 type
Calculate each double word D
iThe probability that belongs to the HM2 type
Calculate each double word D
iThe probability that belongs to the MT2 type
Calculate each double word D
iThe probability that belongs to the N2 type
Obvious P (H2|D
i)+P (HM2|D
i)+P (MT2|D
i)+P (N2|D
i)=1.
In the 1.4th step, indicating in the written material of Chinese name:
See also Fig. 1, said the 2nd step of method specifically comprises the steps:
In the 2.1st step, in the written material of Chinese name to be identified, judge that whether each double word belongs to H2 type or HM2 type, for example can judge according to the sequencing of each double word successively.To a double word D
i, P (H2|D
iThis double word of)>0 expression D
iBelong to the H2 type, otherwise represent this double word D
iDo not belong to the H2 type.P (HM2|D
i)>first threshold is represented this double word D
iBelong to the HM2 type, otherwise represent that this double word does not belong to the HM2 type.The span of first threshold is 0.13~0.22, for example gets 0.2.
If this double word belongs to H2 type or HM2 type, then this double word is made as d1, got into for the 2.4th step.
If this double word neither belongs to the H2 type and does not also belong to the HM2 type, then this double word is split as two individual characters, got into for the 2.2nd step.
In the 2.2nd step, judge wherein whether first individual character belongs to the H1 type.P (H1|S
i)>0 this individual character of expression S
iBelong to the H1 type.
If this first individual character belongs to the H1 type, then this first individual character is made as d1, got into for the 2.4th step.
If this first individual character does not belong to the H1 type, then got into for the 2.3rd step.
In the 2.3rd step, judge wherein whether second individual character belongs to the H1 type.P (H1|S
i)>0 this individual character of expression S
iBelong to the H1 type.
If this second individual character belongs to the H1 type, then this second individual character is made as d1, got into for the 2.4th step.
If this second individual character do not belong to the H1 type, then two of this double word and fractionation thereof individual characters are not the parts of name or name, get into for the 2.1st step and the next double word (for example according to sequencing) of getting after this double word is judged.
In the 2.4th step, judge whether the double word after the d1 belongs to the MT2 type.P (MT2|D
i)>second threshold value is then represented this double word D
iBelong to the MT2 type.The span of second threshold value is 0.13~0.22, for example gets 0.2.
If the double word after the d1 belongs to the MT2 type, then this double word is made as d2, and got into for the 2.6th step.
If the double word after the d1 does not belong to the MT2 type, then this double word is split as two individual characters, be made as d2, d3 respectively, and got into for the 2.5th step.
In the 2.5th step, judge whether the combination of d1, d2 and d3 is Chinese name.Satisfy following 5 formula simultaneously and then represent the Chinese name of being combined as of d1, d2 and d3:
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|H)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|H)×P(d3|T)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|H)
When d1 is individual character S
iThe time, P (d1|H) is P (S
i| H1), P (d1|N) is P (S
i| N1);
When d1 is double word D
iThe time, P (d1|H) is P (D
i| H2), P (d1|N) is P (D
i| N2);
When d2 is individual character S
iThe time, P (d2|H) is P (S
i| H1), P (d2|M) is P (S
i| M1), P (d2|T) is P (S
i| T1), P (d2|N) is P (S
i| N1);
When d2 is double word D
iThe time, P (d2|H) is P (D
i| H2), P (d2|M) is P (D
i| HM2), P (d2|T) is P (D
i| MT2), P (d2|N) is P (D
i| N2);
When d3 is individual character S
iThe time, P (d3|H) is P (S
i| H1), P (d3|T) is P (S
i| T1), P (d3|N) is P (S
i| N1);
When d3 is double word D
iThe time, P (d3|H) is P (D
i| H2), P (d3|T) is P (D
i| MT2), P (d3|N) is P (D
i| N2);
When the Chinese name that is combined as of judging d1, d2 and d3, then record should the Chinese name, gets into for the 2.1st step and gets d3 next double word afterwards and judge.
When the combination of judging d1, d2 and d3 is not Chinese name, then got into for the 2.6th step.
In the 2.6th step, judge whether the combination of d1 and d2 is Chinese name.Satisfy following 2 formula simultaneously and then represent the Chinese name of being combined as of d1 and d2:
P (d1|H) * P (d2|T)>P (d1|N) * P (d2|N), the implication of this formula is: d1 for preceding name element and d2 for after the probability of name element be greater than the probability of d1 and all non-name element of d2.For example d1 is Liu, and d2 is an Xiang.D1 and d2 are the name element and tandem probability is high.And for example, d1 is the building, and d2 is the room.The probability of all non-name element of d1 and d2 is high.
P (d1|H) * P (d2|T)>P (d1|N) * P (d2|H), the implication of this formula is: d1 for preceding name element and d2 for after the probability of name element be greater than the non-name element of d1 and d2 is the probability at preceding name element.For example in " Zhang Ming that charges does not have " the words.When d1 is a money, d2 does not satisfy this formula when opening.When d1 opens, when d2 is bright, satisfy this formula.
When d1 is individual character S
iThe time, P (d1|H) is P (S
i| H1), P (d1|N) is P (S
i| N1);
When d1 is double word D
iThe time, P (d1|H) is P (D
i| H2), P (d1|N) is P (D
i| N2);
When d2 is individual character S
iThe time, P (d2|H) is P (S
i| H1), P (d2|T) is P (S
i| T1), P (d2|N) is P (S
i| N1);
When d2 is double word D
iThe time, P (d2|H) is P (D
i| H2), P (d2|T) is P (D
i| MT2), P (d2|N) is P (D
i| N2);
When judging the be combined as Chinese name of d1 with d2, then record should the Chinese name, gets into for the 2.1st step and gets d2 next double word afterwards and judge.
When judging that d1 and the combination of d2 are not Chinese names, then got into for the 2.1st step and get d2 next double word (for example according to sequencing) afterwards and judge.
Each double word of treating in the identified text material is all judged back (for example being preface sequentially) according to the 2.1st step, has promptly accomplished the work of the Chinese name of identification from this literal material.The present invention can stablize identification to the Chinese name of 2~4 words, and has good ambiguity partition effect.
Claims (2)
1. a Chinese automatic recognition of names method is characterized in that, comprises the steps:
In the 1st step, the written material that indicates Chinese name is added up;
In the 2nd step, the written material of Chinese name to be identified is carried out the identification of Chinese name;
Said the 1st step of method specifically comprises the steps:
The 1.1st step indicating in the written material of Chinese name, was divided into following four types with individual character, and said individual character is single Chinese character;
---the H1 type appears at the position of Chinese first word of name;
---the M1 type appears at Chinese name centre position;
---the T1 type appears at the position of Chinese name the last character;
---the N1 type appears at the position except that Chinese name;
Double word is divided into following four types, and said double word is two continuous Chinese characters;
---the H2 type appears at the position of preceding two words of Chinese name and is two-character surname;
---the HM2 type appears at the position of preceding two words of Chinese name and is not two-character surname;
---the MT2 type appears at the position of the non-two-character surname name of three words or four latter two words of word two-character surname name;
---the N2 type appears at the position except that Chinese name;
In the 1.2nd step, indicating in the written material of Chinese name:
The unduplicated individual character quantity that statistics H1, M1, T1, N1 are four types is designated as nh1, nm1, nt1, nn1 respectively;
The unduplicated double word quantity that statistics H2, HM2, MT2, N2 are four types is designated as nh2, nhm2, nmt2, nn2 respectively;
The total degree of adding up each individual character appearance is designated as z1; Add up the number of times that each individual character appears at four types of H1, M1, T1, N1, be designated as h1, m1, t1, n1 respectively;
The total degree of adding up each double word appearance is designated as z2; Add up the number of times that each double word belongs to four types of H2, HM2, MT2, N2, be designated as h2, hm2, mt2, n2 respectively;
In the 1.3rd step, indicating in the written material of Chinese name:
Calculate each individual character S
iThe probability that belongs to the H1 type
Calculate each individual character S
iThe probability that belongs to the M1 type
Calculate each individual character S
iThe probability that belongs to the T1 type
Calculate each individual character S
iThe probability that belongs to the N1 type
Calculate each double word D
iThe probability that belongs to the H2 type
Calculate each double word D
iThe probability that belongs to the HM2 type
Calculate each double word D
iThe probability that belongs to the MT2 type
Calculate each double word D
iThe probability that belongs to the N2 type
In the 1.4th step, indicating in the written material of Chinese name:
Calculate each double word D in the H2 type
iProbability of occurrence
Calculate each double word D in the HM2 type
iProbability of occurrence
Calculate each double word D in the MT2 type
iProbability of occurrence
Said the 2nd step of method specifically comprises the steps:
In the 2.1st step, in the written material of Chinese name to be identified, judge successively sequentially whether each double word belongs to H2 type or HM2 type; P (H2|D
iThis double word of)>0 expression D
iBelong to the H2 type, otherwise represent this double word D
iDo not belong to the H2 type; P (HM2|D
i)>first threshold is represented this double word D
iBelong to the HM2 type, otherwise represent that this double word does not belong to the HM2 type; The span of first threshold is 0.13~0.22;
If this double word belongs to H2 type or HM2 type, then this double word is made as d1, got into for the 2.4th step;
If this double word neither belongs to the H2 type and does not also belong to the HM2 type, then this double word is split as two individual characters, got into for the 2.2nd step;
In the 2.2nd step, judge wherein whether first individual character belongs to the H1 type; P (H1|S
i)>0 this individual character of expression S
iBelong to the H1 type;
If this first individual character belongs to the H1 type, then this first individual character is made as d1, got into for the 2.4th step;
If this first individual character does not belong to the H1 type, then got into for the 2.3rd step;
In the 2.3rd step, judge wherein whether second individual character belongs to the H1 type; P (H1|S
i)>0 this individual character of expression S
iBelong to the H1 type;
If this second individual character belongs to the H1 type, then this second individual character is made as d1, got into for the 2.4th step;
If this second individual character do not belong to the H1 type, then two of this double word and fractionation thereof individual characters are not the parts of name or name, get into for the 2.1st step and the next double word of getting after this double word is judged;
In the 2.4th step, judge whether the double word after the d1 belongs to the MT2 type; P (MT2|D
i)>second threshold value is then represented this double word D
iBelong to the MT2 type; The span of second threshold value is 0.13~0.22;
If the double word after the d1 belongs to the MT2 type, then this double word is made as d2, and got into for the 2.6th step;
If the double word after the d1 does not belong to the MT2 type, then this double word is split as two individual characters, be made as d2, d3 respectively, and got into for the 2.5th step;
In the 2.5th step, judge whether the combination of d1, d2 and d3 is Chinese name; Satisfy following 5 formula simultaneously and then represent the Chinese name of being combined as of d1, d2 and d3:
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|H)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|H)×P(d3|T)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|H)
When d1 is individual character S
iThe time, P (d1|H) is P (S
i| H1), P (d1|N) is P (S
i| N1);
When d1 is double word D
iThe time, P (d1|H) is P (D
i| H2), P (d1|N) is P (D
i| N2);
When d2 is individual character S
iThe time, P (d2|H) is P (S
i| H1), P (d2|M) is P (S
i| M1), P (d2|T) is P (S
i| T1), P (d2|N) is P (S
i| N1);
When d2 is double word D
iThe time, P (d2|H) is P (D
i| H2), P (d2|M) is P (D
i| HM2), P (d2|T) is P (D
i| MT2), P (d2|N) is P (D
i| N2);
When d3 is individual character S
iThe time, P (d3|H) is P (S
i| H1), P (d3|T) is P (S
i| T1), P (d3|N) is P (S
i| N1);
When d3 is double word D
iThe time, P (d3|H) is P (D
i| H2), P (d3|T) is P (D
i| MT2), P (d3|N) is P (D
i| N2);
When the Chinese name that is combined as of judging d1, d2 and d3, then record should the Chinese name, gets into for the 2.1st step and gets d3 next double word afterwards and judge;
When the combination of judging d1, d2 and d3 is not Chinese name, then got into for the 2.6th step;
In the 2.6th step, judge whether the combination of d1 and d2 is Chinese name; Satisfy following 2 formula simultaneously and then represent the Chinese name of being combined as of d1 and d2:
P(d1|H)×P(d2|T)>P(d1|N)×P(d2|N)
P(d1|H)×P(d2|T)>P(d1|N)×P(d2|H)
When d1 is individual character S
iThe time, P (d1|H) is P (S
i| H1), P (d1|N) is P (S
i| N1);
When d1 is double word D
iThe time, P (d1|H) is P (D
i| H2), P (d1|N) is P (D
i| N2);
When d2 is individual character S
iThe time, P (d2|H) is P (S
i| H1), P (d2|T) is P (S
i| T1), P (d2|N) is P (S
i| N1);
When d2 is double word D
iThe time, P (d2|H) is P (D
i| H2), P (d2|T) is P (D
i| MT2), P (d2|N) is P (D
i| N2);
When judging the be combined as Chinese name of d1 with d2, then record should the Chinese name, gets into for the 2.1st step and gets d2 next double word afterwards and judge;
When judging that d1 and the combination of d2 are not Chinese names, then got into for the 2.1st step and get d2 next double word afterwards and judge.
2. Chinese automatic recognition of names method according to claim 1 is characterized in that said first threshold is 0.2, and said second threshold value is 0.2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102336536A CN102339286A (en) | 2010-07-22 | 2010-07-22 | Method for automatically identifying Chinese names |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102336536A CN102339286A (en) | 2010-07-22 | 2010-07-22 | Method for automatically identifying Chinese names |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102339286A true CN102339286A (en) | 2012-02-01 |
Family
ID=45515023
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010102336536A Pending CN102339286A (en) | 2010-07-22 | 2010-07-22 | Method for automatically identifying Chinese names |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102339286A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105988991A (en) * | 2015-02-26 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Surname language recognition method and device, as well as server |
CN105988989A (en) * | 2015-02-26 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Chinese surname recognition method and device, as well as server |
CN106354713A (en) * | 2016-08-29 | 2017-01-25 | 达而观信息科技(上海)有限公司 | Method for automatically identifying Chinese name |
CN109344233A (en) * | 2018-08-28 | 2019-02-15 | 昆明理工大学 | A method of Chinese name recognition |
-
2010
- 2010-07-22 CN CN2010102336536A patent/CN102339286A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105988991A (en) * | 2015-02-26 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Surname language recognition method and device, as well as server |
CN105988989A (en) * | 2015-02-26 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Chinese surname recognition method and device, as well as server |
CN105988991B (en) * | 2015-02-26 | 2019-01-18 | 阿里巴巴集团控股有限公司 | A kind of recognition methods, device and the server of the affiliated languages of surname |
CN105988989B (en) * | 2015-02-26 | 2019-02-15 | 阿里巴巴集团控股有限公司 | A kind of recognition methods, device and the server of Chinese surname |
CN106354713A (en) * | 2016-08-29 | 2017-01-25 | 达而观信息科技(上海)有限公司 | Method for automatically identifying Chinese name |
CN109344233A (en) * | 2018-08-28 | 2019-02-15 | 昆明理工大学 | A method of Chinese name recognition |
CN109344233B (en) * | 2018-08-28 | 2022-07-19 | 昆明理工大学 | A method of Chinese name recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hinrichs et al. | Recent changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged corpora | |
Levallois | Umigon: sentiment analysis for tweets based on lexicons and heuristics | |
CN101071418B (en) | Chat method and system | |
CN104572625A (en) | Recognition method of named entity | |
WO2002050662A3 (en) | Apparatus and method of video program classification based on syntax of transcript information | |
CN102339286A (en) | Method for automatically identifying Chinese names | |
CN103885934A (en) | Method for automatically extracting key phrases of patent documents | |
CN108920513A (en) | A kind of multimedia data processing method, device and electronic equipment | |
CN105912629A (en) | Intelligent question and answer method and device | |
CN107748745B (en) | Enterprise name keyword extraction method | |
CA2564760A1 (en) | Speech analysis using statistical learning | |
Martin et al. | The 2009 NIST Language Recognition Evaluation. | |
CN110929520A (en) | Non-named entity object extraction method and device, electronic equipment and storage medium | |
CN103869998A (en) | Method and device for sorting candidate items generated by input method | |
JP6759917B2 (en) | Sentence generator and sentence generation method | |
Idsardi | A simple proof that Optimality Theory is computationally intractable | |
Leão et al. | Evolutionary patterns in the geographic range size of Atlantic Forest plants | |
Tranter | Who really spoke when? Finding speaker turns and identities in broadcast news audio | |
CN110019958A (en) | A kind of generation method, device and the terminal device of films and television programs label | |
Westerlund | Testing for unit roots in panel time‐series models with multiple level breaks | |
CN111027322A (en) | Sentiment dictionary-based sentiment analysis method for fine-grained entities in financial news | |
CN100426376C (en) | Estimating and detecting method and system for telephone continuous speech recognition system performance | |
Martínez-Hinarejos et al. | Statistical framework for a spanish spoken dialogue corpus | |
EP3800600A1 (en) | Detection of a topic | |
CN113609279B (en) | Material model extraction method and device and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120201 |
|
WD01 | Invention patent application deemed withdrawn after publication |