[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN102339286A - Method for automatically identifying Chinese names - Google Patents

Method for automatically identifying Chinese names Download PDF

Info

Publication number
CN102339286A
CN102339286A CN2010102336536A CN201010233653A CN102339286A CN 102339286 A CN102339286 A CN 102339286A CN 2010102336536 A CN2010102336536 A CN 2010102336536A CN 201010233653 A CN201010233653 A CN 201010233653A CN 102339286 A CN102339286 A CN 102339286A
Authority
CN
China
Prior art keywords
type
double word
individual character
probability
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102336536A
Other languages
Chinese (zh)
Inventor
陈运文
马飞涛
宋海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shengle Information Technolpogy Shanghai Co Ltd
Original Assignee
Shengle Information Technolpogy Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengle Information Technolpogy Shanghai Co Ltd filed Critical Shengle Information Technolpogy Shanghai Co Ltd
Priority to CN2010102336536A priority Critical patent/CN102339286A/en
Publication of CN102339286A publication Critical patent/CN102339286A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for automatically identifying Chinese names. The method comprises the following steps of: counting and training written materials marked with Chinese names, partitioning a second-order model and a third-order model according to the occurrence positions of Chinese characters, calculating the probabilities of four types of distribution in every type of model, and acquiring the statistic rule of the Chinese names with a Bayesian probability statistic method; and skillfully performing probability calculation on Chinese text materials to be identified by using a combination strategy of double characters and a single character, and comparing the probability of each combination to judge whether the Chinese names occur. By adopting the method, 2-4 Chinese names can be identified stably, and a good ambiguous segmentation effect is achieved.

Description

Chinese automatic recognition of names method
Technical field
The present invention relates to a kind of Chinese information-searching method, particularly relate to a kind of recognition methods of Chinese name.
Background technology
Name is the maximum specific term of contact in the daily life, and name will be done as a wholely just can obtain result for retrieval accurately when information retrieval.With name " Cao Guowei " is example, if searching system is three individual characters " Cao ", " state ", " big " with its cutting, promptly Chinese name is not correctly identified, then can retrieve error result for example " in StatePatent of invention, the inventor: CaoCelebrating is fragrant, the mansion BigRiver ".
Realize that Chinese automatic recognition of names has several big difficult points:
One of which, the combination of Chinese name is extremely many, can't directly use dictionary to carry out mechanical cutting.Be difficult to make up the dictionary of all Chinese names of limit on the one hand.On the other hand, make up the situation that contradiction can appear in dictionary.For example, if name " Wang Junhu " is added in the dictionary, then in sentence " Wang Jun is looking strong and good-natured ", can by error " Wang Junhu " be identified as name.
Its two, there is the situation of monosyllabic name, two-character surname in Chinese name, also has multiple situation such as two word names, three word names, four word names.
Its three, Chinese name may form the ambiguity combination with the front and back literal, brings obstacle for the correct identification of name.For example occurred in " the capital concert of Chen Xiao northeast " the words that " " speech is easy to be erroneously identified as in " Chen Xiao/northeast " in northeast when name is discerned.
Summary of the invention
Technical matters to be solved by this invention provides a kind of automatic identifying method of Chinese name, can identify Chinese name comparatively exactly.
For solving the problems of the technologies described above, the present invention's Chinese automatic recognition of names method comprises the steps:
In the 1st step, the written material that indicates Chinese name is added up;
In the 2nd step, the written material of Chinese name to be identified is carried out the identification of Chinese name;
Said the 1st step of method specifically comprises the steps:
The 1.1st step indicating in the written material of Chinese name, was divided into following four types with individual character, and said individual character is single Chinese character;
---the H1 type appears at the position of Chinese first word of name;
---the M1 type appears at Chinese name centre position;
---the T1 type appears at the position of Chinese name the last character;
---the N1 type appears at the position except that Chinese name;
Double word is divided into following four types, and said double word is two continuous Chinese characters;
---the H2 type appears at the position of preceding two words of Chinese name and is two-character surname;
---the HM2 type appears at the position of preceding two words of Chinese name and is not two-character surname;
---the MT2 type appears at the position of the non-two-character surname name of three words or four latter two words of word two-character surname name;
---the N2 type appears at the position except that Chinese name;
In the 1.2nd step, indicating in the written material of Chinese name:
The unduplicated individual character quantity that statistics H1, M1, T1, N1 are four types is designated as nh1, nm1, nt1, nn1 respectively;
The unduplicated double word quantity that statistics H2, HM2, MT2, N2 are four types is designated as nh2, nhm2, nmt2, nn2 respectively;
The total degree of adding up each individual character appearance is designated as z1; Add up the number of times that each individual character appears at four types of H1, M1, T1, N1, be designated as h1, m1, t1, n1 respectively;
The total degree of adding up each double word appearance is designated as z2; Add up the number of times that each double word belongs to four types of H2, HM2, MT2, N2, be designated as h2, hm2, mt2, n2 respectively;
In the 1.3rd step, indicating in the written material of Chinese name:
Calculate each individual character S iThe probability that occurs
Figure BSA00000200919200031
Calculate each double word D iThe probability that occurs
Figure BSA00000200919200032
Calculate each individual character S iThe probability that belongs to the H1 type
Figure BSA00000200919200033
Calculate each individual character S iThe probability that belongs to the M1 type
Figure BSA00000200919200034
Calculate each individual character S iThe probability that belongs to the T1 type
Figure BSA00000200919200035
Calculate each individual character S iThe probability that belongs to the N1 type
Figure BSA00000200919200036
Calculate each double word D iThe probability that belongs to the H2 type
Figure BSA00000200919200037
Calculate each double word D iThe probability that belongs to the HM2 type Calculate each double word D iThe probability that belongs to the MT2 type
Figure BSA00000200919200039
Calculate each double word D iThe probability that belongs to the N2 type
In the 1.4th step, indicating in the written material of Chinese name:
Calculate each individual character S in the H1 type iProbability of occurrence
Figure BSA000002009192000311
Calculate each individual character S in the M1 type iProbability of occurrence
Calculate each individual character S in the T1 type iProbability of occurrence
Figure BSA00000200919200041
Calculate each individual character S in the N1 type iProbability of occurrence
Figure BSA00000200919200042
Calculate each double word D in the H2 type iProbability of occurrence
Figure BSA00000200919200043
Calculate each double word D in the HM2 type iProbability of occurrence P ( D i | HM 2 ) = P ( HM 2 | D i ) × P ( D i ) Σ i = 1 Nhm 2 P ( HM 2 | D i ) × P ( D i ) ;
Calculate each double word D in the MT2 type iProbability of occurrence P ( D i | MT 2 ) = P ( MT 2 | D i ) × P ( D i ) Σ i = 1 Nmt 2 P ( MT 2 | D i ) × P ( D i ) ;
Calculate each double word D in the N2 type iProbability of occurrence
Figure BSA00000200919200046
Said the 2nd step of method specifically comprises the steps:
In the 2.1st step, in the written material of Chinese name to be identified, judge successively sequentially whether each double word belongs to H2 type or HM2 type; P (H2|D iThis double word of)>0 expression D iBelong to the H2 type, otherwise represent this double word D iDo not belong to the H2 type; P (HM2|D i)>first threshold is represented this double word D iBelong to the HM2 type, otherwise represent that this double word does not belong to the HM2 type; The span of first threshold is 0.13~0.22;
If this double word belongs to H2 type or HM2 type, then this double word is made as d1, got into for the 2.4th step;
If this double word neither belongs to the H2 type and does not also belong to the HM2 type, then this double word is split as two individual characters, got into for the 2.2nd step;
In the 2.2nd step, judge wherein whether first individual character belongs to the H1 type; P (H1|S i)>0 this individual character of expression S iBelong to the H1 type;
If this first individual character belongs to the H1 type, then this first individual character is made as d1, got into for the 2.4th step;
If this first individual character does not belong to the H1 type, then got into for the 2.3rd step;
In the 2.3rd step, judge wherein whether second individual character belongs to the H1 type; P (H1|S i)>0 this individual character of expression S iBelong to the H1 type;
If this second individual character belongs to the H1 type, then this second individual character is made as d1, got into for the 2.4th step;
If this second individual character do not belong to the H1 type, then two of this double word and fractionation thereof individual characters are not the parts of name or name, get into for the 2.1st step and the next double word of getting after this double word is judged;
In the 2.4th step, judge whether the double word after the d1 belongs to the MT2 type; P (MT2|D i)>second threshold value is then represented this double word D iBelong to the MT2 type; The span of second threshold value is 0.13~0.22;
If the double word after the d1 belongs to the MT2 type, then this double word is made as d2, and got into for the 2.6th step;
If the double word after the d1 does not belong to the MT2 type, then this double word is split as two individual characters, be made as d2, d3 respectively, and got into for the 2.5th step;
In the 2.5th step, judge whether the combination of d1, d2 and d3 is Chinese name; Satisfy following 5 formula simultaneously and then represent the Chinese name of being combined as of d1, d2 and d3:
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|H)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|H)×P(d3|T)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|H)
When d1 is individual character S iThe time, P (d1|H) is P (S i| H1), P (d1|N) is P (S i| N1);
When d1 is double word D iThe time, P (d1|H) is P (D i| H2), P (d1|N) is P (D i| N2);
When d2 is individual character S iThe time, P (d2|H) is P (S i| H1), P (d2|M) is P (S i| M1), P (d2|T) is P (S i| T1), P (d2|N) is P (S i| N1);
When d2 is double word D iThe time, P (d2|H) is P (D i| H2), P (d2|M) is P (D i| HM2), P (d2|T) is P (D i| MT2), P (d2|N) is P (D i| N2);
When d3 is individual character S iThe time, P (d3|H) is P (S i| H1), P (d3|T) is P (S i| T1), P (d3|N) is P (S i| N1);
When d3 is double word D iThe time, P (d3|H) is P (D i| H2), P (d3|T) is P (D i| MT2), P (d3|N) is P (D i| N2);
When the Chinese name that is combined as of judging d1, d2 and d3, then record should the Chinese name, gets into for the 2.1st step and gets d3 next double word afterwards and judge;
When the combination of judging d1, d2 and d3 is not Chinese name, then got into for the 2.6th step;
In the 2.6th step, judge whether the combination of d1 and d2 is Chinese name; Satisfy following 2 formula simultaneously and then represent the Chinese name of being combined as of d1 and d2:
P(d1|H)×P(d2|T)>P(d1|N)×P(d2|N)
P(d1|H)×P(d2|T)>P(d1|N)×P(d2|H)
When d1 is individual character S iThe time, P (d1|H) is P (S i| H1), P (d1|N) is P (S i| N1);
When d1 is double word D iThe time, P (d1|H) is P (D i| H2), P (d1|N) is P (D i| N2);
When d2 is individual character S iThe time, P (d2|H) is P (S i| H1), P (d2|T) is P (S i| T1), P (d2|N) is P (S i| N1);
When d2 is double word D iThe time, P (d2|H) is P (D i| H2), P (d2|T) is P (D i| MT2), P (d2|N) is P (D i| N2);
When judging the be combined as Chinese name of d1 with d2, then record should the Chinese name, gets into for the 2.1st step and gets d2 next double word afterwards and judge;
When judging that d1 and the combination of d2 are not Chinese names, then got into for the 2.1st step and get d2 next double word afterwards and judge.
The present invention at first obtains the statistical law of Chinese name through the method for Bayesian probability statistics; Utilize the combined strategy of double word and individual character to carry out probability calculation dexterously written material to be identified then; Thereby can stablize identification to the Chinese name of 2~4 words, and have good ambiguity partition effect.
Description of drawings
Fig. 1 is the method for the invention schematic flow sheet in the 2nd step.
Embodiment
The present invention's Chinese automatic recognition of names method comprises the steps:
In the 1st step, the written material that indicates Chinese name is added up;
In the 2nd step, the written material of Chinese name to be identified is carried out the identification of Chinese name.
Said the 1st step of method specifically comprises the steps:
The 1.1st step, indicating in the written material of Chinese name, individual character (single Chinese character) is divided into following four types according to the position:
---the H1 type appears at the position of Chinese first word of name.The individual character of H1 type possibly be first word of monosyllabic name or two-character surname.
---the M1 type appears at Chinese name centre position (neither first word, neither the last character).The individual character of M1 type possibly be second word or the 3rd word of four word names of the word in centre or the four word names of three word names.
---the T1 type appears at the position of Chinese name the last character.
---the N1 type appears at the position except that Chinese name.
Obviously, arbitrarily individual character belongs among four types of the H1, M1, T1, N1 one or more.
Indicating in the written material of Chinese name, double word (two continuous Chinese characters) be divided into following four types according to the position:
---the H2 type appears at the position of preceding two words of Chinese name and is two-character surname.The double word of H2 type possibly be the ingredient of three word names or four word names, can not be the ingredient of two word names.
---the HM2 type appears at the position of preceding two words of Chinese name and is not two-character surname.The double word of HM2 type possibly be preceding two words of two word names or the non-two-character surname name of three words or preceding two words of the non-two-character surname name of four words.
---the MT2 type appears at the position of the non-two-character surname name of three words or four latter two words of word two-character surname name.
---the N2 type appears at the position except that Chinese name;
Obviously, arbitrarily double word belongs among four types of the H2, HM2, MT2, N2 one or more.
Type through above-mentioned individual character and double word is divided, and Chinese name possibly be following several kinds of situation:
---two word names possibly be HM2 situation only, and this situation has comprised the situation of H1+T1 combination.
---three word names are H1+MT2 combination (having comprised the H1+M1+T1 combination) or H2+T1 combination or HM2+T1 combination (also having comprised the H1+M1+T1 combination).
---four word names, or H2+MT2 combination (having comprised the H2+M1+T1 combination) or HM2+MT2 combination (having comprised that HM2+M1+T1 combination, H1+M1+M1+T1 combination, H1+M1+MT2 make up three kinds of situation).
In the 1.2nd step, indicating in the written material of Chinese name:
Statistics individual character sum (being the Chinese character sum) is designated as ns1.Add up unduplicated individual character sum and be designated as nss1.Obvious nss1≤ns1.The unduplicated individual character quantity that statistics H1, M1, T1, N1 are four types is designated as nh1, nm1, nt1, nn1 respectively.Because it is dissimilar that same individual character possibly appear at, so nh1+nm1+nt1+nn1 >=nss1.
Statistics double word sum is designated as ns2.For example in " Wang Jun is looking strong and good-natured " the words, " Wang Jun ", " army tiger ", " tiger head ", " head tiger ", " brave brain ", " brain " all are double word, obviously ns2=ns1-1.Add up unduplicated double word sum and be designated as nss2.Obvious nss2≤ns2.The unduplicated double word quantity that statistics H2, HM2, MT2, N2 are four types is designated as nh2, nhm2, nmt2, nn2 respectively.Because it is dissimilar that same double word possibly appear at, so nh2+nhm2+nmt2+nn2 >=nss2.
The total degree of adding up each individual character appearance is designated as z1.Add up the number of times that each individual character appears at four types of H1, M1, T1, N1, be designated as h1, m1, t1, n1 respectively.Obviously, z1=h1+m1+t1+n1.
The total degree of adding up each double word appearance is designated as z2.Add up the number of times that each double word belongs to four types of H2, HM2, MT2, N2, be designated as h2, hm2, mt2, n2 respectively.Obviously, z2=h2+hm2+mt2+n2.
In the 1.3rd step, indicating in the written material of Chinese name:
Calculate each individual character S iThe probability that occurs
Figure BSA00000200919200101
Obviously
Calculate each double word D iThe probability that occurs Obviously
Figure BSA00000200919200104
Calculate each individual character S iThe probability that belongs to the H1 type
Figure BSA00000200919200105
Calculate each individual character S iThe probability that belongs to the M1 type
Figure BSA00000200919200106
Calculate each individual character S iThe probability that belongs to the T1 type
Figure BSA00000200919200107
Calculate each individual character S iThe probability that belongs to the N1 type
Figure BSA00000200919200108
Obvious P (H1|S i)+P (M1|S i)+P (T1|S i)+P (N1|S i)=1.
Calculate each double word D iThe probability that belongs to the H2 type
Figure BSA00000200919200109
Calculate each double word D iThe probability that belongs to the HM2 type Calculate each double word D iThe probability that belongs to the MT2 type
Figure BSA000002009192001011
Calculate each double word D iThe probability that belongs to the N2 type
Figure BSA000002009192001012
Obvious P (H2|D i)+P (HM2|D i)+P (MT2|D i)+P (N2|D i)=1.
In the 1.4th step, indicating in the written material of Chinese name:
Calculate each individual character S in the H1 type iProbability of occurrence Obviously
Figure BSA000002009192001014
Calculate each individual character S in the M1 type iProbability of occurrence
Figure BSA000002009192001015
Obviously
Figure BSA00000200919200111
Calculate each individual character S in the T1 type iProbability of occurrence
Figure BSA00000200919200112
Obviously
Calculate each individual character S in the N1 type iProbability of occurrence
Figure BSA00000200919200114
Obviously
Figure BSA00000200919200115
Calculate each double word D in the H2 type iProbability of occurrence
Figure BSA00000200919200116
Obviously
Figure BSA00000200919200117
Calculate each double word D in the HM2 type iProbability of occurrence Obviously
Figure BSA00000200919200119
Calculate each double word D in the MT2 type iProbability of occurrence
Figure BSA000002009192001110
Obviously
Figure BSA000002009192001111
Calculate each double word D in the N2 type iProbability of occurrence
Figure BSA000002009192001112
Obviously
Figure BSA000002009192001113
See also Fig. 1, said the 2nd step of method specifically comprises the steps:
In the 2.1st step, in the written material of Chinese name to be identified, judge that whether each double word belongs to H2 type or HM2 type, for example can judge according to the sequencing of each double word successively.To a double word D i, P (H2|D iThis double word of)>0 expression D iBelong to the H2 type, otherwise represent this double word D iDo not belong to the H2 type.P (HM2|D i)>first threshold is represented this double word D iBelong to the HM2 type, otherwise represent that this double word does not belong to the HM2 type.The span of first threshold is 0.13~0.22, for example gets 0.2.
If this double word belongs to H2 type or HM2 type, then this double word is made as d1, got into for the 2.4th step.
If this double word neither belongs to the H2 type and does not also belong to the HM2 type, then this double word is split as two individual characters, got into for the 2.2nd step.
In the 2.2nd step, judge wherein whether first individual character belongs to the H1 type.P (H1|S i)>0 this individual character of expression S iBelong to the H1 type.
If this first individual character belongs to the H1 type, then this first individual character is made as d1, got into for the 2.4th step.
If this first individual character does not belong to the H1 type, then got into for the 2.3rd step.
In the 2.3rd step, judge wherein whether second individual character belongs to the H1 type.P (H1|S i)>0 this individual character of expression S iBelong to the H1 type.
If this second individual character belongs to the H1 type, then this second individual character is made as d1, got into for the 2.4th step.
If this second individual character do not belong to the H1 type, then two of this double word and fractionation thereof individual characters are not the parts of name or name, get into for the 2.1st step and the next double word (for example according to sequencing) of getting after this double word is judged.
In the 2.4th step, judge whether the double word after the d1 belongs to the MT2 type.P (MT2|D i)>second threshold value is then represented this double word D iBelong to the MT2 type.The span of second threshold value is 0.13~0.22, for example gets 0.2.
If the double word after the d1 belongs to the MT2 type, then this double word is made as d2, and got into for the 2.6th step.
If the double word after the d1 does not belong to the MT2 type, then this double word is split as two individual characters, be made as d2, d3 respectively, and got into for the 2.5th step.
In the 2.5th step, judge whether the combination of d1, d2 and d3 is Chinese name.Satisfy following 5 formula simultaneously and then represent the Chinese name of being combined as of d1, d2 and d3:
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|H)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|H)×P(d3|T)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|H)
When d1 is individual character S iThe time, P (d1|H) is P (S i| H1), P (d1|N) is P (S i| N1);
When d1 is double word D iThe time, P (d1|H) is P (D i| H2), P (d1|N) is P (D i| N2);
When d2 is individual character S iThe time, P (d2|H) is P (S i| H1), P (d2|M) is P (S i| M1), P (d2|T) is P (S i| T1), P (d2|N) is P (S i| N1);
When d2 is double word D iThe time, P (d2|H) is P (D i| H2), P (d2|M) is P (D i| HM2), P (d2|T) is P (D i| MT2), P (d2|N) is P (D i| N2);
When d3 is individual character S iThe time, P (d3|H) is P (S i| H1), P (d3|T) is P (S i| T1), P (d3|N) is P (S i| N1);
When d3 is double word D iThe time, P (d3|H) is P (D i| H2), P (d3|T) is P (D i| MT2), P (d3|N) is P (D i| N2);
When the Chinese name that is combined as of judging d1, d2 and d3, then record should the Chinese name, gets into for the 2.1st step and gets d3 next double word afterwards and judge.
When the combination of judging d1, d2 and d3 is not Chinese name, then got into for the 2.6th step.
In the 2.6th step, judge whether the combination of d1 and d2 is Chinese name.Satisfy following 2 formula simultaneously and then represent the Chinese name of being combined as of d1 and d2:
P (d1|H) * P (d2|T)>P (d1|N) * P (d2|N), the implication of this formula is: d1 for preceding name element and d2 for after the probability of name element be greater than the probability of d1 and all non-name element of d2.For example d1 is Liu, and d2 is an Xiang.D1 and d2 are the name element and tandem probability is high.And for example, d1 is the building, and d2 is the room.The probability of all non-name element of d1 and d2 is high.
P (d1|H) * P (d2|T)>P (d1|N) * P (d2|H), the implication of this formula is: d1 for preceding name element and d2 for after the probability of name element be greater than the non-name element of d1 and d2 is the probability at preceding name element.For example in " Zhang Ming that charges does not have " the words.When d1 is a money, d2 does not satisfy this formula when opening.When d1 opens, when d2 is bright, satisfy this formula.
When d1 is individual character S iThe time, P (d1|H) is P (S i| H1), P (d1|N) is P (S i| N1);
When d1 is double word D iThe time, P (d1|H) is P (D i| H2), P (d1|N) is P (D i| N2);
When d2 is individual character S iThe time, P (d2|H) is P (S i| H1), P (d2|T) is P (S i| T1), P (d2|N) is P (S i| N1);
When d2 is double word D iThe time, P (d2|H) is P (D i| H2), P (d2|T) is P (D i| MT2), P (d2|N) is P (D i| N2);
When judging the be combined as Chinese name of d1 with d2, then record should the Chinese name, gets into for the 2.1st step and gets d2 next double word afterwards and judge.
When judging that d1 and the combination of d2 are not Chinese names, then got into for the 2.1st step and get d2 next double word (for example according to sequencing) afterwards and judge.
Each double word of treating in the identified text material is all judged back (for example being preface sequentially) according to the 2.1st step, has promptly accomplished the work of the Chinese name of identification from this literal material.The present invention can stablize identification to the Chinese name of 2~4 words, and has good ambiguity partition effect.

Claims (2)

1. a Chinese automatic recognition of names method is characterized in that, comprises the steps:
In the 1st step, the written material that indicates Chinese name is added up;
In the 2nd step, the written material of Chinese name to be identified is carried out the identification of Chinese name;
Said the 1st step of method specifically comprises the steps:
The 1.1st step indicating in the written material of Chinese name, was divided into following four types with individual character, and said individual character is single Chinese character;
---the H1 type appears at the position of Chinese first word of name;
---the M1 type appears at Chinese name centre position;
---the T1 type appears at the position of Chinese name the last character;
---the N1 type appears at the position except that Chinese name;
Double word is divided into following four types, and said double word is two continuous Chinese characters;
---the H2 type appears at the position of preceding two words of Chinese name and is two-character surname;
---the HM2 type appears at the position of preceding two words of Chinese name and is not two-character surname;
---the MT2 type appears at the position of the non-two-character surname name of three words or four latter two words of word two-character surname name;
---the N2 type appears at the position except that Chinese name;
In the 1.2nd step, indicating in the written material of Chinese name:
The unduplicated individual character quantity that statistics H1, M1, T1, N1 are four types is designated as nh1, nm1, nt1, nn1 respectively;
The unduplicated double word quantity that statistics H2, HM2, MT2, N2 are four types is designated as nh2, nhm2, nmt2, nn2 respectively;
The total degree of adding up each individual character appearance is designated as z1; Add up the number of times that each individual character appears at four types of H1, M1, T1, N1, be designated as h1, m1, t1, n1 respectively;
The total degree of adding up each double word appearance is designated as z2; Add up the number of times that each double word belongs to four types of H2, HM2, MT2, N2, be designated as h2, hm2, mt2, n2 respectively;
In the 1.3rd step, indicating in the written material of Chinese name:
Calculate each individual character S iThe probability that occurs
Figure FSA00000200919100021
Calculate each double word D iThe probability that occurs
Figure FSA00000200919100022
Calculate each individual character S iThe probability that belongs to the H1 type
Figure FSA00000200919100023
Calculate each individual character S iThe probability that belongs to the M1 type
Figure FSA00000200919100024
Calculate each individual character S iThe probability that belongs to the T1 type
Figure FSA00000200919100025
Calculate each individual character S iThe probability that belongs to the N1 type
Calculate each double word D iThe probability that belongs to the H2 type
Figure FSA00000200919100027
Calculate each double word D iThe probability that belongs to the HM2 type Calculate each double word D iThe probability that belongs to the MT2 type Calculate each double word D iThe probability that belongs to the N2 type
In the 1.4th step, indicating in the written material of Chinese name:
Calculate each individual character S in the H1 type iProbability of occurrence
Figure FSA000002009191000211
Calculate each individual character S in the M1 type iProbability of occurrence
Figure FSA000002009191000212
Calculate each individual character S in the T1 type iProbability of occurrence
Figure FSA000002009191000213
Calculate each individual character S in the N1 type iProbability of occurrence
Figure FSA000002009191000214
Calculate each double word D in the H2 type iProbability of occurrence
Calculate each double word D in the HM2 type iProbability of occurrence P ( D i | HM 2 ) = P ( HM 2 | D i ) × P ( D i ) Σ i = 1 Nhm 2 P ( HM 2 | D i ) × P ( D i ) ;
Calculate each double word D in the MT2 type iProbability of occurrence P ( D i | MT 2 ) = P ( MT 2 | D i ) × P ( D i ) Σ i = 1 Nmt 2 P ( MT 2 | D i ) × P ( D i ) ;
Calculate each double word D in the N2 type iProbability of occurrence
Figure FSA00000200919100034
Said the 2nd step of method specifically comprises the steps:
In the 2.1st step, in the written material of Chinese name to be identified, judge successively sequentially whether each double word belongs to H2 type or HM2 type; P (H2|D iThis double word of)>0 expression D iBelong to the H2 type, otherwise represent this double word D iDo not belong to the H2 type; P (HM2|D i)>first threshold is represented this double word D iBelong to the HM2 type, otherwise represent that this double word does not belong to the HM2 type; The span of first threshold is 0.13~0.22;
If this double word belongs to H2 type or HM2 type, then this double word is made as d1, got into for the 2.4th step;
If this double word neither belongs to the H2 type and does not also belong to the HM2 type, then this double word is split as two individual characters, got into for the 2.2nd step;
In the 2.2nd step, judge wherein whether first individual character belongs to the H1 type; P (H1|S i)>0 this individual character of expression S iBelong to the H1 type;
If this first individual character belongs to the H1 type, then this first individual character is made as d1, got into for the 2.4th step;
If this first individual character does not belong to the H1 type, then got into for the 2.3rd step;
In the 2.3rd step, judge wherein whether second individual character belongs to the H1 type; P (H1|S i)>0 this individual character of expression S iBelong to the H1 type;
If this second individual character belongs to the H1 type, then this second individual character is made as d1, got into for the 2.4th step;
If this second individual character do not belong to the H1 type, then two of this double word and fractionation thereof individual characters are not the parts of name or name, get into for the 2.1st step and the next double word of getting after this double word is judged;
In the 2.4th step, judge whether the double word after the d1 belongs to the MT2 type; P (MT2|D i)>second threshold value is then represented this double word D iBelong to the MT2 type; The span of second threshold value is 0.13~0.22;
If the double word after the d1 belongs to the MT2 type, then this double word is made as d2, and got into for the 2.6th step;
If the double word after the d1 does not belong to the MT2 type, then this double word is split as two individual characters, be made as d2, d3 respectively, and got into for the 2.5th step;
In the 2.5th step, judge whether the combination of d1, d2 and d3 is Chinese name; Satisfy following 5 formula simultaneously and then represent the Chinese name of being combined as of d1, d2 and d3:
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|N)×P(d3|H)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|N)×P(d2|H)×P(d3|T)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|N)
P(d1|H)×P(d2|M)×P(d3|T)>P(d1|H)×P(d2|T)×P(d3|H)
When d1 is individual character S iThe time, P (d1|H) is P (S i| H1), P (d1|N) is P (S i| N1);
When d1 is double word D iThe time, P (d1|H) is P (D i| H2), P (d1|N) is P (D i| N2);
When d2 is individual character S iThe time, P (d2|H) is P (S i| H1), P (d2|M) is P (S i| M1), P (d2|T) is P (S i| T1), P (d2|N) is P (S i| N1);
When d2 is double word D iThe time, P (d2|H) is P (D i| H2), P (d2|M) is P (D i| HM2), P (d2|T) is P (D i| MT2), P (d2|N) is P (D i| N2);
When d3 is individual character S iThe time, P (d3|H) is P (S i| H1), P (d3|T) is P (S i| T1), P (d3|N) is P (S i| N1);
When d3 is double word D iThe time, P (d3|H) is P (D i| H2), P (d3|T) is P (D i| MT2), P (d3|N) is P (D i| N2);
When the Chinese name that is combined as of judging d1, d2 and d3, then record should the Chinese name, gets into for the 2.1st step and gets d3 next double word afterwards and judge;
When the combination of judging d1, d2 and d3 is not Chinese name, then got into for the 2.6th step;
In the 2.6th step, judge whether the combination of d1 and d2 is Chinese name; Satisfy following 2 formula simultaneously and then represent the Chinese name of being combined as of d1 and d2:
P(d1|H)×P(d2|T)>P(d1|N)×P(d2|N)
P(d1|H)×P(d2|T)>P(d1|N)×P(d2|H)
When d1 is individual character S iThe time, P (d1|H) is P (S i| H1), P (d1|N) is P (S i| N1);
When d1 is double word D iThe time, P (d1|H) is P (D i| H2), P (d1|N) is P (D i| N2);
When d2 is individual character S iThe time, P (d2|H) is P (S i| H1), P (d2|T) is P (S i| T1), P (d2|N) is P (S i| N1);
When d2 is double word D iThe time, P (d2|H) is P (D i| H2), P (d2|T) is P (D i| MT2), P (d2|N) is P (D i| N2);
When judging the be combined as Chinese name of d1 with d2, then record should the Chinese name, gets into for the 2.1st step and gets d2 next double word afterwards and judge;
When judging that d1 and the combination of d2 are not Chinese names, then got into for the 2.1st step and get d2 next double word afterwards and judge.
2. Chinese automatic recognition of names method according to claim 1 is characterized in that said first threshold is 0.2, and said second threshold value is 0.2.
CN2010102336536A 2010-07-22 2010-07-22 Method for automatically identifying Chinese names Pending CN102339286A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102336536A CN102339286A (en) 2010-07-22 2010-07-22 Method for automatically identifying Chinese names

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102336536A CN102339286A (en) 2010-07-22 2010-07-22 Method for automatically identifying Chinese names

Publications (1)

Publication Number Publication Date
CN102339286A true CN102339286A (en) 2012-02-01

Family

ID=45515023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102336536A Pending CN102339286A (en) 2010-07-22 2010-07-22 Method for automatically identifying Chinese names

Country Status (1)

Country Link
CN (1) CN102339286A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105988991A (en) * 2015-02-26 2016-10-05 阿里巴巴集团控股有限公司 Surname language recognition method and device, as well as server
CN105988989A (en) * 2015-02-26 2016-10-05 阿里巴巴集团控股有限公司 Chinese surname recognition method and device, as well as server
CN106354713A (en) * 2016-08-29 2017-01-25 达而观信息科技(上海)有限公司 Method for automatically identifying Chinese name
CN109344233A (en) * 2018-08-28 2019-02-15 昆明理工大学 A method of Chinese name recognition

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105988991A (en) * 2015-02-26 2016-10-05 阿里巴巴集团控股有限公司 Surname language recognition method and device, as well as server
CN105988989A (en) * 2015-02-26 2016-10-05 阿里巴巴集团控股有限公司 Chinese surname recognition method and device, as well as server
CN105988991B (en) * 2015-02-26 2019-01-18 阿里巴巴集团控股有限公司 A kind of recognition methods, device and the server of the affiliated languages of surname
CN105988989B (en) * 2015-02-26 2019-02-15 阿里巴巴集团控股有限公司 A kind of recognition methods, device and the server of Chinese surname
CN106354713A (en) * 2016-08-29 2017-01-25 达而观信息科技(上海)有限公司 Method for automatically identifying Chinese name
CN109344233A (en) * 2018-08-28 2019-02-15 昆明理工大学 A method of Chinese name recognition
CN109344233B (en) * 2018-08-28 2022-07-19 昆明理工大学 A method of Chinese name recognition

Similar Documents

Publication Publication Date Title
Hinrichs et al. Recent changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged corpora
Levallois Umigon: sentiment analysis for tweets based on lexicons and heuristics
CN101071418B (en) Chat method and system
CN104572625A (en) Recognition method of named entity
WO2002050662A3 (en) Apparatus and method of video program classification based on syntax of transcript information
CN102339286A (en) Method for automatically identifying Chinese names
CN103885934A (en) Method for automatically extracting key phrases of patent documents
CN108920513A (en) A kind of multimedia data processing method, device and electronic equipment
CN105912629A (en) Intelligent question and answer method and device
CN107748745B (en) Enterprise name keyword extraction method
CA2564760A1 (en) Speech analysis using statistical learning
Martin et al. The 2009 NIST Language Recognition Evaluation.
CN110929520A (en) Non-named entity object extraction method and device, electronic equipment and storage medium
CN103869998A (en) Method and device for sorting candidate items generated by input method
JP6759917B2 (en) Sentence generator and sentence generation method
Idsardi A simple proof that Optimality Theory is computationally intractable
Leão et al. Evolutionary patterns in the geographic range size of Atlantic Forest plants
Tranter Who really spoke when? Finding speaker turns and identities in broadcast news audio
CN110019958A (en) A kind of generation method, device and the terminal device of films and television programs label
Westerlund Testing for unit roots in panel time‐series models with multiple level breaks
CN111027322A (en) Sentiment dictionary-based sentiment analysis method for fine-grained entities in financial news
CN100426376C (en) Estimating and detecting method and system for telephone continuous speech recognition system performance
Martínez-Hinarejos et al. Statistical framework for a spanish spoken dialogue corpus
EP3800600A1 (en) Detection of a topic
CN113609279B (en) Material model extraction method and device and computer equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120201

WD01 Invention patent application deemed withdrawn after publication