CN102339286A

CN102339286A - Method for automatically identifying Chinese names

Info

Publication number: CN102339286A
Application number: CN2010102336536A
Authority: CN
Inventors: 陈运文; 马飞涛; 宋海涛
Original assignee: Shengle Information Technolpogy Shanghai Co Ltd
Current assignee: Shengle Information Technolpogy Shanghai Co Ltd
Priority date: 2010-07-22
Filing date: 2010-07-22
Publication date: 2012-02-01

Abstract

The invention discloses a method for automatically identifying Chinese names. The method comprises the following steps of: counting and training written materials marked with Chinese names, partitioning a second-order model and a third-order model according to the occurrence positions of Chinese characters, calculating the probabilities of four types of distribution in every type of model, and acquiring the statistic rule of the Chinese names with a Bayesian probability statistic method; and skillfully performing probability calculation on Chinese text materials to be identified by using a combination strategy of double characters and a single character, and comparing the probability of each combination to judge whether the Chinese names occur. By adopting the method, 2-4 Chinese names can be identified stably, and a good ambiguous segmentation effect is achieved.

Description

Chinese automatic recognition of names method

Technical field

The present invention relates to a kind of Chinese information-searching method, particularly relate to a kind of recognition methods of Chinese name.

Background technology

Name is the maximum specific term of contact in the daily life, and name will be done as a wholely just can obtain result for retrieval accurately when information retrieval.With name " Cao Guowei " is example, if searching system is three individual characters " Cao ", " state ", " big " with its cutting, promptly Chinese name is not correctly identified, then can retrieve error result for example " in StatePatent of invention, the inventor: CaoCelebrating is fragrant, the mansion BigRiver ".

Realize that Chinese automatic recognition of names has several big difficult points:

One of which, the combination of Chinese name is extremely many, can't directly use dictionary to carry out mechanical cutting.Be difficult to make up the dictionary of all Chinese names of limit on the one hand.On the other hand, make up the situation that contradiction can appear in dictionary.For example, if name " Wang Junhu " is added in the dictionary, then in sentence " Wang Jun is looking strong and good-natured ", can by error " Wang Junhu " be identified as name.

Its two, there is the situation of monosyllabic name, two-character surname in Chinese name, also has multiple situation such as two word names, three word names, four word names.

Its three, Chinese name may form the ambiguity combination with the front and back literal, brings obstacle for the correct identification of name.For example occurred in " the capital concert of Chen Xiao northeast " the words that " " speech is easy to be erroneously identified as in " Chen Xiao/northeast " in northeast when name is discerned.

Summary of the invention

Technical matters to be solved by this invention provides a kind of automatic identifying method of Chinese name, can identify Chinese name comparatively exactly.

For solving the problems of the technologies described above, the present invention's Chinese automatic recognition of names method comprises the steps:

In the 1st step, the written material that indicates Chinese name is added up;

In the 2nd step, the written material of Chinese name to be identified is carried out the identification of Chinese name;

Said the 1st step of method specifically comprises the steps:

The 1.1st step indicating in the written material of Chinese name, was divided into following four types with individual character, and said individual character is single Chinese character;

---the H1 type appears at the position of Chinese first word of name;

---the M1 type appears at Chinese name centre position;

---the T1 type appears at the position of Chinese name the last character;

---the N1 type appears at the position except that Chinese name;

Double word is divided into following four types, and said double word is two continuous Chinese characters;

---the H2 type appears at the position of preceding two words of Chinese name and is two-character surname;

---the HM2 type appears at the position of preceding two words of Chinese name and is not two-character surname;

---the MT2 type appears at the position of the non-two-character surname name of three words or four latter two words of word two-character surname name;

---the N2 type appears at the position except that Chinese name;

In the 1.2nd step, indicating in the written material of Chinese name:

The unduplicated individual character quantity that statistics H1, M1, T1, N1 are four types is designated as nh1, nm1, nt1, nn1 respectively;

The unduplicated double word quantity that statistics H2, HM2, MT2, N2 are four types is designated as nh2, nhm2, nmt2, nn2 respectively;

The total degree of adding up each individual character appearance is designated as z1; Add up the number of times that each individual character appears at four types of H1, M1, T1, N1, be designated as h1, m1, t1, n1 respectively;

The total degree of adding up each double word appearance is designated as z2; Add up the number of times that each double word belongs to four types of H2, HM2, MT2, N2, be designated as h2, hm2, mt2, n2 respectively;

In the 1.3rd step, indicating in the written material of Chinese name:

Calculate each individual character S _iThe probability that occurs

Calculate each double word D _iThe probability that occurs

Calculate each individual character S _iThe probability that belongs to the H1 type

Calculate each individual character S _iThe probability that belongs to the M1 type

Calculate each individual character S _iThe probability that belongs to the T1 type

Calculate each individual character S _iThe probability that belongs to the N1 type

Calculate each double word D _iThe probability that belongs to the H2 type

Calculate each double word D _iThe probability that belongs to the HM2 type Calculate each double word D _iThe probability that belongs to the MT2 type

Calculate each double word D _iThe probability that belongs to the N2 type

In the 1.4th step, indicating in the written material of Chinese name:

Calculate each individual character S in the H1 type _iProbability of occurrence

Calculate each individual character S in the M1 type _iProbability of occurrence

Calculate each individual character S in the T1 type _iProbability of occurrence

Calculate each individual character S in the N1 type _iProbability of occurrence

Calculate each double word D in the H2 type _iProbability of occurrence

Calculate each double word D in the HM2 type _iProbability of occurrence

P (D_{i} | HM 2) = \frac{P (HM 2 | D_{i}) \times P (D_{i})}{Σ_{i = 1}^{Nhm 2} P (HM 2 | D_{i}) \times P (D_{i})};

Calculate each double word D in the MT2 type _iProbability of occurrence

P (D_{i} | MT 2) = \frac{P (MT 2 | D_{i}) \times P (D_{i})}{Σ_{i = 1}^{Nmt 2} P (MT 2 | D_{i}) \times P (D_{i})};

Calculate each double word D in the N2 type _iProbability of occurrence

Said the 2nd step of method specifically comprises the steps:

In the 2.1st step, in the written material of Chinese name to be identified, judge successively sequentially whether each double word belongs to H2 type or HM2 type; P (H2|D _iThis double word of)＞0 expression D _iBelong to the H2 type, otherwise represent this double word D _iDo not belong to the H2 type; P (HM2|D _i)＞first threshold is represented this double word D _iBelong to the HM2 type, otherwise represent that this double word does not belong to the HM2 type; The span of first threshold is 0.13～0.22;

If this double word belongs to H2 type or HM2 type, then this double word is made as d1, got into for the 2.4th step;

If this double word neither belongs to the H2 type and does not also belong to the HM2 type, then this double word is split as two individual characters, got into for the 2.2nd step;

In the 2.2nd step, judge wherein whether first individual character belongs to the H1 type; P (H1|S _i)＞0 this individual character of expression S _iBelong to the H1 type;

If this first individual character belongs to the H1 type, then this first individual character is made as d1, got into for the 2.4th step;

If this first individual character does not belong to the H1 type, then got into for the 2.3rd step;

In the 2.3rd step, judge wherein whether second individual character belongs to the H1 type; P (H1|S _i)＞0 this individual character of expression S _iBelong to the H1 type;

If this second individual character belongs to the H1 type, then this second individual character is made as d1, got into for the 2.4th step;

If this second individual character do not belong to the H1 type, then two of this double word and fractionation thereof individual characters are not the parts of name or name, get into for the 2.1st step and the next double word of getting after this double word is judged;

In the 2.4th step, judge whether the double word after the d1 belongs to the MT2 type; P (MT2|D _i)＞second threshold value is then represented this double word D _iBelong to the MT2 type; The span of second threshold value is 0.13～0.22;

If the double word after the d1 belongs to the MT2 type, then this double word is made as d2, and got into for the 2.6th step;

If the double word after the d1 does not belong to the MT2 type, then this double word is split as two individual characters, be made as d2, d3 respectively, and got into for the 2.5th step;

In the 2.5th step, judge whether the combination of d1, d2 and d3 is Chinese name; Satisfy following 5 formula simultaneously and then represent the Chinese name of being combined as of d1, d2 and d3:

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|N)×P(d3|N)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|N)×P(d3|H)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|H)×P(d3|T)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|H)×P(d2|T)×P(d3|N)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|H)×P(d2|T)×P(d3|H)

When d1 is individual character S _iThe time, P (d1|H) is P (S _i| H1), P (d1|N) is P (S _i| N1);

When d1 is double word D _iThe time, P (d1|H) is P (D _i| H2), P (d1|N) is P (D _i| N2);

When d2 is individual character S _iThe time, P (d2|H) is P (S _i| H1), P (d2|M) is P (S _i| M1), P (d2|T) is P (S _i| T1), P (d2|N) is P (S _i| N1);

When d3 is individual character S _iThe time, P (d3|H) is P (S _i| H1), P (d3|T) is P (S _i| T1), P (d3|N) is P (S _i| N1);

When the Chinese name that is combined as of judging d1, d2 and d3, then record should the Chinese name, gets into for the 2.1st step and gets d3 next double word afterwards and judge;

When the combination of judging d1, d2 and d3 is not Chinese name, then got into for the 2.6th step;

In the 2.6th step, judge whether the combination of d1 and d2 is Chinese name; Satisfy following 2 formula simultaneously and then represent the Chinese name of being combined as of d1 and d2:

P(d1|H)×P(d2|T)＞P(d1|N)×P(d2|N)

P(d1|H)×P(d2|T)＞P(d1|N)×P(d2|H)

When d2 is individual character S _iThe time, P (d2|H) is P (S _i| H1), P (d2|T) is P (S _i| T1), P (d2|N) is P (S _i| N1);

When judging the be combined as Chinese name of d1 with d2, then record should the Chinese name, gets into for the 2.1st step and gets d2 next double word afterwards and judge;

When judging that d1 and the combination of d2 are not Chinese names, then got into for the 2.1st step and get d2 next double word afterwards and judge.

The present invention at first obtains the statistical law of Chinese name through the method for Bayesian probability statistics; Utilize the combined strategy of double word and individual character to carry out probability calculation dexterously written material to be identified then; Thereby can stablize identification to the Chinese name of 2～4 words, and have good ambiguity partition effect.

Description of drawings

Fig. 1 is the method for the invention schematic flow sheet in the 2nd step.

Embodiment

The present invention's Chinese automatic recognition of names method comprises the steps:

In the 1st step, the written material that indicates Chinese name is added up;

In the 2nd step, the written material of Chinese name to be identified is carried out the identification of Chinese name.

Said the 1st step of method specifically comprises the steps:

The 1.1st step, indicating in the written material of Chinese name, individual character (single Chinese character) is divided into following four types according to the position:

---the H1 type appears at the position of Chinese first word of name.The individual character of H1 type possibly be first word of monosyllabic name or two-character surname.

---the M1 type appears at Chinese name centre position (neither first word, neither the last character).The individual character of M1 type possibly be second word or the 3rd word of four word names of the word in centre or the four word names of three word names.

---the T1 type appears at the position of Chinese name the last character.

---the N1 type appears at the position except that Chinese name.

Obviously, arbitrarily individual character belongs among four types of the H1, M1, T1, N1 one or more.

Indicating in the written material of Chinese name, double word (two continuous Chinese characters) be divided into following four types according to the position:

---the H2 type appears at the position of preceding two words of Chinese name and is two-character surname.The double word of H2 type possibly be the ingredient of three word names or four word names, can not be the ingredient of two word names.

---the HM2 type appears at the position of preceding two words of Chinese name and is not two-character surname.The double word of HM2 type possibly be preceding two words of two word names or the non-two-character surname name of three words or preceding two words of the non-two-character surname name of four words.

---the MT2 type appears at the position of the non-two-character surname name of three words or four latter two words of word two-character surname name.

---the N2 type appears at the position except that Chinese name;

Obviously, arbitrarily double word belongs among four types of the H2, HM2, MT2, N2 one or more.

Type through above-mentioned individual character and double word is divided, and Chinese name possibly be following several kinds of situation:

---two word names possibly be HM2 situation only, and this situation has comprised the situation of H1+T1 combination.

---three word names are H1+MT2 combination (having comprised the H1+M1+T1 combination) or H2+T1 combination or HM2+T1 combination (also having comprised the H1+M1+T1 combination).

---four word names, or H2+MT2 combination (having comprised the H2+M1+T1 combination) or HM2+MT2 combination (having comprised that HM2+M1+T1 combination, H1+M1+M1+T1 combination, H1+M1+MT2 make up three kinds of situation).

In the 1.2nd step, indicating in the written material of Chinese name:

Statistics individual character sum (being the Chinese character sum) is designated as ns1.Add up unduplicated individual character sum and be designated as nss1.Obvious nss1≤ns1.The unduplicated individual character quantity that statistics H1, M1, T1, N1 are four types is designated as nh1, nm1, nt1, nn1 respectively.Because it is dissimilar that same individual character possibly appear at, so nh1+nm1+nt1+nn1 >=nss1.

Statistics double word sum is designated as ns2.For example in " Wang Jun is looking strong and good-natured " the words, " Wang Jun ", " army tiger ", " tiger head ", " head tiger ", " brave brain ", " brain " all are double word, obviously ns2=ns1-1.Add up unduplicated double word sum and be designated as nss2.Obvious nss2≤ns2.The unduplicated double word quantity that statistics H2, HM2, MT2, N2 are four types is designated as nh2, nhm2, nmt2, nn2 respectively.Because it is dissimilar that same double word possibly appear at, so nh2+nhm2+nmt2+nn2 >=nss2.

The total degree of adding up each individual character appearance is designated as z1.Add up the number of times that each individual character appears at four types of H1, M1, T1, N1, be designated as h1, m1, t1, n1 respectively.Obviously, z1=h1+m1+t1+n1.

The total degree of adding up each double word appearance is designated as z2.Add up the number of times that each double word belongs to four types of H2, HM2, MT2, N2, be designated as h2, hm2, mt2, n2 respectively.Obviously, z2=h2+hm2+mt2+n2.

In the 1.3rd step, indicating in the written material of Chinese name:

Calculate each individual character S _iThe probability that occurs

Obviously

Calculate each double word D _iThe probability that occurs Obviously

Obvious P (H1|S _i)+P (M1|S _i)+P (T1|S _i)+P (N1|S _i)=1.

Calculate each double word D _iThe probability that belongs to the H2 type

Calculate each double word D _iThe probability that belongs to the N2 type

Obvious P (H2|D _i)+P (HM2|D _i)+P (MT2|D _i)+P (N2|D _i)=1.

In the 1.4th step, indicating in the written material of Chinese name:

Calculate each individual character S in the H1 type _iProbability of occurrence Obviously

Obviously

Obviously

Obviously

Calculate each double word D in the H2 type _iProbability of occurrence

Obviously

Calculate each double word D in the HM2 type _iProbability of occurrence Obviously

Calculate each double word D in the MT2 type _iProbability of occurrence

Obviously

Calculate each double word D in the N2 type _iProbability of occurrence

Obviously

See also Fig. 1, said the 2nd step of method specifically comprises the steps:

In the 2.1st step, in the written material of Chinese name to be identified, judge that whether each double word belongs to H2 type or HM2 type, for example can judge according to the sequencing of each double word successively.To a double word D _i, P (H2|D _iThis double word of)＞0 expression D _iBelong to the H2 type, otherwise represent this double word D _iDo not belong to the H2 type.P (HM2|D _i)＞first threshold is represented this double word D _iBelong to the HM2 type, otherwise represent that this double word does not belong to the HM2 type.The span of first threshold is 0.13～0.22, for example gets 0.2.

If this double word belongs to H2 type or HM2 type, then this double word is made as d1, got into for the 2.4th step.

If this double word neither belongs to the H2 type and does not also belong to the HM2 type, then this double word is split as two individual characters, got into for the 2.2nd step.

In the 2.2nd step, judge wherein whether first individual character belongs to the H1 type.P (H1|S _i)＞0 this individual character of expression S _iBelong to the H1 type.

If this first individual character belongs to the H1 type, then this first individual character is made as d1, got into for the 2.4th step.

If this first individual character does not belong to the H1 type, then got into for the 2.3rd step.

In the 2.3rd step, judge wherein whether second individual character belongs to the H1 type.P (H1|S _i)＞0 this individual character of expression S _iBelong to the H1 type.

If this second individual character belongs to the H1 type, then this second individual character is made as d1, got into for the 2.4th step.

If this second individual character do not belong to the H1 type, then two of this double word and fractionation thereof individual characters are not the parts of name or name, get into for the 2.1st step and the next double word (for example according to sequencing) of getting after this double word is judged.

In the 2.4th step, judge whether the double word after the d1 belongs to the MT2 type.P (MT2|D _i)＞second threshold value is then represented this double word D _iBelong to the MT2 type.The span of second threshold value is 0.13～0.22, for example gets 0.2.

If the double word after the d1 belongs to the MT2 type, then this double word is made as d2, and got into for the 2.6th step.

If the double word after the d1 does not belong to the MT2 type, then this double word is split as two individual characters, be made as d2, d3 respectively, and got into for the 2.5th step.

In the 2.5th step, judge whether the combination of d1, d2 and d3 is Chinese name.Satisfy following 5 formula simultaneously and then represent the Chinese name of being combined as of d1, d2 and d3:

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|N)×P(d3|N)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|N)×P(d3|H)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|H)×P(d3|T)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|H)×P(d2|T)×P(d3|N)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|H)×P(d2|T)×P(d3|H)

When the Chinese name that is combined as of judging d1, d2 and d3, then record should the Chinese name, gets into for the 2.1st step and gets d3 next double word afterwards and judge.

When the combination of judging d1, d2 and d3 is not Chinese name, then got into for the 2.6th step.

In the 2.6th step, judge whether the combination of d1 and d2 is Chinese name.Satisfy following 2 formula simultaneously and then represent the Chinese name of being combined as of d1 and d2:

P (d1|H) * P (d2|T)＞P (d1|N) * P (d2|N), the implication of this formula is: d1 for preceding name element and d2 for after the probability of name element be greater than the probability of d1 and all non-name element of d2.For example d1 is Liu, and d2 is an Xiang.D1 and d2 are the name element and tandem probability is high.And for example, d1 is the building, and d2 is the room.The probability of all non-name element of d1 and d2 is high.

P (d1|H) * P (d2|T)＞P (d1|N) * P (d2|H), the implication of this formula is: d1 for preceding name element and d2 for after the probability of name element be greater than the non-name element of d1 and d2 is the probability at preceding name element.For example in " Zhang Ming that charges does not have " the words.When d1 is a money, d2 does not satisfy this formula when opening.When d1 opens, when d2 is bright, satisfy this formula.

When judging the be combined as Chinese name of d1 with d2, then record should the Chinese name, gets into for the 2.1st step and gets d2 next double word afterwards and judge.

When judging that d1 and the combination of d2 are not Chinese names, then got into for the 2.1st step and get d2 next double word (for example according to sequencing) afterwards and judge.

Each double word of treating in the identified text material is all judged back (for example being preface sequentially) according to the 2.1st step, has promptly accomplished the work of the Chinese name of identification from this literal material.The present invention can stablize identification to the Chinese name of 2～4 words, and has good ambiguity partition effect.

Claims

1. a Chinese automatic recognition of names method is characterized in that, comprises the steps:

In the 1st step, the written material that indicates Chinese name is added up;

Said the 1st step of method specifically comprises the steps:

---the H1 type appears at the position of Chinese first word of name;

---the M1 type appears at Chinese name centre position;

---the T1 type appears at the position of Chinese name the last character;

---the N1 type appears at the position except that Chinese name;

---the N2 type appears at the position except that Chinese name;

In the 1.2nd step, indicating in the written material of Chinese name:

In the 1.3rd step, indicating in the written material of Chinese name:

Calculate each individual character S _iThe probability that occurs

Calculate each double word D _iThe probability that occurs

Calculate each double word D _iThe probability that belongs to the H2 type

Calculate each double word D _iThe probability that belongs to the HM2 type Calculate each double word D _iThe probability that belongs to the MT2 type Calculate each double word D _iThe probability that belongs to the N2 type

In the 1.4th step, indicating in the written material of Chinese name:

Calculate each double word D in the H2 type _iProbability of occurrence

Calculate each double word D in the HM2 type _iProbability of occurrence

P (D_{i} | HM 2) = \frac{P (HM 2 | D_{i}) \times P (D_{i})}{Σ_{i = 1}^{Nhm 2} P (HM 2 | D_{i}) \times P (D_{i})};

Calculate each double word D in the MT2 type _iProbability of occurrence

P (D_{i} | MT 2) = \frac{P (MT 2 | D_{i}) \times P (D_{i})}{Σ_{i = 1}^{Nmt 2} P (MT 2 | D_{i}) \times P (D_{i})};

Calculate each double word D in the N2 type _iProbability of occurrence

Said the 2nd step of method specifically comprises the steps:

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|N)×P(d3|N)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|N)×P(d3|H)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|N)×P(d2|H)×P(d3|T)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|H)×P(d2|T)×P(d3|N)

P(d1|H)×P(d2|M)×P(d3|T)＞P(d1|H)×P(d2|T)×P(d3|H)

P(d1|H)×P(d2|T)＞P(d1|N)×P(d2|N)

P(d1|H)×P(d2|T)＞P(d1|N)×P(d2|H)

2. Chinese automatic recognition of names method according to claim 1 is characterized in that said first threshold is 0.2, and said second threshold value is 0.2.