CN104750779A

CN104750779A - Chinese multi-class word identification method based on conditional random field

Info

Publication number: CN104750779A
Application number: CN201510096284.3A
Authority: CN
Inventors: 费凡; 徐文超; 杨雁峰; 刘云鹏; 汤俊; 杨艳琴
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2015-03-04
Filing date: 2015-03-04
Publication date: 2015-07-01

Abstract

The invention discloses a method for identifying Chinese concurrent words based on a conditional random field, which includes: obtaining entries related to concurrent words, obtaining corpus from the entries; segmenting the corpus to generate language chunks, and simultaneously Generate the chunk features of each text; tag the text to obtain the part-of-speech features of the text, and use the chunk features and part-of-speech features to mark the text; randomly select a part of the corpus for training, and test the rest of the corpus to obtain the first experimental result ; Modify the feature template according to the characteristics of the corpus, continue to train and test the corpus after modification, and obtain the second experimental result; compare the performance of the metrics between the first experimental result and the second experimental result, and improve the recognition of concurrent words . The invention uses a conditional random field to identify Chinese concurrent words in the field of e-commerce, and after modifying the characteristics of the original conditional random field feature template, the accuracy rate, recall rate and f value of identifying concurrent words are all improved.

Description

A Method for Recognition of Chinese Concurrent Words Based on Conditional Random Field

技术领域technical field

本发明属于电商产品文字识别领域，尤其涉及一种电商领域中基于条件随机场的中文兼类词识别方法。The invention belongs to the field of text recognition of e-commerce products, and in particular relates to a method for recognizing Chinese concurrent words based on a conditional random field in the field of e-commerce.

背景技术Background technique

随着时代的发展和技术的提升，歧义词(歧义词即同一个词或词语拥有两种或两种以上的含义，歧义产生的原因：词义不明确、句法不固定、层次不分明、指代不明等)大量的涌现已经导致了在很多环境下同一个词或词语由于机器或人不同的理解而出现的困扰。所以对于歧义词识别的性能的精确与否，是否高效影响着对于文字信息的处理的结果。而歧义词又大致分为多音词、同音词、多义词、兼类词与反训词。之前的识别研究仅限于传统的中文分词，并没有对特定的领域进行研究，本发明仅仅只针对歧义词中的兼类词(兼类词即一个词或词语具有两种或两种以上的词性)在电商领域的特征，并用条件随机场特征模板以及修改后的新特征模板对语料进行训练和测试，目的在于优化提高条件随机场的模板在电商领域的兼类词的识别的性能。With the development of the times and the improvement of technology, ambiguous words (ambiguous words mean that the same word or words have two or more meanings, the reasons for ambiguity: unclear meaning, unfixed syntax, unclear levels, referring to Unidentified, etc.) has led to confusion in many contexts where the same word or phrases are interpreted differently by machines or humans. Therefore, whether the performance of ambiguous word recognition is accurate or not, whether it is efficient or not affects the result of processing text information. The ambiguous words are roughly divided into polysyllabic words, homonyms, polysemous words, concurrent words and anti-precepts. Previous recognition researches were limited to traditional Chinese word segmentation, and no specific field was studied. The present invention is only aimed at concurrent words in ambiguous words (concurrent words, that is, a word or word has two or more parts of speech ) in the field of e-commerce, and use the conditional random field feature template and the modified new feature template to train and test the corpus, the purpose is to optimize and improve the recognition performance of the conditional random field template in the e-commerce field.

对于中文文字识别的方法主要分为以下四大类：The methods for Chinese character recognition are mainly divided into the following four categories:

1.基于规则的方法。1. A rule-based approach.

1a:字符串匹配法。把需要识别的词或词语与词典(即具有一定规模的训练集)进行匹配。按照匹配方向可以分为正向匹配，反向匹配和双向匹配三种，按照匹配的优先原则又分为最大匹配和最小匹配两种。1a: String matching method. Match the words or phrases to be recognized with the dictionary (that is, a training set with a certain size). According to the matching direction, it can be divided into three types: forward matching, reverse matching and two-way matching. According to the principle of matching priority, it can be divided into two types: maximum matching and minimum matching.

1b：最短路径算法。采用Dijkstra算法，Floyd算法，k最短路径算法，n最短路径算法等一些图论算法及衍生的变种的算法。1b: Shortest path algorithm. Dijkstra algorithm, Floyd algorithm, k-shortest path algorithm, n-shortest path algorithm and other graph theory algorithms and derived variant algorithms are used.

以上两种方法仅仅是基于规则的方法中的一小部分，基于规则的方法都是按照自己各自设定的规则来进行识别，这种方法依赖于设定的规则是否完备合理，相对主观无法对于任何语料库都适用，对于处理歧义词性能较差，准确率较低。The above two methods are only a small part of the rule-based methods. The rule-based methods are identified according to the rules set by themselves. This method depends on whether the set rules are complete and reasonable. Any corpus is applicable, and it has poor performance and low accuracy for dealing with ambiguous words.

2.基于理解的方法。这种方法是句法和语义一起分析，模拟人对于词或词语的理解，通过这样来识别相对应的词或词语。由于中文词或词语以及句法系统较为复杂，这种方法需要大量的数据和信息以及知识。2. An understanding-based approach. This method analyzes syntax and semantics together, simulating people's understanding of words or words, and identifying corresponding words or words in this way. Due to the complexity of Chinese words or phrases and syntactic systems, this method requires a large amount of data and information as well as knowledge.

3.基于变换的方法。这种方法是找一个已经标注好词性的语料库，从这个语料库中来识别每个词或词语最相符合的词性，之后再用这个作为训练集，再通过现有规则的学习再变换出一种新的规则(也就是在原先某种规则上的变种变换)。3. Transformation-based methods. This method is to find a corpus that has already marked the part of speech, identify the most suitable part of speech for each word or word from this corpus, and then use this as a training set, and then transform it into a new one by learning the existing rules. New rules (that is, variant transformations on some original rules).

4.基于统计的方法。这种方法根据词语前后的组成关联以及特征信息，对每个词和词性进行概率统计，从中选择最优的状态转移概率来判定词和词性。最有代表性的三大模型分别是隐马尔科夫模型，最大熵马尔科夫模型，条件随机场。隐马尔科夫模型缺点在于在给定观察序列的条件下，观察值仅仅依赖于状态，这使得每个观察元素都是独立存在的，而在真正的语境下，词往往不是只与前后词相关的，是与更远的词有着某种关联的特征信息，所以仅仅做到了局部最优。最大熵马尔科夫模型虽然考虑到了与当前词更远距离的词之间的关联特征信息，但是在状态转移的时候，由于分支数量不同概率分布不均衡，就导致了在状态转移的时候驻留在了某个状态即标注偏置问题。而条件随机场不像隐马尔科夫模型和最大熵马尔科夫模型的状态转移是有向图，其无向图的特征既避开了最大熵马尔科夫模型的标记偏置问题，同时也考虑到了与当前词更远距离的词之间的相互关联的特征信息，解决了隐马尔科夫仅仅局部归一化而导致的词太过于独立的情况，做到了全局最优化。4. Statistical-based methods. This method calculates the probability of each word and part of speech based on the compositional association and feature information before and after the word, and selects the optimal state transition probability to determine the word and part of speech. The three most representative models are Hidden Markov Model, Maximum Entropy Markov Model, and Conditional Random Field. The disadvantage of the hidden Markov model is that under the condition of a given observation sequence, the observation value only depends on the state, which makes each observation element exist independently, and in the real context, words are often not only related to the preceding and following words What is relevant is the feature information that has a certain relationship with more distant words, so it only achieves a local optimum. Although the maximum entropy Markov model takes into account the associated feature information between words that are farther away from the current word, when the state transitions, due to the unbalanced probability distribution of the number of branches, it leads to resident In a certain state, the bias problem is marked. Unlike the hidden Markov model and the maximum entropy Markov model, the state transition of the conditional random field is a directed graph. The characteristics of the undirected graph not only avoid the label bias problem of the maximum entropy Markov model, but also Taking into account the interrelated feature information between words that are farther away from the current word, it solves the situation that the words are too independent caused by hidden Markov only local normalization, and achieves global optimization.

发明内容Contents of the invention

本发明提出了一种基于条件随机场的中文兼类词识别方法，包括以下步骤：The present invention proposes a kind of Chinese concurrent word recognition method based on conditional random field, comprises the following steps:

步骤1：在电商领域内搜索一中文兼类词，获取与所述兼类词相关的词条，从所述词条中获得具有电商领域特征的语料；Step 1: Search for a Chinese concurrent word in the field of e-commerce, obtain an entry related to the concurrent word, and obtain a corpus with characteristics of the electric business field from the entry;

步骤2：对所述语料进行切分生成语块，同时在所述语块中生成每个文字的语块特征；Step 2: Segment the corpus to generate language chunks, and simultaneously generate the language chunk features of each text in the language chunks;

步骤3：对所述文字进行词性标注，获得所述文字的词性特征，利用所述语块特征和所述词性特征标注所述文字；Step 3: Carry out part-of-speech tagging to described text, obtain the part-of-speech feature of described text, utilize described chunk feature and described part-of-speech feature to mark described text;

步骤4：随机选择一部分语料在条件随机场中进行训练，其余的语料在所述条件随机场中进行测试，得到第一实验结果；Step 4: Randomly select a part of the corpus to train in the conditional random field, and test the rest of the corpus in the conditional random field to obtain the first experimental result;

步骤5：根据所述语料的特征修改所述条件随机场中的特征模板，修改后继续对所述条件随机场中的所述语料进行训练和测试，得到第二实验结果；Step 5: Modify the feature template in the conditional random field according to the characteristics of the corpus, and continue to train and test the corpus in the conditional random field after modification, and obtain the second experimental result;

步骤6：对所述第一实验结果和所述第二实验结果进行度量标准的性能比对，提高对于兼类词的识别。Step 6: Comparing the performance of metrics between the first experimental result and the second experimental result to improve the recognition of concurrent words.

本发明基于条件随机场的中文兼类词识别方法中，所述步骤1包括如下步骤：In the present invention's Chinese based on conditional random field and class word identification method, described step 1 comprises the following steps:

步骤1a：在电商领域内，按所述兼类词的名词形式进行搜索，获得与所述名词形式相关的词条，将其中与商品名一致的词条归为语料，把不符合的词条修改成对应的商品名后归为语料；Step 1a: In the field of e-commerce, search according to the noun form of the concurrent word to obtain entries related to the noun form, classify the entries consistent with the product name as corpus, and classify the non-conforming words Items are changed into corresponding product names and classified as corpus;

步骤1b：按所述兼类词的形容词形式进行搜索，获得与所述形容词形式相关的词条，将其中与商品名一致的词条归为语料，把不符合的词条修改成对应的商品名后归为语料。Step 1b: Search according to the adjective form of the generic word, obtain the entries related to the adjective form, classify the entries that are consistent with the product name as the corpus, and modify the non-conforming entries to the corresponding products After the name, it is classified as corpus.

本发明基于条件随机场的中文兼类词识别方法中，所述步骤2中，根据电商领域内产品所含内容，将所述词条切分成制造商块，产地块，品牌块，商品名块，以及净含量块。In the present invention's method for identifying Chinese words based on conditional random fields, in the step 2, according to the content contained in the products in the field of e-commerce, the entry is divided into a manufacturer block, a place of origin block, a brand block, and a product name blocks, and net weight blocks.

本发明基于条件随机场的中文兼类词识别方法中，所述步骤2中，若所述语块中包含两个以上文字，则第一个文字的语块特征为初始词，其余文字的语块特征为紧随词；若所述语块包含一个文字，则所述文字的语块特征为独立的块。In the method for recognizing Chinese concurrent words based on conditional random fields of the present invention, in the step 2, if the language block contains more than two words, the language block feature of the first word is the initial word, and the language block feature of the remaining words is an initial word. The feature of a block is the following word; if the block contains a character, the feature of the block of the word is an independent block.

本发明基于条件随机场的中文兼类词识别方法中，所述步骤3，所述词性特征包括名词、动词、形容词。In the method for recognizing Chinese concurrent words based on conditional random fields of the present invention, in step 3, the part-of-speech features include nouns, verbs, and adjectives.

本发明基于条件随机场的中文兼类词识别方法中，所述步骤4包括如下步骤：In the present invention's Chinese and class word recognition method based on conditional random field, described step 4 comprises the following steps:

步骤4a：从所述语料中随机选择含有一个兼类词的形容词形式或名词形式的语料归入所述条件随机场的训练集进行训练，含有所述兼类词的另一部分形容词形式所述名词形式的语料归入所述条件随机场的测试集进行测试；Step 4a: from the corpus, randomly select the corpus containing an adjective form or a noun form of a concurrent word into the training set of the conditional random field for training; The corpus of the form is classified into the test set of the conditional random field for testing;

步骤4b：完成训练和测试后，重复执行步骤4a随机选取另一个语料进行训练和测试，直至对所有语料完成训练和测试。Step 4b: After completing the training and testing, repeat step 4a to randomly select another corpus for training and testing, until the training and testing are completed for all corpora.

本发明基于条件随机场的中文兼类词识别方法中，所述步骤5包括如下步骤：In the present invention's Chinese based on conditional random field and class word recognition method, described step 5 comprises the following steps:

步骤5a：更改所述条件随机场的特征模板中词性关联的组合特征；Step 5a: changing the combined feature associated with part of speech in the feature template of the conditional random field;

步骤5b：返回步骤4重新训练每个兼类词的训练集以及测试每个兼类词的测试集，得到第二实验结果。Step 5b: Return to step 4 to retrain the training set of each generic word and test the test set of each generic word to obtain the second experimental result.

本发明基于条件随机场的中文兼类词识别方法中，所述步骤6包括如下步骤：In the present invention's Chinese based on conditional random field and class word recognition method, described step 6 comprises the following steps:

步骤6a：用基于Perl脚本语言编写的Conll 2000算法分别对所述第一实验结果和所述第二实验结果进行三个度量标准的性能比对；所述度量标准为精确率、召回率和f值；Step 6a: use the Conll 2000 algorithm based on Perl scripting language to carry out the performance comparison of three metric standards respectively to described first experimental result and described second experimental result; Described metric standard is precision rate, recall rate and f value;

步骤6b：若所述第二实验结果低于所述第一实验结果，则返回步骤5对所述特征模板进行修改并重新得到第二实验结果，直至所述第二实验结果优于所述第一实验结果为止。Step 6b: If the second experimental result is lower than the first experimental result, return to step 5 to modify the feature template and obtain the second experimental result again until the second experimental result is better than the first experimental result up to the experimental results.

以上发明内容中，语料的特征包括词性，语义和词与词之间的相互关系等。词性特征包括名词、动词、形容词等。In the above content of the invention, the characteristics of the corpus include part of speech, semantics and the relationship between words and the like. Part-of-speech features include nouns, verbs, adjectives, etc.

本发明的有益效果在于：修改后的特征模板相比crf普适的特征模板在识别电商领域的兼类词时显得更匹配。The beneficial effect of the present invention is that: compared with the crf universal feature template, the modified feature template appears to be more matching when identifying concurrent words in the field of e-commerce.

附图说明Description of drawings

图1为本发明基于条件随机场的中文兼类词识别方法的流程图。Fig. 1 is the flow chart of the present invention's method for recognizing Chinese concurrent words based on conditional random fields.

图2为步骤1的具体流程图。FIG. 2 is a specific flowchart of step 1.

图3为步骤2的具体流程图。FIG. 3 is a specific flowchart of step 2.

图4为步骤3的具体流程图。FIG. 4 is a specific flowchart of step 3.

图5为步骤4的具体流程图。FIG. 5 is a specific flowchart of step 4.

图6为步骤5的具体流程图。FIG. 6 is a specific flow chart of step 5.

图7为步骤6的具体流程图。FIG. 7 is a specific flow chart of step 6.

具体实施方式Detailed ways

结合以下具体实施例和附图，对本发明作进一步的详细说明。实施本发明的过程、条件、实验方法等，除以下专门提及的内容之外，均为本领域的普遍知识和公知常识，本发明没有特别限制内容。The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.

本发明如图1所示具体包括如下步骤：The present invention specifically comprises the following steps as shown in Figure 1:

步骤5：根据所述语料的词性，语义和词与词之间的相互关系等特征修改所述条件随机场中的特征模板，修改后继续对所述条件随机场中的所述语料进行训练和测试，得到第二实验结果；Step 5: modify the feature template in the conditional random field according to the part-of-speech of the corpus, semantics and the interrelationship between words, and continue to train and train the corpus in the conditional random field after modification Test to obtain the second experimental result;

步骤6：对所述实验结果进行度量标准的性能比对，提高对于兼类词的识别。Step 6: Perform a performance comparison of metrics on the experimental results to improve recognition of concurrent words.

以下结合具体实施例对上述各个步骤做详细解释，以说明本发明的技术方案。The above-mentioned steps are explained in detail below in conjunction with specific embodiments to illustrate the technical solution of the present invention.

如图2所示，步骤1具体通过如下步骤完成上述：As shown in Figure 2, step 1 specifically completes the above through the following steps:

步骤1a：登陆一号店或者淘宝首页，在商品搜索框中按所述兼类词的名词形式进行搜索，获得与所述名词形式相关的词条，与商品包装图片上的商品名一致的词条归为语料，把不符合的词条修改成对应的商品名后归为语料。例如有的词条添加了些多余的并未出现在商品名中的定语修饰：新鲜有机无公害蔬菜露天自然熟不催红正宗番茄，而点击进入商品详情页面发现商品包装上只有新鲜有机番茄。Step 1a: Log in to Yihaodian or the homepage of Taobao, search in the product search box according to the noun form of the said concurrent word, and obtain entries related to the noun form, words consistent with the product name on the product packaging picture The entry is classified as corpus, and the non-compliant entry is modified into the corresponding product name and classified as corpus. For example, some entries added some redundant attributive modifiers that did not appear in the product name: fresh organic pollution-free vegetables are naturally cooked in the open air without promoting red authentic tomatoes, but when you click to enter the product details page, you find that there are only fresh organic tomatoes on the product packaging.

步骤1b：然后输入这个兼类词的形容词形式，把形容词形式所有商品词条截取下来，与商品包装图片上的商品名一致的词条直接作为实验语料，把不符合的词条修改成其产品包装上显示的商品名同时也作为实验语料。Step 1b: Then input the adjective form of this generic word, intercept all product entries in the adjective form, and use the entries that are consistent with the product name on the product packaging picture directly as the experimental corpus, and modify the non-conforming entries into its products The product name displayed on the package is also used as the experimental corpus.

完成获得所需的电商领域特征的语料后，对语料进行切分。图3显示的是步骤2的具体实施流程，主要包括如下各步骤：After completing the corpus to obtain the required characteristics of the e-commerce field, segment the corpus. Figure 3 shows the specific implementation process of step 2, which mainly includes the following steps:

步骤2a：把每个截取下来的商品词条切分为制造商块，产地块，品牌块，商品名块，以及净含量块。例如：北田台湾进口糙米果卷牛奶味儿童饼干150G，需要按照约定方式切分为：北田/制造商，台湾进口/产地，糙米果卷/品牌，牛奶味儿童饼干/商品名，150G/净含量。Step 2a: Divide each intercepted commodity entry into a manufacturer block, a place of origin block, a brand block, a product name block, and a net content block. For example: Beitian Taiwan imported brown rice fruit roll milk-flavored children's biscuits 150G, which needs to be divided into: Beitian/manufacturer, Taiwan import/origin, brown rice fruit roll/brand, milk flavor children's biscuit/trade name, 150G/net content .

步骤2b：把每个单独的语块再划分成文字，单独的语块的第一个文字的语块特征为初始词，用B来标记，之后所有的文字的语块特征为紧随词，用I来标记，若所述语块仅仅包含一个文字，则所述文字的语块特征为独立的块，用O来表示。例如：用B来表示每个块的开始词，用I来表示每个块之后紧随的词，，例如：我们的太阳我们为一个块我标注为B们标注为I而的是单独的块标记为O太阳为一个块太标记为B阳标记为I。Step 2b: Divide each individual language chunk into words again, the language chunk feature of the first character of the individual language chunk is the initial word, marked with B, and then the language chunk features of all the words are followed words, It is marked with I, and if the word block only contains one word, then the word block feature of the word is an independent block, represented by O. For example: use B to represent the beginning word of each block, use I to represent the words that follow each block, for example: our sun we are a block I mark as B and they are marked as I and a separate block Labeled O sun for a block too labeled B yang labeled I.

完成对于语料的切分后。图4显示的是步骤3对于语料进行词性标注的步骤，具体如下：After completing the segmentation of the corpus. Figure 4 shows the step 3 for the part-of-speech tagging of the corpus, as follows:

步骤3a：把切分好的所有单个词进行词性标注；名词的标记为名词，形容词的标记为形容词，以此类推，具体的词性标注对应如以下所示：Step 3a: Carry out part-of-speech tagging on all the segmented individual words; nouns are marked as nouns, adjectives are marked as adjectives, and so on. The specific part-of-speech tagging correspondence is as follows:

动词V V；包M；包/箱M M；包/组M M；包/组袋M；包组M；味道Z；品牌N N；品牌+品类NL；品牌+品类+商品NLC；品牌+商品LC；品牌+商家LJ；品牌+商家+品类NJL；品牌+商家+品类+商品NJLC；品牌+商家+商品NJS；品牌+颜色NNY；品类NP；品类+商品PC；品类+时令PT；商品NPC；商品+时令CT；商家NJ；商家+品类JL；商家+品类+商品JLC；商家+商品JC；地名NS NS；年代T T；形容词A A；形状AD AD；提M；支M；数量Q Q；时令TG TG；条M；杯M；杯/箱M；桶M；片M；瓶M；盒M；碗M；符号W W；筒M；箱M；组M；罐M；罐/组M；罐组M；袋M；袋/组M；袋包M；规格NG；质量NQ；连接词H；金额NM；颜色NY。Verb V V; bag M; bag/box M M; bag/group M M; bag/group bag M; bag group M; taste Z; brand N N; brand + category NL; brand + category + commodity NLC; brand + Commodity LC; brand + merchant LJ; brand + merchant + category NJL; brand + merchant + category + commodity NJLC; brand + merchant + commodity NJS; brand + color NNY; category NP; category + commodity PC; category + seasonal PT; commodity NPC; commodity + seasonal CT; merchant NJ; merchant + category JL; merchant + category + commodity JLC; merchant + commodity JC; place name NS NS; age T T; adjective A A; shape AD AD; mention M; branch M; quantity Q Q; Seasonal TG TG; Bar M; Cup M; Cup/Box M; Bucket M; Tablet M; Bottle M; Box M; Bowl M; /group M; tank group M; bag M; bag/group M; bag M; specification NG; quality NQ; conjunction H; amount NM; color NY.

对于具体实例例如：For specific examples such as:

今N；麦N；郎N；骨NP；汤NP；弹NPC；面NPC；浓Z；汤Z；海Z；鲜Z；杯NPC；面NPC；7M；8M；克M。Jin N; Mai N; Lang N; Bone NP; Soup NP; Bomb NPC; Noodle NPC; Nong Z; Soup Z; Sea Z; Fresh Z;

步骤3b：对于每个词，把两个属性即这个词的身份是初始词还是紧随词还是独立的块以及这个词所标注的词性结合在一起。Step 3b: For each word, combine two attributes, that is, whether the identity of the word is an initial word or a follower word or an independent block, and the part of speech marked by this word.

对于具体实例例如：For specific examples such as:

今N B-N；麦N I-N；郎N I-N；骨NP B-NP；汤NP I-NP；弹NPC B-NPC；面NPC I-NPC；浓Z B-Z；汤Z I-Z；海Z I-Z；鲜Z I-Z；杯NPC B-NPC；面NPC I-NPC；7M B-M；8M I-M；克M I-M；Today N B-N; Mai N I-N; Lang N I-N; Bone NP B-NP; Soup NP I-NP; I-Z; Cup NPC B-NPC; Face NPC I-NPC; 7M B-M; 8M I-M; Gram M I-M;

完成对于语料的词性标注后。图5显示的是步骤4对于语料进行训练和测试的步骤，具体如下：After completing the part-of-speech tagging of the corpus. Figure 5 shows the steps for training and testing the corpus in step 4, as follows:

步骤4a：把已经符合条件随机场的正确的标注形式的语料从中随机选择含有某个兼类词的形容词形式以及名词形式的语料归入训练集进行训练，而含有某个兼类词的另一部分形容词形式以及名词形式的语料归入测试集进行测试；Step 4a: Randomly select the corpus containing the adjective form and the noun form of a certain generic word from the corpus of the correct tagged form that has met the conditions of the random field, and put it into the training set for training, while another part containing a certain generic word The corpus of adjective form and noun form is included in the test set for testing;

步骤4b：按照步骤4a所述的方式继续在已经获取的电商语料中进行对剩下的兼类词做同样的随机训练和测试，得到第一实验结果。这第一实验结果是精确率达到了66.47％，召回率达到了63.13％，f值达到了65.36％。Step 4b: According to the method described in step 4a, continue to perform the same random training and testing on the remaining part-like words in the obtained e-commerce corpus, and obtain the first experimental result. The result of the first experiment is that the precision rate reaches 66.47%, the recall rate reaches 63.13%, and the f value reaches 65.36%.

训练方式如下，进入命令提示符，进入训练集存放的目录，输入训练指令：../crf_learn–[可选参数]template train.data model，测试方式如下，进入命令提示符，进入测试集存放的目录，输入测试指令：../crf_test-[可选参数]-m model test.data。The training method is as follows, enter the command prompt, enter the directory where the training set is stored, and enter the training command: ../crf_learn–[optional parameter] template train.data model, the test method is as follows, enter the command prompt, enter the test set storage directory, enter the test command: ../crf_test-[optional parameter]-m model test.data.

完成对于语料的训练和测试后。图6显示的是步骤5根据语料特征修改特征模板继续训练测试的步骤，具体如下：After completing the training and testing of the corpus. Figure 6 shows the step 5 to modify the feature template according to the corpus features to continue the training and testing steps, as follows:

步骤5a：在原有的条件随机场特征模板的基础上继续增加或者减少抑或更改某种词性关联的组合特征，条件随机场原始的特征模板如以下所示：Step 5a: On the basis of the original conditional random field feature template, continue to increase or decrease or change a certain combination of part-of-speech features. The original feature template of the conditional random field is as follows:

#Unigram#Unigram

U00:％x[-2,0]U00:%x[-2,0]

U01:％x[-1,0]U01: %x[-1,0]

U02:％x[0,0]U02: %x[0,0]

U03:％x[1,0]U03: %x[1,0]

U04:％x[2,0]U04: %x[2,0]

U05:％x[-1,0]/％x[0,0]U05: %x[-1,0]/%x[0,0]

U06:％x[0,0]/％x[1,0]U06: %x[0,0]/%x[1,0]

U07:％x[-2,1]U07: %x[-2,1]

U08:％x[-1,1]U08: %x[-1,1]

U09:％x[0,1]U09: %x[0,1]

U10:％x[1,1]U10: %x[1,1]

U11:％x[2,1]U11: %x[2,1]

U12:％x[-2,1]/％x[-1,1]U12: %x[-2,1]/%x[-1,1]

U13:％x[-1,1]/％x[0,1]U13: %x[-1,1]/%x[0,1]

U14:％x[0,1]/％x[1,1]U14: %x[0,1]/%x[1,1]

U15:％x[1,1]/％x[2,1]U15: %x[1,1]/%x[2,1]

U16:％x[-2,1]/％x[-1,1]/％x[0,1]U16: %x[-2,1]/%x[-1,1]/%x[0,1]

U17:％x[-1,1]/％x[0,1]/％x[1,1]U17: %x[-1,1]/%x[0,1]/%x[1,1]

U18:％x[0,1]/％x[1,1]/％x[2,1]U18: %x[0,1]/%x[1,1]/%x[2,1]

#Bigram#bigram

BB

不断实验不断修改特征后最终对于所实验的语料性能相对原始模板更好的特征模板如以下所示；After continuous experimentation and continuous modification of features, the final feature template with better performance for the experimental corpus than the original template is as follows;

#Unigram#Unigram

U00:％x[-3,0]U00:%x[-3,0]

U01:％x[-2,0]U01: %x[-2,0]

U02:％x[-1,0]U02: %x[-1,0]

U03:％x[0,0]U03: %x[0,0]

U04:％x[1,0]U04: %x[1,0]

U05:％x[2,0]U05: %x[2,0]

U06:％x[3,0]U06: %x[3,0]

U07:％x[-2,0]/％x[0,0]U07: %x[-2,0]/%x[0,0]

U08:％x[-1,0]/％x[0,0]U08: %x[-1,0]/%x[0,0]

U09:％x[0,0]/％x[1,0]U09: %x[0,0]/%x[1,0]

U10:％x[-2,0]/％x[-1,0]U10: %x[-2,0]/%x[-1,0]

U11:％x[1,0]/％x[2,0]U11: %x[1,0]/%x[2,0]

U12:％x[0,0]/％x[2,0]U12: %x[0,0]/%x[2,0]

步骤5b：按照步骤4所述的方式重新训练每个兼类词的训练集以及测试每个兼类词的测试集，得到第二实验结果。这第二实验结果是精确率达到了77.86％，召回率达到了73.17％，f值达到了74.59％.Step 5b: Retrain the training set of each generic word and test the test set of each generic word according to the method described in step 4, and obtain the second experimental result. The result of this second experiment is that the precision rate reached 77.86%, the recall rate reached 73.17%, and the f value reached 74.59%.

完成对于条件随机场特征模板修改后。图7显示的是步骤6对实验结果进行度量标准的性能比对，具体如下：After completing the modification of the conditional random field feature template. Figure 7 shows the performance comparison of the metrics for the experimental results in step 6, as follows:

步骤6a：安装perl解释器，用基于Perl脚本语言编写的Conll 2000算法分别对经过两种模板测试的结果进行三个度量标准的性能比对，把测试集导入perl解释器后后自动生成三个性能指标；Step 6a: Install the perl interpreter, use the Conll 2000 algorithm based on the Perl scripting language to compare the performance of the three metrics for the results of the two template tests, and automatically generate three metrics after importing the test set into the perl interpreter. Performance;

步骤6b：如果修改完的模板在三个性能指标上均优于原模板，第一次实验完的语料分离出语料库，在剩余语料库中再次随机选择一部分语料作为训练集一部分作为测试集，再次对两种不同模板测试下的性能指标进行比对，如果继续优于原模板，则第二次实验完的语料也分离出语料库，剩余语料库中再次继续同样的训练，测试和性能指标比对，直到连续5次，即训练和测试的词条总数在万条左右(即每次训练集和测试集词条总数2000条左右)的结果都显示新的模板优于原模板则说明新的模板在识别电商领域的兼类词时更匹配，只要每次实验结果中某个性能指标没有优于原模板则继续修改模板重新实验。Step 6b: If the modified template is better than the original template in terms of the three performance indicators, the corpus is separated from the corpus after the first experiment, and a part of the corpus is randomly selected from the remaining corpus as the training set and part of the test set. The performance indicators under the two different template tests are compared. If it continues to be better than the original template, the corpus is also separated from the corpus after the second experiment, and the same training, testing and performance indicators are continued in the remaining corpus until Five times in a row, that is, the total number of training and testing entries is about 10,000 (that is, the total number of entries in each training set and test set is about 2,000). The results show that the new template is better than the original template, which means that the new template is in recognition. Concurrent words in the e-commerce field are better matched. As long as a certain performance index in each experiment result is not better than the original template, continue to modify the template and re-test.

本发明的保护内容不局限于以上实施例。在不背离发明构思的精神和范围下，本领域技术人员能够想到的变化和优点都被包括在本发明中，并且以所附的权利要求书为保护范围。The protection content of the present invention is not limited to the above embodiments. Without departing from the spirit and scope of the inventive concept, changes and advantages conceivable by those skilled in the art are all included in the present invention, and the appended claims are the protection scope.

Claims

1. a kind of Chinese and class word recognition method based on conditional random field, is characterized in that, comprises the following steps:

Step 1: Search for a Chinese concurrent word in the field of e-commerce, obtain an entry related to the concurrent word, and obtain a corpus with characteristics of the electric business field from the entry;

Step 2: Segment the corpus to generate language chunks, and simultaneously generate the language chunk features of each text in the language chunks;

Step 3: Carry out part-of-speech tagging to described text, obtain the part-of-speech feature of described text, utilize described chunk feature and described part-of-speech feature to mark described text;

Step 4: Randomly select a part of the corpus to train in the conditional random field, and test the rest of the corpus in the conditional random field to obtain the first experimental result;

Step 5: Modify the feature template in the conditional random field according to the characteristics of the corpus, and continue to train and test the corpus in the conditional random field after modification, and obtain the second experimental result;

Step 6: Comparing the performance of metrics between the first experimental result and the second experimental result to improve the recognition of concurrent words.

2. the Chinese based on conditional random field as claimed in claim 1 is also characterized in that, described step 1 comprises the steps:

Step 1a: In the field of e-commerce, search according to the noun form of the concurrent word to obtain entries related to the noun form, classify the entries consistent with the product name as corpus, and classify the non-conforming words Items are changed into corresponding commodity names and classified as corpus;

Step 1b: Search according to the adjective form of the generic word, obtain the entries related to the adjective form, classify the entries that are consistent with the product name as the corpus, and modify the non-conforming entries to the corresponding products After the name, it is classified as corpus.

3. the Chinese concurrent word recognition method based on conditional random field as claimed in claim 1, is characterized in that, in described step 2, according to the contained content of product in electric business field, described entry is segmented into manufacturer's block, origin block, brand block, commodity name block, and net content block.

4. the Chinese concurrent word recognition method based on conditional random field as claimed in claim 1, is characterized in that, in described step 2, if comprise more than two words in the described language block, then the language of the first word The block feature is the initial word, and the block features of the remaining characters are the following words; if the block contains a word, the block feature of the text is an independent block.

5. the Chinese based on conditional random field as claimed in claim 1 concurrently class word recognition method is characterized in that, described step 3, described part-of-speech feature comprises noun, verb, adjective.

6. the Chinese based on conditional random field as claimed in claim 1 concurrently class word recognition method, it is characterized in that, described step 4 comprises the steps:

Step 4a: from the corpus, randomly select the corpus containing an adjective form or a noun form of a concurrent word into the training set of the conditional random field for training; The corpus of the form is classified into the test set of the conditional random field for testing;

Step 4b: After completing the training and testing, repeat step 4a to randomly select another corpus for training and testing, until the training and testing are completed for all corpora.

7. the Chinese based on conditional random field as claimed in claim 1 concurrently class word recognition method, it is characterized in that, described step 5 comprises the steps:

Step 5a: changing the combined feature associated with part of speech in the feature template of the conditional random field;

Step 5b: Return to step 4 to retrain the training set of each generic word and test the test set of each generic word to obtain the second experimental result.

8. the Chinese based on conditional random field as claimed in claim 1 concurrently class word recognition method, it is characterized in that, described step 6 comprises the steps:

Step 6a: use the Conll 2000 algorithm based on Perl scripting language to carry out the performance comparison of three metric standards respectively to described first experimental result and described second experimental result; Described metric standard is precision rate, recall rate and f value;

Step 6b: If the second experimental result is lower than the first experimental result, return to step 5 to modify the feature template and obtain the second experimental result again until the second experimental result is better than the first experimental result up to the experimental results.