[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN106951437A - Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese - Google Patents

Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese Download PDF

Info

Publication number
CN106951437A
CN106951437A CN201710072161.5A CN201710072161A CN106951437A CN 106951437 A CN106951437 A CN 106951437A CN 201710072161 A CN201710072161 A CN 201710072161A CN 106951437 A CN106951437 A CN 106951437A
Authority
CN
China
Prior art keywords
sentences
sensitive words
matching
character
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710072161.5A
Other languages
Chinese (zh)
Other versions
CN106951437B (en
Inventor
喻民
刘超
卢越
李敏
姜建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201710072161.5A priority Critical patent/CN106951437B/en
Publication of CN106951437A publication Critical patent/CN106951437A/en
Application granted granted Critical
Publication of CN106951437B publication Critical patent/CN106951437B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本发明提供一种适于多个中文敏感词句的识别处理方法及装置,该方法包括:获取多个预设的敏感词句;根据所述敏感词句建立后缀树;获取待识别中文文本;根据所述后缀树对所述待识别中文文本进行匹配;若匹配成功后,获取所述待识别中文本中的敏感词句并输出显示,该方法针对中文的特点,把模式串在后缀树上的匹配时间由提高到达到节省时间和提高模式串在后缀树上的匹配速度,适用于多个敏感词句的中文模式串匹配。

The present invention provides a recognition processing method and device suitable for a plurality of Chinese sensitive words and sentences, the method includes: obtaining a plurality of preset sensitive words and sentences; establishing a suffix tree according to the sensitive words and sentences; obtaining Chinese text to be recognized; according to the The suffix tree matches the Chinese text to be identified; if the matching is successful, the sensitive words and sentences in the text to be identified are obtained and output and displayed. This method is aimed at the characteristics of Chinese, and the matching time of the pattern string on the suffix tree is determined by improve to To save time and improve the matching speed of pattern strings on the suffix tree, it is suitable for Chinese pattern string matching of multiple sensitive words and sentences.

Description

适于多个中文敏感词句的识别处理方法及装置Recognition processing method and device suitable for multiple Chinese sensitive words and sentences

技术领域technical field

本发明涉及计算机处理技术领域,尤其涉及一种适于多个中文敏感词句的识别处理方法及装置。The invention relates to the technical field of computer processing, in particular to a recognition processing method and device suitable for multiple Chinese sensitive words and sentences.

背景技术Background technique

识别敏感词句是指利用程序对信息文本进行嗅探指定的关键字词,检查是否有违反指定策略的行为,是敏感词过滤的基础。为了快速准确的查找敏感词汇需要应用一些模式匹配算法。Identifying sensitive words and sentences refers to using the program to sniff the specified keywords in the information text, and check whether there is any behavior that violates the specified policy, which is the basis of sensitive word filtering. In order to find sensitive words quickly and accurately, some pattern matching algorithms need to be applied.

模式串的模式匹配算法有Aho-Corasick(AC)算法,BM算法,ACBM算法。其中,AC算法通过预处理,将多个模式串转换为树型有限自动状态机(DFSA),对文本串扫描一次就可以完成所有模式串匹配,匹配的时间复杂度是O(n+m)。BM算法的时间复杂度是但是无法处理多模式串匹配问题。ACBM算法融合了AC算法和BM算法思想,平均情况下效率优于AC算法,时间复杂度是虽然ACBM算法在实际应用中表现优异,但针对中文效果较差并且未能充分利用的模式串和中文信息的特点,导致匹配速度较慢。The pattern matching algorithms of pattern strings include Aho-Corasick (AC) algorithm, BM algorithm, and ACBM algorithm. Among them, the AC algorithm converts multiple pattern strings into a tree-type finite automatic state machine (DFSA) through preprocessing, and can complete all pattern string matching by scanning the text string once, and the time complexity of matching is O(n+m) . The time complexity of the BM algorithm is But it can't handle multi-pattern string matching. The ACBM algorithm combines the ideas of the AC algorithm and the BM algorithm. On average, the efficiency is better than that of the AC algorithm, and the time complexity is Although the ACBM algorithm performs well in practical applications, it is not effective for Chinese and has not fully utilized the characteristics of pattern strings and Chinese information, resulting in slow matching speed.

造成低效的原因是英文的基本结构单位是“词”,中文的基本结构单位是“字”。在敏感词句检测时有很大差异。对英语来说,敏感词检测是对26个英文字母进行依次匹配,而对于中文来说是对上万的汉字进行依次匹配。因此字符串匹配算法,由26个英文字母变成上万的汉字后,在时间和空间上都无法达到算法预期效果。另外汉字是多字节符号,同时还具备拼音等英文字母不具有的属性也没有在算法中被充分利用。The reason for the inefficiency is that the basic structural unit of English is "word", while the basic structural unit of Chinese is "character". There is a big difference in the detection of sensitive words and sentences. For English, sensitive word detection is to sequentially match 26 English letters, while for Chinese it is to sequentially match tens of thousands of Chinese characters. Therefore, after the string matching algorithm changes from 26 English letters to tens of thousands of Chinese characters, it cannot achieve the expected effect of the algorithm in terms of time and space. In addition, Chinese characters are multi-byte symbols, and they also have attributes that English letters such as pinyin do not have, and have not been fully utilized in the algorithm.

发明内容Contents of the invention

本发明提供一种适于多个中文敏感词句的识别处理方法及装置,用于解决现有技术中对中文敏感词句匹配速度较慢的问题。The invention provides a recognition processing method and device suitable for multiple Chinese sensitive words and sentences, which are used to solve the problem of slow matching speed of Chinese sensitive words and sentences in the prior art.

第一方面,本发明提供一种适于多个中文敏感词句的识别处理方法,包括:In a first aspect, the present invention provides a recognition processing method suitable for multiple Chinese sensitive words and sentences, including:

获取多个预设的敏感词句;Obtain multiple preset sensitive words and sentences;

根据所述敏感词句建立后缀树;Establish a suffix tree according to the sensitive words;

获取待识别中文文本;Obtain the Chinese text to be recognized;

根据所述后缀树对所述待识别中文文本进行匹配;Matching the Chinese text to be recognized according to the suffix tree;

若匹配成功后,获取所述待识别中文本中的敏感词句并输出显示。If the matching is successful, the sensitive words and sentences in the Chinese text to be recognized are obtained and output for display.

可选地,所述根据所述敏感词句建立后缀树,包括:Optionally, the establishment of a suffix tree according to the sensitive words and sentences includes:

S21、根据多个预设的敏感词句,建立模式串集合P(P1,P2,P3,P4,P5...Pn);S21. Establish a pattern string set P(P 1 , P 2 , P 3 , P 4 , P 5 ... P n ) according to multiple preset sensitive words and sentences;

S22、设置一根节点,所述根节点的属性值为第一预设值,所述第一预设值为任一拼音字母的排列值;S22. Setting a node, the attribute value of the root node is a first preset value, and the first preset value is an arrangement value of any pinyin letter;

S23、选取所述模式串集合中的任一敏感词句Pi,所述敏感词句Pi的字符串长度为m;S23. Select any sensitive word P i in the set of pattern strings, the string length of the sensitive word P i is m;

S24、获取所述敏感词句Pi的第m个字符,对第m个字符解析得到对应拼音的头字母,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值;S24. Obtain the mth character of the sensitive word and sentence P i , analyze the mth character to obtain the initial letter of the corresponding pinyin, and obtain the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the arrangement value the array value;

S25、判断所述头字母的排列值是否小于第一预设值,若小于,则将第m个字符对应的节点设置在所述根节点的左侧,反之,则设置在所述根节点的右侧;S25. Determine whether the arrangement value of the first letter is smaller than the first preset value, if it is smaller, set the node corresponding to the mth character on the left side of the root node, otherwise, set it on the left side of the root node Right;

S25、依次获取所述敏感词句Pi的第m-1,m-2,……,2,1个字符,循环步骤S24-S25将第m-1,m-2,……,2,1个字符对应的节点设置在第m,m-1,……,2个字符节点的孩子节点上。S25. Acquire the m-1, m-2, ..., 2, 1 characters of the sensitive words and sentences P i in turn, and the loop steps S24-S25 convert the m-1, m-2, ..., 2, 1 characters The node corresponding to the first character is set on the child nodes of the mth, m-1, ..., 2 character nodes.

可选地,根据所述后缀树对所述待识别中文文本进行匹配,包括:根据所述后缀树采用BM算法对所述待识别中文文本进行匹配。Optionally, matching the Chinese text to be recognized according to the suffix tree includes: matching the Chinese text to be recognized according to the suffix tree using a BM algorithm.

可选地,所述敏感词句包括单字、词组和语句。Optionally, the sensitive words and sentences include single characters, phrases and sentences.

可选地,还包括:若匹配未成功后,发出提示信息。Optionally, it also includes: sending a prompt message if the matching is not successful.

第二方面,本发明提供一种适于多个中文敏感词句的识别处理装置,包括:In a second aspect, the present invention provides a recognition processing device suitable for multiple Chinese sensitive words and sentences, including:

第一获取模块,用于获取多个预设的敏感词句;The first obtaining module is used to obtain a plurality of preset sensitive words and sentences;

处理模块,用于根据所述敏感词句建立后缀树;A processing module, configured to establish a suffix tree according to the sensitive words and sentences;

第二获取模块,用于获取待识别中文文本;The second obtaining module is used to obtain the Chinese text to be recognized;

匹配模块,用于根据所述后缀树对所述待识别中文文本进行匹配;A matching module, configured to match the Chinese text to be recognized according to the suffix tree;

显示模块,用于在匹配成功后,获取所述待识别中文本中的敏感词句并输出显示。The display module is used to obtain the sensitive words and sentences in the Chinese text to be recognized and output and display after the matching is successful.

可选地,所述处理模块具体用于:Optionally, the processing module is specifically configured to:

S21、根据多个预设的敏感词句,建立模式串集合P(P1,P2,P3,P4,P5...Pn);S21. Establish a pattern string set P(P 1 , P 2 , P 3 , P 4 , P 5 ... P n ) according to multiple preset sensitive words and sentences;

S22、设置一根节点,所述根节点的属性值为第一预设值,所述第一预设值为任一拼音字母的排列值;S22. Setting a node, the attribute value of the root node is a first preset value, and the first preset value is an arrangement value of any pinyin letter;

S23、选取所述模式串集合中的任一敏感词句Pi,所述敏感词句Pi的字符串长度为m;S23. Select any sensitive word P i in the set of pattern strings, the string length of the sensitive word P i is m;

S24、获取所述敏感词句Pi的第m个字符,对第m个字符解析得到对应拼音的头字母,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值;S24. Obtain the mth character of the sensitive word and sentence P i , analyze the mth character to obtain the initial letter of the corresponding pinyin, and obtain the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the arrangement value the array value;

S25、判断所述头字母的排列值是否小于第一预设值,若小于,则将第m个字符对应的节点设置在所述根节点的左侧,反之,则设置在所述根节点的右侧;S25. Determine whether the arrangement value of the first letter is smaller than the first preset value, if it is smaller, set the node corresponding to the mth character on the left side of the root node, otherwise, set it on the left side of the root node Right;

S25、依次获取所述敏感词句Pi的第m-1,m-2,……,2,1个字符,循环步骤S24-S25将第m-1,m-2,……,2,1个字符对应的节点设置在第m,m-1,……,2个字符节点的孩子节点上。S25. Acquire the m-1, m-2, ..., 2, 1 characters of the sensitive words and sentences P i in turn, and the loop steps S24-S25 convert the m-1, m-2, ..., 2, 1 characters The node corresponding to the first character is set on the child nodes of the mth, m-1, ..., 2 character nodes.

可选地,所述匹配模块具体用于:根据所述后缀树采用BM算法对所述待识别中文文本进行匹配。Optionally, the matching module is specifically configured to: use the BM algorithm to match the Chinese text to be recognized according to the suffix tree.

可选地,所述敏感词句包括单字、词组和语句。Optionally, the sensitive words and sentences include single characters, phrases and sentences.

可选地,所述显示模块还用于:在匹配未成功后,发出提示信息。Optionally, the display module is further configured to: issue a prompt message after the matching fails.

由上述技术方案可知,本发明的多个中文敏感词句的识别处理方法及装置,通过对获取多个预设的敏感词句解析并采用拼音字符排列值建立后缀树,在获取待识别中文文本后,根据后缀树对所述待识别中文文本进行匹配,根据字符的字母排列值分路匹配,当匹配成功后获取所述待识别中文本中的敏感词句并输出显示,做到针对中文的特点,把模式串在后缀树上的匹配时间由提高到达到节省时间和提高模式串在后缀树上的匹配速度,适用于多个敏感词句的中文模式串匹配。It can be known from the above technical solution that the method and device for identifying and processing multiple Chinese sensitive words and sentences of the present invention, by analyzing the acquired multiple preset sensitive words and sentences and using the arrangement values of pinyin characters to establish a suffix tree, after obtaining the Chinese text to be recognized, According to the suffix tree, the Chinese text to be identified is matched, and the letter arrangement value of the character is matched according to the branch. When the matching is successful, the sensitive words and sentences in the text to be identified are obtained and output and displayed, so as to achieve the characteristics of Chinese. The matching time of the pattern string on the suffix tree is determined by improve to To save time and improve the matching speed of pattern strings on the suffix tree, it is suitable for Chinese pattern string matching of multiple sensitive words and sentences.

附图说明Description of drawings

图1为本发明实施例1提供的适于多个中文敏感词句的识别处理方法的流程示意图;FIG. 1 is a schematic flow diagram of a recognition processing method suitable for multiple Chinese sensitive words and sentences provided by Embodiment 1 of the present invention;

图2为本发明实施例提供的后缀树的框图;FIG. 2 is a block diagram of a suffix tree provided by an embodiment of the present invention;

图3为本发明实施例2提供的适于多个中文敏感词句的识别处理装置的结构示意图。FIG. 3 is a schematic structural diagram of a recognition and processing device suitable for multiple Chinese sensitive words and sentences provided by Embodiment 2 of the present invention.

具体实施方式detailed description

下面结合附图和实施例,对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明,但不用来限制本发明的范围。The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

图1示出了本发明实施例1提供一种适于多个中文敏感词句的识别处理方法,包括:Fig. 1 shows that Embodiment 1 of the present invention provides a kind of recognition processing method suitable for multiple Chinese sensitive words and sentences, including:

S11、获取多个预设的敏感词句。S11. Obtain multiple preset sensitive words and sentences.

在本步骤中,需要说明的是,在本发明实施例中,所述敏感词句为事先预设的词句。一般可包括包括单字、词组和语句。单字如“傻”、“笨”。词组如“混蛋”“暴力”。语句如“我讨厌中国”。In this step, it should be noted that, in the embodiment of the present invention, the sensitive words and sentences are preset words and sentences. Generally, it can include words, phrases and sentences. Single words such as "stupid", "stupid". Phrases like "asshole" "violence". Statements such as "I hate China".

S12、根据所述敏感词句建立后缀树。S12. Establish a suffix tree according to the sensitive words and sentences.

在本步骤中,需要说明的是,在本发明实施例中,为了便于后续从文本信息中匹配敏感词句,需建立后缀树,具体如下:In this step, it should be noted that in the embodiment of the present invention, in order to facilitate subsequent matching of sensitive words and sentences from text information, a suffix tree needs to be established, as follows:

S21、根据多个预设的敏感词句,建立模式串集合P(P1,P2,P3,P4,P5...Pn);S21. Establish a pattern string set P(P 1 , P 2 , P 3 , P 4 , P 5 ... P n ) according to multiple preset sensitive words and sentences;

S22、设置一根节点,所述根节点的属性值为第一预设值,所述第一预设值为任一拼音字母的排列值;S22. Setting a node, the attribute value of the root node is a first preset value, and the first preset value is an arrangement value of any pinyin letter;

S23、选取所述模式串集合中的任一敏感词句Pi,所述敏感词句Pi的字符串长度为m;S23. Select any sensitive word P i in the set of pattern strings, the string length of the sensitive word P i is m;

S24、获取所述敏感词句Pi的第m个字符,对第m个字符解析得到对应拼音的头字母,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值;S24. Obtain the mth character of the sensitive word and sentence P i , analyze the mth character to obtain the initial letter of the corresponding pinyin, and obtain the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the arrangement value the array value;

S25、判断所述头字母的排列值是否小于第一预设值,若小于,则将第m个字符对应的节点设置在所述根节点的左侧,反之,则设置在所述根节点的右侧;S25. Determine whether the arrangement value of the first letter is smaller than the first preset value, if it is smaller, set the node corresponding to the mth character on the left side of the root node, otherwise, set it on the left side of the root node Right;

S25、依次获取所述敏感词句Pi的第m-1,m-2,……,2,1个字符,循环步骤S24-S25将第m-1,m-2,……,2,1个字符对应的节点设置在第m,m-1,……,2个字符节点。S25. Acquire the m-1, m-2, ..., 2, 1 characters of the sensitive words and sentences P i in turn, and the loop steps S24-S25 convert the m-1, m-2, ..., 2, 1 characters The node corresponding to the first character is set to the mth, m-1, ..., 2 character nodes.

以具体事例对上述步骤进行解释说明:The above steps are explained with specific examples:

如图2所示,假设模式串集合为P(P1,P2,P3,P4),P1为“笨”,P2为“色情”,P3为“你爱法国”,P4为“他讨厌法国”。As shown in Figure 2, suppose the set of pattern strings is P(P 1 , P 2 , P 3 , P 4 ), P 1 is "stupid", P 2 is "pornographic", P 3 is "You love France", P 4 for "He hates France".

设置设置一根节点,所述根节点的属性值为13。Set to set a node, and the attribute value of the root node is 13.

获取敏感词句P1,该敏感词句P1的字符串长度为1。对字符“笨”解析得到对应拼音的头字母“b”,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值“2”。判断“2”小于根节点的属性值“13”,则将字符“笨”的节点设置在根节点的左侧作为根节点的孩子节点。Acquire sensitive words and sentences P 1 , the string length of the sensitive words and sentences P 1 is 1. The initial letter "b" corresponding to the pinyin is obtained by parsing the character "stupid", and the array value "2" of the initial letter is obtained according to the correspondence between the initial letter and the preset pinyin letters and array values. If it is judged that "2" is less than the attribute value "13" of the root node, then the node with the character "stupid" is set on the left side of the root node as a child node of the root node.

获取敏感词句P2,该敏感词句P2的字符串长度为2。对第2个字符“情”解析得到对应拼音的头字母“q”,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值“17”,判断“17”大于根节点的属性值“13”,则将字符“情”的节点设置在根节点的右侧作为根节点的孩子节点。对第1个字符“色”解析得到对应拼音的头字母“s”,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值“19”,判断“19”大于字符“情”节点的属性值“17”,则将字符“色”的节点设置在字符“情”节点的右侧作为根节点的孩子节点。Acquire sensitive words and sentences P 2 , the string length of the sensitive words and sentences P 2 is 2. Analyze the second character "Qing" to obtain the initial letter "q" corresponding to the pinyin, and obtain the array value "17" of the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the array value, and judge "17" is greater than the attribute value "13" of the root node, then the node of the character "love" is set on the right side of the root node as the child node of the root node. Analyze the first character "色" to obtain the initial letter "s" corresponding to the pinyin, and obtain the array value "19" of the initial letter according to the correspondence between the initial letter and the preset pinyin letter and the array value, and judge "19" is greater than the attribute value "17" of the character "love" node, then the character "color" node is set on the right side of the character "love" node as the child node of the root node.

获取敏感词句P3,该敏感词句P3的字符串长度为4。对第4个字符“国”解析得到对应拼音的头字母“g”,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值“7”,判断“7”小于根节点的属性值“13”,则将字符“国”的节点设置在根节点的左侧作为根节点的孩子节点。依次对“法”、“爱”、“你”做上述处理,在此不再赘述,可见图2所示。The sensitive word and sentence P 3 is obtained, and the string length of the sensitive word and sentence P 3 is 4. Analyze the fourth character "国" to obtain the initial letter "g" corresponding to the pinyin, and obtain the array value "7" of the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the array value, and judge "7" is less than the attribute value "13" of the root node, then the node with the character "国" is set on the left side of the root node as the child node of the root node. Do the above-mentioned processing on "law", "love" and "you" in sequence, and will not go into details here, as shown in Figure 2.

获取敏感词句P4,该敏感词句P4的字符串长度为5。对第5个字符“国”解析得到对应拼音的头字母“g”,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值“7”,判断“7”小于根节点的属性值“13”,则将字符“国”的节点设置在根节点的左侧作为根节点的孩子节点。依次对“法”“厌”、“讨”、“他”做上述处理,在此不再赘述,可见图2所示。Acquire sensitive words and sentences P 4 , and the string length of the sensitive words and sentences P 4 is 5. Analyze the fifth character "国" to obtain the initial letter "g" corresponding to the pinyin, and obtain the array value "7" of the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the array value, and judge "7" is less than the attribute value "13" of the root node, then the node with the character "国" is set on the left side of the root node as the child node of the root node. The above-mentioned processing is performed on "fa", "disgusting", "disgusting" and "he" in sequence, and will not be repeated here, as shown in Figure 2.

S13、获取待识别中文文本。S13. Obtain the Chinese text to be recognized.

在本步骤中,需要说明的是,在本发明实施例中,待识别中文文本可为发表文章或评论消息等。In this step, it should be noted that, in the embodiment of the present invention, the Chinese text to be recognized may be published articles or comment messages.

S14、根据所述后缀树对所述待识别中文文本进行匹配。S14. Match the Chinese text to be recognized according to the suffix tree.

在本步骤中,需要说明的是,当字符串Pi的长度大于文本字符长度,此时,便不可能从该文本中找到字符串Pi,因此,中文文本的字符长度大于模式串Pi的字符长度,即len(T)>maxlen(Pi)。In this step, it should be noted that when the length of the character string P i is greater than the character length of the text, at this time, it is impossible to find the character string P i from the text, therefore, the character length of the Chinese text is greater than the pattern string P i The character length of , that is, len(T)>maxlen(P i ).

对所述后缀树采用BM算法对所述待识别中文文本进行匹配,具体可包括:The BM algorithm is used to match the Chinese text to be recognized on the suffix tree, which may specifically include:

(1)根据最短的模式串Pi的长度minlen(Pi)选定目标串的第minlen(Pi)位为起始匹配位置,使用该树进行BM算法匹配。(1) According to the length minlen(P i ) of the shortest pattern string P i , select the minlen(P i )th bit of the target string as the initial matching position, and use this tree for BM algorithm matching.

(2)如果某个字符比较不匹配时,采用两条启发式规则,即坏字符规则和好后缀规则。(2) If a certain character does not match, two heuristic rules are used, that is, the bad character rule and the good suffix rule.

(3)如果某个字符比较匹配时,则先判断左一字符和该匹配字符的大小,大小根据字符的拼音赋值。如果左一字符小于该匹配字符,则去左侧孩子节点中查找,如果不小于,则去右侧孩子节点中进行查找。(3) If a character is relatively matched, first judge the size of the left character and the matching character, and the size is assigned according to the pinyin of the character. If the left character is less than the matching character, then go to the left child node to search, if not, go to the right child node to search.

S15、若匹配成功后,获取所述待识别中文本中的敏感词句并输出显示。S15. If the matching is successful, obtain the sensitive words and sentences in the Chinese text to be recognized and output and display them.

另外,若匹配未成功后,可发出提示信息,以提示用于可以对中文文本进行发表。In addition, if the matching is not successful, a prompt message can be issued to remind the user that the Chinese text can be published.

本发明实施例1提供的适于多个中文敏感词句的识别处理方法,通过对获取多个预设的敏感词句解析并采用拼音字符排列值建立后缀树,在获取待识别中文文本后,根据后缀树对所述待识别中文文本进行匹配,根据字符的字母排列值分路匹配,当匹配成功后获取所述待识别中文本中的敏感词句并输出显示,做到针对中文的特点,把模式串在后缀树上的匹配时间由提高到达到节省时间和提高模式串在后缀树上的匹配速度,适用于多个敏感词句的中文模式串匹配。The recognition processing method suitable for a plurality of Chinese sensitive words and sentences provided by Embodiment 1 of the present invention is to obtain a plurality of preset sensitive words and sentences by analyzing and using the pinyin character arrangement value to establish a suffix tree. After obtaining the Chinese text to be recognized, according to the suffix The tree matches the Chinese text to be recognized, and matches according to the letter arrangement value of the characters. When the matching is successful, the sensitive words and sentences in the Chinese text to be recognized are obtained and output and displayed. The matching time on the suffix tree is given by improve to To save time and improve the matching speed of pattern strings on the suffix tree, it is suitable for Chinese pattern string matching of multiple sensitive words and sentences.

图3示出了本发明实施例2提供的一种适于多个中文敏感词句的识别处理装置,包括第一获取模块21、处理模块22、第二获取模块23、匹配模块24和显示模块25,其中:Figure 3 shows a recognition and processing device suitable for multiple Chinese sensitive words and sentences provided by Embodiment 2 of the present invention, including a first acquisition module 21, a processing module 22, a second acquisition module 23, a matching module 24 and a display module 25 ,in:

第一获取模块21,用于获取多个预设的敏感词句;The first obtaining module 21 is used to obtain a plurality of preset sensitive words and sentences;

处理模块22,用于根据所述敏感词句建立后缀树;The processing module 22 is used to establish a suffix tree according to the sensitive words and sentences;

第二获取模块23,用于获取待识别中文文本;The second obtaining module 23 is used to obtain the Chinese text to be recognized;

匹配模块24,用于根据所述后缀树对所述待识别中文文本进行匹配;Matching module 24, for matching the Chinese text to be recognized according to the suffix tree;

显示模块25,用于在匹配成功后,获取所述待识别中文本中的敏感词句并输出显示。The display module 25 is configured to obtain the sensitive words and sentences in the Chinese text to be recognized and output and display after the matching is successful.

所述处理模块具体用于:The processing module is specifically used for:

S21、根据多个预设的敏感词句,建立模式串集合P(P1,P2,P3,P4,P5...Pn);S21. Establish a pattern string set P(P 1 , P 2 , P 3 , P 4 , P 5 ... P n ) according to multiple preset sensitive words and sentences;

S22、设置一根节点,所述根节点的属性值为第一预设值,所述第一预设值为任一拼音字母的排列值;S22. Setting a node, the attribute value of the root node is a first preset value, and the first preset value is an arrangement value of any pinyin letter;

S23、选取所述模式串集合中的任一敏感词句Pi,所述敏感词句Pi的字符串长度为m;S23. Select any sensitive word P i in the set of pattern strings, the string length of the sensitive word P i is m;

S24、获取所述敏感词句Pi的第m个字符,对第m个字符解析得到对应拼音的头字母,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值;S24. Obtain the mth character of the sensitive word and sentence P i , analyze the mth character to obtain the initial letter of the corresponding pinyin, and obtain the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the arrangement value the array value;

S25、判断所述头字母的排列值是否小于第一预设值,若小于,则将第m个字符对应的节点设置在所述根节点的左侧,反之,则设置在所述根节点的右侧;S25. Determine whether the arrangement value of the first letter is smaller than the first preset value, if it is smaller, set the node corresponding to the mth character on the left side of the root node, otherwise, set it on the left side of the root node Right;

S25、依次获取所述敏感词句Pi的第m-1,m-2,……,2,1个字符,循环步骤S24-S25将第m-1,m-2,……,2,1个字符对应的节点设置在第m,m-1,……,2个字符节点的孩子节点上。S25. Acquire the m-1, m-2, ..., 2, 1 characters of the sensitive words and sentences P i in turn, and the loop steps S24-S25 convert the m-1, m-2, ..., 2, 1 characters The node corresponding to the first character is set on the child nodes of the mth, m-1, ..., 2 character nodes.

由于本发明实施例2所述装置与上述实施例所述方法的原理相同,对于更加详细的解释内容在此不再赘述。Since the principle of the apparatus described in Embodiment 2 of the present invention is the same as that of the method described in the foregoing embodiments, more detailed explanations will not be repeated here.

需要说明的是,本发明实施例中可以通过硬件处理器(hardware processor)来实现相关功能模块。It should be noted that, in the embodiment of the present invention, a hardware processor (hardware processor) may be used to implement related functional modules.

本发明实施例2提供的适于多个中文敏感词句的识别处理装置,通过对获取多个预设的敏感词句解析并采用拼音字符排列值建立后缀树,在获取待识别中文文本后,根据后缀树对所述待识别中文文本进行匹配,根据字符的字母排列值分路匹配,当匹配成功后获取所述待识别中文本中的敏感词句并输出显示,做到针对中文的特点,把模式串在后缀树上的匹配时间由提高到达到节省时间和提高模式串在后缀树上的匹配速度,适用于多个敏感词句的中文模式串匹配。The recognition processing device suitable for a plurality of Chinese sensitive words and sentences provided by Embodiment 2 of the present invention analyzes a plurality of preset sensitive words and sentences and uses pinyin character arrangement values to establish a suffix tree. After obtaining the Chinese text to be recognized, according to the suffix The tree matches the Chinese text to be recognized, and matches according to the letter arrangement value of the characters. When the matching is successful, the sensitive words and sentences in the Chinese text to be recognized are obtained and output and displayed. The matching time on the suffix tree is given by improve to To save time and improve the matching speed of pattern strings on the suffix tree, it is suitable for Chinese pattern string matching of multiple sensitive words and sentences.

此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.

本领域普通技术人员可以理解:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明权利要求所限定的范围。Those of ordinary skill in the art can understand that: the above embodiments are only used to illustrate the technical scheme of the present invention, rather than limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand : It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements to some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the claims of the present invention. range.

Claims (10)

1.一种适于多个中文敏感词句的识别处理方法,其特征在于,包括:1. A recognition processing method suitable for multiple Chinese sensitive words and sentences, characterized in that, comprising: 获取多个预设的敏感词句;Obtain multiple preset sensitive words and sentences; 根据所述敏感词句建立后缀树;Establish a suffix tree according to the sensitive words; 获取待识别中文文本;Obtain the Chinese text to be recognized; 根据所述后缀树对所述待识别中文文本进行匹配;Matching the Chinese text to be recognized according to the suffix tree; 若匹配成功后,获取所述待识别中文本中的敏感词句并输出显示。If the matching is successful, the sensitive words and sentences in the Chinese text to be recognized are obtained and output for display. 2.根据权利要求1所述的方法,其特征在于,所述根据所述敏感词句建立后缀树,包括:2. The method according to claim 1, wherein said setting up a suffix tree according to said sensitive words and sentences comprises: S21、根据多个预设的敏感词句,建立模式串集合P(P1,P2,P3,P4,P5...Pn);S21. Establish a pattern string set P(P 1 , P 2 , P 3 , P 4 , P 5 ... P n ) according to multiple preset sensitive words and sentences; S22、设置一根节点,所述根节点的属性值为第一预设值,所述第一预设值为任一拼音字母的排列值;S22. Setting a node, the attribute value of the root node is a first preset value, and the first preset value is an arrangement value of any pinyin letter; S23、选取所述模式串集合中的任一敏感词句Pi,所述敏感词句Pi的字符串长度为m;S23. Select any sensitive word P i in the set of pattern strings, the string length of the sensitive word P i is m; S24、获取所述敏感词句Pi的第m个字符,对第m个字符解析得到对应拼音的头字母,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值;S24. Obtain the mth character of the sensitive word and sentence P i , analyze the mth character to obtain the initial letter of the corresponding pinyin, and obtain the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the arrangement value the array value; S25、判断所述头字母的排列值是否小于第一预设值,若小于,则将第m个字符对应的节点设置在所述根节点的左侧,反之,则设置在所述根节点的右侧;S25. Determine whether the arrangement value of the first letter is smaller than the first preset value, if it is smaller, set the node corresponding to the mth character on the left side of the root node, otherwise, set it on the left side of the root node Right; S25、依次获取所述敏感词句Pi的第m-1,m-2,……,2,1个字符,循环步骤S24-S25将第m-1,m-2,……,2,1个字符对应的节点设置在第m,m-1,……,2个字符节点的孩子节点上。S25. Acquire the m-1, m-2, ..., 2, 1 characters of the sensitive words and sentences P i in turn, and the loop steps S24-S25 convert the m-1, m-2, ..., 2, 1 characters The node corresponding to the first character is set on the child nodes of the mth, m-1, ..., 2 character nodes. 3.根据权利要求1所述的方法,其特征在于,根据所述后缀树对所述待识别中文文本进行匹配,包括:根据所述后缀树采用BM算法对所述待识别中文文本进行匹配。3. The method according to claim 1, wherein matching the Chinese text to be recognized according to the suffix tree comprises: matching the Chinese text to be recognized according to the suffix tree using a BM algorithm. 4.根据权利要求1所述的方法,其特征在于,所述敏感词句包括单字、词组和语句。4. The method according to claim 1, wherein the sensitive words and sentences include single characters, phrases and sentences. 5.根据权利要求1所述的方法,其特征在于,若匹配未成功后,发出提示信息。5. The method according to claim 1, characterized in that, if the matching is not successful, a prompt message is sent. 6.一种适于多个中文敏感词句的识别处理装置,其特征在于,包括:6. A recognition and processing device suitable for multiple Chinese sensitive words and sentences, characterized in that it comprises: 第一获取模块,用于获取多个预设的敏感词句;The first obtaining module is used to obtain a plurality of preset sensitive words and sentences; 处理模块,用于根据所述敏感词句建立后缀树;A processing module, configured to establish a suffix tree according to the sensitive words and sentences; 第二获取模块,用于获取待识别中文文本;The second obtaining module is used to obtain the Chinese text to be recognized; 匹配模块,用于根据所述后缀树对所述待识别中文文本进行匹配;A matching module, configured to match the Chinese text to be recognized according to the suffix tree; 显示模块,用于在匹配成功后,获取所述待识别中文本中的敏感词句并输出显示。The display module is used to obtain the sensitive words and sentences in the Chinese text to be recognized and output and display after the matching is successful. 7.根据权利要求6所述的装置,其特征在于,所述处理模块具体用于:7. The device according to claim 6, wherein the processing module is specifically used for: S21、根据多个预设的敏感词句,建立模式串集合P(P1,P2,P3,P4,P5...Pn);S21. Establish a pattern string set P(P 1 , P 2 , P 3 , P 4 , P 5 ... P n ) according to multiple preset sensitive words and sentences; S22、设置一根节点,所述根节点的属性值为第一预设值,所述第一预设值为任一拼音字母的排列值;S22. Setting a node, the attribute value of the root node is a first preset value, and the first preset value is an arrangement value of any pinyin letter; S23、选取所述模式串集合中的任一敏感词句Pi,所述敏感词句Pi的字符串长度为m;S23. Select any sensitive word P i in the set of pattern strings, the string length of the sensitive word P i is m; S24、获取所述敏感词句Pi的第m个字符,对第m个字符解析得到对应拼音的头字母,根据所述头字母和预设的拼音字母与排列值的对应关系获得所述头字母的排列值;S24. Obtain the mth character of the sensitive word and sentence P i , analyze the mth character to obtain the initial letter of the corresponding pinyin, and obtain the initial letter according to the corresponding relationship between the initial letter and the preset pinyin letter and the arrangement value the array value; S25、判断所述头字母的排列值是否小于第一预设值,若小于,则将第m个字符对应的节点设置在所述根节点的左侧,反之,则设置在所述根节点的右侧;S25. Determine whether the arrangement value of the first letter is smaller than the first preset value, if it is smaller, set the node corresponding to the mth character on the left side of the root node, otherwise, set it on the left side of the root node Right; S25、依次获取所述敏感词句Pi的第m-1,m-2,……,2,1个字符,循环步骤S24-S25将第m-1,m-2,……,2,1个字符对应的节点设置在第m,m-1,……,2个字符节点的孩子节点上。S25. Acquire the m-1, m-2, ..., 2, 1 characters of the sensitive words and sentences P i in turn, and the loop steps S24-S25 convert the m-1, m-2, ..., 2, 1 characters The node corresponding to the first character is set on the child nodes of the mth, m-1, ..., 2 character nodes. 8.根据权利要求6所述的装置,其特征在于,所述匹配模块具体用于:根据所述后缀树采用BM算法对所述待识别中文文本进行匹配。8. The device according to claim 6, wherein the matching module is specifically configured to: match the Chinese text to be recognized by using the BM algorithm according to the suffix tree. 9.根据权利要求6所述的装置,其特征在于,所述敏感词句包括单字、词组和语句。9. The device according to claim 6, wherein the sensitive words and sentences include single characters, phrases and sentences. 10.根据权利要求6所述的装置,其特征在于,所述显示模块还用于:在匹配未成功后,发出提示信息。10. The device according to claim 6, wherein the display module is further configured to: send a prompt message after the matching fails.
CN201710072161.5A 2017-02-08 2017-02-08 Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese Active CN106951437B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710072161.5A CN106951437B (en) 2017-02-08 2017-02-08 Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710072161.5A CN106951437B (en) 2017-02-08 2017-02-08 Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese

Publications (2)

Publication Number Publication Date
CN106951437A true CN106951437A (en) 2017-07-14
CN106951437B CN106951437B (en) 2019-11-01

Family

ID=59465486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710072161.5A Active CN106951437B (en) 2017-02-08 2017-02-08 Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese

Country Status (1)

Country Link
CN (1) CN106951437B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062199A (en) * 2019-11-05 2020-04-24 北京中科微澜科技有限公司 Bad information identification method and device
CN111159990A (en) * 2019-12-06 2020-05-15 国家计算机网络与信息安全管理中心 A method and system for general special word recognition based on pattern expansion
CN111831785A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 Sensitive word detection method and device, computer equipment and storage medium
CN113157904A (en) * 2021-03-30 2021-07-23 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514238A (en) * 2012-06-30 2014-01-15 重庆新媒农信科技有限公司 Sensitive word recognition processing method based on classification searching
US20150100304A1 (en) * 2013-10-07 2015-04-09 Xerox Corporation Incremental computation of repeats
CN105843950A (en) * 2016-04-12 2016-08-10 乐视控股(北京)有限公司 Sensitive word filtering method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514238A (en) * 2012-06-30 2014-01-15 重庆新媒农信科技有限公司 Sensitive word recognition processing method based on classification searching
US20150100304A1 (en) * 2013-10-07 2015-04-09 Xerox Corporation Incremental computation of repeats
CN105843950A (en) * 2016-04-12 2016-08-10 乐视控股(北京)有限公司 Sensitive word filtering method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LJSSPACE: "后缀树(Suffix Tree)的文本匹配算法", 《HTTPS://BLOG.CSDN.NET/LJSSPACE/ARTICLE/DETAILS/6571467》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062199A (en) * 2019-11-05 2020-04-24 北京中科微澜科技有限公司 Bad information identification method and device
CN111062199B (en) * 2019-11-05 2023-12-22 北京中科微澜科技有限公司 Bad information identification method and device
CN111159990A (en) * 2019-12-06 2020-05-15 国家计算机网络与信息安全管理中心 A method and system for general special word recognition based on pattern expansion
CN111159990B (en) * 2019-12-06 2022-09-30 国家计算机网络与信息安全管理中心 Method and system for identifying general special words based on pattern expansion
CN111831785A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 Sensitive word detection method and device, computer equipment and storage medium
CN113157904A (en) * 2021-03-30 2021-07-23 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm
CN113157904B (en) * 2021-03-30 2024-02-09 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm

Also Published As

Publication number Publication date
CN106951437B (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN110598203B (en) A method and device for extracting entity information of military scenario documents combined with dictionaries
TWI729472B (en) Method, device and server for determining feature words
CN111522994B (en) Method and device for generating information
JP6434542B2 (en) Understanding tables for searching
CN106951437B (en) Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese
CN102930055B (en) A New Word Discovery Method Combining Internal Aggregation Degree and External Discrete Information Entropy
JP6114403B2 (en) Method and apparatus for providing input candidate item corresponding to input character string
JP7397903B2 (en) Intelligent interaction methods, devices, electronic devices and storage media
CN104252484B (en) A kind of phonetic error correction method and system
US8239349B2 (en) Extracting data
US20110153641A1 (en) System and method for regular expression matching with multi-strings and intervals
CN104572983B (en) Construction method, String searching method and the related device of hash table based on internal memory
CN101751386B (en) Identification method of unknown words
CN103605691A (en) Device and method used for processing issued contents in social network
CN110750984A (en) Command line string processing method, terminal, device and readable storage medium
JP6072922B2 (en) Character string search device, character string search method, and character string search program
US10733213B2 (en) Structuring unstructured machine-generated content
CN112883703B (en) Method, device, electronic equipment and storage medium for identifying associated text
CN106878289A (en) Regular Expression Matching Method and Device Based on Multidimensional Template Finite Automata TMFA
US11031092B2 (en) Taxonomic annotation of variable length metagenomic patterns
CN104008136A (en) Method and device for text searching
CN109710896A (en) Word attribute difference labeling method, device, storage medium and electronic equipment
CN114024701A (en) Domain name detection method, device and communication system
CN112287676A (en) New word discovery method, device, electronic equipment and medium
CN114048751B (en) Method, device, electronic device and storage medium for determining phonetic letter vector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant