[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112527977B - Concept extraction method, concept extraction device, electronic equipment and storage medium - Google Patents

Concept extraction method, concept extraction device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112527977B
CN112527977B CN202011241251.0A CN202011241251A CN112527977B CN 112527977 B CN112527977 B CN 112527977B CN 202011241251 A CN202011241251 A CN 202011241251A CN 112527977 B CN112527977 B CN 112527977B
Authority
CN
China
Prior art keywords
candidate
concept
list
candidate concept
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011241251.0A
Other languages
Chinese (zh)
Other versions
CN112527977A (en
Inventor
李涓子
王禹权
于济凡
陈凯源
孙凯
侯磊
张鹏
唐杰
许斌
孙茂松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202011241251.0A priority Critical patent/CN112527977B/en
Publication of CN112527977A publication Critical patent/CN112527977A/en
Application granted granted Critical
Publication of CN112527977B publication Critical patent/CN112527977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a concept extraction method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: extracting terms from the text to be extracted according to a preset word list, obtaining a first candidate concept list, and carrying out entity linking on the text to be extracted according to a preset knowledge graph, so as to obtain a second candidate concept list; reordering each candidate concept in the first candidate concept list and the second candidate concept list, and acquiring a concept extraction result of the text to be extracted according to the reordered result; the text to be extracted is unstructured text. According to the concept extraction method, the device, the electronic equipment and the storage medium, the candidate concepts obtained by carrying out term extraction and entity link on the text to be extracted are reordered, and the concept extraction result is obtained according to the reordered result, so that the concepts can be extracted from unstructured texts more efficiently and accurately under the condition that the labeled data are less or even no labeled data exist.

Description

概念抽取方法、装置、电子设备及存储介质Concept extraction method, device, electronic device and storage medium

技术领域Technical Field

本发明涉及计算机技术领域,尤其涉及一种概念抽取方法、装置、电子设备及存储介质。The present invention relates to the field of computer technology, and in particular to a concept extraction method, device, electronic equipment and storage medium.

背景技术Background technique

概念,又称科学概念,是一种科学语料中用于表征具体技术、重要知识点的术语短语,如“二叉树”就是计算机领域的一个重要的概念。A concept, also known as a scientific concept, is a term or phrase used in scientific corpus to represent specific technologies and important knowledge points. For example, "binary tree" is an important concept in the computer field.

传统的概念抽取方法主要包括三大类:关键短语与术语抽取、实体链接和概念/集合扩展。关键短语与术语抽取,一般通过分词等方法得到候选短语,然后对候选短语进行置信度排序,选取分数较高的候选短语作为抽取结果。实体链接是从文本中找出其背景知识库中存在的实体的不同提及方式。概念扩展类任务是通过对语料、外部知识库等大规模资源进行分析,找出与给定的少量种子概念属于同一集合的概念。Traditional concept extraction methods mainly include three categories: key phrase and term extraction, entity linking, and concept/set expansion. Key phrase and term extraction generally obtains candidate phrases through methods such as word segmentation, and then sorts the candidate phrases by confidence, and selects the candidate phrases with higher scores as the extraction results. Entity linking is to find different mentions of entities in the background knowledge base from the text. Concept expansion tasks are to find concepts that belong to the same set as a small number of given seed concepts by analyzing large-scale resources such as corpus and external knowledge bases.

但上述三类方法依赖专家标注实现,在有较多标注数据的前提下,可以获得比较准确的概念抽取结果。但在标注数据较少甚至没有标注数据(例如非结构化文本)的情况下,上述三类方法的概念抽取结果的准确率较低。However, the above three methods rely on expert annotation. Under the premise of more annotated data, they can obtain relatively accurate concept extraction results. However, when there is less or no annotated data (such as unstructured text), the accuracy of the concept extraction results of the above three methods is low.

发明内容Summary of the invention

本发明实施例提供一种概念抽取方法、装置、电子设备及存储介质,用以解决现有技术中在标注数据较少的情况下概念抽取结果的准确率较低的缺陷,实现在标注数据较少甚至没有标注数据的情况下,更准确的概念抽取。Embodiments of the present invention provide a concept extraction method, device, electronic device and storage medium to solve the defect in the prior art that the accuracy of concept extraction results is low when there is less labeled data, and to achieve more accurate concept extraction when there is less labeled data or even no labeled data.

本发明实施例提供一种概念抽取方法,包括:An embodiment of the present invention provides a concept extraction method, including:

根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对所述待提取文本进行实体链接,获取第二候选概念列表;Extracting terms from the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and performing entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list;

对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取所述待提取文本的概念抽取结果;Reordering the candidate concepts in the first candidate concept list and the second candidate concept list, and obtaining a concept extraction result of the text to be extracted according to the reordering result;

其中,所述待提取文本为非结构化文本。Wherein, the text to be extracted is unstructured text.

根据本发明一个实施例的概念抽取方法,所述根据重排序的结果获取所述待提取文本的概念抽取结果的具体步骤包括:According to a concept extraction method of an embodiment of the present invention, the specific step of obtaining the concept extraction result of the text to be extracted according to the reordering result includes:

根据重排序的结果和预设的评分阈值,或者根据重排序的结果和预设数量,选择多个所述候选概念,作为所述待提取文本的概念抽取结果。According to the re-ranking result and a preset scoring threshold, or according to the re-ranking result and a preset number, a plurality of the candidate concepts are selected as the concept extraction result of the text to be extracted.

根据本发明一个实施例的概念抽取方法,所述根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表的具体步骤包括:According to a concept extraction method of an embodiment of the present invention, the specific steps of extracting terms from the text to be extracted according to a preset vocabulary and obtaining a first candidate concept list include:

根据所述预设的词表对所述待提取文本进行过滤,获取所述预设的词表与所述待提取文本的交集;Filtering the text to be extracted according to the preset vocabulary to obtain the intersection of the preset vocabulary and the text to be extracted;

对所述交集进行分词和词性标注,获取所述交集中的名词作为候选概念,组成所述第一候选概念列表。The intersection is segmented and POS-tagged, and nouns in the intersection are obtained as candidate concepts to form the first candidate concept list.

根据本发明一个实施例的概念抽取方法,所述对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序的具体步骤包括:According to a concept extraction method according to an embodiment of the present invention, the specific step of reordering the candidate concepts in the first candidate concept list and the second candidate concept list includes:

根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序;reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted;

和/或,根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序。And/or, clustering is performed according to the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list, and reordering is performed according to the clustering results and the word vectors of preset seed concepts.

根据本发明一个实施例的概念抽取方法,所述根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序的具体步骤包括:According to a concept extraction method according to an embodiment of the present invention, the specific steps of reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted include:

根据TF-IDF方法,获取所述第一候选概念列表中各候选概念的置信度;According to the TF-IDF method, obtaining the confidence of each candidate concept in the first candidate concept list;

根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,以及所述第二候选概念列表,获取重排序的结果。A re-ranking result is obtained according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, and the second candidate concept list.

根据本发明一个实施例的概念抽取方法,所述根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序的具体步骤包括:According to the concept extraction method of an embodiment of the present invention, the specific steps of clustering the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list, and reordering the word vectors according to the clustering results and the preset seed concept include:

获取所述对所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量,根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类;Obtaining the word vector of each candidate concept in the first candidate concept list and the second candidate concept list, and performing clustering according to the word vector of each candidate concept in the first candidate concept list and the second candidate concept list;

根据聚类获得的每一类簇中心与预设的每一种子概念类簇中心之间的相似度,对各所述类簇中心所属的类簇进行排序;According to the similarity between each cluster center obtained by clustering and each pre-set sub-concept cluster center, the clusters to which each cluster center belongs are sorted;

对于每一所述类簇,根据与所述类簇中心之间的相似度最大的种子概念类簇中心,与属于所述类簇的每一候选概念的词向量之间的相似度,对属于所述类簇的各候选概念进行排序。For each of the clusters, the candidate concepts belonging to the cluster are ranked according to the similarity between the seed concept cluster center having the greatest similarity to the cluster center and the word vector of each candidate concept belonging to the cluster.

根据本发明一个实施例的概念抽取方法,所述根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,以及所述第二候选概念列表,获取重排序的结果的具体步骤包括:According to a concept extraction method of an embodiment of the present invention, the specific steps of obtaining the re-ranking result according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, and the second candidate concept list include:

对于所述第一候选概念列表和所述第二候选概念列表的交集中的各候选概念,根据所述交集中的每一候选概念在的置信度和在所述待提取文本中的出现频率,获取所述交集中的每一候选概念的评分,并将不属于所述交集的各候选概念的评分确定为零;For each candidate concept in the intersection of the first candidate concept list and the second candidate concept list, obtaining a score of each candidate concept in the intersection according to the confidence of each candidate concept in the intersection and the frequency of occurrence in the text to be extracted, and determining the score of each candidate concept that does not belong to the intersection as zero;

根据所述第一候选概念列表和所述第二候选概念列表中各候选概念的评分,获取所述重排序的结果。The re-ranking result is obtained according to the score of each candidate concept in the first candidate concept list and the second candidate concept list.

本发明实施例还提供一种概念抽取装置,包括:The embodiment of the present invention further provides a concept extraction device, comprising:

提取模块,用于根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对所述待提取文本进行实体链接,获取第二候选概念列表;An extraction module is used to extract terms from the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and to perform entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list;

筛选模块,用于对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取所述待提取文本的概念抽取结果;A screening module, used for reordering the candidate concepts in the first candidate concept list and the second candidate concept list, and obtaining a concept extraction result of the text to be extracted according to the reordering result;

其中,所述待提取文本为非结构化文本。Wherein, the text to be extracted is unstructured text.

本发明实施例还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述概念抽取方法的步骤。An embodiment of the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of any of the above-mentioned concept extraction methods are implemented.

本发明实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述概念抽取方法的步骤。An embodiment of the present invention further provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of any of the above-mentioned concept extraction methods are implemented.

本发明实施例提供的概念抽取方法、装置、电子设备及存储介质,根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对待提取文本进行实体链接,获取第二候选概念列表,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取待提取文本的概念抽取结果,能在标注数据较少甚至没有标注数据的情况下,从非结构化文本中更高效、准确等抽取出概念,能减少人工、提高自动化程度。The concept extraction method, device, electronic device and storage medium provided in the embodiments of the present invention perform term extraction on the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and perform entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list, reorder the candidate concepts in the first candidate concept list and the second candidate concept list, and obtain the concept extraction result of the text to be extracted according to the reordering result. In the case of less or no annotated data, concepts can be extracted from unstructured text more efficiently and accurately, which can reduce manual work and improve the degree of automation.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明实施例提供的一种概念抽取方法的流程示意图;FIG1 is a schematic diagram of a flow chart of a concept extraction method provided by an embodiment of the present invention;

图2是本发明实施例提供的一种概念抽取装置的结构示意图;FIG2 is a schematic diagram of the structure of a concept extraction device provided by an embodiment of the present invention;

图3是本发明实施例提供的一种电子设备的结构示意图。FIG. 3 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present invention clearer, the technical solution in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

在本发明实施例的描述中,需要说明的是,术语“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明实施例和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明实施例的限制。此外,术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。In the description of the embodiments of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicating the orientation or positional relationship, are based on the orientation or positional relationship shown in the drawings, and are only for the convenience of describing the embodiments of the present invention and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operate in a specific orientation, and therefore cannot be understood as limiting the embodiments of the present invention. In addition, the terms "first", "second", and "third" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance.

在本发明实施例的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明实施例中的具体含义。In the description of the embodiments of the present invention, it should be noted that, unless otherwise clearly specified and limited, the terms "installed", "connected", and "connected" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection, or it can be indirectly connected through an intermediate medium, or it can be the internal communication of two components. For ordinary technicians in this field, the specific meanings of the above terms in the embodiments of the present invention can be understood according to specific circumstances.

为了克服现有技术的上述问题,本发明实施例提供一种概念抽取方法、装置、电子设备及存储介质,其发明构思是,通过不同的方法从非结构化文本中提取候选概念,对提取出的各候选概念根据其是概念的可能性大小进行排序,根据排序筛选出概念抽取结果,从而筛选出最贴合技术领域的概念。In order to overcome the above-mentioned problems of the prior art, an embodiment of the present invention provides a concept extraction method, device, electronic device and storage medium. The inventive concept is to extract candidate concepts from unstructured text by different methods, sort the extracted candidate concepts according to the possibility of being concepts, and filter out concept extraction results according to the sorting, thereby filtering out the concepts that best fit the technical field.

图1是本发明实施例提供的一种概念抽取方法的流程示意图。下面结合图1描述本发明实施例的概念抽取方法。如图1所示,该方法包括:步骤S101、根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对待提取文本进行实体链接,获取第二候选概念列表。FIG1 is a flow chart of a concept extraction method provided by an embodiment of the present invention. The concept extraction method of an embodiment of the present invention is described below in conjunction with FIG1. As shown in FIG1, the method includes: step S101, extracting terms from the text to be extracted according to a preset vocabulary, obtaining a first candidate concept list, and performing entity linking on the text to be extracted according to a preset knowledge graph, obtaining a second candidate concept list.

其中,待提取文本为非结构化文本。Among them, the text to be extracted is unstructured text.

具体地,待提取文本可以为网页解析结果或教学视频中的字幕等非结构化文本。本发明实施例提供的概念抽取方法,尤其适用于针对某一技术领域的课程进行该课程的概念抽取。Specifically, the text to be extracted may be unstructured text such as web page parsing results or subtitles in teaching videos. The concept extraction method provided by the embodiment of the present invention is particularly suitable for extracting concepts from a course in a certain technical field.

可以分别通过术语抽取的方法和实体链接的方法,对待提取文本进行初步的概念抽取,获得多个候选概念。The text to be extracted can be used for preliminary concept extraction through term extraction and entity linking to obtain multiple candidate concepts.

通过术语抽取的方法对待提取文本进行初步的概念抽取可以包括:The preliminary concept extraction of the extracted text by term extraction method can include:

根据预设的词表,获取待提取文本中的名词,作为各候选概念。According to the preset vocabulary, nouns in the text to be extracted are obtained as candidate concepts.

预设的词表,可以是大词表,包括待提取文本所使用的语言中使用的大量词汇。The preset vocabulary may be a large vocabulary, including a large number of words used in the language used by the text to be extracted.

通过术语抽取的方法获得的各候选概念,组成第一候选概念列表。The candidate concepts obtained by the term extraction method constitute the first candidate concept list.

通过实体链接的方法对待提取文本进行初步的概念抽取可以包括:The preliminary concept extraction of the extracted text through entity linking method can include:

使用链接语言对待提取文本进行实体链接,根据链接结果从预设的知识图谱中获取各候选概念。Use the linking language to perform entity linking on the extracted text, and obtain each candidate concept from the preset knowledge graph based on the linking results.

预设的知识图谱,是包括该待提取文本所属的技术领域的知识的知识图谱。The preset knowledge graph is a knowledge graph that includes knowledge in the technical field to which the text to be extracted belongs.

例如,可以使用Xlink对无结构文本链接,从XLore中得到各候选概念。For example, Xlink can be used to link unstructured text and obtain candidate concepts from XLore.

XLink是基于跨语言知识库XLORE的实体链接系统。XLink能够识别出用户输入的文本文档(如新闻、博客等)中的实体,并链接到XLORE相对应的实体(概念、实例)上。XLink可以将文本信息和知识图谱桥接起来,为文本理解提供了外部知识。并且,XLink可以帮助读者理解有歧义的、生僻的实体,提高文本理解能力。XLink is an entity linking system based on the cross-language knowledge base XLORE. XLink can identify entities in text documents (such as news, blogs, etc.) input by users and link them to the corresponding entities (concepts, instances) in XLORE. XLink can bridge text information and knowledge graphs, providing external knowledge for text understanding. In addition, XLink can help readers understand ambiguous and uncommon entities and improve text comprehension.

XLore是第一个大规模的中英文知识平衡的知识图谱。XLORE是从异构的跨语言在线百科中抽取结构化信息,融合中英文维基、法语维基和百度百科,对百科知识进行结构化和跨语言链接构建的多语言知识图谱,是中英文知识规模较平衡的大规模多语言知识图谱。XLore is the first large-scale knowledge graph that balances Chinese and English knowledge. XLORE extracts structured information from heterogeneous cross-language online encyclopedias, integrates Chinese and English Wikipedias, French Wikipedias and Baidu Encyclopedia, and structures and cross-language links on encyclopedia knowledge. It is a large-scale multilingual knowledge graph with a relatively balanced scale of Chinese and English knowledge.

通过实体链接的方法获得的各候选概念,组成第二候选概念列表。The candidate concepts obtained by the entity linking method constitute the second candidate concept list.

步骤S102、对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取待提取文本的概念抽取结果。Step S102: reorder the candidate concepts in the first candidate concept list and the second candidate concept list, and obtain a concept extraction result of the text to be extracted according to the reordering result.

具体地,第一候选概念列表和第二候选概念列表中的各候选概念,指第一候选概念列表和第二候选概念列表的并集中的各候选概念。Specifically, each candidate concept in the first candidate concept list and the second candidate concept list refers to each candidate concept in the union of the first candidate concept list and the second candidate concept list.

对于第一候选概念列表和第二候选概念列表的并集中的各候选概念,通过比较各候选概念是概念的评分,并根据比较结果进行排序,获取重排序的结果。For each candidate concept in the union of the first candidate concept list and the second candidate concept list, the scores of each candidate concept are compared and sorted according to the comparison result to obtain a re-sorting result.

该评分,可以反映候选概念是概念的概率。评分越高,说明候选概念是概念的概率越大;评分越低,说明候选概念是概念的概率越小。The score can reflect the probability that the candidate concept is a concept. The higher the score, the greater the probability that the candidate concept is a concept; the lower the score, the smaller the probability that the candidate concept is a concept.

可以获取第一候选概念列表和第二候选概念列表的并集中的各候选概念是概念的评分,并根据上述并集中的各候选概念是概念的概率的大小进行排序。The scores of each candidate concept in the union of the first candidate concept list and the second candidate concept list as a concept may be obtained, and the candidate concepts in the union may be sorted according to the probabilities that the candidate concepts are concepts.

还可以并不直接获取第一候选概念列表和第二候选概念列表的并集中的各候选概念是概念的概率,而是间接比较上述并集中的各候选概念是概念的概率的大小,根据比较结果进行排序。It is also possible that instead of directly obtaining the probability that each candidate concept in the union of the first candidate concept list and the second candidate concept list is a concept, the probability that each candidate concept in the union is a concept is indirectly compared, and the candidate concepts are sorted according to the comparison result.

根据重排序的结果,可以从第一候选概念列表和第二候选概念列表的并集中,筛选出一定数量的候选概念作为概念抽取结果。According to the re-ranking result, a certain number of candidate concepts can be screened out from the union of the first candidate concept list and the second candidate concept list as the concept extraction result.

需要说明的是,传统的关键短语与术语抽取方法,是从文本中抽取关键的短语,一般通过分词等方法得到候选短语,然后对它们进行置信度排序,选取分数较高的部分作为抽取结果。一般分为有监督方法和无监督方法:有监督方法通过训练分类模型来判断一个词是否是一个目标短语(即概念),而无监督方法,如TextRank等则通过上下文共现等方法构建语义图,然后进行置信度传播从而得到结果。但对于大部分实际的概念获取场景来,由于标注数据都是十分匮乏,因此有监督的方法往往难以施行,抽取的准确率极低;对于无监督的方法来说,一旦文本长度过短(如一些网页解析结果、教学中的课程字幕等),传统基于上下文的方法就难以进行建图,因而难以得出可信的置信度结果,抽取的准确率较低。并且,由于大量的概念实际是在文本中低频出现的,这也使得仅使用诸如TF-IDF等指标的统计方法难以达到很好的效果。It should be noted that the traditional key phrase and term extraction method extracts key phrases from the text. Generally, candidate phrases are obtained through word segmentation and other methods, and then they are ranked by confidence, and the parts with higher scores are selected as the extraction results. Generally, they are divided into supervised methods and unsupervised methods: supervised methods judge whether a word is a target phrase (i.e., concept) by training classification models, while unsupervised methods, such as TextRank, construct semantic graphs through context co-occurrence and other methods, and then propagate confidence to obtain results. However, for most actual concept acquisition scenarios, since the labeled data is very scarce, supervised methods are often difficult to implement, and the extraction accuracy is extremely low; for unsupervised methods, once the text length is too short (such as some web page parsing results, course subtitles in teaching, etc.), traditional context-based methods are difficult to build maps, so it is difficult to obtain reliable confidence results, and the extraction accuracy is low. In addition, since a large number of concepts actually appear at low frequencies in texts, it is difficult to achieve good results using only statistical methods such as TF-IDF.

相比传统的关键短语与术语抽取方法,本发明实施例通过大词表对待提取文本进行术语抽取,从非结构化文本中提取出名词短语作为候选概念,获取第一候选概念列表,并结合根据预设的知识图谱进行实体链接获取的第二候选概念列表进行重排序,可以在一定程度上克服传统的关键短语与术语抽取方法存在的不足,可以在标注数据较少甚至没有标注数据的情况下,获得准确率更高的概念抽取结果。Compared with traditional key phrase and term extraction methods, the embodiments of the present invention extract terms from the text to be extracted through a large vocabulary, extract noun phrases from the unstructured text as candidate concepts, obtain a first candidate concept list, and reorder the second candidate concept list obtained by entity linking according to a preset knowledge graph. This can overcome the shortcomings of traditional key phrase and term extraction methods to a certain extent, and can obtain concept extraction results with higher accuracy when there is less or even no annotated data.

传统的实体链接方法,是从文本中找出其背景知识库中存在的实体的不同提及方式。其实现需要已知一个大规模的实体清单,然后对目标文本进行匹配和筛选,从而得到链接结果。通过实体链接方法进行概念抽取的缺陷主要在于难以评估获取结果对文本的相关性和重要性:对于实体链接类的方法,它们主要的目标在于将文本中存在的实体与已有知识库进行匹配,然而实际场景中,并非所有文本中出现的实体都应该被认为是概念,这就使得仅使用实体链接的方法非常容易对概念获取引入噪声,导致抽取的准确率较低。The traditional entity linking method is to find different mentions of entities in the background knowledge base from the text. Its implementation requires a large-scale list of entities, and then matches and filters the target text to obtain the link results. The main defect of concept extraction through entity linking method is that it is difficult to evaluate the relevance and importance of the obtained results to the text: for entity linking methods, their main goal is to match the entities in the text with the existing knowledge base. However, in actual scenarios, not all entities that appear in the text should be considered as concepts. This makes it very easy to introduce noise to concept acquisition using only entity linking methods, resulting in low extraction accuracy.

相比传统的实体链接方法,本发明实施例一方面是基于知识图谱进行实体链接,相比传统的基于实体清单进行实体链接的方法,可以在一定程度上提高抽取的准确率;另一方面,将实体链接获取的第二候选概念列表,与通过大词表对待提取文本进行术语抽取获取的第一候选概念列表相结合,进行重排序,可以进一步克服传统的实体链接方法存在的不足,可以在标注数据较少甚至没有标注数据的情况下,获得准确率更高的概念抽取结果。Compared with the traditional entity linking method, the embodiment of the present invention, on the one hand, performs entity linking based on the knowledge graph, which can improve the extraction accuracy to a certain extent compared with the traditional entity linking method based on the entity list; on the other hand, the second candidate concept list obtained by entity linking is combined with the first candidate concept list obtained by term extraction of the text to be extracted through a large vocabulary, and re-sorted, which can further overcome the shortcomings of the traditional entity linking method and obtain more accurate concept extraction results when there is less annotated data or even no annotated data.

传统的概念扩展/集合扩展方法,是通过对语料、外部知识库等大规模资源进行分析,找出与给定的少量种子概念属于同一集合的概念。该方法关注于获取结果的数量及准确性,是一种在少标注资源设定下进行概念获取的方法,一般通过分析语义模板,以及集成学习的方法完成目标。概念扩展/集合扩展方法比较符合实际应用的情况:有少量的高质量标注结果,从大规模语料中找出符合条件的概念。然而,由于监督信号的缺失,概念在多轮扩展的过程中很难进行控制,从而导致严重的语义漂移问题(随着集合的扩大,它的含义逐渐与开始时的含义偏离)。并且,由于概念扩展任务依赖于语义模板的评价、新词的发现等流程,它的表现往往掣肘于预先进行的候选概念的分词、排序等结果,这使得实际应用时,扩展得到的结果往往需要耗费大量资源进行评价和标注。Traditional concept expansion/set expansion methods analyze large-scale resources such as corpus and external knowledge bases to find concepts that belong to the same set as a small number of given seed concepts. This method focuses on the number and accuracy of the results obtained. It is a method for concept acquisition under the setting of few annotated resources. It generally achieves the goal by analyzing semantic templates and ensemble learning methods. Concept expansion/set expansion methods are more in line with practical applications: there are a small number of high-quality annotation results, and qualified concepts are found from large-scale corpora. However, due to the lack of supervision signals, concepts are difficult to control during multiple rounds of expansion, resulting in serious semantic drift problems (as the set expands, its meaning gradually deviates from the initial meaning). In addition, since the concept expansion task depends on the evaluation of semantic templates, the discovery of new words and other processes, its performance is often constrained by the results of the pre-performed segmentation and sorting of candidate concepts. This makes it necessary to consume a lot of resources for evaluation and annotation of the expanded results in practical applications.

相比传统的概念扩展/集合扩展方法,本发明实施例通过结合根据大词表对待提取文本进行术语抽取获取的第一候选概念列表和根据预设的知识图谱进行实体链接获取的第二候选概念列表,进行重排序,可以在标注数据较少甚至没有标注数据的情况下,获得准确率更高的概念抽取结果,而不需要耗费大量资源进行评价和标注。Compared with traditional concept expansion/set expansion methods, the embodiment of the present invention re-sorts the first candidate concept list obtained by term extraction of the extracted text based on a large vocabulary and the second candidate concept list obtained by entity linking based on a preset knowledge graph. In this way, more accurate concept extraction results can be obtained when there is less or no annotated data, without consuming a lot of resources for evaluation and annotation.

本发明实施例根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对待提取文本进行实体链接,获取第二候选概念列表,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取待提取文本的概念抽取结果,能在标注数据较少甚至没有标注数据的情况下,从非结构化文本中更高效、准确等抽取出概念,能减少人工、提高自动化程度。The embodiment of the present invention performs term extraction on the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and performs entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list, reorders the candidate concepts in the first candidate concept list and the second candidate concept list, and obtains a concept extraction result of the text to be extracted according to the reordering result. It can extract concepts from unstructured text more efficiently and accurately when there is less or no annotated data, and can reduce manpower and improve the degree of automation.

基于上述各实施例的内容,根据重排序的结果获取待提取文本的概念抽取结果的具体步骤包括:根据重排序的结果和预设的评分阈值,或者根据重排序的结果和预设数量,选择多个候选概念,作为待提取文本的概念抽取结果。Based on the contents of the above embodiments, the specific steps of obtaining the concept extraction result of the text to be extracted according to the reordering results include: selecting multiple candidate concepts as the concept extraction result of the text to be extracted according to the reordering results and a preset scoring threshold, or according to the reordering results and a preset number.

具体地,可以根据重排序的结果,选择符合预设条件的候选概念,作为待提取文本的概念抽取结果。Specifically, according to the re-ranking result, candidate concepts meeting preset conditions may be selected as concept extraction results of the text to be extracted.

如果重排序的结果是根据第一候选概念列表和第二候选概念列表的并集中的各候选概念是概念的评分的大小进行排序获得的,符合预设条件可以是概念的评分大于预设的评分阈值。If the re-ranking result is obtained by ranking the candidate concepts in the union of the first candidate concept list and the second candidate concept list according to the size of the concept scores, the preset condition may be met if the concept score is greater than a preset score threshold.

如果重排序的结果是间接比较第一候选概念列表和第二候选概念列表的并集中的各候选概念是概念的概率的大小获得的,符合预设条件可以是重排序的结果(降序排列)中的前N个。其中,N为正整数,表示预设数量。If the result of reordering is obtained by indirectly comparing the probability of each candidate concept in the union of the first candidate concept list and the second candidate concept list being a concept, the first N of the reordering results (arranged in descending order) that meet the preset condition may be a positive integer representing a preset number.

通过预设条件,可以从候选概念中筛选出更贴合技术领域的概念。By presetting conditions, concepts that are more suitable for the technical field can be screened out from the candidate concepts.

本发明实施例根据重排序的结果和预设条件对候选概念进行筛选,获取待提取文本的概念抽取结果,能提高概念抽取的准确率。The embodiment of the present invention screens candidate concepts according to the re-ranking result and preset conditions to obtain the concept extraction result of the text to be extracted, which can improve the accuracy of concept extraction.

基于上述各实施例的内容,根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表的具体步骤包括:根据预设的词表对待提取文本进行过滤,获取预设的词表与待提取文本的交集。Based on the contents of the above embodiments, the specific steps of extracting terms from the text to be extracted according to the preset vocabulary and obtaining the first candidate concept list include: filtering the text to be extracted according to the preset vocabulary and obtaining the intersection of the preset vocabulary and the text to be extracted.

具体地,首先对待提取文本进行词表过滤,使用预设的词表对待提取文本进行过滤,获取预设的词表与待提取文本的交集,即获取待提取文本中的在该词表中的各词语,组成该交集。Specifically, firstly, the text to be extracted is filtered by a vocabulary, and the text to be extracted is filtered by using a preset vocabulary to obtain the intersection of the preset vocabulary and the text to be extracted, that is, each word in the vocabulary in the text to be extracted is obtained to form the intersection.

对交集进行分词和词性标注,获取交集中的名词作为候选概念,组成第一候选概念列表。The intersection is segmented and POS tagged, and the nouns in the intersection are obtained as candidate concepts to form a first candidate concept list.

具体地,进行词表过滤之后,可以进行词性过滤。Specifically, after the vocabulary filtering is performed, part-of-speech filtering may be performed.

可以通过自然语言处理(Natural Language Processing,NLP)的方法,对该交集进行分词和词性标注。The intersection can be segmented and POS tagged by Natural Language Processing (NLP) methods.

例如,可以通过结巴(Jieba)、Ansj或盘古分词等开源实现的分词工具,对该交集进行分词和词性标注。For example, the intersection can be segmented and POS tagged using open source segmentation tools such as Jieba, Ansj or Pangu segmentation.

该交集进行分词和词性标注之后,可以根据标注的词性对该交集进行筛选,获取该交集中的名词(含词组),作为候选概念,组成第一候选概念列表。After the intersection is segmented and POS tagged, the intersection can be screened according to the tagged POS to obtain nouns (including phrases) in the intersection as candidate concepts to form a first candidate concept list.

本发明实施例通过预设的词表对待提取文本进行词表过滤,对词表过滤的结果进行词性过滤,筛选出待提取文本中的名词作为候选概念,能提高概念抽取的准确率。The embodiment of the present invention performs vocabulary filtering on the text to be extracted through a preset vocabulary, performs part-of-speech filtering on the result of the vocabulary filtering, and selects nouns in the text to be extracted as candidate concepts, which can improve the accuracy of concept extraction.

基于上述各实施例的内容,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序的具体步骤包括:根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序;和/或,根据第一候选概念列表和第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序。Based on the contents of the above-mentioned embodiments, the specific steps of reordering the candidate concepts in the first candidate concept list and the second candidate concept list include: reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted; and/or, clustering the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list, and reordering according to the clustering results and the word vectors of preset seed concepts.

具体地,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,可以单独采用公式重排序或词向量聚类重排序,还可以结合公式重排序和词向量聚类重排序。Specifically, the candidate concepts in the first candidate concept list and the second candidate concept list are reordered, and formula reordering or word vector clustering reordering may be used alone, or formula reordering and word vector clustering reordering may be combined.

公式重排序,指根据预设的公式,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序。Formula reordering refers to reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to a preset formula.

预设的公式,用于根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,对候选概念进行评分。该评分,用于描述候选概念是概念的概率。The preset formula is used to score the candidate concepts according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted. The score is used to describe the probability that the candidate concept is a concept.

词向量聚类重排序,指根据第一候选概念列表和第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序。The word vector clustering and re-ranking refers to clustering the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list, and re-ranking according to the clustering results and the word vectors of the preset seed concepts.

种子概念,指预先确定的待提取文本所属技术领域中的概念。Seed concepts refer to pre-determined concepts in the technical field to which the text to be extracted belongs.

预设的种子概念的数量为多个。优选地,预设的种子概念的数量可以小于20个。The number of the preset seed concepts is multiple. Preferably, the number of the preset seed concepts may be less than 20.

结合公式重排序和词向量聚类重排序,是指在分别进行公式重排序和词向量聚类重排序之后,将公式重排序获得的排序结果和词向量聚类重排序获得的排序结果进行融合,综合两种排序结果,获取最终的重排序的结果。Combining formula reordering and word vector clustering reordering means that after performing formula reordering and word vector clustering reordering respectively, the ranking results obtained by formula reordering and the ranking results obtained by word vector clustering reordering are merged, and the two ranking results are combined to obtain the final reordering result.

将公式重排序获得的排序结果和词向量聚类重排序获得的排序结果进行融合,可以采用选择其中更准确的一种排序结果,作为最终的重排序的结果,也可以通过对两种结果进行加权求和等方法,获取最终的重排序的结果。The sorting results obtained by formula reordering and the sorting results obtained by word vector clustering reordering are merged. The more accurate sorting result can be selected as the final reordering result, or the final reordering result can be obtained by weighted summing of the two results.

本发明实施例通过对第一候选概念列表和第二候选概念列表中的各候选概念进行公式重排序和/或词向量聚类重排序,能根据重排序的结果获得更准确的概念抽取结果。The embodiment of the present invention can obtain more accurate concept extraction results according to the reordering results by performing formula reordering and/or word vector clustering reordering on each candidate concept in the first candidate concept list and the second candidate concept list.

基于上述各实施例的内容,根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序的具体步骤包括:根据TF-IDF方法,获取第一候选概念列表中各候选概念的置信度。Based on the contents of the above embodiments, the specific steps of reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted include: obtaining the confidence of each candidate concept in the first candidate concept list according to the TF-IDF method.

具体地,对于第一候选概念列表中的每一候选概念,可以通过TF-IDF方法,获取该候选概念的置信度。Specifically, for each candidate concept in the first candidate concept list, the confidence of the candidate concept can be obtained by using the TF-IDF method.

TF-IDF(Term Frequency Inverse Document Frequency,词频-逆文本频率指数)是一种用于信息检索与数据挖掘的常用加权技术。TF-IDF (Term Frequency Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining.

TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF is a statistical method used to assess the importance of a word to a document in a document set or a corpus. The importance of a word increases proportionally with the number of times it appears in a document, but decreases inversely with the frequency of its occurrence in the corpus.

根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,以及第二候选概念列表,获取重排序的结果。The re-ranking result is obtained according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, and the second candidate concept list.

具体地,通常越重要或越基础的概念,在待提取文本中的出现频率更高,因此,重排序并非简单地根据置信度进行排序,还考虑了在待提取文本中的出现频率,将候选概念的置信度与在待提取文本中的出现频率进行结合,获取第一候选概念列表和第二候选概念列表的并集中的每一候选概念是概念的评分。Specifically, the more important or basic the concept is, the higher the frequency of its appearance in the text to be extracted. Therefore, the re-ranking is not simply based on the confidence, but also takes into account the frequency of appearance in the text to be extracted. The confidence of the candidate concept is combined with the frequency of appearance in the text to be extracted to obtain the score of each candidate concept in the union of the first candidate concept list and the second candidate concept list.

可以根据该并集中的每一候选概念是概念的评分,将该并集中的各候选概念按照该评分从大到小的顺序进行排序,获取重排序的结果。According to the score of each candidate concept in the union, the candidate concepts in the union may be sorted in descending order according to the score to obtain a re-sorted result.

本发明实施例根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,以及第二候选概念列表,获取重排序的结果,能更准确地反映各候选概念是概念的概率,从而能根据重排序的结果获得更准确的概念抽取结果。The embodiment of the present invention obtains the re-ranking result based on the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, as well as the second candidate concept list, which can more accurately reflect the probability of each candidate concept being a concept, thereby obtaining a more accurate concept extraction result based on the re-ranking result.

基于上述各实施例的内容,根据第一候选概念列表和第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序的具体步骤包括:获取对第一候选概念列表和第二候选概念列表中的各候选概念的词向量,根据第一候选概念列表和第二候选概念列表中的各候选概念的词向量进行聚类。Based on the contents of the above embodiments, clustering is performed according to the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list, and the specific steps of reordering according to the clustering results and the word vectors of the preset seed concepts include: obtaining the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list, and clustering is performed according to the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list.

具体地,可以采用Word2vec模型获取第一候选概念列表和第二候选概念列表的并集中的每一候选概念的词向量。Specifically, the Word2vec model may be used to obtain the word vector of each candidate concept in the union of the first candidate concept list and the second candidate concept list.

获取该并集中的每一候选概念的词向量之后,可以采用预设的聚类方法进行聚类,得到多个类簇。After obtaining the word vector of each candidate concept in the union, a preset clustering method can be used to perform clustering to obtain multiple clusters.

例如,可以采用K均值聚类算法(K-means clustering algorithm),根据该并集中的各候选概念的词向量,对该并集中的各候选概念进行聚类。For example, a K-means clustering algorithm may be used to cluster the candidate concepts in the union according to the word vectors of the candidate concepts in the union.

对于每一类簇,可以获取属于该类簇的各候选概念的词向量的平均向量,作为该类簇中心。该类簇中心,也是一个词向量。For each cluster, the average vector of the word vectors of the candidate concepts belonging to the cluster can be obtained as the center of the cluster. The center of the cluster is also a word vector.

根据聚类获得的每一类簇中心与预设的每一种子概念类簇中心之间的相似度,对各类簇中心所属的类簇进行排序。According to the similarity between each cluster center obtained by clustering and each pre-set sub-concept cluster center, the clusters to which each cluster center belongs are sorted.

具体地,对于任一类簇中心和任一种子概念类簇中心,可以获取该类簇中心与该种子概念的词向量之间的相似度。Specifically, for any cluster center and any seed concept cluster center, the similarity between the cluster center and the word vector of the seed concept can be obtained.

相似度,可以为余弦相似度、欧氏距离或马氏距离等。Similarity can be cosine similarity, Euclidean distance, or Mahalanobis distance, etc.

可以理解的是,可以预先根据预设的各种子概念的词向量进行聚类,获取多个种子概念类簇。种子概念类簇的数量,可以与各候选概念的类簇的数量相同或不同。种子概念的词向量,也是通过Word2vec模型获取的。种子概念类簇中心,可以是属于该种子概念类簇的各种子概念的词向量的平均向量。It is understandable that clustering can be performed in advance based on the word vectors of various preset sub-concepts to obtain multiple seed concept clusters. The number of seed concept clusters can be the same as or different from the number of clusters of each candidate concept. The word vector of the seed concept is also obtained through the Word2vec model. The center of the seed concept cluster can be the average vector of the word vectors of various sub-concepts belonging to the seed concept cluster.

可以按照类簇中心与各种子概念类簇中心之间的相似度的最大值从大到小的顺序,对各类簇中心所属的类簇进行排序。The clusters to which the various cluster centers belong can be sorted in descending order according to the maximum similarity between the cluster center and the various sub-concept cluster centers.

对于每一类簇,根据与类簇中心之间的相似度最大的种子概念类簇中心,与属于类簇的每一候选概念的词向量之间的距离,对属于类簇的各候选概念进行排序。For each cluster, the candidate concepts belonging to the cluster are ranked according to the distance between the seed concept cluster center with the greatest similarity to the cluster center and the word vector of each candidate concept belonging to the cluster.

具体地,对于聚类获得的每一类簇,可以获取与该类簇中心之间的相似度最大的种子概念类簇中心,与属于该类簇的每一候选概念的词向量之间的距离。Specifically, for each cluster obtained by clustering, the distance between the seed concept cluster center with the greatest similarity to the cluster center and the word vector of each candidate concept belonging to the cluster can be obtained.

距离,可以为余弦距离、欧氏距离或马氏距离等。Distance can be cosine distance, Euclidean distance, or Mahalanobis distance, etc.

可以按照与该类簇中心之间的相似度最大的种子概念类簇中心,与属于该类簇的每一候选概念的词向量之间的距离从小到大的顺序,对属于该类簇的各候选概念进行排序,从而获取重排序的结果。The candidate concepts belonging to the cluster can be sorted in ascending order according to the distance between the seed concept cluster center having the greatest similarity to the cluster center and the word vector of each candidate concept belonging to the cluster, thereby obtaining a re-sorting result.

可以理解的是,对于属于该类簇的候选概念,与该类簇中心之间的相似度最大的种子概念类簇中心与该候选概念的词向量之间的距离越小,表示该候选概念是概念的概率越大;与该类簇中心之间的相似度最大的种子概念类簇中心与该候选概念的词向量之间的距离越大,表示该候选概念是概念的概率越小。It can be understood that, for the candidate concepts belonging to this cluster, the smaller the distance between the center of the seed concept cluster with the greatest similarity to the center of the cluster and the word vector of the candidate concept, the greater the probability that the candidate concept is the concept; the greater the distance between the center of the seed concept cluster with the greatest similarity to the center of the cluster and the word vector of the candidate concept, the smaller the probability that the candidate concept is the concept.

对于任意两个类簇中在类簇内的排序相同的候选概念,属于排序在前的类簇的候选概念是概念的可能性,高于属于排序在后的类簇的候选概念是概念的可能性。For any two candidate concepts with the same ranking within the clusters, the possibility that the candidate concept belonging to the cluster with the earlier ranking is the concept is higher than the possibility that the candidate concept belonging to the cluster with the later ranking is the concept.

本发明实施例通过对第一候选概念列表和第二候选概念列表中的各候选概念进行词向量聚类重排序,获取重排序的结果,能更准确地反映各候选概念是概念的概率的相对大小,从而能根据重排序的结果获得更准确的概念抽取结果。The embodiment of the present invention obtains a reordering result by clustering the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list, which can more accurately reflect the relative size of the probability that each candidate concept is a concept, thereby obtaining a more accurate concept extraction result based on the reordering result.

基于上述各实施例的内容,根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,以及第二候选概念列表,获取重排序的结果的具体步骤包括:对于第一候选概念列表和第二候选概念列表的交集中的各候选概念,根据交集中的每一候选概念在的置信度和在待提取文本中的出现频率,获取交集中的每一候选概念的评分,并将不属于交集的各候选概念的评分确定为零。Based on the contents of the above embodiments, according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, and the second candidate concept list, the specific steps of obtaining the reordering results include: for each candidate concept in the intersection of the first candidate concept list and the second candidate concept list, according to the confidence of each candidate concept in the intersection and the frequency of occurrence in the text to be extracted, obtain the score of each candidate concept in the intersection, and determine the score of each candidate concept that does not belong to the intersection as zero.

具体地,可以将候选概念的置信度与在待提取文本中的出现频率进行结合,获取第一候选概念列表和第二候选概念列表的并集中的每一候选概念的评分。Specifically, the confidence of the candidate concept may be combined with the frequency of occurrence in the text to be extracted to obtain the score of each candidate concept in the union of the first candidate concept list and the second candidate concept list.

首先,获取第一候选概念列表和第二候选概念列表的交集。First, the intersection of the first candidate concept list and the second candidate concept list is obtained.

对于任一候选概念,如果该候选概念属于该交集,则该候选概念是概念的概率,高于不属于该交集的候选概念是概念的概率,可以将不属于该交集的各候选概念的评分确定为零。For any candidate concept, if the candidate concept belongs to the intersection, the probability that the candidate concept is a concept is higher than the probability that the candidate concept not belonging to the intersection is a concept, and the score of each candidate concept not belonging to the intersection can be determined to be zero.

对于属于该交集的每一候选概念,可以根据如下公式获取该候选概念的评分:For each candidate concept belonging to the intersection, the score of the candidate concept can be obtained according to the following formula:

R=ln(freq)*max(conf-c,0)R=ln(freq)*max(conf-c,0)

其中,R表示候选概念的评分;freq表示候选概念在待提取文本中的出现频率(即出现次数);conf表示候选概念的置信度;c表示置信度阈值。Among them, R represents the score of the candidate concept; freq represents the frequency of occurrence (i.e., the number of occurrences) of the candidate concept in the text to be extracted; conf represents the confidence of the candidate concept; and c represents the confidence threshold.

根据第一候选概念列表和第二候选概念列表中各候选概念的评分,获取重排序的结果。The re-ranking result is obtained according to the score of each candidate concept in the first candidate concept list and the second candidate concept list.

具体地,可以根据第一候选概念列表和第二候选概念列表的并集中的每一候选概念是概念的评分,将该并集中的各候选概念按照该评分从大到小的顺序进行排序,获取重排序的结果。Specifically, according to the score of each candidate concept in the union of the first candidate concept list and the second candidate concept list, the candidate concepts in the union may be sorted in descending order according to the score to obtain a re-sorted result.

本发明实施例根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,以及第二候选概念列表,获取重排序的结果,能更准确地反映各候选概念是概念的概率,从而能根据重排序的结果获得更准确的概念抽取结果。The embodiment of the present invention obtains the re-ranking result based on the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, as well as the second candidate concept list, which can more accurately reflect the probability of each candidate concept being a concept, thereby obtaining a more accurate concept extraction result based on the re-ranking result.

通过人工选择18个预设的种子概念,将MOOCCube中《数据结构》课程的字幕作为待提取文本,使用本发明上述各实施例提供的方法进行概念抽取,可以从前200个候选概念中得到138个概念;进一步地,将上述138个概念与上述18个概念合起来作为种子概念,将《数据结构》相关百科作为待提取文本,使用本发明上述各实施例提供的方法进行概念抽取,可以从前200个候选概念中得到174个概念。By manually selecting 18 preset seed concepts, taking the subtitles of the "Data Structure" course in MOOCCube as the text to be extracted, and using the methods provided in the above embodiments of the present invention to perform concept extraction, 138 concepts can be obtained from the top 200 candidate concepts; further, the above 138 concepts are combined with the above 18 concepts as seed concepts, and the "Data Structure" related encyclopedia is taken as the text to be extracted, and the methods provided in the above embodiments of the present invention are used to perform concept extraction, 174 concepts can be obtained from the top 200 candidate concepts.

可见,发明上述各实施例提供的方法,可以在标注数据较少甚至没有标注数据的情况下,从非结构化文本中更高效、准确等抽取出概念。It can be seen that the methods provided by the above embodiments of the invention can extract concepts from unstructured text more efficiently and accurately when there is less or no annotated data.

下面对本发明实施例提供的概念抽取装置进行描述,下文描述的概念抽取装置与上文描述的概念抽取方法可相互对应参照。The concept extraction device provided by the embodiment of the present invention is described below. The concept extraction device described below and the concept extraction method described above can be referenced to each other.

图2是根据本发明实施例提供的概念抽取装置的结构示意图。基于上述各实施例的内容,如图2所示,该装置包括提取模块201和筛选模块202,其中:Fig. 2 is a schematic diagram of the structure of a concept extraction device according to an embodiment of the present invention. Based on the contents of the above embodiments, as shown in Fig. 2, the device includes an extraction module 201 and a screening module 202, wherein:

提取模块201,用于根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对待提取文本进行实体链接,获取第二候选概念列表;The extraction module 201 is used to extract terms from the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and to perform entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list;

筛选模块202,用于对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取待提取文本的概念抽取结果;A screening module 202 is used to reorder the candidate concepts in the first candidate concept list and the second candidate concept list, and obtain a concept extraction result of the text to be extracted according to the reordering result;

其中,待提取文本为非结构化文本。Among them, the text to be extracted is unstructured text.

具体地,提取模块201和筛选模块202电连接。Specifically, the extraction module 201 and the screening module 202 are electrically connected.

提取模块201可以包括第一提取子模块和第二提取子模块。The extraction module 201 may include a first extraction submodule and a second extraction submodule.

第一提取子模块,用于根据预设的词表对待提取文本进行术语抽取,获取多个候选概念,组成第一候选概念列表。The first extraction submodule is used to extract terms from the text to be extracted according to a preset vocabulary, obtain multiple candidate concepts, and form a first candidate concept list.

第二提取子模块,用于根据预设的知识图谱对待提取文本进行实体链接,获取多个候选概念,组成第二候选概念列表。The second extraction submodule is used to perform entity linking on the text to be extracted according to a preset knowledge graph, obtain multiple candidate concepts, and form a second candidate concept list.

筛选模块202对于第一候选概念列表和第二候选概念列表的并集中的各候选概念,通过比较各候选概念是概念的评分,并根据比较结果进行排序,获取重排序的结果。The screening module 202 compares the scores of the candidate concepts in the union of the first candidate concept list and the second candidate concept list, and sorts the candidate concepts according to the comparison results to obtain a re-sorting result.

该评分,可以反映候选概念是概念的概率。评分越高,说明候选概念是概念的概率越大;评分越低,说明候选概念是概念的概率越小。The score can reflect the probability that the candidate concept is a concept. The higher the score, the greater the probability that the candidate concept is a concept; the lower the score, the smaller the probability that the candidate concept is a concept.

筛选模块202可以包括重排序子模块和筛选子模块。The screening module 202 may include a reordering submodule and a screening submodule.

筛选子模块,用于根据重排序的结果,选择符合预设条件的候选概念,作为待提取文本的概念抽取结果。The screening submodule is used to select candidate concepts that meet the preset conditions according to the re-ranking results as the concept extraction results of the text to be extracted.

第一提取子模块可以包括:词表过滤单元和词性过滤单元。The first extraction submodule may include: a vocabulary filtering unit and a part-of-speech filtering unit.

词表过滤单元,用于根据预设的词表对待提取文本进行过滤,获取预设的词表与待提取文本的交集;A vocabulary filtering unit, used to filter the text to be extracted according to a preset vocabulary, and obtain the intersection of the preset vocabulary and the text to be extracted;

词性过滤单元,用于对交集进行分词和词性标注,获取交集中的名词作为候选概念,组成第一候选概念列表。The part-of-speech filtering unit is used to perform word segmentation and part-of-speech tagging on the intersection, obtain the nouns in the intersection as candidate concepts, and form a first candidate concept list.

重排序子模块,可以包括公式重排序单元和/或词向量聚类重排序单元。The reordering submodule may include a formula reordering unit and/or a word vector clustering reordering unit.

公式重排序单元,用于根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序;A formula reordering unit, used to reorder each candidate concept in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted;

词向量聚类重排序单元,用于根据第一候选概念列表和第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序。The word vector clustering and reordering unit is used to cluster the word vectors of each candidate concept in the first candidate concept list and the second candidate concept list, and reorder them according to the clustering results and the word vectors of the preset seed concepts.

公式重排序单元可以包括置信度获取子单元和评分子单元。The formula re-ranking unit may include a confidence acquisition sub-unit and a scoring sub-unit.

置信度获取子单元,用于根据TF-IDF方法,获取第一候选概念列表中各候选概念的置信度。The confidence acquisition subunit is used to acquire the confidence of each candidate concept in the first candidate concept list according to the TF-IDF method.

评分子单元,用于根据第一候选概念列表中各候选概念的置信度和在待提取文本中的出现频率,以及第二候选概念列表,获取重排序的结果。The scoring subunit is used to obtain the re-ranking result according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, as well as the second candidate concept list.

词向量聚类重排序单元可以包括聚类子单元、类间排序子单元和类内排序子单元。The word vector clustering re-ranking unit may include a clustering sub-unit, an inter-class ranking sub-unit and an intra-class ranking sub-unit.

聚类子单元,用于获取对第一候选概念列表和第二候选概念列表中的各候选概念的词向量,根据第一候选概念列表和第二候选概念列表中的各候选概念的词向量进行聚类。The clustering subunit is used to obtain the word vector of each candidate concept in the first candidate concept list and the second candidate concept list, and to perform clustering according to the word vector of each candidate concept in the first candidate concept list and the second candidate concept list.

类间排序子单元,用于根据聚类获得的每一类簇中心与预设的每一种子概念类簇中心之间的相似度,对各类簇中心所属的类簇进行排序。The inter-class sorting subunit is used to sort the clusters to which each class cluster center belongs according to the similarity between each class cluster center obtained by clustering and each preset sub-concept class cluster center.

类内排序子单元,用于对于每一类簇,根据与类簇中心之间的相似度最大的种子概念类簇中心,与属于类簇的每一候选概念的词向量之间的相似度,对属于类簇的各候选概念进行排序。The intra-class sorting subunit is used to sort the candidate concepts belonging to the cluster for each cluster according to the similarity between the seed concept cluster center with the greatest similarity to the cluster center and the word vector of each candidate concept belonging to the cluster.

评分子单元具体用于对于第一候选概念列表和第二候选概念列表的交集中的各候选概念,根据交集中的每一候选概念在的置信度和在待提取文本中的出现频率,获取交集中的每一候选概念的评分,并将不属于交集的各候选概念的评分确定为零;根据第一候选概念列表和第二候选概念列表中各候选概念的评分,获取重排序的结果。The scoring subunit is specifically used to obtain the score of each candidate concept in the intersection of the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the intersection and the frequency of occurrence in the text to be extracted, and determine the score of each candidate concept that does not belong to the intersection as zero; obtain the re-ranking result according to the score of each candidate concept in the first candidate concept list and the second candidate concept list.

本发明实施例提供的概念抽取装置,用于执行本发明上述各实施例提供的概念抽取方法,该概念抽取装置包括的各模块实现相应功能的具体方法和流程详见上述概念抽取方法的实施例,此处不再赘述。The concept extraction device provided in an embodiment of the present invention is used to execute the concept extraction method provided in the above-mentioned embodiments of the present invention. The specific methods and processes for each module included in the concept extraction device to implement the corresponding functions are detailed in the embodiments of the above-mentioned concept extraction method and will not be repeated here.

该概念抽取装置用于前述各实施例的概念抽取方法。因此,在前述各实施例中的概念抽取方法中的描述和定义,可以用于本发明实施例中各执行模块的理解。The concept extraction device is used in the concept extraction method of the above embodiments. Therefore, the description and definition of the concept extraction method in the above embodiments can be used for understanding each execution module in the embodiments of the present invention.

本发明实施例根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对待提取文本进行实体链接,获取第二候选概念列表,对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取待提取文本的概念抽取结果,能在标注数据较少甚至没有标注数据的情况下,从非结构化文本中更高效、准确等抽取出概念,能减少人工、提高自动化程度。The embodiment of the present invention performs term extraction on the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and performs entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list, reorders the candidate concepts in the first candidate concept list and the second candidate concept list, and obtains a concept extraction result of the text to be extracted according to the reordering result. It can extract concepts from unstructured text more efficiently and accurately when there is less or no annotated data, which can reduce manpower and improve the degree of automation.

图3示例了一种电子设备的实体结构示意图,如图3所示,该电子设备可以包括:处理器(processor)301、存储器(memory)302和总线303;其中,处理器301和存储器302通过总线303完成相互间的通信;处理器301用于调用存储在存储器302中并可在处理器301上运行的计算机程序指令,以执行上述各方法实施例提供的概念抽取方法,该方法包括:根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对待提取文本进行实体链接,获取第二候选概念列表;对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取待提取文本的概念抽取结果;其中,待提取文本为非结构化文本。Figure 3 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 3, the electronic device may include: a processor 301, a memory 302 and a bus 303; wherein the processor 301 and the memory 302 communicate with each other through the bus 303; the processor 301 is used to call computer program instructions stored in the memory 302 and executable on the processor 301 to execute the concept extraction method provided by the above-mentioned method embodiments, the method comprising: performing term extraction on the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and performing entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list; reordering each candidate concept in the first candidate concept list and the second candidate concept list, and obtaining a concept extraction result of the text to be extracted according to the reordering result; wherein the text to be extracted is unstructured text.

此外,上述的存储器302中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-OnlyMemory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 302 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

另一方面,本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的概念抽取方法,该方法包括:根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对待提取文本进行实体链接,获取第二候选概念列表;对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取待提取文本的概念抽取结果;其中,待提取文本为非结构化文本。On the other hand, an embodiment of the present invention further provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer, the computer can execute the concept extraction method provided by the above-mentioned method embodiments, the method including: extracting terms on the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and performing entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list; reordering each candidate concept in the first candidate concept list and the second candidate concept list, and obtaining a concept extraction result of the text to be extracted according to the reordering result; wherein the text to be extracted is unstructured text.

又一方面,本发明实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各实施例提供的概念抽取方法,该方法包括:根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对待提取文本进行实体链接,获取第二候选概念列表;对第一候选概念列表和第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取待提取文本的概念抽取结果;其中,待提取文本为非结构化文本。On the other hand, an embodiment of the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon. When the computer program is executed by a processor, it is implemented to execute the concept extraction method provided by the above-mentioned embodiments, the method comprising: extracting terms from the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and performing entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list; reordering each candidate concept in the first candidate concept list and the second candidate concept list, and obtaining a concept extraction result of the text to be extracted according to the reordering result; wherein the text to be extracted is unstructured text.

以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1.一种概念抽取方法,其特征在于,包括:1. A concept extraction method, characterized by comprising: 根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对所述待提取文本进行实体链接,获取第二候选概念列表;Extracting terms from the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and performing entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list; 对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取所述待提取文本的概念抽取结果;Reordering the candidate concepts in the first candidate concept list and the second candidate concept list, and obtaining a concept extraction result of the text to be extracted according to the reordering result; 其中,所述待提取文本为非结构化文本;Wherein, the text to be extracted is unstructured text; 其中,对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序的具体步骤包括:The specific steps of reordering the candidate concepts in the first candidate concept list and the second candidate concept list include: 根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序;reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted; 和/或,根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序;and/or, clustering the word vectors of the candidate concepts in the first candidate concept list and the second candidate concept list, and reordering the word vectors according to the clustering results and the word vectors of the preset seed concepts; 其中,所述根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序的具体步骤包括:The specific step of reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted includes: 根据TF-IDF方法,获取所述第一候选概念列表中各候选概念的置信度;According to the TF-IDF method, obtaining the confidence of each candidate concept in the first candidate concept list; 根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,以及所述第二候选概念列表,获取重排序的结果;Obtaining a re-ranking result according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, and the second candidate concept list; 其中,所述根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序的具体步骤包括:The specific steps of clustering the word vectors of the candidate concepts in the first candidate concept list and the second candidate concept list and reordering the word vectors according to the clustering results and the word vectors of the preset seed concepts include: 获取所述对所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量,根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类;Obtaining the word vector of each candidate concept in the first candidate concept list and the second candidate concept list, and performing clustering according to the word vector of each candidate concept in the first candidate concept list and the second candidate concept list; 根据聚类获得的每一类簇中心与预设的每一种子概念类簇中心之间的相似度,对各所述类簇中心所属的类簇进行排序;According to the similarity between each cluster center obtained by clustering and each pre-set sub-concept cluster center, the clusters to which each cluster center belongs are sorted; 对于每一所述类簇,根据与所述类簇中心之间的相似度最大的种子概念类簇中心,与属于所述类簇的每一候选概念的词向量之间的相似度,对属于所述类簇的各候选概念进行排序。For each of the clusters, the candidate concepts belonging to the cluster are ranked according to the similarity between the seed concept cluster center having the greatest similarity to the cluster center and the word vector of each candidate concept belonging to the cluster. 2.根据权利要求1所述的概念抽取方法,其特征在于,所述根据重排序的结果获取所述待提取文本的概念抽取结果的具体步骤包括:2. The concept extraction method according to claim 1 is characterized in that the specific step of obtaining the concept extraction result of the text to be extracted according to the reordering result comprises: 根据重排序的结果和预设的评分阈值,或者根据重排序的结果和预设数量,选择多个所述候选概念,作为所述待提取文本的概念抽取结果。According to the re-ranking result and a preset scoring threshold, or according to the re-ranking result and a preset number, a plurality of the candidate concepts are selected as the concept extraction result of the text to be extracted. 3.根据权利要求1所述的概念抽取方法,其特征在于,所述根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表的具体步骤包括:3. The concept extraction method according to claim 1 is characterized in that the specific steps of extracting terms from the text to be extracted according to the preset vocabulary and obtaining the first candidate concept list include: 根据所述预设的词表对所述待提取文本进行过滤,获取所述预设的词表与所述待提取文本的交集;Filtering the text to be extracted according to the preset vocabulary to obtain the intersection of the preset vocabulary and the text to be extracted; 对所述交集进行分词和词性标注,获取所述交集中的名词作为候选概念,组成所述第一候选概念列表。The intersection is segmented and POS-tagged, and nouns in the intersection are obtained as candidate concepts to form the first candidate concept list. 4.根据权利要求1所述的概念抽取方法,其特征在于,所述根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,以及所述第二候选概念列表,获取重排序的结果的具体步骤包括:4. The concept extraction method according to claim 1 is characterized in that the specific step of obtaining the re-ranking result according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, and the second candidate concept list comprises: 对于所述第一候选概念列表和所述第二候选概念列表的交集中的各候选概念,根据所述交集中的每一候选概念在的置信度和在所述待提取文本中的出现频率,获取所述交集中的每一候选概念的评分,并将不属于所述交集的各候选概念的评分确定为零;For each candidate concept in the intersection of the first candidate concept list and the second candidate concept list, obtaining a score of each candidate concept in the intersection according to the confidence of each candidate concept in the intersection and the frequency of occurrence in the text to be extracted, and determining the score of each candidate concept that does not belong to the intersection as zero; 根据所述第一候选概念列表和所述第二候选概念列表中各候选概念的评分,获取所述重排序的结果。The re-ranking result is obtained according to the score of each candidate concept in the first candidate concept list and the second candidate concept list. 5.一种概念抽取装置,其特征在于,包括:5. A concept extraction device, characterized by comprising: 提取模块,用于根据预设的词表对待提取文本进行术语抽取,获取第一候选概念列表,并根据预设的知识图谱对所述待提取文本进行实体链接,获取第二候选概念列表;An extraction module is used to extract terms from the text to be extracted according to a preset vocabulary to obtain a first candidate concept list, and to perform entity linking on the text to be extracted according to a preset knowledge graph to obtain a second candidate concept list; 筛选模块,用于对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序,根据重排序的结果获取所述待提取文本的概念抽取结果;A screening module, used for reordering the candidate concepts in the first candidate concept list and the second candidate concept list, and obtaining a concept extraction result of the text to be extracted according to the reordering result; 其中,所述待提取文本为非结构化文本;Wherein, the text to be extracted is unstructured text; 其中,对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序的具体步骤包括:The specific steps of reordering the candidate concepts in the first candidate concept list and the second candidate concept list include: 根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序;reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted; 和/或,根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序;and/or, clustering the word vectors of the candidate concepts in the first candidate concept list and the second candidate concept list, and reordering the word vectors according to the clustering results and the word vectors of the preset seed concepts; 其中,所述根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,对所述第一候选概念列表和所述第二候选概念列表中的各候选概念进行重排序的具体步骤包括:The specific step of reordering the candidate concepts in the first candidate concept list and the second candidate concept list according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted includes: 根据TF-IDF方法,获取所述第一候选概念列表中各候选概念的置信度;According to the TF-IDF method, obtaining the confidence of each candidate concept in the first candidate concept list; 根据所述第一候选概念列表中各候选概念的置信度和在所述待提取文本中的出现频率,以及所述第二候选概念列表,获取重排序的结果;Obtaining a re-ranking result according to the confidence of each candidate concept in the first candidate concept list and the frequency of occurrence in the text to be extracted, and the second candidate concept list; 其中,所述根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类,根据聚类的结果和预设的种子概念的词向量进行重排序的具体步骤包括:The specific steps of clustering the word vectors of the candidate concepts in the first candidate concept list and the second candidate concept list and reordering the word vectors according to the clustering results and the word vectors of the preset seed concepts include: 获取所述对所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量,根据所述第一候选概念列表和所述第二候选概念列表中的各候选概念的词向量进行聚类;Obtaining the word vector of each candidate concept in the first candidate concept list and the second candidate concept list, and performing clustering according to the word vector of each candidate concept in the first candidate concept list and the second candidate concept list; 根据聚类获得的每一类簇中心与预设的每一种子概念类簇中心之间的相似度,对各所述类簇中心所属的类簇进行排序;According to the similarity between each cluster center obtained by clustering and each pre-set sub-concept cluster center, the clusters to which each cluster center belongs are sorted; 对于每一所述类簇,根据与所述类簇中心之间的相似度最大的种子概念类簇中心,与属于所述类簇的每一候选概念的词向量之间的相似度,对属于所述类簇的各候选概念进行排序。For each of the clusters, the candidate concepts belonging to the cluster are ranked according to the similarity between the seed concept cluster center having the greatest similarity to the cluster center and the word vector of each candidate concept belonging to the cluster. 6.一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至4任一项所述的概念抽取方法的步骤。6. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of the concept extraction method as claimed in any one of claims 1 to 4 are implemented. 7.一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现如权利要求1至4任一项所述的概念抽取方法的步骤。7. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the concept extraction method as described in any one of claims 1 to 4 are implemented.
CN202011241251.0A 2020-11-09 2020-11-09 Concept extraction method, concept extraction device, electronic equipment and storage medium Active CN112527977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011241251.0A CN112527977B (en) 2020-11-09 2020-11-09 Concept extraction method, concept extraction device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011241251.0A CN112527977B (en) 2020-11-09 2020-11-09 Concept extraction method, concept extraction device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112527977A CN112527977A (en) 2021-03-19
CN112527977B true CN112527977B (en) 2024-06-25

Family

ID=74980703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011241251.0A Active CN112527977B (en) 2020-11-09 2020-11-09 Concept extraction method, concept extraction device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112527977B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204968B (en) * 2021-05-28 2024-09-17 平安科技(深圳)有限公司 Concept identification method, device, equipment and storage medium of medical entity
CN114741508B (en) * 2022-03-29 2023-05-30 北京三快在线科技有限公司 Concept mining method and device, electronic equipment and readable storage medium
CN116737520B (en) * 2023-06-12 2024-05-03 北京优特捷信息技术有限公司 Data braiding method, device and equipment for log data and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241282A (en) * 2020-01-14 2020-06-05 北京百度网讯科技有限公司 Text theme generation method and device and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223639B2 (en) * 2017-06-22 2019-03-05 International Business Machines Corporation Relation extraction using co-training with distant supervision
CN109753664A (en) * 2019-01-21 2019-05-14 广州大学 A domain-oriented concept extraction method, terminal device and storage medium
CN110569328B (en) * 2019-07-31 2024-06-28 平安科技(深圳)有限公司 Entity linking method, electronic device and computer equipment
CN110968700B (en) * 2019-11-01 2023-04-07 数地工场(南京)科技有限公司 Method and device for constructing domain event map integrating multiple types of affairs and entity knowledge

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241282A (en) * 2020-01-14 2020-06-05 北京百度网讯科技有限公司 Text theme generation method and device and electronic equipment

Also Published As

Publication number Publication date
CN112527977A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
CN107451126B (en) Method and system for screening similar meaning words
CN109299480B (en) Context-based term translation method and device
CN111291195B (en) Data processing method, device, terminal and readable storage medium
US20200081899A1 (en) Automated database schema matching
WO2022062523A1 (en) Artificial intelligence-based text mining method, related apparatus, and device
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
US11227183B1 (en) Section segmentation based information retrieval with entity expansion
Jin et al. Mining scientific terms and their definitions: A study of the ACL anthology
CN112527977B (en) Concept extraction method, concept extraction device, electronic equipment and storage medium
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN115438166A (en) Keyword and semantic-based searching method, device, equipment and storage medium
CN112632287B (en) Electric power knowledge graph construction method and device
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
CN113836938B (en) Text similarity calculation method and device, storage medium and electronic device
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN112632982B (en) A conversation text sentiment analysis method for supplier evaluation
CN108038099B (en) A low-frequency keyword recognition method based on word clustering
Aquino et al. Keyword identification in Spanish documents using neural networks
Nasim et al. Cluster analysis of urdu tweets
CN112926297A (en) Method, apparatus, device and storage medium for processing information
Rathod Extractive text summarization of Marathi news articles
CN116910599A (en) Data clustering method, system, electronic equipment and storage medium
KR20150041908A (en) Method for automatically classifying answer type and apparatus, question-answering system for using the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant