[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115114430A - Information extraction method, device and computer readable storage medium - Google Patents

Information extraction method, device and computer readable storage medium Download PDF

Info

Publication number
CN115114430A
CN115114430A CN202110301488.1A CN202110301488A CN115114430A CN 115114430 A CN115114430 A CN 115114430A CN 202110301488 A CN202110301488 A CN 202110301488A CN 115114430 A CN115114430 A CN 115114430A
Authority
CN
China
Prior art keywords
description
attribute name
text
vector
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110301488.1A
Other languages
Chinese (zh)
Inventor
祝天刚
李浩然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Holding Co Ltd
Original Assignee
Jingdong Technology Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Holding Co Ltd filed Critical Jingdong Technology Holding Co Ltd
Priority to CN202110301488.1A priority Critical patent/CN115114430A/en
Priority to PCT/CN2022/070024 priority patent/WO2022199201A1/en
Publication of CN115114430A publication Critical patent/CN115114430A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information extraction method, an information extraction device and a computer readable storage medium, and relates to the technical field of text processing. The information extraction method comprises the following steps: determining a description vector corresponding to the description text of the article; inputting the description vector into a pre-trained classification model to obtain a class output by the classification model, wherein the output class represents an attribute name of an article; and inputting the description vector and the attribute name information of the article into a pre-trained sequence labeling model to obtain a labeling result of words in the description text output by the sequence labeling model, wherein the labeled words represent attribute values corresponding to the attribute names of the article. The embodiment of the invention improves the efficiency and accuracy of information extraction in integrity.

Description

Information extraction method, device and computer readable storage medium
Technical Field
The present invention relates to the field of text processing technologies, and in particular, to an information extraction method, an information extraction device, and a computer-readable storage medium.
Background
The extraction of the article attribute information refers to a task of extracting an attribute value of an article and a corresponding attribute name from a natural language text describing the article. For example, in the text "this golden stand-up T-shirt is very wild, elegant and fashionable", at least the color attribute and the collar attribute are involved, the attribute value of the color attribute is golden, and the attribute value of the collar attribute is stand-up.
Item attribute information plays an important role in many scenarios, such as intelligent customer service, item recommendation, item retrieval, and the like. However, in practical applications, the attribute information of the article is often missing or uncertain, which causes many negative effects on the application of the attribute information.
In the related art, the task of extracting the product attribute information is generally divided into two independent subtasks, namely product attribute value extraction and corresponding attribute name prediction. The two subtasks are independent of each other, i.e., the extraction of the product attribute value and the prediction of the attribute name are independent of each other.
Disclosure of Invention
The inventor has found through analysis that the property values and property names of the goods are interactive and complementary, and the results of one task are helpful to another task. However, in the related art, these two prediction tasks are handled independently. Therefore, the extraction accuracy and extraction efficiency for the attribute names and attribute values in the related art are low.
The embodiment of the invention aims to solve the technical problem that: how to improve the extraction accuracy and extraction efficiency of the attribute names and the attribute values.
According to a first aspect of some embodiments of the present invention, there is provided an information extraction method, including: determining a description vector corresponding to the description text of the article; inputting the description vector into a pre-trained classification model to obtain a class output by the classification model, wherein the output class represents an attribute name of an article; and inputting the description vector and the attribute name information of the article into a pre-trained sequence labeling model to obtain a labeling result of words in the description text output by the sequence labeling model, wherein the labeled words represent attribute values corresponding to the attribute names of the article.
In some embodiments, determining the description vector corresponding to the description text of the item includes: inputting data comprising preset placeholders and description texts of the articles into an encoder, and obtaining description vectors which are output by the encoder and correspond to the description texts, wherein the description vectors comprise integral vectors which correspond to the placeholders and describe the texts and vectors of each word in the description texts.
In some embodiments, the encoder represents BERT as a bi-directional encoder of the deformer.
In some embodiments, the classification model has a complete sentence parameter matrix and a first term parameter matrix, and the classification model is determined from a sum of a vector describing each term in the text and a result of an operation of the first term parameter matrix, and a result of an operation of the complete vector describing the text and the complete sentence parameter matrix.
In some embodiments, the sequence annotation model has a second word parameter matrix and an attribute name parameter matrix, and is configured to determine an annotation probability that a word is based on an attribute name according to a product of a vector describing each word in the text and the second word parameter matrix and a product of attribute name information and the attribute name parameter matrix, and determine an attribute value corresponding to the attribute name from the description text according to the annotation probability.
In some embodiments, the information extraction method further comprises: acquiring a training description vector corresponding to a training text, and attribute names and corresponding attribute values marked by the training text; inputting the training description vector into a classification model to obtain a class output by the classification model; inputting the training description vector and the category information into a sequence labeling model to obtain a labeling result of words in a training text, which is output by the sequence labeling model; determining the joint loss of the classification model and the sequence labeling model according to the class output by the classification model, the labeling result of the sequence labeling model, the attribute name marked by the training text and the corresponding attribute value; and adjusting parameters of the classification model and the sequence labeling model based on the joint loss.
In some embodiments, the classification model includes one or more sub-classification models, each sub-classification model corresponds to an attribute name, and the classification result of the sub-classification model indicates whether the text corresponding to the input description vector has the corresponding attribute name.
According to a second aspect of some embodiments of the present invention, there is provided an information extraction apparatus comprising: the description vector determining module is configured to determine a description vector corresponding to the description text of the article; the attribute name obtaining module is configured to input the description vector into a pre-trained classification model and obtain a category output by the classification model, wherein the output category represents an attribute name of an article; and the attribute value obtaining module is configured to input the description vector and the attribute name information of the article into a pre-trained sequence labeling model, and obtain a labeling result of a word in the description text, which is output by the sequence labeling model, wherein the labeled word represents an attribute value corresponding to the attribute name of the article.
In some embodiments, the description vector determination module is further configured to input data including preset placeholders and description texts of the items into the encoder, and obtain description vectors corresponding to the description texts output by the encoder, where the description vectors include whole vectors corresponding to the placeholders and describing the texts, and vectors describing each word in the texts.
In some embodiments, the encoder represents BERT as a bi-directional encoder of the deformer.
In some embodiments, the classification model has a complete sentence parameter matrix and a first term parameter matrix, and the classification model is determined from a sum of a vector describing each term in the text and a result of an operation of the first term parameter matrix, and a result of an operation of the complete vector describing the text and the complete sentence parameter matrix.
In some embodiments, the sequence annotation model has a second word parameter matrix and an attribute name parameter matrix, and is configured to determine an annotation probability that a word is based on an attribute name according to a product of a vector describing each word in the text and the second word parameter matrix and a product of attribute name information and the attribute name parameter matrix, and determine an attribute value corresponding to the attribute name from the description text according to the annotation probability.
In some embodiments, the information extraction device further comprises: the training module is configured to acquire a training description vector corresponding to a training text, and an attribute name and a corresponding attribute value marked by the training text; inputting the training description vector into a classification model to obtain a class output by the classification model; inputting the training description vector and the category information into a sequence labeling model to obtain a labeling result of words in a training text, which is output by the sequence labeling model; determining the joint loss of the classification model and the sequence labeling model according to the class output by the classification model, the labeling result of the sequence labeling model, the attribute name marked by the training text and the corresponding attribute value; and adjusting parameters of the classification model and the sequence labeling model based on the joint loss.
In some embodiments, the classification model includes one or more sub-classification models, each sub-classification model corresponds to an attribute name, and the classification result of the sub-classification model indicates whether the text corresponding to the input description vector has the corresponding attribute name.
According to a third aspect of some embodiments of the present invention, there is provided an information extracting apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to perform any of the foregoing information extraction methods based on instructions stored in the memory.
According to a fourth aspect of some embodiments of the present invention, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements any one of the information extraction methods described above.
Some embodiments of the above invention have the following advantages or benefits. In the process of extracting the attributes in the description text, the embodiment of the invention takes the information of the attribute name predicted in the previous stage as a calculation factor, so that the end-to-end prediction of the attribute name and the attribute value can be realized. That is, the user can automatically obtain two prediction results of the attribute name and the attribute value by inputting the description text once. In addition, in the process of predicting the attribute value, the prediction result of the attribute name is used as a calculation factor, so that the efficiency and the accuracy of information extraction are improved on the whole.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 illustrates a flow diagram of an information extraction method according to some embodiments of the invention.
FIG. 2 illustrates a flow diagram of a model training method according to some embodiments of the invention.
Fig. 3 illustrates a schematic structural diagram of an information extraction apparatus according to some embodiments of the present invention.
Fig. 4 is a schematic structural diagram of an information extraction apparatus according to other embodiments of the present invention.
Fig. 5 is a schematic diagram of an information extraction apparatus according to further embodiments of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
The relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
FIG. 1 illustrates a flow diagram of an information extraction method according to some embodiments of the invention. As shown in fig. 1, the information extraction method of this embodiment includes steps S102 to S106.
In step S102, a description vector corresponding to the description text of the article is determined.
In some embodiments, data including preset placeholders and description texts of the items are input into the encoder, and description vectors corresponding to the description texts and output by the encoder are obtained, wherein the description vectors include integral vectors corresponding to the placeholders and describing the texts and vectors describing each word in the texts.
For example, a BERT (Bidirectional Encoder Representation of Bidirectional Encoder reconstruction from transforms) Encoder is used to obtain the description vector.
For example, for the descriptive text "this golden stand-up T-shirt is very wild, elegant and fashionable", the "[ CLS ] this golden stand-up T-shirt is very wild, elegant and fashionable" can be input into the BERT encoder, where [ CLS ] is a preset placeholder for representing the entire sentence. Accordingly, the output of the encoder includes an overall vector that describes the text, as well as a vector that describes each word in the text.
Table 1 exemplarily shows a vector h of encoder outputs 0 ,h 1 ,h 2 ,h 3 ,h 4 ,h 5 ,h 6 ,h 7 ,h 8 ,h 9 Meaning of the corresponding input information, i.e., h 0 Vector representing the entire sentence, h, corresponding to placeholder 1 ~h 9 Representing a vector describing each word in the text, respectively.
TABLE 1
[CLS] This money Golden color Stand collar T-shirt Is very good at All-match Elegance and grace While Fashion(s)
h 0 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 h 9
In step S104, the description vector is input to a classification model trained in advance, and a classification output by the classification model is obtained, where the output classification represents an attribute name possessed by the article.
For example, if the classification result of a description vector by the classification model is "color" or "neck", it indicates that the corresponding description text has an attribute with the attribute name "color" or "neck".
In some embodiments, the classification model includes one or more sub-classification models, each sub-classification model corresponds to an attribute name, and the classification result of the sub-classification model indicates whether the text corresponding to the input description vector has the corresponding attribute name.
For example, a certain sub-classification model corresponds to the attribute name "color", and the classification result indicates whether or not the text corresponding to the input description vector has an attribute with the attribute name "color". Thus, the multi-classification task can be broken down into a plurality of two-classification tasks.
In some embodiments, the classification model has a complete sentence parameter matrix and a first term parameter matrix, and the classification model is determined from a sum of a vector describing each term in the text and a result of an operation of the first term parameter matrix, and a result of an operation of the complete vector describing the text and the complete sentence parameter matrix.
In some embodiments, the classification model determines a classification prediction value according to equation (1), which is used to determine the classification result. For example, the predicted category is determined according to the comparison result of the classification predicted value and a preset threshold value.
Figure BDA0002986440470000071
In the formula (1), the first and second groups,
Figure BDA0002986440470000072
representing a classification predicted value; σ (-) denotes a sigmoid curve (sigmoid) function; w 1 Representing a first word parameter matrix, wherein the numerical value of the matrix is a parameter in the classification model; i represents word identification in the description text; n represents the number of words in the description text; in the case where i is greater than or equal to 1, h i A vector representing each word in the description text; w 2 Expressing a parameter matrix of the whole sentence, wherein the numerical value of the matrix is a parameter in the classification model; h is 0 Representing an overall vector describing the text.
The integral vector and the vector of the word are respectively calculated by using corresponding parameters, and the two calculation results are fused, so that the integral meaning of the text and the meaning of each word can be considered during classification, and the attribute name can be more accurately determined from the description text.
In step S106, the description vector and the attribute name information of the article are input into a sequence tagging model trained in advance, and a tagging result of a word in the description text output by the sequence tagging model is obtained, where the tagged word represents an attribute value corresponding to the attribute name of the article. Examples of the sequence labeling Model include Hidden Markov Models (HMM), Conditional Random Fields (CRF), and Long Short-Term Memory networks (LSTM).
For example, a description vector corresponding to the golden stand-up T-shirt is very wild, elegant and fashionable, and information corresponding to the attribute name "color" recognized in the previous stage is input into the sequence labeling model, and when a preset mark is made on "gold" in the result output by the sequence labeling model, it indicates that the attribute value corresponding to the attribute name "color" in the description text is "gold".
In some embodiments, the sequence annotation model has a second word parameter matrix and an attribute name parameter matrix, and is configured to determine an annotation probability that a word is based on an attribute name according to a product of a vector describing each word in the text and the second word parameter matrix and a product of attribute name information and the attribute name parameter matrix, and determine an attribute value corresponding to the attribute name from the description text according to the annotation probability.
In some embodiments, the sequence annotation model determines annotation information of each word according to formula (2), and the annotation information is used to determine an annotation probability of the word (for example, calculating the annotation probability by using itself as the annotation probability or substituting the annotation probability into a preset formula). After the labeling probabilities of all the words are obtained, the words corresponding to the maximum labeling probability or the labeling probability with the value larger than the preset value are the attribute values corresponding to the corresponding attribute names.
Figure BDA0002986440470000081
In formula (2), i represents the identity of the word,
Figure BDA0002986440470000082
annotation information representing a word i; softmax (·) denotes a normalized exponential function (Softmax function); w 3 Representing a second word parameter matrix, wherein the numerical value of the matrix is a parameter in the classification model; h is i A vector representing word i; w 4 Representing an attribute name parameter matrix, wherein the numerical value of the matrix is a parameter in the classification model;
Figure BDA0002986440470000083
the attribute name information is represented, and the value is obtained by, for example, formula (1).
When the description text corresponds to a plurality of attribute names, the labeling probability of each word can be calculated one by adopting a formula (2) for each attribute name, so that the attribute values corresponding to all the attribute names related to the description text are mined.
Thus, when an attribute name is predicted in advance, extracting the attribute name extracts the attribute value within the range of the predicted attribute name. The accuracy of extracting the attribute values is guaranteed, the search range of the attribute values is reduced, and therefore the calculation efficiency is improved.
In the above embodiment, in the process of extracting the attribute in the description text, the information of the attribute name predicted at the previous stage is also used as a calculation factor, so that the end-to-end prediction of the attribute name and the attribute value can be realized. That is, the user can automatically obtain two prediction results of the attribute name and the attribute value by inputting the description text once. In addition, in the process of predicting the attribute value, the prediction result of the attribute name is used as a calculation factor, so that the efficiency and the accuracy of information extraction are improved on the whole.
In order to further improve the accuracy of prediction, some embodiments of the present invention use a method of training a classification model and a sequence labeling model in a training phase. An embodiment of the model training method of the present invention is described below with reference to fig. 2.
FIG. 2 illustrates a flow diagram of a model training method according to some embodiments of the invention. As shown in FIG. 2, the model training method of this embodiment includes steps S202 to S210.
In step S202, a training description vector corresponding to the training text, and attribute names and corresponding attribute values marked by the training text are obtained.
In step S204, the training description vector is input into the classification model, and the class output by the classification model is obtained.
In step S206, the training description vector and the category information are input into the sequence labeling model, and a labeling result of the word in the training text output by the sequence labeling model is obtained.
In step S208, the joint loss of the classification model and the sequence labeling model is determined according to the class output by the classification model, the labeling result of the sequence labeling model, the attribute name labeled by the training text, and the corresponding attribute value.
For example, the loss function of the classification model for predicting attribute names is shown in equation (3), and the loss function of the sequence labeling model for predicting attribute values is shown in equation (4).
Figure BDA0002986440470000091
Figure BDA0002986440470000092
In equations (3) and (4), Loss a Representing a loss of the classification model; loss v Representing the loss of the sequence annotation model; CrossEntropy represents the calculation of loss values, and other loss value calculation functions can be adopted by those skilled in the art according to needs; y is a Information representing the attribute names of the training text labels,
Figure BDA0002986440470000093
attribute name information representing a classification model prediction; y is v Information representing the attribute values of the training text labels,
Figure BDA0002986440470000094
information representing the attribute values predicted by the sequence annotation model.
The joint Loss of the classification model and the sequence labeling model is determined by, for example, equation (5), and Loss represents a Loss value of the joint Loss.
Loss=Loss a +Loss v (5)
Then, the parameters of the classification model and the sequence labeling model are adjusted according to the Loss value Loss of the joint Loss, instead of using the respective losses of the two models.
In step S210, parameters of the classification model and the sequence labeling model are adjusted based on the joint loss.
By performing joint training on the classification model and the sequence labeling model, an end-to-end attribute name and attribute value extraction model can be obtained, so that the accuracy and the extraction efficiency of information extraction are improved.
The attribute names and attribute values extracted by embodiments of the present invention may be applied to a variety of fields, such as knowledge graph construction, intelligent customer service, search engines, recommendation systems, and the like. For example, the extracted attribute name and attribute value are used as tags describing the article corresponding to the text, and the article and the corresponding relationship of the tags thereof are stored in the database. Thus, in downstream applications, by reading the information of the item and its tag from the database, the tag data of the item can be formed and used for the various scenarios described above.
An embodiment of the information extracting apparatus of the present invention is described below with reference to fig. 3.
Fig. 3 illustrates a schematic structural diagram of an information extraction apparatus according to some embodiments of the present invention. As shown in fig. 3, the information extraction device 30 of this embodiment includes: a description vector determination module 310 configured to determine a description vector corresponding to a description text of the item; an attribute name obtaining module 320 configured to input the description vector into a pre-trained classification model, and obtain a class output by the classification model, wherein the output class represents an attribute name possessed by the article; the attribute value obtaining module 330 is configured to input the description vector and the attribute name information of the article into a pre-trained sequence tagging model, and obtain a tagging result of words in the description text, which is output by the sequence tagging model, where the tagged words represent attribute values corresponding to the attribute names of the article.
In some embodiments, the description vector determining module 310 is further configured to input data including preset placeholders and description texts of the items into the encoder, and obtain a description vector corresponding to the description texts output by the encoder, where the description vector includes an overall vector corresponding to the placeholder and describing the texts, and a vector describing each word in the texts.
In some embodiments, the encoder represents BERT as a bi-directional encoder of the deformer.
In some embodiments, the classification model has a complete sentence parameter matrix and a first term parameter matrix, and the classification model is determined from a sum of a vector describing each term in the text and a result of an operation of the first term parameter matrix, and a result of an operation of the complete vector describing the text and the complete sentence parameter matrix.
In some embodiments, the sequence annotation model has a second word parameter matrix and an attribute name parameter matrix, and is configured to determine an annotation probability that a word is based on an attribute name according to a product of a vector describing each word in the text and the second word parameter matrix and a product of attribute name information and the attribute name parameter matrix, and determine an attribute value corresponding to the attribute name from the description text according to the annotation probability.
In some embodiments, the information extraction device 30 further includes: the training module 340 is configured to obtain a training description vector corresponding to the training text, and attribute names and corresponding attribute values marked by the training text; inputting the training description vector into a classification model to obtain a class output by the classification model; inputting the training description vector and the category information into a sequence labeling model to obtain a labeling result of words in a training text, which is output by the sequence labeling model; determining the joint loss of the classification model and the sequence labeling model according to the class output by the classification model, the labeling result of the sequence labeling model, the attribute name marked by the training text and the corresponding attribute value; and adjusting parameters of the classification model and the sequence labeling model based on the joint loss.
In some embodiments, the classification model includes one or more sub-classification models, each sub-classification model corresponds to an attribute name, and the classification result of the sub-classification model indicates whether the text corresponding to the input description vector has the corresponding attribute name.
Fig. 4 is a schematic structural diagram of an information extraction apparatus according to other embodiments of the present invention. As shown in fig. 4, the information extraction device 40 of this embodiment includes: a memory 410 and a processor 420 coupled to the memory 410, the processor 420 being configured to perform the information extraction method of any of the previous embodiments based on instructions stored in the memory 410.
Memory 410 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.
Fig. 5 shows a schematic structural diagram of an information extraction apparatus according to further embodiments of the present invention. As shown in fig. 5, the information extraction device 50 of this embodiment includes: the memory 510 and the processor 520 may further include an input/output interface 530, a network interface 540, a storage interface 550, and the like. These interfaces 530, 540, 550 and the connections between the memory 510 and the processor 520 may be, for example, through a bus 560. The input/output interface 530 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 540 provides a connection interface for various networking devices. The storage interface 550 provides a connection interface for external storage devices such as an SD card and a usb disk.
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program is configured to implement any one of the foregoing information extraction methods when executed by a processor.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (16)

1. An information extraction method, comprising:
determining a description vector corresponding to a description text of an article;
inputting the description vector into a pre-trained classification model to obtain a class output by the classification model, wherein the output class represents an attribute name of the article;
and inputting the description vector and the attribute name information of the article into a pre-trained sequence labeling model to obtain a labeling result of a word in the description text, which is output by the sequence labeling model, wherein the labeled word represents an attribute value corresponding to the attribute name of the article.
2. The information extraction method according to claim 1, wherein the determining a description vector corresponding to a description text of the article comprises:
inputting data comprising preset placeholders and description texts of the articles into an encoder, and obtaining description vectors corresponding to the description texts and output by the encoder, wherein the description vectors comprise integral vectors corresponding to the placeholders and the description texts and vectors of each word in the description texts.
3. The information extraction method of claim 2, wherein the encoder represents BERT for a bi-directional encoder of a deformer.
4. The information extraction method according to claim 2, wherein the classification model has a complete sentence parameter matrix and a first word parameter matrix, and is determined according to a sum of a vector of each word in the description text and an operation result of the first word parameter matrix, and an operation result of the complete vector of the description text and the complete sentence parameter matrix.
5. The information extraction method according to claim 1 or 2, wherein the sequence labeling model has a second term parameter matrix and an attribute name parameter matrix, and is configured to determine a labeling probability of each term based on the attribute name according to a product of a vector of the term in the description text and the second term parameter matrix and a product of the attribute name information and the attribute name parameter matrix, and determine an attribute value corresponding to the attribute name from the description text according to the labeling probability.
6. The information extraction method of claim 1, further comprising:
acquiring a training description vector corresponding to a training text, and an attribute name and a corresponding attribute value marked by the training text;
inputting the training description vector into a classification model to obtain a class output by the classification model;
inputting the training description vector and the category information into a sequence labeling model to obtain a labeling result of words in the training text, which is output by the sequence labeling model;
determining the joint loss of the classification model and the sequence labeling model according to the class output by the classification model, the labeling result of the sequence labeling model, the attribute name marked by the training text and the corresponding attribute value;
and adjusting parameters of the classification model and the sequence labeling model based on the joint loss.
7. The information extraction method according to claim 1, wherein the classification model includes one or more sub-classification models, each sub-classification model corresponds to an attribute name, and the classification result of the sub-classification model indicates whether the text corresponding to the input description vector has the corresponding attribute name.
8. An information extraction apparatus comprising:
the description vector determining module is configured to determine a description vector corresponding to the description text of the article;
the attribute name obtaining module is configured to input the description vector into a pre-trained classification model and obtain a class output by the classification model, wherein the output class represents an attribute name of the article;
and the attribute value obtaining module is configured to input the description vector and the attribute name information of the article into a pre-trained sequence labeling model, and obtain a labeling result of words in the description text, which is output by the sequence labeling model, wherein the labeled words represent attribute values corresponding to the attribute names of the article.
9. The information extraction apparatus according to claim 8, wherein the description vector determination module is further configured to input data including preset placeholders and description texts of the items into an encoder, and obtain description vectors corresponding to the description texts output by the encoder, where the description vectors include whole vectors of the description texts corresponding to the placeholders and vectors of each word in the description texts.
10. The information extraction apparatus of claim 9, wherein the encoder represents BERT as a bi-directional encoder of a deformer.
11. The information extraction apparatus according to claim 9, wherein the classification model has a complete sentence parameter matrix and a first word parameter matrix, and is determined according to a sum of a vector of each word in the description text and an operation result of the first word parameter matrix, and an operation result of the complete vector of the description text and the complete sentence parameter matrix.
12. The information extraction apparatus according to claim 8 or 9, wherein the sequence labeling model has a second term parameter matrix and an attribute name parameter matrix, and is configured to determine a labeling probability that the term is based on the attribute name from a product of a vector of each term in the description text and the second term parameter matrix and a product of the attribute name information and the attribute name parameter matrix, and determine an attribute value corresponding to the attribute name from the description text according to the labeling probability.
13. The information extraction apparatus according to claim 8, further comprising:
the training module is configured to acquire a training description vector corresponding to a training text, and an attribute name and a corresponding attribute value marked by the training text; inputting the training description vector into a classification model to obtain a class output by the classification model; inputting the training description vector and the category information into a sequence labeling model to obtain a labeling result of words in the training text, which is output by the sequence labeling model; determining the joint loss of the classification model and the sequence labeling model according to the class output by the classification model, the labeling result of the sequence labeling model, the attribute name marked by the training text and the corresponding attribute value; and adjusting parameters of the classification model and the sequence labeling model based on the joint loss.
14. The information extraction apparatus according to claim 8, wherein the classification model includes one or more sub-classification models, each sub-classification model corresponds to an attribute name, and a classification result of the sub-classification model indicates whether text corresponding to the input description vector has the corresponding attribute name.
15. An information extraction apparatus comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the information extraction method of any of claims 1-7 based on instructions stored in the memory.
16. A computer-readable storage medium on which a computer program is stored, which program, when executed by a processor, implements the information extraction method of any one of claims 1 to 7.
CN202110301488.1A 2021-03-22 2021-03-22 Information extraction method, device and computer readable storage medium Pending CN115114430A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110301488.1A CN115114430A (en) 2021-03-22 2021-03-22 Information extraction method, device and computer readable storage medium
PCT/CN2022/070024 WO2022199201A1 (en) 2021-03-22 2022-01-04 Information extraction method and apparatus, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110301488.1A CN115114430A (en) 2021-03-22 2021-03-22 Information extraction method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115114430A true CN115114430A (en) 2022-09-27

Family

ID=83323836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110301488.1A Pending CN115114430A (en) 2021-03-22 2021-03-22 Information extraction method, device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN115114430A (en)
WO (1) WO2022199201A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073568B (en) * 2016-11-10 2020-09-11 腾讯科技(深圳)有限公司 Keyword extraction method and device
CN110110054B (en) * 2019-03-22 2021-06-08 北京中科汇联科技股份有限公司 Method for acquiring question-answer pairs from unstructured text based on deep learning
CN111797622B (en) * 2019-06-20 2024-04-09 北京沃东天骏信息技术有限公司 Method and device for generating attribute information
CN111368079B (en) * 2020-02-28 2024-06-25 腾讯科技(深圳)有限公司 Text classification method, model training method, device and storage medium
CN111611799B (en) * 2020-05-07 2023-06-02 北京智通云联科技有限公司 Entity attribute extraction method, system and equipment based on dictionary and sequence labeling model

Also Published As

Publication number Publication date
WO2022199201A1 (en) 2022-09-29

Similar Documents

Publication Publication Date Title
CN109271521B (en) Text classification method and device
CN111198948A (en) Text classification correction method, device and equipment and computer readable storage medium
CN111046656A (en) Text processing method and device, electronic equipment and readable storage medium
CN111461164B (en) Sample data set capacity expansion method and model training method
CN112948575B (en) Text data processing method, apparatus and computer readable storage medium
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
CN111461301A (en) Serialized data processing method and device, and text processing method and device
CN111666766A (en) Data processing method, device and equipment
CN114329225A (en) Search method, device, equipment and storage medium based on search statement
CN115600109A (en) Sample set optimization method and device, equipment, medium and product thereof
CN113836303A (en) Text type identification method and device, computer equipment and medium
CN108205524B (en) Text data processing method and device
CN115357699A (en) Text extraction method, device, equipment and storage medium
CN111400340A (en) Natural language processing method and device, computer equipment and storage medium
CN113298559A (en) Commodity applicable crowd recommendation method, system, device and storage medium
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN115455151A (en) AI emotion visual identification method and system and cloud platform
CN114676705B (en) Dialogue relation processing method, computer and readable storage medium
CN114840642A (en) Event extraction method, device, equipment and storage medium
CN114139537A (en) Word vector generation method and device
CN113553838A (en) Commodity file generation method and device
CN112329440A (en) Relation extraction method and device based on two-stage screening and classification
CN113688243B (en) Method, device, equipment and storage medium for labeling entities in sentences
CN115114430A (en) Information extraction method, device and computer readable storage medium
CN112860860A (en) Method and device for answering questions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination