CN110347802B - Text analysis method and device - Google Patents
Text analysis method and device Download PDFInfo
- Publication number
- CN110347802B CN110347802B CN201910649742.XA CN201910649742A CN110347802B CN 110347802 B CN110347802 B CN 110347802B CN 201910649742 A CN201910649742 A CN 201910649742A CN 110347802 B CN110347802 B CN 110347802B
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- unit
- question
- answered
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
Abstract
The application provides a text analysis method and device. The text analysis method comprises the following steps: inputting the text to be analyzed and the question to be answered into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered; performing part-of-speech tagging on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered; combining the first word vector with the second word vector, and inputting the first word vector and the second word vector into an answer obtaining model for processing to obtain a third word vector corresponding to each word unit; based on the third word vector corresponding to each word unit, obtaining the probability of each word unit serving as an answer starting position and an answer ending position corresponding to the question to be answered; and determining the answer of the question to be answered based on the probability that each word unit is used as the answer starting position and the answer ending position. The text analysis method and the text analysis device can improve the accuracy of answers.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a text analysis method, an apparatus, a computing device, a computer-readable storage medium, and a chip.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and it is studying various theories and methods that enable efficient communication between humans and computers using Natural Language. The application scenario of natural language processing is, in a large aspect, intelligent processing of language words, including reading comprehension, question and answer conversation, writing, translation and the like. These application scenarios can be further subdivided into tasks, including recognizing words from a series of words, recognizing phrases from a series of words, recognizing predicates from sentences, statins, colloquials, recognizing moods from sentences, abstracting abstracts from the entire article, finding answers from the entire article according to questions, i.e., reading comprehension and question answering, and so on.
For reading, understanding and questioning and answering tasks, a Bidirectional attention neural network model (BERT) is usually selected for processing. However, the BERT model cannot sufficiently extract the interdependence relation and information of the articles and the problems, and the model effect needs to be improved.
Disclosure of Invention
In view of this, embodiments of the present application provide a text analysis method, a text analysis device, a computing device, a computer-readable storage medium, and a chip, so as to solve technical defects in the prior art.
The embodiment of the application discloses a text analysis method, which comprises the following steps:
inputting a text to be analyzed and a question to be answered into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered;
performing part-of-speech tagging on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered;
combining the first word vector with the second word vector, and inputting the first word vector and the second word vector into an answer obtaining model for processing to obtain a third word vector corresponding to each word unit;
obtaining the probability of each word unit serving as an answer starting position and an answer ending position corresponding to the question to be answered based on the third word vector corresponding to each word unit;
and determining the answer of the question to be answered based on the probability that each word unit is used as the answer starting position and the answer ending position.
Further, before the step of inputting the text to be analyzed and the question to be answered into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered, the method further includes:
and respectively carrying out word segmentation processing on the text to be analyzed and the question to be answered to obtain the word unit.
Further, before the step of inputting the text to be analyzed and the question to be answered into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered, the method further includes:
dividing the text to be analyzed into at least one input unit;
performing word segmentation processing on the input unit and the question to be answered respectively to obtain the word unit;
the method for processing the text to be analyzed and the question to be answered by inputting the text to be analyzed and the question to be answered into a text analysis model to obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered comprises the following steps:
and inputting each input unit and the question to be answered into the text analysis model as an input set for processing to obtain a first word vector corresponding to each word unit in the input set.
Further, the inputting each input unit and the question to be answered as an input set into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the input set includes:
pre-embedding the input unit and the question to be answered to obtain a word vector, a sentence vector and a position vector of each word unit in the input unit and the question to be answered;
and inputting the word vector, the sentence vector and the position vector into the text analysis model as an input set for processing to obtain a first word vector corresponding to each word unit in the input set.
Further, part-of-speech tagging is performed on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered, and the method includes:
respectively carrying out word segmentation processing on the text to be analyzed and the question to be answered to obtain the word unit;
performing part-of-speech tagging on each word unit to obtain a word unit carrying part-of-speech information;
and performing word embedding processing on the word units carrying the part of speech information to obtain the second word vector corresponding to each word unit.
Further, the combining the first word vector and the second word vector, and inputting the combined first word vector and second word vector into an answer obtaining model for processing to obtain a third word vector corresponding to each word unit includes:
splicing the first word vector and the second word vector of each word unit to obtain a spliced vector corresponding to the word unit;
and inputting the spliced vector into the answer obtaining model for processing until a third word vector corresponding to each word unit.
Further, the answer obtaining model comprises a first sub-layer, a second sub-layer and a third sub-layer;
inputting the stitching vector into the answer obtaining model for processing, and obtaining a third word vector corresponding to each word unit, wherein the step of inputting the stitching vector into the answer obtaining model for processing comprises the following steps:
inputting the splicing vector into the first sublayer for processing to obtain an output vector of the first sublayer;
inputting the output vector of the first sublayer into the second sublayer for processing to obtain the output vector of the second sublayer;
and inputting the output vector of the second sublayer into the third sublayer for processing to obtain the third word vector.
Further, the obtaining, based on the third word vector corresponding to each word unit, a probability that each word unit serves as an answer start position and an answer end position corresponding to the question to be answered includes:
and performing linear mapping and nonlinear transformation on the third word vector corresponding to each word unit to respectively obtain the probability of each word unit serving as the answer starting position and the answer ending position corresponding to the question.
A text analysis apparatus comprising:
the first processing module is configured to input a text to be analyzed and a question to be answered into a text analysis model for processing, and obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered;
the second processing module is configured to perform part-of-speech tagging on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered;
the third processing module is configured to combine the first word vector with the second word vector, and input the combined first word vector and second word vector into an answer obtaining model for processing to obtain a third word vector corresponding to each word unit;
the probability obtaining module is configured to obtain the probability that each word unit is used as the answer starting position and the answer ending position corresponding to the question to be answered based on the third word vector corresponding to each word unit;
an answer determination module configured to determine an answer to the question to be answered based on a probability that each of the word units is used as an answer start position and an answer end position.
Optionally, the text analysis apparatus further includes:
and the word segmentation processing module is configured to perform word segmentation processing on the text to be analyzed and the question to be answered respectively to obtain the word unit.
Optionally, the text analysis apparatus further includes:
a dividing module configured to divide the text to be analyzed into at least one input unit;
performing word segmentation processing on the input unit and the question to be answered respectively to obtain the word unit;
the first processing module is further configured to:
and inputting each input unit and the question to be answered as an input set into the text analysis model for processing to obtain a first word vector corresponding to each word unit in the input set.
Optionally, the first processing module is further configured to:
pre-embedding the input unit and the question to be answered to obtain a word vector, a sentence vector and a position vector of each word unit in the input unit and the question to be answered;
and inputting the word vector, the sentence vector and the position vector into the text analysis model as an input set for processing to obtain a first word vector corresponding to each word unit in the input set.
Optionally, the second processing module is further configured to:
respectively carrying out word segmentation processing on the text to be analyzed and the question to be answered to obtain the word unit;
performing part-of-speech tagging on each word unit to obtain a word unit carrying part-of-speech information;
and performing word embedding processing on the word units carrying the part of speech information to obtain the second word vector corresponding to each word unit.
Optionally, the third processing module is further configured to:
splicing the first word vector and the second word vector of each word unit to obtain a spliced vector corresponding to the word unit;
and inputting the spliced vector into the answer obtaining model for processing until a third word vector corresponding to each word unit.
Optionally, the answer obtaining model includes a first sub-layer, a second sub-layer, and a third sub-layer;
the third processing module further configured to:
inputting the splicing vector into the first sublayer for processing to obtain an output vector of the first sublayer;
inputting the output vector of the first sublayer into the second sublayer for processing to obtain the output vector of the second sublayer;
and inputting the output vector of the second sublayer into the third sublayer for processing to obtain the third word vector.
Optionally, the probability obtaining module is further configured to:
and performing linear mapping and nonlinear transformation on the third word vector corresponding to each word unit to respectively obtain the probability of each word unit serving as the answer starting position and the answer ending position corresponding to the question.
A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the text analysis method when executing the instructions.
A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the text analysis method.
A chip storing computer instructions which, when executed by a processor, implement the steps of the text analysis method.
According to the text analysis method, the text analysis device, the computing equipment, the computer readable storage medium and the chip, the text to be analyzed and the question to be answered are combined through the first word vector obtained through the text analysis model and the second word vector obtained through part-of-speech tagging, and are input into the answer obtaining model for further extraction and analysis, so that information between the text to be analyzed and the question to be answered can be further extracted more deeply, and accuracy of the answer is effectively improved.
Drawings
FIG. 1 is a schematic block diagram of a computing device according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps of a text analysis method according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating the generation of an input set of a text analysis model according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a BERT model according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a text analysis apparatus according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
BERT model: google proposed bidirectional attention neural network model in 2018, month 10. The BERT model obtains article information and the interdependence relation between the problems and the articles by splicing the problems and the articles and utilizing an attention mechanism, thereby obtaining the interdependence expression vector of each word unit of the problems and the articles, and finally obtaining the probability of each word unit as the starting position and the ending position of an answer through linear mapping and nonlinear transformation.
Word unit (token): before any actual processing of the input text, it needs to be segmented into language units such as words, punctuation, numbers or pure alphanumerics, which are called word units. For an English text, a word unit may be a word, a punctuation mark, a number, etc., and for a Chinese text, the smallest word unit may be a word, a punctuation mark, a number, etc.
Word embedding: means that a high-dimensional space with the number of all words is embedded into a continuous vector space with a much lower dimension, and each word or phrase is mapped to a vector on the real number domain.
Long Short-Term Memory network (LSTM) model: is a time-recursive neural network suitable for processing and predicting important events with relatively long intervals and delays in a time sequence. The LSTM model may be used to connect previous information to the current task, for example using past statements to infer understanding of the current statement.
Normalized exponential function Softmax: is a generalization of the logistic function that can "compress" a K-dimensional vector containing arbitrary real numbers into another K-dimensional real vector, so that each element ranges between [0,1] and the sum of all elements is 1.
In the present application, a text analysis method, an apparatus, a computing device, a computer-readable storage medium, and a chip are provided, which are described in detail in the following embodiments one by one.
Fig. 1 is a block diagram illustrating a configuration of a computing device 100 according to an embodiment of the present specification. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, through a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein the processor 120 may perform the steps of the method shown in fig. 2.
As shown in fig. 2, a text analysis method includes steps S210 to S250.
Step S210: inputting a text to be analyzed and a question to be answered into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered.
In practical application, the text to be analyzed may be divided into at least one input unit in advance, and the word segmentation processing may be performed on the input unit and the question to be answered, respectively, to obtain the word unit.
Specifically, the text to be analyzed may be divided into input units based on the number of characters of the text to be analyzed and a preset number of characters that each input unit can accommodate. For example, assuming that an input unit can contain a characters at most, a text to be analyzed contains b characters in common, and a and b are both positive integers, the whole text to be analyzed can be taken as an input unit under the condition that a is larger than or equal to b, and whether b/a is an integer is judged under the condition that a is smaller than b, if yes, the text to be analyzed can be divided into b/a input units, and if not, the text to be analyzed can be divided into b/a +1 input units.
For example, assuming that each input unit can accommodate 100 characters at most, the text a to be analyzed includes 220 characters, the text B to be analyzed includes 80 characters, and the text C to be analyzed includes 100 characters, the text a to be analyzed may be divided into three input units, the whole text B to be analyzed may be used as one input unit, and the whole text C to be analyzed may also be used as one input unit.
Further, each input unit and the question to be answered are respectively used as an input set and input into the text analysis model for processing, and a first word vector corresponding to each word unit in the input set is obtained.
In particular, each input set includes one input element of text to be analyzed and a question to be answered.
For example, assume that the text A to be analyzed includes three input units, input unit A respectively 1 Input unit A 2 And an input unit A 3 Then input into unit A 1 Forming input set A with questions to be answered 1 Input into text analysis model for processing, input unit A 2 Forming input set A with questions to be answered 2 Input into text analysis model for processing, input unit A 3 Forming input set A with questions to be answered 3 And inputting the data into a text analysis model for processing.
Further, the input unit and the question to be answered may be subjected to pre-embedding processing to obtain a word vector, a sentence vector, and a position vector of each word unit in the input unit and the question to be answered; and inputting the word vector, sentence vector and position vector of each word unit in the input unit and the question to be answered into the text analysis model as an input vector set for processing to obtain a first word vector corresponding to each word unit in the input set.
Specifically, the pre-embedding processing refers to inputting the input unit and the question to be answered into an embedding layer to perform word embedding processing in advance, so as to obtain a word vector, a sentence vector and a position vector of each word unit in the input unit and the question to be answered. The word vector is a vector corresponding to each word unit, the sentence vector is a sentence vector to which each word unit belongs, and the position vector is a vector generated at a position corresponding to each word unit.
For example, as shown in fig. 3, fig. 3 is a schematic diagram of generation of an input set of text analysis models, and it is assumed that an input unit includes two sentences of "my dog tokens" and "what dogs dog token". The "my dog keys bones" is used as a text (input unit) to be analyzed, the "what dogs dog keys" is used as a question to be answered, and the input unit and the question to be answered are subjected to word embedding processing to obtain an input set shown in fig. 3. The word unit is arranged in the input set, and the alphabet subscript indicates the sentence to be analyzed, and the arabic number subscript indicates the position of the word unit in the input set.
The text analysis model may be a BERT model, and as shown in fig. 4, the BERT model may include n stacked layers, and the n stacked layers are sequentially connected. Each stack layer further comprises: a self-attention layer, a first specification layer, a feedforward layer, and a second specification layer. And inputting an input set consisting of word vectors, sentence vectors and position vectors into the 1 st stack layer to obtain an output vector of the 1 st stack layer, inputting the output vector of the 1 st stack layer into the 2 nd stack layer … …, and analogizing in sequence to finally obtain an output vector of the last stack layer. And taking the output vector of the last stack layer as the first word vector of each word unit.
Step S220: and performing part-of-speech tagging on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered.
In practical application, the text to be analyzed and the question to be answered are respectively subjected to word segmentation processing to obtain the word unit; performing part-of-speech tagging on each word unit to obtain a word unit carrying part-of-speech information; and performing word embedding processing on the word units carrying the part of speech information to obtain the second word vector corresponding to each word unit.
Specifically, the part of speech refers to the characteristics of a word as a basis for dividing the part of speech. And performing part-of-speech tagging on each word unit, namely tagging corresponding part-of-speech information for each word unit based on the characteristics of the word unit. The part-of-speech information may be part-of-speech category information, including nouns, verbs, adjectives, local words, orientation words, distinction words, status words, pronouns, numerics, quantifiers, prepositions, pronouns, adverbs, mood words, character strings, punctuation marks, and the like. For example, word units representing persons, things, places or abstract concepts may be labeled with parts of speech information "nouns", word units representing actions or states may be labeled with parts of speech information "verbs", word units representing behavior or state features may be labeled with parts of speech information "adverbs", and the like, and the rest of the cases may be analogized in sequence and are not described again. And performing word embedding processing on the word units carrying the part of speech information to obtain a second word vector carrying the part of speech information corresponding to each word unit.
For example, assuming that the text to be analyzed includes "my dog tokens" and is subjected to word segmentation processing, word units including "my", "dog", "tokens" and "bones" are obtained, and a part-of-speech tagging is performed on each word unit to obtain that part-of-speech information of the word unit "my" is an "adjective (adj)", part-of-speech information of the "dog" and "bones" is a "noun (n)", part-of-speech information of the word unit "bones" is a "verb (v)", and the word units carrying the part-of-speech information are subjected to word embedding processing, so as to obtain the second word vector shown in table 1.
TABLE 1
Word unit | my | dog | likes | bones |
Part of speech information | Adjectives | Noun (name) | Verb and its usage | Noun (name) |
Second word vector | E (my,adj) | E (dog,n) | E (likes,v) | E (bones,n) |
Step S230: and combining the first word vector with the second word vector, and inputting the combined first word vector and second word vector into an answer obtaining model for processing to obtain a third word vector corresponding to each word unit.
In practical application, the first word vector and the second word vector of each word unit can be spliced to obtain a spliced vector corresponding to the word unit; and inputting the spliced vector into the answer obtaining model for processing to obtain a third word vector corresponding to each word unit.
Specifically, if a first word vector of any word unit is an a-dimensional vector and a second word vector is a b-dimensional vector, the a-dimensional first word vector and the b-dimensional second word vector are spliced to obtain a c-dimensional spliced word vector, where c is a + b, and the c-dimensional spliced vector is input into an answer acquisition model for processing, so as to obtain a third word vector corresponding to the word unit.
For example, assume a first word vector E for the word unit "bones 1 Is 768 dimensions, the second word vector E 2 Is 64 dimensions, the first word vector E of the word unit "bones" is formed 1 And a second word vector E 2 Splicing is carried out to obtain a 832-dimensional splicing vector E 12 To splice vector E 12 Inputting the third word vector into the answer obtaining model to obtain a third word vector corresponding to the word unit "bones".
Further, the answer obtaining model comprises a first sub-layer, a second sub-layer and a third sub-layer. The stitching vector may be input into the first sublayer for processing, so as to obtain an output vector of the first sublayer; inputting the output vector of the first sublayer into the second sublayer for processing to obtain the output vector of the second sublayer; and inputting the output vector of the second sublayer into the third sublayer for processing to obtain the third word vector.
The answer obtaining model is a three-layer bidirectional LSTM structure, and the first sub-layer, the second sub-layer and the third sub-layer are all bidirectional LSTM network models.
Step S240: and obtaining the probability of each word unit serving as the answer starting position and the answer ending position corresponding to the question to be answered based on the third word vector corresponding to each word unit.
In practical application, the third word vector corresponding to each word unit may be subjected to linear mapping and nonlinear transformation, so as to obtain probabilities that each word unit is used as an answer start position and an answer end position corresponding to a question.
In particular, a linear mapping is a mapping from one vector space V to another vector space W. And (4) realizing conversion from the dimension of the word vector to the dimension of the sentence vector through linear mapping. The nonlinear transformation is to make nonlinear transformation on the original feature vector to obtain a new feature vector, and the new feature vector is used for linear classification, which is equivalent to making nonlinear classification on the original feature space.
In practical application, the method can be implemented in various manners such as Softmax, and the application does not limit the method.
Step S250: and determining the answer of the question to be answered based on the probability that each word unit is used as the answer starting position and the answer ending position.
Specifically, the starting position of the answer to the question to be answered in the text to be analyzed may be determined by comparing the probability magnitude of each word unit as the starting position of the answer, and similarly, the ending position of the answer to the question to be answered in the text to be analyzed may be determined by comparing the probability magnitude of each word unit as the ending position of the answer.
For example, assume that the text to be analyzed includes "my doglinks bones ' including ' what do dog like ' question to be answered, the probability of the word unit ' my ' as the starting position of the answer is m 1 The probability of the answer ending position is n 1 The probability of the word unit "dog" as the answer start position is m 2 The probability of the answer ending position is n 2 The probability of the word unit "keys" as the answer start position is m 3 The probability of the answer ending position is n 3 The probability of the word unit "bones" as the answer start position is m 4 The probability of the answer ending position is n 4 Wherein m is 4 >m 3 >m 2 >m 1 ,n 4 >n 3 >n 2 >n 1 Therefore, it can be seen that the probability that the word unit "bones" is used as the answer start position and the probability that the word unit "bones" is used as the answer end position are both the maximum, and the answer to the question "what doss dog like" is "bones".
The above embodiments are further described below with reference to specific examples.
For example, assuming that The text to be analyzed is "The simple Road is viewed and inputted into The Land Silk Road and The Sea Silk Road, The Land Silk Road ordered in The Western Han Dynasty and The Sea Silk Road for The purpose of in The Qin and Han Dynasties", and The question to be answered is "while period cut The Land Silk Road ordered originally om", The text to be analyzed and The question to be answered are subjected to The word segmentation processing to obtain word units including The words "The", "Silk", "Road", "is", "differentiated", "intro", and The like, and The number of characters of The text to be analyzed is smaller than The maximum number of characters that can be held by one input unit, so that The text to be analyzed is put in front of The question to be answered as an input unit as a whole and is inputted together with The question to be answered.
And performing pre-embedding processing on the input units and the questions to be answered to obtain word vectors, sentence vectors and position vectors corresponding to each word unit in the input set, and forming an input vector set. Taking "Land" in a text to be analyzed as an example, a word unit "Land" appears twice in the text to be analyzed, and appears once in a question to be answered, assuming that "Land" appears as "Land 1" for the first time in the text to be analyzed, and appears as "Land 2" for the second time, and appears as "Land 3" in the question to be answered, after pre-embedding processing, the three "Land" word units respectively obtain word vectors, sentence vectors, and position vectors as shown in table 2, and the processing conditions of other word units can be analogized, and are not described again.
TABLE 2
Inputting a set of word vectors, sentence vectors and position vectors corresponding to the word units into a text analysis model (BERT model) as an input set for processing to obtain a first word vector E x1 ~E x46 。
And performing part-of-speech tagging on each word unit to obtain a word unit carrying part-of-speech information, wherein The part-of-speech information of The word unit is a definite article, The part-of-speech information of The word unit 'Silk' is an adjective, The part-of-speech information of The word unit 'Road' is a noun, The part-of-speech information of The word unit 'is' a verb, The part-of-speech information of The word unit 'divide' is a verb, The part-of-speech information of The word unit 'intro' is a preposition, and The part-of-speech information of other word units can be analogized and repeated. Performing word embedding processing on the word units carrying the part of speech information to obtain a second word vector E corresponding to each word unit y1 ~E y46 。
-transforming said first word vector (E) of each of said word units x1 ~E x46 ) With said second word vector (E) y1 ~E y46 ) Splicing to obtain a spliced vector (E) corresponding to the word unit x1 +E y1 ~E x46 +E y46 ) (ii) a Vector to be spliced (E) x1 +E y1 ~E x46 +E y46 ) Input into the answer acquisition modelProcessing to obtain a third word vector (E) corresponding to each word unit z1 ~E z46 ) Through linear mapping and nonlinear transformation, the probability that each word unit is used as the answer start position and the answer end position is obtained as follows.
Answer start position probability: [0.28,0.27,0.55,0.23,0.12,0.40,0.33,0.60,0.11,0.22,0.61,0.65,0.40,0.29,0.44,0.38,0.60,0.35,0.39,0.16,0.97,0.57,0.10,0.11,0.31,0.22,0.31,0.18,0.62,0.07,0.52,0.33,0.51,0.77,0.10,0.40,0.40,0.29,0.28,0.28,0.46,0.91,0.15,0.27,0.14,0.09]
Answer end position probability: [0.26,0.11,0.54,0.72,0.27,0.64,0.41,0.14,0.78,0.87,0.66,0.27,0.16,0.21,0.05,0.39,0.660.27,0.28,0.11,0.13,0.39,0.51,0.57,1.83,0.26,0.25,0.50,0.18,0.13,0.10,0.98,0.62,0.50,0.48,0.50,0.50,0.50,0.30,0.15,0.33,0.25,0.61,1.12,1.25,0.5]
Therefore, the answer with the highest probability of the initial position of the answer is the 22 nd word unit, and the answer with the highest probability of the end position of the answer is the 25 th word unit, so that the answer of the question to be answered is 'the Western Han Dynasty'.
According to the text analysis method, the vector obtained through BERT model processing and the vector carrying the part-of-speech information obtained through part-of-speech tagging processing are combined and input into the answer acquisition model formed by the three layers of bidirectional LSTM models to be further processed, more semantic information and sentence meaning information can be extracted through more and deeper calculation on the basis of obtaining the vector combining the output information of the BERT model and the part-of-speech information, the extraction abundance and the extraction depth of the extracted semantic information and the sentence meaning information are effectively improved, and therefore the question and answer effect and the answer accuracy are improved.
As shown in fig. 5, a text analysis apparatus includes:
the first processing module 510 is configured to input a text to be analyzed and a question to be answered into a text analysis model for processing, so as to obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered.
A second processing module 520, configured to perform part-of-speech tagging on the text to be analyzed and the question to be answered, so as to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered.
A third processing module 530, configured to combine the first word vector with the second word vector, and input the combined first word vector and second word vector into an answer obtaining model for processing, so as to obtain a third word vector corresponding to each word unit.
A probability obtaining module 540 configured to obtain, based on the third word vector corresponding to each word unit, a probability that each word unit serves as an answer start position and an answer end position corresponding to the question to be answered.
An answer determining module 550 configured to determine an answer to the question to be answered based on the probability that each of the word units is the answer start position and the answer end position.
Optionally, the text analysis apparatus further includes:
and the word segmentation processing module is configured to perform word segmentation processing on the text to be analyzed and the question to be answered respectively to obtain the word unit.
Optionally, the text analysis apparatus further includes:
a dividing module configured to divide the text to be analyzed into at least one input unit.
And performing word segmentation processing on the input unit and the question to be answered respectively to obtain the word unit.
The first processing module 510 is further configured to:
and inputting each input unit and the question to be answered into the text analysis model as an input set for processing to obtain a first word vector corresponding to each word unit in the input set.
Optionally, the first processing module 510 is further configured to:
and performing pre-embedding processing on the input unit and the question to be answered to obtain a word vector, a sentence vector and a position vector of each word unit in the input unit and the question to be answered.
And inputting the word vector, the sentence vector and the position vector into the text analysis model as an input set for processing to obtain a first word vector corresponding to each word unit in the input set.
Optionally, the second processing module 520 is further configured to:
and respectively carrying out word segmentation processing on the text to be analyzed and the question to be answered to obtain the word unit.
And performing part-of-speech tagging on each word unit to obtain the word unit carrying part-of-speech information.
And performing word embedding processing on the word units carrying the part of speech information to obtain the second word vector corresponding to each word unit.
Optionally, the third processing module 530 is further configured to:
and splicing the first word vector and the second word vector of each word unit to obtain a spliced vector corresponding to the word unit.
And inputting the spliced vector into the answer obtaining model for processing until a third word vector corresponding to each word unit.
Optionally, the answer obtaining model includes a first sub-layer, a second sub-layer, and a third sub-layer.
The third processing module 530, further configured to:
and inputting the splicing vector into the first sublayer for processing to obtain an output vector of the first sublayer.
And inputting the output vector of the first sublayer into the second sublayer for processing to obtain the output vector of the second sublayer.
And inputting the output vector of the second sublayer into the third sublayer for processing to obtain the third word vector.
Optionally, the probability obtaining module 540 is further configured to:
and performing linear mapping and nonlinear transformation on the third word vector corresponding to each word unit to respectively obtain the probability of each word unit serving as the answer starting position and the answer ending position corresponding to the question.
The text analysis device according to an embodiment of the present application can effectively improve the question and answer effect and the accuracy of the answer by combining vectors containing different information and performing further deeper processing.
An embodiment of the present application further provides a computing device, including a memory, a processor, and computer instructions stored on the memory and executable on the processor, where the processor executes the instructions to implement the following steps:
inputting a text to be analyzed and a question to be answered into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the text to be analyzed and the question to be answered.
And performing part-of-speech tagging on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered.
And combining the first word vector with the second word vector, and inputting the combined first word vector and second word vector into an answer obtaining model for processing to obtain a third word vector corresponding to each word unit.
And obtaining the probability of each word unit serving as the answer starting position and the answer ending position corresponding to the question to be answered based on the third word vector corresponding to each word unit.
And determining the answer of the question to be answered based on the probability that each word unit is used as the answer starting position and the answer ending position.
An embodiment of the present application further provides a computer readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the text analysis method as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text analysis method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text analysis method.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that for simplicity and convenience of description, the above-described method embodiments are described as a series of combinations of acts, but those skilled in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and/or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently considered to be preferred embodiments and that acts and modules are not required in the present application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.
Claims (13)
1. A method of text analysis, comprising:
dividing a text to be analyzed into at least one input unit;
performing word segmentation processing on the input unit and the question to be answered respectively to obtain word units;
inputting each input unit and the question to be answered into a text analysis model as an input set for processing to obtain a first word vector corresponding to each word unit in the input set;
performing part-of-speech tagging on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered;
splicing the first word vector and the second word vector of each word unit to obtain a spliced vector corresponding to the word unit;
inputting the spliced vectors into an answer obtaining model for processing to obtain a third word vector corresponding to each word unit;
performing linear mapping and nonlinear transformation on the third word vector corresponding to each word unit to respectively obtain the probability of each word unit serving as the answer starting position and the answer ending position corresponding to the question;
and comparing the probability of each word unit as the initial position of the answer, determining the initial position of the answer of the question to be answered, and comparing the probability of each word unit as the end position of the answer, and determining the end position of the answer of the question to be answered.
2. The method according to claim 1, before the inputting the text to be analyzed and the question to be answered into a text analysis model for processing to obtain the first word vector corresponding to each word unit in the text to be analyzed and the question to be answered, further comprising:
and respectively carrying out word segmentation processing on the text to be analyzed and the question to be answered to obtain the word unit.
3. The method according to claim 1, wherein the step of inputting each input unit and the question to be answered as an input set into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the input set comprises:
pre-embedding the input unit and the question to be answered to obtain a word vector, a sentence vector and a position vector of each word unit in the input unit and the question to be answered;
and inputting the word vector, the sentence vector and the position vector into the text analysis model as an input set for processing to obtain a first word vector corresponding to each word unit in the input set.
4. The text analysis method according to claim 1, wherein performing part-of-speech tagging on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered comprises:
respectively carrying out word segmentation processing on the text to be analyzed and the question to be answered to obtain the word unit;
performing part-of-speech tagging on each word unit to obtain a word unit carrying part-of-speech information;
and performing word embedding processing on the word units carrying the part of speech information to obtain the second word vector corresponding to each word unit.
5. The text analysis method of claim 1, wherein the answer obtaining model comprises a first sub-layer, a second sub-layer, and a third sub-layer;
inputting the stitching vector into the answer obtaining model for processing, and obtaining a third word vector corresponding to each word unit, wherein the step of inputting the stitching vector into the answer obtaining model for processing comprises the following steps:
inputting the splicing vector into the first sublayer for processing to obtain an output vector of the first sublayer;
inputting the output vector of the first sublayer into the second sublayer for processing to obtain the output vector of the second sublayer;
and inputting the output vector of the second sublayer into the third sublayer for processing to obtain the third word vector.
6. A text analysis apparatus, comprising:
a first processing module configured to divide a text to be analyzed into at least one input unit; performing word segmentation processing on the input unit and the question to be answered respectively to obtain word units; inputting each input unit and the question to be answered as an input set into a text analysis model for processing to obtain a first word vector corresponding to each word unit in the input set;
the second processing module is configured to perform part-of-speech tagging on the text to be analyzed and the question to be answered to obtain a second word vector corresponding to each word unit in the text to be analyzed and the question to be answered;
the third processing module is configured to splice the first word vector and the second word vector of each word unit to obtain a spliced vector corresponding to the word unit; inputting the spliced vectors into an answer obtaining model for processing until a third word vector corresponding to each word unit;
the probability obtaining module is configured to perform linear mapping and nonlinear transformation on the third word vector corresponding to each word unit, and obtain the probability that each word unit is used as the answer starting position and the answer ending position corresponding to the question;
and the answer determining module is configured to compare the probability of each word unit as the answer starting position, determine the starting position of the answer of the question to be answered, compare the probability of each word unit as the answer ending position, and determine the ending position of the answer of the question to be answered.
7. The text analysis apparatus according to claim 6, further comprising:
and the word segmentation processing module is configured to perform word segmentation processing on the text to be analyzed and the question to be answered respectively to obtain the word unit.
8. The text analysis apparatus of claim 6, wherein the first processing module is further configured to:
pre-embedding the input unit and the question to be answered to obtain a word vector, a sentence vector and a position vector of each word unit in the input unit and the question to be answered;
and inputting the word vector, the sentence vector and the position vector into the text analysis model as an input set for processing to obtain a first word vector corresponding to each word unit in the input set.
9. The text analysis apparatus of claim 6, wherein the second processing module is further configured to:
respectively carrying out word segmentation processing on the text to be analyzed and the question to be answered to obtain the word unit;
performing part-of-speech tagging on each word unit to obtain a word unit carrying part-of-speech information;
and performing word embedding processing on the word units carrying the part of speech information to obtain the second word vector corresponding to each word unit.
10. The text analysis device of claim 6, wherein the answer obtaining model comprises a first sub-layer, a second sub-layer, and a third sub-layer;
the third processing module further configured to:
inputting the splicing vector into the first sublayer for processing to obtain an output vector of the first sublayer;
inputting the output vector of the first sublayer into the second sublayer for processing to obtain the output vector of the second sublayer;
and inputting the output vector of the second sublayer into the third sublayer for processing to obtain the third word vector.
11. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-5 when executing the instructions.
12. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 5.
13. A chip storing computer instructions, characterized in that the instructions, when executed by a processor, implement the steps of the method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910649742.XA CN110347802B (en) | 2019-07-17 | 2019-07-17 | Text analysis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910649742.XA CN110347802B (en) | 2019-07-17 | 2019-07-17 | Text analysis method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110347802A CN110347802A (en) | 2019-10-18 |
CN110347802B true CN110347802B (en) | 2022-09-02 |
Family
ID=68178782
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910649742.XA Active CN110347802B (en) | 2019-07-17 | 2019-07-17 | Text analysis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110347802B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110781663B (en) * | 2019-10-28 | 2023-08-29 | 北京金山数字娱乐科技有限公司 | Training method and device of text analysis model, text analysis method and device |
CN110837558B (en) * | 2019-11-07 | 2022-04-15 | 成都星云律例科技有限责任公司 | Judgment document entity relation extraction method and system |
CN111241244B (en) * | 2020-01-14 | 2024-10-11 | 平安科技(深圳)有限公司 | Answer position acquisition method, device, equipment and medium based on big data |
CN113127729B (en) * | 2020-01-16 | 2024-08-09 | 深圳绿米联创科技有限公司 | Household scheme recommendation method and device, electronic equipment and storage medium |
CN113535887B (en) * | 2020-04-15 | 2024-04-02 | 北京金山数字娱乐科技有限公司 | Formula similarity detection method and device |
CN114648022A (en) * | 2020-12-17 | 2022-06-21 | 北京金山数字娱乐科技有限公司 | Text analysis method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109753661A (en) * | 2019-01-11 | 2019-05-14 | 国信优易数据有限公司 | A kind of machine reads understanding method, device, equipment and storage medium |
CN109766423A (en) * | 2018-12-29 | 2019-05-17 | 上海智臻智能网络科技股份有限公司 | Answering method and device neural network based, storage medium, terminal |
WO2019106965A1 (en) * | 2017-12-01 | 2019-06-06 | 日本電信電話株式会社 | Information processing device, information processing method, and program |
CN109977428A (en) * | 2019-03-29 | 2019-07-05 | 北京金山数字娱乐科技有限公司 | A kind of method and device that answer obtains |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10909329B2 (en) * | 2015-05-21 | 2021-02-02 | Baidu Usa Llc | Multilingual image question answering |
-
2019
- 2019-07-17 CN CN201910649742.XA patent/CN110347802B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019106965A1 (en) * | 2017-12-01 | 2019-06-06 | 日本電信電話株式会社 | Information processing device, information processing method, and program |
CN109766423A (en) * | 2018-12-29 | 2019-05-17 | 上海智臻智能网络科技股份有限公司 | Answering method and device neural network based, storage medium, terminal |
CN109753661A (en) * | 2019-01-11 | 2019-05-14 | 国信优易数据有限公司 | A kind of machine reads understanding method, device, equipment and storage medium |
CN109977428A (en) * | 2019-03-29 | 2019-07-05 | 北京金山数字娱乐科技有限公司 | A kind of method and device that answer obtains |
Also Published As
Publication number | Publication date |
---|---|
CN110347802A (en) | 2019-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11288295B2 (en) | Utilizing word embeddings for term matching in question answering systems | |
CN110347802B (en) | Text analysis method and device | |
CN109522553B (en) | Named entity identification method and device | |
CN109977428B (en) | Answer obtaining method and device | |
CN110765244A (en) | Method and device for acquiring answering, computer equipment and storage medium | |
CN112131366A (en) | Method, device and storage medium for training text classification model and text classification | |
CN113127624B (en) | Question-answer model training method and device | |
Shen et al. | Kwickchat: A multi-turn dialogue system for aac using context-aware sentence generation by bag-of-keywords | |
CN110609886A (en) | Text analysis method and device | |
CN111930914A (en) | Question generation method and device, electronic equipment and computer-readable storage medium | |
CN112287085B (en) | Semantic matching method, system, equipment and storage medium | |
CN113536801A (en) | Reading understanding model training method and device and reading understanding method and device | |
CN112632258A (en) | Text data processing method and device, computer equipment and storage medium | |
Lyu et al. | Deep learning for textual entailment recognition | |
CN115878752A (en) | Text emotion analysis method, device, equipment, medium and program product | |
CN114282592A (en) | Deep learning-based industry text matching model method and device | |
CN110705310B (en) | Article generation method and device | |
Dündar et al. | A Hybrid Approach to Question-answering for a Banking Chatbot on Turkish: Extending Keywords with Embedding Vectors. | |
CN113590768B (en) | Training method and device for text relevance model, question answering method and device | |
CN113342944B (en) | Corpus generalization method, apparatus, device and storage medium | |
Ling | Coronavirus public sentiment analysis with BERT deep learning | |
CN115292492A (en) | Method, device and equipment for training intention classification model and storage medium | |
CN115795007A (en) | Intelligent question-answering method, intelligent question-answering device, electronic equipment and storage medium | |
CN114691716A (en) | SQL statement conversion method, device, equipment and computer readable storage medium | |
JP2018010481A (en) | Deep case analyzer, deep case learning device, deep case estimation device, method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |