CN112906380B - Character recognition method and device in text, readable medium and electronic equipment - Google Patents
Character recognition method and device in text, readable medium and electronic equipment Download PDFInfo
- Publication number
- CN112906380B CN112906380B CN202110145123.4A CN202110145123A CN112906380B CN 112906380 B CN112906380 B CN 112906380B CN 202110145123 A CN202110145123 A CN 202110145123A CN 112906380 B CN112906380 B CN 112906380B
- Authority
- CN
- China
- Prior art keywords
- word
- text
- training
- identified
- recognized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 239000013598 vector Substances 0.000 claims abstract description 234
- 238000012549 training Methods 0.000 claims description 163
- 238000012545 processing Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 16
- 238000002372 labelling Methods 0.000 claims description 16
- 230000010365 information processing Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 11
- 230000015654 memory Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 5
- 210000002569 neuron Anatomy 0.000 description 4
- 239000000203 mixture Substances 0.000 description 3
- 238000013136 deep learning model Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Character Discrimination (AREA)
- Machine Translation (AREA)
Abstract
The disclosure relates to a method and a device for identifying characters in text, a readable medium and electronic equipment, and relates to the technical field of electronic information processing, wherein the method comprises the following steps: acquiring each word and a word vector corresponding to each word included in a text to be recognized, determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, determining the associated words according to the associated words corresponding to the words, wherein the associated words consist of the words and preset digital words adjacent to the words, forming word vectors corresponding to each word and word vectors corresponding to the associated words corresponding to the words, forming a combined vector corresponding to the words, so as to obtain a combined vector sequence corresponding to the text to be recognized, wherein the combined vector sequence comprises combined vectors corresponding to each word in the text to be recognized, and determining character entities included in the text to be recognized according to the combined vector sequence and a pre-trained recognition model. The method and the device can improve accuracy of character entity identification.
Description
Technical Field
The disclosure relates to the technical field of electronic information processing, and in particular relates to a method and a device for identifying characters in text, a readable medium and electronic equipment.
Background
With the continuous development of electronic information technology, people have more and more entertainment and life, and reading electronic books has become a mainstream reading mode. In order to make it inconvenient for a user to read an electronic book, the user can acquire information included in the electronic book through hearing, or read and listen simultaneously, and acquire information included in the electronic book from two dimensions of vision and hearing, corresponding audio is often prerecorded for the electronic book for the user to listen. In order to enrich the expressive power of the audio, in the process of recording the audio, different timbres can be used for recording conversations of different roles in the electronic book, so that the different roles in the electronic book need to be identified first. In general, each role in the electronic book needs to be marked manually, and the processing efficiency and accuracy are low.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a method for identifying a character in text, the method including:
Acquiring each word included in a text to be identified and a word vector corresponding to each word;
Determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, wherein the associated words are determined according to combined words corresponding to the word, and the combined words consist of the word and a preset number of words adjacent to the word;
combining a word vector corresponding to each word and a word vector corresponding to the associated word corresponding to the word into a combined vector corresponding to the word to obtain a combined vector sequence corresponding to the text to be recognized, wherein the combined vector sequence comprises a combined vector corresponding to each word in the text to be recognized;
and determining the character entities included in the text to be recognized according to the combined vector sequence and a pre-trained recognition model.
In a second aspect, the present disclosure provides an apparatus for recognizing characters in text, the apparatus comprising:
The acquisition module is used for acquiring each word included in the text to be identified and a word vector corresponding to each word;
The determining module is used for determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, wherein the associated words are determined according to combined words corresponding to the word, and the combined words consist of the word and a preset number of words adjacent to the word;
The processing module is used for forming a word vector corresponding to each word and a word vector corresponding to the associated word corresponding to the word into a combination vector corresponding to the word so as to obtain a combination vector sequence corresponding to the text to be recognized, wherein the combination vector sequence comprises combination vectors corresponding to each word in the text to be recognized;
and the recognition module is used for determining the character entity included in the text to be recognized according to the combination vector sequence and the pre-trained recognition model.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method of the first aspect of the disclosure.
Through the technical scheme, each word and the corresponding word vector in the text to be recognized are firstly obtained, then the word vector corresponding to the associated word corresponding to each word is determined, the associated word is determined according to the combined word corresponding to the word, then the word vector corresponding to each word and the word vector corresponding to the associated word are formed into the combined vector corresponding to the word, so that the combined vector sequence corresponding to the text to be recognized and comprising the combined vector corresponding to each word is obtained, and finally the character entity included in the text to be recognized is determined according to the combined vector sequence and the pre-trained recognition model. In the process of identifying the character entity, each word included in the text to be identified is considered, and the associated word associated with each word is considered, so that the accuracy of identifying the character entity is improved.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flowchart illustrating a method of identifying characters in text according to an exemplary embodiment;
FIG. 2 is a flowchart illustrating another method of identifying characters in text according to an exemplary embodiment;
FIG. 3 is a flowchart illustrating another method of identifying characters in text according to an exemplary embodiment;
FIG. 4 is a flowchart illustrating another method of identifying characters in text according to an exemplary embodiment;
FIG. 5 is a flowchart illustrating a training recognition model, according to an exemplary embodiment;
FIG. 6 is a block diagram illustrating an in-text character recognition apparatus according to an exemplary embodiment;
FIG. 7 is a block diagram illustrating another in-text character recognition apparatus according to an exemplary embodiment;
FIG. 8 is a block diagram illustrating another in-text character recognition apparatus according to an exemplary embodiment;
FIG. 9 is a block diagram illustrating another in-text character recognition apparatus according to an exemplary embodiment;
Fig. 10 is a block diagram of an electronic device, according to an example embodiment.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
Fig. 1 is a flowchart illustrating a method of recognizing characters in text according to an exemplary embodiment, and the method includes the steps of:
step 101, each word included in the text to be recognized and a word vector corresponding to each word are obtained.
For example, text to be identified is first obtained that may include a character entity. The text to be identified may be, for example, one or more sentences in a text file specified by the user (which may be understood as the specified total text mentioned later), one or more paragraphs in a text file, or one or more chapters in a text file. The text file may be, for example, an electronic book, or may be other types of files, such as news, public number articles, blogs, etc. And then extracting each word included in the text to be recognized and a word vector corresponding to each word. For example, the Word vector corresponding to each Word may be sequentially searched in the pre-trained Word vector table, or the Word vector corresponding to each Word may be generated by using the pre-trained Word2vec model.
Step 102, determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, wherein the associated words are determined according to combined words corresponding to the word, and the combined words consist of the word and a preset number of words adjacent to the word.
For each word in the text to be recognized, the corresponding combination word of the word can be determined first, and then the corresponding association word can be determined according to the corresponding combination word of the word, wherein the association word can be one word or a plurality of words. And finally determining word vectors corresponding to the associated words. For example, the Word vector corresponding to each associated Word may be sequentially searched in a pre-trained Word vector table, or a pre-trained Word2vec model may be used to generate the Word vector corresponding to each associated Word. The word is a word with a preset number adjacent to the word, and the word is a word with a context. For example, the preset number is two, and then the combined word is a word formed by the word and two words before and two words after the word in the text to be recognized, that is, a word formed by the word and one word before, two words before, one word after and two words after. Accordingly, the associated terms may be understood as all the combined terms. The associated word may also be understood as a word meeting specified requirements (e.g., matching a preset dictionary, etc.) in the combined word. For example, the preset number is two, and the text to be recognized is: "there is still no message from mobile phone". For the fifth word in the text to be recognized: for "go", the corresponding combined words are: "brother" and "big" composition: "having mobile phone", "mobile phone" and "big" composition: "mobile," "mobile," and "mobile" are: "Ge", "Ge" and "any of the following" compositions: "the conception of the brother" is used in four combination words. The four combined words can be used as associated words, or words matched with a preset dictionary in the four combined words can be used as associated words, for example, "mobile phone" can be used as associated words.
And 103, combining the word vector corresponding to each word and the word vector corresponding to the associated word corresponding to the word into a combined vector corresponding to the word so as to obtain a combined vector sequence corresponding to the text to be recognized, wherein the combined vector sequence comprises the combined vector corresponding to each word in the text to be recognized.
And 104, determining character entities included in the text to be recognized according to the combined vector sequence and the pre-trained recognition model.
For each word in the text to be recognized, the word vector corresponding to the word and the word vector corresponding to the associated word corresponding to the word may be combined to obtain a combined vector corresponding to the word, thereby obtaining a combined vector sequence corresponding to the text to be recognized. That is, the combination vector corresponding to each word includes the word vector corresponding to the word and the word vector corresponding to the associated word corresponding to the word. For example, the text to be recognized includes 20 words, each word corresponds to a word vector with 1 x 100 dimensions, where a certain word corresponds to two associated words, each associated word corresponds to a word vector with 1 x 100 dimensions, and then the combined vector corresponding to the word is 1 x 300 dimensions. Finally, the sequence of combined vectors is input to a pre-trained recognition model to determine, from the output of the recognition model, the persona entities included in the text to be recognized, which may be zero (i.e., no persona entities exist in the text to be recognized), one or more. The character entity may be, for example, included in the text to be recognized: name, pronoun (e.g., you, me, she, he, your, etc.), anthropomorphic animal, anthropomorphic object, etc. Specifically, the recognition model may directly output the character entity, or the recognition model may output a label of each word in the text to be recognized, and then determine the character entity according to the label of each word. The recognition model is used for marking each word in the text to be recognized and indicating whether each word belongs to a role entity or not.
The recognition model can be a deep learning model which is obtained by training a large number of training samples in advance, and the structure can be a combination of a transducer+CRF (English: conditional Random Fields, chinese: conditional random field), and compared with a combination of BLSTM (English: bidirectional Long Short-Term Memory, chinese: two-way long and short Term Memory network) +CRF, the recognition efficiency is higher. For example, the combined vector sequence may be used as an input of a transducer to obtain feature vectors output by the transducer and capable of representing the combined vector sequence, and then the feature vectors are used as an input of a CRF to obtain labels corresponding to each combined vector in the combined vector sequence output by the CRF, i.e. labels of each word in the text to be recognized. Because the combined vector sequence not only comprises the word vector corresponding to each word, but also comprises the word vector corresponding to the associated word corresponding to each word, the recognition model can learn the relation between each word and the corresponding associated word, the problems of word leakage and multiple words in the character entity recognition process are avoided, and the accuracy of labeling the character entity by the recognition model can be improved.
In summary, the present disclosure firstly obtains each word and a corresponding word vector in a text to be recognized, then determines a word vector corresponding to an associated word corresponding to each word, where the associated word is determined according to a combined word corresponding to the word, and then composes the word vector corresponding to each word and the word vector corresponding to the associated word into a combined vector corresponding to the word, thereby obtaining a combined vector sequence corresponding to the text to be recognized, including the combined vector corresponding to each word, and finally determines a character entity included in the text to be recognized according to the combined vector sequence and a recognition model trained in advance. In the process of identifying the character entity, each word included in the text to be identified is considered, and the associated word associated with each word is considered, so that the accuracy of identifying the character entity is improved.
Fig. 2 is a flowchart illustrating another method of identifying characters in text, according to an exemplary embodiment, as shown in fig. 2, step 102 may include:
Step 1021, for each word, obtaining a combined word composed of the word and a preset number of words adjacent to the word.
Step 1022, taking the combined word matched with the preset word dictionary as the associated word corresponding to the word, and obtaining the word vector corresponding to the associated word.
For example, for each word in the text to be recognized, a combined word of the word and a preset number of words adjacent to the word may be determined first, that is, the combined word is of the word and the preset number of words adjacent to the word. Taking the preset number as three as an example, the combined word corresponding to the word is the word composed of the word and three words before and three words after the word in the text to be recognized. And then, sequentially matching each combined word with a preset word dictionary, if so, determining the combined word as an associated word corresponding to the word, and acquiring a word vector corresponding to the associated word, wherein the number of the associated word can be zero (i.e. none of the combined words corresponding to the word is matched with the word dictionary), and one or more. If the associated word corresponding to the word is zero, the combined vector corresponding to the word is the word vector corresponding to the word. The word dictionary can be understood as a dictionary in which a large number of character entities are collected in advance, and the word dictionary screens the combined words corresponding to the words, so that a large number of meaningless interferences can be removed from semantics, the association relation between each word and the corresponding associated word is ensured, and the accuracy of character entity identification is improved. For example, the text to be recognized is: "do today weather good, miss go out? ", for the first word: for "to date" there are no previous three words, then "to date" can be combined with the next three words "day", "gas" to get the combined word: "today" composed of "today" and "day", and "today's weather" composed of "today" and "day-day air", the three combined words are sequentially matched with a word dictionary, and the obtained matched associated words are as follows: "today".
For another example, for the eighth word "sister," the "sister" may be combined with six words before and after: "wrong", "small", "want", "out", "gate" are combined to get a combined word: the words of six combinations are "sister" and "mistakes," mistakes made up of "sister", "sister" and "mistakes" made up of "sister", "sister" and "want" are "sister" and "sister" want to go out "," sister "and" sister "want to go out" are made up of "sister" and "sister" want to go out ". And sequentially matching the six combined words with a word dictionary to obtain a matched associated word which is 'Miss'. It should be noted that, in the above embodiment, punctuation marks included in the text to be recognized may be treated as one word.
FIG. 3 is a flowchart illustrating another method of identifying characters in text, according to an exemplary embodiment, as shown in FIG. 3, step 104 may include:
step 1041, inputting the combined vector sequence into the recognition model to obtain an attribute tag corresponding to each word in the text to be recognized output by the recognition model, where the attribute tag is used to indicate whether the word belongs to a character entity.
Step 1042, determining the character entity included in the text to be recognized according to the attribute tags corresponding to each word in the text to be recognized.
For example, the combined vector sequence may be input into the recognition model first to obtain an attribute tag corresponding to each word in the text to be recognized output by the recognition model, where the attribute tag can indicate whether the corresponding word belongs to a character entity. For example, when the attribute tag output by the recognition model is 1, it indicates that the corresponding word belongs to the character entity, and when the attribute tag output by the recognition model is 0, it indicates that the corresponding word does not belong to the character entity. The recognition model may be understood as a seq2seq model, i.e. a sequence of combined vectors, which is input as a combined vector comprising a corresponding combination vector for each word in the text to be recognized, and output as a set of attribute labels comprising an attribute label corresponding to each word in the text to be recognized. For example, the text to be recognized is: "there is still no message from mobile phone". Inputting a combination vector sequence corresponding to a text to be identified into an identification model, and obtaining an attribute tag set output by the identification model, wherein the attribute tag set comprises: 00001100000, then it can be determined that the persona entity is: "mobile phone".
In one application scenario, the attribute tag may indicate, in addition to whether the corresponding word belongs to a character entity, whether the word belongs to a single-word character entity (i.e., the word corresponds to one character entity) or a multi-word character entity (i.e., multiple words are included in the character entity). In case the word belongs to a multi-word character entity, the position of the word in the character entity may also be indicated, i.e. the property tag may be used to indicate that the position of the word in the character entity is a start position, or an end position, or an intermediate position.
The attribute tag indicates that the position of the word in the character entity is a start position, which can be understood as the first word of the character entity, the attribute tag indicates that the position of the word in the character entity is an end position, which can be understood as the last word of the character entity, and the attribute tag indicates that the position of the word in the character entity is an intermediate position, which can be understood as any word in the middle of the character entity. For example, when the attribute tag is the letter O, the corresponding word is indicated as not belonging to the character entity, when the attribute tag is the letter S, the corresponding word is indicated as a single character entity, when the attribute tag is the letter B, the corresponding word is indicated as a starting position in the multi-character entity, when the attribute tag is the letter M, the corresponding word is indicated as an intermediate position in the multi-character entity, and when the attribute tag is the letter E, the corresponding word is indicated as a terminating position in the multi-character entity.
Accordingly, the implementation manner of step 1042 may be:
if the attribute label corresponding to the target word indicates that the target word belongs to the character entity, determining the character entity comprising the target word according to the position of the target word indicated by the attribute label in the character entity, wherein the target word is any word in the text to be recognized.
For example, if the attribute tag corresponding to the target word indicates that the target word belongs to a character entity and is a single character entity, the target word may be directly used as a character entity. If the attribute tag corresponding to the target word indicates that the target word belongs to a character entity and is a multi-word character entity, the character entity formed by the target word can be further determined according to the position of the target word indicated by the attribute tag in the character entity.
If the attribute tag indicates that the position of the target word in the character entity is the starting position, the attribute tag corresponding to each word after the target word is continuously determined until the attribute tag corresponding to a certain word indicates that the position of the word in the character entity is the ending position, and then the word formed by the target word and the word can be used as a character entity. For example, the text to be recognized is: "Miss, do you want to go out", the attribute tags corresponding to each word output by the recognition model are: BEOSOOOOO, it can be determined that two words with attribute tags B and E constitute one character entity: "Miss", a word with attribute tag S constitutes a character entity: "you", i.e. the text to be identified, includes two character entities: "Miss" and "you".
After determining the character entities included in the text to be recognized, the occurrence times of different character entities can be counted, so that the character entity with the largest occurrence times can be used as the main character entity in the text to be recognized. Further, a target character entity to which the dialogue sentence included in the text to be recognized belongs may be determined from among the recognized character entities. A specific description of how the target character entity is determined is provided below.
Fig. 4 is a flowchart illustrating another method for recognizing characters in text according to an exemplary embodiment, and as shown in fig. 4, the text to be recognized includes a first text to be recognized corresponding to any dialogue sentence in a specified total text and a second text to be recognized corresponding to a sentence in the specified total text, where a distance between dialogue sentences corresponding to the first text to be recognized satisfies a preset condition. After step 104, the method may further include:
step 105, determining attribute features corresponding to each character entity included in the text to be identified, where the attribute features include: one or more of a first positional relationship between the character entity and a first text to be identified, a second positional relationship between the text to which the character entity belongs and the first text to be identified, and a dialogue attribute of the text to which the character entity belongs.
For example, the text to be recognized may include a first text to be recognized corresponding to any dialogue sentence in the specified total text and a second text to be recognized corresponding to a sentence in the specified total text, where a distance between dialogue sentences corresponding to the first text to be recognized satisfies a preset condition. The specified total text includes text corresponding to each sentence in the plurality of sentences, for example, the specified total text may be an electronic book specified by a user, or may be a chapter or a section in an electronic book. The plurality of sentences included in the specified total text may be classified into two types according to whether a dialog symbol is included, one type being a dialog sentence and the other type being a non-dialog sentence, wherein the dialog symbol is used to identify one sentence as a dialog sentence, for example, a double-quote "", or "may be" ", which is not particularly limited in this disclosure.
And then the text corresponding to any dialogue sentence included in the specified total text can be used as the first text to be recognized. And determining a second text to be recognized corresponding to the first text to be recognized. In the specified total text, the distance between the sentence corresponding to the first text to be identified and the sentence corresponding to the second text to be identified meets the preset condition, and the second text to be identified can correspond to one or more sentences. It is understood that the first text to be recognized is associated with the second text to be recognized, and the second text to be recognized is also understood as the context of the first text to be recognized. The sentence corresponding to the second text to be recognized can be a dialogue sentence or a non-dialogue sentence. Taking the preset condition as less than or equal to three sentences as an example, the second text to be recognized may be the text corresponding to the previous three sentences and the subsequent three sentences (six sentences in total) of the first text to be recognized in the specified total text.
And determining the attribute characteristics corresponding to each role entity. An attribute feature may be understood as a feature that reflects the relationship between the character entity and the first text to be recognized. The attribute features may include, for example, one or more of the following: the first position relation between the character entity and the first text to be identified, the second position relation between the text to which the character entity belongs and the first text to be identified, and the dialogue attribute of the text to which the character entity belongs. Wherein the first positional relationship may be used to indicate whether the character entity belongs to a first text to be recognized. The second positional relationship may be used to indicate that, among the specified total text, the text to which the character entity belongs is located before or after the first text to be recognized. The dialogue attribute may be used to indicate whether a sentence corresponding to a text to which the character entity belongs is a dialogue sentence.
Step 106, inputting the first text to be identified, the second text to be identified, the character entity and the attribute characteristics corresponding to the character entity into a pre-trained attribution identification model aiming at each character entity so as to obtain the matching degree of the character entity and the first text to be identified, which are output by the attribution identification model.
And step 107, determining the target role entity to which the dialogue sentence corresponding to the first text to be identified belongs according to the matching degree of each role entity and the first text to be identified.
For example, the first text to be identified, the second text to be identified, each character entity and the attribute feature corresponding to the character entity may be used as input of a pre-trained home identification model, where the home identification model may output a matching degree between the character entity and the first text to be identified, and the matching degree may be understood as a probability value of a dialogue sentence corresponding to the first text to be identified belonging to the character entity. The home recognition model may be a deep learning model that is obtained by training through a large number of training samples in advance, and the structure may be a combination of blstm+Dense_layer+softmax, for example. For example, the first text to be identified and the second text to be identified may be converted into corresponding text feature sequences (i.e. Text Embedding), then the character entity is converted into corresponding word vectors, then the text feature sequences, the word vectors and the attribute features corresponding to the character entity are spliced and used as input of the BLSTM, so as to obtain feature vectors which are output by the BLSTM and can comprehensively represent the first text to be identified, the second text to be identified, each character entity and the attribute features corresponding to the character entity, then the feature vectors are used as input of a Dense_layer, the output of the Dense_layer is used as input of a softmax, so as to obtain a probability value output by the softmax, and finally the probability value is used as the matching degree of the character entity and the first text to be identified. For example, the first text to be recognized includes 20 words, the second text to be recognized includes 50 words, each word corresponds to a word vector of 1×300 dimensions, and then the first text to be recognized and the second text to be recognized are converted into vectors of 70×300 dimensions of corresponding text feature sequences. The character entity corresponds to a word vector with 1 x 300 dimension, the attribute feature corresponding to the character entity is a vector with 1 x 11 dimension, and then the vector input to the home-recognition model is a vector with 70 x (300+300+11) dimension.
Further, after obtaining the matching degree between each character entity and the first text to be identified, a target character entity to which the dialogue sentence corresponding to the first text to be identified belongs may be determined in at least one character entity, that is, it is determined that the attribution of the dialogue sentence corresponding to the first text to be identified is the target character entity (that is, it is determined that the dialogue sentence corresponding to the first text to be identified is spoken by the target character entity). For example, the most matching character entity may be used as the target character entity, or at least one character entity may be arranged in descending order according to the matching degree, and the user may be provided with a specified number (for example, three) of character entities arranged in front, and the user may determine the target character entity. Further, after determining the target role entity, the target role entity may be used as a tag and associated with the first text to be identified, so that when the first text to be identified is recorded in the process of recording the audio corresponding to the specified total text, the target role entity may be determined according to the tag of the first text to be identified, and recording may be performed according to the tone color allocated to the target role entity in advance.
In this way, when determining the attribution of the dialogue sentence corresponding to the first text to be identified, besides the first text to be identified, the second text to be identified which is associated with the first text to be identified is considered, so that the attribution identification model can learn the association between the first text to be identified and the second text to be identified, and meanwhile, the attribution identification model can further learn the association between each role entity and the first text to be identified by combining the attribute characteristics of the role entities extracted from the first text to be identified and the second text to be identified, thereby determining the target role entity to which the dialogue sentence corresponding to the first text to be identified belongs, and improving the identification efficiency and accuracy of dialogue attribution.
In an application scenario, attribute features corresponding to each character entity may include multiple features, and the first position relationship may be determined according to the character entity and the first text to be identified. And determining a second position relation according to the distance between the text to which the character entity belongs and the first text to be identified. And finally, determining the dialogue attribute according to the text of the role entity. For example, the attribute features may include 11 features:
And the feature a is used for indicating whether the character entity belongs to the first text to be recognized. If the character entity belongs to the first text to be recognized in the specified total text, the feature a may be represented as 0. If the character entity does not belong to the first text to be recognized in the specified total text, the feature a may be represented as 1 if the character entity is located after the first text to be recognized, and the feature a may be represented as-1 if the character entity is located before the first text to be recognized.
The feature b, configured to indicate whether the character entity belongs to a target paragraph, where the target paragraph is a paragraph to which the first text to be identified belongs, may be understood as whether the character entity and the first text to be identified belong to the same paragraph. For example, feature b may be represented as 1 if the persona entity belongs to the target paragraph, and feature b may be represented as 0 if the persona entity does not belong to the target paragraph.
And c, indicating the distance between the character entity and the first text to be identified. The distance between the text to which the character entity belongs and the first text to be recognized can be understood as the sequence in the distance between the text to which each character entity belongs and the first text to be recognized. For example, in step 104, it is determined that the text to be recognized includes 4 character entities including a, b, c, and t, and distances between the character entities and the first text to be recognized are 2,4, 3, and 2, and the order after sorting is 1,3,2,1, so that the feature c corresponding to b may be represented as 3.
And the feature d is used for indicating the distance between the text to which the character entity belongs and the first text to be identified. For example, the distance between the text to which the character entity belongs and the first text to be recognized is 2 sentences, and then the feature d may be represented as 2.
And e, a feature used for indicating whether the sentence corresponding to the text to which the character entity belongs is a dialogue sentence. For example, if the sentence corresponding to the text to which the character entity belongs is a dialogue sentence, the feature e may be represented as 1, and if the sentence corresponding to the text to which the character entity belongs is not a dialogue sentence, the feature e may be represented as 0.
And the feature f is used for indicating whether the text of the character entity comprises a first dialogue template.
And the feature g is used for indicating whether the text of the character entity comprises a second dialogue template.
And the feature h is used for indicating whether the text of the character entity comprises a third dialogue template.
For example, the first dialog template may include "XX say: "," XX lane: "," XX smile: "etc. represent templates for the start of a conversation. The second dialog template may include a "XX utterance. Lanes "," XX ". "," XX smiles. "etc. represent templates for the end of a conversation. The third dialog template may include "say", "channel", "smile" and the like templates indicating that dialog may occur. If the above template is included, it may be denoted as 1, and if the above template is not included, it may be denoted as 0.
And the feature i is used for indicating the position of the character entity in the text to which the character entity belongs. It is understood what kind of character entity the character entity belongs to. For example, if a text includes 3 character entities, i.e., a character entity, b character entity, and c character entity, in left-to-right order, the feature i corresponding to a character may be represented as 1, the feature i corresponding to b character may be represented as 2, and the feature i corresponding to c character may be represented as 3.
And the feature j is used for indicating whether one sentence before the dialogue sentence corresponding to the first text to be identified is a dialogue sentence or not in the specified total text. For example, if one sentence preceding the dialogue sentence corresponding to the first text to be recognized is a dialogue sentence, then the feature j may be represented as 1, and if one sentence preceding the dialogue sentence corresponding to the first text to be recognized is not a dialogue sentence, then the feature j may be represented as 0.
And the feature k is used for indicating whether one sentence after the dialogue sentence corresponding to the first text to be identified is a dialogue sentence or not in the specified total text. For example, if one sentence after the dialogue sentence corresponding to the first text to be recognized is a dialogue sentence, the feature j may be represented as 1, and if one sentence after the dialogue sentence corresponding to the first text to be recognized is not a dialogue sentence, the feature j may be represented as 0.
FIG. 5 is a flowchart illustrating a training of an identification model, as shown in FIG. 5, according to an exemplary embodiment, the identification model is trained as follows:
Step A, word vectors corresponding to each training word in a training text, word vectors corresponding to training related words corresponding to the training word in the training text and labeling data corresponding to the training text are obtained, the training related words are determined according to training combination words corresponding to the training word, the training combination words consist of the training word and training words with the preset number adjacent to the training word, and labeling role entities included in the training text are included in the labeling data.
And B, aiming at each training word, combining a word vector corresponding to the training word and a word vector corresponding to a training associated word corresponding to the training word into a training combination vector corresponding to the training word so as to obtain a training combination vector sequence corresponding to a training text, wherein the training combination vector sequence comprises training combination vectors corresponding to each training word.
And step C, inputting the training combination vector sequence into the recognition model, and training the recognition model according to the output and the labeling data of the recognition model.
For example, training the recognition model requires pre-acquisition of training text, and corresponding annotation data. The training text comprises a plurality of training words. The annotation data annotates the annotation character entities included in the training text. For example, training text is: "Miss, do you want to go out", the corresponding annotation data may be: BEOSOOOOO, annotate the persona entities as "Miss" and "you". And then, acquiring a word vector corresponding to each training word and a word vector corresponding to the training associated word corresponding to the training word.
For each training word, the training combination word corresponding to the training word can be determined first, and then the corresponding training association word is determined according to the training combination word corresponding to the training word. The training related words can be one word or a plurality of words. And finally, obtaining word vectors corresponding to the training associated words. The training combination words corresponding to each training word consist of the training word and a preset number of training words adjacent to the training word, and the training combination words can be understood as words consisting of the training word and the context. For example, if the preset number is two, the training combination word is the word composed of the training word, the previous two training words and the next two training words in the training text. Accordingly, training related words may be understood as all training combined words. Training related words may also be understood as words meeting specified requirements (e.g., matching a preset dictionary, etc.) in training combined words.
For each training word in the training text, the word vector corresponding to the training word and the word vector corresponding to the training associated word corresponding to the training word can be combined to obtain the training combination vector corresponding to the training word, so that the training combination vector sequence corresponding to the training text and comprising the training combination vector corresponding to each training word is obtained. And finally, inputting the training combination vector sequence into the recognition model, and training the recognition model according to the output and the labeling data of the recognition model. It will be appreciated that the recognition model outputs a label for each training word in the training text. Therefore, the difference between the label actually output by the recognition model and the label data can be used as a loss function of the recognition model, and the parameter of the neuron in the recognition model, such as the Weight (english: weight) and the Bias (english: bias) of the neuron, can be corrected by using the back propagation algorithm with the aim of reducing the loss function. Repeating the steps until the loss function meets the preset condition, for example, the loss function is smaller than the preset loss threshold value.
The structure of the recognition model may be a combination of a transducer+crf, for example. The transform may be based on a Multi-head self-attention (english) mechanism, and may be capable of learning the degree of correlation between the individual combined vectors in the sequence of combined vectors. The input size of the recognition model may be 300. The number of neurons of the FFN (english: feed Forward Network, chinese: feed-forward network) included in the transducer may be 256. The number of neurons of the preprocessing network (English: pre-net) included in the transducer may be 256. The transducer may include 8 Multi-head self-attention structures and the transducer may include Encoder and Decoder blocks 1 in number. The maximum length that the recognition model can handle may be 150, i.e. a maximum of 150 combined vectors may be included in the sequence of combined vectors (a maximum of 150 words may be included in the text to be recognized).
In summary, the present disclosure firstly obtains each word and a corresponding word vector in a text to be recognized, then determines a word vector corresponding to an associated word corresponding to each word, where the associated word is determined according to a combined word corresponding to the word, and then composes the word vector corresponding to each word and the word vector corresponding to the associated word into a combined vector corresponding to the word, thereby obtaining a combined vector sequence corresponding to the text to be recognized, including the combined vector corresponding to each word, and finally determines a character entity included in the text to be recognized according to the combined vector sequence and a recognition model trained in advance. In the process of identifying the character entity, each word included in the text to be identified is considered, and the associated word associated with each word is considered, so that the accuracy of identifying the character entity is improved.
Fig. 6 is a block diagram illustrating an apparatus for recognizing characters in text according to an exemplary embodiment, and as shown in fig. 6, the apparatus 200 may include:
the obtaining module 201 is configured to obtain each word included in the text to be recognized and a word vector corresponding to each word.
The determining module 202 is configured to determine a word vector corresponding to an associated word corresponding to each word in the text to be recognized, where the associated word is determined according to a combined word corresponding to the word, and the combined word is composed of the word and a preset number of words adjacent to the word.
The processing module 203 is configured to combine the word vector corresponding to each word and the word vector corresponding to the associated word corresponding to the word into a combined vector corresponding to the word, so as to obtain a combined vector sequence corresponding to the text to be recognized, where the combined vector sequence includes the combined vector corresponding to each word in the text to be recognized.
The recognition module 204 is configured to determine a character entity included in the text to be recognized according to the combination vector sequence and the pre-trained recognition model.
Fig. 7 is a block diagram illustrating another in-text character recognition apparatus according to an exemplary embodiment, and as shown in fig. 7, the determining module 202 includes:
an obtaining sub-module 2021 is configured to obtain, for each word, a combined word composed of the word and a preset number of words adjacent to the word.
The determining submodule 2022 is configured to take, as an associated word corresponding to the word, a combined word matched with a preset word dictionary in the combined words, and obtain a word vector corresponding to the associated word.
Fig. 8 is a block diagram illustrating another recognition apparatus of characters in text according to an exemplary embodiment, and as shown in fig. 8, the recognition module 204 may include:
The recognition submodule 2041 is configured to input the combined vector sequence into a recognition model to obtain an attribute tag corresponding to each word in the text to be recognized output by the recognition model, where the attribute tag is used to indicate whether the word belongs to a character entity.
The processing submodule 2042 is configured to determine a character entity included in the text to be recognized according to the attribute tag corresponding to each word in the text to be recognized.
In one application scenario, the attribute tag is further used to indicate that the position of the word in the character entity is a start position, or an end position, or an intermediate position.
In another application scenario, the processing submodule 2042 may be used to:
if the attribute label corresponding to the target word indicates that the target word belongs to the character entity, determining the character entity comprising the target word according to the position of the target word indicated by the attribute label in the character entity, wherein the target word is any word in the text to be recognized.
Fig. 9 is a block diagram of another recognition apparatus of characters in text, according to an exemplary embodiment, and as shown in fig. 9, the text to be recognized includes a first text to be recognized corresponding to any dialogue sentence in a specified total text and a second text to be recognized corresponding to a sentence in the specified total text, where a distance between dialogue sentences corresponding to the first text to be recognized satisfies a preset condition. The apparatus 200 may further include:
the attribute determining module 205 is configured to determine, after determining the character entities included in the text to be recognized according to the combination vector sequence and the pre-trained recognition model, attribute features corresponding to each character entity included in the text to be recognized, where the attribute features include: one or more of a first positional relationship between the character entity and a first text to be identified, a second positional relationship between the text to which the character entity belongs and the first text to be identified, and a dialogue attribute of the text to which the character entity belongs.
The input module 206 is configured to input, for each character entity, the first text to be identified, the second text to be identified, the character entity, and attribute features corresponding to the character entity, into a pre-trained home identification model, so as to obtain a matching degree between the character entity and the first text to be identified, which is output by the home identification model.
The attribution determining module 207 is configured to determine, according to the matching degree between each character entity and the first text to be identified, a target character entity to which the dialogue sentence corresponding to the first text to be identified belongs.
In the above embodiment, the recognition model is obtained by training in the following manner:
Step A, word vectors corresponding to each training word in a training text, word vectors corresponding to training related words corresponding to the training word in the training text and labeling data corresponding to the training text are obtained, the training related words are determined according to training combination words corresponding to the training word, the training combination words consist of the training word and training words with the preset number adjacent to the training word, and labeling role entities included in the training text are included in the labeling data.
And B, aiming at each training word, combining a word vector corresponding to the training word and a word vector corresponding to a training associated word corresponding to the training word into a training combination vector corresponding to the training word so as to obtain a training combination vector sequence corresponding to a training text, wherein the training combination vector sequence comprises training combination vectors corresponding to each training word.
And step C, inputting the training combination vector sequence into the recognition model, and training the recognition model according to the output and the labeling data of the recognition model.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
In summary, the present disclosure firstly obtains each word and a corresponding word vector in a text to be recognized, then determines a word vector corresponding to an associated word corresponding to each word, where the associated word is determined according to a combined word corresponding to the word, and then composes the word vector corresponding to each word and the word vector corresponding to the associated word into a combined vector corresponding to the word, thereby obtaining a combined vector sequence corresponding to the text to be recognized, including the combined vector corresponding to each word, and finally determines a character entity included in the text to be recognized according to the combined vector sequence and a recognition model trained in advance. In the process of identifying the character entity, each word included in the text to be identified is considered, and the associated word associated with each word is considered, so that the accuracy of identifying the character entity is improved.
Referring now to fig. 10, a schematic diagram of a structure of an electronic device 300 suitable for use in implementing embodiments of the present disclosure (e.g., an execution body of a method of recognizing characters in text in the above-described embodiments) is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 10 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 10, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 10 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device 309, or installed from a storage device 308, or installed from a ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the terminal device, server, may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring each word included in a text to be identified and a word vector corresponding to each word; determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, wherein the associated words are determined according to combined words corresponding to the word, and the combined words consist of the word and a preset number of words adjacent to the word; combining a word vector corresponding to each word and a word vector corresponding to the associated word corresponding to the word into a combined vector corresponding to the word to obtain a combined vector sequence corresponding to the text to be recognized, wherein the combined vector sequence comprises a combined vector corresponding to each word in the text to be recognized; and determining the character entities included in the text to be recognized according to the combined vector sequence and a pre-trained recognition model.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module is not limited to the module itself in some cases, and for example, the acquisition module may also be described as "a module that acquires each word and a word vector corresponding to each word".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, example 1 provides a method of identifying a character in text, the method comprising: acquiring each word included in a text to be identified and a word vector corresponding to each word; determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, wherein the associated words are determined according to combined words corresponding to the word, and the combined words consist of the word and a preset number of words adjacent to the word; combining a word vector corresponding to each word and a word vector corresponding to the associated word corresponding to the word into a combined vector corresponding to the word to obtain a combined vector sequence corresponding to the text to be recognized, wherein the combined vector sequence comprises a combined vector corresponding to each word in the text to be recognized; and determining the character entities included in the text to be recognized according to the combined vector sequence and a pre-trained recognition model.
According to one or more embodiments of the present disclosure, example 2 provides the method of example 1, wherein determining a word vector corresponding to an associated word corresponding to each word in the text to be identified includes: for each word, acquiring the combined words composed of the word and a preset number of words adjacent to the word; and taking the combined word matched with a preset word dictionary as the associated word corresponding to the word in the combined word, and acquiring a word vector corresponding to the associated word.
According to one or more embodiments of the present disclosure, example 3 provides the method of example 1, the determining, according to the combined vector sequence and a pre-trained recognition model, a character entity included in the text to be recognized, including: inputting the combined vector sequence into the recognition model to obtain attribute labels corresponding to each word in the text to be recognized output by the recognition model, wherein the attribute labels are used for indicating whether the word belongs to the character entity or not; and determining the character entity included in the text to be recognized according to the attribute label corresponding to each word in the text to be recognized.
Example 4 provides the method of example 3, further comprising indicating a position of the word in the persona entity as a start position, or an end position, or an intermediate position, according to one or more embodiments of the present disclosure.
According to one or more embodiments of the present disclosure, example 5 provides the method of example 4, wherein the determining the character entity included in the text to be recognized according to the attribute tag corresponding to each word in the text to be recognized includes: and if the attribute tag corresponding to the target word indicates that the target word belongs to the role entity, determining the role entity comprising the target word according to the position of the target word indicated by the attribute tag in the role entity, wherein the target word is any word in the text to be identified.
According to one or more embodiments of the present disclosure, example 6 provides the method of example 1, where the text to be identified includes a first text to be identified and a second text to be identified, the first text to be identified corresponds to any dialogue sentence in a specified total text, and the second text to be identified corresponds to a sentence in the specified total text, where a distance between dialogue sentences corresponding to the first text to be identified meets a preset condition; after said determining of character entities comprised in said text to be recognized from said combined vector sequence and a pre-trained recognition model, said method further comprises: determining attribute characteristics corresponding to each character entity included in the text to be identified, wherein the attribute characteristics comprise: one or more of a first position relation between the character entity and the first text to be identified, a second position relation between the text to which the character entity belongs and the first text to be identified, and a dialogue attribute of the text to which the character entity belongs; inputting the first text to be identified, the second text to be identified, the character entity and the attribute characteristics corresponding to the character entity into a pre-trained home identification model aiming at each character entity so as to obtain the matching degree of the character entity and the first text to be identified, which are output by the home identification model; and determining the target role entity to which the dialogue sentence corresponding to the first text to be identified belongs according to the matching degree of each role entity and the first text to be identified.
In accordance with one or more embodiments of the present disclosure, example 7 provides the method of examples 1-6, the recognition model is trained by: acquiring word vectors corresponding to each training word in a training text, word vectors corresponding to training associated words corresponding to the training word in the training text and labeling data corresponding to the training text, wherein the training associated words are determined according to training combined words corresponding to the training word, the training combined words consist of the training word and a preset number of training words adjacent to the training word, and the labeling data comprise labeled character entities included in the training text; aiming at each training word, combining a word vector corresponding to the training word and a word vector corresponding to the training associated word corresponding to the training word into a training combination vector corresponding to the training word to obtain a training combination vector sequence corresponding to the training text, wherein the training combination vector sequence comprises training combination vectors corresponding to each training word; and inputting the training combination vector sequence into the recognition model, and training the recognition model according to the output of the recognition model and the labeling data.
Example 8 provides an apparatus for recognizing characters in text, according to one or more embodiments of the present disclosure, the apparatus comprising: the acquisition module is used for acquiring each word included in the text to be identified and a word vector corresponding to each word; the determining module is used for determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, wherein the associated words are determined according to combined words corresponding to the word, and the combined words consist of the word and a preset number of words adjacent to the word; the processing module is used for forming a word vector corresponding to each word and a word vector corresponding to the associated word corresponding to the word into a combination vector corresponding to the word so as to obtain a combination vector sequence corresponding to the text to be recognized, wherein the combination vector sequence comprises combination vectors corresponding to each word in the text to be recognized; and the recognition module is used for determining the character entity included in the text to be recognized according to the combination vector sequence and the pre-trained recognition model.
According to one or more embodiments of the present disclosure, example 9 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the methods described in examples 1 to 7.
In accordance with one or more embodiments of the present disclosure, example 10 provides an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to realize the steps of the method described in examples 1 to 7.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Claims (9)
1. A method for identifying characters in text, the method comprising:
Acquiring each word included in a text to be identified and a word vector corresponding to each word;
Determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, wherein the associated words are determined according to combined words corresponding to the word, and the combined words consist of the word and a preset number of words adjacent to the word;
combining a word vector corresponding to each word and a word vector corresponding to the associated word corresponding to the word into a combined vector corresponding to the word to obtain a combined vector sequence corresponding to the text to be recognized, wherein the combined vector sequence comprises a combined vector corresponding to each word in the text to be recognized;
Determining character entities included in the text to be recognized according to the combined vector sequence and a pre-trained recognition model;
The text to be identified comprises a first text to be identified and a second text to be identified, wherein the first text to be identified corresponds to any dialogue sentence in a specified total text, the second text to be identified corresponds to a sentence in the specified total text, and the distance between dialogue sentences corresponding to the first text to be identified meets a preset condition;
After said determining of character entities comprised in said text to be recognized from said combined vector sequence and a pre-trained recognition model, said method further comprises:
determining attribute characteristics corresponding to each character entity included in the text to be identified, wherein the attribute characteristics comprise: one or more of a first position relation between the character entity and the first text to be identified, a second position relation between the text to which the character entity belongs and the first text to be identified, and a dialogue attribute of the text to which the character entity belongs;
Inputting the first text to be identified, the second text to be identified, the character entity and the attribute characteristics corresponding to the character entity into a pre-trained home identification model aiming at each character entity so as to obtain the matching degree of the character entity and the first text to be identified, which are output by the home identification model;
and determining the target role entity to which the dialogue sentence corresponding to the first text to be identified belongs according to the matching degree of each role entity and the first text to be identified.
2. The method of claim 1, wherein the determining a word vector corresponding to the associated word corresponding to each word in the text to be recognized comprises:
For each word, acquiring the combined words composed of the word and a preset number of words adjacent to the word;
and taking the combined word matched with a preset word dictionary as the associated word corresponding to the word in the combined word, and acquiring a word vector corresponding to the associated word.
3. The method according to claim 1, wherein said determining a character entity included in said text to be recognized from said combined vector sequence and a pre-trained recognition model comprises:
Inputting the combined vector sequence into the recognition model to obtain attribute labels corresponding to each word in the text to be recognized output by the recognition model, wherein the attribute labels are used for indicating whether the word belongs to the character entity or not;
And determining the character entity included in the text to be recognized according to the attribute label corresponding to each word in the text to be recognized.
4. A method according to claim 3, wherein the property tag is further used to indicate that the position of the word in the character entity is a start position, or an end position, or an intermediate position.
5. The method according to claim 4, wherein the determining the character entity included in the text to be recognized according to the attribute tag corresponding to each word in the text to be recognized includes:
And if the attribute tag corresponding to the target word indicates that the target word belongs to the role entity, determining the role entity comprising the target word according to the position of the target word indicated by the attribute tag in the role entity, wherein the target word is any word in the text to be identified.
6. The method according to any one of claims 1-5, wherein the recognition model is trained by:
Acquiring word vectors corresponding to each training word in a training text, word vectors corresponding to training associated words corresponding to the training word in the training text and labeling data corresponding to the training text, wherein the training associated words are determined according to training combined words corresponding to the training word, the training combined words consist of the training word and a preset number of training words adjacent to the training word, and the labeling data comprise labeled character entities included in the training text;
Aiming at each training word, combining a word vector corresponding to the training word and a word vector corresponding to the training associated word corresponding to the training word into a training combination vector corresponding to the training word to obtain a training combination vector sequence corresponding to the training text, wherein the training combination vector sequence comprises training combination vectors corresponding to each training word;
and inputting the training combination vector sequence into the recognition model, and training the recognition model according to the output of the recognition model and the labeling data.
7. An apparatus for recognizing characters in text, the apparatus comprising:
The acquisition module is used for acquiring each word included in the text to be identified and a word vector corresponding to each word;
The determining module is used for determining word vectors corresponding to associated words corresponding to each word in the text to be recognized, wherein the associated words are determined according to combined words corresponding to the word, and the combined words consist of the word and a preset number of words adjacent to the word;
The processing module is used for forming a word vector corresponding to each word and a word vector corresponding to the associated word corresponding to the word into a combination vector corresponding to the word so as to obtain a combination vector sequence corresponding to the text to be recognized, wherein the combination vector sequence comprises combination vectors corresponding to each word in the text to be recognized;
The recognition module is used for determining character entities included in the text to be recognized according to the combination vector sequence and a pre-trained recognition model;
The text to be identified comprises a first text to be identified and a second text to be identified, wherein the first text to be identified corresponds to any dialogue sentence in a specified total text, the second text to be identified corresponds to a sentence in the specified total text, and the distance between dialogue sentences corresponding to the first text to be identified meets a preset condition; the apparatus further comprises:
The attribute determining module is configured to determine attribute features corresponding to each character entity included in the text to be recognized after determining the character entity included in the text to be recognized according to the combined vector sequence and the pre-trained recognition model, where the attribute features include: one or more of a first position relation between the character entity and the first text to be identified, a second position relation between the text to which the character entity belongs and the first text to be identified, and a dialogue attribute of the text to which the character entity belongs;
The input module is used for inputting the first text to be identified, the second text to be identified, the character entity and the attribute characteristics corresponding to the character entity into a pre-trained attribution identification model aiming at each character entity so as to obtain the matching degree of the character entity and the first text to be identified, which is output by the attribution identification model;
and the attribution determining module is used for determining a target role entity to which the dialogue sentence corresponding to the first text to be identified belongs according to the matching degree of each role entity and the first text to be identified.
8. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-6.
9. An electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-6.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110145123.4A CN112906380B (en) | 2021-02-02 | 2021-02-02 | Character recognition method and device in text, readable medium and electronic equipment |
PCT/CN2022/073126 WO2022166613A1 (en) | 2021-02-02 | 2022-01-21 | Method and apparatus for recognizing role in text, and readable medium and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110145123.4A CN112906380B (en) | 2021-02-02 | 2021-02-02 | Character recognition method and device in text, readable medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112906380A CN112906380A (en) | 2021-06-04 |
CN112906380B true CN112906380B (en) | 2024-09-27 |
Family
ID=76121552
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110145123.4A Active CN112906380B (en) | 2021-02-02 | 2021-02-02 | Character recognition method and device in text, readable medium and electronic equipment |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112906380B (en) |
WO (1) | WO2022166613A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112906380B (en) * | 2021-02-02 | 2024-09-27 | 北京有竹居网络技术有限公司 | Character recognition method and device in text, readable medium and electronic equipment |
CN113312358A (en) * | 2021-06-23 | 2021-08-27 | 北京有竹居网络技术有限公司 | Method and device for constructing character library, storage medium and electronic equipment |
CN113658458B (en) * | 2021-08-20 | 2024-02-13 | 北京得间科技有限公司 | Reading processing method, computing device and storage medium for dialogue novels |
CN114783403B (en) * | 2022-02-18 | 2024-08-13 | 腾讯科技(深圳)有限公司 | Method, apparatus, device, storage medium and program product for generating audio reading material |
CN115034226B (en) * | 2022-06-17 | 2024-07-23 | 北京有竹居网络技术有限公司 | Method, apparatus, device and storage medium for determining speaker in text |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111695345A (en) * | 2020-06-12 | 2020-09-22 | 腾讯科技(深圳)有限公司 | Method and device for recognizing entity in text |
CN112270167A (en) * | 2020-10-14 | 2021-01-26 | 北京百度网讯科技有限公司 | Role labeling method and device, electronic equipment and storage medium |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388795B (en) * | 2017-08-07 | 2022-11-08 | 芋头科技(杭州)有限公司 | Named entity recognition method, language recognition method and system |
US10515625B1 (en) * | 2017-08-31 | 2019-12-24 | Amazon Technologies, Inc. | Multi-modal natural language processing |
JPWO2020021845A1 (en) * | 2018-07-24 | 2021-02-15 | 株式会社Nttドコモ | Document classification device and trained model |
CN111368535B (en) * | 2018-12-26 | 2024-01-16 | 珠海金山数字网络科技有限公司 | Sensitive word recognition method, device and equipment |
WO2020133039A1 (en) * | 2018-12-27 | 2020-07-02 | 深圳市优必选科技有限公司 | Entity identification method and apparatus in dialogue corpus, and computer device |
CN110222330B (en) * | 2019-04-26 | 2024-01-30 | 平安科技(深圳)有限公司 | Semantic recognition method and device, storage medium and computer equipment |
CN110334340B (en) * | 2019-05-06 | 2021-08-03 | 北京泰迪熊移动科技有限公司 | Semantic analysis method and device based on rule fusion and readable storage medium |
CN111160033B (en) * | 2019-12-18 | 2024-02-27 | 车智互联(北京)科技有限公司 | Named entity identification method based on neural network, computing equipment and storage medium |
CN111104800B (en) * | 2019-12-24 | 2024-01-23 | 东软集团股份有限公司 | Entity identification method, entity identification device, entity identification equipment, storage medium and program product |
CN111428493B (en) * | 2020-03-06 | 2024-08-30 | 中国平安人寿保险股份有限公司 | Entity relationship acquisition method, device, equipment and storage medium |
CN111767715A (en) * | 2020-06-10 | 2020-10-13 | 北京奇艺世纪科技有限公司 | Method, device, equipment and storage medium for person identification |
CN111669757B (en) * | 2020-06-15 | 2023-03-14 | 国家计算机网络与信息安全管理中心 | Terminal fraud call identification method based on conversation text word vector |
CN112270198B (en) * | 2020-10-27 | 2021-08-17 | 北京百度网讯科技有限公司 | Role determination method and device, electronic equipment and storage medium |
CN112906380B (en) * | 2021-02-02 | 2024-09-27 | 北京有竹居网络技术有限公司 | Character recognition method and device in text, readable medium and electronic equipment |
-
2021
- 2021-02-02 CN CN202110145123.4A patent/CN112906380B/en active Active
-
2022
- 2022-01-21 WO PCT/CN2022/073126 patent/WO2022166613A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111695345A (en) * | 2020-06-12 | 2020-09-22 | 腾讯科技(深圳)有限公司 | Method and device for recognizing entity in text |
CN112270167A (en) * | 2020-10-14 | 2021-01-26 | 北京百度网讯科技有限公司 | Role labeling method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2022166613A1 (en) | 2022-08-11 |
CN112906380A (en) | 2021-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112906380B (en) | Character recognition method and device in text, readable medium and electronic equipment | |
CN111177393B (en) | Knowledge graph construction method and device, electronic equipment and storage medium | |
CN113470619B (en) | Speech recognition method, device, medium and equipment | |
CN112634876B (en) | Speech recognition method, device, storage medium and electronic equipment | |
CN112883968B (en) | Image character recognition method, device, medium and electronic equipment | |
WO2022166621A1 (en) | Dialog attribution recognition method and apparatus, readable medium and electronic device | |
CN111767740B (en) | Sound effect adding method and device, storage medium and electronic equipment | |
CN109933217B (en) | Method and device for pushing sentences | |
CN111489735B (en) | Voice recognition model training method and device | |
CN113158656B (en) | Ironic content recognition method, ironic content recognition device, electronic device, and storage medium | |
CN111667810B (en) | Method and device for acquiring polyphone corpus, readable medium and electronic equipment | |
CN111753551A (en) | Information generation method and device based on word vector generation model | |
CN112364653A (en) | Text analysis method, apparatus, server and medium for speech synthesis | |
CN114444508A (en) | Date identification method and device, readable medium and electronic equipment | |
CN111555960A (en) | Method for generating information | |
CN110827085A (en) | Text processing method, device and equipment | |
CN111460214B (en) | Classification model training method, audio classification method, device, medium and equipment | |
CN115129877B (en) | Punctuation mark prediction model generation method and device and electronic equipment | |
CN117312845A (en) | Sample labeling method, medium and electronic equipment | |
CN111914535B (en) | Word recognition method and device, computer equipment and storage medium | |
CN110728137B (en) | Method and device for word segmentation | |
CN114429629A (en) | Image processing method and device, readable storage medium and electronic equipment | |
CN116821327A (en) | Text data processing method, apparatus, device, readable storage medium and product | |
CN114677668A (en) | Character recognition method and device, computer readable medium and electronic equipment | |
CN112699687A (en) | Content cataloging method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |