CN111695345A - Method and device for recognizing entity in text - Google Patents
Method and device for recognizing entity in text Download PDFInfo
- Publication number
- CN111695345A CN111695345A CN202010533173.5A CN202010533173A CN111695345A CN 111695345 A CN111695345 A CN 111695345A CN 202010533173 A CN202010533173 A CN 202010533173A CN 111695345 A CN111695345 A CN 111695345A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- character
- entity
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 239000013598 vector Substances 0.000 claims abstract description 208
- 238000000605 extraction Methods 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims abstract description 18
- 230000011218 segmentation Effects 0.000 claims abstract description 15
- 238000013507 mapping Methods 0.000 claims description 30
- 238000012546 transfer Methods 0.000 claims description 28
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 24
- 230000015654 memory Effects 0.000 description 19
- 238000012549 training Methods 0.000 description 19
- 238000013473 artificial intelligence Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 11
- 230000000694 effects Effects 0.000 description 10
- 238000002372 labelling Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 7
- 230000004044 response Effects 0.000 description 6
- 230000004913 activation Effects 0.000 description 4
- 238000012512 characterization method Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 239000007787 solid Substances 0.000 description 4
- 230000007704 transition Effects 0.000 description 4
- 108091026890 Coding region Proteins 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 230000000644 propagated effect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method and a device for identifying entities in a text, electronic equipment and a computer readable storage medium; the method comprises the following steps: performing feature extraction processing on a text to obtain a character feature vector corresponding to each character in the text; determining a dictionary vector corresponding to each character in the text according to the entity dictionary corresponding to the text; performing word segmentation processing on the text to obtain words corresponding to each character in the text, and determining word vectors corresponding to each word; splicing the character feature vector, the dictionary vector and the word vector corresponding to each character to obtain a spliced vector corresponding to each character; and determining a label corresponding to each character according to the splicing vector corresponding to each character, and determining an entity in the text and the type of the entity according to the label corresponding to each character. By the method and the device, the efficiency and the accuracy of entity identification can be improved.
Description
Technical Field
The invention relates to natural language processing technology in the field of artificial intelligence, in particular to a method and a device for identifying entities in texts, electronic equipment and a computer readable storage medium.
Background
Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. Natural Language Processing (NLP) is an important direction in artificial intelligence, and various theories and methods for realizing efficient communication between a person and a computer using natural Language are mainly studied.
Entity recognition is a branch of natural language processing and refers to the recognition of entities in text that have particular significance, such as song names, person names, place names, and the like. In the scheme provided by the related technology, a feature is generally constructed by manually constructing a text to be recognized, then a machine learning model is used for labeling the feature according to a label category, and finally entity recognition is realized according to the labeled label category, and the manual construction of the feature causes low entity recognition efficiency.
Disclosure of Invention
The embodiment of the invention provides a method and a device for identifying an entity in a text, electronic equipment and a computer readable storage medium, which can improve the efficiency and accuracy of entity identification.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a method for identifying entities in texts, which comprises the following steps:
performing feature extraction processing on a text to obtain a character feature vector corresponding to each character in the text;
determining a dictionary vector corresponding to each character in the text according to the entity dictionary corresponding to the text;
performing word segmentation processing on the text to obtain words corresponding to each character in the text, and determining word vectors corresponding to each word;
splicing the character feature vector, the dictionary vector and the word vector corresponding to each character to obtain a spliced vector corresponding to each character;
determining the label corresponding to each character according to the splicing vector corresponding to each character, and
and determining the entity in the text and the type of the entity according to the label corresponding to each character.
The embodiment of the invention provides a device for identifying entities in texts, which comprises:
the feature extraction module is used for performing feature extraction processing on the text to obtain a character feature vector corresponding to each character in the text;
the dictionary module is used for determining a dictionary vector corresponding to each character in the text according to the entity dictionary corresponding to the text;
the word segmentation module is used for carrying out word segmentation processing on the text to obtain words corresponding to each character in the text and determining word vectors corresponding to each word;
the splicing module is used for splicing the character feature vector, the dictionary vector and the word vector corresponding to each character to obtain a spliced vector corresponding to each character;
and the identification module is used for determining a label corresponding to each character according to the splicing vector corresponding to each character, and determining an entity in the text and the type of the entity according to the label corresponding to each character.
In the above scheme, the feature extraction module is further configured to query a mapping dictionary for a numeric identifier corresponding to each word in the text; and converting the digital identifier corresponding to each character into a vector form to obtain a character feature vector corresponding to each character.
In the above scheme, the dictionary module is further configured to determine a type to which the text belongs; determining an entity dictionary corresponding to the type of the text; and in the entity dictionary, inquiring dictionary vectors corresponding to each character in the text.
In the above scheme, the word segmentation module is further configured to invoke a word vector model to perform the following operations on the text: intercepting a plurality of words with preset word number and length in the text; coding each intercepted word to obtain a plurality of coding sequences in one-to-one correspondence; mapping the coding sequence corresponding to each of the words to a word vector for the corresponding word.
In the above scheme, the splicing module is further configured to determine a word feature vector, a dictionary vector, and a word vector that correspond to the same word; and superposing the character feature vector, the dictionary vector and each dimension contained by the word vector, and filling a scalar corresponding to the dimension in the superposed dimension to obtain a splicing vector corresponding to the character.
In the above scheme, the identification module is further configured to map the concatenation vector corresponding to each character into probabilities of respectively belonging to different candidate tags; the candidate label is used for indicating the type of an entity to which the word belongs and the position of the word in the belonging entity, or indicating that the word belongs to an irrelevant character; and determining the candidate label corresponding to the maximum probability as the label corresponding to the character.
In the foregoing solution, the identification module is further configured to map the stitching vector corresponding to each character into a plurality of different candidate tags, and determine transfer scores corresponding to the candidate tags respectively by mapping the stitching vector; the candidate label is used for indicating the type of an entity to which the word belongs and the position of the word in the belonging entity, or indicating that the word belongs to an irrelevant character; and determining the label corresponding to each character according to the plurality of candidate labels corresponding to each character and the corresponding transfer score.
In the above scheme, the identification module is further configured to select candidate tags multiple times from multiple candidate tags corresponding to each word according to an appearance sequence of each word in the text, and combine the candidate tags selected each time to obtain multiple different candidate tag sequences; a plurality of candidate tags contained in the candidate tag sequence selected each time belong to different characters, and the number of the contained candidate tags is the same as the number of words of the text; accumulating the transfer scores corresponding to each candidate tag in the candidate tag sequence to obtain an overall transfer score; and determining a plurality of candidate labels contained in the candidate label sequence with the maximum integral transfer score as label categories corresponding to the corresponding characters.
In the foregoing solution, the identification module is further configured to identify, as the same entity, the characters that are located consecutively in the text and have corresponding labels indicating the same entity type, and identify the entity type indicated by the labels as the type of the same entity.
An embodiment of the present invention provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the entity identification method in the text provided by the embodiment of the invention when the processor executes the executable instructions stored in the memory.
The embodiment of the invention provides a computer-readable storage medium, which stores executable instructions and is used for causing a processor to execute the method for recognizing the entity in the text.
The embodiment of the invention has the following beneficial effects:
the features in the text are automatically extracted without manually constructing the features, so that the workload of feature engineering is simplified, and the efficiency of entity identification is improved; the character feature vector, the dictionary vector and the word vector corresponding to each character in the text are spliced, and label marking is carried out on the spliced vectors, so that errors of recognizing entities and entity types in the text are reduced, and accuracy of entity recognition is improved.
Drawings
Fig. 1 is a schematic diagram of an architecture of a system 100 for recognizing entities in text according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an electronic device 500 according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a method for identifying an entity in a text according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for identifying an entity in a text according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for identifying an entity in a text according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating an example of model inputs provided by an embodiment of the present invention;
FIG. 7 is a schematic diagram of a model architecture provided by an embodiment of the present invention;
FIG. 8 is a schematic diagram of an application scenario provided by an embodiment of the present invention;
fig. 9 is a schematic diagram of a model structure provided in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
1) And (3) natural language processing: is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
2) BIO labeling System: one way to label an element in a text (or text sentence) is to label the element as "B-X", "I-X", or "O", where "B" in "B-X" indicates that the entity position of the element is first, "I" in "I-X" indicates that the entity position of the element is not first, "B-X" and "X" in "I-X" indicate that the entity type of the element is X type, and "O" indicates that the element does not belong to any type, i.e., an unrelated element. Wherein the elements may be words in a text sentence.
3) Short text Query (Query): refers to a request statement input by a user, the request statement contains the intention expectation of the user, for example: "ice rain for the first XXX"; "give me a story of fool-watch mountain"; "i want to see the movie without a break" etc.
4) Entity (Entity), or named Entity: is a basic unit of knowledge graph, and is also an important language unit for bearing information in the text, for example: name of person, place name, organization name, and product name, etc.
In a task-based dialog system, entities are used to represent important information in a user input Query. For example, when Query is "ice rain from XXX", Query itself is an intent expectation that the user wants to listen to the song, and the entities are artist type XXX and song type ice rain.
5) A media asset class entity: refers to media information entities, such as song entities in music skills, movies, television shows, or cartoons entities in video skills, and album entities in FM radio skills. There is some similarity between entities and there may also be an intersection of entity contents.
6) Entity identification (NER, Named Entity Recognition): refers to identifying entities in text.
7) A solid dictionary: when designing a new skill intention, a set of entity instances, namely an entity dictionary, to which the new skill relates is generally provided for reference so as to inform the boundaries and rules of the set of entities.
8) Speech Recognition, or Automatic Speech Recognition (ASR), aims at converting the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes or character sequences. Unlike speaker recognition and speaker verification, speaker recognition and speaker verification attempt to identify or verify the speaker who uttered the speech rather than the vocabulary content contained therein.
9) Training samples, or training data, are preprocessed and then have relatively stable and accurate feature description data sets, and participate in the training process of the entity recognition model in the form of samples.
10) Artificial intelligence cloud Service (AiaaS, AI as a Service): the AIaaS platform is a service mode of an artificial intelligence platform, and particularly divides several types of common AI services and provides independent or packaged services at a cloud end. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform by means of an Application Programming Interface (API), and some of the sophisticated developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain their own dedicated cloud artificial intelligence services.
According to the embodiment of the invention, the training of the entity recognition model can be realized by calling the artificial intelligence cloud service.
The embodiment of the invention provides a method and a device for identifying an entity in a text, electronic equipment and a computer readable storage medium, which can effectively improve the efficiency and accuracy of entity identification. An exemplary application of the method for identifying an entity in a text provided by the embodiment of the present invention is described below, and the method for identifying an entity in a text provided by the embodiment of the present invention may be implemented by various electronic devices, for example, may be implemented by a terminal, may also be implemented by a server or a server cluster, or may be implemented by cooperation of a terminal and a server.
In the following, an embodiment of the present invention is described by taking a terminal and a server as an example, referring to fig. 1, fig. 1 is a schematic structural diagram of an entity recognition system 100 in text provided by the embodiment of the present invention. The system 100 for recognizing entities in text includes: the server 200, the network 300, the terminal 400, and the client 410 operating in the terminal 400 will be described separately.
The server 200 is a background server of the client 410 and is used for receiving the text sent by the client 410; and also for identifying entities and types of entities in the text (a process of identifying entities and types of entities in the text will be described in detail below), and acquiring corresponding resources (e.g., answer sentences, songs or movies, etc.) from a database or a network according to the identified entities and types of entities, and transmitting the resources to the client 410.
The network 300 is used as a medium for communication between the server 200 and the terminal 400, and may be a wide area network or a local area network, or a combination of both.
The terminal 400 is used for running a client 410, and the client 410 is various Applications (APP) with entity identification function, such as a voice assistant APP, a music APP, or a video APP. The client 410 is configured to send a text to the server 200, obtain a resource of the text sent by the server 200, and display the resource to the user (e.g., present a response sentence, play a song, play a video, or the like).
It should be noted that the client 410 can determine the entity and the type of the entity in the text by calling the entity identification service of the server 200; the entities and types of entities in the text may also be determined by invoking an entity identification service of the terminal 400.
As an example, the client 410 invokes the entity recognition service of the terminal 400 to recognize an entity and a type of the entity in the text; and sending a corresponding request to the server 200 according to the identified entity and the type of the entity, so that the server 200 acquires a corresponding resource (for example, answer sentence, song or movie, etc.) from a database or a network according to the request, receives the resource returned by the server 200, and displays the resource to the user (for example, presents answer sentence, plays song or plays video, etc.).
In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present invention.
The embodiment of the invention can be applied to various scenes, such as a response scene, a video playing scene or a music playing scene.
Taking the answer scenario as an example, the client 410 may be a voice assistant APP. The client 410 calls a microphone of the terminal 400 to collect a voice query sentence of the user and performs voice recognition on the voice query sentence to obtain a corresponding text; the client 410 sends the text to the server 200; the server 200 identifies the text, obtains a corresponding entity and an entity type, queries the entity type in the knowledge graph to obtain a response sentence, and sends the response sentence to the client 410; the client 410 broadcasts the answer sentence in the form of voice broadcasting.
Or, the client 410 calls a microphone of the terminal 400 to collect a voice query sentence of the user, and performs voice recognition on the voice query sentence to obtain a corresponding text; the client 410 calls the entity recognition service of the terminal 400 to recognize the text, obtains the corresponding entity and entity type, queries in the knowledge graph to obtain the response sentence, and broadcasts the response sentence in a voice broadcast manner.
Taking a music playing scene as an example, the client 410 may be a music APP. The client 410 calls a microphone of the terminal 400 to collect a voice operation instruction of the user, such as "first ice flower", and performs voice recognition on the voice operation instruction to obtain a corresponding text; the client 410 sends the text to the server 200; the server 200 identifies the text to obtain a corresponding entity, namely 'lubing flower', and the type of the entity, namely the type of the song name; the server 200 searches for the song 'lubing flower' in a database or a network, obtains corresponding song resources, and sends the song resources to the client 410; the client 410 plays the song "lubinghua".
Or, the client 410 calls a microphone of the terminal 400 to collect a voice operation instruction of the user, for example, "to put a ice flower on the next day", and performs voice recognition on the voice operation instruction to obtain a corresponding text; the client 410 calls an entity identification service of the terminal 400 to identify the text, and obtains a corresponding entity, namely 'lubing flower', and the type of the entity, namely the type of the song name; the client 410 sends a song acquisition request to the server 200, so that the server 200 searches for a song 'lubing flower' from a database or a network, obtains a corresponding song resource, and returns the song resource to the client 410; the client 410 plays the song "lubinghua".
Next, a structure of an electronic device for entity identification according to an embodiment of the present invention is described, where the electronic device may be the server 200 or the terminal 400 shown in fig. 1. The following describes a structure of the electronic device by taking the electronic device as the server 200 shown in fig. 1 as an example, referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 provided in an embodiment of the present invention, and the electronic device 500 shown in fig. 2 includes: at least one processor 510, memory 540, and at least one network interface 520. The various components in the electronic device 500 are coupled together by a bus system 530. It is understood that the bus system 530 is used to enable communications among the components. The bus system 530 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 530 in FIG. 2.
The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The memory 540 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 540 described in connection with embodiments of the present invention is intended to comprise any suitable type of memory. Memory 540 optionally includes one or more storage devices physically located remote from processor 510.
In some embodiments, memory 540 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below, to support various operations.
An operating system 541 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and for handling hardware-based tasks;
a network communication module 542 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
in some embodiments, the in-text entity recognition apparatus provided in the embodiments of the present invention may be implemented in software, and fig. 2 shows the in-text entity recognition apparatus 543 stored in the memory 540, which may be software in the form of programs and plug-ins, and includes the following software modules: a feature extraction module 5431, a dictionary module 5432, a segmentation module 5433, a concatenation module 5434, and a recognition module 5435. These modules may be logical functional modules and thus may be arbitrarily combined or further divided according to the functions implemented. The functions of the respective modules will be explained below.
In other embodiments, the in-text entity identifying Device 543 provided by the embodiments of the present invention may be implemented by a combination of hardware and software, and as an example, the Device provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the in-text entity identifying method provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
In the following, the method for recognizing an entity in text provided by the embodiment of the present invention implemented by the server 200 in fig. 1 is described as an example. Referring to fig. 3, fig. 3 is a flowchart illustrating a method for identifying an entity in text according to an embodiment of the present invention, and will be described with reference to the steps shown in fig. 3.
In step S101, feature extraction processing is performed on the text to obtain a word feature vector corresponding to each word in the text.
In some embodiments, the mapping dictionary is queried for a numerical identification (ID, Identity) corresponding to each word in the text; and converting the numerical identifier corresponding to each character into a vector form to obtain a character feature vector corresponding to each character.
By way of example, the server 200 includes a mapping dictionary that maps words to numeric IDs, the mapping dictionary being capable of converting each chinese character in the text to a corresponding numeric ID; inquiring the number ID corresponding to each character in the text in a mapping dictionary through a feature extraction network; and converting the numerical ID obtained by query into a vector form to obtain a character feature vector corresponding to each character in the text. According to the embodiment of the invention, the character feature vector corresponding to each character in the text is accurately extracted according to the mapping dictionary, so that the accuracy of subsequent entity recognition is improved.
Here, the above-mentioned feature extraction network can extract not only the word feature vector of each word in the text, but also the feature representing the type of the text (or the feature representing the user' S intention), and classify the feature representing the type of the text to obtain a classification result (a process of classifying the extracted feature representing the type of the text will be described in detail in step S102). For example: when the text is 'coming ice and rain', the features representing the type of the text are extracted through the feature extraction network, and the classification result that the type of the representation text is music can be obtained by classifying the extracted features representing the type of the text. In this way, according to the extracted features representing the type of the text, the entity dictionary corresponding to the type of the text can be selected, so as to facilitate the determination of the dictionary vector corresponding to each word in the text in step S102.
Taking the text "forgetting to water first coming" as an example, the text is subjected to feature extraction processing to obtain a word feature vector corresponding to "coming", a word feature vector corresponding to "one", a word feature vector corresponding to "first", a word feature vector corresponding to "forgetting", a word feature vector corresponding to "feeling", and a word feature vector corresponding to "water".
The embodiment of the invention can automatically extract the character feature vector corresponding to each character from the text without manually constructing the features, realizes the automatic evolution of the feature engineering at one step, and can greatly reduce the workload of the automatic feature engineering compared with the related technology.
In step S102, a dictionary vector corresponding to each word in the text is determined according to the entity dictionary of the corresponding text.
Here, the entity dictionary may be an unambiguous named entity dictionary, i.e. named entities comprised in the entity dictionary have only unique meanings. Entity instances contained in the entity dictionary support user customization.
In some embodiments, a type to which the text belongs is determined; wherein the type to which the text belongs includes at least one of: a music class; a video class; radio stations; a place name class; a character class; determining an entity dictionary of a type to which the corresponding text belongs; in the solid dictionary, a dictionary vector corresponding to each word in the text is queried.
As an example, the specific process of determining the type to which the text belongs is: extracting the characteristics representing the type of the text through a characteristic extraction network; mapping the characteristics representing the types of the texts into probabilities respectively belonging to the types of different texts; and determining the type of the text corresponding to the maximum probability as the type of the text. Therefore, the entity dictionary corresponding to the type of the text can be accurately obtained, so that the dictionary vector corresponding to each character in the text can be accurately extracted, and the accuracy of subsequently classifying the extracted features is improved.
For example, extracting the features representing the type of the text through a feature extraction network; receiving the characteristics representing the type of the text through an input layer of the classification network, and transmitting the characteristics to a hidden layer of the classification network; mapping the characteristics of the type of the representation text through an activation function of a hidden layer of the classification network, and continuously carrying out forward propagation on the vector obtained by mapping in the hidden layer; receiving vectors propagated by the hidden layer through an output layer of the classification network, and mapping the vectors into confidence coefficients belonging to different text types through an activation function of the output layer; and determining the type corresponding to the maximum confidence coefficient as the type of the text.
In some embodiments, the server 200 stores therein a plurality of different types of entity dictionaries, for example, an entity dictionary of a music class, an entity dictionary of a video class, an entity dictionary of a station class, an entity dictionary of a place name class, and an entity dictionary of a character class; selecting a solid dictionary corresponding to the text from a plurality of solid dictionaries of different types; searching dictionary features corresponding to each character in the text in the selected entity dictionary; and converting the dictionary features obtained by query into a vector form to obtain a dictionary vector corresponding to each character in the text. The embodiment of the invention selects the entity dictionary consistent with the text type, can accurately extract the dictionary vector corresponding to each character in the text, improves the accuracy of subsequent entity recognition, supports the user to define the entity dictionary, and further improves the accuracy of cold entity recognition.
In step S103, a word segmentation process is performed on the text to obtain a word corresponding to each word in the text, and a word vector corresponding to each word is determined.
In some embodiments, the word vector model is invoked to perform the following operations on the text: intercepting a plurality of words with preset word number and length in a text; coding each intercepted word to obtain a plurality of coding sequences in one-to-one correspondence; the coding sequence corresponding to each word is mapped to a word vector for the corresponding word.
Here, the Word vector model may be any language model capable of converting words into corresponding Word vectors, such as a Word2vec model, a Glove model, and a Bidirectional encoder representation from transforms (BERT) model, etc. The preset value may be a user-defined value; or a value determined according to the number of words of the text, wherein the number of words of the text is proportional to the size of the preset value.
Taking the text as "first-come forgetting water" and the preset value is 2 as an example, the words in the text are "first", "forgetting", "case" and "water", respectively, and words respectively corresponding to "first", "forgetting", "case" and "water" are intercepted in the text, for example, a word corresponding to "first" is "first", a word corresponding to "first" is "first forgetting", a word corresponding to "forgetting" is "forgetting", a word corresponding to "case" is "case water" and a word corresponding to "water" is "water #" (here, since "water" is the last word in the text, words are grouped by "water" and wildcard "#" to obtain "water #"); encoding each intercepted word, for example, One-Hot (One-Hot) encoding, to obtain encoding sequences corresponding to "first", "first forgetting", "water-loving", and "water #", respectively; the code sequences corresponding to "come", "one", "head", "forget", "love", and "water #" are mapped to corresponding word vectors, so that word vectors corresponding to each word (i.e., "come", "one", "head", "forget", "love", and "water") in the text can be obtained.
It should be noted that, the word corresponding to each word is not necessarily captured as the word and the next adjacent word, and a word including the word and having a preset length may be captured as the word corresponding to the word, for example, the word corresponding to "forgetting" may be "forgetting to start", "forgetting to go to water", or "forgetting to start" and the like. So, can improve the variety of the word of intercepting, increase the training sample of model, avoid the overfitting of model.
In step S104, the word feature vector, the dictionary vector, and the word vector corresponding to each word are subjected to stitching processing to obtain a stitching vector corresponding to each word.
In some embodiments, a word feature vector, a dictionary vector, and a word vector corresponding to the same word are determined; carrying out tail splicing on the character feature vector, the dictionary vector and the word vector which correspond to the same character to obtain a spliced vector of the corresponding character; the dimension of the splicing vector is the sum of the dimension of the character feature vector, the dimension of the dictionary vector and the dimension of the word vector.
As an example, a word feature vector, a dictionary vector and a word vector corresponding to the same word are determined; determining a plurality of dimensions of the word feature vector and scalars respectively corresponding to each dimension; determining a plurality of dimensions in the dictionary vector and scalars respectively corresponding to each dimension; determining a plurality of dimensions in the word vector and scalars corresponding to each dimension respectively; and superposing each dimension contained in the character feature vector, the dictionary vector and the word vector, and filling scalars of corresponding dimensions in the superposed dimensions to obtain a splicing vector of the corresponding character. Therefore, the spliced vectors are classified, and the characters in the text can be labeled according to categories by integrating multiple dimensions, so that the accuracy of entity identification can be improved, and the efficiency of entity identification can be improved.
In step S105, a label corresponding to each character is determined according to the concatenation vector corresponding to each character.
Here, the tag is used to indicate the type of an entity to which the text belongs and the position of the text in the belonging entity, or to indicate that the text belongs to an unrelated character. The type of entity includes at least one of: a music class; a video class; radio stations; a place name class; a character class. For example, the label "B-song" characterizes that the position of the word in the belonging entity is the first character and the type of the entity to which the word belongs is the music class; the label "I-per" indicates that the position of the word in the belonging entity is a non-initial character, and the type of the belonging entity of the word is a character class; the label "O" characterizes that the word does not belong to any type and does not belong to any entity, i.e. an unrelated character.
In some embodiments, referring to fig. 4, fig. 4 is a flowchart illustrating a method for identifying an entity in text according to an embodiment of the present invention, and step S105 shown in fig. 3 may be further specifically implemented by steps S1051 to S1052.
In step S1051, the concatenation vector corresponding to each character is mapped to probabilities of belonging to different candidate tags, respectively.
Here, the candidate tag is used to indicate the type of the entity to which the text belongs and the position of the text in the belonging entity, or to indicate that the text belongs to an unrelated character.
In some embodiments, the concatenation vector corresponding to each word is received by an input layer of the classification network and propagated to a hidden layer of the classification network; mapping the splicing vector corresponding to each character through an activation function of a hidden layer of a classification network, and continuously carrying out forward propagation on the vector obtained by mapping in the hidden layer; vectors propagated by the hidden layer are received by an output layer of the classification network and are mapped to confidence degrees belonging to different candidate tags by an activation function of the output layer.
In step S1052, the candidate label corresponding to the maximum probability is determined as the label corresponding to the character.
In some embodiments, the candidate label corresponding to the maximum confidence level is determined as the label corresponding to the text.
As an example, when the probability that the stitching vector corresponding to "forget" in the text "forget to do all the way" is mapped to the candidate tag "B-song" is 0.5, the probability that the stitching vector is mapped to the candidate tag "I-movie" is 0.3, and the probability that the stitching vector is mapped to the candidate tag "O" is 0.2, the candidate tag "B-song" corresponding to the highest probability is determined to be the tag corresponding to "forget". Thus, the label corresponding to each character in the text "forgetting water from the beginning" can be determined, that is, the label corresponding to "coming" is "O", "the label corresponding to" one "is" O "," the label corresponding to "first" is "O", "the label corresponding to forgetting" is "B-song", "the label corresponding to" emotion "is" I-song ", and the label corresponding to" water "is" I-song ". Therefore, the entity and the type of the entity in the text 'forgetting to do the next year' can be determined according to the label corresponding to each word.
The embodiment of the invention marks the category of each character in the text through the classifier, has simple classification process and higher classification speed, and improves the efficiency of entity identification. Although the labeling speed of each character in the text by the classifier is high, since the classification is only performed on each character in the text and the global labeling of the text is not considered, the situation that the positions of two characterization characters in the same entity to which the two characterization characters belong are first characters may occur simultaneously in the same text labeling type, for example, labels corresponding to "forgetting" and "feeling" are both "B-song", and thus, an error may be generated in the entity recognition. In this regard, embodiments of the present invention provide an alternative tagging method, which is described in detail below.
In other embodiments, referring to fig. 5, fig. 5 is a flowchart illustrating a method for identifying an entity in a text according to an embodiment of the present invention, and step S105 shown in fig. 3 may be further implemented through step S1053 to step S1054.
In step S1053, the stitching vector corresponding to each character is mapped to a plurality of different candidate tags, and a transition score corresponding to each candidate tag to which the stitching vector is mapped is determined.
Here, the transition score is used to characterize the degree of match between the word and the candidate label of the corresponding mapping, i.e., the greater the transition score, the greater the probability that the word belongs to the candidate label of the corresponding mapping.
Taking the text of 'forgetting to leave one' as an example, respectively mapping 'forgetting' into candidate labels 'B-song', 'O' and 'I-song', wherein the transfer score corresponding to the mapping of 'forgetting' into the candidate label 'B-song' is 0.5, the transfer score corresponding to the mapping of 'forgetting' into the candidate label 'I-song' is 0.3, the transfer score corresponding to the mapping of 'forgetting' into the candidate label 'O' is 0.2, the probability of representing that 'forgetting' belongs to the label 'B-song' is the largest, and the probability of representing that 'forgetting' belongs to the label 'O' is the smallest; thus, the transition scores that map "come", "one", "first", "forget", "love", and "water" to the candidate tags "B-song", "O", and "I-song", respectively, can be determined one by one.
Here, the candidate tags are not limited to "B-song", "O", and "I-song", and may include any type of entity, for example, "B-movie", "B-per", and "I-fm", and the like, and the embodiment of the present invention is not limited thereto.
In step S1054, a label corresponding to each character is determined according to the plurality of candidate labels corresponding to each character and the corresponding transfer score.
In some embodiments, according to the appearance sequence of each character in the text, selecting candidate tags from a plurality of candidate tags corresponding to each character for a plurality of times, and combining the candidate tags selected each time to obtain a plurality of different candidate tag sequences; a plurality of candidate tags contained in the candidate tag sequence selected each time belong to different characters, and the number of the contained candidate tags is the same as the number of words of the text; accumulating the transfer scores corresponding to each candidate label in the candidate label sequence to obtain an integral transfer score; and determining a plurality of candidate labels contained in the candidate label sequence with the maximum integral transfer score as the label categories corresponding to the corresponding characters.
It should be noted that the larger the overall transfer score corresponding to the candidate tag sequence is, the higher the matching degree between the candidate tag sequence and the characters in the text is, and the higher the association degree between the candidate tag corresponding to the characters in the text and the candidate tag corresponding to the adjacent characters is.
Taking the example that the text is 'coming first forgetting water', and the candidate tags comprise 'B-song', 'O', and 'I-song', each word in 'coming', 'one', 'first', 'forgetting', 'feeling', and 'water' corresponds to three candidate tags;selecting candidate labels for multiple times from 3 candidate labels corresponding to 6 different characters, and combining the candidate labels selected each time to obtain 36729 different candidate tag sequences; respectively calculating the integral transfer scores of the 729 different candidate tag sequences, and selecting the candidate tag sequence with the largest integral transfer score, namely { "O", "O", "O", "B-song", "I-song", "I-song" }; and determining the candidate label contained in the candidate label sequence with the maximum integral transfer score as the label category corresponding to the corresponding character, namely that the label corresponding to the 'come' is 'O', 'the label corresponding to the' one 'is' O ',' the label corresponding to the first 'is' O ',' the label corresponding to the 'forget' is 'B-song', 'the label corresponding to the' emotion 'is' I-song 'and' the label corresponding to the 'water' is 'I-song'.
When each character in the text is labeled, the probability that each character belongs to the corresponding label is considered, the overall optimal probability of the whole text is also considered, and the problem of label bias can be avoided. Specifically, the embodiment of the present invention considers the degree of association between the label corresponding to each word in the text and the label corresponding to the adjacent word, so as to avoid the situation that the label whose position of the representative word in the same entity to which the representative word belongs is the first word appears behind the label whose position of the representative word in the same entity to which the representative word belongs is the non-first word in the same type labeled with respect to the same text; the association degree between the label corresponding to each word and the text is also considered, so that the situation that the positions of two characterization words in the same entity to which the two characterization words belong are the first word can be avoided in the same text labeled type.
In step S106, the entity and the type of the entity in the text are determined according to the label corresponding to each word.
Here, the type of the entity includes at least one of: a music class; a video class; radio stations; a place name class; a character class. Each of the types described above belongs to a major category, and each major category may include a minor category, for example, the music category includes a singer category, a song category, an album category, and the like; the video category includes a movie category, a television play category, an actor category, a cartoon category, and the like. Step S106 may identify not only the major class to which the entity belongs, but also the minor class to which the entity belongs.
In some embodiments, words in the text, which have consecutive positions and corresponding labels indicated as the same entity type, are identified as the same entity, and the entity type indicated by the labels is identified as the same entity type.
As an example, a word corresponding to a tag category indicating a position at the beginning of the belonging entity is determined as a first character of the entity in the text; traversing in a plurality of characters of the type which belongs to the same entity with the first character; when the label category of the traversed character indicates the character located in the middle of the affiliated entity, determining the traversed character as the non-first character of the entity; the first character and the non-first character are jointly determined as the entity, and the type of the entity to which the label categories of the first character and the non-first character jointly indicate belongs is determined as the type of the entity.
Taking the text of 'coming one forgetting water' as an example, if the label corresponding to the coming 'is' O ',' the label corresponding to the first 'is' O ',' the label corresponding to the forgotten 'is' B-song ',' the label corresponding to the feelings 'is' I-song 'and' the label corresponding to the 'water' is 'I-song', it can be determined that the entity in the text is 'forgetting water', and the type of the entity is a song class in the category of music classes.
The embodiment of the invention automatically extracts the feature vector corresponding to each word in the text without manually constructing features, thereby improving the efficiency of entity recognition. The character feature vector, the dictionary vector and the word vector corresponding to each character in the text are spliced, labels are labeled according to the spliced vectors, errors of entities and entity types in the text are reduced, and accuracy of recognition is improved.
Continuing with the description of the structure of the electronic device 500 in conjunction with FIG. 2, in some embodiments, as shown in FIG. 2, the software modules stored in the text entity recognition device 543 of the memory 540 may include: a feature extraction module 5431, a dictionary module 5432, a segmentation module 5433, a concatenation module 5434, and a recognition module 5435.
A feature extraction module 5431, configured to perform feature extraction processing on a text to obtain a word feature vector corresponding to each word in the text;
the dictionary module 5432 is configured to determine, according to the entity dictionary corresponding to the text, a dictionary vector corresponding to each word in the text;
a word segmentation module 5433, configured to perform word segmentation processing on the text to obtain a word corresponding to each word in the text, and determine a word vector corresponding to each word;
a concatenation module 5434, configured to perform concatenation processing on the word feature vector, the dictionary vector, and the word vector corresponding to each word to obtain a concatenation vector corresponding to each word;
an identifying module 5435, configured to determine, according to the concatenation vector corresponding to each word, a label corresponding to each word, and determine, according to the label corresponding to each word, an entity in the text and a type of the entity.
In some embodiments, the feature extraction module 5431 is further configured to look up a mapping dictionary for a numeric identifier corresponding to each word in the text; and converting the digital identifier corresponding to each character into a vector form to obtain a character feature vector corresponding to each character.
In some embodiments, the dictionary module 5432 is further configured to determine a type to which the text belongs; determining an entity dictionary corresponding to the type of the text; and in the entity dictionary, inquiring dictionary vectors corresponding to each character in the text.
In some embodiments, the word segmentation module 5433 is further configured to invoke a word vector model to perform the following operations on the text: intercepting a plurality of words with preset word number and length in the text; coding each intercepted word to obtain a plurality of coding sequences in one-to-one correspondence; mapping the coding sequence corresponding to each of the words to a word vector for the corresponding word.
In some embodiments, the concatenation module 5434 is further configured to determine a word feature vector, a dictionary vector, and a word vector corresponding to the same word; and superposing the character feature vector, the dictionary vector and each dimension of the word vector, and filling a scalar corresponding to the dimension in the superposed dimension to obtain a splicing vector corresponding to the character.
In some embodiments, the identifying module 5435 is further configured to map the concatenation vector corresponding to each word to probabilities of belonging to different candidate tags, respectively; the candidate label is used for indicating the type of an entity to which the word belongs and the position of the word in the belonging entity, or indicating that the word belongs to an irrelevant character; and determining the candidate label corresponding to the maximum probability as the label corresponding to the character.
In some embodiments, the identifying module 5435 is further configured to map the concatenation vector corresponding to each word into a plurality of different candidate labels, and determine a transfer score corresponding to mapping the concatenation vector into each candidate label; the candidate label is used for indicating the type of an entity to which the word belongs and the position of the word in the belonging entity, or indicating that the word belongs to an irrelevant character; and determining the label corresponding to each character according to the plurality of candidate labels corresponding to each character and the corresponding transfer score.
In some embodiments, the identifying module 5435 is further configured to select candidate tags multiple times from multiple candidate tags corresponding to each word according to an appearance sequence of each word in the text, and combine the candidate tags selected each time to obtain multiple different candidate tag sequences; a plurality of candidate tags contained in the candidate tag sequence selected each time belong to different characters, and the number of the contained candidate tags is the same as the number of words of the text; accumulating the transfer scores corresponding to each candidate tag in the candidate tag sequence to obtain an overall transfer score; and determining a plurality of candidate labels contained in the candidate label sequence with the maximum integral transfer score as label categories corresponding to the corresponding characters.
In some embodiments, the identification module 5435 is further configured to identify, as the same entity, characters in the text that have consecutive positions and corresponding labels that indicate the same entity type, and identify the entity type indicated by the labels as the same entity type.
Embodiments of the present invention provide a computer-readable storage medium storing executable instructions, which when executed by a processor, cause the processor to perform a method for recognizing an entity in text, such as the method for recognizing an entity in text shown in fig. 3, 4 or 5.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions can correspond, but do not necessarily correspond, to files in a file system, and can be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts stored in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
Next, an exemplary application in an actual application scenario will be explained.
For entity extraction (i.e., entity identification as described above), the related art generally employs a Conditional Random Field (CRF) model for entity extraction. In the related art, a training corpus (i.e., training data) is organized into a model input example shown in fig. 6, and is input into a CRF model for training, so as to obtain a trained CRF model, and finally, entity recognition is performed based on the trained CRF model.
Fig. 6 is a schematic diagram of an example of model input according to an embodiment of the present invention, in fig. 6, a first column is a word feature, a second column is a Bi-gram feature (or binary feature), a third column is a part-of-speech feature, a fourth column is an entity information feature, and a fifth column is a prediction tag. It should be noted that the training data has a prediction tag in the fifth column, and the prediction data (or called test data) has no prediction tag in the fifth column, because the prediction data is intended to predict the prediction tag in the fifth column. Here, the label may be labeled using the BIO label system.
Referring to fig. 7, fig. 7 is a schematic diagram of a model architecture provided in the embodiment of the present invention.
In fig. 7, the model provided in the embodiment of the present invention includes a bidirectional long Short-Term Memory (bltm) module, and the module mainly uses the bidirectional LSTM for deep learning to perform feature extraction. Since feature capture is the superior ability of the front-end layer of the deep learning network, compared with the related art, the embodiment of the present invention reduces the workload of artificial feature engineering, but due to the inherent limitations of BiLSTM, such as: when Query is too long, attention to words with longer distance is reduced, and meanwhile due to the time sequence of the sequence, serial training is needed during parameter training, so that the training of the model is time-consuming.
From the above, in the related art, it is necessary for a user to manually construct a feature engineering to determine which features are specifically used, and a conclusion is made on the basis of multiple tests. In the related technology, single character characteristics, two character characteristics, or part-of-speech characteristics and the like can be used, and in the process of model development and tuning, characteristic engineering is very time-consuming and has certain threshold requirements on model tuners. The embodiment of the invention in fig. 7 introduces a BiLSTM as a feature extraction module, which solves the technical problems in the related art to a certain extent, but due to the characteristics of the LSTM, the embodiments of the invention have limitations in feature capture capability and serial training, and the effect of body recognition is improved but not very obvious.
In view of the above problems, an embodiment of the present invention further provides a method for recognizing an entity in a text, which can improve an entity recognition effect and simplify workload of a feature engineering.
Referring to fig. 8, fig. 8 is a schematic view of an application scenario provided in the embodiment of the present invention. In fig. 8, when designing the skill intents, a set of entities related to the related skill intents (i.e., the entity dictionary mentioned above) can be defined and imported according to requirements, and the entities also support alias configuration to meet the diversity of entity expressions. Fig. 8 illustrates definitions and examples of types of animation (sys. Video. cartoon) entities involved in the Video (Video) domain (i.e., the Video class described above).
Referring to fig. 9, fig. 9 is a schematic diagram of a model structure provided by an embodiment of the present invention, which will be described with reference to fig. 9.
(1) Feature extraction layer
In some embodiments, the inputs to the feature extraction layer tok1, tok2 … … tokn are the numeric IDs of the corresponding Query (i.e., the text described above), respectively. Here, the feature extraction layer has a dictionary of mapping characters to numeric IDs, and the dictionary of mapping contains all the characters in the training set, and each character of Query can be converted into a corresponding numeric ID.
Here, the feature extraction layer can also construct a Cls symbol to represent the whole sentence, and an output vector thereof can represent low-dimensional information (i.e., Embedding information, which is equivalent to the above-mentioned feature that characterizes the type of text) of the whole Query, and is generally used for text sentence classification.
In some embodiments, the output of the feature extraction layer is the word vector (i.e., the word feature vector described above) and information of the Cls portion for each word of the Query. In an application scene of entity identification, only a word vector corresponding to each character of Query is needed; in an application scenario when text sentence classification is performed, the output information of the Cls part can be used directly.
Here, the dimensions of the word vector and the information of the Cls section corresponding to each character are 768 dimensions.
(2) Intermediate feature layer
In some embodiments, based on the customized entity dictionary, on the basis of the 768-dimensional word vector output by the feature extraction layer, the customized dictionary information corresponding to each word in 40 dimensions is spliced (i.e. corresponding to the 4 th column of features in fig. 6, or the features of the entity dictionary, which is equivalent to the above-mentioned dictionary vector). Therefore, the process of splicing the output of the feature extraction layer and the user-defined dictionary vector is completed in the intermediate feature layer.
In some embodiments, based on the algorithm of Word2vec, a Bi-gram Word vector (or Bi-gram feature) corresponding to each Word of Query is obtained, wherein the dimension of the Bi-gram Word vector is 200 dimensions. The intermediate feature layer can also splice the output of the feature extraction layer, the user-defined dictionary vector and the Bi-gram word vector.
(3) CRF decoding layer
In some embodiments, a 768-dimensional word vector, a 40-dimensional dictionary vector, and a 200-dimensional Bi-gram word vector corresponding to each word of a Query are spliced to obtain a 768+40+ 200-1008-dimensional vector (i.e., the above-mentioned spliced vector); and labeling the spliced vector through a CRF decoding layer.
When the CRF decoding layer carries out probability labeling on the spliced vector corresponding to each character, the vector information of each character and the transfer matrix information of each Label (Label) are considered at the same time, so that two 'B' labels representing the first character of an entity can be avoided appearing at the same time; in addition, the CRF decoding layer considers the global optimal probability of the whole sentence, and can solve the problem of label bias.
In some embodiments, the corpus selects a crowd-sourced entity corpus; the test corpus selects log data of real users, and data distribution accords with real user distribution.
The following explains a comparison of effects between the embodiment of the present invention and the related art.
Referring to tables 1 and 2, tables 1 and 2 are tables for comparing effects between examples of the present invention and the related art. In case one of table 1, the features input to the CRF model are word vectors corresponding to each word of Query output by the feature extraction layer; in the second case, the features input into the CRF model are obtained by splicing the word vector and the dictionary vector of each character corresponding to Query output by the feature extraction layer; in case three, the feature input to the CRF model is a vector obtained by splicing the word vector, the dictionary vector, and the Bi-gram word vector of each character corresponding to Query output by the feature extraction layer.
As can be seen from table 1, although the precision P value is reduced to a certain extent, the recall R value is greatly increased, and the overall comprehensive value F value is significantly increased in the embodiment of the present invention compared to the related art. Therefore, the overall effect of the embodiment of the invention is obviously better than that of the related art.
Table 1 comparison of effects between the embodiment of the present invention and the related art
Table 2 comparison of effects between embodiment of the present invention and related art
Compared with the related technology, the embodiment of the invention reduces the workload of a plurality of characteristic engineering parts, and is beneficial to improving the iterative efficiency of model development.
Based on the progress in multi-task learning, similar tasks can be combined to improve the overall effect to a certain extent. Based on the principle, the embodiment of the invention considers the similarity characteristic of Query linguistic data of media resource intentions, synthesizes linguistic data of multiple intentions such as Music (Music) class, Video (Video) class, radio station (FM) class and the like, utilizes similarity information expressed by Query linguistic data among similar intentions to enhance the linguistic data, and finds out that the effect of entity identification is improved. Meanwhile, the number of the training corpora is increased, the problem of overfitting of the model is avoided, and therefore the accuracy of entity recognition is improved.
In some embodiments, for a scenario in which deep learning requires a large amount of training data, a data enhancement technique may be further added in the embodiments of the present invention. For example, in the training data, the entity labeling parts may be replaced with each other to add more training data, and the Corpus (Corpus) part outside the entity labeling may also adopt technologies such as assisted synonym replacement, and the like, and may also continue to mine the training data from the user log, and the like, all of which can increase more sample data, so as to achieve the effect of continuing to optimize the model.
In summary, the embodiments of the present invention have the following beneficial effects:
1) the character feature vector corresponding to each character can be automatically extracted from the text without manually constructing the features, so that the automatic evolution of the feature engineering is realized, and compared with the related technology, the workload can be greatly reduced by the automatic feature engineering.
2) And selecting an entity dictionary consistent with the text type, so that dictionary vectors corresponding to all characters in the text can be accurately extracted, and the accuracy of subsequent entity recognition is improved.
3) The spliced vectors are classified, and the characters in the text can be labeled according to categories by integrating multiple dimensions, so that the accuracy of entity identification can be improved, and the efficiency of entity identification can be improved.
4) The classification labeling is carried out on each character in the text through the classifier, the classification process is simple, the classification speed is high, and the entity identification efficiency is improved.
5) When each character in the text is labeled, the probability that each character belongs to the corresponding label is considered, the overall optimal probability of the whole text is also considered, and the problem of label bias can be avoided. Specifically, the embodiment of the present invention considers the degree of association between the label corresponding to each word in the text and the label corresponding to the adjacent word, so as to avoid the situation that the label whose position of the representative word in the same entity to which the representative word belongs is the first word appears behind the label whose position of the representative word in the same entity to which the representative word belongs is the non-first word in the same type labeled with respect to the same text; and the association degree between the label corresponding to each character and the text is also considered, so that the situation that the positions of two representation characters in the same entity to which the two representation characters belong are first characters in the same text labeled type can be avoided, the errors of identifying the entity and the entity type in the text are reduced, and the identification accuracy is improved.
The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.
Claims (10)
1. A method for recognizing an entity in text, the method comprising:
performing feature extraction processing on a text to obtain a character feature vector corresponding to each character in the text;
determining a dictionary vector corresponding to each character in the text according to the entity dictionary corresponding to the text;
performing word segmentation processing on the text to obtain words corresponding to each character in the text, and determining word vectors corresponding to each word;
splicing the character feature vector, the dictionary vector and the word vector corresponding to each character to obtain a spliced vector corresponding to each character;
determining the label corresponding to each character according to the splicing vector corresponding to each character, and
and determining the entity in the text and the type of the entity according to the label corresponding to each character.
2. The method of claim 1, wherein the performing feature extraction on the text to obtain a word feature vector corresponding to each word in the text comprises:
querying a mapping dictionary for a digital identifier corresponding to each character in the text;
and converting the digital identifier corresponding to each character into a vector form to obtain a character feature vector corresponding to each character.
3. The method of claim 1, wherein determining a dictionary vector corresponding to each word in the text from the entity dictionary corresponding to the text comprises:
determining a type to which the text belongs;
determining an entity dictionary corresponding to the type of the text;
and in the entity dictionary, inquiring dictionary vectors corresponding to each character in the text.
4. The method of claim 1, wherein the performing word segmentation on the text to obtain a word corresponding to each word in the text and determining a word vector corresponding to each word comprises:
intercepting a plurality of words with preset word number and length in the text;
coding each intercepted word to obtain a plurality of coding sequences in one-to-one correspondence;
mapping the coded sequence corresponding to each of the words to a word vector corresponding to the word.
5. The method of claim 1, wherein the concatenating the word feature vector, the dictionary vector, and the word vector corresponding to each word to obtain a concatenated vector corresponding to each word comprises:
determining character feature vectors, dictionary vectors and word vectors corresponding to the same character;
and superposing the character feature vector, the dictionary vector and each dimension contained by the word vector, and filling a scalar corresponding to the dimension in the superposed dimension to obtain a splicing vector corresponding to the character.
6. The method of claim 1, wherein the determining the label corresponding to each word according to the concatenation vector corresponding to each word comprises:
mapping the splicing vector corresponding to each character into probabilities respectively belonging to different candidate labels;
the candidate label is used for indicating the type of an entity to which the word belongs and the position of the word in the belonging entity, or indicating that the word belongs to an irrelevant character;
and determining the candidate label corresponding to the maximum probability as the label corresponding to the character.
7. The method of claim 1, wherein the determining the label corresponding to each word according to the concatenation vector corresponding to each word comprises:
mapping the splicing vector corresponding to each character into a plurality of different candidate labels respectively, and determining the splicing vector as a transfer score corresponding to each candidate label respectively;
the candidate label is used for indicating the type of an entity to which the word belongs and the position of the word in the belonging entity, or indicating that the word belongs to an irrelevant character;
and determining the label corresponding to each character according to the plurality of candidate labels corresponding to each character and the corresponding transfer score.
8. The method of claim 7, wherein determining the label corresponding to each word according to the plurality of candidate labels corresponding to each word and the corresponding transfer score comprises:
selecting candidate labels from a plurality of candidate labels corresponding to each character for a plurality of times according to the appearance sequence of each character in the text, and combining the candidate labels selected each time to obtain a plurality of different candidate label sequences;
a plurality of candidate tags contained in the candidate tag sequence selected each time belong to different characters, and the number of the contained candidate tags is the same as the number of words of the text;
accumulating the transfer scores corresponding to each candidate tag in the candidate tag sequence to obtain an overall transfer score;
and determining a plurality of candidate labels contained in the candidate label sequence with the maximum integral transfer score as label categories corresponding to the corresponding characters.
9. The method of any one of claims 1 to 8, wherein the determining the entity in the text and the type of the entity according to the label corresponding to each word comprises:
identifying characters with continuous positions and corresponding labels in the text as the same entity type as the same entity, and
identifying the type of the entity indicated by the label as the type of the same entity.
10. An apparatus for in-text entity recognition, the apparatus comprising:
the feature extraction module is used for performing feature extraction processing on the text to obtain a character feature vector corresponding to each character in the text;
the dictionary module is used for determining a dictionary vector corresponding to each character in the text according to the entity dictionary corresponding to the text;
the word segmentation module is used for carrying out word segmentation processing on the text to obtain words corresponding to each character in the text and determining word vectors corresponding to each word;
the splicing module is used for splicing the character feature vector, the dictionary vector and the word vector corresponding to each character to obtain a spliced vector corresponding to each character;
and the identification module is used for determining a label corresponding to each character according to the splicing vector corresponding to each character, and determining an entity in the text and the type of the entity according to the label corresponding to each character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010533173.5A CN111695345B (en) | 2020-06-12 | 2020-06-12 | Method and device for identifying entity in text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010533173.5A CN111695345B (en) | 2020-06-12 | 2020-06-12 | Method and device for identifying entity in text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111695345A true CN111695345A (en) | 2020-09-22 |
CN111695345B CN111695345B (en) | 2024-02-23 |
Family
ID=72480580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010533173.5A Active CN111695345B (en) | 2020-06-12 | 2020-06-12 | Method and device for identifying entity in text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111695345B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364656A (en) * | 2021-01-12 | 2021-02-12 | 北京睿企信息科技有限公司 | Named entity identification method based on multi-dataset multi-label joint training |
CN112487813A (en) * | 2020-11-24 | 2021-03-12 | 中移(杭州)信息技术有限公司 | Named entity recognition method and system, electronic equipment and storage medium |
CN112906381A (en) * | 2021-02-02 | 2021-06-04 | 北京有竹居网络技术有限公司 | Recognition method and device of conversation affiliation, readable medium and electronic equipment |
CN112906380A (en) * | 2021-02-02 | 2021-06-04 | 北京有竹居网络技术有限公司 | Method and device for identifying role in text, readable medium and electronic equipment |
CN112988979A (en) * | 2021-04-29 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Entity identification method, entity identification device, computer readable medium and electronic equipment |
CN113505587A (en) * | 2021-06-23 | 2021-10-15 | 科大讯飞华南人工智能研究院(广州)有限公司 | Entity extraction method, related device, equipment and storage medium |
CN113673249A (en) * | 2021-08-25 | 2021-11-19 | 北京三快在线科技有限公司 | Entity identification method, device, equipment and storage medium |
CN113705232A (en) * | 2021-03-03 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Text processing method and device |
CN113868419A (en) * | 2021-09-29 | 2021-12-31 | 中国平安财产保险股份有限公司 | Text classification method, device, equipment and medium based on artificial intelligence |
WO2022078102A1 (en) * | 2020-10-14 | 2022-04-21 | 腾讯科技(深圳)有限公司 | Entity identification method and apparatus, device and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165280A (en) * | 2018-09-13 | 2019-01-08 | 安徽倍思特教育科技有限公司 | A kind of information consulting system of educational training |
WO2019024704A1 (en) * | 2017-08-03 | 2019-02-07 | 阿里巴巴集团控股有限公司 | Entity annotation method, intention recognition method and corresponding devices, and computer storage medium |
CN109388795A (en) * | 2017-08-07 | 2019-02-26 | 芋头科技(杭州)有限公司 | A kind of name entity recognition method, language identification method and system |
CN109543181A (en) * | 2018-11-09 | 2019-03-29 | 中译语通科技股份有限公司 | A kind of name physical model combined based on Active Learning and deep learning and system |
CN110502738A (en) * | 2018-05-18 | 2019-11-26 | 阿里巴巴集团控股有限公司 | Chinese name entity recognition method, device, equipment and inquiry system |
CN111079418A (en) * | 2019-11-06 | 2020-04-28 | 科大讯飞股份有限公司 | Named body recognition method and device, electronic equipment and storage medium |
US20200302118A1 (en) * | 2017-07-18 | 2020-09-24 | Glabal Tone Communication Technology Co., Ltd. | Korean Named-Entity Recognition Method Based on Maximum Entropy Model and Neural Network Model |
-
2020
- 2020-06-12 CN CN202010533173.5A patent/CN111695345B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200302118A1 (en) * | 2017-07-18 | 2020-09-24 | Glabal Tone Communication Technology Co., Ltd. | Korean Named-Entity Recognition Method Based on Maximum Entropy Model and Neural Network Model |
WO2019024704A1 (en) * | 2017-08-03 | 2019-02-07 | 阿里巴巴集团控股有限公司 | Entity annotation method, intention recognition method and corresponding devices, and computer storage medium |
CN109388795A (en) * | 2017-08-07 | 2019-02-26 | 芋头科技(杭州)有限公司 | A kind of name entity recognition method, language identification method and system |
CN110502738A (en) * | 2018-05-18 | 2019-11-26 | 阿里巴巴集团控股有限公司 | Chinese name entity recognition method, device, equipment and inquiry system |
CN109165280A (en) * | 2018-09-13 | 2019-01-08 | 安徽倍思特教育科技有限公司 | A kind of information consulting system of educational training |
CN109543181A (en) * | 2018-11-09 | 2019-03-29 | 中译语通科技股份有限公司 | A kind of name physical model combined based on Active Learning and deep learning and system |
CN111079418A (en) * | 2019-11-06 | 2020-04-28 | 科大讯飞股份有限公司 | Named body recognition method and device, electronic equipment and storage medium |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022078102A1 (en) * | 2020-10-14 | 2022-04-21 | 腾讯科技(深圳)有限公司 | Entity identification method and apparatus, device and storage medium |
CN112487813A (en) * | 2020-11-24 | 2021-03-12 | 中移(杭州)信息技术有限公司 | Named entity recognition method and system, electronic equipment and storage medium |
CN112487813B (en) * | 2020-11-24 | 2024-05-10 | 中移(杭州)信息技术有限公司 | Named entity recognition method and system, electronic equipment and storage medium |
CN112364656A (en) * | 2021-01-12 | 2021-02-12 | 北京睿企信息科技有限公司 | Named entity identification method based on multi-dataset multi-label joint training |
CN112906381A (en) * | 2021-02-02 | 2021-06-04 | 北京有竹居网络技术有限公司 | Recognition method and device of conversation affiliation, readable medium and electronic equipment |
CN112906380A (en) * | 2021-02-02 | 2021-06-04 | 北京有竹居网络技术有限公司 | Method and device for identifying role in text, readable medium and electronic equipment |
CN112906380B (en) * | 2021-02-02 | 2024-09-27 | 北京有竹居网络技术有限公司 | Character recognition method and device in text, readable medium and electronic equipment |
CN112906381B (en) * | 2021-02-02 | 2024-05-28 | 北京有竹居网络技术有限公司 | Dialog attribution identification method and device, readable medium and electronic equipment |
CN113705232B (en) * | 2021-03-03 | 2024-05-07 | 腾讯科技(深圳)有限公司 | Text processing method and device |
CN113705232A (en) * | 2021-03-03 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Text processing method and device |
CN112988979A (en) * | 2021-04-29 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Entity identification method, entity identification device, computer readable medium and electronic equipment |
CN113505587B (en) * | 2021-06-23 | 2024-04-09 | 科大讯飞华南人工智能研究院(广州)有限公司 | Entity extraction method, related device, equipment and storage medium |
CN113505587A (en) * | 2021-06-23 | 2021-10-15 | 科大讯飞华南人工智能研究院(广州)有限公司 | Entity extraction method, related device, equipment and storage medium |
CN113673249A (en) * | 2021-08-25 | 2021-11-19 | 北京三快在线科技有限公司 | Entity identification method, device, equipment and storage medium |
CN113868419A (en) * | 2021-09-29 | 2021-12-31 | 中国平安财产保险股份有限公司 | Text classification method, device, equipment and medium based on artificial intelligence |
CN113868419B (en) * | 2021-09-29 | 2024-05-31 | 中国平安财产保险股份有限公司 | Text classification method, device, equipment and medium based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN111695345B (en) | 2024-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111695345A (en) | Method and device for recognizing entity in text | |
US11308937B2 (en) | Method and apparatus for identifying key phrase in audio, device and medium | |
CN110807332A (en) | Training method of semantic understanding model, semantic processing method, semantic processing device and storage medium | |
CN110795552A (en) | Training sample generation method and device, electronic equipment and storage medium | |
CN115952272B (en) | Method, device and equipment for generating dialogue information and readable storage medium | |
CN111739520B (en) | Speech recognition model training method, speech recognition method and device | |
CN110619051A (en) | Question and sentence classification method and device, electronic equipment and storage medium | |
CN111353035B (en) | Man-machine conversation method and device, readable storage medium and electronic equipment | |
CN116738233A (en) | Method, device, equipment and storage medium for training model online | |
CN107908743B (en) | Artificial intelligence application construction method and device | |
CN111399629A (en) | Operation guiding method of terminal equipment, terminal equipment and storage medium | |
CN111401034B (en) | Semantic analysis method, semantic analysis device and terminal for text | |
CN110795547A (en) | Text recognition method and related product | |
CN115394321A (en) | Audio emotion recognition method, device, equipment, storage medium and product | |
WO2022160445A1 (en) | Semantic understanding method, apparatus and device, and storage medium | |
CN115148212A (en) | Voice interaction method, intelligent device and system | |
CN116522905B (en) | Text error correction method, apparatus, device, readable storage medium, and program product | |
CN114792092B (en) | Text theme extraction method and device based on semantic enhancement | |
CN115115432B (en) | Product information recommendation method and device based on artificial intelligence | |
CN113763947B (en) | Voice intention recognition method and device, electronic equipment and storage medium | |
CN112017647B (en) | Semantic-combined voice recognition method, device and system | |
CN112632962B (en) | Method and device for realizing natural language understanding in man-machine interaction system | |
CN111489742B (en) | Acoustic model training method, voice recognition device and electronic equipment | |
CN114239601A (en) | Statement processing method and device and electronic equipment | |
CN113705163A (en) | Entity extraction method, device, equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |