CN107291700A - Entity word recognition method and device - Google Patents
Entity word recognition method and device Download PDFInfo
- Publication number
- CN107291700A CN107291700A CN201710580718.6A CN201710580718A CN107291700A CN 107291700 A CN107291700 A CN 107291700A CN 201710580718 A CN201710580718 A CN 201710580718A CN 107291700 A CN107291700 A CN 107291700A
- Authority
- CN
- China
- Prior art keywords
- entity word
- dictionary
- instance
- field
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a kind of entity word recognition method, including step:Collection structure data, generate the language material in several fields after the structural data is tentatively filtered and simplified;The first instance dictionary in correspondence field is generated after the language material in each field is trained;Checking generation second instance dictionary is carried out to the first instance dictionary in each field by a large amount of articles, entity word is identified according to the second instance dictionary, the problem of efficiently solving low prior art entity word recognition efficiency and high cost, mode without manually collecting excavates neologisms, human cost is reduced, energy automatic identification entity word simultaneously updates dictionary.
Description
Technical field
The present invention relates to computer realm, more particularly to a kind of entity word recognition method and device.
Background technology
With the fast development of science and technology and internet, computer and network technologies oneself through being deep into people's work, it is raw
Every aspect living.The information that needs are obtained using computer is also gradually used by people, such as Information retrieval queries, calculating
Machine supplementary translation, automatic question answering etc..Be stored with some entity words, such as ProductName in the database of computer server
Title, model, Business Name, brand name etc..If included in the sentence that user is inputted by client in the database
Entity word, then can directly search corresponding result, such as corresponding translation result, question and answer knot from the database of server
Really, retrieval result, then feeds back to client.Such a mode, result corresponding for existing entity word, server can be quick
Client is fed back to, so as to improve the response speed of system.In addition, such a mode can ensure the accurate of feedback data
Property, it is ensured that the validity of data transfer, it is to avoid user constantly sends the request such as retrieval, translation by client, so as to reduce
Data volume of the server transport to client.
Entity word in common server database is obtained by way of manually collecting more, with the continuous hair of technology
Exhibition, particularly in some special dimensions, can constantly produce new entity word, often can not be right in time by the way of manually collecting
Entity word in database is updated, when user sends the requests such as retrieval, translation by user end to server, server
Just it can not realize and fast and accurately respond, so as to reduce response speed.When user can not obtain accurate or its desired result
When, it often constantly sends new request, this adds increased server burden, while adding the data transfer of server
Amount.In addition, new entity word is excavated by way of manually collecting to be needed to expend substantial amounts of workload, increase human cost.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of entity word recognition method and device, can effectively solve prior art real
The problem of pronouns, general term for nouns, numerals and measure words recognition efficiency is low and cost is high.
To achieve the above object, the embodiments of the invention provide a kind of entity word recognition method, including step:
Collection structure data, generate the language in several fields after the structural data is tentatively filtered and simplified
Material;
The first instance dictionary in correspondence field is generated after the language material in each field is trained;
Checking generation second instance dictionary is carried out to the first instance dictionary in each field by a large amount of articles, according to described
Entity word is identified second instance dictionary.
Compared with prior art, entity word recognition method disclosed by the invention is by collection structure data, to the knot
Structure data generate the language material in several fields after tentatively being filtered and simplified;It is raw after the language material in each field is trained
Into the first instance dictionary in correspondence field;Checking generation second is carried out to the first instance dictionary in each field by a large amount of articles
Entity dictionary, entity word is identified according to the second instance dictionary, efficiently solves prior art entity word identification effect
The problem of rate is low and cost is high, can automatic identification entity word simultaneously update dictionary.
As the improvement of such scheme, the classification of the entity word includes name, place name, company and brand.
As the improvement of such scheme, entity word identification is included to carry out classification, weight and affiliated neck to the entity word
The identification in domain.
As the improvement of such scheme, it is specially to entity word identification according to the second instance dictionary:
According to the second instance dictionary, the entity word is identified by Linear Mapping technology.
As the improvement of such scheme, several fields are generated after the structural data is tentatively filtered and simplified
Language material be specially:
Several fields are generated after the structural data is tentatively filtered and simplified by big data ETL technologies
Corpus.
As the improvement of such scheme, checking generation the is carried out to the first instance dictionary in each field by a large amount of articles
Two entity dictionaries are specially:
According to the first instance dictionary in each field, by condition random field to being total between a large amount of articles progress entity word
Now rate is trained, so as to generate second instance dictionary.
As the improvement of such scheme, step is also included after being recognized according to the second instance dictionary to entity word:
Entity word after being identified is subjected to secondary verification by part of speech semantic engine.
The embodiment of the present invention additionally provides a kind of entity word identifying device, including:
Collection module, for collection structure data, is generated after the structural data is tentatively filtered and simplified
The language material in several fields;
First instance dictionary generation module, generates the first of correspondence field after being trained for the language material to each field
Entity dictionary;
Identification module, for carrying out checking generation second instance to the first instance dictionary in each field by a large amount of articles
Dictionary, entity word is identified according to the second instance dictionary.
Compared with prior art, entity word identifying device disclosed by the invention is by collection module collection structure data,
The language material in several fields is generated after the structural data is tentatively filtered and simplified, then is given birth to by first instance dictionary
The first instance dictionary in correspondence field is generated after the language material in each field is trained into module, then passes through second instance word
Storehouse generation module carries out checking generation second instance dictionary to the first instance dictionary in each field according to a large amount of articles, according to institute
State second instance dictionary entity word is identified, efficiently solve that prior art entity word recognition efficiency is low and cost is high asks
Topic, energy automatic identification entity word simultaneously updates dictionary.
As the improvement of such scheme, the classification of the entity word includes name, place name, company and brand.
As the improvement of such scheme, entity word identification is included to carry out classification, weight and affiliated neck to the entity word
The identification in domain.
Brief description of the drawings
Fig. 1 is a kind of schematic flow sheet for entity word recognition method that the embodiment of the present invention 1 is provided.
Fig. 2 is a kind of schematic flow sheet for entity word recognition method that the embodiment of the present invention 2 is provided.
Fig. 3 is a kind of structural representation for entity word identifying device that the embodiment of the present invention 3 is provided.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
It is a kind of schematic flow sheet for entity word recognition method that the embodiment of the present invention 1 is provided referring to Fig. 1, including step:
S1, collection structure data, generate several fields after the structural data is tentatively filtered and simplified
Language material;
S2, the language material in each field is trained after generate correspondence field first instance dictionary;
S3, checking generation second instance dictionary is carried out to the first instance dictionary in each field by a large amount of articles, according to
Entity word is identified the second instance dictionary.
Wherein, entity word is recognized in step s3 includes carrying out classification, weight and art to the entity word
Identification.
When it is implemented, collection structure data, if being generated after the structural data is tentatively filtered and simplified
The language material in dry field;The first instance dictionary in correspondence field is generated after the language material in each field is trained;By a large amount of
Article carries out checking generation second instance dictionary to the first instance dictionary in each field, according to the second instance dictionary to reality
Pronouns, general term for nouns, numerals and measure words is identified, the problem of efficiently solving low prior art entity word recognition efficiency and high cost, without what is manually collected
Mode excavates neologisms, reduces human cost, and energy automatic identification entity word simultaneously updates dictionary.
It should be understood that the classification of the entity word includes name, place name, company and brand.
Preferably, it is specially to entity word identification according to the second instance dictionary in step S3:
According to the second instance dictionary, the entity word is identified by Linear Mapping technology.
Because every generic attribute of entry has a corresponding self refresh dictionary, entry to be identified by with word in dictionary
Correlation (correlation rule and similitude differentiate) is that can determine whether that Attribute class is other by the matching analysis.Therefore, Linear Mapping skill is passed through
Art is identified, and reduces the dependence to dictionary, has more preferable recognition effect to emerging word.
Preferably, the language in several fields is generated after the structural data tentatively being filtered and simplified in step S1
Material is specially:
Several fields are generated after the structural data is tentatively filtered and simplified by big data ETL technologies
Corpus.
ETL, is English Extract-Transform-Load abbreviation, for describing data from source terminal by extracting
(extract), conversion (transform), the process of loading (load) to destination.ETL is build data warehouse important one
Ring, user extracts required data from data source, by data cleansing, finally according to the data warehouse mould pre-defined
Type, is loaded data into data warehouse.
Preferably, checking generation second is carried out in fact to the first instance dictionary in each field by a large amount of articles in step S3
Pronouns, general term for nouns, numerals and measure words storehouse is specially:
According to the first instance dictionary in each field, by condition random field to being total between a large amount of articles progress entity word
Now rate is trained, so as to generate second instance dictionary.
It is a kind of schematic flow sheet for entity word recognition method that the embodiment of the present invention 2 is provided, in embodiment 1 referring to Fig. 2
On the basis of, in addition to step:
S4, the entity word after being identified is passed through into part of speech semantic engine carry out secondary verification.
Secondary verification in the step is verified by recognizing part of speech and analysis semanteme.
It is a kind of structural representation for entity word identifying device that the embodiment of the present invention 3 is provided referring to Fig. 3, including:
Collection module 101, it is raw after the structural data is tentatively filtered and simplified for collection structure data
Into the language material in several fields;
First instance dictionary generation module 102, generates correspondence field after being trained for the language material to each field
First instance dictionary;
Identification module 103, for carrying out checking generation second to the first instance dictionary in each field by a large amount of articles
Entity dictionary, entity word is identified according to the second instance dictionary.
When it is implemented, first passing through collection module collection structure data, the structural data is tentatively filtered
With simplify after generate the language material in several fields, then the language material in each field is instructed by first instance dictionary generation module
After white silk generate correspondence field first instance dictionary, then by second instance dictionary generation module according to a large amount of articles to each
The first instance dictionary in field carries out checking generation second instance dictionary, and entity word is known according to the second instance dictionary
Not, the problem of efficiently solving low prior art entity word recognition efficiency and high cost, can automatic identification entity word and more neologisms
Storehouse.
In a preferred embodiment, the classification of the entity word includes name, place name, company and brand.
In a preferred embodiment, the identification module includes carrying out the entity word classification, power to entity word identification
The identification of weight and art.
To sum up, it is right by collection structure data the embodiments of the invention provide a kind of entity word recognition method and device
The structural data generates the language material in several fields after tentatively being filtered and simplified;The language material in each field is instructed
The first instance dictionary in correspondence field is generated after white silk;Checking life is carried out to the first instance dictionary in each field by a large amount of articles
Into second instance dictionary, entity word is identified according to the second instance dictionary, prior art entity word is efficiently solved
The problem of recognition efficiency is low and cost is high, can automatic identification entity word simultaneously update dictionary.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications are also considered as
Protection scope of the present invention.
Claims (10)
1. a kind of entity word recognition method, it is characterised in that including step:
Collection structure data, generate the language material in several fields after the structural data is tentatively filtered and simplified;
The first instance dictionary in correspondence field is generated after the language material in each field is trained;
Checking generation second instance dictionary is carried out to the first instance dictionary in each field by a large amount of articles, according to described second
Entity word is identified entity dictionary.
2. entity word recognition method as claimed in claim 1, it is characterised in that the classification of the entity word include name,
Name, company and brand.
3. entity word recognition method as claimed in claim 1, it is characterised in that include entity word identification to the entity word
Carry out the identification of classification, weight and art.
4. entity word recognition method as claimed in claim 1, it is characterised in that according to the second instance dictionary to entity word
Identification is specially:
According to the second instance dictionary, the entity word is identified by Linear Mapping technology.
5. entity word recognition method as claimed in claim 1, it is characterised in that tentatively filtered to the structural data
With simplify after generate the language material in several fields and be specially:
The language material in several fields is generated after the structural data is tentatively filtered and simplified by big data ETL data
Storehouse.
6. entity word recognition method as claimed in claim 1, it is characterised in that by a large amount of articles to the first of each field
Entity dictionary carries out checking generation second instance dictionary:
According to the first instance dictionary in each field, the co-occurrence rate between entity word is carried out to a large amount of articles by condition random field
Training, so as to generate second instance dictionary.
7. entity word recognition method as claimed in claim 1, it is characterised in that according to the second instance dictionary to entity word
Also include step after identification:
Entity word after being identified is subjected to secondary verification by part of speech semantic engine.
8. a kind of entity word identifying device, it is characterised in that including:
Collection module, for collection structure data, is generated some after the structural data is tentatively filtered and simplified
The language material in individual field;
First instance dictionary generation module, generates the first instance in correspondence field after being trained for the language material to each field
Dictionary;
Identification module, for carrying out checking generation second instance word to the first instance dictionary in each field by a large amount of articles
Storehouse, entity word is identified according to the second instance dictionary.
9. entity word identifying device as claimed in claim 8, it is characterised in that the classification of the entity word include name,
Name, company and brand.
10. entity word identifying device as claimed in claim 8, it is characterised in that include entity word identification to the entity
Word carries out the identification of classification, weight and art.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710580718.6A CN107291700A (en) | 2017-07-17 | 2017-07-17 | Entity word recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710580718.6A CN107291700A (en) | 2017-07-17 | 2017-07-17 | Entity word recognition method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107291700A true CN107291700A (en) | 2017-10-24 |
Family
ID=60101558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710580718.6A Pending CN107291700A (en) | 2017-07-17 | 2017-07-17 | Entity word recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107291700A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108108350A (en) * | 2017-11-29 | 2018-06-01 | 北京小米移动软件有限公司 | Name word recognition method and device |
CN108595430A (en) * | 2018-04-26 | 2018-09-28 | 携程旅游网络技术(上海)有限公司 | Boat becomes information extracting method and system |
CN109189900A (en) * | 2018-08-03 | 2019-01-11 | 北京捷易迅信息技术有限公司 | A kind of entity abstracting method for BOT system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101477518A (en) * | 2009-01-09 | 2009-07-08 | 昆明理工大学 | Tour field named entity recognition method based on condition random field |
US20090254334A1 (en) * | 2002-01-29 | 2009-10-08 | International Business Machines Corporation | Translation method, translation output method and storage medium, program, and computer used therewith |
CN103268339A (en) * | 2013-05-17 | 2013-08-28 | 中国科学院计算技术研究所 | Recognition method and system of named entities in microblog messages |
CN106528863A (en) * | 2016-11-29 | 2017-03-22 | 中国国防科技信息中心 | Training and technology of CRF recognizer and method for extracting attribute name relation pairs of CRF recognizer |
CN106649272A (en) * | 2016-12-23 | 2017-05-10 | 东北大学 | Named entity recognizing method based on mixed model |
CN106815293A (en) * | 2016-12-08 | 2017-06-09 | 中国电子科技集团公司第三十二研究所 | System and method for constructing knowledge graph for information analysis |
-
2017
- 2017-07-17 CN CN201710580718.6A patent/CN107291700A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090254334A1 (en) * | 2002-01-29 | 2009-10-08 | International Business Machines Corporation | Translation method, translation output method and storage medium, program, and computer used therewith |
CN101477518A (en) * | 2009-01-09 | 2009-07-08 | 昆明理工大学 | Tour field named entity recognition method based on condition random field |
CN103268339A (en) * | 2013-05-17 | 2013-08-28 | 中国科学院计算技术研究所 | Recognition method and system of named entities in microblog messages |
CN106528863A (en) * | 2016-11-29 | 2017-03-22 | 中国国防科技信息中心 | Training and technology of CRF recognizer and method for extracting attribute name relation pairs of CRF recognizer |
CN106815293A (en) * | 2016-12-08 | 2017-06-09 | 中国电子科技集团公司第三十二研究所 | System and method for constructing knowledge graph for information analysis |
CN106649272A (en) * | 2016-12-23 | 2017-05-10 | 东北大学 | Named entity recognizing method based on mixed model |
Non-Patent Citations (1)
Title |
---|
陈蕾: "基于语义与语境的专利信息查询扩展的研究", 《中国优秀硕士学位论文全文数据库_信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108108350A (en) * | 2017-11-29 | 2018-06-01 | 北京小米移动软件有限公司 | Name word recognition method and device |
CN108108350B (en) * | 2017-11-29 | 2021-09-14 | 北京小米移动软件有限公司 | Noun recognition method and device |
CN108595430A (en) * | 2018-04-26 | 2018-09-28 | 携程旅游网络技术(上海)有限公司 | Boat becomes information extracting method and system |
CN109189900A (en) * | 2018-08-03 | 2019-01-11 | 北京捷易迅信息技术有限公司 | A kind of entity abstracting method for BOT system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105468605B (en) | Entity information map generation method and device | |
CN106776711B (en) | Chinese medical knowledge map construction method based on deep learning | |
Su et al. | Automatic detection and interpretation of nominal metaphor based on the theory of meaning | |
CN105653590B (en) | A kind of method that Chinese literature author duplication of name disambiguates | |
CN102314519B (en) | Information searching method based on public security domain knowledge ontology model | |
CN109271626A (en) | Text semantic analysis method | |
CN107609052A (en) | A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle | |
CN113962293B (en) | LightGBM classification and representation learning-based name disambiguation method and system | |
CN107330111A (en) | The search method and device of domain body based on common version body | |
CN107291700A (en) | Entity word recognition method and device | |
CN107480197A (en) | Entity word recognition method and device | |
CN111597349B (en) | Rail transit standard entity relation automatic completion method based on artificial intelligence | |
CN106777048A (en) | Enterprise-quality credit data acquisition methods and system | |
Kwatra et al. | Extractive and abstractive summarization for hindi text using hierarchical clustering | |
Kang et al. | A short texts matching method using shallow features and deep features | |
Mohnot et al. | Hybrid approach for Part of Speech Tagger for Hindi language | |
Nguyen et al. | A vietnamese question answering system | |
Al-Qawasmeh et al. | Arabic named entity disambiguation using linked open data | |
Yao et al. | An automatic semantic extraction method for web data interchange | |
Kang et al. | An Analysis of Research Trends on Language Model Using BERTopic | |
CN109685590A (en) | A kind of system and method for intelligent medicine purchase | |
Saleh et al. | Semantic kernels for semantic parsing | |
Paşca | Acquisition of open-domain classes via intersective semantics | |
CN109543182A (en) | A kind of electric power enterprise based on solr engine takes turns interactive semantic analysis method more | |
Zhang et al. | Design and implementation of power question answering and visualization system based on knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171024 |
|
RJ01 | Rejection of invention patent application after publication |