CN103942347A - Word separating method based on multi-dimensional comprehensive lexicon - Google Patents
Word separating method based on multi-dimensional comprehensive lexicon Download PDFInfo
- Publication number
- CN103942347A CN103942347A CN201410212388.1A CN201410212388A CN103942347A CN 103942347 A CN103942347 A CN 103942347A CN 201410212388 A CN201410212388 A CN 201410212388A CN 103942347 A CN103942347 A CN 103942347A
- Authority
- CN
- China
- Prior art keywords
- keyword
- dictionary
- commodity
- center
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/2448—Query languages for particular applications; for extensibility, e.g. user defined types
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an establishing method based on a multi-dimensional comprehensive lexicon. The method includes the steps of selecting data sources, conducting use amount statistics, selecting keywords according to the constraint conditions, setting up a multi-dimensional maintenance word section for the keywords, obtaining synonyms for the original keywords and singular forms of plural English keywords, completing the content of the lexicon, formulating the central keyword recognition rule, and finding out the central keywords contained in the original keywords. The invention meanwhile discloses a search word separation method based on the multi-dimensional comprehensive lexicon and a central keyword recognition method. By setting up the multi-dimensional comprehensive lexicon, applying the semantic recognition technology in the lexicon and recognizing the central keywords of commodities, a good foundation is set up for matching. By comprehensively applying a character string matching word separation method and a word separation method based on statistics and the lexicon, combining the automatic method with the manual method and participating in the maintenance upgrade of the lexicon, the word separation accuracy is improved.
Description
Technical field
The present invention relates to the participle technique in search engine technique, the technology that particularly relates to the segmenting method in ecommerce search and merchandise news is understood.
Background technology
Along with ecommerce high speed development, increasing supplier provides a large amount of commodity on e-commerce platform, represents.Numerous purchaser, buyers will be in so many commodity, want to search out the product that meets oneself needs, be unable to do without the help of E-commerce Search Engine at all, only have by it and carry out search commercial articles, just likely search, select product, thereby browse detailed product information.
In this case, buyer searches commodity by search, wishes that the product in Search Results is not only wanted comprehensively, and wants accurately, and this just has higher requirement to accuracy rate and the recall ratio of search.And in search technique, participle technique is a core technology, be not only simple participle, and relate to the understanding to merchandise news, therefore the accuracy of word segmentation result also affects the accuracy of Search Results.
Universal search engine has the technology that realizes of centering English string segmentation, at present conventional following several technical methods:
The first is the segmenting method based on string matching, comprises forward matching method, reverse matching method, and bi-directional matching methods etc., according to the method for the preferential coupling of different length, can be divided into again maximum matching method and smallest match method;
The second is the segmenting method based on statistical study, divides with regard to statistical model, comprises conditional random field models and Hidden Markov Model (HMM) etc.Form, word is made up of word, if enough popular and extensive, this word can be fixed, and may obtain different word segmentation result by forward and reverse matching method, also can adopt additive method to obtain more kinds of word segmentation result, by the co-occurrence frequency between word or word in calculating word segmentation result, co-occurrence degree is higher, illustrates that tightness degree is higher, more likely becomes best word segmentation result;
The third is the segmenting method based on specific dictionary, different field has different field dictionaries conventionally, in dictionary, can mark the part of speech of word, the information such as phonetic, the result of participle derives from the word existing in dictionary, participle based on dictionary can not independently exist, and is determining after dictionary, needs to select a kind of point of word algorithm to coordinate.
The 4th kind is the segmenting method based on language understanding.By setting up the syntax of different language, syntax rule storehouse, then treats point word information and carries out syntax, the analysis of syntax rule, thereby the different composition of identifying information, and disambiguation to a certain extent.
Increasing income in participle device of commonly using at present, for example, IK participle has the feature of the above-mentioned segmenting method based on string matching and the segmenting method based on specific dictionary, it is packaged in dictionary in jar bag, need to repack the maintenance of dictionary data, maintenance cost is higher, and in the time that discovery is inaccurate, cannot adjust dictionary, be difficult to observe the effect of participle.Rule-based participle, for example 2-4Gram participle, does not understand information, belongs to a kind of segmenting method of string matching.
Under normal conditions, general is all to carry out participle according to space for English participle, but in the commodity English name of e-commerce platform, there is such situation, the semanteme of some trade name has atomicity, it can not split, for example hair color represents hair dye, car cover represents the protective cover of automobile, commodity English name need to identify its atomicity phrase like this, therefore need these trade names to understand, will extract wherein atomicity phrase as to requirement of English input string participle.
In e-commerce field, if because service needed, tend to design different search strategies, for example, in the time of search " televisor ", wish the commodity that comprise " TV " in all titles, the commodity that comprise " tv " in all titles also can be retrieved out, in this case, " TV ", " tv " can be regarded as to the synonym of " televisor ".Thus, participle does not just carry out on literal carrying out cutting to information, understands but also relate to, so that during with search, can allow user find oneself to want the product finding.
In addition, single plural number in ambiguity word, wrong word and English, all be unable to do without the understanding to information, and the current segmenting method based on language understanding, although disambiguation to a certain extent, but the method computation complexity is high, occur being also difficult to adjust after participle mistake, be difficult to meet the requirement of ecommerce search real-time.
Therefore, can solve the error to comprehension of information in ecommerce search participle by participle, improve searching accuracy, and can be easy to safeguard that word segmentation result is very urgent.
Summary of the invention
In scheme provided by the invention, by building a comprehensive dictionary with multiple dimensions, this dictionary, by program Mass production, manually can participate in editor and safeguard, improves accuracy thereby reach by operation dictionary; In dictionary and adopt semantic recognition technology, the center keyword of recognition value, thus make coupling have good basis.Integrated use of the present invention gradually the matching method and based on statistics, dictionary segmenting method, participate in the maintenance upgrade of dictionary in conjunction with automatic and artificial mode simultaneously, thereby further improve participle accuracy.
The technical solution used in the present invention is: a kind of construction method based on the comprehensive dictionary of various dimensions, comprising:
Step 1, selection data source, go forward side by side and exercise consumption statistics;
In the search daily record of e-commerce platform, the searched key word of selecting user to use within a period of time, to every day, every user's searched key word carries out duplicate removal, user's use amount of then adding up the every day of each searched key word, user's use amount of the every day of searched key word in a period of time is added up, count the user's use amount in searched key word a period of time, this user's use amount has represented the hotspot's distribution of current search keyword;
In the commodity key word information of e-commerce platform as data source, and same supplier's commodity keyword is carried out to duplicate removal, then statistics has how many suppliers to use this commodity keyword in the process of describing commodity, there are how many commodity to use this commodity keyword, the keyword using when supplier describes commodity is more, represent that this commodity keyword is more popular, degree of contention is fiercer; Use the commodity of certain commodity keyword more, represent that businessman's competition of these commodity of sale is fiercer;
Step 2, according to constraint condition select keyword;
Through use amount statistics, by producing the candidate collection of a large amount of keywords, to these candidate keywords data, the keyword that constraint condition is determined in selector unification enters dictionary;
Step 3, safeguard field for what keyword created multidimensional;
On the selected basis of keyword to be safeguarded, for these keywords create field to be safeguarded, and the principle correspondence creating according to these fields indicates this dictionary by certain format;
Step 4, according to cooccurrence relation, obtain the synonym of primary keys and the singulative of English keyword plural number, improve dictionary content;
By the co-occurrence number of times between each keyword and other keyword, select keyword that co-occurrence number of times is higher as synonym, and the singulative of keyword plural number;
Step 5, formulation center keyword recognition rule, find out the center keyword comprising in primary keys;
For the feature of e-commerce industry merchandising, except construct e-commerce field dictionary by said process, to utilize this dictionary to carry out beyond participle in participle process; For ecommerce dealing be vendible article time, a kind of method of a kind of recognition value center keyword has been proposed, and a part using the center keyword that this identifies as word segmentation result, by adding that before the center keyword identifying mark is to distinguish common word segmentation result.
The invention also discloses a kind of search segmenting method and center keyword recognition methods based on the comprehensive dictionary of various dimensions.
The present invention's beneficial effect compared with the prior art:
1, the present invention utilizes the method for statistics to build the comprehensive dictionary of various dimensions, and provide the method for manual maintenance dictionary, from the angle of comprehension of information, dictionary is carried out to Information expansion from multiple dimensions, to identifying inaccurate primary keys, provide correct segmenting method, and be committed in dictionary, thereby less input, also can obtain more reasonably word segmentation result.
2, method described in the invention is easily understood, and maintainability is higher, and algorithm is implemented efficient, feasible, especially applicable to the search of ecommerce commodity class, but is not limited to e-commerce field search.
3, the invention solves the problem that computational load is higher, method is single and artificial maintainability is poor that current segmenting method exists.
4, the segmenting method in the present invention is strong to the use extendability of language, goes for the word segmentation processing of other Languages, comprises English, Japanese, Korean etc.
Brief description of the drawings
Fig. 1 is the structure process flow diagram based on the comprehensive dictionary of various dimensions of the present invention.
Fig. 2 is the word segmentation processing process flow diagram based on the comprehensive dictionary of various dimensions of the present invention.
Fig. 3 is the recognition methods of the center keyword of trade name of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
The comprehensive dictionary construction method of e-commerce field various dimensions of the present embodiment, comprising:
(1) select data source, go forward side by side and exercise consumption statistics;
On e-commerce platform, there is every day a large amount of users to search product by search, in search daily record, the searched key word of selecting user to use within a period of time, to every day, every user's searched key word carries out duplicate removal, user's use amount of then adding up the every day of each searched key word, user's use amount of the every day of searched key word in a period of time is added up, count the user's use amount in searched key word a period of time, this user's use amount has represented the hotspot's distribution of current search keyword;
On e-commerce platform, in order to carry out internet marketing, in each commodity, include commodity key word information, select these commodity keywords as data source, and same supplier's commodity keyword is carried out to duplicate removal, then statistics has how many suppliers to use this commodity keyword in the process of describing commodity, there are how many commodity to use this commodity keyword, the keyword using when supplier describes commodity is more, represents that this commodity keyword is more popular, and degree of contention is fiercer; Use the commodity of certain commodity keyword more, represent that businessman's competition of these commodity of sale is fiercer.
(2) select keyword according to constraint condition;
After above-mentioned processing, to produce the candidate collection of a large amount of keywords, these keywords might not all meet the requirements and enter dictionary, now need to be from above-mentioned candidate keywords data, the keyword that constraint condition is determined in selector unification enters dictionary, and constraint condition comprises:
● when keyword search quantity, supplier's usage quantity of keyword, when the commodity amount of use keyword exceedes certain threshold value, analysis and the use value of these keywords are larger, they are added in dictionary, as keyword to be safeguarded;
● filter out some and obviously have wrong primary keys, for example individual character, word, the wrong word that volumes of searches is less.
(3) safeguard field for what keyword created multidimensional;
On the selected basis of keyword to be safeguarded, for these keywords create field to be safeguarded, the principle that these fields create comprises:
● whether this keyword is correct, if mistake, what the correct keyword of that correspondence is;
● whether this keyword can be sold, if can sell, can be used as the kernel keyword of product;
● what the kernel keyword of this keyword is, the kernel keyword of for example bicycle is car;
● whether this keyword is atom keyword, and for example Iron Guanyin is exactly atom keyword, splits nonsensical;
● for English, what the prototype of word is, its plural number, odd number just can be expressed clearly like this;
● for the discontented sufficient actual needs of word segmentation result, need preserve correct cutting result by " manually cutting ".
Therefore, the form of this dictionary is as shown in the table.
Having listed in table " whether correct ", " correct word ", " synonym ", " root ", " prototype ", " retaining original ", " artificial cutting ", the attribute of primary keyses such as " modification times "." artificial cutting " attribute is wherein the word segmentation result when primary keys when unsatisfactory, provides the correct result of artificial cutting." modification time " keeps different in data for recording participle device and dictionary."-" in form represents this attribute void value.
(4) according to cooccurrence relation, obtain the synonym of primary keys and the prototype of English keyword plural number, improve dictionary content;
By the co-occurrence number of times between each keyword and other keyword, select keyword that co-occurrence number of times is higher as synonym, single plural number, the source of the attribute datas such as correct word.For example, the keyword TV and the televisor co-occurrence frequency that count commodity are higher, analyze reason, and TV and televisor are synonyms, in actual mechanical process, can setting threshold, and co-occurrence frequency is arrived to a certain amount of keyword and give tacit consent to as synonym.If xxx and xxxs or xxxes co-occurrence number of times are higher, also xxx can be regarded as to the prototype of xxxs and xxxes plural number.While so just having solved initialization dictionary, a relatively process of labor intensive, importing after dictionary, in order to ensure the correctness of dictionary, preferably ensures manual maintenance dictionary.
The process that dictionary builds as shown in Figure 1.Fig. 1 is the structure dictionary process flow diagram in the present invention.
(5) formulate center keyword recognition rule, find out the center keyword comprising in primary keys;
For the feature of e-commerce industry merchandising, except construct e-commerce field dictionary by said process, to utilize this dictionary to carry out beyond participle in participle process; For ecommerce dealing be vendible article time, a kind of method of a kind of recognition value center keyword has been proposed, and a part using the center keyword that this identifies as word segmentation result, by adding that before the center keyword identifying No. # mark is to distinguish common word segmentation result.This recognition value center keyword comprises following rule:
● first need first to analyze the syntactic structure of each language, the center of analyzing trade name is to the left or to the right, and by the analysis to Chinese and English trade name, the center keyword of Chinese and English commodity generally appears at the right of trade name.
● in the dictionary of above-mentioned structure, having comprised which keyword can sell, and which keyword is modification type keyword.The trade name of right-to-left scanning input, in the time running into modification keyword, directly skip, in the time running into bracket (comprising braces, bracket, round bracket), think that the information in bracket is the supplementary notes to commodity itself, should be also as modifying keyword, when identifying a keyword, this keyword is in dictionary, and be vendible keyword, taking English as example, if this keyword does not also have for, with above, without, in, the semantic conversion words such as made of, this keyword is exactly identified center keyword so.If there is semantic conversion word, the device of recognition value title center keyword, what jump directly to these semantic conversion words proceeds above identification, until find a merchandise mart keyword, or due to former in the statement of trade name information thereby cannot find.
● merchandise mart keyword must derive from the vendible keyword in dictionary.
After above-mentioned processing, improve dictionary Zhong center key words content.
A search segmenting method based on the comprehensive dictionary of various dimensions, comprising:
(1) treat to split participle input of character string according to minimum semantic unit to receiving;
Receiving after participle input of character string, by the input of character string receiving, carrying out cutting according to the punctuate mode of each language.Wherein, the minimum semantic unit of Chinese is Chinese character, and Englishly continuous letter or continuous numeral should be done as a wholely, and space as decollator, does not plant oneself in English.
To the sentence of input, split according to minimum semantic unit, in the time generating minimum semantic unit, do not distinguish the order of scanning, that is to say, can reach and identify minimum semantic unit from left to right and from right to left.
For example: the input of character string for the treatment of participle: " intelligence is dazzled 46 cun of highlighted LED LCDs (900 brightness) "
Minimum semantic primitive is: " intelligence/dazzling/46/ cun/height/bright/led/ liquid/crystalline substance/show/show/shield/(/ 900/ bright/spend /) "
Oblique line wherein represents separator, and LED capitalization has been converted into small letter.In participle process, can get rid of some nonsensical punctuation marks.
(2), on the basis of minimum semantic unit, carry out reverse maximum matching algorithm in conjunction with comprehensive dictionary;
Generating on the basis of minimum semantic unit, carrying out reverse maximum matching algorithm in connection with comprehensive dictionary, describing with example below.
The process of right-to-left scanning is as follows: suppose in dictionary that the length of long primary keys is 5, whether first judgement " (900 brightness) " exists in dictionary, identify a participle part if exist, otherwise, length is subtracted to 1, whether judgement " 900 brightness) " in dictionary, judges, according to reverse maximum judged result successively.
(3) process thering is synon keyword in participle;
In the time identifying " LCDs ", word segmentation result is now: " LCDs/900/ brightness "
Because LCDs has synonym " liquid crystal display " in dictionary, so current word segmentation result is optimised for " LCDs/liquid crystal display/900/ brightness ".
(4) processing to word segmentation result mistake;
In the time finding that there is cutting mistake, for example " making up and clothes " participle is become " make up/and/clothes ", and the cutting result of wishing is " cosmetic/kimonos/dress ", in this case, need by " make up and clothes " with and the word segmentation result " cosmetic/kimonos/dress " of corresponding expectation join in dictionary.Like this, again when participle, just can from dictionary, obtain correct result.
Above-mentioned dictionary safeguards with minimum semantic granularity conventionally, and when running into the situation of ambiguity None-identified, every profession and trade, for the understanding of the ambiguity situation of oneself, configures artificial cutting word segmentation result, thereby obtains higher participle accuracy.In the time running into erroneous words, for example " I just ironworks working " am split as " I// just iron/factory/working ", because firm iron is designated a wrong word in dictionary, we can be converted into iron and steel by " just iron ", thereby obtain comparatively correct word segmentation result, " I// iron and steel/factory/working ".
In English, also have the situation of common participle mistake, if for example according to space participle " red car cover ", it will be " red/car/cover " by participle, and " car seat cover " can be " car/seat/cover " by participle.In the time of search car cover, because two commodity have all comprised car and cover, all can be hit.So, be thisly unfavorable in this case match query in the direct participle mode in space.The method now improving according to the present invention, can add car cover in dictionary to, does as a whole." red car cover " can be " red/car cover " by participle like this, then while searching for car cover, just only can mate the such commodity of red car cover, the higher accuracy of acquisition in search.Particularly, in ecommerce search field, participle belongs to the category of comprehension of information, and it is a kind of means, match query and row
Name is only the object of search.
Obtaining on the basis of comparatively correct word segmentation result, due to the needs of electronic commerce information coupling, in participle process, wishing the center keyword of recognition value information also as the important component part of word segmentation result.
Fig. 2 is the process flow diagram that carries out word segmentation processing based on the comprehensive dictionary of various dimensions in the present invention.
For example: the word segmentation result of " Paper Cutting Machine (SQZ-115CTN) ": " paper/cutting/machine/sqz/115/ctn " is because that these merchandise sales is " machine " instead of " paper ", so, center keyword recognition rule based on certain, the center keyword that identifies these commodity is " machine ", further processing word segmentation result is " paper/cutting/machine/#machine/sqz/115/ctn ", wherein " #machine ", the center keyword that represents these commodity is " machine ", other keyword is all in order to modify center keyword " machine ".So utilize word segmentation result Zhong center keyword coupling, in the time of search " paper ", centered by " machine ", the commodity of keyword are just can rank very not forward.
Other situation is to have run into some semantic conversion words, for example for, with.Be exemplified below.
Name is called the commodity of " toy for baby; children ", that sell can not be " baby ", but " toy ", now due to the existence of " for ", according to center keyword recognition rule, the center keyword of commodity exist with " for " before, so " toy " is as center keyword, as a part for participle.The word segmentation result of " toy for baby, children " is " toy/#toy/for/baby/children/ " like this.
The keyword finding as stated above and dictionary contrast, do not have corresponding center keyword in dictionary, select so a keyword on its left side, continue to continue to mate with dictionary, in this manner, until till first keyword.
Fig. 3 is the recognition methods of the center keyword of the trade name in the present invention.
Above embodiment is just described for partial function of the present invention, but embodiment and accompanying drawing are not of the present invention for limiting.Without departing from the spirit and scope of the invention, any equivalence of doing changes or retouching, belongs to equally the present invention's protection domain.Therefore protection scope of the present invention should be as the criterion with the application's the content that claim was defined.
Claims (10)
1. the construction method based on the comprehensive dictionary of various dimensions, is characterized in that, comprising:
Step 1, selection data source, go forward side by side and exercise consumption statistics;
In the search daily record of e-commerce platform, the searched key word of selecting user to use within a period of time, to every day, every user's searched key word carries out duplicate removal, user's use amount of then adding up the every day of each searched key word, user's use amount of the every day of searched key word in a period of time is added up, count the user's use amount in searched key word a period of time, this user's use amount has represented the hotspot's distribution of current search keyword;
In the commodity key word information of e-commerce platform as data source, and same supplier's commodity keyword is carried out to duplicate removal, then statistics has how many suppliers to use this commodity keyword in the process of describing commodity, there are how many commodity to use this commodity keyword, the keyword using when supplier describes commodity is more, represent that this commodity keyword is more popular, degree of contention is fiercer; Use the commodity of certain commodity keyword more, represent that businessman's competition of these commodity of sale is fiercer;
Step 2, according to constraint condition select keyword;
Through use amount statistics, by producing the candidate collection of a large amount of keywords, to these candidate keywords data, the keyword that constraint condition is determined in selector unification enters dictionary;
Step 3, safeguard field for what keyword created multidimensional;
On the selected basis of keyword to be safeguarded, for these keywords create field to be safeguarded, and the principle correspondence creating according to these fields indicates this dictionary by certain format;
Step 4, according to cooccurrence relation, obtain the synonym of primary keys and the singulative of English keyword plural number, improve dictionary content;
By the co-occurrence number of times between each keyword and other keyword, select keyword that co-occurrence number of times is higher as synonym, and the singulative of keyword plural number;
Step 5, formulation center keyword recognition rule, find out the center keyword comprising in primary keys;
For the feature of e-commerce industry merchandising, except construct e-commerce field dictionary by said process, to utilize this dictionary to carry out beyond participle in participle process; For ecommerce dealing be vendible article time, a kind of method of a kind of recognition value center keyword has been proposed, and a part using the center keyword that this identifies as word segmentation result, by adding that before the center keyword identifying mark is to distinguish common word segmentation result.
2. the construction method based on the comprehensive dictionary of various dimensions according to claim 1, is characterized in that: in step 2, constraint condition comprises:
● when keyword search quantity, supplier's usage quantity of keyword, when the commodity amount of use keyword exceedes certain threshold value, analysis and the use value of these keywords are larger, they are added in dictionary, as keyword to be safeguarded;
● filter out some and obviously have wrong primary keys.
3. the construction method based on the comprehensive dictionary of various dimensions according to claim 1, is characterized in that: in step 3, the principle that described field creates comprises:
● whether this keyword is correct, if mistake, what the correct keyword of that correspondence is;
● whether this keyword can be sold, if can sell, can be used as the kernel keyword of product;
● what the kernel keyword of this keyword is;
● whether this keyword is atom keyword;
● for English, what the prototype of word is;
● for the discontented sufficient actual needs of word segmentation result, need preserve correct cutting result by " manually cutting ".
4. the construction method based on the comprehensive dictionary of various dimensions according to claim 1, is characterized in that: in step 4, setting threshold, arrives co-occurrence frequency in a certain amount of keyword and give tacit consent to as synonym, and the prototype of English keyword plural number.
5. the construction method based on the comprehensive dictionary of various dimensions according to claim 1, is characterized in that: in step 5, described center keyword recognition rule is specially:
● first analyze the syntactic structure of each language, the center of analyzing trade name is to the left or to the right, and by the analysis to Chinese and English trade name, the center keyword of Chinese and English commodity generally appears at the right of trade name;
● in constructed dictionary, having comprised which keyword can sell, and which keyword is modification type keyword; The trade name of right-to-left scanning input, in the time running into modification keyword, directly skip, in the time running into bracket, think that the information in bracket is the supplementary notes to commodity itself, as modifying keyword, when identifying a keyword, this keyword, in dictionary, and is vendible keyword, if there is no semantic conversion word, this keyword is exactly identified center keyword so; If there is semantic conversion word, the device of recognition value title center keyword, what jump directly to these semantic conversion words proceeds above identification, until find a merchandise mart keyword, or due to former in the statement of trade name information thereby cannot find;
● merchandise mart keyword must derive from the vendible keyword in dictionary.
6. the search segmenting method based on the comprehensive dictionary of various dimensions, is characterized in that, comprising:
Step 1, treat to split participle input of character string according to minimum semantic unit to receiving;
Step 2, on the basis of minimum semantic unit, carry out reverse maximum matching algorithm in conjunction with comprehensive dictionary;
Step 3, process thering is synon keyword in participle;
Step 4, processing to word segmentation result mistake.
7. the search segmenting method based on the comprehensive dictionary of various dimensions according to claim 6, is characterized in that, in step 1:
Receiving after participle input of character string, by the input of character string receiving, carrying out cutting according to the punctuate mode of each language; Wherein, the minimum semantic unit of Chinese is Chinese character, and Englishly continuous letter or continuous numeral should be done as a wholely, and space as decollator, does not plant oneself in English;
To the sentence of input, split according to minimum semantic unit, in the time generating minimum semantic unit, do not distinguish the order of scanning, that is to say, can reach and identify minimum semantic unit from left to right and from right to left.
8. the search segmenting method based on the comprehensive dictionary of various dimensions according to claim 6, is characterized in that, in step 4:
Based on safeguarding with minimum semantic granularity in the comprehensive dictionary of various dimensions, when running into the situation of ambiguity None-identified, configure artificial cutting word segmentation result, thereby obtain higher participle accuracy;
Obtaining on the basis of comparatively correct word segmentation result, in participle process, using the center keyword of recognition value information also as the important component part of word segmentation result.
9. the center keyword recognition methods based on the comprehensive dictionary of various dimensions, is characterized in that:
Obtain word segmentation result according to the search segmenting method based on the comprehensive dictionary of various dimensions one of claim 6 to 8 Suo Shu, center keyword recognition rule based on certain, identify the center keyword of these commodity, further extract word segmentation result Zhong center keyword, utilize word segmentation result Zhong center keyword coupling, if based on there is no corresponding center keyword in the comprehensive dictionary of various dimensions, select so a keyword on its left side, continue to continue to mate with dictionary, in this manner, until till first keyword.
10. the center keyword recognition methods based on the comprehensive dictionary of various dimensions according to claim 9, is characterized in that:
Described is to build according to the construction method of one of claim 1 to 5 based on the comprehensive dictionary of various dimensions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410212388.1A CN103942347B (en) | 2014-05-19 | 2014-05-19 | A kind of segmenting method based on various dimensions synthesis dictionary |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410212388.1A CN103942347B (en) | 2014-05-19 | 2014-05-19 | A kind of segmenting method based on various dimensions synthesis dictionary |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103942347A true CN103942347A (en) | 2014-07-23 |
CN103942347B CN103942347B (en) | 2017-04-05 |
Family
ID=51190015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410212388.1A Active CN103942347B (en) | 2014-05-19 | 2014-05-19 | A kind of segmenting method based on various dimensions synthesis dictionary |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103942347B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104408173A (en) * | 2014-12-11 | 2015-03-11 | 焦点科技股份有限公司 | Method for automatically extracting kernel keyword based on B2B platform |
CN105653517A (en) * | 2015-11-05 | 2016-06-08 | 乐视致新电子科技(天津)有限公司 | Recognition rate determining method and apparatus |
CN106570058A (en) * | 2016-09-29 | 2017-04-19 | 山东浪潮商用系统有限公司 | Searching method and search engine |
CN107784019A (en) * | 2016-08-30 | 2018-03-09 | 苏宁云商集团股份有限公司 | Word treatment method and system are searched in a kind of searching service |
CN107943786A (en) * | 2017-11-16 | 2018-04-20 | 广州市万隆证券咨询顾问有限公司 | A kind of Chinese name entity recognition method and system |
CN108304484A (en) * | 2017-12-29 | 2018-07-20 | 北京城市网邻信息技术有限公司 | Key word matching method and device, electronic equipment and readable storage medium storing program for executing |
CN108491518A (en) * | 2018-03-26 | 2018-09-04 | 广州虎牙信息科技有限公司 | Audit method, apparatus, electronic equipment and the storage medium of text |
WO2019136855A1 (en) * | 2018-01-12 | 2019-07-18 | 平安科技(深圳)有限公司 | Method and apparatus for implementing multidimensional analysis on insurance policy, terminal device, and storage medium |
CN112364153A (en) * | 2020-11-10 | 2021-02-12 | 中数通信息有限公司 | Keyword identification method and device based on interference characteristics |
CN113032683A (en) * | 2021-04-28 | 2021-06-25 | 玉米社(深圳)网络科技有限公司 | Method for quickly segmenting words in network popularization |
CN113793193A (en) * | 2021-08-13 | 2021-12-14 | 唯品会(广州)软件有限公司 | Data search accuracy verification method, device, equipment and computer readable medium |
CN117953875A (en) * | 2024-03-27 | 2024-04-30 | 成都启英泰伦科技有限公司 | Offline voice command word storage method based on semantic understanding |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079066A (en) * | 2007-06-29 | 2007-11-28 | 深圳市中科新业信息科技发展有限公司 | Network data analysis method and system in network auditing |
CN101246492A (en) * | 2008-02-26 | 2008-08-20 | 华中科技大学 | Full text retrieval system based on natural language |
CN102479191A (en) * | 2010-11-22 | 2012-05-30 | 阿里巴巴集团控股有限公司 | Method and device for providing multi-granularity word segmentation result |
CN102929937A (en) * | 2012-09-28 | 2013-02-13 | 福州博远无线网络科技有限公司 | Text-subject-model-based data processing method for commodity classification |
CN102955857A (en) * | 2012-11-09 | 2013-03-06 | 北京航空航天大学 | Class center compression transformation-based text clustering method in search engine |
CN102999534A (en) * | 2011-09-19 | 2013-03-27 | 北京金和软件股份有限公司 | Chinese word segmentation algorithm based on reverse maximum matching |
-
2014
- 2014-05-19 CN CN201410212388.1A patent/CN103942347B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079066A (en) * | 2007-06-29 | 2007-11-28 | 深圳市中科新业信息科技发展有限公司 | Network data analysis method and system in network auditing |
CN101246492A (en) * | 2008-02-26 | 2008-08-20 | 华中科技大学 | Full text retrieval system based on natural language |
CN102479191A (en) * | 2010-11-22 | 2012-05-30 | 阿里巴巴集团控股有限公司 | Method and device for providing multi-granularity word segmentation result |
CN102999534A (en) * | 2011-09-19 | 2013-03-27 | 北京金和软件股份有限公司 | Chinese word segmentation algorithm based on reverse maximum matching |
CN102929937A (en) * | 2012-09-28 | 2013-02-13 | 福州博远无线网络科技有限公司 | Text-subject-model-based data processing method for commodity classification |
CN102955857A (en) * | 2012-11-09 | 2013-03-06 | 北京航空航天大学 | Class center compression transformation-based text clustering method in search engine |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104408173A (en) * | 2014-12-11 | 2015-03-11 | 焦点科技股份有限公司 | Method for automatically extracting kernel keyword based on B2B platform |
CN104408173B (en) * | 2014-12-11 | 2016-12-07 | 焦点科技股份有限公司 | A kind of kernel keyword extraction method based on B2B platform |
CN105653517A (en) * | 2015-11-05 | 2016-06-08 | 乐视致新电子科技(天津)有限公司 | Recognition rate determining method and apparatus |
CN107784019A (en) * | 2016-08-30 | 2018-03-09 | 苏宁云商集团股份有限公司 | Word treatment method and system are searched in a kind of searching service |
CN106570058A (en) * | 2016-09-29 | 2017-04-19 | 山东浪潮商用系统有限公司 | Searching method and search engine |
CN107943786B (en) * | 2017-11-16 | 2021-12-07 | 广州市万隆证券咨询顾问有限公司 | Chinese named entity recognition method and system |
CN107943786A (en) * | 2017-11-16 | 2018-04-20 | 广州市万隆证券咨询顾问有限公司 | A kind of Chinese name entity recognition method and system |
CN108304484A (en) * | 2017-12-29 | 2018-07-20 | 北京城市网邻信息技术有限公司 | Key word matching method and device, electronic equipment and readable storage medium storing program for executing |
WO2019136855A1 (en) * | 2018-01-12 | 2019-07-18 | 平安科技(深圳)有限公司 | Method and apparatus for implementing multidimensional analysis on insurance policy, terminal device, and storage medium |
CN108491518A (en) * | 2018-03-26 | 2018-09-04 | 广州虎牙信息科技有限公司 | Audit method, apparatus, electronic equipment and the storage medium of text |
CN108491518B (en) * | 2018-03-26 | 2021-02-26 | 广州虎牙信息科技有限公司 | Method and device for auditing text, electronic equipment and storage medium |
CN112364153A (en) * | 2020-11-10 | 2021-02-12 | 中数通信息有限公司 | Keyword identification method and device based on interference characteristics |
CN113032683A (en) * | 2021-04-28 | 2021-06-25 | 玉米社(深圳)网络科技有限公司 | Method for quickly segmenting words in network popularization |
CN113032683B (en) * | 2021-04-28 | 2021-12-24 | 玉米社(深圳)网络科技有限公司 | Method for quickly segmenting words in network popularization |
CN113793193A (en) * | 2021-08-13 | 2021-12-14 | 唯品会(广州)软件有限公司 | Data search accuracy verification method, device, equipment and computer readable medium |
CN113793193B (en) * | 2021-08-13 | 2024-02-02 | 唯品会(广州)软件有限公司 | Data search accuracy verification method, device, equipment and computer readable medium |
CN117953875A (en) * | 2024-03-27 | 2024-04-30 | 成都启英泰伦科技有限公司 | Offline voice command word storage method based on semantic understanding |
Also Published As
Publication number | Publication date |
---|---|
CN103942347B (en) | 2017-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103942347A (en) | Word separating method based on multi-dimensional comprehensive lexicon | |
CN107257970B (en) | Question answering from structured and unstructured data sources | |
CN103853824B (en) | In-text advertisement releasing method and system based on deep semantic mining | |
CN104063523B (en) | E-commerce search scoring and ranking method and system | |
More | Attribute extraction from product titles in ecommerce | |
CN102880645B (en) | The intelligent search method of semantization | |
KR101040119B1 (en) | Apparatus and Method for Search of Contents | |
Sarawagi et al. | Open-domain quantity queries on web tables: annotation, response, and consensus models | |
WO2020253591A1 (en) | Search method and apparatus applying tag knowledge network | |
TWI645303B (en) | Method for verifying string, method for expanding string and method for training verification model | |
CN106156204A (en) | The extracting method of text label and device | |
CN105868255A (en) | Query recommendation method and apparatus | |
CN109960756A (en) | Media event information inductive method | |
Zhao et al. | Creating a fine-grained corpus for chinese sentiment analysis | |
CN102737013A (en) | Device and method for identifying statement emotion based on dependency relation | |
CN109522418A (en) | A kind of automanual knowledge mapping construction method | |
CN105630768A (en) | Cascaded conditional random field-based product name recognition method and device | |
CN102693279A (en) | Method, device and system for fast calculating comment similarity | |
CN102955837A (en) | Analogy retrieval control method based on Chinese word pair relationship similarity | |
CN110263178B (en) | WordNet-to-Neo 4J mapping method, semantic detection method and semantic calculation expansion interface generation method | |
CN101727451A (en) | Method and device for extracting information | |
CN113821718A (en) | Article information pushing method and device | |
CN105786794B (en) | Question-answer pair retrieval method and community question-answer retrieval system | |
CN108932247A (en) | A kind of method and device optimizing text search | |
Rezk et al. | Accurate product attribute extraction on the field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |