[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN103942347A - Word separating method based on multi-dimensional comprehensive lexicon - Google Patents

Word separating method based on multi-dimensional comprehensive lexicon Download PDF

Info

Publication number
CN103942347A
CN103942347A CN201410212388.1A CN201410212388A CN103942347A CN 103942347 A CN103942347 A CN 103942347A CN 201410212388 A CN201410212388 A CN 201410212388A CN 103942347 A CN103942347 A CN 103942347A
Authority
CN
China
Prior art keywords
keyword
dictionary
commodity
center
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410212388.1A
Other languages
Chinese (zh)
Other versions
CN103942347B (en
Inventor
李仁勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN201410212388.1A priority Critical patent/CN103942347B/en
Publication of CN103942347A publication Critical patent/CN103942347A/en
Application granted granted Critical
Publication of CN103942347B publication Critical patent/CN103942347B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/2448Query languages for particular applications; for extensibility, e.g. user defined types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an establishing method based on a multi-dimensional comprehensive lexicon. The method includes the steps of selecting data sources, conducting use amount statistics, selecting keywords according to the constraint conditions, setting up a multi-dimensional maintenance word section for the keywords, obtaining synonyms for the original keywords and singular forms of plural English keywords, completing the content of the lexicon, formulating the central keyword recognition rule, and finding out the central keywords contained in the original keywords. The invention meanwhile discloses a search word separation method based on the multi-dimensional comprehensive lexicon and a central keyword recognition method. By setting up the multi-dimensional comprehensive lexicon, applying the semantic recognition technology in the lexicon and recognizing the central keywords of commodities, a good foundation is set up for matching. By comprehensively applying a character string matching word separation method and a word separation method based on statistics and the lexicon, combining the automatic method with the manual method and participating in the maintenance upgrade of the lexicon, the word separation accuracy is improved.

Description

A kind of segmenting method based on the comprehensive dictionary of various dimensions
Technical field
The present invention relates to the participle technique in search engine technique, the technology that particularly relates to the segmenting method in ecommerce search and merchandise news is understood.
Background technology
Along with ecommerce high speed development, increasing supplier provides a large amount of commodity on e-commerce platform, represents.Numerous purchaser, buyers will be in so many commodity, want to search out the product that meets oneself needs, be unable to do without the help of E-commerce Search Engine at all, only have by it and carry out search commercial articles, just likely search, select product, thereby browse detailed product information.
In this case, buyer searches commodity by search, wishes that the product in Search Results is not only wanted comprehensively, and wants accurately, and this just has higher requirement to accuracy rate and the recall ratio of search.And in search technique, participle technique is a core technology, be not only simple participle, and relate to the understanding to merchandise news, therefore the accuracy of word segmentation result also affects the accuracy of Search Results.
Universal search engine has the technology that realizes of centering English string segmentation, at present conventional following several technical methods:
The first is the segmenting method based on string matching, comprises forward matching method, reverse matching method, and bi-directional matching methods etc., according to the method for the preferential coupling of different length, can be divided into again maximum matching method and smallest match method;
The second is the segmenting method based on statistical study, divides with regard to statistical model, comprises conditional random field models and Hidden Markov Model (HMM) etc.Form, word is made up of word, if enough popular and extensive, this word can be fixed, and may obtain different word segmentation result by forward and reverse matching method, also can adopt additive method to obtain more kinds of word segmentation result, by the co-occurrence frequency between word or word in calculating word segmentation result, co-occurrence degree is higher, illustrates that tightness degree is higher, more likely becomes best word segmentation result;
The third is the segmenting method based on specific dictionary, different field has different field dictionaries conventionally, in dictionary, can mark the part of speech of word, the information such as phonetic, the result of participle derives from the word existing in dictionary, participle based on dictionary can not independently exist, and is determining after dictionary, needs to select a kind of point of word algorithm to coordinate.
The 4th kind is the segmenting method based on language understanding.By setting up the syntax of different language, syntax rule storehouse, then treats point word information and carries out syntax, the analysis of syntax rule, thereby the different composition of identifying information, and disambiguation to a certain extent.
Increasing income in participle device of commonly using at present, for example, IK participle has the feature of the above-mentioned segmenting method based on string matching and the segmenting method based on specific dictionary, it is packaged in dictionary in jar bag, need to repack the maintenance of dictionary data, maintenance cost is higher, and in the time that discovery is inaccurate, cannot adjust dictionary, be difficult to observe the effect of participle.Rule-based participle, for example 2-4Gram participle, does not understand information, belongs to a kind of segmenting method of string matching.
Under normal conditions, general is all to carry out participle according to space for English participle, but in the commodity English name of e-commerce platform, there is such situation, the semanteme of some trade name has atomicity, it can not split, for example hair color represents hair dye, car cover represents the protective cover of automobile, commodity English name need to identify its atomicity phrase like this, therefore need these trade names to understand, will extract wherein atomicity phrase as to requirement of English input string participle.
In e-commerce field, if because service needed, tend to design different search strategies, for example, in the time of search " televisor ", wish the commodity that comprise " TV " in all titles, the commodity that comprise " tv " in all titles also can be retrieved out, in this case, " TV ", " tv " can be regarded as to the synonym of " televisor ".Thus, participle does not just carry out on literal carrying out cutting to information, understands but also relate to, so that during with search, can allow user find oneself to want the product finding.
In addition, single plural number in ambiguity word, wrong word and English, all be unable to do without the understanding to information, and the current segmenting method based on language understanding, although disambiguation to a certain extent, but the method computation complexity is high, occur being also difficult to adjust after participle mistake, be difficult to meet the requirement of ecommerce search real-time.
Therefore, can solve the error to comprehension of information in ecommerce search participle by participle, improve searching accuracy, and can be easy to safeguard that word segmentation result is very urgent.
Summary of the invention
In scheme provided by the invention, by building a comprehensive dictionary with multiple dimensions, this dictionary, by program Mass production, manually can participate in editor and safeguard, improves accuracy thereby reach by operation dictionary; In dictionary and adopt semantic recognition technology, the center keyword of recognition value, thus make coupling have good basis.Integrated use of the present invention gradually the matching method and based on statistics, dictionary segmenting method, participate in the maintenance upgrade of dictionary in conjunction with automatic and artificial mode simultaneously, thereby further improve participle accuracy.
The technical solution used in the present invention is: a kind of construction method based on the comprehensive dictionary of various dimensions, comprising:
Step 1, selection data source, go forward side by side and exercise consumption statistics;
In the search daily record of e-commerce platform, the searched key word of selecting user to use within a period of time, to every day, every user's searched key word carries out duplicate removal, user's use amount of then adding up the every day of each searched key word, user's use amount of the every day of searched key word in a period of time is added up, count the user's use amount in searched key word a period of time, this user's use amount has represented the hotspot's distribution of current search keyword;
In the commodity key word information of e-commerce platform as data source, and same supplier's commodity keyword is carried out to duplicate removal, then statistics has how many suppliers to use this commodity keyword in the process of describing commodity, there are how many commodity to use this commodity keyword, the keyword using when supplier describes commodity is more, represent that this commodity keyword is more popular, degree of contention is fiercer; Use the commodity of certain commodity keyword more, represent that businessman's competition of these commodity of sale is fiercer;
Step 2, according to constraint condition select keyword;
Through use amount statistics, by producing the candidate collection of a large amount of keywords, to these candidate keywords data, the keyword that constraint condition is determined in selector unification enters dictionary;
Step 3, safeguard field for what keyword created multidimensional;
On the selected basis of keyword to be safeguarded, for these keywords create field to be safeguarded, and the principle correspondence creating according to these fields indicates this dictionary by certain format;
Step 4, according to cooccurrence relation, obtain the synonym of primary keys and the singulative of English keyword plural number, improve dictionary content;
By the co-occurrence number of times between each keyword and other keyword, select keyword that co-occurrence number of times is higher as synonym, and the singulative of keyword plural number;
Step 5, formulation center keyword recognition rule, find out the center keyword comprising in primary keys;
For the feature of e-commerce industry merchandising, except construct e-commerce field dictionary by said process, to utilize this dictionary to carry out beyond participle in participle process; For ecommerce dealing be vendible article time, a kind of method of a kind of recognition value center keyword has been proposed, and a part using the center keyword that this identifies as word segmentation result, by adding that before the center keyword identifying mark is to distinguish common word segmentation result.
The invention also discloses a kind of search segmenting method and center keyword recognition methods based on the comprehensive dictionary of various dimensions.
The present invention's beneficial effect compared with the prior art:
1, the present invention utilizes the method for statistics to build the comprehensive dictionary of various dimensions, and provide the method for manual maintenance dictionary, from the angle of comprehension of information, dictionary is carried out to Information expansion from multiple dimensions, to identifying inaccurate primary keys, provide correct segmenting method, and be committed in dictionary, thereby less input, also can obtain more reasonably word segmentation result.
2, method described in the invention is easily understood, and maintainability is higher, and algorithm is implemented efficient, feasible, especially applicable to the search of ecommerce commodity class, but is not limited to e-commerce field search.
3, the invention solves the problem that computational load is higher, method is single and artificial maintainability is poor that current segmenting method exists.
4, the segmenting method in the present invention is strong to the use extendability of language, goes for the word segmentation processing of other Languages, comprises English, Japanese, Korean etc.
Brief description of the drawings
Fig. 1 is the structure process flow diagram based on the comprehensive dictionary of various dimensions of the present invention.
Fig. 2 is the word segmentation processing process flow diagram based on the comprehensive dictionary of various dimensions of the present invention.
Fig. 3 is the recognition methods of the center keyword of trade name of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
The comprehensive dictionary construction method of e-commerce field various dimensions of the present embodiment, comprising:
(1) select data source, go forward side by side and exercise consumption statistics;
On e-commerce platform, there is every day a large amount of users to search product by search, in search daily record, the searched key word of selecting user to use within a period of time, to every day, every user's searched key word carries out duplicate removal, user's use amount of then adding up the every day of each searched key word, user's use amount of the every day of searched key word in a period of time is added up, count the user's use amount in searched key word a period of time, this user's use amount has represented the hotspot's distribution of current search keyword;
On e-commerce platform, in order to carry out internet marketing, in each commodity, include commodity key word information, select these commodity keywords as data source, and same supplier's commodity keyword is carried out to duplicate removal, then statistics has how many suppliers to use this commodity keyword in the process of describing commodity, there are how many commodity to use this commodity keyword, the keyword using when supplier describes commodity is more, represents that this commodity keyword is more popular, and degree of contention is fiercer; Use the commodity of certain commodity keyword more, represent that businessman's competition of these commodity of sale is fiercer.
(2) select keyword according to constraint condition;
After above-mentioned processing, to produce the candidate collection of a large amount of keywords, these keywords might not all meet the requirements and enter dictionary, now need to be from above-mentioned candidate keywords data, the keyword that constraint condition is determined in selector unification enters dictionary, and constraint condition comprises:
● when keyword search quantity, supplier's usage quantity of keyword, when the commodity amount of use keyword exceedes certain threshold value, analysis and the use value of these keywords are larger, they are added in dictionary, as keyword to be safeguarded;
● filter out some and obviously have wrong primary keys, for example individual character, word, the wrong word that volumes of searches is less.
(3) safeguard field for what keyword created multidimensional;
On the selected basis of keyword to be safeguarded, for these keywords create field to be safeguarded, the principle that these fields create comprises:
● whether this keyword is correct, if mistake, what the correct keyword of that correspondence is;
● whether this keyword can be sold, if can sell, can be used as the kernel keyword of product;
● what the kernel keyword of this keyword is, the kernel keyword of for example bicycle is car;
● whether this keyword is atom keyword, and for example Iron Guanyin is exactly atom keyword, splits nonsensical;
● for English, what the prototype of word is, its plural number, odd number just can be expressed clearly like this;
● for the discontented sufficient actual needs of word segmentation result, need preserve correct cutting result by " manually cutting ".
Therefore, the form of this dictionary is as shown in the table.
Having listed in table " whether correct ", " correct word ", " synonym ", " root ", " prototype ", " retaining original ", " artificial cutting ", the attribute of primary keyses such as " modification times "." artificial cutting " attribute is wherein the word segmentation result when primary keys when unsatisfactory, provides the correct result of artificial cutting." modification time " keeps different in data for recording participle device and dictionary."-" in form represents this attribute void value.
(4) according to cooccurrence relation, obtain the synonym of primary keys and the prototype of English keyword plural number, improve dictionary content;
By the co-occurrence number of times between each keyword and other keyword, select keyword that co-occurrence number of times is higher as synonym, single plural number, the source of the attribute datas such as correct word.For example, the keyword TV and the televisor co-occurrence frequency that count commodity are higher, analyze reason, and TV and televisor are synonyms, in actual mechanical process, can setting threshold, and co-occurrence frequency is arrived to a certain amount of keyword and give tacit consent to as synonym.If xxx and xxxs or xxxes co-occurrence number of times are higher, also xxx can be regarded as to the prototype of xxxs and xxxes plural number.While so just having solved initialization dictionary, a relatively process of labor intensive, importing after dictionary, in order to ensure the correctness of dictionary, preferably ensures manual maintenance dictionary.
The process that dictionary builds as shown in Figure 1.Fig. 1 is the structure dictionary process flow diagram in the present invention.
(5) formulate center keyword recognition rule, find out the center keyword comprising in primary keys;
For the feature of e-commerce industry merchandising, except construct e-commerce field dictionary by said process, to utilize this dictionary to carry out beyond participle in participle process; For ecommerce dealing be vendible article time, a kind of method of a kind of recognition value center keyword has been proposed, and a part using the center keyword that this identifies as word segmentation result, by adding that before the center keyword identifying No. # mark is to distinguish common word segmentation result.This recognition value center keyword comprises following rule:
● first need first to analyze the syntactic structure of each language, the center of analyzing trade name is to the left or to the right, and by the analysis to Chinese and English trade name, the center keyword of Chinese and English commodity generally appears at the right of trade name.
● in the dictionary of above-mentioned structure, having comprised which keyword can sell, and which keyword is modification type keyword.The trade name of right-to-left scanning input, in the time running into modification keyword, directly skip, in the time running into bracket (comprising braces, bracket, round bracket), think that the information in bracket is the supplementary notes to commodity itself, should be also as modifying keyword, when identifying a keyword, this keyword is in dictionary, and be vendible keyword, taking English as example, if this keyword does not also have for, with above, without, in, the semantic conversion words such as made of, this keyword is exactly identified center keyword so.If there is semantic conversion word, the device of recognition value title center keyword, what jump directly to these semantic conversion words proceeds above identification, until find a merchandise mart keyword, or due to former in the statement of trade name information thereby cannot find.
● merchandise mart keyword must derive from the vendible keyword in dictionary.
After above-mentioned processing, improve dictionary Zhong center key words content.
A search segmenting method based on the comprehensive dictionary of various dimensions, comprising:
(1) treat to split participle input of character string according to minimum semantic unit to receiving;
Receiving after participle input of character string, by the input of character string receiving, carrying out cutting according to the punctuate mode of each language.Wherein, the minimum semantic unit of Chinese is Chinese character, and Englishly continuous letter or continuous numeral should be done as a wholely, and space as decollator, does not plant oneself in English.
To the sentence of input, split according to minimum semantic unit, in the time generating minimum semantic unit, do not distinguish the order of scanning, that is to say, can reach and identify minimum semantic unit from left to right and from right to left.
For example: the input of character string for the treatment of participle: " intelligence is dazzled 46 cun of highlighted LED LCDs (900 brightness) "
Minimum semantic primitive is: " intelligence/dazzling/46/ cun/height/bright/led/ liquid/crystalline substance/show/show/shield/(/ 900/ bright/spend /) "
Oblique line wherein represents separator, and LED capitalization has been converted into small letter.In participle process, can get rid of some nonsensical punctuation marks.
(2), on the basis of minimum semantic unit, carry out reverse maximum matching algorithm in conjunction with comprehensive dictionary;
Generating on the basis of minimum semantic unit, carrying out reverse maximum matching algorithm in connection with comprehensive dictionary, describing with example below.
The process of right-to-left scanning is as follows: suppose in dictionary that the length of long primary keys is 5, whether first judgement " (900 brightness) " exists in dictionary, identify a participle part if exist, otherwise, length is subtracted to 1, whether judgement " 900 brightness) " in dictionary, judges, according to reverse maximum judged result successively.
(3) process thering is synon keyword in participle;
In the time identifying " LCDs ", word segmentation result is now: " LCDs/900/ brightness "
Because LCDs has synonym " liquid crystal display " in dictionary, so current word segmentation result is optimised for " LCDs/liquid crystal display/900/ brightness ".
(4) processing to word segmentation result mistake;
In the time finding that there is cutting mistake, for example " making up and clothes " participle is become " make up/and/clothes ", and the cutting result of wishing is " cosmetic/kimonos/dress ", in this case, need by " make up and clothes " with and the word segmentation result " cosmetic/kimonos/dress " of corresponding expectation join in dictionary.Like this, again when participle, just can from dictionary, obtain correct result.
Above-mentioned dictionary safeguards with minimum semantic granularity conventionally, and when running into the situation of ambiguity None-identified, every profession and trade, for the understanding of the ambiguity situation of oneself, configures artificial cutting word segmentation result, thereby obtains higher participle accuracy.In the time running into erroneous words, for example " I just ironworks working " am split as " I// just iron/factory/working ", because firm iron is designated a wrong word in dictionary, we can be converted into iron and steel by " just iron ", thereby obtain comparatively correct word segmentation result, " I// iron and steel/factory/working ".
In English, also have the situation of common participle mistake, if for example according to space participle " red car cover ", it will be " red/car/cover " by participle, and " car seat cover " can be " car/seat/cover " by participle.In the time of search car cover, because two commodity have all comprised car and cover, all can be hit.So, be thisly unfavorable in this case match query in the direct participle mode in space.The method now improving according to the present invention, can add car cover in dictionary to, does as a whole." red car cover " can be " red/car cover " by participle like this, then while searching for car cover, just only can mate the such commodity of red car cover, the higher accuracy of acquisition in search.Particularly, in ecommerce search field, participle belongs to the category of comprehension of information, and it is a kind of means, match query and row
Name is only the object of search.
Obtaining on the basis of comparatively correct word segmentation result, due to the needs of electronic commerce information coupling, in participle process, wishing the center keyword of recognition value information also as the important component part of word segmentation result.
Fig. 2 is the process flow diagram that carries out word segmentation processing based on the comprehensive dictionary of various dimensions in the present invention.
For example: the word segmentation result of " Paper Cutting Machine (SQZ-115CTN) ": " paper/cutting/machine/sqz/115/ctn " is because that these merchandise sales is " machine " instead of " paper ", so, center keyword recognition rule based on certain, the center keyword that identifies these commodity is " machine ", further processing word segmentation result is " paper/cutting/machine/#machine/sqz/115/ctn ", wherein " #machine ", the center keyword that represents these commodity is " machine ", other keyword is all in order to modify center keyword " machine ".So utilize word segmentation result Zhong center keyword coupling, in the time of search " paper ", centered by " machine ", the commodity of keyword are just can rank very not forward.
Other situation is to have run into some semantic conversion words, for example for, with.Be exemplified below.
Name is called the commodity of " toy for baby; children ", that sell can not be " baby ", but " toy ", now due to the existence of " for ", according to center keyword recognition rule, the center keyword of commodity exist with " for " before, so " toy " is as center keyword, as a part for participle.The word segmentation result of " toy for baby, children " is " toy/#toy/for/baby/children/ " like this.
The keyword finding as stated above and dictionary contrast, do not have corresponding center keyword in dictionary, select so a keyword on its left side, continue to continue to mate with dictionary, in this manner, until till first keyword.
Fig. 3 is the recognition methods of the center keyword of the trade name in the present invention.
Above embodiment is just described for partial function of the present invention, but embodiment and accompanying drawing are not of the present invention for limiting.Without departing from the spirit and scope of the invention, any equivalence of doing changes or retouching, belongs to equally the present invention's protection domain.Therefore protection scope of the present invention should be as the criterion with the application's the content that claim was defined.

Claims (10)

1. the construction method based on the comprehensive dictionary of various dimensions, is characterized in that, comprising:
Step 1, selection data source, go forward side by side and exercise consumption statistics;
In the search daily record of e-commerce platform, the searched key word of selecting user to use within a period of time, to every day, every user's searched key word carries out duplicate removal, user's use amount of then adding up the every day of each searched key word, user's use amount of the every day of searched key word in a period of time is added up, count the user's use amount in searched key word a period of time, this user's use amount has represented the hotspot's distribution of current search keyword;
In the commodity key word information of e-commerce platform as data source, and same supplier's commodity keyword is carried out to duplicate removal, then statistics has how many suppliers to use this commodity keyword in the process of describing commodity, there are how many commodity to use this commodity keyword, the keyword using when supplier describes commodity is more, represent that this commodity keyword is more popular, degree of contention is fiercer; Use the commodity of certain commodity keyword more, represent that businessman's competition of these commodity of sale is fiercer;
Step 2, according to constraint condition select keyword;
Through use amount statistics, by producing the candidate collection of a large amount of keywords, to these candidate keywords data, the keyword that constraint condition is determined in selector unification enters dictionary;
Step 3, safeguard field for what keyword created multidimensional;
On the selected basis of keyword to be safeguarded, for these keywords create field to be safeguarded, and the principle correspondence creating according to these fields indicates this dictionary by certain format;
Step 4, according to cooccurrence relation, obtain the synonym of primary keys and the singulative of English keyword plural number, improve dictionary content;
By the co-occurrence number of times between each keyword and other keyword, select keyword that co-occurrence number of times is higher as synonym, and the singulative of keyword plural number;
Step 5, formulation center keyword recognition rule, find out the center keyword comprising in primary keys;
For the feature of e-commerce industry merchandising, except construct e-commerce field dictionary by said process, to utilize this dictionary to carry out beyond participle in participle process; For ecommerce dealing be vendible article time, a kind of method of a kind of recognition value center keyword has been proposed, and a part using the center keyword that this identifies as word segmentation result, by adding that before the center keyword identifying mark is to distinguish common word segmentation result.
2. the construction method based on the comprehensive dictionary of various dimensions according to claim 1, is characterized in that: in step 2, constraint condition comprises:
● when keyword search quantity, supplier's usage quantity of keyword, when the commodity amount of use keyword exceedes certain threshold value, analysis and the use value of these keywords are larger, they are added in dictionary, as keyword to be safeguarded;
● filter out some and obviously have wrong primary keys.
3. the construction method based on the comprehensive dictionary of various dimensions according to claim 1, is characterized in that: in step 3, the principle that described field creates comprises:
● whether this keyword is correct, if mistake, what the correct keyword of that correspondence is;
● whether this keyword can be sold, if can sell, can be used as the kernel keyword of product;
● what the kernel keyword of this keyword is;
● whether this keyword is atom keyword;
● for English, what the prototype of word is;
● for the discontented sufficient actual needs of word segmentation result, need preserve correct cutting result by " manually cutting ".
4. the construction method based on the comprehensive dictionary of various dimensions according to claim 1, is characterized in that: in step 4, setting threshold, arrives co-occurrence frequency in a certain amount of keyword and give tacit consent to as synonym, and the prototype of English keyword plural number.
5. the construction method based on the comprehensive dictionary of various dimensions according to claim 1, is characterized in that: in step 5, described center keyword recognition rule is specially:
● first analyze the syntactic structure of each language, the center of analyzing trade name is to the left or to the right, and by the analysis to Chinese and English trade name, the center keyword of Chinese and English commodity generally appears at the right of trade name;
● in constructed dictionary, having comprised which keyword can sell, and which keyword is modification type keyword; The trade name of right-to-left scanning input, in the time running into modification keyword, directly skip, in the time running into bracket, think that the information in bracket is the supplementary notes to commodity itself, as modifying keyword, when identifying a keyword, this keyword, in dictionary, and is vendible keyword, if there is no semantic conversion word, this keyword is exactly identified center keyword so; If there is semantic conversion word, the device of recognition value title center keyword, what jump directly to these semantic conversion words proceeds above identification, until find a merchandise mart keyword, or due to former in the statement of trade name information thereby cannot find;
● merchandise mart keyword must derive from the vendible keyword in dictionary.
6. the search segmenting method based on the comprehensive dictionary of various dimensions, is characterized in that, comprising:
Step 1, treat to split participle input of character string according to minimum semantic unit to receiving;
Step 2, on the basis of minimum semantic unit, carry out reverse maximum matching algorithm in conjunction with comprehensive dictionary;
Step 3, process thering is synon keyword in participle;
Step 4, processing to word segmentation result mistake.
7. the search segmenting method based on the comprehensive dictionary of various dimensions according to claim 6, is characterized in that, in step 1:
Receiving after participle input of character string, by the input of character string receiving, carrying out cutting according to the punctuate mode of each language; Wherein, the minimum semantic unit of Chinese is Chinese character, and Englishly continuous letter or continuous numeral should be done as a wholely, and space as decollator, does not plant oneself in English;
To the sentence of input, split according to minimum semantic unit, in the time generating minimum semantic unit, do not distinguish the order of scanning, that is to say, can reach and identify minimum semantic unit from left to right and from right to left.
8. the search segmenting method based on the comprehensive dictionary of various dimensions according to claim 6, is characterized in that, in step 4:
Based on safeguarding with minimum semantic granularity in the comprehensive dictionary of various dimensions, when running into the situation of ambiguity None-identified, configure artificial cutting word segmentation result, thereby obtain higher participle accuracy;
Obtaining on the basis of comparatively correct word segmentation result, in participle process, using the center keyword of recognition value information also as the important component part of word segmentation result.
9. the center keyword recognition methods based on the comprehensive dictionary of various dimensions, is characterized in that:
Obtain word segmentation result according to the search segmenting method based on the comprehensive dictionary of various dimensions one of claim 6 to 8 Suo Shu, center keyword recognition rule based on certain, identify the center keyword of these commodity, further extract word segmentation result Zhong center keyword, utilize word segmentation result Zhong center keyword coupling, if based on there is no corresponding center keyword in the comprehensive dictionary of various dimensions, select so a keyword on its left side, continue to continue to mate with dictionary, in this manner, until till first keyword.
10. the center keyword recognition methods based on the comprehensive dictionary of various dimensions according to claim 9, is characterized in that:
Described is to build according to the construction method of one of claim 1 to 5 based on the comprehensive dictionary of various dimensions.
CN201410212388.1A 2014-05-19 2014-05-19 A kind of segmenting method based on various dimensions synthesis dictionary Active CN103942347B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410212388.1A CN103942347B (en) 2014-05-19 2014-05-19 A kind of segmenting method based on various dimensions synthesis dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410212388.1A CN103942347B (en) 2014-05-19 2014-05-19 A kind of segmenting method based on various dimensions synthesis dictionary

Publications (2)

Publication Number Publication Date
CN103942347A true CN103942347A (en) 2014-07-23
CN103942347B CN103942347B (en) 2017-04-05

Family

ID=51190015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410212388.1A Active CN103942347B (en) 2014-05-19 2014-05-19 A kind of segmenting method based on various dimensions synthesis dictionary

Country Status (1)

Country Link
CN (1) CN103942347B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408173A (en) * 2014-12-11 2015-03-11 焦点科技股份有限公司 Method for automatically extracting kernel keyword based on B2B platform
CN105653517A (en) * 2015-11-05 2016-06-08 乐视致新电子科技(天津)有限公司 Recognition rate determining method and apparatus
CN106570058A (en) * 2016-09-29 2017-04-19 山东浪潮商用系统有限公司 Searching method and search engine
CN107784019A (en) * 2016-08-30 2018-03-09 苏宁云商集团股份有限公司 Word treatment method and system are searched in a kind of searching service
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN108304484A (en) * 2017-12-29 2018-07-20 北京城市网邻信息技术有限公司 Key word matching method and device, electronic equipment and readable storage medium storing program for executing
CN108491518A (en) * 2018-03-26 2018-09-04 广州虎牙信息科技有限公司 Audit method, apparatus, electronic equipment and the storage medium of text
WO2019136855A1 (en) * 2018-01-12 2019-07-18 平安科技(深圳)有限公司 Method and apparatus for implementing multidimensional analysis on insurance policy, terminal device, and storage medium
CN112364153A (en) * 2020-11-10 2021-02-12 中数通信息有限公司 Keyword identification method and device based on interference characteristics
CN113032683A (en) * 2021-04-28 2021-06-25 玉米社(深圳)网络科技有限公司 Method for quickly segmenting words in network popularization
CN113793193A (en) * 2021-08-13 2021-12-14 唯品会(广州)软件有限公司 Data search accuracy verification method, device, equipment and computer readable medium
CN117953875A (en) * 2024-03-27 2024-04-30 成都启英泰伦科技有限公司 Offline voice command word storage method based on semantic understanding

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079066A (en) * 2007-06-29 2007-11-28 深圳市中科新业信息科技发展有限公司 Network data analysis method and system in network auditing
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN102479191A (en) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 Method and device for providing multi-granularity word segmentation result
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN102955857A (en) * 2012-11-09 2013-03-06 北京航空航天大学 Class center compression transformation-based text clustering method in search engine
CN102999534A (en) * 2011-09-19 2013-03-27 北京金和软件股份有限公司 Chinese word segmentation algorithm based on reverse maximum matching

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079066A (en) * 2007-06-29 2007-11-28 深圳市中科新业信息科技发展有限公司 Network data analysis method and system in network auditing
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN102479191A (en) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 Method and device for providing multi-granularity word segmentation result
CN102999534A (en) * 2011-09-19 2013-03-27 北京金和软件股份有限公司 Chinese word segmentation algorithm based on reverse maximum matching
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN102955857A (en) * 2012-11-09 2013-03-06 北京航空航天大学 Class center compression transformation-based text clustering method in search engine

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408173A (en) * 2014-12-11 2015-03-11 焦点科技股份有限公司 Method for automatically extracting kernel keyword based on B2B platform
CN104408173B (en) * 2014-12-11 2016-12-07 焦点科技股份有限公司 A kind of kernel keyword extraction method based on B2B platform
CN105653517A (en) * 2015-11-05 2016-06-08 乐视致新电子科技(天津)有限公司 Recognition rate determining method and apparatus
CN107784019A (en) * 2016-08-30 2018-03-09 苏宁云商集团股份有限公司 Word treatment method and system are searched in a kind of searching service
CN106570058A (en) * 2016-09-29 2017-04-19 山东浪潮商用系统有限公司 Searching method and search engine
CN107943786B (en) * 2017-11-16 2021-12-07 广州市万隆证券咨询顾问有限公司 Chinese named entity recognition method and system
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN108304484A (en) * 2017-12-29 2018-07-20 北京城市网邻信息技术有限公司 Key word matching method and device, electronic equipment and readable storage medium storing program for executing
WO2019136855A1 (en) * 2018-01-12 2019-07-18 平安科技(深圳)有限公司 Method and apparatus for implementing multidimensional analysis on insurance policy, terminal device, and storage medium
CN108491518A (en) * 2018-03-26 2018-09-04 广州虎牙信息科技有限公司 Audit method, apparatus, electronic equipment and the storage medium of text
CN108491518B (en) * 2018-03-26 2021-02-26 广州虎牙信息科技有限公司 Method and device for auditing text, electronic equipment and storage medium
CN112364153A (en) * 2020-11-10 2021-02-12 中数通信息有限公司 Keyword identification method and device based on interference characteristics
CN113032683A (en) * 2021-04-28 2021-06-25 玉米社(深圳)网络科技有限公司 Method for quickly segmenting words in network popularization
CN113032683B (en) * 2021-04-28 2021-12-24 玉米社(深圳)网络科技有限公司 Method for quickly segmenting words in network popularization
CN113793193A (en) * 2021-08-13 2021-12-14 唯品会(广州)软件有限公司 Data search accuracy verification method, device, equipment and computer readable medium
CN113793193B (en) * 2021-08-13 2024-02-02 唯品会(广州)软件有限公司 Data search accuracy verification method, device, equipment and computer readable medium
CN117953875A (en) * 2024-03-27 2024-04-30 成都启英泰伦科技有限公司 Offline voice command word storage method based on semantic understanding

Also Published As

Publication number Publication date
CN103942347B (en) 2017-04-05

Similar Documents

Publication Publication Date Title
CN103942347A (en) Word separating method based on multi-dimensional comprehensive lexicon
CN107257970B (en) Question answering from structured and unstructured data sources
CN103853824B (en) In-text advertisement releasing method and system based on deep semantic mining
CN104063523B (en) E-commerce search scoring and ranking method and system
More Attribute extraction from product titles in ecommerce
CN102880645B (en) The intelligent search method of semantization
KR101040119B1 (en) Apparatus and Method for Search of Contents
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
WO2020253591A1 (en) Search method and apparatus applying tag knowledge network
TWI645303B (en) Method for verifying string, method for expanding string and method for training verification model
CN106156204A (en) The extracting method of text label and device
CN105868255A (en) Query recommendation method and apparatus
CN109960756A (en) Media event information inductive method
Zhao et al. Creating a fine-grained corpus for chinese sentiment analysis
CN102737013A (en) Device and method for identifying statement emotion based on dependency relation
CN109522418A (en) A kind of automanual knowledge mapping construction method
CN105630768A (en) Cascaded conditional random field-based product name recognition method and device
CN102693279A (en) Method, device and system for fast calculating comment similarity
CN102955837A (en) Analogy retrieval control method based on Chinese word pair relationship similarity
CN110263178B (en) WordNet-to-Neo 4J mapping method, semantic detection method and semantic calculation expansion interface generation method
CN101727451A (en) Method and device for extracting information
CN113821718A (en) Article information pushing method and device
CN105786794B (en) Question-answer pair retrieval method and community question-answer retrieval system
CN108932247A (en) A kind of method and device optimizing text search
Rezk et al. Accurate product attribute extraction on the field

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant