[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108268438A - A kind of content of pages extracting method, device and client - Google Patents

A kind of content of pages extracting method, device and client Download PDF

Info

Publication number
CN108268438A
CN108268438A CN201611260567.8A CN201611260567A CN108268438A CN 108268438 A CN108268438 A CN 108268438A CN 201611260567 A CN201611260567 A CN 201611260567A CN 108268438 A CN108268438 A CN 108268438A
Authority
CN
China
Prior art keywords
alternative word
character
word
alternative
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611260567.8A
Other languages
Chinese (zh)
Other versions
CN108268438B (en
Inventor
李洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201611260567.8A priority Critical patent/CN108268438B/en
Publication of CN108268438A publication Critical patent/CN108268438A/en
Application granted granted Critical
Publication of CN108268438B publication Critical patent/CN108268438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/235Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on user input or interaction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/28Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
    • G06V30/287Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of Kanji, Hiragana or Katakana characters

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of content of pages extracting method, device and client, the method includes:Obtain the selected region in the page;The character in the selected region is identified one by one, obtains the first sentence for including the character, splits first sentence to obtain alternative word;The alternative word is ranked up using at least one attribute of the alternative word, obtains ranking results;The highest alternative word that sorts in the ranking results is chosen as target alternative word, and extracts the target alternative word.The present invention by first sentence splits and can fast and effeciently be extracted using the sequence of alternative word attribute the content of pages of client's selection, and the content of extraction is more accurate, and user is avoided to also need to manually adjust after selection, saves the time, improves user experience.

Description

A kind of content of pages extracting method, device and client
Technical field
The present invention relates to a kind of Internet technical field more particularly to content of pages extracting method, device and clients.
Background technology
With the fast development of mobile Internet, daily life is closely coupled with internet so that produces internet The data information of magnanimity has been given birth to, has become the main source of acquisition of information, this has penetrated into the every field of network extensively.
Gradually, people are more and more for the demand of information analysis and information processing, wherein, user is using client When equipment reads web page text, often have duplication text and carry out the demand of other operations, for example retrieved or pasted It is further edited to dialog box;Since people are higher and higher for accuracy and the promptness requirement of information analysis, so user The completion text for wanting to efficiently and accurately replicates.
In the prior art, for user in text selection and when replicating, it is slow that some will appear signature velocity, when causing the operation to complete Between it is long;The content that some occurs wanting to replicate is not in the selection of acquiescence, it is impossible to which the content replicated, user's body are wanted in correct selection Test difference;Some will appear needs repeatedly adjustment selection flasher could choose in addition will appear repeatedly adjustment after, still cannot be correct The situation of word that user wants is replicated, operating efficiency is low.
Invention content
In order to solve the above-mentioned technical problem, the present invention proposes a kind of content of pages extracting method, device and client.
In a first aspect, a kind of content of pages extracting method is provided, the method includes:It obtains selected in the page Middle region;The character in the selected region is identified one by one, obtains the first sentence for including the character, splits first sentence to obtain To alternative word;The alternative word is ranked up using at least one attribute of the alternative word, obtains ranking results;According to institute It states ranking results and chooses target alternative word, and extract the target alternative word..
Second aspect provides content of pages extraction element, the method includes:Region acquisition module, for obtaining State the selected region in the page;Alternative word generation module for identifying the character in the selected region one by one, obtains packet First sentence is split as alternative word by first sentence containing the character;Attribute sorting module, for according to the multiple of the alternative word Attribute is ranked up the alternative word, obtains ranking results;Content of pages extraction module, for being selected according to the ranking results Target alternative word is taken, and extracts the target alternative word.
The third aspect, provides a kind of client, and the client includes aforementioned content of pages extraction element, the client End is installed in user terminal, for extracting content of pages according to the input of user.
The advantageous effect that technical solution provided in an embodiment of the present invention is brought includes:Based on first sentence is split as alternative word Alternative word, which is ranked up, at least one attribute using alternative word quickly and accurately to extract in user's chosen area Hold, the operations such as user facilitated to be replicated, is searched for, greatly promoting user experience.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.
Fig. 1 is application scenarios schematic diagram provided in an embodiment of the present invention.
Fig. 2 is the method flow diagram of content of pages extracting method provided in an embodiment of the present invention;
Fig. 3 is the method flow diagram of content of pages extracting method provided in an embodiment of the present invention;
Fig. 4 is the method flow diagram of content of pages extracting method provided in an embodiment of the present invention;
Fig. 5 is the method flow diagram of content of pages extracting method provided in an embodiment of the present invention;
Fig. 6 is the method flow diagram of content of pages extracting method provided in an embodiment of the present invention;
Fig. 7 is the method flow diagram of content of pages extracting method provided in an embodiment of the present invention;
Fig. 8 is the method flow diagram of content of pages extracting method provided in an embodiment of the present invention;
Fig. 9 is the principle of device block diagram of content of pages extraction element provided in an embodiment of the present invention;
Figure 10 is the principle of device block diagram of content of pages extraction element provided in an embodiment of the present invention;
Figure 11 is terminal structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's all other embodiments obtained without making creative work should all belong to the model that the present invention protects It encloses.
An embodiment of the present invention provides a kind of methods of mobile equipment evaluation and test, please refer to Fig.1, it illustrates the present invention to implement The structure diagram of the implementation environment involved by content of pages extracting method that example provides.The implementation environment includes being configured with to be evaluated User equipment 101, the user equipment 101 of survey, which can be shown, includes the page to be extracted, and user can choose content of pages Operation.User equipment can show the content of selection according to the selection of user.
In one embodiment of the invention, a kind of content of pages extracting method is provided, as shown in Fig. 2, the method packet It includes:
S210 obtains the selected region in the page.
Specifically, client can obtain selected areas of the user's operation in the page by man-machine interface.For example, this is selected Middle region can be that user is pressed on the region chosen on touch interface by finger.For example, the selected region can also It is the region using input tools such as writing pencils in interface by drawing or click is chosen.
S220 identifies the character in the selected region one by one, obtains the first sentence for including the character, by first sentence It is split as alternative word.
Specifically, client can identify the character included in selected region, these characters may be all by comprising Complete character in selected region, it is also possible to the incomplete character being partially contained in selected region.Selected area Domain refers to user by pressing, touching, the selected areas that modes are formed on a user interface such as sliding, if character completely includes In selected region, then for selected areas, which is complete, if character is located exactly at selected areas Boundary, partly in selected areas, partly outside selected areas, then for selected areas, which is Incomplete, either complete character or incomplete character can all be identified as the character in selected region, In identification process, the complete character and incomplete character are distinguish using flag.
In one example, for the complete character being all contained in selected region, carry out table using flag 1 Show, for incomplete character, represented using flag 0.
In another example, the integrated degree of character can also be represented using the mark bit value of quantization, for complete Portion is comprised in the complete character in selected region, is represented using mark bit value 1, for incomplete character, uses mark Bit value X is known to represent, X is a numerical value between 0 to 1, which represents that incomplete character is accounted for comprising corresponding Complete character area.
In one example, first sentence comprising character is by being obtained in the corresponding location retrieval of content of pages.First sentence It is character string where the character, being divided by adjacent punctuate, such as content of pages " AAAAAA, BBBBBBB, CCCCCCCCCCCDDDDDDD、EEE、FF;G, HHHHHHH;IIIIIIIIII”.It is wherein included member sentence be respectively “AAAAAA”、“BBBBBBB”、“CCCCCCCCCCC”、“DDDDDDD”、“EEE”、“FF”、“G”、“HHHHHHH”、 “IIIIIIIIII”.Wherein A, B, C, D, E, F, G, H, I represents the character in each first sentence, character can it is identical can not also Together.
Specifically, first sentence is split as alternative word by client, and using different participle techniques, existing skill may be used Participle technique in art can also use the modified participle technique in such as the present embodiment.Word is minimum, can independently live Dynamic, significant language element;It is using space as nature delimiter between English word, however Chinese is using word as base This grapheme is no apparent separator, and Chinese lexical analysis is Chinese information processing between the word of Chinese The basis of technology and key.Therefore, it needs to select ripe participle technique when to Chinese information processing.The client of the present embodiment End splits every words in the object statement using ripe participle technique, and every words are split as an alternative word Phrase, wherein each alternative word phrase includes multiple alternative words.
In one example, alternative word is split to include:Setting split alternative word maximum particle size, granularity be split out it is standby Select the number of character that word is included.Read the continuation character string in first sentence;According to sequence from left to right by the company Continuous character string is matched with default vocabulary;When the character string of the first length in the continuation character string is matched with default vocabulary, Judge that the first length adds whether the character string of 1 length matches with default vocabulary;If it is not, using the character string of first length as standby Word is selected, and the character string of first length is cut off from the continuation character string, is continued using the continuation character string after excision Matching;If so, the first length is added 1 as being updated to the first length, and continue to judge the character that first length adds 1 length The step of whether string matches with default vocabulary.
In one example, alternative word is split to include:Setting split alternative word maximum particle size, granularity be split out it is standby Select the number of character that word is included.Read the continuation character string in first sentence;According to sequence from right to left by the company Continuous character string is matched with default vocabulary;When the character string of the first length in the continuation character string is matched with default vocabulary, Judge that the first length adds whether the character string of 1 length matches with default vocabulary;If it is not, using the character string of first length as standby Word is selected, and the character string of first length is cut off from the continuation character string, is continued using the continuation character string after excision Matching;If so, the first length is added 1 as being updated to the first length, and continue to judge the character that first length adds 1 length The step of whether string matches with default vocabulary.
In another example, the split process in the first two example is repeated, and according to granularity maximum principle and fractionation Word quantity minimum principle come select output split result.Such as first sentence " we Safari Park play ", according to from a left side to The right alternative word that splits out of matching is split out alternative for " we/out of office/lively/object/garden/object for appreciation " according to the right sequence to a left side Word is " we/in/Safari Park/object for appreciation ", according to granularity maximum principle, select " we/are in/Safari Park/object for appreciation " as Export result.
As it can be seen that the alternative word method for splitting of the present embodiment can improve the accuracy selected for alternative word, so as to improve The accuracy of content of pages extraction.
S230 is ranked up the alternative word according to multiple attributes of the alternative word, obtains ranking results.
Alternative word includes a variety of attributes, for example, the use temperature of alternative word, the part of speech of alternative word, the word that alternative word includes Number etc. is accorded with, differentiation and the importance sorting that content is chosen for user can be realized using the attribute of word, so as to be more prone to Identify and extract the content of pages selected by user.
In the present embodiment, using the magnetism of the integrated degree of character, the temperature of alternative word and alternative word in alternative word The alternative word after fractionation is ranked up, hereinafter referred to as the first attribute, the second attribute and third attribute.
During alternative word attribute is used to be ranked up alternative word, first with the integrity degree attribute in alternative word Value carries out the first minor sort, and chosen position when alternative word integrity degree attribute reflection user selects content of pages, is pair The most important index of content of pages extraction.As previously mentioned, it can represent the complete journey of character using the mark bit value of quantization Degree, for the complete character being all contained in selected region, is represented using flag 1, for incomplete character, made It is represented with flag X, X is a numerical value between 0 to 1, which represents that incomplete character is accounted for comprising corresponding Complete character area.The integrity degree property value of so one word is exactly the average value of the sum of each character integrity degree in the word. For example, the integrity degree of each character is respectively X1, X2, X3, X4 and X5 in alternative word " Safari Park ", then the integrity degree of the word For (X1+X2+X3+X4+X5) divided by 5.Summarizing integrity degree formula is:
Wherein, I represents the character ordinal number in alternative word, and n represents the number of alternative word, and XI represents the complete of i-th character Degree.
In the above examples, if the value of X1-X5 is respectively 0.6,1,1,1,0.8, then integrity degree formula is:
It is also wrapped after the alternative word is ranked up It includes:Judge whether the integrity degree of the alternative word is more than the first predetermined threshold value, if the integrity degree of alternative word is too low, show that the word is inclined Center from user's chosen area, can be screened by the threshold value and non-user selects word.In one example, if described in setting First predetermined threshold value is the 50% of the chosen area area, then in the multiple alternative word, has integrity degree to be more than the choosing If 50% alternative word for taking region area, such alternative word is just stored in the first alternative phrase by client, described Alternative word in first alternative phrase is exactly to carry out the object of the second minor sort.
In one example, client obtains the integrity degree ranking results of the alternative word, and will sort highest alternative word As target alternative word.
In one example, using the temperature and part of speech of the alternative word, to described first, alternative phrase carries out again client Minor sort obtains the ranking results.Wherein, there are priority, the words of highest priority for the alternative word in the ranking results Language is exactly target word, and the temperature of the alternative word is the number that the alternative word is searched in hot word service;The alternative word Part of speech the characteristics of being the word for Part of Speech Division, wherein, hot word service has for search engine or input method etc. with hot word The service of pass.
S240 chooses target alternative word, and extract the target alternative word according to the ranking results.
Specifically, each alternative word in the ranking results there will necessarily be sequence, can by the sequence of each alternative word come It chooses coverage higher and meets one or several alternative words of temperature and part of speech as target alternative word.
In one example, it is one by the target alternative word that ranking results select, in page saliency, the target is standby Word is selected, while replicates the target alternative word, user can be directed to the target word that client replicates and carry out target word relevant operation, than It such as affixes to chat conversations frame and carries out coordinate indexing into edlin or to the target word of duplication.
In one example, the target alternative word selected by ranking results is multiple, in the multiple mesh of page saliency Alternative word is marked, and user is waited for carry out selection operation;Client selects to replicate the target alternative word according to user, and user can be with needle Target word relevant operation is carried out to the target word that client replicates, for example affixes to chat conversations frame into edlin or to replicating Target word carry out coordinate indexing.
The alternative word on the text is highlighted and is marked by client, and the word for being highlighted label is exactly user Target word, further client replicate the target word
In conclusion the present embodiment provides split by first sentence and can fast and effeciently be carried using the sequence of alternative word attribute The content of pages that client chooses is taken out, the content of extraction is more accurate, and user is avoided to also need to manually adjust after selection, saves Time improves user experience.
It please refers to Fig.3, the present embodiment proposes a kind of content of pages extracting method, includes the following steps:
S310. the selected region in the page is obtained.
For example, if the object of user's operation be cell-phone customer terminal, user browsing webpage process need to text into Row replicates, then user is operated on the touch screen of cell-phone customer terminal, and the finger face of user is in contact to obtain with touching screen One annular chosen area, as shown in figure 3, the annular region in Fig. 3 is exactly chosen area in the text.
S320. it identifies the character in the chosen area, obtains the corresponding all sentences of the character, and delete the institute There is the sentence repeated in sentence, obtain object statement.
Step S320 includes following sub-step:
S3201 identifies the character in the chosen area.Fig. 5 is please referred to, which includes:
S32011 identifies the complete character in the selected region, increases complete character mark for the complete character Position.
S32012 identifies the incomplete character in the selected region, increases incomplete word for the incomplete character Accord with flag.
In step s 320, client is identified all characters in chosen area by character acquiring technology, It please refers to Fig.4, the character belonged in the chosen area has:
【2nd, ten, state, collection, go out, color】
Wherein, " ten " are the complete character in chosen area, " two, state, integrate, go out, color " to be non-complete in chosen area Whole character.Respectively these characters increase character mark position, for representing whether character is the complete of complete character or character Degree.
S3202 obtains the corresponding all first sentences of the character.Fig. 6 is please referred to, which includes:
S32021 retrieves the character in the content of pages, to obtain each in the selected region Multiple first sentences corresponding to character.
S32022 inquires the multiple first sentence, to judge in the multiple first sentence with the presence or absence of the first sentence repeated.
S32023, if so, deleting first sentence of the repetition.
Specifically, judge first sentence belonging to character be client by the punctuation mark between sentence and sentence for boundary, The corresponding first sentence of all characters in the chosen area is identified successively.For the sentence repeated in all sentences, client The sentence repeated is deleted by duplicate removal technology.For example.Still with reference to Fig. 4, " two " corresponding first sentence is " two Group of Tens The summit of leader Antalya turns out a great success ", " ten " corresponding first sentence is also " two Group of Ten leader Antalya summits Turn out a great success ", " two " and " ten " corresponding first sentence is identical, then finally only retains one for the sentence repeated, by it Remaining identical sentence is all deleted;It is identified according to this and duplicate removal, finally obtains the corresponding sentence of the character, that is, target language Sentence:
【Two Group of Ten leader Antalya summits turn out a great success.
Thank to the outstanding positive achievement to work and obtain of last year Turkey of presiding country again.】
S3203 splits first sentence to obtain alternative word.Fig. 7 is please referred to, which includes following sub-step:
S32031 reads the continuation character string in first sentence;
S32032, according to the matching with default vocabulary by the continuation character string of sequence from left to right;
S32033 when the character string of the first length in the continuation character string is matched with default vocabulary, judges the first length Whether the character string of degree plus 1 length matches with default vocabulary;
S32034, if it is not, by the character string of first length alternately word, and by the character string of first length from institute It states and is cut off in continuation character string, continue to match using the continuation character string after excision;
S32035, if so, the first length is added 1 as being updated to the first length, and continue to judge that first length adds 1 The step of whether character string of length matches with default vocabulary.
Specifically, it is split according to the sentence selected in Fig. 4:
" two Group of Ten leader Antalya summits turn out a great success." split result is as follows:
【20 states, group, leader, Antalya, summit are opened, and are obtained, very, success】
" thank to the outstanding positive achievement to work and obtain of last year Turkey of presiding country again." split result is as follows:
【Again, thank, last year, chairman, state, Turkey, remarkably, work and, obtain, actively, achievement.】
S330. the alternative word is ranked up using at least one attribute of the alternative word, obtains ranking results.It is standby Word is selected to include a variety of attributes, for example, the use temperature of alternative word, the part of speech of alternative word, number of characters that alternative word includes etc., profit The attribute of word can realize differentiation and the importance sorting that content is chosen for user, so as to be more prone to identify and extract use Content of pages selected by family.
In one example, alternative word is ranked up using an attribute of alternative word, and then obtains the alternative word Ranking results.Such as can be ranked up by the integrity degree attribute of character in alternative word, because during character is obtained The integrity degree attribute of character in alternative word is obtained, according to formula:
Wherein, I represents the character ordinal number in alternative word, and n represents the number of alternative word, and XI represents the complete of i-th character Degree.The integrity degree numerical value of character in each alternative word can be obtained, can be thus achieved according to integrity degree numerical value and alternative word is carried out Sequence.
In one example, alternative word is ranked up using an attribute of alternative word, and then obtains the alternative word Ranking results.Such as can be ranked up by the temperature attribute of character in alternative word, the temperature of alternative word can be according to word The temperature of hot word label is inquired in library, for the label of temperature then from big data to internet hunt in character library The collection of engine or instant messaging tools obtains.For example, roast duck, park, caravan hot value be respectively 3,700,000 search values, 1500000 search values and 80 search values, then the temperature sequence of three is followed successively by " roast duck-park-caravan ".
In one example, it is ranked up using two attributes of alternative word or three attributes, which includes making first It is ranked up with the first attribute, then ranking results is corrected using the second attribute and/or third attribute.Specifically, Step S330 can include following sub-step at this time:
S3301 carries out priority ranking to the multiple alternative word according to the first property value of the alternative word, obtains the One ranking results.
S3302, judges whether the first property value of the alternative word is more than the first predetermined threshold value, if so, by described standby Word is selected to be stored in the first alternative phrase;
S3303, according to the second property value of the alternative word or third property value to standby in the described first alternative phrase Word is selected to carry out minor sort again, obtains the ranking results.
Integrity degree of first attribute for alternative word is selected, the second attribute is the temperature of alternative word, and third attribute is alternative word Part of speech when.The first minor sort is carried out according to the integrity degree of alternative word first, then compares integrity degree and preset threshold value Compared with acquisition integrity degree is higher than the alternative word of threshold value, using these alternative words as the first alternative phrase, later again to the first alternative word Group is ranked up according to the temperature of alternative word.However there are a kind of situations, still can not be determined only after exactly sorting according to temperature One alternative word, then be ranked up according still further to the part of speech of alternative word.
Three attribute mentioned before are certainly not limited to, character length included in alternative word etc. can also be used to be joined With sequence, the attribute sequence of alternative word is also that may be permuted combination, such as the first attribute can be with selected as alternative word Temperature is ranked up first by temperature, is conducive to directly selecting for network hot word in this way, is improved the efficiency and standard of extraction content True rate.
In one example, Fig. 8 is please referred to, step S3303 can also include following sub-step:
S33031 obtains the second property value of alternative word described in the first alternative phrase, the alternative word Second property value and the second predetermined threshold value;
S33032, if there are the alternative word that the second property value is more than second predetermined threshold value, according to the alternative word The second property value minor sort again is carried out to the alternative word in the described first alternative phrase;
S33033, if there is no the alternative word that the second property value is more than second predetermined threshold value, according to described alternative The third attribute of word carries out minor sort again to the alternative word in the described first alternative phrase.
Specifically, it is not all hot word in the alternative word in first ranking results, in other words if by judging one by one Network temperature is told somebody what one's real intentions are, when being not suitable as sort by, then according to the third attribute of the alternative word to first alternative word Alternative word in group carries out minor sort again, obtains ranking results.
The third attribute includes the part of speech of alternative word, specifically, for the part of speech of alternative word:By answering mass users For the statistics of behavior processed it is recognised that user wants the reproduction possibilities of noun, adjective and verb higher, wherein noun is highest; So the sequence being ranked up to the alternative word phrase is:
Noun>Adjective>Verb>Other words
Wherein, other described words include number, quantifier and pronoun etc., since the word of other parts of speech is multiple as user's acquiescence The possibility very little of content processed, so other words can not have to distinguish.
Such as in Fig. 3, for the temperature of alternative word, it has been searched 10,000 times if " 20 state " is identified, " 20 state " For hot word, corresponding hot value is 10,000;" if outstanding " is entered, by the way that hot word bank is called to find " outstanding " and hot word, quilt It has searched for 5000 times, the temperature is 5000;At this point, the sequence for being ranked up to obtain " 20 state " to the two according to hot value is high In " outstanding ".But if preset heat degree threshold is above 10,000, then at this time hot value not as sequence reference value, But part of speech is used as the Rule of judgment for judging sequence.
S340 chooses target alternative word, and extract the target alternative word according to the ranking results.
Client can will sort highest alternative word as target alternative word according to ranking results in alternative word, and this is standby Word is selected to extract.Specifically, extraction can include the operation of two aspects, first, alternative word is replicated, second is that alternative word is pre- It first replicates in memory.
Specifically, priority in second ranking results is first word by client, is highlighted on the text It is marked, the word for being highlighted label is exactly the target alternative word of user, and further client replicates the target alternative word. The mode highlighted can be it is highlighted highlight, color highlights or shape highlights etc..Highlighted highlight refers to change target alternative word Background color, so as to which the region where the word be made to show in the form of highlighted;Color, which highlights, refers to the word face for changing the word Color, to highlight in other words;Shape highlights the region shape for referring to change where the font or alternative word of alternative word.
In conclusion content of pages extracting method provided in this embodiment, it can be big being sorted and being screened using more attributes The big efficiency and accuracy for improving extraction content.For example, after the integrity degree of alternative word sorts to the alternative word, it is further right Judgement is identified in alternative word in first ranking results, and the temperature of alternative word or the part of speech of alternative word is selected to be arranged again Sequence, so as to more efficiently copy the operation target of user.
Fig. 9 is please referred to, present embodiments provides a kind of content of pages extraction element, described device includes:
Region acquisition module performs step S210, for obtaining the selected region in the page;
Alternative word generation module performs step S220, for identifying the character in the selected region one by one, obtains packet First sentence is split as alternative word by first sentence containing the character;
Attribute sorting module, perform step S230, for multiple attributes according to the alternative word to the alternative word into Row sequence, obtains ranking results;
Content of pages extraction module performs step S240, for using the highest alternative word of the ranking results as target Alternative word, and extract the target alternative word.
0 is please referred to Fig.1, present embodiments provides a kind of content of pages extraction element, described device includes:
Region acquisition module performs step S310, for obtaining the selected region in the page.
Alternative word generation module performs step S320, for identifying the character in the chosen area, obtains the character Corresponding all sentences, and the sentence repeated in all sentences is deleted, obtain object statement.
Alternative word generation module includes following submodule:
Character recognition submodule performs step S3201, for identifying the character in the chosen area.
Character recognition submodule includes:
Complete character identifies submodule, performs step S32011, for identifying the complete character in the selected region, Increase complete character flag for the complete character.
Incomplete character recognition submodule performs step S32012, incomplete in the selected region for identifying Character increases incomplete character mark position for the incomplete character.
First sentence acquisition submodule performs step S3202, for obtaining the corresponding all first sentences of the character.
This yuan of sentence acquisition submodule, includes following submodule:
First sentence retrieves submodule, performs step S32021, the character is retrieved in the content of pages, with described in acquisition Multiple first sentences corresponding to each character in selected region.
Submodule is inquired, performs step S32022, inquires the multiple first sentence, to judge whether deposited in the multiple first sentence In the first sentence repeated.
Duplicate removal submodule performs step S32021, for deleting first sentence of the repetition when existing and repeating first sentence.
First sentence splits submodule and performs step S3203, for splitting first sentence to obtain alternative word.The step is included such as Lower sub-step:
Character string reading submodule performs step S32031, for reading the continuation character string in first sentence;
Matched sub-block performs step S32032, according to sequence from left to right by the continuation character string with presetting Vocabulary matches;
First matching judgment submodule performs step S32033, when the character string of the first length in the continuation character string When being matched with default vocabulary, judge that the first length adds whether the character string of 1 length matches with default vocabulary;
First logic judgment submodule performs step S32034, for being in the judging result of the first matching judgment submodule When no, by the character string of first length alternately word, and by the character string of first length from the continuation character string Excision, continues to match using the continuation character string after excision;
Second logic judgment submodule performs step S32035, for being in the judging result of the first matching judgment submodule When being, the first length is added 1 as being updated to the first length, and continue to judge that first length adds the character string of 1 length to be No the step of being matched with default vocabulary.
Attribute sorting module performs step S330, for by least one attribute of the alternative word to described alternative Word is ranked up, and obtains ranking results.Alternative word includes a variety of attributes, for example, the use temperature of alternative word, the word of alternative word Property, number of characters that alternative word includes etc. can realize differentiation and the importance that content is chosen for user using the attribute of word Sequence, so as to be more prone to identify and extract the content of pages selected by user.
In one example, attribute sorting module can include following submodule at this time:
First attribute sorting sub-module performs step S3301, according to the first property value of the alternative word to the multiple Alternative word carries out priority ranking, obtains the first ranking results.
First determined property submodule performs step S3302, judges whether the first property value of the alternative word is more than the One predetermined threshold value, if so, the alternative word is stored in the first alternative phrase;
Secondary sorting sub-module performs step S3303, according to the second property value of the alternative word or third property value pair Alternative word in the first alternative phrase carries out minor sort again, obtains the ranking results.
Integrity degree of first attribute for alternative word is selected, the second attribute is the temperature of alternative word, and third attribute is alternative word Part of speech when.The first minor sort is carried out according to the integrity degree of alternative word first, then compares integrity degree and preset threshold value Compared with acquisition integrity degree is higher than the alternative word of threshold value, using these alternative words as the first alternative phrase, later again to the first alternative word Group is ranked up according to the temperature of alternative word.However there are a kind of situations, still can not be determined only after exactly sorting according to temperature One alternative word, then be ranked up according still further to the part of speech of alternative word.
Three attribute mentioned before are certainly not limited to, character length included in alternative word etc. can also be used to be joined With sequence, the attribute sequence of alternative word is also that may be permuted combination, such as the first attribute can be with selected as alternative word Temperature is ranked up first by temperature, is conducive to directly selecting for network hot word in this way, is improved the efficiency and standard of extraction content True rate.
In one example, secondary sorting module can also include following submodule:
Second attribute thresholds comparison sub-module performs step S33031, obtains alternative described in the first alternative phrase Second property value of word, second property value of the alternative word and the second predetermined threshold value;
First logic sorting sub-module performs step S33032, there is the second property value more than the described second default threshold During the alternative word of value, then the alternative word in the described first alternative phrase is carried out again according to the second property value of the alternative word Sequence;
First logic sorting sub-module, there is no the second property value be more than second predetermined threshold value alternative word when, Minor sort again is then carried out to the alternative word in the described first alternative phrase according to the third attribute of the alternative word.
Content of pages extraction module performs step S340, for choosing the highest alternative word that sorts in the ranking results As target alternative word, and extract the target alternative word.
1 is please referred to Fig.1, present embodiments provides a kind of terminal, the terminal can be used for implementing to carry in above-described embodiment The content of pages extracting method of confession.Specifically:
Terminal 700 can include RF (Radio Frequency, radio frequency) circuit 110, include one or more meters The memory 120 of calculation machine readable storage medium storing program for executing, input unit 130, display unit 140, sensor 150, voicefrequency circuit 160, WiFi (wireless fidelity, Wireless Fidelity) module 170, including there are one or more than one processing core processing The components such as device 180 and power supply 190.It will be understood by those skilled in the art that the terminal structure shown in figure was not formed to end The restriction at end can include either combining certain components or different components arrangement than illustrating more or fewer components. Wherein:
RF circuits 110 can be used for receive and send messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, transfer to one or more than one processor 180 is handled;In addition, the data for being related to uplink are sent to Base station.In general, RF circuits 110 include but not limited to antenna, at least one amplifier, tuner, one or more oscillators, use Family identity module (SIM) card, transceiver, coupler, LNA (Low Noise Amplifier, low-noise amplifier), duplex Device etc..In addition, RF circuits 110 can also communicate with network and other equipment by radio communication.The wireless communication can make With any communication standard or agreement, and including but not limited to GSM (Global System of Mobile communication, entirely Ball mobile communcations system), GPRS (General Packet Radio Service, general packet radio service), CDMA (Code Division Multiple Access, CDMA), WCDMA (Wideband Code Division Multiple Access, wideband code division multiple access), LTE (Long Term Evolution, long term evolution), Email, SMS (Short Messaging Service, short message service) etc..
Memory 120 can be used for storage software program and module, and processor 180 is stored in memory 120 by operation Software program and module, so as to perform various functions application and data processing.Memory 120 can mainly include storage journey Sequence area and storage data field, wherein, storing program area can storage program area, the application program needed for function (for example broadcast by sound Playing function, image player function etc.) etc.;Storage data field can be stored uses created data (such as sound according to terminal 700 Frequency evidence, phone directory etc.) etc..In addition, memory 120 can include high-speed random access memory, can also include non-volatile Property memory, a for example, at least disk memory, flush memory device or other volatile solid-state parts.Correspondingly, it deposits Reservoir 120 can also include Memory Controller, to provide the access of processor 180 and input unit 130 to memory 120.
Input unit 130 can be used for receiving the number inputted or character information and generate and user setting and function Control related keyboard, mouse, operating lever, optics or the input of trace ball signal.Specifically, input unit 130 may include touching Sensitive surfaces 131 and other input equipments 132.Touch sensitive surface 131, also referred to as touch display screen or Trackpad are collected and are used Family on it or neighbouring touch operation (such as user using any suitable object such as finger, stylus or attachment in touch-sensitive table Operation on face 131 or near touch sensitive surface 131), and corresponding attachment device is driven according to preset formula.It is optional , touch sensitive surface 131 may include both touch detecting apparatus and touch controller.Wherein, touch detecting apparatus detection is used The touch orientation at family, and the signal that touch operation is brought is detected, transmit a signal to touch controller;Touch controller is from touch Touch information is received in detection device, and is converted into contact coordinate, then gives processor 180, and processor 180 can be received The order sent simultaneously is performed.Furthermore, it is possible to using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves Realize touch sensitive surface 131;In addition to touch sensitive surface 131, input unit 130 can also include other input equipments 132.Specifically, Other input equipments 132 can include but is not limited to physical keyboard, function key (such as volume control button, switch key etc.), It is one or more in trace ball, mouse, operating lever etc..
Display unit 140 can be used for display by information input by user or be supplied to the information of user and terminal 700 Various graphical user interface, these graphical user interface can be made of figure, text, icon, video and its arbitrary combination. Display unit 140 may include display panel 141, optionally, LCD (Liquid Crystal Display, liquid crystal may be used Show device), the forms such as OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) display panel is configured 141.Further, touch sensitive surface 131 can cover display panel 141, when touch sensitive surface 131 detects on it or neighbouring touches After touching operation, processor 180 is sent to determine the type of touch event, is followed by subsequent processing type of the device 180 according to touch event Corresponding visual output is provided on display panel 141.Although in fig. 11, touch sensitive surface 131 and display panel 141 are conducts Two independent components realize input and input function, but in some embodiments it is possible to by touch sensitive surface 131 and display Panel 141 is integrated and realizes and outputs and inputs function.
Terminal 700 may also include at least one sensor 150, such as optical sensor, motion sensor and other sensings Device.Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein, ambient light sensor can be according to environment The light and shade of light adjusts the brightness of display panel 141, and proximity sensor can close display when terminal 700 is moved in one's ear Panel 141 and/or backlight.As one kind of motion sensor, gravity accelerometer can detect in all directions (generally Three axis) acceleration size, size and the direction of gravity are can detect that when static, can be used to identify terminal posture application (ratio Such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap);Extremely In other sensors such as gyroscope, barometer, hygrometer, thermometer, the infrared ray sensors that terminal 700 can also configure, herein It repeats no more.
Voicefrequency circuit 160, loud speaker 161, microphone 162 can provide the audio interface between user and terminal 700.Audio The transformed electric signal of the audio data received can be transferred to loud speaker 161, sound is converted to by loud speaker 161 by circuit 160 Sound signal exports;On the other hand, the voice signal of collection is converted to electric signal by microphone 162, after being received by voicefrequency circuit 160 Audio data is converted to, then after audio data output processor 180 is handled, through RF circuits 110 to be sent to such as another end Audio data is exported to memory 120 to be further processed by end.Voicefrequency circuit 160 is also possible that earphone jack, To provide the communication of peripheral hardware earphone and terminal 700.
WiFi belongs to short range wireless transmission technology, and terminal 700 can help user's transceiver electronics by WiFi module 170 Mail, browsing webpage and access streaming video etc., it has provided wireless broadband internet to the user and has accessed.Although Fig. 8 is shown WiFi module 170, but it is understood that, and must be configured into for terminal 700 is not belonging to, completely it can exist as needed Do not change in the range of the essence of invention and omit.
Processor 180 is the control centre of terminal 700, utilizes various interfaces and each portion of the entire terminal of connection Point, it is stored in memory 120 by running or performing the software program being stored in memory 120 and/or module and call Interior data perform the various functions of terminal 700 and processing data, so as to carry out integral monitoring to terminal.Optionally, processor 180 may include one or more processing cores;Preferably, processor 180 can integrate application processor and modem processor, Wherein, the main processing operation system of application processor, user interface and application program etc., modem processor mainly handles nothing Line communicates.It is understood that above-mentioned modem processor can not also be integrated into processor 180.
Terminal 700 further includes the power supply 190 (such as battery) powered to all parts, it is preferred that power supply can pass through electricity Management system and processor 180 are logically contiguous, so as to realize management charging, electric discharge and power consumption by power-supply management system The functions such as management.Power supply 190 can also include one or more direct current or AC power, recharging system, power supply event Hinder the random components such as detection circuit, power supply changeover device or inverter, power supply status indicator.
Although being not shown, terminal 700 can also include camera, bluetooth module etc., and details are not described herein.Specifically in this reality It applies in example, the display unit of terminal is touch-screen display, and terminal has further included memory and one or more than one Program, either more than one program is stored in memory and is configured to by one or more than one processing for one of them Device performs said one or more than one program and includes the instruction for being operated below:
Obtain the chosen area in the text;
It identifies the character in the chosen area, obtains the corresponding sentence of the character;
The sentence is split as multiple alternative words;
Priority ranking is carried out to the multiple alternative word according to alternative word attribute, obtains ranking results;
Target word is marked, and replicate the target word according to the ranking results.
Further, the processor of terminal is additionally operable to perform the instruction operated below:Identify character in the chosen area Corresponding all sentences;The sentence repeated in all sentences is deleted, obtains the corresponding sentence of the character.
Further, the processor of terminal is additionally operable to perform the instruction operated below:It is torn open using Forward Maximum Method algorithm Divide the sentence, obtain multiple alternative words.
Further, the processor of terminal is additionally operable to perform the instruction operated below:Belong to according to the first of the alternative word Property to the multiple alternative word carry out priority ranking, obtain the first ranking results;Judging the first attribute of the alternative word is It is no to be more than the first predetermined threshold value, if so, the alternative word is stored in the first alternative phrase;According to the second of the alternative word Attribute or third attribute carry out minor sort again to the alternative word in the described first alternative phrase, obtain the ranking results.
Specifically, first attribute includes the integrity degree of the alternative word, and the integrity degree of the alternative word is described standby The area that word is selected to be occupied in the chosen area.
Further, the processor of terminal is additionally operable to perform the instruction operated below:It obtains in the described first alternative phrase Second attribute of the alternative word, second attribute of the alternative word and the second predetermined threshold value;If there are the second categories Property be more than second predetermined threshold value alternative word, then according to the second attribute of the alternative word in the described first alternative phrase Alternative word carry out minor sort again;If there is no the alternative word that the second attribute is more than second predetermined threshold value, according to The third attribute of alternative word carries out minor sort again to the alternative word in the described first alternative phrase.
Further, the second attribute of the alternative word includes the temperature of the alternative word, and the third attribute includes standby Select the part of speech of word.
In conclusion terminal provided in this embodiment, by obtaining partially complete and incomplete character in chosen area, into The alternative word that one step splits the corresponding sentence of the character and obtained to fractionation is repeatedly sorted, can be correct It marks out user and wants the object content replicated, reduce the number of user's operation;By combining peripheral parts, advanced optimize User replicates the experience sense of text.
The part or the technical side that technical solution in the present embodiment substantially in other words contributes to the prior art The all or part of case can be embodied in the form of software product, which is stored in storage medium, if including Dry instruction is used so that one or more terminal device performs all or part of the steps of the method according to each embodiment of the present invention.
The division of module/unit described in the present embodiment, only a kind of division of logic function, can have in actual implementation Other dividing mode, such as multiple units or component may be combined or can be integrated into another device or some features It can ignore or not perform.Some or all of module/unit therein can be selected according to the actual needs and realized to reach The purpose of the present invention program.
In addition, each module/unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (19)

1. a kind of content of pages extracting method, which is characterized in that described method includes following steps:
Obtain the selected region in the page;
The character in the selected region is identified one by one, obtains the first sentence for including the character, splits first sentence to obtain Alternative word;
The alternative word is ranked up using at least one attribute of the alternative word, obtains ranking results;
Target alternative word is chosen, and extract the target alternative word according to the ranking results.
2. the according to the method described in claim 1, it is characterized in that, character identified one by one in the selected region Including:
Identify the complete character in the selected region;
Identify the incomplete character in the selected region;
Increase flag for the complete character and incomplete character, the flag is for identifying the complete character and described The integrity degree of incomplete character.
3. according to the method described in claim 1, it is characterized in that, the first sentence for including the character that obtains includes:
The character is retrieved in the content of pages, to obtain corresponding to each character in the selected region First sentence;
First sentence is inquired, to judge in the multiple first sentence with the presence or absence of the first sentence repeated;
If so, delete first sentence of the repetition.
4. according to the method described in claim 1, it is characterized in that, described be split as at least one alternative word packet by first sentence It includes:
Read the continuation character string in first sentence;
According to the matching with default vocabulary by the continuation character string of sequence from left to right;
When the character string of the first length in the continuation character string is matched with default vocabulary, judge that the first length adds 1 length Whether character string matches with default vocabulary;
If it is not, by the character string of first length alternately word, and by the character string of first length from the continuation character It is cut off in string, continues to match using the continuation character string after excision;
If so, the first length is added 1 as being updated to the first length, and continue to judge the character that first length adds 1 length The step of whether string matches with default vocabulary.
5. according to the method described in claim 1, it is characterized in that, multiple attributes according to the alternative word are to described standby Word is selected to be ranked up, including:
Priority ranking is carried out to the multiple alternative word according to the first property value of the alternative word, obtains the first sequence knot Fruit;
Judge whether the first property value of the alternative word is more than the first predetermined threshold value, if so, the alternative word is stored in First alternative phrase;
The alternative word in the described first alternative phrase is carried out again according to the second property value of the alternative word or third property value Minor sort obtains the ranking results.
6. according to the method described in claim 5, it is characterized in that, the first property value includes the complete of the alternative word Degree, the integrity degree of the alternative word are calculated by equation below:
Wherein, X represents the integrated degree of alternative word, and I represents the character ordinal number in alternative word, and n represents of character in alternative word Number, XI represent the integrity degree of i-th character.
7. according to the method described in claim 5, it is characterized in that, according to the second attribute of the alternative word or third attribute pair Alternative word in the first alternative phrase carries out minor sort again, including:
Obtain the second property value of alternative word described in the described first alternative phrase, second attribute of the alternative word Value and the second predetermined threshold value;
If there are the alternative word that the second property value is more than second predetermined threshold value, according to the second property value of the alternative word Minor sort again is carried out to the alternative word in the described first alternative phrase;
If there is no the alternative word that the second property value is more than second predetermined threshold value, according to the third attribute of the alternative word Minor sort again is carried out to the alternative word in the described first alternative phrase.
8. the method according to the description of claim 7 is characterized in that the second property value of the alternative word includes the alternative word Hot value, the third attribute includes the part of speech of alternative word.
9. it according to the method described in claim 1, it is characterized in that, is highlighted described in the target alternative word and/or duplication Target alternative word.
10. a kind of content of pages extraction element, which is characterized in that described device includes following module:
Region acquisition module, for obtaining the selected region in the page;
Alternative word generation module for identifying the character in the selected region one by one, obtains the first sentence for including the character, First sentence is split as alternative word;
Attribute sorting module is ranked up the alternative word for multiple attributes according to the alternative word, obtains sequence knot Fruit;
Content of pages extraction module for choosing target alternative word according to the ranking results, and extracts the target alternative word.
11. device according to claim 10, which is characterized in that the alternative word generation module includes character recognition submodule Block, the character recognition module are used for:Identify the complete character in the selected region;It identifies in the selected region Incomplete character;Increase flag for the complete character and incomplete character, the flag is used to identify the complete word The integrity degree of symbol and the incomplete character.
12. device according to claim 10, which is characterized in that the alternative word generation module includes first sentence and obtains submodule Block, this yuan of sentence acquisition submodule are used for:The character is retrieved in the content of pages, to obtain in the selected region Each character corresponding to multiple first sentences;The multiple first sentence is inquired, to judge in the multiple first sentence with the presence or absence of weight Multiple first sentence;If so, delete first sentence of the repetition.
13. device according to claim 10, which is characterized in that the alternative word generation module includes participle submodule, The participle submodule is used to read the continuation character string in first sentence;According to sequence from left to right by the continuation character string Matched with default vocabulary;When the character string of the first length in the continuation character string is matched with default vocabulary, first is judged Length adds whether the character string of 1 length matches with default vocabulary;If it is not, by the character string of first length alternately word, and The character string of first length from the continuation character string is cut off, continues to match using the continuation character string after excision;If That the first length is added 1 as being updated to the first length, and continue to judge first length add 1 length character string whether The step of being matched with default vocabulary.
14. device according to claim 10, which is characterized in that the attribute sorting module includes:
First attribute sorting sub-module, for the first property value according to the alternative word to the multiple alternative word into row major Grade sequence, obtains the first ranking results;
First attribute thresholds judging submodule, for judging whether the first property value of the alternative word is more than the first default threshold Value, if so, the alternative word is stored in the first alternative phrase;
Secondary sorting sub-module, for the second property value according to the alternative word or third property value to first alternative word Alternative word in group carries out minor sort again, obtains the ranking results.
15. device according to claim 14, which is characterized in that the first property value includes the complete of the alternative word Degree, the integrity degree of the alternative word are calculated by equation below:
Wherein, X represents the integrated degree of alternative word, and I represents the character ordinal number in alternative word, and n represents of character in alternative word Number, XI represent the integrity degree of i-th character.
16. device according to claim 14, which is characterized in that the secondary sorting sub-module includes:
Second property value obtains submodule, for obtaining the second property value of alternative word described in the described first alternative phrase;
Second attribute thresholds judging submodule, second property value of the alternative word and the second predetermined threshold value;If it deposits It is more than the alternative word of second predetermined threshold value in the second property value, then according to the second property value of the alternative word to described the Alternative word in one alternative phrase carries out minor sort again;If there is no the second property values to be more than the alternative of second predetermined threshold value Word then carries out minor sort again according to the third attribute of the alternative word to the alternative word in the described first alternative phrase.
17. device according to claim 16, which is characterized in that the second property value of the alternative word includes described alternative The hot value of word, the third attribute include the part of speech of alternative word.
18. device according to claim 10, which is characterized in that the content of pages extraction module includes:
Module is highlighted, for being highlighted the target alternative word;
Submodule is replicated, for replicating the target alternative word.
19. a kind of client includes the device described in one of claim 10-18.
CN201611260567.8A 2016-12-30 2016-12-30 Page content extraction method and device and client Active CN108268438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611260567.8A CN108268438B (en) 2016-12-30 2016-12-30 Page content extraction method and device and client

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611260567.8A CN108268438B (en) 2016-12-30 2016-12-30 Page content extraction method and device and client

Publications (2)

Publication Number Publication Date
CN108268438A true CN108268438A (en) 2018-07-10
CN108268438B CN108268438B (en) 2021-10-22

Family

ID=62755020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611260567.8A Active CN108268438B (en) 2016-12-30 2016-12-30 Page content extraction method and device and client

Country Status (1)

Country Link
CN (1) CN108268438B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110691028A (en) * 2019-09-16 2020-01-14 腾讯科技(深圳)有限公司 Message processing method, device, terminal and storage medium
WO2020029210A1 (en) * 2018-08-09 2020-02-13 深圳市柔宇科技有限公司 Copy content selection method, terminal and storage medium
CN111475093A (en) * 2019-08-02 2020-07-31 广州三星通信技术研究有限公司 Word selection method and electronic equipment
CN111796952A (en) * 2020-08-12 2020-10-20 Oppo(重庆)智能科技有限公司 Content operation method and device and computer readable storage medium
CN112181167A (en) * 2020-10-27 2021-01-05 维沃移动通信有限公司 Input method candidate word processing method and electronic equipment
CN113220191A (en) * 2020-01-21 2021-08-06 佳能株式会社 Image processing system for computerizing document, control method thereof and storage medium
CN112181167B (en) * 2020-10-27 2024-11-15 维沃移动通信有限公司 Candidate word processing method for input method and electronic equipment

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201841A (en) * 2007-02-15 2008-06-18 刘二中 Convenient method and system for electronic text-processing and searching
CN101377855A (en) * 2007-08-27 2009-03-04 富士施乐株式会社 Document image processing apparatus, and information processing method
CN102301366A (en) * 2008-11-18 2011-12-28 夏普株式会社 Information processing device
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
CN103778200A (en) * 2014-01-09 2014-05-07 中国科学院计算技术研究所 Method for extracting information source of message and system thereof
US20140157168A1 (en) * 2012-11-30 2014-06-05 International Business Machines Corporation Copy and paste experience
CN104462085A (en) * 2013-09-12 2015-03-25 腾讯科技(深圳)有限公司 Method and device for correcting search keywords
US20150104764A1 (en) * 2013-10-15 2015-04-16 Apollo Education Group, Inc. Adaptive grammar instruction for commas
CN104699809A (en) * 2015-03-20 2015-06-10 广东睿江科技有限公司 Method and device for controlling optimized word bank
CN104750661A (en) * 2013-12-30 2015-07-01 腾讯科技(深圳)有限公司 Method and device for selecting words and sentences of text
US20150199091A1 (en) * 2010-05-15 2015-07-16 Roddy McKee Bullock Enhanced E-Book and Enhanced E-Book Reader
CN105446955A (en) * 2015-11-27 2016-03-30 贺惠新 Adaptive word segmentation method
CN105550170A (en) * 2015-12-14 2016-05-04 北京锐安科技有限公司 Chinese word segmentation method and apparatus
US20160147879A1 (en) * 2014-11-24 2016-05-26 Qiurong Huang Fuzzy Search and Highlighting of Existing Data Visualization
CN105808512A (en) * 2016-03-04 2016-07-27 北京奇虎科技有限公司 Editing method and editing apparatus for encyclopedic entries

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201841A (en) * 2007-02-15 2008-06-18 刘二中 Convenient method and system for electronic text-processing and searching
CN101377855A (en) * 2007-08-27 2009-03-04 富士施乐株式会社 Document image processing apparatus, and information processing method
CN102301366A (en) * 2008-11-18 2011-12-28 夏普株式会社 Information processing device
US20150199091A1 (en) * 2010-05-15 2015-07-16 Roddy McKee Bullock Enhanced E-Book and Enhanced E-Book Reader
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
US20140157168A1 (en) * 2012-11-30 2014-06-05 International Business Machines Corporation Copy and paste experience
CN104462085A (en) * 2013-09-12 2015-03-25 腾讯科技(深圳)有限公司 Method and device for correcting search keywords
US20150104764A1 (en) * 2013-10-15 2015-04-16 Apollo Education Group, Inc. Adaptive grammar instruction for commas
CN104750661A (en) * 2013-12-30 2015-07-01 腾讯科技(深圳)有限公司 Method and device for selecting words and sentences of text
CN103778200A (en) * 2014-01-09 2014-05-07 中国科学院计算技术研究所 Method for extracting information source of message and system thereof
US20160147879A1 (en) * 2014-11-24 2016-05-26 Qiurong Huang Fuzzy Search and Highlighting of Existing Data Visualization
CN104699809A (en) * 2015-03-20 2015-06-10 广东睿江科技有限公司 Method and device for controlling optimized word bank
CN105446955A (en) * 2015-11-27 2016-03-30 贺惠新 Adaptive word segmentation method
CN105550170A (en) * 2015-12-14 2016-05-04 北京锐安科技有限公司 Chinese word segmentation method and apparatus
CN105808512A (en) * 2016-03-04 2016-07-27 北京奇虎科技有限公司 Editing method and editing apparatus for encyclopedic entries

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020029210A1 (en) * 2018-08-09 2020-02-13 深圳市柔宇科技有限公司 Copy content selection method, terminal and storage medium
CN111475093A (en) * 2019-08-02 2020-07-31 广州三星通信技术研究有限公司 Word selection method and electronic equipment
CN110691028A (en) * 2019-09-16 2020-01-14 腾讯科技(深圳)有限公司 Message processing method, device, terminal and storage medium
CN110691028B (en) * 2019-09-16 2022-07-08 腾讯科技(深圳)有限公司 Message processing method, device, terminal and storage medium
CN113220191A (en) * 2020-01-21 2021-08-06 佳能株式会社 Image processing system for computerizing document, control method thereof and storage medium
CN111796952A (en) * 2020-08-12 2020-10-20 Oppo(重庆)智能科技有限公司 Content operation method and device and computer readable storage medium
CN112181167A (en) * 2020-10-27 2021-01-05 维沃移动通信有限公司 Input method candidate word processing method and electronic equipment
CN112181167B (en) * 2020-10-27 2024-11-15 维沃移动通信有限公司 Candidate word processing method for input method and electronic equipment

Also Published As

Publication number Publication date
CN108268438B (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN106227774B (en) Information search method and device
CN104239535B (en) A kind of method, server, terminal and system for word figure
CN108541310B (en) Method and device for displaying candidate words and graphical user interface
CN109783798A (en) Method, apparatus, terminal and the storage medium of text information addition picture
CN109309751B (en) Voice recording method, electronic device and storage medium
CN104123937B (en) Remind method to set up, device and system
CN108268438A (en) A kind of content of pages extracting method, device and client
CN104866511B (en) A kind of method and apparatus of addition multimedia file
KR20170047268A (en) Orphaned utterance detection system and method
CN111368063B (en) Information pushing method based on machine learning and related device
CN103605656A (en) Music recommendation method and device and mobile terminal
WO2014176750A1 (en) Reminder setting method, apparatus and system
CN109815363A (en) Generation method, device, terminal and the storage medium of lyrics content
CN108563965A (en) Character input method and device, computer readable storage medium, terminal
WO2024036616A1 (en) Terminal-based question and answer method and apparatus
CN110069769B (en) Application label generation method and device and storage device
CN110278141A (en) A kind of processing method of instant communication information, device and storage medium
CN109543014B (en) Man-machine conversation method, device, terminal and server
CN114631094A (en) Intelligent e-mail headline suggestion and remake
CN103366010A (en) Method and device for searching audio file
CN108427761B (en) News event processing method, terminal, server and storage medium
CN106534528A (en) Processing method and device of text information and mobile terminal
US10140265B2 (en) Apparatuses and methods for phone number processing
CN108549681A (en) Data processing method and device, electronic equipment, computer readable storage medium
CN110020429B (en) Semantic recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant