Microblogging homepage data auto recommending method
Technical field
The present invention relates to a kind of recommendation methods, in particular to a kind of microblogging homepage data auto recommending method.
Background technology
Microblogging (Microblog) is a kind of network service emerging in recent years, it is the information based on customer relationship
Share, propagate and obtain platform.User can send word by the client of network, mobile phone and various intelligent networkings,
And it realizes and shares immediately.Microblogging has fast using the open multi-platform access way of simple and convenient, support, information updating spread speed
The features such as, attract global more than one hundred million users in short 5 years, ended the first half of the year in 2011, Chinese microblog users have reached
1.95 hundred million.Microblogging has stronger information propagation capabilities and member organization's ability, this unique advantage than traditional social networks
It is made to rapidly become one of current main Social Media, as a kind of very important informed source and route of transmission, more next
It plays a key effect in more social events.
The miscellaneous vertical service for integrating content of microblog emits like the mushrooms after rain.The quality of homepage depends on head
The quality of page data.One good homepage, can promote the quality entirely serviced, show that directly the vertical content serviced takes entire microblogging
To guiding, excitation user interest improve page click ratio, therefore a good homepage is essential.Current homepage data push away
Method is recommended, relies primarily on artificial recommendation, the data of newest hottest point are found by manual read, selects or makes by hand and meet
The picture and word of homepage design.
The method manually recommended, shortcoming are exactly of high cost, and poor in timeliness, renewal speed is slow, and content category is narrow.Pass through
It manually finds newest most dsc data, puts into artificial quantity, the range read and speed, determine the speed and quality of discovery,
Therefore for homepage data that are newest, more preferable, shortening the update cycle it is necessary to put into substantial amounts of manpower, this substantially increases cost.
The content of the invention
In view of the deficiencies of the prior art, the present invention proposes a kind of method of automatic recommendation microblogging homepage data.According to microblogging
Feature and user demand, analysis statisticaling data recommend the picture and microblogging of different first page size different channel by turns automatically
Summary.Save manpower and maintenance cost.
The purpose of the present invention is what is realized using following technical proposals:
A kind of microblogging homepage data auto recommending method, it is improved in that the described method includes
(1) microblogging list is filtered out from massive micro-blog;
(2) microblogging theme line is extracted, the blog article theme line of corresponding length is extracted according to picture size;
(3) automatic cutting is carried out to selecting picture immediate with Target Photo size.
Preferably, the step (1) is included according to configuration template, according to the granularity and outer diameter of data volume, from database
It is middle to read each microblog data of the channel with picture, obtain the data set of each channel;According to microblogging issuing time and number is forwarded,
It is sorted to data set, takes newest most hot preceding N, obtain the microblogging list TopN of each channel.
Further, every microblogging includes a node storage, when node content includes blog article, picture, blog article issue
Between and blog article forward number.
Preferably, the step (2) includes cycling successively from microblogging list, takes out the blog article in node, extracts blog article
Theme line.
Preferably, the step (2) includes
(2.1) blog article is pre-processed;
(2.2) sentence is cut, according to the blog article feature of different channel, is sorted to sentence, is chosen the sentence of sequence first, be denoted as s;
(2.3) sentence length is calculated, is denoted as len, len>Wordi then cuts sentence to s;Wordi is the theme the length of i;
(2.4) whether the theme line judged is significant;
(2.5) next node is chosen, repeats step (2.1)-(2.4);
(2.6) terminate.
Further, the step (2.3) includes being intercepted according to the punctuate of punctuation mark, the priority scheduling of punctuation mark
Grade be:
(a)“。”
(b)“!”、“”
(c)“;”
(d)“:”
(e)“,”
Ensure the integrality of the symbol occurred in pairs, half of symbol occur, then clip.
Further, whether theme line of the step (2.4) including the judgement be significant, and the method taken is number of words
Judge, Chinese and English judges and modal particle judges, it is not intended to which justice then abandons.
Preferably, the step (3) includes the data set obtained according to step (2), takes out the picture in node, is put into certainly
Dynamic screening washer, meets the requirements, then carries out automatic cutting according to the size in template, otherwise remove a pictures and continue to screen.
Preferably, the step (3) includes
(3.1) size of picture is calculated, is denoted as size;
(3.2) judge whether the quantity for meeting template picture i has reached the maximum quantity maxNumi, be not reaching to, carry out
Step (3.3), reaches, and travels through next template picture, circulation step (3.2);If the maximum quantity of all template pictures is all
Meet, then jump to step (3.6);
(3.3) matching degree of the size of size and template picture i is calculated, is denoted as d;
(3.4) judge whether matching degree d meets the requirements;Work as T1<d<T2 then carries out automatic cutting, to meeting template picture i
Quantity add 1, jump to step (3.6);Otherwise it is undesirable, repeat step (3.2) and (3.3), until with the institute in template
The picture for having species, which all compares, to be finished;It is undesirable, then continue step (3.5), wherein, T1, T2 are threshold value;
(3.5) pictures are removed, step (3.1) is carried out and arrives (3.4).
(3.6) terminate.
Compared with the prior art, beneficial effects of the present invention are:
The present invention recommends newest most hot blog article picture and summary from trend homepage, to meet user demand.Using automatic
The method filling homepage data of statistics screening, improve freshness, range and the update cycle of data, save manpower and into
This.Through artificial detection, the quality that picture screening is cut reaches 99.9%, the rate of accuracy reached of the recommendation of microblogging summary to 98% with
On.It is embodied in the following
1st, several different sizes are designed, to adapt to the inconsistent picture specification of length and width of all kinds;
2nd, flexible configuration data volume particle and outer diameter improve the probability that each channel has a picture and summary is recommended.
3rd, comprehensive a variety of strategy extraction blog article summaries, coordinate picture, recommend homepage automatically.
4th, design picture automatic screening device, compression cut out give prominence to the key points, the high quality picture of image clearly;
Description of the drawings
Fig. 1 is a kind of microblogging homepage data auto recommending method flow chart provided by the invention.
Fig. 2 is wall scroll data manipulation flow chart of the present invention provided by the invention.
Specific embodiment
The specific embodiment of the present invention is described in further detail below in conjunction with the accompanying drawings.(content of the invention is tried one's best more
It supplements specifically, technological means, technical solution, flow, reaches open abundant)
The structure chart of the present invention is as shown in Figure 1, mainly divide three big modules.First module, filters out from massive micro-blog
Several former obtains newest most hot microblogging list (TopN);Second module is extracted microblogging theme line, is extracted according to picture size
(because theme line is to be embedded in picture to show, the size of picture determines theme line to the blog article theme line of corresponding length
Length);3rd module, picture automatic screening device select picture immediate with Target Photo size and carry out automatic cutting.
Wall scroll data manipulation flow chart is as shown in Figure 2.Implementation steps are as follows:
Configuration template:
The path of zdpCfg--- downloader initialization files
Haarcascades--- picture automatic cutting class initialization files path
IntervalSec--- systems recommend interval time by turns
The time window of DisRptH--- not repeated datas
The index file of urlbak---url
The index file of tweetbak--- blog articles
The outer diameter of DBLoop--- data volumes
The granularity of DBCount--- data volumes
OutPath--- generates the storage path of homepage static page
PicType--- picture categories numbers
(i represents certain class picture number to the width of certain picture of Widthi--- i, since 1, adds up successively, and maximum is
Picture categories number, similarly hereinafter)
The height of Heighti--- pictures i
The length of wordi--- themes i
The maximum number of maxNumi--- pictures i
Module one:
Calculate newest most hot microblogging list.According to configuration template, according to the granularity and outer diameter of data volume, from database
Microblog data of each channel with picture is read, obtains the data set of each channel.Every microblogging is stored by a node, node
Content includes blog article, picture, blog article issuing time, blog article forwarding number etc..According to microblogging issuing time and forwarding number, to data set
It is sorted, takes newest most hot preceding N, obtain the microblogging list TopN of each channel.
Module two:
It is cycled successively from microblogging list, takes out the blog article in node, extract the theme line of blog article.Theme is selected according to importance
Sentence.It is as follows:
1st, blog article is pre-processed, the particular content of processing is as follows:
(1) to some html label transcodings, such as " &lt ";
(2) denoising, such as "@Li Xiaomings ", expression, more spaces;
(3) double byte punctuation mark changes into single byte punctuation mark, fullstop exception;
2nd, sentence is cut, according to the blog article feature of different channel, is sorted to sentence, is chosen the sentence of sequence first, be denoted as s;
3rd, sentence length is calculated, len is denoted as, if len>Wordi cuts sentence to s.It is cut according to the punctuate of punctuation mark
It takes, the priority level of punctuation mark is as follows:
(1)“。”
(2)“!”、“”
(3)“;”
(4)“:”
(5)“,”
And ensure the integrality of the symbol occurred in pairs as far as possible, such as " () ", "《》" etc., such as there is half of symbol, then cut
It goes
Whether the theme line the 4th, judged is significant, and the method that can be taken such as number of words judges, Chinese and English judges, modal particle is sentenced
Break, if meaningless, abandon
5th, next node is chosen, repeats step 1-4
6th, terminate
Module three:
Automatic screening device is designed, the data set obtained in slave module two takes out the picture in node, is put into automatic screening
Device if met the requirements, carries out automatic cutting according to the size of picture in template, otherwise removes a pictures and continue to screen.
One node screening picture is as follows:
1st, the size of picture is calculated, is denoted as size
2nd, judge whether the quantity for meeting template picture i has reached the maximum quantity maxNumi, if being not reaching to, carry out
Step 3, if reaching, next template picture, circulation step 2 are traveled through, if the maximum quantity of all template pictures is full foot,
Jump to step 6
3rd, the matching degree of the size of size and template picture i is calculated, is denoted as d
4th, judge whether matching degree d meets the requirements.Work as T1<d<T2 (T1, T2 are threshold value), then carry out automatic cutting, to symbol
The quantity of shuttering picture i adds 1, jumps to step 6;Otherwise it is undesirable, repeat step 2,3, until with it is all in template
The picture of species, which all compares, to be finished;If still undesirable, continue the 5th step.
5th, a pictures are removed, carry out step 1 to 4.
6th, terminate
Finally it should be noted that:The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, institute
The those of ordinary skill in category field with reference to above-described embodiment still can to the present invention specific embodiment modify or
Equivalent substitution, these are applying for this pending hair without departing from any modification of spirit and scope of the invention or equivalent substitution
Within bright claims.