CN102855282B - A kind of document recommendation method and device - Google Patents
A kind of document recommendation method and device Download PDFInfo
- Publication number
- CN102855282B CN102855282B CN201210272764.7A CN201210272764A CN102855282B CN 102855282 B CN102855282 B CN 102855282B CN 201210272764 A CN201210272764 A CN 201210272764A CN 102855282 B CN102855282 B CN 102855282B
- Authority
- CN
- China
- Prior art keywords
- document
- cluster
- content
- recommendation
- clusters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of document recommendation method and device, a kind of document recommendation method includes:In preset collection of document, centered on document A, clustered to document according to the similarity degree of document content;According to there is currently document associations information, determine with document A with cluster document associated document;Using identified and document A with the associated document of cluster document, the first recommendation results of document A are constituted.Compared with prior art, using technical solution provided by the present invention, manually new publishing documents need not be pre-processed, to be effectively saved human cost.The document even newly issued in this way, or it generates recommendation results, efficiently solves the problems, such as cold start-up and Sparse Problem.
Description
Technical field
The present invention relates to computer application technologies, more particularly to a kind of document recommendation method and device.
Background technology
With the development of Internet technology, the information content on internet is in explosive growth.In order to make user more square
Just these information are quickly obtained, recommended technology is widely applied in information system.Wherein, correlation recommendation technology becomes again
One important component of recommended technology, the basic thought of correlation recommendation technology are the one or more features based on information,
The correlation between different information is found, and further establishes the contact relationship between information, when user browses a certain information,
Commending system can will have the information for the relationship that contacts also to recommend user with the information.
For the research emphasis of correlation recommendation technology, other than excavating the features that more can be used for recommending, also reside in as
What sets up the relationship between information according to these features in practical applications.Currently, more common mode is according to user
The relationship between information is established in behavior, can be according to user to historical behaviors such as browsing, the search of document by taking document is recommended as an example
Record, analyzes the interest of user, and then according to the interest similarity degree of single or multiple users, the contact established between document is closed
System finally carries out document recommendation according to the relationship established.
But existing correlation recommendation method, it there is very serious cold start-up and Sparse Problem, it is so-called cold to open
Dynamic refers to the information newly issued, and Sparse refers to then:For some information, the associated user's behavior record of itself is seldom
(Or it is 0), therefore, it is difficult to generate recommendation results according to user behavior.Currently used solution is the side by manual intervention
Formula is some the preset recommendation results of information newly issued, but this mode needs to consume human cost, and requires operator
Member has abundant priori, and recommendation results are there is also larger limitation and subjectivity, frequent nothing in practical applications
Method meets the actual demand of information browse person.
Invention content
In order to solve the above technical problems, a kind of document recommendation method of offer of the embodiment of the present invention and device, to solve document
Cold start-up problem in associated recommendation and Sparse Problem.Specific technical solution is as follows:
A kind of document recommendation method, including:
In preset collection of document, centered on document A, document is gathered according to the similarity degree of document content
Cluster;
According to there is currently document associations information, determine with document A with cluster document associated document;
Using identified and document A with the associated document of cluster document, the first recommendation results of document A are constituted.
In a kind of specific implementation mode of the present invention, the document associations information is:
The related information between established different document is recorded according to the relevant user behavior of document.
In a kind of specific implementation mode of the present invention, the document associations information is:
Related information between the different document established according to the classification that document is belonged to.
In a kind of specific implementation mode of the present invention, centered on the A by document, according to the similarity degree of document content
It clusters to document, including:
Document content is carried out to sentence weight, the document with document A content multiplicities more than predetermined threshold value is polymerized to a text
Shelves cluster.
It is described that document is gathered according to the similarity degree of document content in a kind of specific implementation mode of the present invention
Cluster, including:
It is retrieved using document A, will be more than the document of predetermined threshold value with the document A content degrees of correlation according to retrieval result
It is polymerized to a document clusters.
In a kind of specific implementation mode of the present invention, this method further includes:
Using the same cluster document of document A, the second recommendation results of document A are constituted.
A kind of document recommendation apparatus, including:
Cluster unit, is used in preset collection of document, centered on document A, according to the similarity degree of document content
It clusters to document;
Associative cell, for according to there is currently document associations information, determine and with document A be associated with text with cluster document
Shelves;
Recommendation unit, for being pushed away with the associated document of cluster document, the first of composition document A with document A using identified
Recommend result.
In a kind of specific implementation mode of the present invention, the document associations information is:
The related information between established different document is recorded according to the relevant user behavior of document.The present invention's
In a kind of specific implementation mode, the document associations information is:
Related information between the different document established according to the classification that document is belonged to.
In a kind of specific implementation mode of the present invention, the unit that clusters is specifically used for:
Document content is carried out to sentence weight, the document with document A content multiplicities more than predetermined threshold value is polymerized to a text
Shelves cluster.
In a kind of specific implementation mode of the present invention, the unit that clusters is specifically used for:
It is retrieved using document A, will be more than the document of predetermined threshold value with the document A content degrees of correlation according to retrieval result
It is polymerized to a document clusters.
In a kind of specific implementation mode of the present invention, the recommendation unit is additionally operable to:
Using the same cluster document of document A, the second recommendation results of document A are constituted.
The technical solution that the embodiment of the present invention is provided gathers document based on the similarity degree of document particular content
Then cluster carries out document recommendation according to the result that clusters.It is equivalent to several similar documents of content, is considered as an identical point
It is handled.The document even newly issued in this way, or it generates recommendation results, on the other hand, for currently
Document with recommendation results can also be further optimized recommendation results according to the situation that clusters.
Compared with prior art, using technical solution provided by the present invention, manually new publishing documents need not be carried out
Pretreatment, to be effectively saved human cost.Moreover, it is assumed that current existing incidence relation is reasonable between document, that
Recommendation results after being clustered based on content similarities are still reasonable, that is to say, that the present invention program in recommendation process,
The recommendation of high confidence level can be provided to the document newly issued under the premise of not introducing the influence of operating personnel's factor and individual subjective factor
As a result, to further promote the performance of commending system.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments described in invention can also be obtained according to these attached drawings other for those of ordinary skill in the art
Attached drawing.
Fig. 1 is a kind of flow chart of document recommendation method of the embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of document recommendation apparatus of the embodiment of the present invention.
Specific implementation mode
It is provided for the embodiments of the invention a kind of document recommendation method first to illustrate, this method may include following
Step:
In preset collection of document, centered on document A, document is gathered according to the similarity degree of document content
Cluster;
According to there is currently document associations information, determine with document A with cluster document associated document;
Using identified and document A with the associated document of cluster document, the first recommendation results of document A are constituted.
Document in the embodiment of the present invention can show as diversified forms, such as can be the files shape such as TXT, DOC, PDF
The document of formula, can also be the document of form web page, these have no effect on the realization of the present invention program.
The document recommendation method that the embodiment of the present invention is provided is to be carried out within the scope of certain document, that is to say, that root
According to different application environments, all there is a preset collection of document.Such as:Recommended in network library, then in library
The upper transmitting file of all users constitutes preset collection of document;Recommended in knowledge platform, then knowledge all in the platform
Theme constitutes preset collection of document;Recommended in news website, then news web page all in the website is constituted preset
Collection of document.Certainly, according to actual application needs, the size of recommended range can be flexibly set, as low as some specific text
Shelves subject categories, greatly to full internet range, the present invention does not need to this to be defined.
The technical solution that the embodiment of the present invention is provided is primarily based on the similarity degree of document particular content, to document into
Row clusters, and then carries out document recommendation according to the result that clusters.It is equivalent to several similar documents of content, is considered as identical one
A point is handled.
Assuming that A is new publishing documents, it, will be with the approximate document of document A contents after being clustered centered on document A
B, document C, document D are gathered for identical cluster.In this way, if B, C, D itself have associated document, it can be by B, C, D
Associated document feeds back to user as the recommendation results of A.
Compared with prior art, using technical solution provided by the present invention, manually new publishing documents need not be carried out
Pretreatment, to be effectively saved human cost.Moreover, it is assumed that current existing incidence relation is reasonable between document, that
Recommendation results after being clustered based on content similarities are still reasonable, that is to say, that the present invention program in recommendation process,
The recommendation of high confidence level can be provided to the document newly issued under the premise of not introducing the influence of operating personnel's factor and individual subjective factor
As a result, to further promote the performance of commending system.
In order to make those skilled in the art more fully understand the technical solution in the present invention, implement below in conjunction with the present invention
Attached drawing in example, technical solution in the embodiment of the present invention is described in detail, it is clear that described embodiment is only
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
The every other embodiment obtained, should all belong to the scope of protection of the invention.
Fig. 1 show a kind of flow chart for document recommendation method that the embodiment of the present invention is provided, and this method may include
Following steps:
S101, in preset collection of document, centered on document A, according to the similarity degree of document content to document into
Row clusters;
Currently, the information content in internet is very big, but by the study found that can exist among these many similar or even complete
The multiple content of full weight may have the similar news report of many contents for example, being directed to same hot ticket;Different user may
The identical document of content can be uploaded to library platform, etc..Document similar for content, for many-sided reason(Such as
The resource quantity that the morning and evening of issuing time, publisher itself are possessed is different, published method difference etc.)Both, may cause
The associated document data volume possessed is different.For example, document A is identical with document B contents, wherein document A is the text just issued
Shelves, without any data that can be used for establishing incidence relation, and document B has had a large amount of associated data accumulation.
So, it is completely rationally by the associated document of the document B also recommendation results as document A from the point of view of " content is similar "
's.
According to above-mentioned principle, the present invention uses centered on document A, according to document content similar journey any document A
Spend the mode that clusters, found all in document sets with the approximate document of document A contents, then according to by other in cluster at
Recommended candidate of the associated document of member as document A, generates the recommendation results of document A.
In a kind of specific implementation mode of the present invention, weight technology can be sentenced using text and is clustered to document.
Objective application environment Internet-based will necessarily have the document that a large amount of content repeats, in order to heavy to these
Multiple document is effectively managed, and is accordingly produced many texts and is sentenced weight technology, for example, based on the signature algorithm of documentation level into
Row sentences weight, and algorithms most in use includes MD5 algorithms, simhash algorithms etc..It, can be directly sharp in scheme provided herein
Sentence weight technology with these ripe documents, document different in preset document sets is sentenced and is handled again, by the identical text of content
Shelves are grouped into together.
In specific implementation process, can subordinate sentence be carried out to document first, such as by looking for newline, fullstop, exclamation, question mark
Etc. segmentation marks to document carry out cutting;Then the sentence after cutting is normalized, such as such as the conversion of full half-angle, size
Write conversion, either traditional and simplified characters conversion, removal noise character, more blank character normalizings etc.;It finally signs to sentence, and calculates two documents
The common length or similarity of signature vectors indicate content registration with common length or similarity.
It only schematically illustrates, should not be constituted as one kind it is understood that document presented above sentences density current journey
Limitation to the present invention program.
In practical applications, due to user's change etc., the content between some documents might have in some details
Difference, but content on the whole still tends to unanimously.And the purpose of the present invention is the content similarity degree based on document into
Row recommends part therefore can preset a content multiplicity threshold value(Such as 80%, 90% etc.), during sentencing weight, if
Similarity between document is more than this threshold value, then it is assumed that and the same document clusters can be gathered and be become to the difference very little between document,
And then between same cluster member, associated document can be general mutually.
In another specific implementation mode of the present invention, it can also be clustered to document using retrieval technique.
The basic function of search engine, be exactly according to given search key, find out it is identical as the key words content or
Other similar Internet resources.According to the basic function of search engine, in the present invention it is possible to utilize document A(In clustering
The heart)Composition of contents search key input search engine, scanned within the scope of preset document sets, then according to search
As a result the member to cluster is determined.
A kind of most basic implementation is:The title of document A can be drawn directly as search key input search
It holds up, if the title of search result and document A are same or similar, which can be gathered to the document centered on A
Cluster.For example, document A it is entitled " in examine reading(Chinese language)", entitled " the middle written comments on the work, etc of public of officials of another document B are obtained by retrieval
Text is read ", then directly the document B can be gathered in cluster.
Certainly in practical applications, if the body matter of search result is similar to the title of document A, it is also assumed that full
Cluster condition enough, might not shall be limited only to the extent " title is similar ".In theory, other than title, the other parts of document A are all
It can be used for retrieving, such as author, abstract etc..During constituting search key, it can also carry out as segmented, removing
Stop words etc. pre-processes.In addition, many search engines are very intelligent at present, such as search engine itself can carry out automatically
The pretreatments such as participle, removal stop words, and search result also generally all can be according to related to keyword(It is similar)Degree into
Row sequence, therefore can directly take the preceding n of search result(N is positive integer)Position, the same cluster member as A.In short, this field skill
The specific strategy to cluster using search result, this hair can be flexibly arranged according to practical application request and application scenarios in art personnel
It is bright that this is not needed to be defined.
Compared with based on the method that weight technology clusters is sentenced, the method that clusters based on search technique is in similarity judgement
Accuracy on be short of, but can directly utilize existing search engine, therefore cost of implementation is relatively low.In practical application
In, two schemes can be independently operated, or be used in combination.Certainly, in the premise for the basic thought for not departing from the present invention
Under, those skilled in the art can also be clustered method using others, and these methods can be independently operated, or with
The method that the embodiment of the present invention is provided is used in combination.
S102, according to there is currently document associations information, determine with document A with cluster document associated document;
By clustering after obtaining document similar with document A contents, is recommended to be directed to document A, needed first
Determine the associated document of similar document.
The present invention program is based on such hypothesis:In preset document sets, there are a part of document, these documents itself
There is related information.So, if this kind of document is gathered with document A in same cluster, these can be utilized existing
Related information generates the recommendation results of document A.
In a kind of specific implementation mode of the present invention, it can be recorded according to the relevant user behavior of document, foundation
Related information between different document.
For document B and document B1, in the access process of user, embodied correlation, then can establish document B and
The incidence relation of document B1.Wherein " access of user " may include browsing, search, the actively behaviors such as recommendation.For example, certain user
In certain navigation process, document B " middle written comments on the work, etc of public of officials text is read " is first browsed, then having browsed document B1 again, " middle written comments on the work, etc of public of officials text is made
Text ", then can establish the incidence relation of document B and document B1.
In a specific embodiment, preset collection of document can be initialized as to a figure(graph), document sets
Each document in conjunction constitutes the point set of the figure, subsequently gathers if there is new document is added, then accordingly increases by one in figure
A point.
The initial edge set of figure is combined into sky(Side right weight i.e. between any two points is 0), for arbitrary two points, if one
Correlation has been embodied in the access behavior of name user, then has increased a line between the two points, if in another user
Access behavior in also embody correlation, then increase and have the weight ... on side repeatedly, pass through and analyze a large number of users
Historical behavior records, and is stepped up the quantity and weight on side.Finally obtain the related information of all documents in collection of document.
In practical applications, different weighted values can also be assigned to different user behaviors.Such as:For " search " row
For the correlation embodied, the weight of 0.5 unit is assigned;For the correlation that " browsing " behavior is embodied, the power of 1 unit is assigned
Weight;For " the user actively recommends " correlation that behavior is embodied, the weight, etc. of 2 units is assigned.
In a kind of specific implementation mode of the present invention, the classification that can also be belonged to according to document, the not identical text of foundation
Related information between shelves.
Document classification refers to determining one to each document in collection of document according to the attribute according to document or content
Classification.In this way, user is not only able to easily in specific classification browsing document, and can be made by limiting search range
The lookup of document is more easy.
Document B and document B1 can be established if the two itself is in identical classification for document B and document B1
Incidence relation." middle written comments on the work, etc of public of officials text is read " and document B1 " in examine language composition " belong to the class of " the middle written comments on the work, etc of public of officials are literary " for example, document B
Not, then the incidence relation of document B and document B1 can be established.
It is understood that " the existing related information " of document can obtain, above two side in any way
Case only schematically illustrates.In practical applications, two schemes can be independently operated, or be used in combination, such as " will belong to
In the same category " certain weighted value is assigned, with " correlation that user access activity is embodied " collective effect.Certainly, not
Under the premise of the basic thought for being detached from the present invention, those skilled in the art can also use other sides for establishing related information
Method, and these methods can be independently operated, or are used in combination with the method that the embodiment of the present invention is provided.
S103 constitutes the first recommendation results of document A using identified and document A with the associated document of cluster document.
For document A, it is assumed that, will be with the approximate document B of document A contents, text after being clustered centered on document A
Shelves C, document D are gathered for identical cluster.Also, B, C, D are respectively provided with following associated document:
The associated document of B is B1, B2, B3, B4(It sorts by associated weights, similarly hereinafter);
The associated document of C is C1, C2, C3;
The associated document of D is D1, D2;
So, the associated document as the same cluster member of A, B1, C1, C2, C3, D1, D2 just constitute the recommended candidate collection of A
It closes, the recommendation results of document A can be generated according to the set.
According to actual demand, different strategies can be had by generating recommendation results using recommended candidate set, such as:It can divide
Each top N associated document with cluster member is not chosen generates recommendation results;
The associated document that according to cluster member to the distance at cluster center, can also choose different number generates recommendation knot
Fruit, such as:Recommendation results are added for apart from nearest cluster member, choosing 3 associated documents, for the close cluster of distance time at
Member chooses 2 associated documents and recommendation results is added, and for remaining cluster member, chooses 1 associated document respectively and is added and recommend knot
Fruit, etc..
In addition, if during generating recommendation results, it is found that there is identical association texts between different cluster members
Shelves, then it is assumed that such document associations confidence level is higher, and recommendation results can preferentially be added.Such as:
The associated document of B is B1, B2, B3, B4;
The associated document of C is C1, C2, C3, X;
The associated document of D is D1, D2, X;
According to existing related information, document X constitutes document C and the associated document of document D simultaneously, then is generating recommendation knot
During fruit, the additional ranking weightings of document X can be given according to the co-occurrence degree of document X.
Furthermore, it is contemplated that B, C, D inherently can also therefore during actual recommendation with the approximate document of A contents
Consider B, C, D being also further added in recommendation results.
Using above-mentioned technical proposal, it is assumed that A is new publishing documents, then can be using the associated document of B, C, D as A's
Recommendation results feed back to user.On the other hand, if document A has had some associated documents for recommending originally,
After clustering, A is just provided with more recommended candidates, this is also beneficial to be further optimized recommendation results.
It is shown in Figure 2 the present invention also provides a kind of document recommendation apparatus corresponding to above method embodiment, the dress
Set including:
Cluster unit 110, is used in preset collection of document, centered on document A, according to the similar journey of document content
Degree clusters to document;
The present invention is used and is clustered centered on document A, according to document content similarity degree for any document A
Mode is found all in document sets with the approximate document of document A contents, then according to by the associated document of other members in cluster
As the recommended candidate of document A, the recommendation results of document A are generated.
In a kind of specific implementation mode of the present invention, weight technology can be sentenced using text and is clustered to document.
Objective application environment Internet-based will necessarily have the document that a large amount of content repeats, in order to heavy to these
Multiple document is effectively managed, and is accordingly produced many texts and is sentenced weight technology, for example, based on the signature algorithm of documentation level into
Row sentences weight, and algorithms most in use includes MD5 algorithms, simhash algorithms etc..It, can be directly sharp in scheme provided herein
Sentence weight technology with these ripe documents, document different in preset document sets is sentenced and is handled again, by the identical text of content
Shelves are grouped into together.
In specific implementation process, can subordinate sentence be carried out to document first, such as by looking for newline, fullstop, exclamation, question mark
Etc. segmentation marks to document carry out cutting;Then the sentence after cutting is normalized, such as such as the conversion of full half-angle, size
Write conversion, either traditional and simplified characters conversion, removal noise character, more blank character normalizings etc.;It finally signs to sentence, and calculates two documents
The common length or similarity of signature vectors indicate content registration with common length or similarity.
It only schematically illustrates, should not be constituted as one kind it is understood that document presented above sentences density current journey
Limitation to the present invention program.
In practical applications, due to user's change etc., the content between some documents might have in some details
Difference, but content on the whole still tends to unanimously.And the purpose of the present invention is the content similarity degree based on document into
Row recommends part therefore can preset a content multiplicity threshold value(Such as 80%, 90% etc.), during sentencing weight, if
Similarity between document is more than this threshold value, then it is assumed that and the same document clusters can be gathered and be become to the difference very little between document,
And then between same cluster member, associated document can be general mutually.
In another specific implementation mode of the present invention, it can also be clustered to document using retrieval technique.
The basic function of search engine, be exactly according to given search key, find out it is identical as the key words content or
Other similar Internet resources.According to the basic function of search engine, in the present invention it is possible to utilize document A(In clustering
The heart)Composition of contents search key input search engine, scanned within the scope of preset document sets, then according to search
As a result the member to cluster is determined.
A kind of most basic implementation is:The title of document A can be drawn directly as search key input search
It holds up, if the title of search result and document A are same or similar, which can be gathered to the document centered on A
Cluster.For example, document A it is entitled " in examine reading(Chinese language)", entitled " the middle written comments on the work, etc of public of officials of another document B are obtained by retrieval
Text is read ", then directly the document B can be gathered in cluster.
Certainly in practical applications, if the body matter of search result is similar to the title of document A, it is also assumed that full
Cluster condition enough, might not shall be limited only to the extent " title is similar ".In theory, other than title, the other parts of document A are all
It can be used for retrieving, such as author, abstract etc..During constituting search key, it can also carry out as segmented, removing
Stop words etc. pre-processes.In addition, many search engines are very intelligent at present, such as search engine itself can carry out automatically
The pretreatments such as participle, removal stop words, and search result also generally all can be according to related to keyword(It is similar)Degree into
Row sequence, therefore can directly take the preceding n of search result(N is positive integer)Position, the same cluster member as A.In short, this field skill
The specific strategy to cluster using search result, this hair can be flexibly arranged according to practical application request and application scenarios in art personnel
It is bright that this is not needed to be defined.
Compared with based on the method that weight technology clusters is sentenced, the method that clusters based on search technique is in similarity judgement
Accuracy on be short of, but can directly utilize existing search engine, therefore cost of implementation is relatively low.In practical application
In, two schemes can be independently operated, or be used in combination.Certainly, in the premise for the basic thought for not departing from the present invention
Under, those skilled in the art can also be clustered method using others, and these methods can be independently operated, or with
The method that the embodiment of the present invention is provided is used in combination.
Associative cell 120, for according to there is currently document associations information, determine with document A being associated with cluster document
Document;
By clustering after obtaining document similar with document A contents, is recommended to be directed to document A, needed first
Determine the associated document of similar document.
The present invention program is based on such hypothesis:In preset document sets, there are a part of document, these documents itself
There is related information.So, if this kind of document is gathered with document A in same cluster, these can be utilized existing
Related information generates the recommendation results of document A.
In a kind of specific implementation mode of the present invention, it can be recorded according to the relevant user behavior of document, foundation
Related information between different document.
For document B and document B1, in the access process of user, embodied correlation, then can establish document B and
The incidence relation of document B1.Wherein " access of user " may include browsing, search, the actively behaviors such as recommendation.For example, certain user
In certain navigation process, document B " middle written comments on the work, etc of public of officials text is read " is first browsed, then having browsed document B1 again, " middle written comments on the work, etc of public of officials text is made
Text ", then can establish the incidence relation of document B and document B1.
In a kind of specific specific implementation mode, preset collection of document can be initialized as to a figure(graph), text
Each document in shelves set constitutes the point set of the figure, subsequently if there is new document is added, then accordingly increases a point.
The initial edge set of figure is combined into sky, for arbitrary two points, if embodied in the access behavior of a user
Correlation then increases a line, if also embody correlation in the access behavior of another user between the two points
Property, then the weight ... for increasing existing side by analyzing the historical behavior record of a large number of users, is stepped up side repeatedly
Quantity and weight.Finally obtain the related information of all documents in collection of document.
In practical applications, different weighted values can also be assigned to different user behaviors.Such as:For " search " row
For the correlation embodied, the weight of 0.5 unit is assigned;For the correlation that " browsing " behavior is embodied, the power of 1 unit is assigned
Weight;For " the user actively recommends " correlation that behavior is embodied, the weight, etc. of 2 units is assigned.
In a kind of specific implementation mode of the present invention, the classification that can also be belonged to according to document, the not identical text of foundation
Related information between shelves.
Document classification refers to determining one to each document in collection of document according to the attribute according to document or content
Classification.In this way, user is not only able to easily in specific classification browsing document, and can be made by limiting search range
The lookup of document is more easy.
Document B and document B1 can be established if the two itself is in identical classification for document B and document B1
Incidence relation." middle written comments on the work, etc of public of officials text is read " and document B1 " in examine language composition " belong to the class of " the middle written comments on the work, etc of public of officials are literary " for example, document B
Not, then the incidence relation of document B and document B1 can be established.
It is understood that " the existing related information " of document can obtain, above two side in any way
Case only schematically illustrates.In practical applications, two schemes can be independently operated, or be used in combination, such as " will belong to
In the same category " certain weighted value is assigned, with " correlation that user access activity is embodied " collective effect.Certainly, not
Under the premise of the basic thought for being detached from the present invention, those skilled in the art can also use other sides for establishing related information
Method, and these methods can be independently operated, or are used in combination with the method that the embodiment of the present invention is provided.
Recommendation unit 130 constitutes the first of document A for the associated document using identified and document A with cluster document
Recommendation results.
For document A, it is assumed that, will be with the approximate document B of document A contents, text after being clustered centered on document A
Shelves C, document D are gathered for identical cluster.Also, B, C, D are respectively provided with following associated document:
The associated document of B is B1, B2, B3, B4(It sorts by associated weights, similarly hereinafter);
The associated document of C is C1, C2, C3;
The associated document of D is D1, D2;
So, the associated document as the same cluster member of A, B1, C1, C2, C3, D1, D2 just constitute the recommended candidate collection of A
It closes, the recommendation results of document A can be generated according to the set.
According to actual demand, different strategies can be had by generating recommendation results using recommended candidate set, such as:It can divide
Each top N associated document with cluster member is not chosen generates recommendation results;
The associated document that according to cluster member to the distance at cluster center, can also choose different number generates recommendation knot
Fruit, such as:Recommendation results are added for apart from nearest cluster member, choosing 3 associated documents, for the close cluster of distance time at
Member chooses 2 associated documents and recommendation results is added, and for remaining cluster member, chooses 1 associated document respectively and is added and recommend knot
Fruit, etc..
In addition, if during generating recommendation results, it is found that there is identical association texts between different cluster members
Shelves, then it is assumed that such document associations confidence level is higher, and recommendation results can preferentially be added.Such as:
The associated document of B is B1, B2, B3, B4;
The associated document of C is C1, C2, C3, X;
The associated document of D is D1, D2, X;
According to existing related information, document X constitutes document C and the associated document of document D simultaneously, then is generating recommendation knot
During fruit, the additional ranking weightings of document X can be given according to the co-occurrence degree of document X.
Furthermore, it is contemplated that B, C, D inherently can also therefore during actual recommendation with the approximate document of A contents
Consider B, C, D being also further added in recommendation results.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this
The function of each unit is realized can in the same or multiple software and or hardware when invention.
Using above-mentioned technical proposal, it is assumed that A is new publishing documents, then can be using the associated document of B, C, D as A's
Recommendation results feed back to user.On the other hand, if document A has had some associated documents for recommending originally,
After clustering, A is just provided with more recommended candidates, this is also beneficial to be further optimized recommendation results.
As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can
It is realized by the mode of software plus required general hardware platform.Based on this understanding, technical scheme of the present invention essence
On in other words the part that contributes to existing technology can be expressed in the form of software products, the computer software product
It can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used so that a computer equipment
(Can be personal computer, server or the network equipment etc.)Execute the certain of each embodiment or embodiment of the invention
Method described in part.
Each embodiment in this specification is described in a progressive manner, identical similar portion between each embodiment
Point just to refer each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality
For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method
Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separating component explanation
Unit may or may not be physically separated, the component shown as unit may or may not be
Physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to the actual needs
Some or all of module therein is selected to realize the mesh system of this embodiment scheme or the distributed computing environment etc. of equipment.
The present invention can describe in the general context of computer-executable instructions executed by a computer, such as program
Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group
Part, data structure etc..The present invention can also be put into practice in a distributed computing environment, in these distributed computing environments, by
Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with
In the local and remote computer storage media including storage device.
The above is only the specific implementation mode of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (8)
1. a kind of document recommendation method, which is characterized in that including:
In preset collection of document, centered on document A, clustered to document according to the similarity degree of document content;
According to there is currently document associations information, determine with document A with cluster document associated document;The document associations information
According to related information between established different document is recorded with the relevant user behavior of document;Or returned according to document
Related information between the different document that the classification of category is established;
Using identified and document A with the associated document of cluster document, the first recommendation results of document A are constituted.
2. according to the method described in claim 1, it is characterized in that, centered on the A by document, according to the similar of document content
Degree clusters to document, including:
Document content is carried out to sentence weight, the document with document A content multiplicities more than predetermined threshold value is polymerized to a document clusters.
3. according to the method described in claim 1, it is characterized in that, the similarity degree according to document content carries out document
It clusters, including:
It is retrieved using document A, according to retrieval result, the document with the document A content degrees of correlation more than predetermined threshold value is polymerize
For a document clusters.
4. according to the method described in claim 1, it is characterized in that, this method further includes:
Using the same cluster document of document A, the second recommendation results of document A are constituted.
5. a kind of document recommendation apparatus, which is characterized in that including:
Cluster unit, is used in preset collection of document, centered on document A, according to the similarity degree of document content to text
Shelves cluster;
Associative cell, for according to there is currently document associations information, determine with document A with cluster document associated document;Institute
State according to document associations information the related information recorded with the relevant user behavior of document between established different document;Or
Related information between the different document that person is established according to the classification that document is belonged to;
Recommendation unit, for recommending knot with the associated document of cluster document, the first of composition document A with document A using identified
Fruit.
6. device according to claim 5, which is characterized in that the unit that clusters is specifically used for:
Document content is carried out to sentence weight, the document with document A content multiplicities more than predetermined threshold value is polymerized to a document clusters.
7. device according to claim 5, which is characterized in that the unit that clusters is specifically used for:
It is retrieved using document A, according to retrieval result, the document with the document A content degrees of correlation more than predetermined threshold value is polymerize
For a document clusters.
8. device according to claim 5, the recommendation unit, are additionally operable to:
Using the same cluster document of document A, the second recommendation results of document A are constituted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210272764.7A CN102855282B (en) | 2012-08-01 | 2012-08-01 | A kind of document recommendation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210272764.7A CN102855282B (en) | 2012-08-01 | 2012-08-01 | A kind of document recommendation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102855282A CN102855282A (en) | 2013-01-02 |
CN102855282B true CN102855282B (en) | 2018-10-16 |
Family
ID=47401870
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210272764.7A Active CN102855282B (en) | 2012-08-01 | 2012-08-01 | A kind of document recommendation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102855282B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103425748B (en) * | 2013-07-19 | 2017-06-06 | 百度在线网络技术(北京)有限公司 | A kind of document resources advise the method for digging and device of word |
CN104640092B (en) * | 2015-01-27 | 2016-10-19 | 北京奇虎科技有限公司 | Identify the method for refuse messages, client, cloud server and system |
JP6623547B2 (en) * | 2015-05-12 | 2019-12-25 | 富士ゼロックス株式会社 | Information processing apparatus and information processing program |
CN105512300B (en) * | 2015-12-11 | 2019-01-22 | 宁波中青华云新媒体科技有限公司 | information filtering method and system |
CN107491423B (en) * | 2016-06-12 | 2021-03-30 | 北京云量数盟科技有限公司 | Chinese document gene quantization and characterization method based on numerical value-character string mixed coding |
CN107844493B (en) * | 2016-09-19 | 2020-12-29 | 博彦泓智科技(上海)有限公司 | File association method and system |
CN110019811B (en) * | 2018-01-02 | 2024-01-09 | 深圳市雅阅科技有限公司 | Article recommendation method, device and equipment |
CN109189913B (en) * | 2018-08-01 | 2021-10-22 | 昆明理工大学 | Novel recommendation method based on content |
CN110162752B (en) * | 2019-05-13 | 2023-06-27 | 百度在线网络技术(北京)有限公司 | Article judging and re-processing method and device and electronic equipment |
CN110888981B (en) * | 2019-10-30 | 2022-11-01 | 深圳价值在线信息科技股份有限公司 | Title-based document clustering method and device, terminal equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000627A (en) * | 2007-01-15 | 2007-07-18 | 北京搜狗科技发展有限公司 | Method and device for issuing correlation information |
JP2008040875A (en) * | 2006-08-08 | 2008-02-21 | Canon Inc | Document printing apparatus, document preparation device, document printing method, and document preparation method |
CN101689183A (en) * | 2007-04-30 | 2010-03-31 | 谷歌公司 | Program guide user interface |
CN101976259A (en) * | 2010-11-03 | 2011-02-16 | 百度在线网络技术(北京)有限公司 | Method and device for recommending series documents |
-
2012
- 2012-08-01 CN CN201210272764.7A patent/CN102855282B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008040875A (en) * | 2006-08-08 | 2008-02-21 | Canon Inc | Document printing apparatus, document preparation device, document printing method, and document preparation method |
CN101000627A (en) * | 2007-01-15 | 2007-07-18 | 北京搜狗科技发展有限公司 | Method and device for issuing correlation information |
CN101689183A (en) * | 2007-04-30 | 2010-03-31 | 谷歌公司 | Program guide user interface |
CN101976259A (en) * | 2010-11-03 | 2011-02-16 | 百度在线网络技术(北京)有限公司 | Method and device for recommending series documents |
Non-Patent Citations (1)
Title |
---|
基于内容的相关书籍推荐技术研究;商雪晶;《中国优秀硕士论文全文数据库 信息科技辑》;20110515;第1-45页 * |
Also Published As
Publication number | Publication date |
---|---|
CN102855282A (en) | 2013-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102855282B (en) | A kind of document recommendation method and device | |
Selvalakshmi et al. | Intelligent ontology based semantic information retrieval using feature selection and classification | |
CN103577462B (en) | A kind of Document Classification Method and device | |
Jotheeswaran et al. | OPINION MINING USING DECISION TREE BASED FEATURE SELECTION THROUGH MANHATTAN HIERARCHICAL CLUSTER MEASURE. | |
Hensinger et al. | Modelling and predicting news popularity | |
Gasparetti et al. | Exploiting web browsing activities for user needs identification | |
KR100954842B1 (en) | Method and System of classifying web page using category tag information and Recording medium using by the same | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
Raghav et al. | Text and citations based cluster analysis of legal judgments | |
Aliakbary et al. | Web page classification using social tags | |
Hürriyetoǧlu et al. | Relevancer: Finding and labeling relevant information in tweet collections | |
Preetha et al. | Personalized search engines on mining user preferences using clickthrough data | |
Santoso et al. | An Ontological Crawling Approach for Improving Information Aggregation over eGovernment Websites. | |
Xiao | A Survey of Document Clustering Techniques & Comparison of LDA and moVMF | |
Bogers et al. | Expertise classification: Collaborative classification vs. automatic extraction | |
Dalvi et al. | An Improvised Approach for Website Domain Classification | |
Bhavani et al. | RETRACTED ARTICLE: A Sly Salvage of Semantic Web Content with Insistence of Low Precision and Low Recall | |
JP2020113267A (en) | System and method for creating reading list | |
He et al. | Eventgraph based events detection in social media | |
Sandhya et al. | Automatic Text Categorization on News Articles | |
Bazghandi et al. | Extractive summarization Of Farsi documents based on PSO clustering | |
Sajeev | A community based web summarization in near linear time | |
Ajallouda et al. | Automatic Key-Phrase Extraction: Empirical Study of Graph-Based Methods | |
Li et al. | Clustering web search results using conceptual grouping | |
Dzhurenko et al. | Analysis of Text Mining methods in Web search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |