CN102693335B - The user interest pattern division methods estimated based on Gini coefficient - Google Patents
The user interest pattern division methods estimated based on Gini coefficient Download PDFInfo
- Publication number
- CN102693335B CN102693335B CN201210133502.2A CN201210133502A CN102693335B CN 102693335 B CN102693335 B CN 102693335B CN 201210133502 A CN201210133502 A CN 201210133502A CN 102693335 B CN102693335 B CN 102693335B
- Authority
- CN
- China
- Prior art keywords
- interest
- user
- degree
- gini coefficient
- user interest
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of user interest pattern division methods estimated based on Gini coefficient, relate to microcomputer modelling technical field, comprising: S1: based on vector space model framework user interest model;S2: according to user interest degree ascending order, the interest-degree after being sorted is ranked up to user interest modelS3: by step S2Interest-degree changes into the percentage taking family total interest degree own:S4: according toCalculate each user'sFor generating the vector of lorenz curve;S5: withValue be ordinate, be arranged as abscissa from low to high with the interest-degree in field, draw the lorenz curve of user interest pattern, and calculate Gini coefficient, by Gini coefficient divide user interest mode.Present invention achieves the accurate division to user interest pattern.
Description
Technical field
The present invention relates to microcomputer modelling technical field, survey based on Gini coefficient particularly to one
The user interest pattern division methods of degree.
Background technology
User interest pattern is diversified, and some users belong to the extensive type of interest, and it is to respectively
Field to like degree to be distributed relatively uniform;Some users are then single-minded to like few class
Other things, each field is liked the distribution of degree relative and uneven by it.Many times, i.e.
Just different user Cup of tea thing is different, but has identical interest mode, such as one user
Being that audiophile only likes music, another user is that military fan only likes military, to the greatest extent
Pipe music and military art differ widely, but the two user is single interest pattern to be used
Family, they have same interest pattern.And the research of relevant user interest lacks to user at present
The measure of interest mode.It is therefore desirable to find a kind of side estimating user interest pattern
Method.
In economics, Gini coefficient has very big reference to anticipate for research user interest pattern classification
Justice.Gini coefficient is a kind of tolerance of the social rich or poor difference of assessment general in the world in economics
Method, tolerance is distribution situation in all population for society's total income, this and user interest
Pattern research measure user interest problem of Relative distribution between each field have similar very greatly it
Place.
In order to study wealth of society inequitable distribution problem, U.S. statistician M.O. in 1905
Long-range navigation thatch (Max Otto Lorenz) proposes famous lorenz curve (lurenz curve).
Lorenz curve (solid wire in Fig. 1) in economics is a kind of wealth distribution aggregation function
Graph-based method.
For any point (x%, y%) on lorenz curve in Fig. 1, it is meant that lean to richness
The ratio that before arrangement, the accumulative total income of the population of x% accounts for society's total income is y%.In figure " absolutely
To fair line " (curve of absolute equality) represent society's total income in city
Distribution of earnings curve during absolute mean allocation, is the straight line of " y=x ";" definitely unjust
Horizontal line " (Curve of absolute inequality) is that all incomes of society are uniquely accounted for by a people
According to when distribution of earnings curve, be a straight line being perpendicular to x-axis.The general position of lorenz curve
Between absolute fair line and absolutely not fair line.
1912, Italy's economist's Geordie was according to lorenz curve, it is proposed that Geordie system
Number (Gini coefficient).Gini coefficient is (uneven as estimating a variable distribution concentration degree
Equal character) index, modern economics often uses it to measure gap between the rich and the poor.As it is shown in figure 1,
If the area between lorenz curve and definitely fair line is A, lorenz curve is unjust with absolute
The graphics area that horizontal line and x-axis surround is B.And it is unequal divided by the quotient representation of A+B with A
Degree is Gini coefficient.It is expressed as by below equation (1):
This numerical value is referred to as Gini coefficient or claims Lorenz coefficient, and size is positioned between 0 to 1.
Area A between lorenz curve and definitely fair line is less, and distribution of earnings tends to flat
Deng the radian of lorenz curve is also less, and Gini coefficient is also less;Otherwise, distribution of earnings
Tending to inequality, the radian of lorenz curve is bigger, then Gini coefficient is also bigger.
Gini coefficient is when measuring gap between the rich and the poor, and the total income that it is social that its essence is assumed is homogeneity
, measure its distribution situation at all population.
Prior art concentrates on according to user interest similarity research user preference.Lack from user
Interest mode angle research user interest, it is impossible to exactly user interest pattern is divided.
Content of the invention
(1) to solve the technical problem that
The technical problem to be solved in the present invention is: how to carry out to user interest pattern exactly drawing
Point.
(2) technical scheme
For solving above-mentioned technical problem, the invention provides one and estimate user based on Gini coefficient
Interest division methods, comprises the following steps:
S1: based on vector space model framework user interest model, user interest field
Collect and be combined into T={ interest 1, interest 2 ..., interest N}, for any one user, its user
Interest model can be expressed as U={<interest 1, interest-degree 1>...,<interest N, interest-degree
N>};
S2: user interest model is ranked up according to user interest degree ascending order, after being sorted
User interest degree be:
S3: by step S2
Interest-degree
Change into that to take family itself totally emerging
The percentage of interest degree:
S4: according to
Calculate each user's
Make a living
Become the vector of lorenz curve, wherein:
S5: with
Value be ordinate, be arranged as horizontal seat from low to high with the interest-degree in field
Mark, draws the lorenz curve of user interest pattern, and calculates Gini coefficient, by Gini coefficient
Divide the interest mode of user.
Wherein, it, in described step S5, is calculated as follows Gini coefficient:
Wherein, the less user of described Gini coefficient difference, interest mode is more similar.
(3) beneficial effect
The present invention by utilize lorenz curve and Gini coefficient qualitative and quantitative estimate user
The method of interest mode so that the division to user interest pattern is more accurate.
Brief description
Fig. 1 is lorenz curve schematic diagram;
Fig. 2 is that a kind of of the embodiment of the present invention draws based on the user interest pattern that Gini coefficient is estimated
Divide method flow diagram;
Fig. 3 is the lorenz curve schematic diagram representing user's A, B, C interest in embodiment;
Fig. 4 is that the ladder approximation decomposition of the lorenz curve representing user's A interest in embodiment is shown
It is intended to;
Fig. 5 is 943 user's lorenz curve schematic diagrames in movielens data set;
Fig. 6 is 943 user's Gini coefficient frequency disribution schematic diagrames in movielens data set;
Fig. 7 is the Gini coefficient distribution feelings dividing according to reference user in Movielens data set
Condition.
Detailed description of the invention
Below in conjunction with the accompanying drawings and embodiment, the detailed description of the invention of the present invention is made further in detail
Describe.Following example are used for illustrating the present invention, but are not limited to the scope of the present invention.
User interest pattern division methods flow process such as Fig. 1 estimating based on Gini coefficient of the present invention
Shown in, comprising:
Step S201, based on vector space model (Vector Space Model, VSM) frame
Structure user interest model, user interest field collection is combined into T={ interest 1, interest 2 ..., interest
N}, for any one user, his user interest can be expressed as U={ < interest 1, interest
Degree 1>...,<interest N, interest-degree N>.
Uuser={ < theme1, weight1>...,<themeN, weightN>}
Theme is the theme in corresponding set T, and weight is that user is big to corresponding field interest-degree
Little, it is weight user being previously set to certain field interest level, represent user to certain field
Degree interested, it is possible to be directly expressed as:
Uuser={ weight1, weight2..., weightN}。
Step S202, is ranked up according to user interest degree ascending order to user interest model, obtains
User interest after sequence is:
Wherein
Step S203, then by second step
Interest-degree changes into and always takies family itself
The percentage of body interest-degree
Obviously, numerical value sum in each user vector after percentage
It is all 100%:
Wherein:
Step S204, according to calculate in S203
Calculate each user's It is used to generate the vector of lorenz curve, wi, (1≤i≤n)
Represent that the percentage of in expression step S202 the 1st group is added to i-th group of interest-degree sum and accounts for entirely
The percentage of body interest-degree.That is:
Step S204, with
Value be ordinate, low to high be arranged as by interested with field
Abscissa, gets final product described point and draws the lorenz curve of user interest pattern, and Gini coefficient is:
The concrete reasoning process of Gini coefficient is exemplified below:
During application Gini coefficient of deriving estimates user interest pattern, three are first simulated
Know the user of type, provide its Gini coefficient and calculate process.
As shown in table 1, a user being named as " rich " is had to be denoted as user A, can from table 1
It is that every field is all preferred with the feature finding out him intuitively;Another one user is
" specialized personnel one ", is denoted as user B, and it is only concerned physical culture, finance and economics and military field, never reads
Cross the news of other field;Also have a user " specialized personnel two ", be denoted as user C, user's C base
Originally fashion, healthy and education sector are only focused on.Numerical value in table 1 is that it reads association area literary composition
The number of chapter, as it was previously stated, the number of the read news of user is equal to user to this field
Interest-degree.
1 three typical user's interest-degree distribution situations of table
After carrying out user interest modeling according to vector space model, the interest of three users can be distinguished
It is expressed as by vector form:
UA={ 5,3,4,7,6,2,8,4}
UB={ 8,10,0,3,0,0,0,0}
UC={ 0,0,1,0,0,5,12,7}
The difference of three user interest pattern can be found out according to these three vector intuitively, it is clear that
The interest-degree distribution of user A is more average than user B, C.Next Gini coefficient is used
Carry out the interest mode of quantitative measurement user, and qualitative observation can be carried out by lorenz curve.
The calculating process of user interest pattern Gini coefficient is:
The first step: A, B after being sorted successively by each user interest degree ascending sort and
The vector model of user C:
Second step: the interest-degree chemical conversion in vector in the first step is taken family total interest degree own
Percentage, it is clear that after percentage, in each user vector, numerical value sum is 100%,
That is:
3rd step: for calculating Gini coefficient, also need to calculate each user's according to second step Being used to generate the vector of lorenz curve, n represents field
Sum (n=8 herein), wi, (1≤i≤n) i.e.:
Then with vector UGiniValue be ordinate, with all fields as abscissa, described point generates
Curve be exactly user A, B and the lorenz curve of user C.As it is shown on figure 3, near " absolutely
To fair line " some bar line be the lorenz curve of user A, dotted lines is the long-range navigation of user B
Thatch curve, the stripline runs between user A, B is the lorenz curve of user C.Herein should be special
Not it is noted that for these three user, the field of abscissa is by unique user fancy grade
Height arrangement, is incremented by from left to right successively, might not be identical for different user specific field,
Such as in Fig. 3 abscissa field 8, only represents the favorite field of unique user, for user A
From the point of view of field 8 be its favorite health, and for user B from the point of view of, field 8 is that it likes best
Finance and economics, for the two user, the particular content in field 8 is simultaneously different, but identical
Being healthy and finance and economics is the favorite field of user A, B respectively, field 8 is liked by i.e. two users
Ranking in all spectra for the good degree is identical.
For calculating the Gini coefficient of three users, as a example by user A.As shown in Figure 4, user A
Lorenz curve and absolute average line between area be defined as SA, with abscissa and absolutely not fair
The area being formed between line is SB, can obtain according to formula (1), the Gini coefficient of user A is
Formula (2):
In abscissa, interest worlds totally takes 100%, it is clear that SA+SBArea be equal to 0.5.
Calculating SBWhen, owing to actual lorenz curve is the line of a bending, it is impossible to directly
Reference area, can only use certain methods to approximate.As shown in Figure 4, herein by SBClosely
Like for n with the width in field total shared by field between i-th group and the i-th-1 group as the end, with i-th
The accumulative interest-degree w of groupiAdd up interest-degree w with the i-th-1 groupi-1Trapezoidal face for upper bottom
Long-pending sum.The hypothesis general at this has n field,
For user: Its Gini coefficient meter
Calculating formula is:
User A, B can be calculated according to formula (3) to be respectively as follows: with the Gini coefficient of user C
GA=0.22, GB=0.73, GC=0.69.
It it can thus be seen that the interest Gini coefficient of user A " rich " is 0.22, is similar to
0.2, Gini coefficient smaller explanation this user interest distribution is relatively uniform, belongs to hobby
An extensive class people;And the Gini coefficient of user B " specialized personnel one " is 0.73, numeric ratio is relatively big,
Illustrate that this user distributes and uneven, belong to the single-minded narrow class people of hobby;User C
Interest distribution is also uneven, also belongs to the narrow crowd of hobby.Long-range navigation thatch from Fig. 3
Can also find out intuitively on curve, the lorenz curve of user A is than user B and user C more
Near " absolute average line ", illustrate user A in distribution in all fields for the interest than use
Family B is more uniform.And the lorenz curve of user B and user C is closely, show that it has phase
As interest mode.
To the checking of the method experiment simulation of the present invention, it was demonstrated that feasibility, detailed process and
Simulation result is as follows:
For effect in measure user interest mode for the actual verification Gini coefficient, have chosen
Movielens data set carries out experimental verification.Movielens data set be Grouplens tissue from
User's film score data that movielens website gathers.The data set using includes 943
100,000 scoring (1~5 point) to 1682 films for the user, each user is at least to 20
Film had scoring.
It is (comedy, dynamic that every film on Movielens data set all belongs to 18 fields
Make piece, romance movie etc.) in a class or a few class, emerging for user with the type of this 18 kinds of films
Interest field overall space, investigates the distribution situation that user interest is liked at this 18 type film.
For the calculating of user interest degree, use and the scoring seen a film is added to by user this film institute
The field belonging to, tries to achieve the interest-degree to this 18 class film for the user respectively.
Actual experiment finds, when investigating user preferences distribution with this whole 18 types, several
The Gini coefficient of all users is all near 0.8, and discrimination is very low, and this shows almost do not have
Having user all interested in the film of 18 types, this is also consistent with real daily life, because one
As the class dabbled of the extensive people of interest also seldom can cover all.
Solve the overall quantity in Gini coefficient field for this according to 80/20 principle determination.Assume to use
It is 80% that the main interest at family all accounts for its interest-degree population proportion, before then obtaining 943 users
Shared by the main interest place relevant of 80%, the mean value of relevant is 7.14, therefore exists
User is only investigated the highest to its interest when seeking lorenz curve and calculate user interest Gini coefficient
The hobby situation of front 7 type films.Fig. 5 is the long-range navigation thatch of 943 users drawing accordingly
Curve (abscissa is similar to Fig. 3, and the rightmost side is the field that user is most interested in, from left to right,
Level of interest is incremented by successively), Fig. 6 is corresponding user's Gini coefficient frequency disribution.
The frequency disribution of 943 user's Gini coefficients substantially matching normal distribution in Fig. 6.In order to more
Add the implication understanding these numerical value accurately, provide the Gini coefficient of 5 reference user as ginseng
According to.Assuming that the interest distribution in 7 kinds of most interested fields for the user meets 80/20 principle, user feels
The field of interest takies the 80% of family total interest, and remaining field accounts for 20% altogether.For example for
One double Interests User, this user only has two main interests, and user is to the two field interest
Degree accounts for the half of total amount 80% proportion respectively, and i.e. 40%, remain five fields of loseing interest in and divide equally
The interest-degree of residue 20%, i.e. 4%.
So can respectively obtain following five class reference user and corresponding fancy grade distribution feelings thereof
Condition, as shown in table 2:
Table 2 five class reference user interest-degree distribution situation and Gini coefficient
Reference user | Interest-degree is distributed | Gini coefficient |
Single Interests User | 33%, 33%, 33%, 33%, 33%, 33%, 80%} | 014 |
Double Interests User | 4%, 4%, 4%, 4%, 4%, 40%, 40%} | 024 |
Three Interests User | 5%, 5%, 5%, 5%, 266%, 266%, 266%} | 038 |
Four Interests User | 66%, 66%, 66%, 20%, 20%, 20%, 20%} | 052 |
Five Interests User | 10%, 10%, 16%, 16%, 16%, 16%, 16%} | 063 |
By user's Gini coefficient frequency in Fig. 6 respectively according to above five class reference user standards again
Dividing, statistics draws user's Gini coefficient distribution situation, as shown in Figure 7.
It can be seen that Gini coefficient is very effective to the division result of user interest pattern,
In 943 users being investigated, most is about 44.5% user have 3 to 4 interest necks
Territory;The user of 24.4% has 4 to 5 main interests, and the user of 22.6% has 2 to 4 interest,
The user of 6.5% only has 1 to 2 main interests;The artificial absolute single interest mode class of 1.2%
Type, the only people of only a few 0.8% are liked interest worlds and are reached more than 5.
Above experimental data shows, application Gini coefficient measure user interest mode has well
Discrimination, can be observed qualitatively by lorenz curve, and it is fixed also to be carried out by Gini coefficient
Quantitative analysis.
Embodiment of above is merely to illustrate the present invention, and not limitation of the present invention, relevant
The those of ordinary skill of technical field, without departing from the spirit and scope of the present invention,
Can also make a variety of changes and modification, therefore all equivalent technical schemes fall within the present invention
Category, the scope of patent protection of the present invention should be defined by the claims.
Claims (1)
1. the user interest pattern division methods estimated based on Gini coefficient, its feature exists
In comprising the following steps:
S1: based on vector space model framework user interest model, user interest field
Collect and be combined into T={ interest 1, interest 2 ..., interest N}, for any one user, its user
Interest model can be expressed as U={<interest 1, interest-degree 1>...,<interest N, interest-degree
N>};
S2: user interest model is ranked up according to user interest degree ascending order, after being sorted
User interest degree be:
S3: by step S2Interest-degreeChange into that to take family itself totally emerging
The percentage of interest degree:
S4: according toCalculate each user's Make a living
Become the vector of lorenz curve, wherein:
S5: withValue be ordinate, be arranged as horizontal seat from low to high with the interest-degree in field
Mark, draws the lorenz curve of user interest pattern, and calculates Gini coefficient, by Gini coefficient
Divide the interest mode of user;
It in described step S5, is calculated as follows Gini coefficient:
The less user of described Gini coefficient difference, interest mode is more similar.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210133502.2A CN102693335B (en) | 2012-04-28 | 2012-04-28 | The user interest pattern division methods estimated based on Gini coefficient |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210133502.2A CN102693335B (en) | 2012-04-28 | 2012-04-28 | The user interest pattern division methods estimated based on Gini coefficient |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102693335A CN102693335A (en) | 2012-09-26 |
CN102693335B true CN102693335B (en) | 2016-10-05 |
Family
ID=46858767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210133502.2A Expired - Fee Related CN102693335B (en) | 2012-04-28 | 2012-04-28 | The user interest pattern division methods estimated based on Gini coefficient |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102693335B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105792006B (en) * | 2016-03-04 | 2019-10-08 | 广州酷狗计算机科技有限公司 | Interactive information display methods and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU4057600A (en) * | 1999-03-31 | 2000-10-16 | Rosetta Inpharmatics, Inc. | Methods for the identification of reporter and target molecules using comprehensive gene expression profiles |
-
2012
- 2012-04-28 CN CN201210133502.2A patent/CN102693335B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN102693335A (en) | 2012-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Radicchi et al. | Diffusion of scientific credits and the ranking of scientists | |
Duclos | Gini indices and the redistribution of income | |
CN105373597B (en) | The user collaborative filtered recommendation method merging based on k medoids item cluster and partial interest | |
Wang et al. | Measurement of relative welfare poverty and its impact on happiness in China: Evidence from CGSS | |
CN107391670A (en) | A kind of mixing recommendation method for merging collaborative filtering and user property filtering | |
Fernando et al. | Modified factor analysis to construct composite indices: illustration on urbanization index | |
Decancq et al. | What If We Voted on the Weights of a Multidimensional Well‐Being Index? An Illustration with Flemish Data | |
CN107491813A (en) | A kind of long-tail group recommending method based on multiple-objection optimization | |
CN107301247B (en) | Method and device for establishing click rate estimation model, terminal and storage medium | |
Tomul | Measuring regional inequality of education in Turkey: an evaluation by Gini index | |
Obidoa et al. | How does the SF--‐36 perform in healthy populations? A structured review of longitudinal studies | |
CN109840833A (en) | Bayes's collaborative filtering recommending method | |
CN108628967A (en) | A kind of e-learning group partition method generating network similarity based on study | |
CN110795758A (en) | Non-equidistant histogram publishing method based on differential privacy | |
CN102945222B (en) | A kind of weary information measurement data gross error method of discrimination based on gray theory | |
US20140114974A1 (en) | Co-clustering apparatus, co-clustering method, recording medium, and integrated circuit | |
Cugmas et al. | Scientific Co‐Authorship Networks | |
CN107944750A (en) | A kind of poverty depth analysis method and system | |
CN102693335B (en) | The user interest pattern division methods estimated based on Gini coefficient | |
Sælen et al. | Exploring public opposition and support across different climate policies: Poles apart? | |
Wang et al. | Measuring spatiotemporal changes of rural basic public service in poverty-stricken area of China | |
CN103678709A (en) | Recommendation system attack detection method based on time series data | |
CN107767933A (en) | Psychological situation method for early warning and device based on OLAP | |
CN106846127A (en) | A kind of time-based Products Show method and system | |
Mussini | A longitudinal decomposition of Zenga’s new inequality index |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161005 |