[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1242572.1242642acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Robust methodologies for modeling web click distributions

Published: 08 May 2007 Publication History

Abstract

Metrics such as click counts are vital to online businesses but their measurement has been problematic due to inclusion of high variance robot traffic. We posit that by applying statistical methods more rigorous than have been employed to date that we can build a robust model of thedistribution of clicks following which we can set probabilistically sound thresholds to address outliers and robots. Prior research in this domain has used inappropriate statistical methodology to model distributions and current industrial practice eschews this research for conservative ad-hoc click-level thresholds. Prevailing belief is that such distributions are scale-free power law distributions but using more rigorous statistical methods we find the best description of the data is instead provided by a scale-sensitive Zipf-Mandelbrot mixture distribution. Our results are based on ten data sets from various verticals in the Yahoo domain. Since mixture models can overfit the data we take care to use the BIC log-likelihood method which penalizes overly complex models. Using a mixture model in the web activity domain makes sense because there are likely multiple classes of users. In particular, we have noticed that there is a significantly large set of "users" that visit the Yahoo portal exactly once a day. We surmise these may be robots testing internet connectivity by pinging the Yahoo main website.
Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against heoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems.

References

[1]
G. Abdulla. Analysis and Modeling of World Wide Web Traffic. PhD thesis, Virginia Tech, 1998.
[2]
A. Agresti. Categorical Data Analysis. Wiley Series in Probability and Statistics, 2002.
[3]
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM (1), pages 126--134, 1999.
[4]
R. H. Byrd, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. Journal of Scientific Computing (SIAM), 16:1190--1208, 1995.
[5]
G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 1990.
[6]
U. Frisch and D. Sornette. Extreme deviation and applications. J. Phys. I France 7, 7:1155--1171, 1997.
[7]
S. Glassman. A caching relay for the World Wide Web. Computer Networks and ISDN Systems, 27(2):165--173, 1994.
[8]
D. C. Heilbron. Zero-altered and other regression models for count data with added zeroes. Biometrics, 36:531--547, 1994.
[9]
B. A. Huberman, P. L. T. Pirolli, J. E. Pitkow, and R. M. Lukose. Strong regularities in world wide web surfing. Science, 280:95--97, 1998.
[10]
J. Laherrere and D. Sornette. Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. The European Physical Journal B, 2:525, 1998.
[11]
D. Lambert. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34:1--14, 1992.
[12]
D. Lord, S. P. Washington, and J. N. Ivan. Poisson, poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accident Analysis and Prevention, 37:35--46, 2005.
[13]
B. Mandelbrot. An informational theory of the statistical structure of language. In W. Jackson, editor, Communication Theory. Betterworths, 1953.
[14]
S. M. Mwalili, E. Lesaffre, and D. Declerck. The zero-inflated negative binomial regression model with correction for misclassification: An example in caries research. Technical Report TR0462, IAP Statistics Network, 2005.
[15]
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-00-3.
[16]
J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth & Brooks/Cole, 1988.
[17]
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461--464, 1978.
[18]
H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425--440, 1955.
[19]
E. C. Titchmarsh. The Theory of the Riemann Zeta Function, 2nd ed. Oxford Science Publications, Clarendon Press, Oxford, 1986.
[20]
D. G. Uitenbroek. SISA Pairwise tests. http://home.clara.net/sisa/pairwhlp.htm, 1997.
[21]
D. von Seggern. CRC Standard Curves and Surfaces. CRC Press, 1993.
[22]
J. R. Wilson. Logarithmic series distribution and its use in analyzing discrete data. In Proceedings of the Survey Research Methods Section, American Statistical Association, pages 275--280, 1988.
[23]
G. K. Zipf. Human Behaviour and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA, 1949.

Cited By

View all
  • (2021)A complex systems perspective of news recommender systems: Guiding emergent outcomes with feedback modelsPLOS ONE10.1371/journal.pone.024509616:1(e0245096)Online publication date: 7-Jan-2021
  • (2016)Collaborative content caching in wireless edge with SDNProceedings of the 1st Workshop on Content Caching and Delivery in Wireless Networks10.1145/2836183.2836189(1-7)Online publication date: 1-Dec-2016
  • (2016)Exploiting Path Diversity for Thwarting Pollution Attacks in Named Data NetworkingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2016.257430711:9(2077-2090)Online publication date: 1-Sep-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. distribution fitting

Qualifiers

  • Article

Conference

WWW'07
Sponsor:
WWW'07: 16th International World Wide Web Conference
May 8 - 12, 2007
Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)A complex systems perspective of news recommender systems: Guiding emergent outcomes with feedback modelsPLOS ONE10.1371/journal.pone.024509616:1(e0245096)Online publication date: 7-Jan-2021
  • (2016)Collaborative content caching in wireless edge with SDNProceedings of the 1st Workshop on Content Caching and Delivery in Wireless Networks10.1145/2836183.2836189(1-7)Online publication date: 1-Dec-2016
  • (2016)Exploiting Path Diversity for Thwarting Pollution Attacks in Named Data NetworkingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2016.257430711:9(2077-2090)Online publication date: 1-Sep-2016
  • (2015)Aggregation Bias in Sponsored Search DataMarketing Science10.1287/mksc.2014.088434:1(59-77)Online publication date: 1-Jan-2015
  • (2013)Empirical Insights on the Effect of User-Generated Website Features on Micro-ConversionsInternational Journal of E-Business Research10.4018/ijebr.20131001039:4(33-46)Online publication date: 1-Oct-2013
  • (2013)Optimal Bidding in Multi-Item Multislot Sponsored Search AuctionsOperations Research10.1287/opre.2013.118761:4(855-873)Online publication date: Aug-2013
  • (2013)Ads by whom? ads about what?Proceedings of the first ACM conference on Online social networks10.1145/2512938.2512950(155-164)Online publication date: 7-Oct-2013
  • (2013)Revenue maximizing itemset construction for online shopping servicesIndustrial Management & Data Systems10.1108/02635571311289683113:1(96-116)Online publication date: 25-Jan-2013
  • (2008)Characterizing typical and atypical user sessions in clickstreamsProceedings of the 17th international conference on World Wide Web10.1145/1367497.1367617(885-894)Online publication date: 21-Apr-2008
  • (2008)Finding high-quality content in social mediaProceedings of the 2008 International Conference on Web Search and Data Mining10.1145/1341531.1341557(183-194)Online publication date: 11-Feb-2008
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media