[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1150402.1150433acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Hierarchical topic segmentation of websites

Published: 20 August 2006 Publication History

Abstract

In this paper, we consider the problem of identifying and segmenting topically cohesive regions in the URL tree of a large website. Each page of the website is assumed to have a topic label or a distribution on topic labels generated using a standard classifier. We develop a set of cost measures characterizing the benefit accrued by introducing a segmentation of the site based on the topic labels. We propose a general framework to use these measures for describing the quality of a segmentation; we also provide an efficient algorithm to find the best segmentation in this framework. Extensive experiments on human-labeled data confirm the soundness of our framework and suggest that a judicious choice of cost measures allows the algorithm to perform surprisingly accurate topical segmentations.

References

[1]
R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu. Mining newsgroups using networks arising from social behavior. In 12th WWW, pages 529--535, 2003.]]
[2]
D. J. Aumueller. A tool for gathering, analysing, exporting, and visualizing the structure of a website. Master's thesis, University of Leeds, Institute of Communications Studies, 2003.]]
[3]
A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using von Mises-Fisher distributions. JMLR, 6:1345--1382, 2005.]]
[4]
K. Bharat, A. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. JASIS, 51(12):1114--1122, 2000.]]
[5]
D. Blei and M. Jordan. Modeling annotated data. In 26th SIGIR, pages 127--134, 2003.]]
[6]
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext classification using hyperlinks. In SIGMOD, pages 307--318, 1998.]]
[7]
L. M. Collins and C. W. Dent. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231--242, 1988.]]
[8]
N. Craswell, D. Hawking, and S. Roberston. Effective site finding using link anchor information. In 24th SIGIR, pages 250--257, 2001.]]
[9]
L. Denoyer and P. Gallinari. Bayesian network model for semi-structured document classification. Information Processing and Management, 40(5):807--827, 2004.]]
[10]
M. Diligenti, M. Gori, M. Maggini, and F. Scarselli. Classification of HTML documents by hidden tree-Markov models. In 6th ICDAR, pages 849--853, 2001.]]
[11]
M. Ester, H.-P. Kriegel, and M. Schubert. Web site mining: A new way to spot competitors, customers and suppliers in the world wide web. In 8th KDD, pages 249--258, 2002.]]
[12]
R. Fagin, R. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Multi-structural databases. In 24th PODS, pages 184--195, 2005.]]
[13]
R. Fagin, P. Kolaitis, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Efficient implementation of large-scale multi-structural databases. In 31st VLDB, pages 958--969, 2005.]]
[14]
S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1):41--62, 1998.]]
[15]
D. Gibson. Surfing the web by site. In 13th WWW, pages 496--497, 2004.]]
[16]
D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In 14th WWW, pages 830--839, 2005.]]
[17]
W. L. Hsu. The distance-domination numbers of trees. Operations Research Letters, 1:96--100, 1982.]]
[18]
S. D. Kamvar, M. T. Scholsser, and H. Garcia-Molina. The eigentrust algorithm for reputation management in P2P networks. In 12th WWW, pages 640--651, 2003.]]
[19]
O. Kariv and S. L. Haikim. An algorithmic approach to network location problems, part II: p-medians. SIAM J. on Applied Mathematics, 37:539--560, 1979.]]
[20]
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In 14th ICML, pages 170--178, 1997.]]
[21]
H.-P. Kriegel and M. Schubert. Classification of websites as sets of feature vectors. In IASTED Intl. Conf. on Databases and Applications, pages 127--132, 2004.]]
[22]
J. Pierre. Practical issues for automated categorization of web sites. In ECDL 2000 Workshop on Semantic Web, 2000.]]
[23]
B. Piwowarski, L. Denoyer, and P. Gallinari. Un modèle pour la recherche d'information sur des documents structurés. In 6th Journées internationales d'Analyse statistique des Données Textuelles, 2002.]]
[24]
J. R. Quinlan. Induction of decision trees. In J. W. Shavlik and T. G. Dietterich, editors, Readings in Machine Learning. Morgan Kaufmann, 1990. Originally in Machine Learning 1:81--106, 1986.]]
[25]
F. Ricca and P. Tonella. Web site analysis: Structure and evolution. In 16th ICSM, pages 76--86, 2000.]]
[26]
U. Schonfeld, Z. Bar-Yossef, and I. Keidar. Do not crawl in the dust: Different urls with similar text. In 15th WWW, 2006.]]
[27]
R. Shah and M. Farach-Colton. Undiscretized dynamic programming: Faster algorithms for facility location and related problems on trees. In 13th SODA, pages 108--115, 2002.]]
[28]
A. Sun and E.-P. Lim. Web unit mining: finding and classifying subgraphs of web pages. In 12th CIKM, pages 108--115, 2003.]]
[29]
A. Tamir. An o(pn2) algorithm for the p-median and related problems on tree graphs. Operations Research Letters, 19:59--64, 1996.]]
[30]
L. Terveen, W. Hill, and B. Amento. Constructing, organizing, and visualizing collections of topically related web resources. ACM Transactions on Computer-Human Interaction, 6(1):67--94, 1999.]]
[31]
M. Thelwall and D. Wilkinson. Finding similar academic web sites with links, bibliometric couplings and colinks. Information Processing and Management, 40(3):515--526, 2004.]]
[32]
M. Theobald, R. Schenkel, and G. Weikum. Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In 6th WebDB, pages 1--6, 2003.]]
[33]
Y. Tian, T. Huang, W. Gao, J. Cheng, and P. Kang. Two-phase web site classification based on hidden Markov tree models. In IEEE/WIC International Conference on Web Intelligence, pages 227--236, 2003.]]

Cited By

View all
  • (2019)Becoming gatekeepers together with alliesProceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1145/3341161.3342874(81-88)Online publication date: 27-Aug-2019
  • (2017)Navigation objects extraction for better content structure understandingProceedings of the International Conference on Web Intelligence10.1145/3106426.3106437(629-636)Online publication date: 23-Aug-2017
  • (2014)Measuring the impact of MVC attack in large complex networksInformation Sciences10.1016/j.ins.2014.03.085278(685-702)Online publication date: Sep-2014
  • Show More Cited By

Index Terms

  1. Hierarchical topic segmentation of websites

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2006
    986 pages
    ISBN:1595933395
    DOI:10.1145/1150402
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. KL-distance
    2. classification
    3. facility location
    4. gain ratio
    5. tree partitioning
    6. website hierarchy
    7. website segmentation

    Qualifiers

    • Article

    Conference

    KDD06

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Becoming gatekeepers together with alliesProceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1145/3341161.3342874(81-88)Online publication date: 27-Aug-2019
    • (2017)Navigation objects extraction for better content structure understandingProceedings of the International Conference on Web Intelligence10.1145/3106426.3106437(629-636)Online publication date: 23-Aug-2017
    • (2014)Measuring the impact of MVC attack in large complex networksInformation Sciences10.1016/j.ins.2014.03.085278(685-702)Online publication date: Sep-2014
    • (2014)Joint Prediction of Topics in a URL HierarchyMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44848-9_33(514-529)Online publication date: 2014
    • (2013)Search result presentationProceedings of the 22nd International Conference on World Wide Web10.1145/2487788.2488161(1269-1274)Online publication date: 13-May-2013
    • (2013)Scalable Diversified Ranking on Large GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.17025:9(2133-2146)Online publication date: 1-Sep-2013
    • (2013)Building Enhanced Link Context by Logical SitemapKnowledge Science, Engineering and Management10.1007/978-3-642-39787-5_4(36-47)Online publication date: 2013
    • (2013)Mining taxonomies from web menusProceedings of the 13th international conference on Web Engineering10.1007/978-3-642-39200-9_23(265-282)Online publication date: 8-Jul-2013
    • (2012)MenuMinerProceedings of the 21st International Conference on World Wide Web10.1145/2187980.2188237(1025-1034)Online publication date: 16-Apr-2012
    • (2010)Web-site boundary detectionProceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects10.5555/1880672.1880721(529-543)Online publication date: 12-Jul-2010
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media