[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/775047.775084acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Published: 23 July 2002 Publication History

Abstract

When automatically extracting information from the world wide web, most established methods focus on spotting single HTML-documents. However, the problem of spotting complete web sites is not handled adequately yet, in spite of its importance for various applications. Therefore, this paper discusses the classification of complete web sites. First, we point out the main differences to page classification by discussing a very intuitive approach and its weaknesses. This approach treats a web site as one large HTML-document and applies the well-known methods for page classification. Next, we show how accuracy can be improved by employing a preprocessing step which assigns an occurring web page to its most likely topic. The determined topics now represent the information the web site contains and can be used to classify it more accurately. We accomplish this by following two directions. First, we apply well established classification algorithms to a feature space of occurring topics. The second direction treats a site as a tree of occurring topics and uses a Markov tree model for further classification. To improve the efficiency of this approach, we additionally introduce a powerful pruning method reducing the number of considered web pages. Our experiments show the superiority of the Markov tree approach regarding classification accuracy. In particular, we demonstrate that the use of our pruning method not only reduces the processing time, but also improves the classification accuracy.

References

[1]
Chakrabarti S., Dom B. and Indyk P.: Enhanced hypertext categorization using hyperlinks, Proceedings ACM SIGMOD, 1998.
[2]
Craven M., DiPasquo D., Freitag D., McCallum A., Mitchell T., Nigam K., and Slattery S.: Learning to Construct Knowledge Bases from the World Wide Web, Artificial Intelligence, Elsevier, 1999.
[3]
Deshpande M., Karypis G.: Evaluation of Techniques for Classifying Biological Sequences, Proceedings PAKDD, 2002.
[4]
DMOZ open directory project, http://dmoz.org/
[5]
Joachims T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proceedings European Conference on Machine Learning, 1998.
[6]
Lesh N., Zaki M. J., Ogihara Mitsunori: Mining Features for Sequence Classification, Proceedings ACM SIGKDD, San Diego, CA, August 1999.
[7]
McCallum A., Nigam K.: A Comparison of Event Models for Naive Bayes Text Classification, Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
[8]
Menshikov M.V., Volkov S.E.: Branching Markov Chains: Qualitative Characteristics, 1997, Markov Processes Relat. Fields. 3 1--18.
[9]
Quinlan J.R.: C4.5 : Programs for Machine Learning, 1993, Morgan Kaufmann, San Mateo, CA
[10]
Witten I. H., Eibe F.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 1999, Morgan Kaufmann, http://www.cs.waikato.ac.nz/ml/weka/
[11]
Yahoo! Directory Service, http://www.yahoo.com/
[12]
Yang Y., Liu X.: A Re-Examination of Text Categorization Methods, Proceedings ACM SIGIR, 1999.
[13]
Zaki M. J.: SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning Journal, pp 31--60, Vol. 42 Nos. 1/2, Jan/Feb 2001

Cited By

View all
  • (2017)Mining the information architecture of the WWW using automated website boundary detectionWeb Intelligence10.3233/WEB-17036515:4(269-290)Online publication date: 20-Nov-2017
  • (2017)Quantitative evaluation of web metrics for automatic genre classification of web pagesInternational Journal of System Assurance Engineering and Management10.1007/s13198-017-0629-18:S2(1567-1579)Online publication date: 27-May-2017
  • (2015)Enhancing Predictive Analytics for Anti-Phishing by Exploiting Website Genre InformationJournal of Management Information Systems10.1080/07421222.2014.100126031:4(109-157)Online publication date: 15-Apr-2015
  • Show More Cited By
  1. Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
        July 2002
        719 pages
        ISBN:158113567X
        DOI:10.1145/775047
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 23 July 2002

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Markov classifiers
        2. web content mining
        3. web site classification
        4. web site mining

        Qualifiers

        • Article

        Conference

        KDD02
        Sponsor:

        Acceptance Rates

        KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
        Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

        Upcoming Conference

        KDD '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)4
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 31 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2017)Mining the information architecture of the WWW using automated website boundary detectionWeb Intelligence10.3233/WEB-17036515:4(269-290)Online publication date: 20-Nov-2017
        • (2017)Quantitative evaluation of web metrics for automatic genre classification of web pagesInternational Journal of System Assurance Engineering and Management10.1007/s13198-017-0629-18:S2(1567-1579)Online publication date: 27-May-2017
        • (2015)Enhancing Predictive Analytics for Anti-Phishing by Exploiting Website Genre InformationJournal of Management Information Systems10.1080/07421222.2014.100126031:4(109-157)Online publication date: 15-Apr-2015
        • (2015)A Framework to Harvest Page Views of Web for Banner AdvertisingProceedings of the 4th International Conference on Big Data Analytics - Volume 949810.1007/978-3-319-27057-9_4(57-68)Online publication date: 15-Dec-2015
        • (2014)Techniques for data-driven curriculum analysisProceedings of the Fourth International Conference on Learning Analytics And Knowledge10.1145/2567574.2567591(148-157)Online publication date: 24-Mar-2014
        • (2014)A Dynamic Approach to the Website Boundary Detection Problem Using Random WalksProceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0210.1109/WI-IAT.2014.74(9-14)Online publication date: 11-Aug-2014
        • (2014)Identifying website communities in mobile internet based on affinity measurementComputer Communications10.1016/j.comcom.2013.12.00841(22-30)Online publication date: 1-Mar-2014
        • (2012)Detecting Fake Medical Web Sites Using Recursive Trust LabelingACM Transactions on Information Systems10.1145/2382438.238244130:4(1-36)Online publication date: 1-Nov-2012
        • (2012)Classifying websites into non-topical categoriesProceedings of the 14th international conference on Data Warehousing and Knowledge Discovery10.1007/978-3-642-32584-7_30(364-377)Online publication date: 3-Sep-2012
        • (2011)Query-Sets++Proceedings of the 18th international conference on String processing and information retrieval10.5555/2051073.2051086(129-134)Online publication date: 17-Oct-2011
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media