[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/276304.276332acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free access

Enhanced hypertext categorization using hyperlinks

Published: 01 June 1998 Publication History

Abstract

A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

References

[1]
K, All. Learning relational rules from relational data. Data Mining Seminar, IBM Almaden, San Jose, CA, Sept. 1997.
[2]
C. Apte, F. Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. A CM Transactions on information Systems, 1994. IBM Research Report RC18879.
[3]
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks/Cole, 1984. ISBN: 0-534-98054-6.
[4]
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases, in VLDB, Athens, Greece, Aug. 1997. Invlted submission to VLDB Journal, 1998.
[5]
C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In Sixth World Wide Web Oor~ferer, ce, San Jose, CA, 1996.
[6]
R. Chellappa and A. fain. Markov Random Fields: Theory and Applications. Academic Press, 1993.
[7]
W. B. Croft and H. R. Turtle. A retrieval model for incorporating hypertext links. In A OM Hypertext, pages 213-224. ACM, 1989.
[8]
W. B. Croft and H. R. Turtle. Retrieval strategies for hypertext. Information Processing and Management, 29(3):313-324, 1993.
[9]
w. Deng and S. S. Iyengar. A new probabilistic relaxation scheme and its application to edge detection. IEEE Transactions of Pattern Analysis and Machine Intelligence, 18(4):432-437, Apr. 1996.
[10]
R. Duds and P. Hart. Pattern Classification and Scene Analysis. Wiley. New York 1973.
[11]
D. Bppsteln. Fmd,ng the k shortest paths. In Symposium on the Foundations of Computer Science. IEEE, 1994.
[12]
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In VLDB, Athens, Greece, 1997.
[13]
H. P. Frei and D. Steiger. Making use of hypertext links when retrieving information. In A CM European Conference on Hypertezt (ECHT), pages 102-111. ACM, Nov. 1992.
[14]
H. P. Frei and D. Steiger. The use of semantic links in hypertext information retrieval. Information Processing and Management, 31(1):1-13, 1995.
[15]
Y. Freund and D. Pen. Learning to model sequences generated by switching distributions. In Proceedings of the Eighth Annual A CM Conference on Computational Learning Theory (COLT), pages 41-50, 1995.
[16]
M. Hearst and C. Karadi. Cat-a-Cone: An interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. In Proceedings of A CM SIGIR 97, pages 246--255. ACM Press, 1997.
[17]
R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling processes. IBEE Transactions of Patter~n Analysis and Machine Intelligence, PAMI-5(3):267-287, May 1983.
[18]
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In International Conference on Machine Learning, volume 14. Morgan-Kaufmann, July 1997. To appear.
[19]
K. L. Kwok. The use of titles and cited titles as document representations for automatic classification. Information Processing and Management, 11:201-206, 1975.
[20]
K. L. Kwok. A document-document similarity measure based on cited titles and probability theory and its application to relevance feedback retrieval. In C. J. van Ri~sbergen, editor, Research and Development in Information Retrieval, pages 221- 232. Cambridge University Press, 1984.
[21]
K. L. Kwok. A probabitistic theory of indexing and similarity measure based on cited and citing documents. Journal of the American Society for Information Science, 36(342-351), 1985.
[22]
K. L. Kwok. On the use of bibliographically related titles for the enhancements of document representations. Information Processing and Management, 24(2):123-131, 1988.
[23]
N. Lavrac and S. Dzeroski. Inductive Logic Programming - Techniques and Applications. Chichester, 1994.
[24]
D. Lucarella and A. Zanzi. Information retrieval from hypertext' an approach using plausible inference. Information Processing and Management, 29(3):P.99-312, 1993.
[25]
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast sea|able classifier for data mining. In Pr0c. of the Fifth Int'I Conference on E~tending Database Technology, Avignon, France, March 1996.
[26]
S. Muggleton and C. Feng. Efficient induction of logic programs. In Proceedings of the Workshop on Algorithmic Learning Theory. Japanese Society for Artificial Intelligence, 1990.
[27]
M. Pazzani, C. Brunk, and G. Silverstein. A knowledge-intensive approach to learning relational concepts. In Machine Learning: Proceedings of the Eighth International Workshop (ML91), Ithaca, NY, 1991. Morgan Kaufmann.
[28]
L. Pelkowit~. A continuous relaxation labeling algorithm for markov random fields. IEEE Transactions on Systems, Man and Cybernetics, 20(3):709-715, May 1990.
[29]
R. Quinlan. Learning logical definitions from relations. Machine Learning, 5, 3:239-266, 1990.
[30]
M. Sahami. Web classification using Bayesian nets. Personal communication, Oct. 1997.
[31]
G. Salton. Associative document retrieval techniques using bibliographic information. Journal o/the A CM, 10(4):440-457, Oct. 1963.
[32]
G. Salton and M. J. McGill. Introduction to ModeT'n Information Retrieval. McGraw-Hill, 1983.
[33]
G. Salton and Y. Zhang. Enhancement of text representations using related document titles. Information Processing and Management, 22(5):385-394, 1986.
[34]
j. Savoy. A learning scheme for information retrieval in hypertext. Information Processing and Management, 30(4):515-533, 1994.
[35]
J. Savoy. A new probabilistic scheme for information retrieval in hypertext. The New Review of I-Iype~.rned~a and Multi~'nedia, pages 107-134, 1995.
[36]
J. Savoy. An extended vector processing scheme for searching information in hypertext. Information Processing and Management, 32(2):155-170, Mar. 1996.
[37]
J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalableparallel classifier for data mining. In Proc. of the 2Zth Int'l Conference on Very Large Databases, Bombay, India, Sept 1996.
[38]
Y. Singer and W. W. Cohen. Context-sensitive learning methods for text categorization. In SIGIR. ACM, 1996.
[39]
J. R. Smith and S.-F. Chang. Visually searching the web for content. IEEE Multimedia, 4(3):12-20, 1997.
[40]
R. Weiss, B. Velez~ M. A. Sheldon, C. Nemprempre, P. Szilagyi, A. Duds, and D. K. Gifford. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In A OM Hyperte~t, Washington, DC, Mar. 1996.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1998

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS98
SIGMOD/PODS98: Special Interest Group on Management of Data
June 1 - 4, 1998
Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)207
  • Downloads (Last 6 weeks)27
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Graph-Assisted Bayesian Node ClassifiersIEEE Access10.1109/ACCESS.2023.324286611(23989-24002)Online publication date: 2023
  • (2022)Fair Collective Classification in Networked Data2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020610(1415-1424)Online publication date: 17-Dec-2022
  • (2022)The Road AheadLearning to Quantify10.1007/978-3-031-20467-8_7(121-123)Online publication date: 29-Dec-2022
  • (2022)The Quantification LandscapeLearning to Quantify10.1007/978-3-031-20467-8_6(103-120)Online publication date: 29-Dec-2022
  • (2022)Advanced TopicsLearning to Quantify10.1007/978-3-031-20467-8_5(87-101)Online publication date: 29-Dec-2022
  • (2022)Methods for Learning to QuantifyLearning to Quantify10.1007/978-3-031-20467-8_4(55-85)Online publication date: 29-Dec-2022
  • (2022)Evaluation of Quantification AlgorithmsLearning to Quantify10.1007/978-3-031-20467-8_3(33-54)Online publication date: 29-Dec-2022
  • (2022)Applications of QuantificationLearning to Quantify10.1007/978-3-031-20467-8_2(19-31)Online publication date: 29-Dec-2022
  • (2022)The Case for QuantificationLearning to Quantify10.1007/978-3-031-20467-8_1(1-17)Online publication date: 29-Dec-2022
  • (2020)Towards Efficient Retrieval of Top-k Entities in Systems of EngagementWeb Information Systems Engineering – WISE 202010.1007/978-3-030-62008-0_4(52-67)Online publication date: 21-Oct-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media