[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/775152.775178acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

Published: 20 May 2003 Publication History

Abstract

This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.

References

[1]
S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68--88, 1997.
[2]
R. Agrawal, R. Bayardo, D. Gruhl, and S. Papadimitriou. Vinci: A service-oriented architecture for rapid development of web applications. In Proceedings of the Tenth International World Wide Web Conference (WWW2001), pages 355--365, Hong Kong, China, 2001.
[3]
AltaVista. http://www.altavista.com.
[4]
G. Arocena, A. Mendelzon, and G. Mihaila. Applications of a Web query language. In Proceedings of the 6th International World Wide Web Conference (WWW1997), pages 1305--1315, Santa Clara, CA, 1997.
[5]
T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American, 1(1):68--88, 2000.
[6]
D. Box, D. Ehnebuske, G. Kakivaya, A. Layman, N. Mendelsohn, H. F. Nielsen, S. Thatte, and D. Winder. Simple Object Access Protocol. http://www.w3.org/TR/SOAP/, May 2000.
[7]
D. Brickley and R.V.Guha. Rdf schema. http://www.w3.org/TR/rdf-schema/.
[8]
A. Broder and M. R. Henzinger. Algorithmic aspects of information retrieval on the web. In M. G. C. R. J. Abello, P. M. Pardalos, editor, Handbook of Massive Data Sets. Kluwer Academic Publishers, Boston, to appear.
[9]
C. Clarke, G. Cormack, and F. Burkowski. Shortest substring ranking. In Proceedings of the Fourth Text Retrieval Conference, pages 295--304, Gaithersburg, MD, November 1995.
[10]
W. Cohen and L. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (IJCAI'01), 2001.
[11]
M. Erdmann, A. Maedche, H. Schnurr, and S. Staab. From manual to semi-automatic semantic annotation: About ontology-based text annotation tools. In P. Buitelaar and K. Hasida, editors, Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, August 2000.
[12]
Google. http://www.google.com.
[13]
T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The Netherlands, 1993. Kluwer Academic Publishers.
[14]
J. Heflin and J. Hendler. Searching the web with shoe. In AAAI-2000 Workshop on AI for Web Search, 2000.
[15]
J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hilldrum, D. Maden, V. Raman, and M. A. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7--18, 2000.
[16]
J. Hirai, S. Raghavan, A. Paepcke, and H. Garcia-Molina. WebBase: A repository of Web pages. In Proceedings of the 9th International World Wide Web Conference (WWW2000), pages 277-293, Amsterdam, The Netherlands, 2000.
[17]
J. Kahan and M.-R. Koivunen. Annotea: an open RDF infrastructure for shared web annotations. In World Wide Web, pages 623--632, 2001.
[18]
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.
[19]
T. Leonard and H. Glaser. Large scale acquisition and maintenance from the web without source access. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/Leonard.pdf, 2001.
[20]
K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, August 2001.
[21]
G.-A. Levow. Corpus-based techniques for word sense disambiguation. Technical Report AIM-1637, MIT AI Lab, 1, 1997.
[22]
J. Li, L. Zhang, and Y. Yu. Learning to generate semantic annotation for domain specific sentences. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/GenerateSemAnnot.pdf.
[23]
P. K. Lockheed. AeroDAML: Applying information extraction to generate DAML annotations from web pages.
[24]
D. L. McGuinness. Description logics emerge from ivory towers. In Description Logics, 2001.
[25]
G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over web views. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT'98), volume LNCS 1377, pages 72--86, Valencia, Spain, 1998. Springer-Verlag.
[26]
R. Mihalcea. Word sense disambiguation and its application to the internet search. Master's thesis, Southern Methodist University, 1999.
[27]
A. Newell. Some problems of the basic organization in problem-solving programs. In Proceedings of the Second Conference on Self-Organizing Systems, pages 393--423, Washington, DC, 1962.
[28]
N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A. Musen. Creating semantic web contents with protege-2000. IEEE Intelligent Systems, 2(16):60--71, 2001.
[29]
J. Pustejovsky, B. Boguraev, M. Verhagen, P. Buitelaar, and M. Johnston. Semantic indexing and typed hyperlinking. In Proceedings of the American Association for Artificial Intelligence Conference, Spring Symposium, NLP for WWW, pages 120--128, 1997.
[30]
R.Guha and R. McCool. Tap: Towards a web of data. http://tap.stanford.edu/.
[31]
E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 117--124, Providence, RI, 1997.
[32]
H. Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97--124, 1998.
[33]
E. Spertus and L. A. Stein. Squeal: A structured query language for the web. In Proceedings of the 9th International World Wide Web Conference (WWW2000), pages 95--103, Amsterdam, The Netherlands, 2000.
[34]
S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the semantic web. In S. Isjizaki, editor, Proceedings of the First Workshop on Multimedia Annotation, Tokyo, Japan, January 2001.
[35]
The Internet Archive. http://www.archive.org.
[36]
M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna. MnM: Ontology driven semi-automatic and automatic support for semantic markup. In The 13th International Conference on Knowledge Engineering and Management (EKAW 2002), 2002.
[37]
W3C. Platform for internet content selection. http://www.w3.org/PICS/.
[38]
W3C. Web ontology language. http://www.w3.org/2001/sw/WebOnt/.
[39]
Web-in-a-Box. http://research.compaq.com/SRC/WebArcheology/wib.html.
[40]
Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the SIGLEX Workshop Tagging Text with Lexical Semantics: What, why and how?, pages 47--51, 1997.

Cited By

View all
  • (2024)A Multilevel Interaction Network Framework for Multimodal Entity LinkingNatural Language Processing and Chinese Computing10.1007/978-981-97-9437-9_27(343-355)Online publication date: 1-Nov-2024
  • (2024)CAW: Confidence-Based Adaptive Weighted Model for Multi-modal Entity LinkingArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72347-6_3(34-51)Online publication date: 17-Sep-2024
  • (2023)Ad-Hoc Monitoring of COVID-19 Global Research Trends for Well-Informed Policy MakingACM Transactions on Intelligent Systems and Technology10.1145/357690114:2(1-28)Online publication date: 21-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '03: Proceedings of the 12th international conference on World Wide Web
May 2003
772 pages
ISBN:1581136803
DOI:10.1145/775152
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automated semantic tagging
  2. data mining
  3. information retrieval
  4. large text datasets
  5. text analytics

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)4
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Multilevel Interaction Network Framework for Multimodal Entity LinkingNatural Language Processing and Chinese Computing10.1007/978-981-97-9437-9_27(343-355)Online publication date: 1-Nov-2024
  • (2024)CAW: Confidence-Based Adaptive Weighted Model for Multi-modal Entity LinkingArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72347-6_3(34-51)Online publication date: 17-Sep-2024
  • (2023)Ad-Hoc Monitoring of COVID-19 Global Research Trends for Well-Informed Policy MakingACM Transactions on Intelligent Systems and Technology10.1145/357690114:2(1-28)Online publication date: 21-Feb-2023
  • (2023)GDOM: An Immersive Experience of Intangible Heritage through Spatial StorytellingJournal on Computing and Cultural Heritage 10.1145/349832915:4(1-18)Online publication date: 14-Feb-2023
  • (2023)Learning Individualized Automatic Content Magnification in Gaze-based Interaction2023 IEEE International Symposium on Multimedia (ISM)10.1109/ISM59092.2023.00054(282-286)Online publication date: 11-Dec-2023
  • (2022)Learning to Ask: Conversational Product Search via Representation LearningACM Transactions on Information Systems10.1145/355537141:2(1-27)Online publication date: 21-Dec-2022
  • (2022)A Generic Federated Recommendation Framework via Fake Marks and Secret SharingACM Transactions on Information Systems10.1145/354845641:2(1-37)Online publication date: 21-Dec-2022
  • (2022)A Revisiting Study of Appropriate Offline Evaluation for Top-N Recommendation AlgorithmsACM Transactions on Information Systems10.1145/354579641:2(1-41)Online publication date: 21-Dec-2022
  • (2022)A Relative Information Gain-based Query Performance Prediction Framework with Generated Query VariantsACM Transactions on Information Systems10.1145/354511241:2(1-31)Online publication date: 21-Dec-2022
  • (2022)Concept Annotation from Users Perspective: A New ChallengeCompanion Proceedings of the Web Conference 202210.1145/3487553.3524933(1180-1188)Online publication date: 25-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media