[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Computing similarity between items in a digital library of cultural heritage

Published: 09 January 2013 Publication History

Abstract

Large amounts of cultural heritage content have now been digitized and are available in digital libraries. However, these are often unstructured and difficult to navigate. Automatic techniques for identifying similar items in these collections could be used to improve navigation since it would allow items that are implicitly connected to be linked together and allow sets of similar items to be clustered. Europeana is a large digital library containing more than 20 million digital objects from a set of cultural heritage providers throughout Europe. The diverse nature of this collection means that the items do not have standard metadata to assist navigation.
A range of methods for computing the similarity between pairs of texts are applied to metadata records in Europeana in order to estimate the similarity between items. Various methods for computing similarity have been proposed and can be classified into two main approaches: (1) knowledge-based, which make use of external knowledge sources and (2) corpus-based approaches, which rely on analyzing the frequency distributions of words in documents. Both techniques are evaluated against manual judgements obtained for this study and a multiple-choice test created from manually generated categories in cultural heritage collections. We find that a combination of corpus and knowledge-based approaches provide the best results in both experiments.

References

[1]
Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M., and Soroa, A. 2009. A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) on Human Language Technologies. Association for Computational Linguistics, Morristown, NJ, 19.
[2]
Agirre, E., Cer, D., Diab, M., and Gonzalez-Agirre, A. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics -- Vol. 1: Proceedings of the Main Conferene and the Shared Task, and Vol. 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval). Association for Computational Linguistics, 385--393.
[3]
Amin, A., van Ossenbruggen, J., Hardman, L., and van Nispen, A. 2008. Understanding cultural heritage experts' information seeking needs. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries. 39--47.
[4]
Aroyo, L., Brussee, R., Rutledge, L., Gorgels, P., Stash, N., and Wang, Y. 2007. Personalized museum experience: The Rijksmuseum use case. In Proceedings of Museums and the Web.
[5]
Artstein, R. and Poesio, M. 2008. Inter-coder agreement for computational linguistics. Comput. Ling. 34, 4, 555--596.
[6]
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley Longman Limited, Essex.
[7]
Bendersky, M. and Croft, W. 2009. Finding text reuse on the Web. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 262--271.
[8]
Benjamins, V., Contreras, J., Blazquez, M., Dodero, J., Garcia, A., Navas, E., Hernandez, F., and Wert, C. 2004. Cultural heritage and the semantic Web. In Proceedings of the 1st European Semantic Web Symposium. 433--444.
[9]
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022.
[10]
Bohnert, F., Schmidt, D., and Zuckerman, I. 2009. Spatial process for recommender systems. In Proceedings of the 21st International Joint Conference on Artificial Intelligence. 2022--2027.
[11]
Borgman, C. 1997. Multi-media, multi-cultural and multi-lingual digital libraries: Or how do we exchange data in 400 languages. D-Lib Mag. 3, 6.
[12]
Bowen, J. and Filippini-Fantoni, S. 2004. Personalization and the Web from a museum perspective. In Proceedings of the Museums and the Web. 63--78.
[13]
Celikyilmaz, A., Hakkani-Tur, D., and Tur, G. 2010. LDA based similarity modeling for question answering. InProceedings of the NAACL HLT Workshop on Semantic Search (SS). 1--9.
[14]
Dolan, W. and Brocket, C. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of 3rd International Workshop on Paraphrasing (IWP).
[15]
Europeana. 2012. http://europeana.eu/portal.
[16]
Fellbaum, C., Ed. 1998. WordNet An Electronic Lexical Database. The MIT Press, Cambridge, MA.
[17]
Fink, J. and Kobsa, A. 2002. User modelling for personalized city tours. Art. Intell. Rev. 18, 1, 33--74.
[18]
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., S Olan, Z., Wolfman, G., and Ruppin, E. 2002. Placing search in context: The concept revisited. ACM Trans. Inform. Syst. 20, 1, 116--131.
[19]
Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical intelligence (IJCAI). 1606--1611.
[20]
Grieser, K., Baldwin, T., and Bird, S. 2007. Dynamic path prediction and recommendation in a museum environment. In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH). 49--56.
[21]
Grieser, K., Baldwin, T., Bohnert, F., and Sonenberg, L. 2011. Using ontological and document similarity to estimate museum exhibit relatedness. ACM J. Comput. Cult. Herit. 3, 3, 1--20.
[22]
Grishman, R. 2003. Information extraction. In Oxford Handbook of Computational Linguistics, R. Mitkov, Ed.
[23]
Hassan, S. and Mihalcea, R. 2009. Cross-lingual semantic relatedness using encyclopedic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1192--1201.
[24]
Haveliwala, T. H. 2003. Topic-sensitive Pagerank: A context-sensitive ranking algorithm for Web search. Tech. rep. 2003-29, Stanford InfoLab.
[25]
Heitzman, J., Mellish, C., and Oberlander, J. 1997. Dynamic generation of museum Web pages: The intelligent labelling explorer. Arch. Museum Inform. 11, 2, 117--125.
[26]
Hirst, G. and St-Onge, D. 1998. Lexical chains as representations of context for the detection and correction of malapropisms. In Wordnet: An Electronic Lexical Database, C. Fellbaum, Ed., MIT Press, 305--332.
[27]
Hyvönen, E. 2007. Semantic portals for cultural heritage. In Handbook on Ontologies, S. Staab and R. Studer, Eds., Springer.
[28]
Jiang, J. and Conrath, D. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference Research on Computational Linguistics (ROCLING).
[29]
Jurafsky, D. and Martin, J. 2008. Speech and Language Processing 2nd Ed., Prentice Hall Series in Artificial Intelligence, 2 Ed. Prentice Hall.
[30]
Kazai, G. 2011. In search of quality in crowdsourcing for search engine evaluation. Advances Inform. Retriev. 165--176.
[31]
Kriedler, C. 1998. Introducing English Semantics. Routledge, London.
[32]
Leacock, C. and Chodrow, M. 1998. Combining local context and WordNet similarity for word sense identification. In Wordnet: An Electronic Lexical Database, C. Fellbaum, Ed., MIT Press, 265--283.
[33]
Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the ACM SIGDOC Conference. 24--26.
[34]
Lin, D. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning. 296--304.
[35]
Manning, C. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. The MIT Press.
[36]
Marchionini, G. 2006. Exploratory search: From finding to understanding. Comm. ACM 49, 1, 41--46.
[37]
Mihalcea, R., Corley, C., and Strapparava, C. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the National Conference on Artificial Intelligence.
[38]
Miller, G. and Charles, W. 1991. Contextual correlates of semantic similarity. Lang. Cog. Process. 6, 1, 1--28.
[39]
Milne, D. 2007. Computing semantic relatedness using Wikipedia's link structure. In Proceedings of the New Zealand Computer Science Research Student Conference.
[40]
Milne, D. and Witten, I. 2008. Learning to link with Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM).
[41]
Mohler, M. and Mihalcea, R. 2009. Text-to-text semantic similarity for automatic short answer grading. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL). 567--575.
[42]
Navigli, R. and Ponzetto, S. P. 2012. Babelrelate! a joint multilingual approach to computing semantic relatedness. In Proceedings of the 26th Conference on Artificial Intelligence.108--114.
[43]
O' Donnell, M., Mellish, C., Oberlander, J., and Knott, A. 2001. ILEX: An architecture for a dynamic hypertext generation system. Natural Lang. Eng. 7, 225--250.
[44]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1999. The PageRank citation ranking: Bringing order to the Web. In Proceedings of the World Wide Web Internet and Web Information Systems. 1--17.
[45]
Patwardhan, S., Banerjee, S., and Pedersen, T. 2003. Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics. 241--257.
[46]
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
[47]
Puniyani, K., Eisenstein, J., Cohen, S., and Xing, E. 2010. Social links from latent topics in microblogs. In Proceedings of the NAACL HLT Workshop on Computational Linguistics in a World of Social Media. Association for Computational Linguistics, 19--20.
[48]
Quan, X., Liu, G., Lu, Z., Ni, X., and Wenyin, L. 2010. Short text similarity based on probabilistic topics. Knowl. Inform. Syst. 25, 473--491.
[49]
Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the International Joint Conference on Artificial Intelligence. 448--453.
[50]
Roes, I., Stash, N., Wang, Y., and Aroyo, L. 2009. A personalized walk through the museum: The CHIP interactive tour guide. In Proceedings of the 27th International Conference on Human Factors in Computing Systems. 3317--3322.
[51]
Salton, G. and McGill, M. 1983. Modern Information Retrieval. McGraw-Hill, New York.
[52]
Seo, J. and Croft, W. 2008. Local text reuse detection. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 571--578.
[53]
Strube, M. and Ponzetto, S. M. 2006. WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI Press, 1419--1424.
[54]
Trant, J. 2009. Tagging, folksonomies and art museums: Early experiments and ongoing research. J Dig. Inform. 10, 1.
[55]
Turney, P. D. 2001. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning (EMCL). 491--502.
[56]
Zanzotto, F., Pennacchiotti, M., and Moschitti, A. 2009. A machine learning approach to textual entailment recognition. Natural Lang. Eng. 15-04, 551--582.

Cited By

View all
  • (2022)Context-Aware Querying, Geolocalization, and Rephotography of Historical Newspaper ImagesApplied Sciences10.3390/app12211106312:21(11063)Online publication date: 1-Nov-2022
  • (2021)Object Spotting in Historical DocumentsDigital Techniques for Heritage Presentation and Preservation10.1007/978-3-030-57907-4_5(75-105)Online publication date: 18-Mar-2021
  • (2019)Discovering the structure and impact of the digital library evaluation domainInternational Journal on Digital Libraries10.1007/s00799-017-0222-x20:2(125-141)Online publication date: 1-Jun-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 5, Issue 4
December 2012
87 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/2399180
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2013
Accepted: 01 September 2012
Revised: 01 July 2012
Received: 01 December 2011
Published in JOCCH Volume 5, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Digital libraries
  2. Europeana
  3. semantic similarity

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Context-Aware Querying, Geolocalization, and Rephotography of Historical Newspaper ImagesApplied Sciences10.3390/app12211106312:21(11063)Online publication date: 1-Nov-2022
  • (2021)Object Spotting in Historical DocumentsDigital Techniques for Heritage Presentation and Preservation10.1007/978-3-030-57907-4_5(75-105)Online publication date: 18-Mar-2021
  • (2019)Discovering the structure and impact of the digital library evaluation domainInternational Journal on Digital Libraries10.1007/s00799-017-0222-x20:2(125-141)Online publication date: 1-Jun-2019
  • (2018)Survey and Analysis of Interactive Art Documentation, 1979–2017Leonardo10.1162/leon_a_01716(298-299)Online publication date: 21-Dec-2018
  • (2018)Figure spotting in Indian heritage imageJournal of Cultural Heritage10.1016/j.culher.2017.12.01232(133-143)Online publication date: Jul-2018
  • (2017)A statistical approach for modeling inter-document semantic relationships in digital librariesJournal of Intelligent Information Systems10.1007/s10844-016-0423-648:3(477-498)Online publication date: 1-Jun-2017
  • (2016)Why are these similar? Investigating item similarity types in a large digital libraryJournal of the Association for Information Science and Technology10.1002/asi.2348267:7(1624-1638)Online publication date: 1-Jul-2016
  • (2016)Linking and clustering artworks using social tagsJournal of the Association for Information Science and Technology10.1002/asi.2344267:4(885-899)Online publication date: 1-Apr-2016
  • (2015)Information Reliability EvaluationJournal on Computing and Cultural Heritage 10.1145/26938478:3(1-33)Online publication date: 13-Apr-2015
  • (2014)Personalised PageRank for making recommendations in digital cultural heritage collectionsProceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries10.5555/2740769.2740778(49-52)Online publication date: 8-Sep-2014
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media