A hybrid system for German encyclopedia alignment

Roman Kern¹,
Christin Seifert¹ &
Michael Granitzer^1,2

89 Accesses
Explore all metrics

Abstract

Collaboratively created on-line encyclopedias have become increasingly popular. Especially in terms of completeness they have begun to surpass their printed counterparts. Two German publishers of traditional encyclopedias have reacted to this challenge and started an initiative to merge their corpora to create a single, more complete encyclopedia. The crucial step in this merging process is the alignment of articles. We have developed a two-step hybrid system to provide high-accurate alignments with low manual effort. First, we apply an information retrieval based, automatic alignment algorithm. Second, the articles with a low confidence score are revised using a manual alignment scheme carefully designed for quality assurance. Our evaluation shows that a combination of weighting and ranking techniques utilizing different facets of the encyclopedia articles allow to effectively reduce the number of necessary manual alignments. Further, the setup of the manual alignment turned out to be robust against inter-indexer inconsistencies. As a result, the developed system empowered us to align four encyclopedias with high accuracy and low effort.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction

Article Open access 01 November 2022

Off-the-shelf Semantic Author Name Disambiguation for Bibliographic Data Bases

ConDef: Automated Context-Aware Lexicography Using Large Online Encyclopedias

References

Anderka, M., Stein, B.: The ESA retrieval model revisited. In: Sanderson, M., Zhai, C., Zobel, J., Aslam, J. (eds.) 32th Annual International ACM SIGIR Conference (SIGIR 09), pp. 670–671. ACM (2009). doi:https://doi.org/10.1145/1571941.1572070
Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: String Processing and Information Retrieval Symposium, pp. 55–67 (2004)
Bouma, G., Duarte, S., Islam, Z.: Cross-lingual alignment and completion of wikipedia templates. In: Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, CLIAWS3 ’09, pp. 21–29. Association for Computational Linguistics, Stroudsburg, PA (2009)
Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and development in Information Retrieval, pp. 480–487. ACM (2005)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the Twentieth International Joint Conference for Artificial Intelligence, pp. 1606–1611. Hyderabad (2007)
Gries S.: Dispersions and adjusted frequencies in corpora. Int. J. Corpus Linguist. 13(4), 403–437 (2008). doi:https://doi.org/10.1075/ijcl.13.4.02gri
Article Google Scholar
Kern, R., Granitzer, M.: Efficient linear text segmentation based on information retrieval techniques. In: MEDES ’09: Proceedings of the International Conference on Management of Emergent Digital EcoSystems, pp. 167–171. ACM, New York, NY (2009). doi:https://doi.org/10.1145/1643823.1643854
Li Y., McLean D., Bandar Z.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Article Google Scholar
Liu, X., Zhou, Y., Zheng, R.: Measuring semantic similarity within sentences. In: Proceedings of the 7th International Conference on Machine Learning and Cybernetics, ICMLC, vol. 5, pp. 2558–2562 (2008). doi:https://doi.org/10.1109/ICMLC.2008.4620839
Marko, K., Baud, R., Zweigenbaum, P., Merkel, M., Gronostaj, M.T., Kokkinakis, D., Schulz, S.: Cross-lingual alignment of medical lexicons. In: Workshop on Acquiring and Representing Multilingual, Specialized Lexicons: the Case of Biomedicine (2006)
Metzler, D., Bernstein, Y., Croft, W., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: CIKM ’05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 517–524. ACM (2005)
O’Shea, J., Bandar, Z., Crockett, K., McLean, D.: A comparative study of two short text semantic similarity measures. In: Agent and Multi-Agent Systems: Technologies and Applications: Second KES International Symposium, vol. 4953, pp. 172–181. Springer (2008)
Pedersen, T.: Computational approaches to measuring the similarity of short contexts: a review of applications and methods. Comput. Res. Repos. (CoRR) abs/0806.3 (2008)
Rector L.H.: Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles. Ref. Serv. Rev. 36(1), 7–22 (2008). doi:https://doi.org/10.1108/00907320810851998
Article Google Scholar
Robertson, S., Gatford, M.: Okapi at TREC-4. In: Proceedings of the Fourth Text Retrieval Conference, pp. 73–97 (1996)
Sahami, M., Heilman, T.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW ’06: Proceedings of the 15th International Conference on World Wide Web, pp. 377–386. ACM (2006)
Yih, W., Meek, C.: Improving similarity measures for short segments of text. In: AAAI’07: Proceedings of the 22nd National Conference on Artificial Intelligence, pp. 1489–1494. AAAI Press (2007)

Download references

Author information

Authors and Affiliations

Graz University of Technology, Knowledge Management Institute, Inffeldgasse 21a, 8010, Graz, Austria
Roman Kern, Christin Seifert & Michael Granitzer
Know-Center GmbH and Graz University of Technology, Knowledge Management Institute, Inffeldgasse 21a, 8010, Graz, Austria
Michael Granitzer

Authors

Roman Kern
View author publications
You can also search for this author in PubMed Google Scholar
Christin Seifert
View author publications
You can also search for this author in PubMed Google Scholar
Michael Granitzer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roman Kern.

Additional information

This article is a substantially revised and extended version of a article with the title “German Encyclopedia Alignment Based on Information Retrieval Techniques” originally appeared in the Proceedings of the 14th European Conference on Digital Libraries (ECDL 2010).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kern, R., Seifert, C. & Granitzer, M. A hybrid system for German encyclopedia alignment. Int J Digit Libr 11, 75–89 (2010). https://doi.org/10.1007/s00799-011-0069-5

Download citation

Published: 01 June 2010
Issue Date: June 2010
DOI: https://doi.org/10.1007/s00799-011-0069-5

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction

Off-the-shelf Semantic Author Name Disambiguation for Bibliographic Data Bases

ConDef: Automated Context-Aware Lexicography Using Large Online Encyclopedias

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A hybrid system for German encyclopedia alignment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction

Off-the-shelf Semantic Author Name Disambiguation for Bibliographic Data Bases

ConDef: Automated Context-Aware Lexicography Using Large Online Encyclopedias

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now