Abstract
Collaboratively created on-line encyclopedias have become increasingly popular. Especially in terms of completeness they have begun to surpass their printed counterparts. Two German publishers of traditional encyclopedias have reacted to this challenge and started an initiative to merge their corpora to create a single, more complete encyclopedia. The crucial step in this merging process is the alignment of articles. We have developed a two-step hybrid system to provide high-accurate alignments with low manual effort. First, we apply an information retrieval based, automatic alignment algorithm. Second, the articles with a low confidence score are revised using a manual alignment scheme carefully designed for quality assurance. Our evaluation shows that a combination of weighting and ranking techniques utilizing different facets of the encyclopedia articles allow to effectively reduce the number of necessary manual alignments. Further, the setup of the manual alignment turned out to be robust against inter-indexer inconsistencies. As a result, the developed system empowered us to align four encyclopedias with high accuracy and low effort.
Similar content being viewed by others
References
Anderka, M., Stein, B.: The ESA retrieval model revisited. In: Sanderson, M., Zhai, C., Zobel, J., Aslam, J. (eds.) 32th Annual International ACM SIGIR Conference (SIGIR 09), pp. 670–671. ACM (2009). doi:https://doi.org/10.1145/1571941.1572070
Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: String Processing and Information Retrieval Symposium, pp. 55–67 (2004)
Bouma, G., Duarte, S., Islam, Z.: Cross-lingual alignment and completion of wikipedia templates. In: Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, CLIAWS3 ’09, pp. 21–29. Association for Computational Linguistics, Stroudsburg, PA (2009)
Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and development in Information Retrieval, pp. 480–487. ACM (2005)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the Twentieth International Joint Conference for Artificial Intelligence, pp. 1606–1611. Hyderabad (2007)
Gries S.: Dispersions and adjusted frequencies in corpora. Int. J. Corpus Linguist. 13(4), 403–437 (2008). doi:https://doi.org/10.1075/ijcl.13.4.02gri
Kern, R., Granitzer, M.: Efficient linear text segmentation based on information retrieval techniques. In: MEDES ’09: Proceedings of the International Conference on Management of Emergent Digital EcoSystems, pp. 167–171. ACM, New York, NY (2009). doi:https://doi.org/10.1145/1643823.1643854
Li Y., McLean D., Bandar Z.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Liu, X., Zhou, Y., Zheng, R.: Measuring semantic similarity within sentences. In: Proceedings of the 7th International Conference on Machine Learning and Cybernetics, ICMLC, vol. 5, pp. 2558–2562 (2008). doi:https://doi.org/10.1109/ICMLC.2008.4620839
Marko, K., Baud, R., Zweigenbaum, P., Merkel, M., Gronostaj, M.T., Kokkinakis, D., Schulz, S.: Cross-lingual alignment of medical lexicons. In: Workshop on Acquiring and Representing Multilingual, Specialized Lexicons: the Case of Biomedicine (2006)
Metzler, D., Bernstein, Y., Croft, W., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: CIKM ’05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 517–524. ACM (2005)
O’Shea, J., Bandar, Z., Crockett, K., McLean, D.: A comparative study of two short text semantic similarity measures. In: Agent and Multi-Agent Systems: Technologies and Applications: Second KES International Symposium, vol. 4953, pp. 172–181. Springer (2008)
Pedersen, T.: Computational approaches to measuring the similarity of short contexts: a review of applications and methods. Comput. Res. Repos. (CoRR) abs/0806.3 (2008)
Rector L.H.: Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles. Ref. Serv. Rev. 36(1), 7–22 (2008). doi:https://doi.org/10.1108/00907320810851998
Robertson, S., Gatford, M.: Okapi at TREC-4. In: Proceedings of the Fourth Text Retrieval Conference, pp. 73–97 (1996)
Sahami, M., Heilman, T.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW ’06: Proceedings of the 15th International Conference on World Wide Web, pp. 377–386. ACM (2006)
Yih, W., Meek, C.: Improving similarity measures for short segments of text. In: AAAI’07: Proceedings of the 22nd National Conference on Artificial Intelligence, pp. 1489–1494. AAAI Press (2007)
Author information
Authors and Affiliations
Corresponding author
Additional information
This article is a substantially revised and extended version of a article with the title “German Encyclopedia Alignment Based on Information Retrieval Techniques” originally appeared in the Proceedings of the 14th European Conference on Digital Libraries (ECDL 2010).
Rights and permissions
About this article
Cite this article
Kern, R., Seifert, C. & Granitzer, M. A hybrid system for German encyclopedia alignment. Int J Digit Libr 11, 75–89 (2010). https://doi.org/10.1007/s00799-011-0069-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-011-0069-5