Abstract
Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR (2003)
Carmel, D., Roitman, H., Zwerdling, N.: Enhancing cluster labeling using wikipedia. In: SIGIR 2009 (2009)
Carpineto, C., Osiski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. (July 2009)
Chen, H.-H., Kuo, J.-J., Su, T.-C.: Clustering and Visualization in a Multi-lingual Multi-document Summarization System. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 266–280. Springer, Heidelberg (2003)
Chin, O.S., Kulathuramaiyer, N., Yeo, A.W.: Automatic discovery of concepts from text. In: WI 2006 (2006)
Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin (October 1968)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR 1992 (1992)
Geraci, F., Pellegrini, M., Maggini, M., Sebastiani, F.: Cluster generation and labeling for web snippets: A fast, accurate hierarchical solution. Internet Math. (2007)
Glover, E., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: CIKM 2002 (2002)
Honarpisheh, M.A., Ghassem-Sani, G., Mirroshandel, G.: A multi-document multi-lingual automatic summarization system. In: IJCNLP 2009 (2009)
Ke, W., Sugimoto, C.R., Mostafa, J.: Dynamicity vs. e ectiveness: studying online clustering for scatter/gather. In: SIGIR 2009 (2009)
Kuo, J.-J., Chen, H.-H.: Multidocument summary generation: Using informative and event words. TALIP (February 2008)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: EMNLP 2009 (2009)
Ming, Z.-Y., Wang, K., Chua, T.-S.: Prototype hierarchy based clustering for the categorization and navigation of web collections. In: SIGIR 2010 (2010)
Osinski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intell. Sys. (May 2005)
Radev, D.R., Jing, H., Styś, M., Tam, D.: Centroid-based summarization of multiple documents. Inf. Proc. Manag (November 2004)
Toda, H., Kataoka, R.: A clustering method for news articles retrieval system. In: WWW 2005 (2005)
Treeratpituk, P., Callan, J.: Automatically labeling hierarchical clusters. Digital Government Research (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tholpadi, G., Das, M.K., Bhattacharyya, C., Shevade, S. (2012). Cluster Labeling for Multilingual Scatter/Gather Using Comparable Corpora. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_33
Download citation
DOI: https://doi.org/10.1007/978-3-642-28997-2_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28996-5
Online ISBN: 978-3-642-28997-2
eBook Packages: Computer ScienceComputer Science (R0)