Abstract
Document clustering is a popular method for discovering useful information from text data. This paper proposes an innovative hybrid document clustering method based on the novel concepts of ranking, density and shared neighborhood. We utilize ranked documents generated from a search engine to effectively build a graph of shared relevant documents. The high density regions in the graph are processed to form initial clusters. The clustering decisions are further refined using the shared neighborhood information. Empirical analysis shows that the proposed method is able to produce accurate and efficient solution as compared to relevant benchmarking methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anastasiu, D.C., Tagarelli, A., Karypis, G.: Document clustering: the next frontier. In: Aggarwal, C.C., Reddy, C.K. (eds.) Data Clustering: Algorithms and Applications, pp. 305–328 (2013)
Zhao, W., He, Q., Ma, H., Shi, Z.: Effective semi-supervised document clustering via active learning with instance-level constraints. KAIS 30, 569–587 (2012)
Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: Hubness-based clustering of high-dimensional data. In: Celebi, M.Emre (ed.) Partitional Clustering Algorithms, pp. 353–386. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-09259-1_11
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
Ertöz, L., Steinbach, M., Kumar, V.: Finding topics in collections of documents: a shared nearest neighbor approach. Clustering and Information Retrieval. Network Theory and Applications, vol. 11, pp. 83–103. Springer, Boston (2003). https://doi.org/10.1007/978-1-4613-0227-8_3
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 100, 1025–1034 (1973)
Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: SIAM, pp. 47–58. SIAM (2003)
Sutanto, T., Nayak, R.: Semi-supervised document clustering via loci. In: Wang, J., Cellary, W., Wang, D., Wang, H., Chen, S.-C., Li, T. (eds.) WISE 2015. LNCS, vol. 9419, pp. 208–215. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26187-4_16
Broder, A., Garcia-Pueyo, L., Josifovski, V., Vassilvitskii, S., Venkatesan, S.: Scalable k-means by ranked retrieval. In: 7th WSDM, pp. 233–242. ACM (2014)
Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15, 93–115 (2012)
Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: 17th WWW, pp. 387–396. ACM (2008)
Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval. Inf. Storage Retr. 7, 217–240 (1971)
Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., Chen, Z., Ma, W.-Y.: Improving web search results using affinity graph. In: 28th ACM SIGIR, pp. 504–511. ACM (2005)
Hou, J., Nayak, R.: The heterogeneous cluster ensemble method using hubness for clustering text documents. In: Lin, X., Manolopoulos, Y., Srivastava, D. (eds.) WISE 2013. LNCS, vol. 8180, pp. 102–110. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41230-1_9
Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19, 2756–2779 (2007)
Hajek, B.: Adaptive transmission strategies and routing in mobile radio networks. Urbana 51, 61801 (1983)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Mohotti, W.A., Nayak, R. (2018). An Efficient Ranking-Centered Density-Based Document Clustering Method. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-93040-4_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)