Abstract
Extracting valuable insights from a large volume of unstructured data such as texts through clustering analysis is paramount to many big data applications. However, document clustering is challenged by the computational complexity of the underlying methods and the high dimensionality of data, especially when the number of required clusters is large. A fine-grained clustering solution is required to understand a data set that represents heterogeneous topics such as social media data. This paper presents the Fine-Grained document Clustering via Ranking (FGCR) approach which leverages the search engine capability of handling big data efficiently. Ranking scores from a search engine are used to calculate dynamic clusters’ representations called loci in an unsupervised learning setting. Clustering decisions are efficiently made based on an optimal selection from a small subset of loci instead of the entire cluster set as in the conventional centroid-based clustering. A comprehensive empirical study on several social media data sets shows that FGCR is able to produce insightful and accurate fine-grained solution. Moreover, it is magnitudes faster and requires less computational resources compared to other state-of-the-art document clustering approaches.
Similar content being viewed by others
References
Aksyonoff A (2011) Introduction to Search with Sphinx: from installation to relevance tuning. O’Reilly, Sebastopol
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Database theory–ICDT ’99. Springer, pp 217–235
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188
Chen J, Fang H-R, Saad Y (2009) Fast approximate k nn graph construction for high dimensional data via recursive lanczos bisection. J Mach Learn Res 10:1989–2012
De Vries CM, De Vine L, Geva S, Nayak R (2015) Parallel streaming signature em-tree: a clustering algorithm for web scale applications. In: Proceedings of the 24th international conference on World Wide Web, pp 216–226. International World Wide Web Conferences Steering Committee
Dorow B (2006) A graph model for words and their meanings. PhD thesis, Institut fÃijr Maschinelle Sprachverarbeitung der UniversitÂĺ at Stuttgart
Eisenstein J, O’Connor B, Smith NA, Xing EP (2010) A latent variable model for geographic lexical variation. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1277–1287
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Ferrara E, Interdonato R, Tagarelli A (2014) Online popularity and topical interests through the lens of instagram. In: Proceedings of the 25th ACM conference on Hypertext and social media. ACM, pp 24–34
Fuhr N, Lechtenfeld M, Stein B, Gollub T (2012) The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15(2):93–115
Gellman M, Turner JR (2013) Encyclopedia of behavioral medicine. Springer, Berlin
He W, Zha S, Li L (2013) Social media competitive analysis and text mining: a case study in the pizza industry. Int J Inf Manag 33(3):464–472
Hearst MA, Pedersen JO (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th annual international ACM sigir conference on research and development in information retrieval, SIGIR ’96, pp 76–84, New York, NY, USA
Hou J, Nayak R (2013) The heterogeneous cluster ensemble method using hubness for clustering text documents. In: WISE 2013. Springer, Berlin Heidelberg, pp 102–110
Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. Access IEEE 2:652–687
Hu X, Liu H (2012) Text analytics in social media. Springer, Boston, pp 385–414
Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Jardine N, van Rijsbergen CJ (1971) The use of hierarchic clustering in information retrieval. Inf Storage Retr 7(5):217–240
Johnson WB, Lindenstrauss J (1984) Extensions of lipschitz mappings into a hilbert space. Contemp Math 26(189–206):1
Katal A, Wazid M, Goudar R (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth international conference on contemporary computing (IC3). IEEE, pp 404–409
Klawonn F, Höppner F, Jayaram B (2012) What are clusters in high dimensions and are they difficult to find? In: International workshop on clustering high-dimensional data, Springer, pp 14–33
Kriegel H-P, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data (TKDD) 3(1):1
Kurland O (2013) The cluster hypothesis in information retrieval. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’13, pp 1126–1126, New York, NY, USA
Leuski A (2001) Evaluating document clustering for interactive information retrieval. In: Proceedings of the tenth international conference on information and knowledge management, ACM, pp 33–40
Lloyd SP (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Losee RM, Paris LAH (1999) Measuring search-engine quality and query difficulty: ranking with target and freestyle. J Assoc Inf Sci Technol 50(10):882
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge university press, Cambridge
Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge university press, Cambridge
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 169–178
Medhat W, Hassan A, Korashy H (2014) Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J 5(4):1093–1113
Mihalcea R, Tarau P (2004) TextRank: Bringing order into texts. In: Conference on empirical methods in natural language processing, Barcelona, Spain
O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: Exploratory search and topic summarization for twitter. In: ICWSM
Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Discov 24(3):515–554
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Petkos G, Papadopoulos S, Mezaris V, Kompatsiaris Y (2014) Social event detection at mediaeval 2014: challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2014 multimedia benchmark workshop Barcelona, Spain
Raiber F, Kurland O (2012) Exploring the cluster hypothesis, and cluster-based retrieval, over the web. In: Proceedings of the 21st ACM international conference on information and knowledge management. ACM, pp 2507–2510
Raiber F, Kurland O (2013) Ranking document clusters using markov random fields. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. pp 333–342
Reuter T, Papadopoulos S, Petkos G, Mezaris V, Kompatsiaris Y, Cimiano P, de Vries C, Geva S (2013) Social event detection at mediaeval 2013: challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2013 multimedia benchmark workshop Barcelona, Spain, 2013
Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146
Rosa KD, Shah R, Lin B, Gershman A, Frederking R (2011) Topical clustering of tweets. In: Proceedings of the ACM SIGIR: SWSM
Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: Murgante B, Misra S, Rocha A, Torre C, Rocha J, FalcÃčo M, Taniar D, Apduhan B, Gervasi O (eds) Computational science and its applications âĂŞ ICCSA 2014, volume 8583 of Lecture notes in computer science. Springer International Publishing, pp 707–720
Sinclair GR (2012) StÃl’fan and the Voyant Tools Team. Voyant tools (web application). http://voyant-tools.org/
Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’96, pp 21–29, New York, NY, USA
Smucker MD, Allan J (2009) A new measure of the cluster hypothesis. In: Conference on the theory of information retrieval. Springer, pp 281–288
Spink A, Wolfram D, Jansen MB, Saracevic T (2001) Searching the web: the public and their queries. J Assoc Inf Sci Technol 52(3):226–234
Sutanto T, Nayak R (2014) Ranking based clustering for social event detection. In: Working notes proceedings of the mediaeval 2014 workshop, vol 1263, pp 1–2. CEUR workshop proceedings
Sutanto T, Nayak R (2014) The ranking based constrained document clustering method and its application to social event detection. In: Database systems for advanced applications. Springer, pp 47–60
Sutanto T, Nayak R (2015) Semi-supervised document clustering via loci. In: Wang J, Cellary W, Wang D, Wang H, Chen S-C, Li T, Zhang Y (eds) Web information systems engineering âĂŞ WISE 2015 volume 9419 of Lecture notes in computer science. Springer International Publishing, pp 208–215
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) The role of hubness in clustering high-dimensional data. In: Advances in knowledge discovery and data mining. Springer, pp 183–195
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2014) Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification. Int J Mach Learn Cybern 5(3):445–458
Trepte S, Reinecke L (2011) Privacy online: perspectives on privacy and self-disclosure in the social web. Springer, Berlin
Van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, London
Voorhees EM (1985) The cluster hypothesis revisited. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, pp 188–196
Wang C, Chow SSM, Wang Q, Ren K, Lou W (2013) Privacy-preserving public auditing for secure cloud storage. IEEE Trans Comput 62(2):362–375
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Widenius M, Axmark D (2002) MySQL reference manual: documentation from the source. O’Reilly Media Inc, Sebastopol
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273
Yin J, Karimi S, Lampert A, Cameron M, Robinson B, Power R (2015) Using social media to enhance emergency situation awareness. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15. AAAI Press, pp 4234–4238
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11):1361–1374
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Sutanto, T., Nayak, R. Fine-grained document clustering via ranking and its application to social media analytics. Soc. Netw. Anal. Min. 8, 29 (2018). https://doi.org/10.1007/s13278-018-0508-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-018-0508-z