Fine-grained document clustering via ranking and its application to social media analytics

Taufik Sutanto¹ &
Richi Nayak²

517 Accesses
11 Citations
5 Altmetric
1 Mention
Explore all metrics

Abstract

Extracting valuable insights from a large volume of unstructured data such as texts through clustering analysis is paramount to many big data applications. However, document clustering is challenged by the computational complexity of the underlying methods and the high dimensionality of data, especially when the number of required clusters is large. A fine-grained clustering solution is required to understand a data set that represents heterogeneous topics such as social media data. This paper presents the Fine-Grained document Clustering via Ranking (FGCR) approach which leverages the search engine capability of handling big data efficiently. Ranking scores from a search engine are used to calculate dynamic clusters’ representations called loci in an unsupervised learning setting. Clustering decisions are efficiently made based on an optimal selection from a small subset of loci instead of the entire cluster set as in the conventional centroid-based clustering. A comprehensive empirical study on several social media data sets shows that FGCR is able to produce insightful and accurate fine-grained solution. Moreover, it is magnitudes faster and requires less computational resources compared to other state-of-the-art document clustering approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Aksyonoff A (2011) Introduction to Search with Sphinx: from installation to relevance tuning. O’Reilly, Sebastopol
Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Database theory–ICDT ’99. Springer, pp 217–235
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
MATH Google Scholar
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188
Article Google Scholar
Chen J, Fang H-R, Saad Y (2009) Fast approximate k nn graph construction for high dimensional data via recursive lanczos bisection. J Mach Learn Res 10:1989–2012
MathSciNet MATH Google Scholar
De Vries CM, De Vine L, Geva S, Nayak R (2015) Parallel streaming signature em-tree: a clustering algorithm for web scale applications. In: Proceedings of the 24th international conference on World Wide Web, pp 216–226. International World Wide Web Conferences Steering Committee
Dorow B (2006) A graph model for words and their meanings. PhD thesis, Institut fÃijr Maschinelle Sprachverarbeitung der UniversitÂĺ at Stuttgart
Eisenstein J, O’Connor B, Smith NA, Xing EP (2010) A latent variable model for geographic lexical variation. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1277–1287
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Article Google Scholar
Ferrara E, Interdonato R, Tagarelli A (2014) Online popularity and topical interests through the lens of instagram. In: Proceedings of the 25th ACM conference on Hypertext and social media. ACM, pp 24–34
Fuhr N, Lechtenfeld M, Stein B, Gollub T (2012) The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15(2):93–115
Article Google Scholar
Gellman M, Turner JR (2013) Encyclopedia of behavioral medicine. Springer, Berlin
Book Google Scholar
He W, Zha S, Li L (2013) Social media competitive analysis and text mining: a case study in the pizza industry. Int J Inf Manag 33(3):464–472
Article Google Scholar
Hearst MA, Pedersen JO (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th annual international ACM sigir conference on research and development in information retrieval, SIGIR ’96, pp 76–84, New York, NY, USA
Hou J, Nayak R (2013) The heterogeneous cluster ensemble method using hubness for clustering text documents. In: WISE 2013. Springer, Berlin Heidelberg, pp 102–110
Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. Access IEEE 2:652–687
Article Google Scholar
Hu X, Liu H (2012) Text analytics in social media. Springer, Boston, pp 385–414
Google Scholar
Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95
Article Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Jardine N, van Rijsbergen CJ (1971) The use of hierarchic clustering in information retrieval. Inf Storage Retr 7(5):217–240
Article Google Scholar
Johnson WB, Lindenstrauss J (1984) Extensions of lipschitz mappings into a hilbert space. Contemp Math 26(189–206):1
MathSciNet MATH Google Scholar
Katal A, Wazid M, Goudar R (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth international conference on contemporary computing (IC3). IEEE, pp 404–409
Klawonn F, Höppner F, Jayaram B (2012) What are clusters in high dimensions and are they difficult to find? In: International workshop on clustering high-dimensional data, Springer, pp 14–33
Kriegel H-P, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data (TKDD) 3(1):1
Article Google Scholar
Kurland O (2013) The cluster hypothesis in information retrieval. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’13, pp 1126–1126, New York, NY, USA
Leuski A (2001) Evaluating document clustering for interactive information retrieval. In: Proceedings of the tenth international conference on information and knowledge management, ACM, pp 33–40
Lloyd SP (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet MATH Google Scholar
Losee RM, Paris LAH (1999) Measuring search-engine quality and query difficulty: ranking with target and freestyle. J Assoc Inf Sci Technol 50(10):882
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge university press, Cambridge
Book MATH Google Scholar
Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge university press, Cambridge
Book MATH Google Scholar
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 169–178
Medhat W, Hassan A, Korashy H (2014) Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J 5(4):1093–1113
Article Google Scholar
Mihalcea R, Tarau P (2004) TextRank: Bringing order into texts. In: Conference on empirical methods in natural language processing, Barcelona, Spain
O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: Exploratory search and topic summarization for twitter. In: ICWSM
Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Discov 24(3):515–554
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Petkos G, Papadopoulos S, Mezaris V, Kompatsiaris Y (2014) Social event detection at mediaeval 2014: challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2014 multimedia benchmark workshop Barcelona, Spain
Raiber F, Kurland O (2012) Exploring the cluster hypothesis, and cluster-based retrieval, over the web. In: Proceedings of the 21st ACM international conference on information and knowledge management. ACM, pp 2507–2510
Raiber F, Kurland O (2013) Ranking document clusters using markov random fields. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. pp 333–342
Reuter T, Papadopoulos S, Petkos G, Mezaris V, Kompatsiaris Y, Cimiano P, de Vries C, Geva S (2013) Social event detection at mediaeval 2013: challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2013 multimedia benchmark workshop Barcelona, Spain, 2013
Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146
Article Google Scholar
Rosa KD, Shah R, Lin B, Gershman A, Frederking R (2011) Topical clustering of tweets. In: Proceedings of the ACM SIGIR: SWSM
Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905
Article Google Scholar
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: Murgante B, Misra S, Rocha A, Torre C, Rocha J, FalcÃčo M, Taniar D, Apduhan B, Gervasi O (eds) Computational science and its applications âĂŞ ICCSA 2014, volume 8583 of Lecture notes in computer science. Springer International Publishing, pp 707–720
Sinclair GR (2012) StÃl’fan and the Voyant Tools Team. Voyant tools (web application). http://voyant-tools.org/
Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’96, pp 21–29, New York, NY, USA
Smucker MD, Allan J (2009) A new measure of the cluster hypothesis. In: Conference on the theory of information retrieval. Springer, pp 281–288
Spink A, Wolfram D, Jansen MB, Saracevic T (2001) Searching the web: the public and their queries. J Assoc Inf Sci Technol 52(3):226–234
Article Google Scholar
Sutanto T, Nayak R (2014) Ranking based clustering for social event detection. In: Working notes proceedings of the mediaeval 2014 workshop, vol 1263, pp 1–2. CEUR workshop proceedings
Sutanto T, Nayak R (2014) The ranking based constrained document clustering method and its application to social event detection. In: Database systems for advanced applications. Springer, pp 47–60
Sutanto T, Nayak R (2015) Semi-supervised document clustering via loci. In: Wang J, Cellary W, Wang D, Wang H, Chen S-C, Li T, Zhang Y (eds) Web information systems engineering âĂŞ WISE 2015 volume 9419 of Lecture notes in computer science. Springer International Publishing, pp 208–215
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) The role of hubness in clustering high-dimensional data. In: Advances in knowledge discovery and data mining. Springer, pp 183–195
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2014) Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification. Int J Mach Learn Cybern 5(3):445–458
Article Google Scholar
Trepte S, Reinecke L (2011) Privacy online: perspectives on privacy and self-disclosure in the social web. Springer, Berlin
Book Google Scholar
Van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, London
MATH Google Scholar
Voorhees EM (1985) The cluster hypothesis revisited. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, pp 188–196
Wang C, Chow SSM, Wang Q, Ren K, Lou W (2013) Privacy-preserving public auditing for secure cloud storage. IEEE Trans Comput 62(2):362–375
Article MathSciNet MATH Google Scholar
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Article MathSciNet Google Scholar
Widenius M, Axmark D (2002) MySQL reference manual: documentation from the source. O’Reilly Media Inc, Sebastopol
Google Scholar
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273
Yin J, Karimi S, Lampert A, Cameron M, Robinson B, Power R (2015) Using social media to enhance emergency situation awareness. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15. AAAI Press, pp 4234–4238
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11):1361–1374
Article Google Scholar

Download references

Author information

Authors and Affiliations

Syarif Hidayatullah State Islamic University Jakarta, Jakarta, Indonesia
Taufik Sutanto
Queensland University of Technology (QUT), Brisbane, Australia
Richi Nayak

Authors

Taufik Sutanto
View author publications
You can also search for this author in PubMed Google Scholar
Richi Nayak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Taufik Sutanto or Richi Nayak.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sutanto, T., Nayak, R. Fine-grained document clustering via ranking and its application to social media analytics. Soc. Netw. Anal. Min. 8, 29 (2018). https://doi.org/10.1007/s13278-018-0508-z

Download citation

Received: 12 December 2017
Revised: 05 March 2018
Accepted: 31 March 2018
Published: 07 April 2018
DOI: https://doi.org/10.1007/s13278-018-0508-z

Fine-grained document clustering via ranking and its application to social media analytics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semi-supervised Document Clustering via Loci

An Efficient Ranking-Centered Density-Based Document Clustering Method

The Ranking Based Constrained Document Clustering Method and Its Application to Social Event Detection

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Fine-grained document clustering via ranking and its application to social media analytics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semi-supervised Document Clustering via Loci

An Efficient Ranking-Centered Density-Based Document Clustering Method

The Ranking Based Constrained Document Clustering Method and Its Application to Social Event Detection

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation