[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Fine-grained document clustering via ranking and its application to social media analytics

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Extracting valuable insights from a large volume of unstructured data such as texts through clustering analysis is paramount to many big data applications. However, document clustering is challenged by the computational complexity of the underlying methods and the high dimensionality of data, especially when the number of required clusters is large. A fine-grained clustering solution is required to understand a data set that represents heterogeneous topics such as social media data. This paper presents the Fine-Grained document Clustering via Ranking (FGCR) approach which leverages the search engine capability of handling big data efficiently. Ranking scores from a search engine are used to calculate dynamic clusters’ representations called loci in an unsupervised learning setting. Clustering decisions are efficiently made based on an optimal selection from a small subset of loci instead of the entire cluster set as in the conventional centroid-based clustering. A comprehensive empirical study on several social media data sets shows that FGCR is able to produce insightful and accurate fine-grained solution. Moreover, it is magnitudes faster and requires less computational resources compared to other state-of-the-art document clustering approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://github.com/taufikedys/FGCR.

  2. https://www.elastic.co.

References

  • Aksyonoff A (2011) Introduction to Search with Sphinx: from installation to relevance tuning. O’Reilly, Sebastopol

    Google Scholar 

  • Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035

  • Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Database theory–ICDT ’99. Springer, pp 217–235

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188

    Article  Google Scholar 

  • Chen J, Fang H-R, Saad Y (2009) Fast approximate k nn graph construction for high dimensional data via recursive lanczos bisection. J Mach Learn Res 10:1989–2012

    MathSciNet  MATH  Google Scholar 

  • De Vries CM, De Vine L, Geva S, Nayak R (2015) Parallel streaming signature em-tree: a clustering algorithm for web scale applications. In: Proceedings of the 24th international conference on World Wide Web, pp 216–226. International World Wide Web Conferences Steering Committee

  • Dorow B (2006) A graph model for words and their meanings. PhD thesis, Institut fÃijr Maschinelle Sprachverarbeitung der UniversitÂĺ at Stuttgart

  • Eisenstein J, O’Connor B, Smith NA, Xing EP (2010) A latent variable model for geographic lexical variation. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1277–1287

  • Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231

    Google Scholar 

  • Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279

    Article  Google Scholar 

  • Ferrara E, Interdonato R, Tagarelli A (2014) Online popularity and topical interests through the lens of instagram. In: Proceedings of the 25th ACM conference on Hypertext and social media. ACM, pp 24–34

  • Fuhr N, Lechtenfeld M, Stein B, Gollub T (2012) The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15(2):93–115

    Article  Google Scholar 

  • Gellman M, Turner JR (2013) Encyclopedia of behavioral medicine. Springer, Berlin

    Book  Google Scholar 

  • He W, Zha S, Li L (2013) Social media competitive analysis and text mining: a case study in the pizza industry. Int J Inf Manag 33(3):464–472

    Article  Google Scholar 

  • Hearst MA, Pedersen JO (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th annual international ACM sigir conference on research and development in information retrieval, SIGIR ’96, pp 76–84, New York, NY, USA

  • Hou J, Nayak R (2013) The heterogeneous cluster ensemble method using hubness for clustering text documents. In: WISE 2013. Springer, Berlin Heidelberg, pp 102–110

  • Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. Access IEEE 2:652–687

    Article  Google Scholar 

  • Hu X, Liu H (2012) Text analytics in social media. Springer, Boston, pp 385–414

    Google Scholar 

  • Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95

    Article  Google Scholar 

  • Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  • Jardine N, van Rijsbergen CJ (1971) The use of hierarchic clustering in information retrieval. Inf Storage Retr 7(5):217–240

    Article  Google Scholar 

  • Johnson WB, Lindenstrauss J (1984) Extensions of lipschitz mappings into a hilbert space. Contemp Math 26(189–206):1

    MathSciNet  MATH  Google Scholar 

  • Katal A, Wazid M, Goudar R (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth international conference on contemporary computing (IC3). IEEE, pp 404–409

  • Klawonn F, Höppner F, Jayaram B (2012) What are clusters in high dimensions and are they difficult to find? In: International workshop on clustering high-dimensional data, Springer, pp 14–33

  • Kriegel H-P, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data (TKDD) 3(1):1

    Article  Google Scholar 

  • Kurland O (2013) The cluster hypothesis in information retrieval. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’13, pp 1126–1126, New York, NY, USA

  • Leuski A (2001) Evaluating document clustering for interactive information retrieval. In: Proceedings of the tenth international conference on information and knowledge management, ACM, pp 33–40

  • Lloyd SP (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  MATH  Google Scholar 

  • Losee RM, Paris LAH (1999) Measuring search-engine quality and query difficulty: ranking with target and freestyle. J Assoc Inf Sci Technol 50(10):882

    Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge university press, Cambridge

    Book  MATH  Google Scholar 

  • Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge university press, Cambridge

    Book  MATH  Google Scholar 

  • McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 169–178

  • Medhat W, Hassan A, Korashy H (2014) Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J 5(4):1093–1113

    Article  Google Scholar 

  • Mihalcea R, Tarau P (2004) TextRank: Bringing order into texts. In: Conference on empirical methods in natural language processing, Barcelona, Spain

  • O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: Exploratory search and topic summarization for twitter. In: ICWSM

  • Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Discov 24(3):515–554

    Article  Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Petkos G, Papadopoulos S, Mezaris V, Kompatsiaris Y (2014) Social event detection at mediaeval 2014: challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2014 multimedia benchmark workshop Barcelona, Spain

  • Raiber F, Kurland O (2012) Exploring the cluster hypothesis, and cluster-based retrieval, over the web. In: Proceedings of the 21st ACM international conference on information and knowledge management. ACM, pp 2507–2510

  • Raiber F, Kurland O (2013) Ranking document clusters using markov random fields. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. pp 333–342

  • Reuter T, Papadopoulos S, Petkos G, Mezaris V, Kompatsiaris Y, Cimiano P, de Vries C, Geva S (2013) Social event detection at mediaeval 2013: challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2013 multimedia benchmark workshop Barcelona, Spain, 2013

  • Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146

    Article  Google Scholar 

  • Rosa KD, Shah R, Lin B, Gershman A, Frederking R (2011) Topical clustering of tweets. In: Proceedings of the ACM SIGIR: SWSM

  • Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178

  • Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905

    Article  Google Scholar 

  • Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: Murgante B, Misra S, Rocha A, Torre C, Rocha J, FalcÃčo M, Taniar D, Apduhan B, Gervasi O (eds) Computational science and its applications âĂŞ ICCSA 2014, volume 8583 of Lecture notes in computer science. Springer International Publishing, pp 707–720

  • Sinclair GR (2012) StÃl’fan and the Voyant Tools Team. Voyant tools (web application). http://voyant-tools.org/

  • Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’96, pp 21–29, New York, NY, USA

  • Smucker MD, Allan J (2009) A new measure of the cluster hypothesis. In: Conference on the theory of information retrieval. Springer, pp 281–288

  • Spink A, Wolfram D, Jansen MB, Saracevic T (2001) Searching the web: the public and their queries. J Assoc Inf Sci Technol 52(3):226–234

    Article  Google Scholar 

  • Sutanto T, Nayak R (2014) Ranking based clustering for social event detection. In: Working notes proceedings of the mediaeval 2014 workshop, vol 1263, pp 1–2. CEUR workshop proceedings

  • Sutanto T, Nayak R (2014) The ranking based constrained document clustering method and its application to social event detection. In: Database systems for advanced applications. Springer, pp 47–60

  • Sutanto T, Nayak R (2015) Semi-supervised document clustering via loci. In: Wang J, Cellary W, Wang D, Wang H, Chen S-C, Li T, Zhang Y (eds) Web information systems engineering âĂŞ WISE 2015 volume 9419 of Lecture notes in computer science. Springer International Publishing, pp 208–215

  • Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) The role of hubness in clustering high-dimensional data. In: Advances in knowledge discovery and data mining. Springer, pp 183–195

  • Tomašev N, Radovanović M, Mladenić D, Ivanović M (2014) Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification. Int J Mach Learn Cybern 5(3):445–458

    Article  Google Scholar 

  • Trepte S, Reinecke L (2011) Privacy online: perspectives on privacy and self-disclosure in the social web. Springer, Berlin

    Book  Google Scholar 

  • Van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, London

    MATH  Google Scholar 

  • Voorhees EM (1985) The cluster hypothesis revisited. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, pp 188–196

  • Wang C, Chow SSM, Wang Q, Ren K, Lou W (2013) Privacy-preserving public auditing for secure cloud storage. IEEE Trans Comput 62(2):362–375

    Article  MathSciNet  MATH  Google Scholar 

  • Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

    Article  MathSciNet  Google Scholar 

  • Widenius M, Axmark D (2002) MySQL reference manual: documentation from the source. O’Reilly Media Inc, Sebastopol

    Google Scholar 

  • Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273

  • Yin J, Karimi S, Lampert A, Cameron M, Robinson B, Power R (2015) Using social media to enhance emergency situation awareness. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15. AAAI Press, pp 4234–4238

  • Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11):1361–1374

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Taufik Sutanto or Richi Nayak.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sutanto, T., Nayak, R. Fine-grained document clustering via ranking and its application to social media analytics. Soc. Netw. Anal. Min. 8, 29 (2018). https://doi.org/10.1007/s13278-018-0508-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-018-0508-z

Keywords

Navigation