AdaHash: hashing-based scalable, adaptive hierarchical clustering of streaming data on Mapreduce frameworks

422 Accesses
7 Citations
Explore all metrics

Abstract

Despite the recent growth in large-scale, distributed streaming data processing, there are currently limited options for flexible clustering of streaming data on industrial production frameworks such as Mapreduce and Spark. The clustering methods being used by practitioners on such systems do not respond rapidly to new data, or do not adjust the number of clusters appropriately as more data are processed. This problem is particularly acute for unstructured data like text and other non-enumerated types that are common in log and message streams and not analyzed at scale for precisely this reason. To address such issues, this paper proposes a method for hierarchical clustering using adaptive hash (AdaHash) values that can be re-calculated during a periodic batch process and used for subsequent streaming processing at the speed of data arrival, assuming sufficient distributed compute resources. We demonstrate that this method is as fast as other (optimal) hashing methods while enabling an adaptive hash function on each batch cycle within a lambda architecture computing framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Getting Started with Spark

Performance Analysis of Adapting a MapReduce Framework to Dynamically Accommodate Heterogeneity

Scalable Blocking for Very Large Databases

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Apache spark mllib. http://spark.apache.org/docs/latest/ml-guide.html (2017). Accessed 15 Jan 2017
Marz, N., Warren, J.: Big Data: Principles and Best Practices of Scalable Realtime Data Systems, 1st edn. Manning Publications Co., Greenwich (2015)
Google Scholar
Fagin, R., Nievergelt, J., Pippenger, N., Strong, H.R.: Extendible hashing—a fast access method for dynamic files. ACM Trans. Database Syst. 4(3), 315 (1979). https://doi.org/10.1145/320083.320092
Article Google Scholar
Lu, P., Chen, G., Ooi, B.C., Vo, H.T., Wu, S.: ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]. Proc. VLDB Endow. 7(14), 1797 (2014). https://doi.org/10.14778/2733085.2733087
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 Dec 2004, pp. 137–150. http://www.usenix.org/events/osdi04/tech/dean.html (2004)
Zaharia, M.: An architecture for fast and general data processing on large clusters. Ph.D. thesis, EECS Department, University of California, Berkeley. (2014). http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.html. Accessed 11 Sept 2017
Apache spark. http://spark.apache.org/ (2017). Accessed 20 Jan 2017
Zadeh, J., Apostolopoulos, G., Tryfonas, C., Sudhakar, M.: Defense at scale: building a central nervous system for the SOC. Blackhat 2015, 1–8 (2015)
Google Scholar
Databricks spark scikit-learn. https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html (2016). Accessed 22 Jan 2017
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation—Volume 6 (USENIX Association, Berkeley, CA, USA, 2004), OSDI’04, pp. 10–10. (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264. Accessed 11 Oct 2017
Okasaki, C.: Purely Functional Data Structures. Cambridge University Press, Cambridge (1999)
MATH Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, 23–26 May 1998, pp. 604–613. https://doi.org/10.1145/276698.276876 (1998)
Apache spark mllib lsh. https://spark.apache.org/docs/2.1.0/ml-features.html (2017). Accessed 21 Nov 2017
Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for Similarity Search: A Survey. CoRR arXiv:1408.2927 (2014)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Upper Saddle River (1988)
MATH Google Scholar
Aggarwal, C.C., Reddy, C.K. (eds.): Data Clustering: Algorithms and Applications. CRC Press, Boca Raton (2014). http://www.crcpress.com/product/isbn/9781466558212. Accessed 11 Oct 2017
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, pp. 103–114 (1996)
Ghesmoune, M., Lebbah, M., Azzag, H.: A new growing neural gas for clustering data streams. Neural Netw. 78, 36 (2016). https://doi.org/10.1016/j.neunet.2016.02.003
Article Google Scholar
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.P.L.F., Gama, J.A.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 1–13 (2013). https://doi.org/10.1145/2522968.2522981
Article MATH Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases—Volume 29 (VLDB Endowment, 2003), VLDB ’03, pp. 81–92. (2003). http://dl.acm.org/citation.cfm?id=1315451.1315460. Accessed 11 Sept 2017
Chapter Google Scholar
Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339. https://doi.org/10.1137/1.9781611972764.29 (2006)
Backhoff, O., Ntoutsi, E.: Scalable online-offline stream clustering in apache spark. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 37–44 (2016)
Thornthwaite, W.: Dimensional modeling basics get flexible real-world dimensions with analysis services 2005. SQL Server Pro (2006)
Apache common log format. https://httpd.apache.org/docs/1.3/logs.html (2004). Accessed 13 Mar 2017
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable K-means++. Proc. VLDB Endow. 5(7), 622 (2012)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Ravi Srinivasan for providing valuable feedback on the mathematical framing of the model.

Author information

Authors and Affiliations

JASK Labs, San Francisco, USA
Dean Teffer
Applied Research Laboratories, The University of Texas at Austin, Austin, USA
Ravi Srinivasan
Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, USA
Joydeep Ghosh

Authors

Dean Teffer
View author publications
You can also search for this author in PubMed Google Scholar
Ravi Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar
Joydeep Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dean Teffer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teffer, D., Srinivasan, R. & Ghosh, J. AdaHash: hashing-based scalable, adaptive hierarchical clustering of streaming data on Mapreduce frameworks. Int J Data Sci Anal 8, 257–267 (2019). https://doi.org/10.1007/s41060-018-0145-7

Download citation

Received: 26 May 2017
Accepted: 23 July 2018
Published: 01 August 2018
Issue Date: October 2019
DOI: https://doi.org/10.1007/s41060-018-0145-7

Abstract

Access this article

Subscribe and save

Buy Now