Abstract
Despite the recent growth in large-scale, distributed streaming data processing, there are currently limited options for flexible clustering of streaming data on industrial production frameworks such as Mapreduce and Spark. The clustering methods being used by practitioners on such systems do not respond rapidly to new data, or do not adjust the number of clusters appropriately as more data are processed. This problem is particularly acute for unstructured data like text and other non-enumerated types that are common in log and message streams and not analyzed at scale for precisely this reason. To address such issues, this paper proposes a method for hierarchical clustering using adaptive hash (AdaHash) values that can be re-calculated during a periodic batch process and used for subsequent streaming processing at the speed of data arrival, assuming sufficient distributed compute resources. We demonstrate that this method is as fast as other (optimal) hashing methods while enabling an adaptive hash function on each batch cycle within a lambda architecture computing framework.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Apache spark mllib. http://spark.apache.org/docs/latest/ml-guide.html (2017). Accessed 15 Jan 2017
Marz, N., Warren, J.: Big Data: Principles and Best Practices of Scalable Realtime Data Systems, 1st edn. Manning Publications Co., Greenwich (2015)
Fagin, R., Nievergelt, J., Pippenger, N., Strong, H.R.: Extendible hashing—a fast access method for dynamic files. ACM Trans. Database Syst. 4(3), 315 (1979). https://doi.org/10.1145/320083.320092
Lu, P., Chen, G., Ooi, B.C., Vo, H.T., Wu, S.: ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]. Proc. VLDB Endow. 7(14), 1797 (2014). https://doi.org/10.14778/2733085.2733087
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 Dec 2004, pp. 137–150. http://www.usenix.org/events/osdi04/tech/dean.html (2004)
Zaharia, M.: An architecture for fast and general data processing on large clusters. Ph.D. thesis, EECS Department, University of California, Berkeley. (2014). http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.html. Accessed 11 Sept 2017
Apache spark. http://spark.apache.org/ (2017). Accessed 20 Jan 2017
Zadeh, J., Apostolopoulos, G., Tryfonas, C., Sudhakar, M.: Defense at scale: building a central nervous system for the SOC. Blackhat 2015, 1–8 (2015)
Databricks spark scikit-learn. https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html (2016). Accessed 22 Jan 2017
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation—Volume 6 (USENIX Association, Berkeley, CA, USA, 2004), OSDI’04, pp. 10–10. (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264. Accessed 11 Oct 2017
Okasaki, C.: Purely Functional Data Structures. Cambridge University Press, Cambridge (1999)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, 23–26 May 1998, pp. 604–613. https://doi.org/10.1145/276698.276876 (1998)
Apache spark mllib lsh. https://spark.apache.org/docs/2.1.0/ml-features.html (2017). Accessed 21 Nov 2017
Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for Similarity Search: A Survey. CoRR arXiv:1408.2927 (2014)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Upper Saddle River (1988)
Aggarwal, C.C., Reddy, C.K. (eds.): Data Clustering: Algorithms and Applications. CRC Press, Boca Raton (2014). http://www.crcpress.com/product/isbn/9781466558212. Accessed 11 Oct 2017
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, pp. 103–114 (1996)
Ghesmoune, M., Lebbah, M., Azzag, H.: A new growing neural gas for clustering data streams. Neural Netw. 78, 36 (2016). https://doi.org/10.1016/j.neunet.2016.02.003
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.P.L.F., Gama, J.A.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 1–13 (2013). https://doi.org/10.1145/2522968.2522981
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases—Volume 29 (VLDB Endowment, 2003), VLDB ’03, pp. 81–92. (2003). http://dl.acm.org/citation.cfm?id=1315451.1315460. Accessed 11 Sept 2017
Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339. https://doi.org/10.1137/1.9781611972764.29 (2006)
Backhoff, O., Ntoutsi, E.: Scalable online-offline stream clustering in apache spark. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 37–44 (2016)
Thornthwaite, W.: Dimensional modeling basics get flexible real-world dimensions with analysis services 2005. SQL Server Pro (2006)
Apache common log format. https://httpd.apache.org/docs/1.3/logs.html (2004). Accessed 13 Mar 2017
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable K-means++. Proc. VLDB Endow. 5(7), 622 (2012)
Acknowledgements
The authors would like to thank Ravi Srinivasan for providing valuable feedback on the mathematical framing of the model.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Teffer, D., Srinivasan, R. & Ghosh, J. AdaHash: hashing-based scalable, adaptive hierarchical clustering of streaming data on Mapreduce frameworks. Int J Data Sci Anal 8, 257–267 (2019). https://doi.org/10.1007/s41060-018-0145-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-018-0145-7