More Web Proxy on the site http://driver.im/

research-article

Distributed Classification of Text Streams: Limitations, Challenges, and Solutions

Authors:

Artem Trofimov,

Nikita Sokolov,

Mikhail Shavkunov,

Igor Kuralenok,

Boris NovikovAuthors Info & Claims

BIRTE 2019: Proceedings of Real-Time Business Intelligence and Analytics

Article No.: 2, Pages 1 - 6

https://doi.org/10.1145/3350489.3350491

Published: 26 August 2019 Publication History

Abstract

Text stream classification is an important problem that is difficult to solve at scale. Batch processing systems, widely adopted for text classification tasks, cannot provide for low latency. Distributed stream processing systems can offer low latency, but do not support the same level of fault tolerance and determinism as the batch systems. In this work, we demonstrate how the distributed stream processing features can affect the results of a typical text classification data flow. Our analysis shows emerged trade-offs between fault tolerance and reproducibility on the one side, and performance on the other side. We outline potential ways to solve the revealed issues and to handle streaming features.

References

[1]

Apache Hadoop 2017. Apache Hadoop. (Oct. 2017). http://hadoop.apache.org/

[2]

Apache Storm 2017. Apache Storm. (Oct. 2017). http://storm.apache.org/

[3]

Alexandros Baltas, Andreas Kanavos, and Athanasios K Tsakalidis. 2016. An apache spark implementation for sentiment analysis on twitter data. In International Workshop of Algorithmic Aspects of Cloud Computing. Springer, 15--25.

[4]

Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 1387--1395.

Digital Library

[5]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. "O'Reilly Media, Inc.".

Digital Library

[6]

Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing. Proc. VLDB 10, 12 (Aug. 2017), 1718--1729.

Digital Library

[7]

S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng, and P. Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In 2016 IEEE Intnl. Parallel and Distributed Processing Symp. Workshops (IPDPSW). 1789--1792.

[8]

Emmanouil Ikonomakis, Sotiris Kotsiantis, and V Tampakas. 2005. Text Classification Using Machine Learning Techniques. WSEAS transactions on computers 4 (08 2005), 966--974.

[9]

Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking distributed stream data processing systems. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1507--1518.

[10]

Akhmedov Khumoyun, Yun Cui, and Hanku Lee. 2016. Real Time information Classification in Twitter using storm. (2016).

[11]

Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proc. of the 2015 ACM SIGMOD Intnl. Conf. on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239--250.

Digital Library

[12]

Igor Kuralenok, Natalia Starikova, Aleksandr Khvorov, and Julian Serdyuk. 2018. Construction of Efficient V-Gram Dictionary for Sequential Data Analysis. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18). ACM, New York, NY, USA, 1343--1352.

Digital Library

[13]

Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov. 2018. Deterministic Model for Distributed Speculative Stream Processing. In Advances in Databases and Information Systems, András Benczúr, Bernhard Thalheim, and Tomáš Horváth (Eds.). Springer International Publishing, Cham, 233--246.

[14]

Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov. 2018. FlameStream: Model and Runtime for Distributed Stream Processing. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR'18). ACM, New York, NY, USA, Article 8, 2 pages.

Digital Library

[15]

Lenta 2019. Lenta.ru dataset. (Feb. 2019). https://github.com/yutkin/Lenta.Ru-News-Dataset

[16]

Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order Processing: A New Architecture for High-performance Stream Systems. Proc. VLDB Endow. 1, 1 (Aug. 2008), 274--288.

Digital Library

[17]

H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1222--1230.

Digital Library

[18]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.

Digital Library

[19]

Gianmarco De Francisci Morales and Albert Bifet. 2015. SAMOA: scalable advanced massive online analysis. Journal of Machine Learning Research 16, 1 (2015), 149--153.

Digital Library

[20]

Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Samza: Stateful Scalable Stream Processing at LinkedIn. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1634--1645.

Digital Library

[21]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[22]

Junfei Qiu, Qihui Wu, Guoru Ding, Yuhua Xu, and Shuo Feng. 2016. A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing 2016, 1 (2016), 67.

[23]

Piotr Semberecki and Henryk Maciejewski. 2016. Distributed classification of text documents on Apache Spark platform. In International Conference on Artificial Intelligence and Soft Computing. Springer, 621--630.

[24]

Alexey Svyatkovskiy, Kosuke Imai, Mary Kroeger, and Yuki Shiraito. 2016. Large-scale text processing pipeline with Apache Spark. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 3928--3935.

[25]

Artem Trofimov, Mikhail Shavkunov, Sergey Reznick, Nikita Sokolov, Mikhail Yutman, Igor E. Kuralenok, and Boris Novikov. 2019. Reproducible and Reliable Distributed Classification of Text Streams. In Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems (DEBS '19). ACM, New York, NY, USA, 264--265.

Digital Library

[26]

B. Yan, Z. Yang, Y. Ren, X. Tan, and E. Liu. 2017. Microblog Sentiment Classification Using Parallel SVM in Apache Spark. In 2017 IEEE International Congress on Big Data (BigData Congress). 282--288.

[27]

Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56--65.

Digital Library

[28]

Yang Zhang, Xue Li, and Maria Orlowska. 2008. One-class classification of text streams with concept drift. In Data Mining Workshops, 2008. ICDMW'08. IEEE International Conference on. IEEE, 116--125.

Digital Library

Index Terms

Distributed Classification of Text Streams: Limitations, Challenges, and Solutions
1. Information systems

Recommendations

Reproducible and Reliable Distributed Classification of Text Streams
DEBS '19: Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems

Large-scale classification of text streams is an essential problem that is hard to solve. Batch processing systems are scalable and proved their effectiveness for machine learning but do not provide low latency. On the other hand, state-of-the-art ...
On demand classification of data streams
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification ...
An Incremental Fuzzy Decision Tree Classification Method for Mining Data Streams
MLDM '07: Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition

One of most important algorithms for mining data streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

BIRTE 2019: Proceedings of Real-Time Business Intelligence and Analytics

August 2019

45 pages

ISBN:9781450376600

DOI:10.1145/3350489

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

VLDB Endowment: Very Large Database Endowment

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

BIRTE 2019

BIRTE 2019: Real-Time Business Intelligence and Analytics

August 26, 2019

CA, Los Angeles, USA

Acceptance Rates

BIRTE 2019 Paper Acceptance Rate 6 of 10 submissions, 60%;

Overall Acceptance Rate 12 of 21 submissions, 57%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
105
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents