[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3350489.3350491acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbirteConference Proceedingsconference-collections
research-article

Distributed Classification of Text Streams: Limitations, Challenges, and Solutions

Published: 26 August 2019 Publication History

Abstract

Text stream classification is an important problem that is difficult to solve at scale. Batch processing systems, widely adopted for text classification tasks, cannot provide for low latency. Distributed stream processing systems can offer low latency, but do not support the same level of fault tolerance and determinism as the batch systems. In this work, we demonstrate how the distributed stream processing features can affect the results of a typical text classification data flow. Our analysis shows emerged trade-offs between fault tolerance and reproducibility on the one side, and performance on the other side. We outline potential ways to solve the revealed issues and to handle streaming features.

References

[1]
Apache Hadoop 2017. Apache Hadoop. (Oct. 2017). http://hadoop.apache.org/
[2]
Apache Storm 2017. Apache Storm. (Oct. 2017). http://storm.apache.org/
[3]
Alexandros Baltas, Andreas Kanavos, and Athanasios K Tsakalidis. 2016. An apache spark implementation for sentiment analysis on twitter data. In International Workshop of Algorithmic Aspects of Cloud Computing. Springer, 15--25.
[4]
Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 1387--1395.
[5]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. "O'Reilly Media, Inc.".
[6]
Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing. Proc. VLDB 10, 12 (Aug. 2017), 1718--1729.
[7]
S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng, and P. Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In 2016 IEEE Intnl. Parallel and Distributed Processing Symp. Workshops (IPDPSW). 1789--1792.
[8]
Emmanouil Ikonomakis, Sotiris Kotsiantis, and V Tampakas. 2005. Text Classification Using Machine Learning Techniques. WSEAS transactions on computers 4 (08 2005), 966--974.
[9]
Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking distributed stream data processing systems. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1507--1518.
[10]
Akhmedov Khumoyun, Yun Cui, and Hanku Lee. 2016. Real Time information Classification in Twitter using storm. (2016).
[11]
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proc. of the 2015 ACM SIGMOD Intnl. Conf. on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239--250.
[12]
Igor Kuralenok, Natalia Starikova, Aleksandr Khvorov, and Julian Serdyuk. 2018. Construction of Efficient V-Gram Dictionary for Sequential Data Analysis. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18). ACM, New York, NY, USA, 1343--1352.
[13]
Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov. 2018. Deterministic Model for Distributed Speculative Stream Processing. In Advances in Databases and Information Systems, András Benczúr, Bernhard Thalheim, and Tomáš Horváth (Eds.). Springer International Publishing, Cham, 233--246.
[14]
Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov. 2018. FlameStream: Model and Runtime for Distributed Stream Processing. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR'18). ACM, New York, NY, USA, Article 8, 2 pages.
[15]
Lenta 2019. Lenta.ru dataset. (Feb. 2019). https://github.com/yutkin/Lenta.Ru-News-Dataset
[16]
Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order Processing: A New Architecture for High-performance Stream Systems. Proc. VLDB Endow. 1, 1 (Aug. 2008), 274--288.
[17]
H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1222--1230.
[18]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.
[19]
Gianmarco De Francisci Morales and Albert Bifet. 2015. SAMOA: scalable advanced massive online analysis. Journal of Machine Learning Research 16, 1 (2015), 149--153.
[20]
Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Samza: Stateful Scalable Stream Processing at LinkedIn. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1634--1645.
[21]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[22]
Junfei Qiu, Qihui Wu, Guoru Ding, Yuhua Xu, and Shuo Feng. 2016. A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing 2016, 1 (2016), 67.
[23]
Piotr Semberecki and Henryk Maciejewski. 2016. Distributed classification of text documents on Apache Spark platform. In International Conference on Artificial Intelligence and Soft Computing. Springer, 621--630.
[24]
Alexey Svyatkovskiy, Kosuke Imai, Mary Kroeger, and Yuki Shiraito. 2016. Large-scale text processing pipeline with Apache Spark. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 3928--3935.
[25]
Artem Trofimov, Mikhail Shavkunov, Sergey Reznick, Nikita Sokolov, Mikhail Yutman, Igor E. Kuralenok, and Boris Novikov. 2019. Reproducible and Reliable Distributed Classification of Text Streams. In Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems (DEBS '19). ACM, New York, NY, USA, 264--265.
[26]
B. Yan, Z. Yang, Y. Ren, X. Tan, and E. Liu. 2017. Microblog Sentiment Classification Using Parallel SVM in Apache Spark. In 2017 IEEE International Congress on Big Data (BigData Congress). 282--288.
[27]
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56--65.
[28]
Yang Zhang, Xue Li, and Maria Orlowska. 2008. One-class classification of text streams with concept drift. In Data Mining Workshops, 2008. ICDMW'08. IEEE International Conference on. IEEE, 116--125.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
BIRTE 2019: Proceedings of Real-Time Business Intelligence and Analytics
August 2019
45 pages
ISBN:9781450376600
DOI:10.1145/3350489
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • VLDB Endowment: Very Large Database Endowment

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 August 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data streams
  2. exactly-once
  3. reproducibility
  4. text classification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

BIRTE 2019

Acceptance Rates

BIRTE 2019 Paper Acceptance Rate 6 of 10 submissions, 60%;
Overall Acceptance Rate 12 of 21 submissions, 57%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 105
    Total Downloads
  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media