research-article

Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis

Authors:

Michal Podhoranyi,

Lukas VojacekAuthors Info & Claims

CCIOT '19: Proceedings of the 2019 4th International Conference on Cloud Computing and Internet of Things

Pages 1 - 6

https://doi.org/10.1145/3361821.3361825

Published: 20 September 2019 Publication History

Get Access

Abstract

Social media provide continuous data streams that contain information with different level of sensitivity, validity and accuracy. Therefore, this type of information has to be properly filtered, extracted and processed to avoid noisy and inaccurate results. The main goal of this work is to propose architecture and workflow able to process Twitter social network data in near real-time. The primary design of the introduced modern architecture covers all processing aspects from data ingestion and storing to data processing and analysing. This paper presents Apache Spark and Hadoop implementation. The secondary objective is to analyse tweets with the defined topic --- floods. The word frequency method (Word Clouds) is shown as a major tool to analyse the content of the input dataset. The experimental architecture confirmed the usefulness of many well-known functions of Spark and Hadoop in the social data domain. The platforms which were used provided effective tools for optimal data ingesting, storing as well as processing and analysing. Based on the analytical part, it was observed that the word frequency method (n-grams) can effectively reveal the tweets content. According to the results of this study, the tweets proved their high informative potential regarding data quality and quantity.

References

[1]

Martínez-Rojas, M., Pardo-Ferreira, M.C., and Rubio-Romero, J. C. 2018. Twitter as a tool for the management and analysis of emergency situations: A systematic literature review. Int J. Inform. Manage. 43, 196--208. DOI= https://doi.org/10.1016/j.ijinfomgt.2018.07.008.

Crossref

Google Scholar

[2]

Shafiee, M.E., Barker, Z., and Rasekh, A. 2018. Enhancing water system models by integrating big data. Sustain Cities Soc. 37, 485--491. DOI= https://doi.org/10.1016/j.scs.2017.11.042.

Crossref

Google Scholar

[3]

Martín, A., Julián, A.B.A., and Cos-Gayón, F. 2019. Analysis of Twitter messages using big data tools to evaluate and locate the activity in the city of Valencia (Spain). Cities. 86, 37--50. DOI=https://doi.org/10.1016/j.cities.2018.12.014.

Crossref

Google Scholar

[4]

Landwehr, P.M., Wei, W., Kowalchuk, M., and Carley, K. M. 2016. Using tweets to support disaster planning, warning and response. Safety Sci. 90, 33--47. DOI=https://doi.org/10.1016/j.ssci.2016.04.012.

Crossref

Google Scholar

[5]

Al-Daihani, S.M., and Abrahams, A. 2018. Analysis of Academic Libraries' Facebook Posts: Text and Data Analytics. J. Acad. Libr. 44, 216--225. DOI=https://doi.org/10.1016/j.acalib.2018.02.004.

Crossref

Google Scholar

[6]

Öztürk, N., and Ayvaz, S. 2018. Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis. Telemat. Inform. 35, 136--147. DOI=https://doi.org/10.1016/j.tele.2017.10.006.

Crossref

Google Scholar

[7]

Muralidharan, S., Rasmussen, L., Patterson, D., and Shin, J.H. 2011. Hope for Haiti: An analysis of Facebook and Twitter usage during the earthquake relief efforts. Public Relat. Rev. 37, 175--177. DOI= https://doi.org/10.1016/j.pubrev.2011.01.010.

Crossref

Google Scholar

[8]

Yoo, E., Rand, W., Eftekhar, M., and Rabinovich, E. 2016. Evaluating information diffusion speed and its determinants in social media networks during humanitarian crises. J. Oper. Manag. 45, 123--133. DOI=https://doi.org/10.1016/j.jom.2016.05.007.

Digital Library

Google Scholar

[9]

Jongman, B., Wagemaker, J., Romero, B.R., and De Perez, E.C. 2015. Early Flood Detection for Rapid Humanitarian Response: Harnessing Near Real-Time Satellite and Twitter Signals. ISPRS J. Geo-Inf. 4, 2246--2266. DOI=https://doi.org/10.3390/ijgi4042246.

Crossref

Google Scholar

[10]

Kim, J., and Hastak M. 2018. Social network analysis: Characteristics of online social networks after a disaster. Int. J. Inform. Manage. 38, 86--96. DOI=https://doi.org/10.1016/j.ijinfomgt.2017.08.003.

Digital Library

Google Scholar

[11]

Das, S., Behera, R.K., Kumar, M., and Rath, S.K. 2018. Real-Time Sentiment Analysis of Twitter Streaming data for Stock Prediction. Procedia Comput. Sci. 132, 956--964. DOI=https://doi.org/10.1016/j.procs.2018.05.111.

Digital Library

Google Scholar

[12]

Cohen, J.C., and Acharya, S. 2014. Towards a trusted HDFS storage platform: Mitigating threats to Hadoop infrastructures using hardware-accelerated encryption with TPM-rooted key protection. J. Inform. Secur. Appl. 19, 224--244. DOI=https://doi.org/10.1016/j.jisa.2014.03.003.

Digital Library

Google Scholar

[13]

Oussous, A., Benjelloun, F.Z., Lahcen, A.A., and Belfkih, S. 2018. Big Data technologies: A survey. Journal of King Saud University - Computer and Information Science. 4, 431--448. DOI=https://doi.org/10.1016/j.jksuci.2017.06.001.

Crossref

Google Scholar

[14]

Mavridis, I., and Karatza, H. 2017. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J. Syst. Software. 125, 133--151. DOI=https://doi.org/10.1016/j.jss.2016.11.037.

Digital Library

Google Scholar

[15]

Zola, P., Cortez, P., and Carpita, M. 2019. Twitter user geolocation using web country noun searches. Decis. Support Syst. 120, 50--59. DOI=https://doi.org/10.1016/j.dss.2019.03.006.

Digital Library

Google Scholar

[16]

Lansley, G., Longley, P.A. 2016. The geography of Twitter topics in London. Comput. Environ. Urban 58, 85--96. DOI= https://doi.org/10.1016/j.compenvurbsys.2016.04.002.

Crossref

Google Scholar

[17]

Alharbi, A.S.M., de Doncker E. 2019. Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information. Cogn. Syst. Res. 54, 50--61. DOI=https://doi.org/10.1016/j.cogsys.2018.10.001.

Crossref

Google Scholar

Cited By

View all

Mysiuk IMysiuk RShuvar RYuzevych VPavlenchyk ADalyk V(2024)Designing a Data Pipeline Architecture for Intelligent Analysis of Streaming DataScience, Engineering Management and Information Technology10.1007/978-3-031-72284-4_22(361-372)Online publication date: 12-Sep-2024
https://doi.org/10.1007/978-3-031-72284-4_22
Gutierrez CWhittaker APatenio KGehman JLefsrud LBarbosa DStroulia EOnuţ IZulkernine F(2021)Analyzing and visualizing Twitter conversationsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507791(4-13)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.5555/3507788.3507791
Khan MYu W(2021)ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data AnalyticsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472518(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472518

Index Terms

Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures

Recommendations

Big Data Network Flow Processing Using Apache Spark
ECBS '19: Proceedings of the 6th Conference on the Engineering of Computer Based Systems

The increasing amount of traffic flows captured as a part of network monitoring activities makes the analysis more complicated. One of the goals for network traffic analysis is to identify malicious communication. In the paper, we present a new system ...
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
Big Data Processing Using Spark in Cloud

Reviews

Reviewer: Dominik Strzalka

What would we do without social media What would the world look like if there weren't continuous data streams If we refer back to our history, the first big breakthrough was around 1440, when Johannes Gutenberg started his printing technology. This was the beginning of information spreading around the world. At first, this was done very slowly, but then in the early 17th century (from 1605), a new and interesting idea appeared: the newspaper. Information could now spread more quickly. And when Gauss, Weber, Wheatstone, and Morse introduced their various telegraphs, it was clear that information spreads even faster with a pair of wires. After the discoveries of Tesla and Marconi, with their new invention called the radio, the speed of information spreading reached almost the speed of light. However, this communication was rather one-directional, and its flow was very limited: the reader/listener had no or limited opportunity to create/respond and quickly spread news. As an example: the citizens band (CB) radio was not so popular. Still, all of these inventions were only a prelude to what we have today. An increase in information flow intensity has been observed since the 1970s, when it became clear that the idea of a global market was not a dream but a reality. Starting from Sydney, through Tokyo, Bombay, Frankfurt, Paris, and London, and reaching the New York Stock Exchange, the markets worked almost the whole day with constant data flow about changes in share prices. It was only a matter of time before this situation became the norm, though in different dimensions. This is possible thanks to the Internet: one of its applications-social media-has taken over the world, generating flows of information. Different social media services generate data streams of information with different levels of sensitivity, validity, and accuracy. The main contribution of this paper is an architecture that is able to process Twitter's data streams. The authors propose a five-component system: (1) data ingestion based on Apache Flume; (2) data storage on the Hadoop Distributed File System (HDFS), where tweets are broken into separate blocks and distributed to nodes; (3) a data warehouse: Apache HIVE with HiveQL to store data in the form of a table for further analysis; (4) a resource manager for job scheduling with yet another resource negotiator (YARN); and (5) the SPARK processing engine. Data from Twitter is very easily available with application programming interface (API) access. As an experiment, the authors apply the word frequency method ( n -grams) to two datasets: 1,000 tweets with the keyword flood (completed on April 10, 2018) and 10,000 tweets with the keyword flood (completed on April 25, 2018). The proposed architecture works very well to uncover the content ... in the tweets. It should be noted that the processing of social media data is not trivial, but a novel attempt to show how the Twitter data stream can be processed by the Apache Spark big data platform.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

CCIOT '19: Proceedings of the 2019 4th International Conference on Cloud Computing and Internet of Things

September 2019

134 pages

ISBN:9781450372411

DOI:10.1145/3361821

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Waseda University: Waseda University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CCIOT 2019

CCIOT 2019: 2019 4th International Conference on Cloud Computing and Internet of Things

September 20 - 22, 2019

Tokyo, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
303
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Mysiuk IMysiuk RShuvar RYuzevych VPavlenchyk ADalyk V(2024)Designing a Data Pipeline Architecture for Intelligent Analysis of Streaming DataScience, Engineering Management and Information Technology10.1007/978-3-031-72284-4_22(361-372)Online publication date: 12-Sep-2024
https://doi.org/10.1007/978-3-031-72284-4_22
Gutierrez CWhittaker APatenio KGehman JLefsrud LBarbosa DStroulia EOnuţ IZulkernine F(2021)Analyzing and visualizing Twitter conversationsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507791(4-13)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.5555/3507788.3507791
Khan MYu W(2021)ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data AnalyticsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472518(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472518
Al-Obeidat FBani-Hani AAdedugbe OMajdalawieh MBenkhelifa E(2021)A microservices persistence technique for cloud-based online social data analysisCluster Computing10.1007/s10586-021-03244-024:3(2341-2353)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s10586-021-03244-0

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Big Data Network Flow Processing Using Apache Spark

Performance comparison of Apache Hadoop and Apache Spark

Big Data Processing Using Spark in Cloud

Reviews

Access critical reviews of Computing literature here