[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3361821.3361825acmotherconferencesArticle/Chapter ViewAbstractPublication PagescciotConference Proceedingsconference-collections
research-article

Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis

Published: 20 September 2019 Publication History

Abstract

Social media provide continuous data streams that contain information with different level of sensitivity, validity and accuracy. Therefore, this type of information has to be properly filtered, extracted and processed to avoid noisy and inaccurate results. The main goal of this work is to propose architecture and workflow able to process Twitter social network data in near real-time. The primary design of the introduced modern architecture covers all processing aspects from data ingestion and storing to data processing and analysing. This paper presents Apache Spark and Hadoop implementation. The secondary objective is to analyse tweets with the defined topic --- floods. The word frequency method (Word Clouds) is shown as a major tool to analyse the content of the input dataset. The experimental architecture confirmed the usefulness of many well-known functions of Spark and Hadoop in the social data domain. The platforms which were used provided effective tools for optimal data ingesting, storing as well as processing and analysing. Based on the analytical part, it was observed that the word frequency method (n-grams) can effectively reveal the tweets content. According to the results of this study, the tweets proved their high informative potential regarding data quality and quantity.

References

[1]
Martínez-Rojas, M., Pardo-Ferreira, M.C., and Rubio-Romero, J. C. 2018. Twitter as a tool for the management and analysis of emergency situations: A systematic literature review. Int J. Inform. Manage. 43, 196--208. DOI= https://doi.org/10.1016/j.ijinfomgt.2018.07.008.
[2]
Shafiee, M.E., Barker, Z., and Rasekh, A. 2018. Enhancing water system models by integrating big data. Sustain Cities Soc. 37, 485--491. DOI= https://doi.org/10.1016/j.scs.2017.11.042.
[3]
Martín, A., Julián, A.B.A., and Cos-Gayón, F. 2019. Analysis of Twitter messages using big data tools to evaluate and locate the activity in the city of Valencia (Spain). Cities. 86, 37--50. DOI=https://doi.org/10.1016/j.cities.2018.12.014.
[4]
Landwehr, P.M., Wei, W., Kowalchuk, M., and Carley, K. M. 2016. Using tweets to support disaster planning, warning and response. Safety Sci. 90, 33--47. DOI=https://doi.org/10.1016/j.ssci.2016.04.012.
[5]
Al-Daihani, S.M., and Abrahams, A. 2018. Analysis of Academic Libraries' Facebook Posts: Text and Data Analytics. J. Acad. Libr. 44, 216--225. DOI=https://doi.org/10.1016/j.acalib.2018.02.004.
[6]
Öztürk, N., and Ayvaz, S. 2018. Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis. Telemat. Inform. 35, 136--147. DOI=https://doi.org/10.1016/j.tele.2017.10.006.
[7]
Muralidharan, S., Rasmussen, L., Patterson, D., and Shin, J.H. 2011. Hope for Haiti: An analysis of Facebook and Twitter usage during the earthquake relief efforts. Public Relat. Rev. 37, 175--177. DOI= https://doi.org/10.1016/j.pubrev.2011.01.010.
[8]
Yoo, E., Rand, W., Eftekhar, M., and Rabinovich, E. 2016. Evaluating information diffusion speed and its determinants in social media networks during humanitarian crises. J. Oper. Manag. 45, 123--133. DOI=https://doi.org/10.1016/j.jom.2016.05.007.
[9]
Jongman, B., Wagemaker, J., Romero, B.R., and De Perez, E.C. 2015. Early Flood Detection for Rapid Humanitarian Response: Harnessing Near Real-Time Satellite and Twitter Signals. ISPRS J. Geo-Inf. 4, 2246--2266. DOI=https://doi.org/10.3390/ijgi4042246.
[10]
Kim, J., and Hastak M. 2018. Social network analysis: Characteristics of online social networks after a disaster. Int. J. Inform. Manage. 38, 86--96. DOI=https://doi.org/10.1016/j.ijinfomgt.2017.08.003.
[11]
Das, S., Behera, R.K., Kumar, M., and Rath, S.K. 2018. Real-Time Sentiment Analysis of Twitter Streaming data for Stock Prediction. Procedia Comput. Sci. 132, 956--964. DOI=https://doi.org/10.1016/j.procs.2018.05.111.
[12]
Cohen, J.C., and Acharya, S. 2014. Towards a trusted HDFS storage platform: Mitigating threats to Hadoop infrastructures using hardware-accelerated encryption with TPM-rooted key protection. J. Inform. Secur. Appl. 19, 224--244. DOI=https://doi.org/10.1016/j.jisa.2014.03.003.
[13]
Oussous, A., Benjelloun, F.Z., Lahcen, A.A., and Belfkih, S. 2018. Big Data technologies: A survey. Journal of King Saud University - Computer and Information Science. 4, 431--448. DOI=https://doi.org/10.1016/j.jksuci.2017.06.001.
[14]
Mavridis, I., and Karatza, H. 2017. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J. Syst. Software. 125, 133--151. DOI=https://doi.org/10.1016/j.jss.2016.11.037.
[15]
Zola, P., Cortez, P., and Carpita, M. 2019. Twitter user geolocation using web country noun searches. Decis. Support Syst. 120, 50--59. DOI=https://doi.org/10.1016/j.dss.2019.03.006.
[16]
Lansley, G., Longley, P.A. 2016. The geography of Twitter topics in London. Comput. Environ. Urban 58, 85--96. DOI= https://doi.org/10.1016/j.compenvurbsys.2016.04.002.
[17]
Alharbi, A.S.M., de Doncker E. 2019. Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information. Cogn. Syst. Res. 54, 50--61. DOI=https://doi.org/10.1016/j.cogsys.2018.10.001.

Cited By

View all
  • (2024)Designing a Data Pipeline Architecture for Intelligent Analysis of Streaming DataScience, Engineering Management and Information Technology10.1007/978-3-031-72284-4_22(361-372)Online publication date: 12-Sep-2024
  • (2021)Analyzing and visualizing Twitter conversationsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507791(4-13)Online publication date: 22-Nov-2021
  • (2021)ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data AnalyticsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472518(1-10)Online publication date: 9-Aug-2021

Index Terms

  1. Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis

    Recommendations

    Reviews

    Dominik Strzalka

    What would we do without social media What would the world look like if there weren't continuous data streams If we refer back to our history, the first big breakthrough was around 1440, when Johannes Gutenberg started his printing technology. This was the beginning of information spreading around the world. At first, this was done very slowly, but then in the early 17th century (from 1605), a new and interesting idea appeared: the newspaper. Information could now spread more quickly. And when Gauss, Weber, Wheatstone, and Morse introduced their various telegraphs, it was clear that information spreads even faster with a pair of wires. After the discoveries of Tesla and Marconi, with their new invention called the radio, the speed of information spreading reached almost the speed of light. However, this communication was rather one-directional, and its flow was very limited: the reader/listener had no or limited opportunity to create/respond and quickly spread news. As an example: the citizens band (CB) radio was not so popular. Still, all of these inventions were only a prelude to what we have today. An increase in information flow intensity has been observed since the 1970s, when it became clear that the idea of a global market was not a dream but a reality. Starting from Sydney, through Tokyo, Bombay, Frankfurt, Paris, and London, and reaching the New York Stock Exchange, the markets worked almost the whole day with constant data flow about changes in share prices. It was only a matter of time before this situation became the norm, though in different dimensions. This is possible thanks to the Internet: one of its applications-social media-has taken over the world, generating flows of information. Different social media services generate data streams of information with different levels of sensitivity, validity, and accuracy. The main contribution of this paper is an architecture that is able to process Twitter's data streams. The authors propose a five-component system: (1) data ingestion based on Apache Flume; (2) data storage on the Hadoop Distributed File System (HDFS), where tweets are broken into separate blocks and distributed to nodes; (3) a data warehouse: Apache HIVE with HiveQL to store data in the form of a table for further analysis; (4) a resource manager for job scheduling with yet another resource negotiator (YARN); and (5) the SPARK processing engine. Data from Twitter is very easily available with application programming interface (API) access. As an experiment, the authors apply the word frequency method ( n -grams) to two datasets: 1,000 tweets with the keyword flood (completed on April 10, 2018) and 10,000 tweets with the keyword flood (completed on April 25, 2018). The proposed architecture works very well to uncover the content ... in the tweets. It should be noted that the processing of social media data is not trivial, but a novel attempt to show how the Twitter data stream can be processed by the Apache Spark big data platform.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CCIOT '19: Proceedings of the 2019 4th International Conference on Cloud Computing and Internet of Things
    September 2019
    134 pages
    ISBN:9781450372411
    DOI:10.1145/3361821
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Waseda University: Waseda University

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 September 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Apache Spark
    2. Twitter
    3. data processing architecture
    4. social network data

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    CCIOT 2019

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Designing a Data Pipeline Architecture for Intelligent Analysis of Streaming DataScience, Engineering Management and Information Technology10.1007/978-3-031-72284-4_22(361-372)Online publication date: 12-Sep-2024
    • (2021)Analyzing and visualizing Twitter conversationsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507791(4-13)Online publication date: 22-Nov-2021
    • (2021)ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data AnalyticsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472518(1-10)Online publication date: 9-Aug-2021
    • (2021)A microservices persistence technique for cloud-based online social data analysisCluster Computing10.1007/s10586-021-03244-024:3(2341-2353)Online publication date: 1-Sep-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media