[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2998476.2998498acmotherconferencesArticle/Chapter ViewAbstractPublication PagescomputeConference Proceedingsconference-collections
research-article

A Novel Approach to Big Data Veracity using Crowdsourcing Techniques and Bayesian Predictors

Published: 21 October 2016 Publication History

Abstract

In today's world data is being generated at a tremendous pace and there have to be enough measures in place to verify the nature of big data. Analysis performed on 'dirty' data may lead to erroneous insights and thereby shaping decisions poorly. The aspect of big data that deals with its correctness is known as big data veracity. Trusting the data acquired goes a long way in implementing decisions from an automated decision-making system and veracity helps to validate the data acquired. In this paper, we present our solution to the big data veracity problem using crowdsourcing techniques. Our solution involves the use of sentiment analysis, which deals with identifying the sentiment expressed in a piece of text. As a proof of concept, we have developed an app that requires users to tag tweets as per the sentiment it evokes in them. Each tweet would therefore get ratified by hundreds of our participants and the sentiment associated to the tweet gets tagged. The tagged emotion was then evaluated against the verified emotion as compared to a verified data set. This analysis was then plotted on a ROC curve and also evaluated against verified data using a Bayesian predictor trained with a trinomial function. As can be seen, an accuracy of 81% was obtained as displayed by the ROC curve and 89% through the Bayesian predictor. Also, a MAP analysis of the Bayesian predictor yields neutral sentiment as the most probable hypothesis. By doing this, we have proven that crowdsourcing of sentiment analysis is a viable solution to the problem of big data veracity and therefore an aid in making better decisions.

References

[1]
A. Celikyilmaz, D. Hakkani-Tur, and J. Feng. Probabilistic model-based sentiment analysis of twitter messages. Spoken Language Technology Workshop (SLT), IEEE, pages 79--84, 2010.
[2]
B. M. Good and A. I. Su. Crowdsourcing for bioinformatics. Bioinformatics, 2013.
[3]
A. Hertzmann. Introduction to Bayesian learning. ACM SIGGRAPH 2004 Course Notes, Los Angeles, CA, August 2004.
[4]
M. Hosseini, A. Shahri, K. Phalp, and R. Ali. Recommendations on adapting crowdsourcing to problem types. IEEE 9th International Conference on Research Challenges in Information Science (RCIS), Athens, pages 423--433, 2015.
[5]
T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, 1997.
[6]
T. W. Narock and P. Hitzler. Crowdsourcing semantics for big data in geoscience applications. AAAI Fall Symposium Series, Semantics for Big Data, November 2013.
[7]
M. S. Neethu and R. Rajashree. Twitter using machine learning techniques. Proceedings of 4th International Conference on Computing, Communications and Networking Technologies (ICCNT), 2013.
[8]
B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. Proceedings of EMNLP, pages 79--86, 2002.
[9]
V. M. K. Peddinti and P. Chintalapoodi. Domain adaptation in sentiment analysis of twitter. Analyzing Microtext Workshop, AAAI, 2011.
[10]
Trinomial distribution. https://onlinecourses.science.psu.edu/stat414/node/106, March 2015. Accessed: 2015-07-15.
[11]
J. F. Puget. Optimization is ready for big data. https://www.ibm.com/developerworks/community/blogs/jfp/entry/optimization_is_ready_for_big_data_part_4_veracity?lang=en, January 2015. Accessed: 2016-01-01.
[12]
SemEval data and tools. http://alt.qcri.org/semeval2015/task10/index.php?id=data-and-tools, January 2016. Accessed: 2015-03-15.
[13]
D. Tam. Facebook processes more than 500 TB of data daily. http://www.cnet.com/news/facebook-processes-more-than-500-tb-of-data-daily/, August 2012. Accessed: 2015-09-30.
[14]
D. Trottier. Crowdsourcing CCTV surveillance on the internet. Information, Communication and Society, 17(5):609--626, 2013.

Cited By

View all
  • (2024)The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and GuidelinesProceedings of the ACM on Human-Computer Interaction10.1145/36410238:CSCW1(1-45)Online publication date: 26-Apr-2024
  • (2020)The Utilization of Blockchain for Enhancing Big Data Security and VeracityCombating Security Challenges in the Age of Big Data10.1007/978-3-030-35642-2_8(157-187)Online publication date: 27-May-2020
  • (2018)DTRMFuture Generation Computer Systems10.1016/j.future.2018.01.02683:C(293-302)Online publication date: 1-Jun-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
COMPUTE '16: Proceedings of the 9th Annual ACM India Conference
October 2016
178 pages
ISBN:9781450348089
DOI:10.1145/2998476
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bayesian Predictor
  2. Big Data
  3. Crowdsourcing
  4. Machine Learning
  5. Sentiment Analysis
  6. Tweet Mining

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ACM COMPUTE '16
ACM COMPUTE '16: Ninth Annual ACM India Conference
October 21 - 23, 2016
Gandhinagar, India

Acceptance Rates

COMPUTE '16 Paper Acceptance Rate 22 of 117 submissions, 19%;
Overall Acceptance Rate 114 of 622 submissions, 18%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and GuidelinesProceedings of the ACM on Human-Computer Interaction10.1145/36410238:CSCW1(1-45)Online publication date: 26-Apr-2024
  • (2020)The Utilization of Blockchain for Enhancing Big Data Security and VeracityCombating Security Challenges in the Age of Big Data10.1007/978-3-030-35642-2_8(157-187)Online publication date: 27-May-2020
  • (2018)DTRMFuture Generation Computer Systems10.1016/j.future.2018.01.02683:C(293-302)Online publication date: 1-Jun-2018
  • (2017)Big Data Acquisition, Preparation, and Analysis Using Apache Software Foundation ToolsBig Data Analytics10.1201/b21822-9(195-228)Online publication date: 30-Oct-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media