[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2593728.2593733acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

Utilization of synergetic human-machine clouds: a big data cleaning case

Published: 02 June 2014 Publication History

Abstract

Cloud computing and crowdsourcing are growing trends in IT. Combining the strengths of both machine and human clouds within a hybrid design enables us to overcome certain problems and achieve efficiencies. In this paper we present a case in which we developed a hybrid, throw-away prototype software system to solve a big data cleaning problem in which we corrected and normalized a data set of 53,822 academic publication records. The first step in our solution consists of utilization of external DOI query web services to label the records with matching DOIs. Then we used customized string similarity calculation algorithms based on Levensthein Distance and Jaccard Index to grade the similarity between records. Finally we used crowdsourcing to identify duplicates among the residual record set consisting of similar yet not identical records. We consider this proof of concept to be successful and report that we achieved certain results that we could not have achieved by using either human or machine clouds alone.

References

[1]
Von Ahn, L. and Dabbish, L. 2004. Labeling images with a computer game. Proceedings of the 2004 conference on Human factors in computing systems - CHI ’04. (2004), 319–326.
[2]
Amer-Yahia, S., Doan, A., Kleinberg, J., Koudas, N. and Franklin, M. 2010. Crowds, Clouds, and Algorithms: Exploring the Human Side of “Big Data” Applications. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), 1259–1260.
[3]
Bernstein, A., Klein, M. and Malone, T.W. 2012. Programming the Global Brain. Commun. ACM. 55, 5 (May 2012), 41–43.
[4]
Bernstein, M.S., Teevan, J., Dumais, S., Liebling, D. and Horvitz, E. 2012. Direct Answers for Search Queries in the Long Tail. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (New York, NY, USA, 2012), 237–246.
[5]
CrossRef: www.crossref.org.
[6]
Davis, J.G. 2011. From Crowdsourcing to Crowdservicing. Internet Computing, IEEE. 15, 3 (May 2011), 92–94.
[7]
Fast, E., Steffee, D., Wang, L., Brandt, J. and Bernstein, M.S. 2014. Emergent, Crowd-scale Programming Practice in the IDE. (2014).
[8]
Fold-it: fold.it.
[9]
Innocentive: www.innocentive.com.
[10]
Iren, D. and Bilgen, S. 2013. Cost models of crowdsourcing quality assurance mechanisms.
[11]
Lackermair, G. 2011. Hybrid cloud architectures for the online commerce. Procedia Computer Science. 3, 0 (2011), 550–555.
[12]
Lenk, A., Klems, M., Nimis, J., Tai, S. and Sandholm, T. 2009. What’s Inside the Cloud? An Architectural Map of the Cloud Landscape. Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing (Washington, DC, USA, 2009), 23–31.
[13]
Levandowsky, M. and Winter, D. 1971. Distance between Sets. Nature. 234, 5323 (Nov. 1971), 34–35.
[14]
Levenshtein, V. 1965. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission. 1, (1965), 8–17.
[15]
Navarro, G. 2001. A Guided Tour to Approximate String Matching. ACM Comput. Surv. 33, 1 (2001), 31–88.
[16]
Quinn, A.J. and Bederson, B.B. 2011. Human Computation: A Survey and Taxonomy of a Growing Field. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (New York, NY, USA, 2011), 1403–1412.
[17]
Tziralis, G. and Tatsiopoulos, I. 2007. Prediction Markets: An Extended Literature Review. Journal of Prediction Markets. 1, 1 (2007), 75–91.
[18]
Vukovic, M. 2009. Crowdsourcing for Enterprises. Proceedings of the 2009 Congress on Services - I (Washington, DC, USA, 2009), 686–692.
[19]
Wikipedia: www.wikipedia.org.

Cited By

View all
  • (2021)Correcting Corrupted Labels Using Mode Dropping of ACGAN2021 15th International Symposium on Medical Information and Communication Technology (ISMICT)10.1109/ISMICT51748.2021.9434911(98-103)Online publication date: 14-Apr-2021
  • (2021)Toward Pinpointing Data Leakage from Advanced Persistent Threats2021 7th IEEE Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)10.1109/BigDataSecurityHPSCIDS52275.2021.00038(157-162)Online publication date: May-2021
  • (2017)Leveraging business process improvement with natural language processing and organizational semantic knowledgeProceedings of the 2017 International Conference on Software and System Process10.1145/3084100.3084112(100-108)Online publication date: 5-Jul-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CSI-SE 2014: Proceedings of the 1st International Workshop on CrowdSourcing in Software Engineering
June 2014
18 pages
ISBN:9781450328579
DOI:10.1145/2593728
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • TCSE: IEEE Computer Society's Tech. Council on Software Engin.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cloud Computing
  2. Crowdservice
  3. Crowdsourcing

Qualifiers

  • Article

Conference

ICSE '14
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Correcting Corrupted Labels Using Mode Dropping of ACGAN2021 15th International Symposium on Medical Information and Communication Technology (ISMICT)10.1109/ISMICT51748.2021.9434911(98-103)Online publication date: 14-Apr-2021
  • (2021)Toward Pinpointing Data Leakage from Advanced Persistent Threats2021 7th IEEE Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)10.1109/BigDataSecurityHPSCIDS52275.2021.00038(157-162)Online publication date: May-2021
  • (2017)Leveraging business process improvement with natural language processing and organizational semantic knowledgeProceedings of the 2017 International Conference on Software and System Process10.1145/3084100.3084112(100-108)Online publication date: 5-Jul-2017
  • (2016)Decision support in tourism based on human-computer cloudProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011174(125-132)Online publication date: 28-Nov-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media