[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3318464.3384698acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

CoClean: Collaborative Data Cleaning

Published: 31 May 2020 Publication History

Abstract

High quality data is crucial for many applications but real-life data is often dirty. Unfortunately, automated solutions are often not trustable and are thus seldom employed in practice. In real-world scenarios, it is often necessary to resort to manual cleaning for obtaining pristine data. Existing human-in-the-loop solutions, such as Trifacta and OpenRefine, typically involve a single user. This is often error-prone, limited to a single-person expertise, and cannot scale with the ever growing volume, variety and veracity of data.
We propose a crowd-in-the-loop cleaning system, called CoClean, built on top of Python Pandas dataframe, a widely used library for data scientists. The core of CoCleanis a new Python library called Collaborative dataframe (CDF) that allows one to share data represented as a dataframe with other users. CDF is responsible for synchronizing and aggregating annotations obtained from different users. The attendees will have the opportunity to experience the following features:(1)Data Assignment: Given a dataframe, the owner can assign it (or a subset of it) to different users. (2)Supporting both lay and power users: lay users can use a GUI for direct manual cleaning of the data, while power users can work on the assigned data through a Jupyter Notebook where they can write scripts to do batch cleaning. (3)Combining machines and humans: Possible errors and repairs generated by machine algorithms can be highlighted as annotations, which can make the life of users easier for manual cleaning. (4)Collaboration Modes: CoClean supports two modes: blind-on(no user can see the annotations from others) and blind-off.

References

[1]
Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? PVLDB, 9(12):993--1004, 2016.
[2]
M. Mahdavi, Z. Abedjan, R. C. Fernandez, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Raha: A configuration-free error detection system. In SIGMOD, pages 865--882, 2019.
[3]
A. A. Qahtan, A. Elmagarmid, R. Castro Fernandez, M. Ouzzani, and N. Tang. Fahes: A robust disguised missing values detector. In ACM SIGKDD, 2018.
[4]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.

Cited By

View all
  • (2024)Cleenex: Support for User Involvement During an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/3648476Online publication date: 15-Feb-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)A power fusion data cleaning method based on exponential moving average and cosine similarity algorithms2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud (EdgeCom)10.1109/EdgeCom62867.2024.00012(25-30)Online publication date: 28-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data cleaning
  2. data collaboration
  3. data consolidation
  4. data preparation

Qualifiers

  • Short-paper

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)70
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Cleenex: Support for User Involvement During an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/3648476Online publication date: 15-Feb-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)A power fusion data cleaning method based on exponential moving average and cosine similarity algorithms2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud (EdgeCom)10.1109/EdgeCom62867.2024.00012(25-30)Online publication date: 28-Jun-2024
  • (2024)A multi-source heterogeneous medical data enhancement framework based on lakehouseHealth Information Science and Systems10.1007/s13755-024-00295-612:1Online publication date: 5-Jul-2024
  • (2023)Deep-Reinforcement-Learning-Based IoT Sensor Data Cleaning Framework for Enhanced Data AnalyticsSensors10.3390/s2304179123:4(1791)Online publication date: 5-Feb-2023
  • (2023)Students learning performance prediction based on feature extraction algorithm and attention-based bidirectional gated recurrent unit networkPLOS ONE10.1371/journal.pone.028615618:10(e0286156)Online publication date: 25-Oct-2023
  • (2023)Private Collaborative Data Cleaning via Non-Equi PSI2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179396(1419-1434)Online publication date: May-2023
  • (2023)Private Collaborative Data Cleaning via Non-Equi PSI2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179337(1419-1434)Online publication date: May-2023
  • (2023)An Efficient Generative Data Imputation Toolbox with Adversarial Learning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00290(3651-3654)Online publication date: Apr-2023
  • (2022)Diversifying repairs of Denial constraint violationsInformation Systems10.1016/j.is.2022.102041108(102041)Online publication date: Sep-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media