[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3312714.3312717acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicsltConference Proceedingsconference-collections
research-article

Application of Attribute Correlation in Unsupervised Data Cleaning

Published: 10 January 2019 Publication History

Abstract

Referring to the supervised learning and unsupervised learning in machine learning, we divide the data cleaning processes into supervised and unsupervised two forms too, and then, we reclassify the data quality problems into canonicalization error, redundancy error, strong logic error and weak logic error according to the characteristics of unsupervised cleaning. For the weak logic errors, we propose a repair framework AC-Framework and an algorithm AC-Repair based on the attribute correlation. When repairing, we first establish a priority queue(PQ) for elements to be repaired according to the minimum cost idea and take the corresponding conflict-free data set(Icf) as a training set to learn the correlation among attributes. Then, we select the first element in PQ list as the candidate element to repair, and recompute the PQ list after one repair round to improve the efficiency. Finally, in order to prevent the algorithm from endless loops, we set a label flag to mark the repaired elements, in this way, every error element will be repaired at most once. In the experimental part, we compare the AC-Repair algorithm with the interpolation-based repair algorithm to verify its validity.

References

[1]
Hongzhi Wang, Mingda Li, Yingyi Bu, Jianzhong Li, Hong Gao, and Jiacheng Zhang. 2016. Cleanix: a Parallel Big Data Cleaning System. SIGMOD Rec. 44, 4 (May 2016), 35--40.
[2]
Xu, S., Lu, B., Baldea, M., Edgar, T. F., Wojsznis, W., & Blevins, T., et al. 2015. Data cleaning in the process industries. Reviews in Chemical Engineering, 31(5), 453--490.
[3]
Fujii, T., Ito, H., & Miyoshi, S. 2017. Statistical-mechanical analysis connecting supervised learning and semi-supervised learning. Journal of the Physical Society of Japan, 86(6), 063801.
[4]
Zhang W, Zhang T. X., Shen J. 2005. Dynamic infrared imagery analysis method based on knowledge representation and supervised learning. Infrared and Laser Engineering, (2):216--220.
[5]
Xu S. L., Wang J. H. 2016. Classification Algorithm Combined with Unsupervised Learning for Data Stream. Pattern Recognition and Artificial Intelligence, (7):665--672.
[6]
Kim, J., Jang, G. J., & Lee, M. 2016. Investigation of the Efficiency of Unsupervised Learning for Multi-task Classification in Convolutional Neural Network. Neural Information Processing. Springer International Publishing.
[7]
Can, B., & Manandhar, S. 2014. Methods and Algorithms for Unsupervised Learning of Morphology. Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg.
[8]
Jin, C. Q., Liu, H. P., & Zhou, A. Y. 2016. Functional dependency and conditional constraint based data repair. Journal of Software.
[9]
Zhang, C., & Diao, Y. 2016. Conditional functional dependency discovery and data repair based on decision tree. International Conference on Fuzzy Systems and Knowledge Discovery (pp.864--868). IEEE.
[10]
LI M. H., Li J. Z. 2015. Algorithms for Improving Data Currency. Journal of Computer Research and Development, (9):1992--2001.
[11]
Mcgilvray, D. 2008. Executing data quality projects: ten steps to quality data and trusted information (tm).
[12]
Xu Y. L., Li Z. H., Chen Q., Zhong P. 2016. Repairing Inconsistent Relational Data Based on Possible World Model. Journal of Software., 27(7):1685--1699.
[13]
Diao, Y., Sheng, W., Liu, K., Kaiyuan, H. E., & Meng, X. 2015. Research on online cleaning and repair methods of large-scale distribution network load data. Power System Technology.
[14]
Zhang, L., Zhao, Y., Zhu, Z., Shen, D., & Ji, S. 2018. Multi-view missing data completion. IEEE Transactions on Knowledge & Data Engineering, PP(99), 1--1.
[15]
Tang, L., Wang, R., Runze, W. U., & Fan, B. (2017). Missing data filling algorithm for uniform data model in panoramic dispatching and control system. Automation of Electric Power Systems.
[16]
Nieves R. Brisaboa, M. Andrea Rodríguez, Diego Seco, and Rodrigo A. Troncoso. 2015. Rank-based strategies for cleaning inconsistent spatial databases. Int. J. Geogr. Inf. Sci. 29, 2 (February 2015), 280--304.
[17]
Benbernou, S., & Ouziri, M. 2018. Enhancing data quality by cleaning inconsistent big RDF data. IEEE International Conference on Big Data (pp.74--79). IEEE.
[18]
Hu Y., Qiao Y. L. 2018. Wind Power Data Cleaning Method Based on Confidence Equivalent Boundary Model. Automation of Electric Power Systems, (15):18--23, 149.
[19]
Martin, D., Rosete, A., Alcala-Fdez, J., & Herrera, F. 2014. A new multiobjective evolutionary algorithm for mining a reduced set of interesting positive and negative quantitative association rules. IEEE Transactions on Evolutionary Computation, 18(1), 54--69.
[20]
Alonso, A. P., Medina, I. J. B., González, L. M. G., & Chica, J. M. S. 2017. Incremental maintenance of discovered association rules and approximate dependencies. Intelligent Data Analysis, 21(1), 117--133.
[21]
Zhang X. J., Wang M., Meng X. F. 2014. An Accurate Method for Mining top-k Frequent Pattern Under Differential Privacy. Journal of Computer Research and Development, (1):104--114.
[22]
Noor, M. N., Yahaya, A. S., Ramli, N. A., & Al Bakri, A. M. M. 2014. Filling missing data using interpolation methods: study on the effect of fitting distribution. Key Engineering Materials, 594--595, 889--895.
[23]
Le, H. T., Urruty, T., Gbèhounou, S., Lecellier, F., Martinet, J., & Fernandez-Maloigne, C. 2017. Improving retrieval framework using information gain models. Signal Image & Video Processing, 11(2), 1--8.
[24]
Mingquan, Y. E., Gao, L., Changrong, W. U., & Wan, C. 2017. Informative gene selection method based on symmetric uncertainty and svm recursive feature elimination. Pattern Recognition & Artificial Intelligence, 30(5), 429--438.

Cited By

View all
  • (2021)Correcting Corrupted Labels Using Mode Dropping of ACGAN2021 15th International Symposium on Medical Information and Communication Technology (ISMICT)10.1109/ISMICT51748.2021.9434911(98-103)Online publication date: 14-Apr-2021

Index Terms

  1. Application of Attribute Correlation in Unsupervised Data Cleaning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICSLT '19: Proceedings of the 5th International Conference on e-Society, e-Learning and e-Technologies
    January 2019
    132 pages
    ISBN:9781450362351
    DOI:10.1145/3312714
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 January 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Unsupervised data cleaning
    2. attribute correlation
    3. machine learning
    4. minimum repair cost
    5. weak logic errors

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICSLT 2019

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 31 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Correcting Corrupted Labels Using Mode Dropping of ACGAN2021 15th International Symposium on Medical Information and Communication Technology (ISMICT)10.1109/ISMICT51748.2021.9434911(98-103)Online publication date: 14-Apr-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media