[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3422713.3422719acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbdtConference Proceedingsconference-collections
research-article

A Column-Level Data Lineage Processing System Based on Hive

Published: 23 October 2020 Publication History

Abstract

For big data, the data warehouse stores all business data of the entire enterprise. The data collected in the data warehouse will generate new data collection through the operations of data union, splitting, and transformation. This data conversion relationship in the data production process is called data lineage. Therefore, tracking the data lineage in the data warehouse is an important part of the process of data warehouse construction. However, the existing open-source solutions have shortcomings such as high coupling, poor accuracy, and intrusion in the processing of this critical link. Therefore, this paper designs a column-level data lineage processing system based on the Hive data warehouse for the Hive data warehouse. The system has realized the ability to analyze the Data lineage of Hive SQL data locally and realizes the fine-grained and high accuracy of the data lineage analysis results while ensuring the low coupling between the data lineage function and the Hive data warehouse.

References

[1]
Buneman, P., and Tan, W. C. 2019. Data Provenance: What next?. SIGMOD Rec. 47, 3 (September 2018), 5--16. DOI=https://doi.org/10.1145/3316416.3316418.
[2]
Glavic B. 2012. Big data provenance: Challenges and implications for benchmarking. Specifying big data benchmarks. Springer, Berlin, Heidelberg, 72--80.
[3]
Apache Hive. 2020. Apache Hive. Retrieved from http://hive.apache.org/.
[4]
Park, H., Ikeda, R., and Widom, J. 2011. Ramp: A system for capturing and tracing provenance in mapreduce workflows.
[5]
Apache Atlas 2020. Apache Atlas: Data Governance and Metadata framework for Hadoop. Retrieved from http://atlas.apache.org/.
[6]
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E. N., O'Malley, O., ... and Zhang, X. 2014. Major technical advancements in apache hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 1235--1246. DOI=https://doi.org/10.1145/2588555.2595630.
[7]
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., ... and Murthy, R. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626--1629. DOI=https://doi.org/10.14778/1687553.1687609.
[8]
Camacho-Rodríguez, J., Chauhan, A., Gates, A., Koifman, E., O'Malley, O., Garg, V., ... and Jaiswal, D. 2019. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1773--1786. DOI=https://doi.org/10.1145/3299869.3314045.
[9]
Interlandi, M., Ekmekji, A., Shah, K., Gulzar, M. A., Tetali, S. D., Kim, M., ... and Condie, T. 2018. Adding data provenance support to Apache Spark. The VLDB Journal 27, 5 (October 2018), 595--615. DOI=https://doi.org/10.1007/s00778-017-0474-5.
[10]
Pokorný, J., Sykora, J., and Valenta, M. 2019. Data Lineage Temporally Using a Graph Database. In Proceedings of the 11th International Conference on Management of Digital EcoSystems (MEDES '19). Association for Computing Machinery, New York, NY, USA, 285--291. DOI=https://doi.org/10.1145/3297662.3365794.
[11]
Tang, M., Shao, S., Yang, W., Liang, Y., Yu, Y., Saha, B., and Hyun, D. 2019. SAC: A System for Big Data Lineage Tracking. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1964-1967. DOI=http://doi.org/10.1109/ICDE.2019.00215.
[12]
lulumengyi. 2019. Hive SQL AST. Retrieved from https://github.com/lulumengyi/Hive_SQL_AST.

Index Terms

  1. A Column-Level Data Lineage Processing System Based on Hive

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies
    September 2020
    250 pages
    ISBN:9781450387859
    DOI:10.1145/3422713
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Big data
    2. Data lineage
    3. Hive

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICBDT 2020

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 170
      Total Downloads
    • Downloads (Last 12 months)29
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media