[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3078597.3078603acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

Diagnosing Machine Learning Pipelines with Fine-grained Lineage

Published: 26 June 2017 Publication History

Abstract

We present the Hippo system to enable the diagnosis of distributed machine learning (ML) pipelines by leveraging fine-grained data lineage. Hippo exposes a concise yet powerful API, derived from primitive lineage types, to capture fine-grained data lineage for each data transformation. It records the input datasets, the output datasets and the cell-level mapping between them. It also collects sufficient information that is needed to reproduce the computation. Hippo efficiently enables common ML diagnosis operations such as code debugging, result analysis, data anomaly removal, and computation replay. By exploiting the metadata separation and high-order function encoding strategies, we observe an O(10^3)x total improvement in lineage storage efficiency vs. the baseline of cell-wise mapping recording while maintaining the lineage integrity. Hippo can answer the real use case lineage queries within a few seconds, which is low enough to enable interactive diagnosis of ML pipelines.

References

[1]
I. Altintas, M. K. Anand, D. Crawl, S. Bowers, A. Belloum, P. Missier, B. Lud\"ascher, C. A. Goble, and P. M. Sloot. Understanding collaborative studies through interoperable workflow provenance. In Provenance and Annotation of Data and Processes, pages 42--58. Springer, 2010.
[2]
I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In Provenance and annotation of data, pages 118--132. Springer, 2006.
[3]
Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting lipstick on pig: Enabling database-style workflow provenance. Proceedings of the VLDB Endowment, 5(4):346--357, 2011.
[4]
Apache. Apache Hadoop. http://hadoop.apache.org/.
[5]
N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, SIGMOD '90, pages 322--331, New York, NY, USA, 1990. ACM.
[6]
R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys (CSUR), 37(1):1--28, 2005.
[7]
P. G. Brown. Overview of scidb: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 963--968. ACM, 2010.
[8]
P. Buneman, A. Chapman, and J. Cheney. Provenance management in curated databases. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 539--550. ACM, 2006.
[9]
Y. Cui and J. Widom. Practical lineage tracing in data warehouses. In Data Engineering, 2000. Proceedings. 16th International Conference on, pages 367--378. IEEE, 2000.
[10]
Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. The VLDB Journal, The International Journal on Very Large Data Bases, 12(1):41--58, 2003.
[11]
D. Devecsery, M. Chow, X. Dou, J. Flinn, and P. M. Chen. Eidetic systems. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
[12]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
[13]
I. Foster, J. Vöckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. In Scientific and Statistical Database Management, 2002. Proceedings. 14th International Conference on, pages 37--46. IEEE, 2002.
[14]
J. Freire, D. Koop, E. Santos, and C. T. Silva. Provenance for computational tasks: A survey. Computing in Science & Engineering, 10(3):11--21, 2008.
[15]
J. Frew and R. Bose. Earth system science workbench: A data management infrastructure for earth science products. In Scientific and Statistical Database Management, 2001. SSDBM 2001. Proceedings. Thirteenth International Conference on, pages 180--189. IEEE, 2001.
[16]
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 231--242. IEEE, 2011.
[17]
Google. Tensorflow. http://tensorflow.org/.
[18]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, SIGMOD '84, pages 47--57, New York, NY, USA, 1984. ACM.
[19]
R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In CIDR 2011. Stanford InfoLab.
[20]
M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: data provenance support in spark. Proceedings of the VLDB Endowment, 9(3):216--227, 2015.
[21]
H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--15. ACM, 2014.
[22]
D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In Proceedings of the 4th annual Symposium on Cloud Computing, page 17. ACM, 2013.
[23]
D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150--1157. Ieee, 1999.
[24]
J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281--297. Oakland, CA, USA., 1967.
[25]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. arXiv preprint arXiv:1505.06807, 2015.
[26]
P. Missier, N. W. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In Proceedings of the 13th International Conference on Extending Database Technology, pages 299--310. ACM, 2010.
[27]
J. Nielsen. Powers of 10: Time scales in user experience, 2009. http://www.nngroup.com/articles/powers-of-10-time-scales-in-ux/.
[28]
T. Oinn, M. Greenwood, M. J. Addis, M. N. Alpdemir, J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull, D. Marvin, et al. Taverna: lessons in creating a workflow environment for the life sciences. Journal of Concurrency and Computation: Practice and experience, 2002.
[29]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110. ACM, 2008.
[30]
E. Osheim. Archery. https://github.com/meetup/archery.
[31]
H. Park, R. Ikeda, and J. Widom. RAMP: A system for capturing and tracing provenance in mapreduce workflows. In 37th International Conference on Very Large Data Bases (VLDB). Stanford InfoLab, August 2011.
[32]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825--2830, 2011.
[33]
T. Sellis, N. Roussopoulos, and C. Faloutsos. The R
[34]
-tree: A dynamic index for multi-dimensional objects. 1987.
[35]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings of the IEEE Symposium on Mass Storage Systems and Technologies (MSST '10), pages 1--10. IEEE, 2010.
[36]
Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. ACM Sigmod Record, 34(3):31--36, 2005.
[37]
Y. L. Simmhan, B. Plale, and D. Gannon. Karma2: Provenance management for data-driven workflows. Web Services Research for Emerging Applications: Discoveries and Trends: Discoveries and Trends, page 317, 2010.
[38]
E. Sparks, S. Venkataraman, T. Kaftan, M. Franklin, and B. Recht. Keystoneml. https://github.com/amplab/keystone.
[39]
J. Sroka, J. Hidders, P. Missier, and C. Goble. A formal semantics for the taverna 2 workflow model. Journal of Computer and System Sciences, 76(6):490--508, 2010.
[40]
M. Stonebraker, J. Chen, N. Nathan, C. Paxson, and J. Wu. Tioga: Providing data management support for scientific visualization applications. In VLDB, volume 93, pages 25--38. Citeseer, 1993.
[41]
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. Technical Report 2004--40, Stanford InfoLab, August 2004.
[42]
A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In Data Engineering, 1997. Proceedings. 13th International Conference on, pages 91--102. IEEE, 1997.
[43]
E. Wu, S. Madden, and M. Stonebraker. SubZero: A fine-grained lineage system for scientific databases. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages 865--876. IEEE, 2013.
[44]
D. G. York, J. Adelman, J. E. Anderson Jr, S. F. Anderson, J. Annis, N. A. Bahcall, J. Bakken, R. Barkhouser, S. Bastian, E. Berman, et al. The sloan digital sky survey: Technical summary. The Astronomical Journal, 120(3):1579, 2000.
[45]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI '12), pages 2--2. USENIX Association, 2012.
[46]
M. Zhang, X. Zhang, X. Zhang, and S. Prabhakar. Tracing lineage beyond relational operators. In Proceedings of the 33rd international conference on Very large data bases, pages 1116--1127. VLDB Endowment, 2007.

Cited By

View all
  • (2024)Compression and In-Situ Query Processing for Fine-Grained Array Lineage2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00281(3654-3667)Online publication date: 13-May-2024
  • (2023)Metadata Representations for Queryable Repositories of Machine Learning ModelsIEEE Access10.1109/ACCESS.2023.333064711(125616-125630)Online publication date: 2023
  • (2023)A Review on Machine UnlearningSN Computer Science10.1007/s42979-023-01767-44:4Online publication date: 19-Apr-2023
  • Show More Cited By

Index Terms

  1. Diagnosing Machine Learning Pipelines with Fine-grained Lineage

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
    June 2017
    254 pages
    ISBN:9781450346993
    DOI:10.1145/3078597
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. diagnostics
    2. fine-grained lineage
    3. machine learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    HPDC '17
    Sponsor:

    Acceptance Rates

    HPDC '17 Paper Acceptance Rate 19 of 100 submissions, 19%;
    Overall Acceptance Rate 166 of 966 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)125
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Compression and In-Situ Query Processing for Fine-Grained Array Lineage2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00281(3654-3667)Online publication date: 13-May-2024
    • (2023)Metadata Representations for Queryable Repositories of Machine Learning ModelsIEEE Access10.1109/ACCESS.2023.333064711(125616-125630)Online publication date: 2023
    • (2023)A Review on Machine UnlearningSN Computer Science10.1007/s42979-023-01767-44:4Online publication date: 19-Apr-2023
    • (2022)Database Meets Artificial Intelligence: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299464134:3(1096-1116)Online publication date: 1-Mar-2022
    • (2022)Data distribution debugging in machine learning pipelinesThe VLDB Journal10.1007/s00778-021-00726-w31:5(1103-1126)Online publication date: 31-Jan-2022
    • (2021)AI Meets Database: AI4DB and DB4AIProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457542(2859-2866)Online publication date: 9-Jun-2021
    • (2021)MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00146(1655-1666)Online publication date: Apr-2021
    • (2021)Provenance Supporting Hyperparameter Analysis in Deep Neural NetworksProvenance and Annotation of Data and Processes10.1007/978-3-030-80960-7_2(20-38)Online publication date: 9-Jul-2021
    • (2020)Bayesian Stress Testing of Models in a Classification Hierarchy2020 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN48605.2020.9207356(1-8)Online publication date: Jul-2020
    • (2019)SmokeProceedings of the VLDB Endowment10.14778/3199517.319952211:6(719-732)Online publication date: 17-Jan-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media