[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Juneau: data lake management for Jupyter

Published: 01 August 2019 Publication History

Abstract

In collaborative settings such as multi-investigator laboratories, data scientists need improved tools to manage not their data records but rather their data sets and data products, to facilitate both provenance tracking and data (and code) reuse within their data lakes and file systems. We demonstrate the Juneau System, which extends computational notebook software (Jupyter Notebook) as an instrumentation and data management point for overseeing and facilitating improved dataset usage, through capabilities for indexing, searching, and recommending "complementary" data sources, previously extracted machine learning features, and additional training data. This demonstration focuses on how we help the user find related datasets via search.

References

[1]
L. A. Carvalho, R. Wang, Y. Gil, and D. Garijo. Niw: Converting notebooks into workflows to capture dataflow and provenance. In Conference on Knowledge Capture (K-CAP), 2017.
[2]
D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.
[3]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. Journal of computer and system sciences, 66(4):614--656, 2003.
[4]
R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1001--1012. IEEE, 2018.
[5]
Z. G. Ives, S. Han, Y. Zhang, and N. Zheng. Data relationship management systems. 2019.
[6]
D. Koop and J. Patel. Dataflow notebooks: encoding and tracking dependencies of cells. In TaPP, 2017.
[7]
D. Mottin, M. Lissandrini, Y. Velegrakis, and T. Palpanas. Exemplar queries: Give me an example of what you need. PVLDB, 7(5):365--376, 2014.
[8]
F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on open data. PVLDB, 11(7):813--825, 2018.
[9]
nteract team. Papermill: Parameterize, execute, and analyze notebooks. https://papermill.readthedocs.io/en/latest/, 2018.
[10]
T. Petricek, J. Geddes, and C. Sutton. Wrattler: Reproducible, live and polyglot notebooks. In TaPP. USENIX Association, 2018.

Cited By

View all
  • (2024)LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data LakesProceedings of the VLDB Endowment10.14778/3685800.368588017:12(4381-4384)Online publication date: 8-Nov-2024
  • (2024)Demonstration of Ver: View Discovery in the WildCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654748(428-431)Online publication date: 9-Jun-2024
  • (2024)AutoFeat: Transitive Feature Discovery over Join Paths2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00150(1861-1873)Online publication date: 13-May-2024
  • Show More Cited By

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 12, Issue 12
August 2019
547 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2019
Published in PVLDB Volume 12, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)4
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data LakesProceedings of the VLDB Endowment10.14778/3685800.368588017:12(4381-4384)Online publication date: 8-Nov-2024
  • (2024)Demonstration of Ver: View Discovery in the WildCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654748(428-431)Online publication date: 9-Jun-2024
  • (2024)AutoFeat: Transitive Feature Discovery over Join Paths2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00150(1861-1873)Online publication date: 13-May-2024
  • (2023)Dataset Discovery and Exploration: A SurveyACM Computing Surveys10.1145/362652156:4(1-37)Online publication date: 9-Nov-2023
  • (2023)Data Lakes: A Survey of Functions and SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327010135:12(12571-12590)Online publication date: 25-Apr-2023
  • (2023)Ver: View Discovery in the Wild2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00045(503-516)Online publication date: Apr-2023
  • (2022)An Overview of Data Warehouse and Data Lake in Modern Enterprise Data ManagementBig Data and Cognitive Computing10.3390/bdcc60401326:4(132)Online publication date: 7-Nov-2022
  • (2022)Toward data lakes as central building blocks for data management and analysisFrontiers in Big Data10.3389/fdata.2022.9457205Online publication date: 19-Aug-2022
  • (2022)An overview about data integration in data lakes2022 17th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI54924.2022.9820576(1-7)Online publication date: 22-Jun-2022
  • (2022)Maintainability Challenges in ML: A Systematic Literature Review2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)10.1109/SEAA56994.2022.00018(60-67)Online publication date: Aug-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media