[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3219104.3219159acmotherconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article
Public Access

Clowder: Open Source Data Management for Long Tail Data

Published: 22 July 2018 Publication History

Abstract

Clowder is an open source data management system to support data curation of long tail data and metadata across multiple research domains and diverse data types. Institutions and labs can install and customize their own instance of the framework on local hardware or on remote cloud computing resources to provide a shared service to distributed communities of researchers. Data can be ingested directly from instruments or manually uploaded by users and then shared with remote collaborators using a web front end. We discuss some of the challenges encountered in designing and developing a system that can be easily adapted to different scientific areas including digital preservation, geoscience, material science, medicine, social science, cultural heritage and the arts. Some of these challenges include support for large amounts of data, horizontal scaling of domain specific preprocessing algorithms, ability to provide new data visualizations in the web browser, a comprehensive Web service API for automatic data ingestion and curation, a suite of social annotation and metadata management features to support data annotation by communities of users and algorithms, and a web based front-end to interact with code running on heterogeneous clusters, including HPC resources.

References

[1]
2018. The BagIt File Packaging Format. (2018). Retrieved June 1, 2018 from https://tools.ietf.org/html/draft-kunze-bagit-14
[2]
2018. IML-CZO Data Management - Critical Zone Observatory Network for Intensively Managed Landscapes. (2018). Retrieved June 1, 2018 from http://imlczo.ncsa.illinois.edu/
[3]
2018. JSON-LD - JSON for Linking Data. (2018). Retrieved June 1, 2018 from https://json-ld.org/
[4]
2018. ORE Specifications and User Guides - Table of Contents. (2018). Retrieved June 1, 2018 from http://www.openarchives.org/ore/1.0/toc
[5]
2018. PROV Model Primer. (2018). Retrieved June 1, 2018 from https://www.w3.org/TR/2013/NOTE-prov-primer-20130430/#agents-and-responsibility
[6]
Ricardo Carvalho Amorim, João Aguiar Castro, João Rocha da Silva, and Cristina Ribeiro. 2017. A comparison of research data management platforms: architecture, flexible metadata and interoperability. Universal Access in the Information Society 16, 4 (01 Nov 2017), 851--862.
[7]
S. B. Ardestani, C. J. Håkansson, E. Laure, I. Livenson, P. Stranák, E. Dima, D. Blommesteijn, and M. v. d. Sanden. 2015. B2SHARE: An Open eScience Data Sharing Platform. In 2015 IEEE 11th International Conference on e-Science. 448--453.
[8]
Maxwell Burnette, Rob Kooper, J.D. Maloney, Gareth S. Rohde, Jeffrey A. Terstriep, Craig Willis, Noah Falhgren, Todd Mockler, Maria Newcomb, Vasit Sagan, Pedro Andrade-Sanchez, Nadia Shakoor, Paheding Sidike, Rick Ward, and David LeBauer. 2018. TERRA-REF Data Processing Infrastructure for Plant Sensing. In PEARC 18: Practice and Experience in Advanced Research Computing. ACM, New York, NY, USA. To appear.
[9]
Charles E. Catlett. 2005. TeraGrid: A Foundation for US Cyberinfrastructure. In Network and Parallel Computing, Hai Jin, Daniel Reed, and Wenbin Jiang (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--1.
[10]
Amit Chourasia, Mona Wong, Dmitry Mishin, David R. Nadeau, and Michael Norman. 2016. SeedMe: A Scientific Data Sharing and Collaboration Platform. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale (XSEDE16). ACM, New York, NY, USA, Article 48, 6 pages.
[11]
Mostafa M. Elag, Praveen Kumar, Luigi Marini, James D. Myers, Margaret Hedstrom, and Beth A. Plale. 2017. Identification and characterization of information-networks in long-tail data collections. Environmental Modelling & Software 94 (2017), 100 -- 111.
[12]
P. F. Felzenszwalb, R. B Girshick, D. McAllester, and D. Ramanan. 2010. Object Detection with Discriminatively Trained Part Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 9 (2010), 1627--1645.
[13]
National Science Foundation. 2014. Chapter II - Proposal Preparation Instructions. (Dec 2014). https://www.nsf.gov/pubs/policydocs/pappguide/nsf15001/gpg_2.jsp#IIC2j
[14]
R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. 2012. Discriminatively Trained Deformable Part Models, Release 5. http://people.cs.uchicago.edu/rbg/latent-release5/. (09 2012).
[15]
I. Gutierrez-Polo, Y. Zhao, S. Bradley, E. Roeder, M. Pitcel, K. TePas, P. Collingsworth, and L. Marini. 2017. Monitoring Water Quality in the Great Lakes Leveraging Geo-Temporal Cyberinfrastructure. In 2017 IEEE 13th International Conference on e-Science (e-Science). 364--373.
[16]
P Bryan Heidorn. 2008. Shedding light on the dark data in the long tail of science. Library trends 57, 2 (2008), 280--299.
[17]
Virginia Kuhn, Alan Craig, Michael Simeone, Sandeep Puthanveetil Satheesan, and Luigi Marini. 2015. The VAT: Enhanced Video Analysis. In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure (XSEDE '15). ACM, New York, NY, USA, Article 11, 4 pages.
[18]
Alison Langmead, Paul Rodriguez, Sandeep Puthanveetil Satheesan, and Alan Craig. 2017. Extracting Meaningful Data from Decomposing Bodies. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability Success and Impact (PEARC 17). ACM, New York, NY, USA, Article 41, 8 pages.
[19]
John A Lewis. 2014. Research data management technical infrastructure: A review of options for development at the University of Sheffield. (2014).
[20]
William K. Michener, Suzie Allard, Amber Budden, Robert B. Cook, Kimberly Douglass, Mike Frame, Steve Kelling, Rebecca Koskela, Carol Tenopir, and David A. Vieglais. 2012. Participatory design of DataONE - Enabling cyberinfrastructure for the biological and environmental sciences. Ecological Informatics 11 (2012), 5 -- 15. Data platforms in integrative biodiversity research.
[21]
A. Milan, S. Roth, and K. Schindler. 2014. Continuous Energy Minimization for Multitarget Tracking. IEEE TPAMI 36, 1 (2014), 58--72.
[22]
Richard L Moore, Chaitan Baru, Diane Baxter, Geoffrey C Fox, Amit Majumdar, Phillip Papadopoulos, Wayne Pfeiffer, Robert S Sinkovits, Shawn Strande, Mahidhar Tatineni, et al. 2014. Gateways to discovery: Cyberinfrastructure for the long tail of science. In Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. ACM, 39.
[23]
J. Myers, M. Hedstrom, D. Akmon, S. Payette, B. A. Plale, I. Kouper, S. McCaulay, R. McDonald, I. Suriarachchi, A. Varadharaju, P. Kumar, M. Elag, J. Lee, R. Kooper, and L. Marini. 2015. Towards Sustainable Curation and Preservation: The SEAD Project's Data Services Approach. In 2015 IEEE 11th International Conference on e-Science. 485--494.
[24]
P. Nguyen, S. Konstanty, T. Nicholson, T. O'Brien, A. Schwartz-Duval, T. Spila, K. Nahrstedt, R. H. Campbell, I. Gupta, M. Chan, K. Mchenry, and N. Paquin. 2017. 4CeeD: Real-Time Data Acquisition and Analysis Framework for Material-Related Cyber-Physical Environments. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 11--20.
[25]
National Institutes of Health. 2003. NIH Data Sharing Policy and Implementation Guidance. (Mar 2003). Retrieved March 13, 2018 from https://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm
[26]
Smruti Padhy, Jay Alameda, Rob Kooper, Rui Liu, Sandeep Puthanveetil Satheesan, Inna Zharnitsky, Gregory Jansen, Michael C. Dietze, Praveen Kumar, Jong Lee, Richard Marciano, Luigi Marini, Barbara Minsker, Chris Navarro, Marcus Slavenas, William Sullivan, and Kenton McHenry. 2016. An Architecture for Automatic Deployment of Brown Dog Services at Scale into Diverse Computing Infrastructures. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale (XSEDE16). ACM, New York, NY, USA, Article 33, 8 pages.
[27]
Scott D Peckham. 2014. The CSDMS standard names: Cross-domain naming conventions for describing process models, data sets and their associated variables. (2014).
[28]
Marshall Scott Poole, Natalie Lambert, Sandeep Puthanveetil Satheesan, Amit Das, Alex Yahja, and Mark Hasegawa-Johnson. 2015. GroupScope: A Framework and Tools for Large Scale Study of Social Processes. (2015).
[29]
Paul Rodriguez, Sandeep Puthanveetil, Jeffrey Will, Elizabeth Wuerffel, and Alan Craig. 2017. Extracting, Assimilating, and Sharing the Results of Image Analysis on the FSA/OWI Photography Collection. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact (PEARC 17). ACM, New York, NY, USA, Article 42, 6 pages.
[30]
Sandeep Puthanveetil Satheesan, Jay Alameda, Shannon Bradley, Michael Dietze, Benjamin Galewsky, Gregory Jansen, Rob Kooper, Praveen Kumar, Jong Lee, Richard Marciano, Luigi Marini, Christopher M. Navarro Barbara S. Minsker, Arthur Schmidt, Marcus Slavenas, William C. Sullivan, Bing Zhang, Yan Zhao, Inna Zharnitsky, and Kenton McHenry. 2018. Brown Dog: Making the Digital World a Better Place, a Few Files at a Time. In PEARC 18: Practice and Experience in Advanced Research Computing. ACM, New York, NY, USA. To appear.
[31]
Constantinos Sophocleous, Luigi Marini, Ropertos Georgiou, Mohammed El-farargy, and Kenton McHenry. 2017. Medici 2: A scalable content management system for cultural heritage datasets. Code4Lib Jounral (2017).
[32]
Stuart Weibel. 1997. The Dublin Core: A Simple Content Description Model for Electronic Resources. Bulletin of the American Society for Information Science and Technology 24, 1 (1997), 9--11.
[33]
Nancy Wilkins-Diehr, Dennis Gannon, Gerhard Klimeck, Scott Oster, and Sudhakar Pamidighantam. 2008. TeraGrid Science Gateways and Their Impact on Science. Computer 41, 11 (Nov 2008), 32--41.
[34]
Joss Winn et al. 2013. Open data and the academy: an evaluation of CKAN for research data management. (2013).
[35]
B. Zhang, L. C. Pouchard, P. M. Smith, A. Gasc, and B. C. Pijanowski. 2016. Data storage and sharing for the long tail of science. In 2016 New York Scientific Data Summit (NYSDS). 1--9.
[36]
Y. Zhao, E. F. Black, L. Marini, K. McHenry, N. Kenyon, R. Patil, A. Balla, and A. Bartholomew. 2016. Automatic glomerulus extraction in whole slide images towards computer aided diagnosis. In 2016 IEEE 12th International Conference on e-Science (e-Science). 165--174.

Cited By

View all
  • (2024)Multihousehold Load Forecasting Based on a Convolutional Neural Network Using Moment Information and Data AugmentationEnergies10.3390/en1704090217:4(902)Online publication date: 15-Feb-2024
  • (2024)An Analysis of Research Data Storage Systems2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678706(1-9)Online publication date: 16-Sep-2024
  • (2023)Ekstrakcija metapodatkov s pomočjo strojnega učenjaModerna arhivistika10.54356/MA/2023/VRNY76652023 (6):2(255-269)Online publication date: 17-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity
July 2018
652 pages
ISBN:9781450364461
DOI:10.1145/3219104
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data curation
  2. data management
  3. linked data
  4. metadata management
  5. scientific gateways

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

PEARC '18

Acceptance Rates

PEARC '18 Paper Acceptance Rate 79 of 123 submissions, 64%;
Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)117
  • Downloads (Last 6 weeks)13
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multihousehold Load Forecasting Based on a Convolutional Neural Network Using Moment Information and Data AugmentationEnergies10.3390/en1704090217:4(902)Online publication date: 15-Feb-2024
  • (2024)An Analysis of Research Data Storage Systems2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678706(1-9)Online publication date: 16-Sep-2024
  • (2023)Ekstrakcija metapodatkov s pomočjo strojnega učenjaModerna arhivistika10.54356/MA/2023/VRNY76652023 (6):2(255-269)Online publication date: 17-Dec-2023
  • (2023)Airavata Data Catalog: A Multi-tenant Metadata Service for Efficient Data Discovery and Access ControlPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3597572(181-185)Online publication date: 23-Jul-2023
  • (2023)Cybershuttle: An End-to-End Cyberinfrastructure Continuum to Accelerate Discovery in Science and EngineeringPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3593602(26-34)Online publication date: 23-Jul-2023
  • (2023)Big data in contemporary electron microscopy: challenges and opportunities in data transfer, compute and managementHistochemistry and Cell Biology10.1007/s00418-023-02191-8160:3(169-192)Online publication date: 13-Apr-2023
  • (2022)Toward a big data analysis system for historical newspaper collections researchProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539795(1-11)Online publication date: 27-Jun-2022
  • (2022)Models and Metrics for Mining Meaningful MetadataComputational Science – ICCS 202210.1007/978-3-031-08751-6_30(417-430)Online publication date: 21-Jun-2022
  • (2021)Data Ecosystems for Scientific Experiments: Managing Combustion Experiments and Simulation Analyses in Chemical EngineeringFrontiers in Big Data10.3389/fdata.2021.6634104Online publication date: 15-Sep-2021
  • (2021)The long-tail data-driven financial recommender systemsIV International Scientific and Practical Conference10.1145/3487757.3490890(1-6)Online publication date: 18-Mar-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media