Abstract
Recently, digitization of biomedical processes has accelerated, in no small part due to the use of machine learning techniques which require large amounts of labeled data. This chapter focuses on the prerequisite steps to the training of any algorithm: data collection and labeling. In particular, we tackle how data collection can be set up with scalability and security to avoid costly and delaying bottlenecks. Unprecedented amounts of data are now available to companies and academics, but digital tools in the biomedical field encounter a problem of scale, since high-throughput workflows such as high content imaging and sequencing can create several terabytes per day. Consequently data transport, aggregation, and processing is challenging.
A second challenge is maintenance of data security. Biomedical data can be personally identifiable, may constitute important trade-secrets, and be expensive to produce. Furthermore, human biomedical data is often immutable, as is the case with genetic information. These factors make securing this type of data imperative and urgent. Here we address best practices to achieve security, with a focus on practicality and scalability. We also address the challenge of obtaining usable, rich metadata from the collected data, which is a major challenge in the biomedical field because of the use of fragmented and proprietary formats. We detail tools and strategies for extracting metadata from biomedical scientific file formats and how this underutilized metadata plays a key role in creating labeled data for use in the training of neural networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Microsoft (2017) Microsoft Security Bulletin MS17-010—Critical. https://docs.microsoft.com/en-us/security-updates/securitybulletins/2017/ms17-010. Accessed 3 Sept 2019
Microsoft (2018) Microsoft SMB Protocol and CIFS Protocol Overview. https://docs.microsoft.com/en-us/windows/win32/fileio/microsoft-smb-protocol-and-cifs-protocol-overview. Accessed 29 Aug 2019
Rob Sobers (2019) CIFS vs SMB: What’s the Difference? https://www.varonis.com/blog/cifs-vs-smb/. Accessed 29 Aug 2019
Wikipedia (2019) Network File System. https://en.wikipedia.org/wiki/Network_File_System. Accessed 3 Sept 2019
Wikipedia (2019) Andrew File System. https://en.wikipedia.org/wiki/Andrew_File_System. Accessed 3 Sept 2019
Paul Rubens (2019) SSD versus HDD Speed. https://www.enterprisestorageforum.com/storage-hardware/ssd-vs-hdd-speed.html. Accessed 29 Aug 2019
Microsoft Technet Blog (2010) SHA2 and Windows. https://blogs.technet.microsoft.com/pki/2010/09/30/sha2-and-windows/. Accessed 3 Sept 2019
Python Documentation (2019) Standard errno system symbols. https://docs.python.org/2/library/errno.html. Accessed 3 Sept 2019
Wikipedia (2019) Comparison of file transfer protocols. https://en.wikipedia.org/wiki/Comparison_of_file_transfer_protocols. Accessed 5 Sept 2019
Zeiss (2019) CZI Format License Request. https://www.zeiss.com/microscopy/us/products/microscope-software/czi/czi-download.html. Accessed 5 Sept 2019
Wikipedia (2019) bzip2 file compression. https://en.wikipedia.org/wiki/Bzip2. Accessed 5 Sept 2019
Wikipedia (2019) gzip file format and software application. https://en.wikipedia.org/wiki/Gzip. Accessed 5 Sept 2019
HighSpeedInternet (2018) Why does my internet slow down at night? https://www.highspeedinternet.com/resources/why-does-my-internet-slow-down-at-night. Accessed 5 Sept 2019
Amazon Web Services (2019) AWS Snowball: Physically migrate petabyte-scale data sets into and out of AWS. https://aws.amazon.com/snowball/. Accessed 5 Sept 2019
Microsoft Azure (2019) Azure Data Box https://azure.microsoft.com/en-us/services/databox/. Accessed 5 Sept 2019
Google Cloud (2019) Introducing the Transfer Appliance: Sneakernet for the cloud era. https://cloud.google.com/blog/products/gcp/introducing-transfer-appliance-sneakernet-for-the-cloud-era Accessed 5 Sept 2019
Microsoft (2014) Support for Windows XP ended. https://www.microsoft.com/en-us/microsoft-365/windows/end-of-windows-xp-support. Accessed 10 Sept 2019
U.S. Department of Health and Human Services Office for Civil Rights (2019) Breach Portal: Notice to the Secretary of HHS Break of Unsecured Protected Health Information. https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf. Accessed 7 Sept 2019
Wikipedia (2019) Transport Layer Security. https://en.wikipedia.org/wiki/Transport_Layer_Security. Accessed 10 Sept 2019
Google Security (2015) Maintaining digital certificate security. https://security.googleblog.com/2015/03/maintaining-digital-certificate-security.html. Accessed 12 Sept 2019
Zetter, K. (2013) Google Discovers Fraudulent Digital Certificate Issued for Its Domain. In: Wired Magazine. https://www.wired.com/2013/01/google-fraudulent-certificate/. Accessed 12 Sept 2019
Wikipedia (2019) Kazakhstan man-in-the-middle attack. https://en.wikipedia.org/wiki/Kazakhstan_man-in-the-middle_attack. Accessed 12 Sept 2019
Wikipedia (2019) Forward secrecy. https://en.wikipedia.org/wiki/Forward_secrecy. Accessed 12 Sept 2019
Wikipedia (2019) Replay attack: network attack type. https://en.wikipedia.org/wiki/Replay_attack. Accessed 14 Sept 2019
Wikipedia (2019) Known-plaintext attacks. https://en.wikipedia.org/wiki/Known-plaintext_attack Accessed 14 Sept 2019
Amazon Web Services (2006) Protecting Data Using Server-Side Encryption. https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html. Accessed 7 Sept 2019
Google Cloud (2019) Encryption at Rest. https://cloud.google.com/security/encryption-at-rest/. Accessed 7 Sept 2019
Microsoft Azure (2019) Azure Storage encryption for data at rest. https://docs.microsoft.com/en-us/azure/storage/common/storage-service-encryption Accessed 7 September 7th, 2019
Wikipedia (2019) BREACH: Browser Reconnaissance and Exfiltration via Adaptive Compression of Hypertext. https://en.wikipedia.org/wiki/BREACH. Accessed 7 Sept 2019
Wikipedia (2019) CRIME: Compression Ratio Info-leak Made Easy. https://en.wikipedia.org/wiki/CRIME. Accessed 7 Sept 2019
Wikipedia (2019) IEEE-488: short-range digital communications bus specification. https://en.wikipedia.org/wiki/IEEE-488. Accessed 7 Sept 2019
Toga AW, Foster I, Kesselman C et al (2015) Big biomedical data as the key resource for discovery science. J Am Med Information Assoc 22(6):1126–1131. https://doi.org/10.1093/jamia/ocv077
Allotrope Foundation (2019) The Allotrope Framework and Data Format. https://www.allotrope.org/allotrope-framework Accessed 8 September 2019
Pistoia Alliance (2019) Unified Data Model. https://www.pistoiaalliance.org/projects/udm/. Accessed 8 Sept 2019
National Center for Advancing Translational Sciences (2013) Biomedical Data Translator. https://ncats.nih.gov/translator/about. Accessed 8 Sept 2019
Spidlen J, Moore W, Parks D et al (2010) Data file standard for flow cytometry, Version FCS 3.1. J Cytometry A 77(1):97–100. https://doi.org/10.1002/cyto.a.20825
Python Package Index (2018) FlowCytometryTools python package. https://pypi.org/project/FlowCytometryTools/. Accessed 8 Sept 2019
Teague B (2019) Cytoflow GitHub code repository. https://github.com/bpteague/cytoflow. Accessed 8 Sept 2019
FlowPy (2016) FlowPy Code Repository and Documentation. http://flowpy.wikidot.com/. Accessed 8 Sept 2019
Scott White (2019) FlowIO python library for flow cytometry. https://github.com/whitews/flowio. Accessed 8 Sept 2019
Ridiculous Fish (2019) Hex Fiend hexadecimal editor for Mac OS X. https://ridiculousfish.com/hexfiend/. Accessed 8 Sept 2019
MH-Nexus (2019) HxD Hexadecimal editor for Windows. https://mh-nexus.de/en/hxd/. Accessed 8 Sept 2019
Zentgraf D (2015) What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. http://kunststube.net/encoding/. Accessed 8 Sept 2019
Wikipedia (2019) FASTQ text based format for storing biological sequences and quality scores. https://en.wikipedia.org/wiki/FASTQ_format. Accessed 8 Sept 2019
HUPO Proteomics Standards Initiative (2017) mzML file format specification for raw spectrometer data. http://www.psidev.info/mzML. Accessed 8 Sept 2019
Samtools (2019) Variant Call Format specification. https://samtools.github.io/hts-specs/VCFv4.2.pdf. Accessed 8 Sept 2019
Fracchia C, Dapello J (2016) Reverse engineering biomedical equipment for fun and open science. Presented at DEFCON24—6 August 2016
BioBright (2017) Tools Bring Superpowers to the Biology Lab. https://www.businesswire.com/news/home/20170314005466/en/BioBright-Tools-Bring-%E2%80%98Superpowers%E2%80%99-Biology-Lab. Accessed 8 Sept 2019
Hearst MA et al. (2007) BioText Search Engine: beyond abstract search. Bioinformatics 23(17):2348–2351. https://doi.org/10.1093/bioinformatics/btm301
Amer-Yahia S, Shanmugasundaram J (2005) XML full-text search: challenges and opportunities. Proceedings Hearst MA, Divoli A, Guturu H et al. (2007) BioText Search Engine: beyond abstract search. Bioinformatics 23(17):2348-2351of 31st international conference on Very large databases 1368–1368
Xu S, McCusker J, Krauthammer M (2008) Yale Image Finder (YIF): a new search engine for retrieving biomedical images. Bioinformatics 24(17):1968–1970
Amazon Web Services (2019) Elasticsearch Service: Fully managed, scalable, and secure Elasticsearch service. https://aws.amazon.com/elasticsearch-service/. Accessed 8 Sept 2019
Microsoft Azure (2019) How full text search works in Azure Cognitive Search. https://docs.microsoft.com/en-us/azure/search/search-lucene-query-architecture. Accessed 8 Sept 2019
Postgresql (2019) JSON Functions and Operators. https://www.postgresql.org/docs/current/functions-json.html. Accessed 8 Sept 2019
Apache (2019) Lucene text search engine project. https://lucene.apache.org/core/index.html Accessed 8 Sept 2019
Elasticsearch (2019) Configuring security in Elasticsearch. https://www.elastic.co/guide/en/elasticsearch/reference/current/configuring-security.html. Accessed 8 Sept 2019
Swagger (2019) OpenAPI specification documentation. https://swagger.io/docs/specification/about/. Accessed 8 Sept 2019
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Fracchia, C. (2021). Secure and Scalable Collection of Biomedical Data for Machine Learning Applications. In: Cartwright, H. (eds) Artificial Neural Networks. Methods in Molecular Biology, vol 2190. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0826-5_16
Download citation
DOI: https://doi.org/10.1007/978-1-0716-0826-5_16
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-0825-8
Online ISBN: 978-1-0716-0826-5
eBook Packages: Springer Protocols