[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Secure and Scalable Collection of Biomedical Data for Machine Learning Applications

  • Protocol
  • First Online:
Artificial Neural Networks

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2190))

  • 4270 Accesses

Abstract

Recently, digitization of biomedical processes has accelerated, in no small part due to the use of machine learning techniques which require large amounts of labeled data. This chapter focuses on the prerequisite steps to the training of any algorithm: data collection and labeling. In particular, we tackle how data collection can be set up with scalability and security to avoid costly and delaying bottlenecks. Unprecedented amounts of data are now available to companies and academics, but digital tools in the biomedical field encounter a problem of scale, since high-throughput workflows such as high content imaging and sequencing can create several terabytes per day. Consequently data transport, aggregation, and processing is challenging.

A second challenge is maintenance of data security. Biomedical data can be personally identifiable, may constitute important trade-secrets, and be expensive to produce. Furthermore, human biomedical data is often immutable, as is the case with genetic information. These factors make securing this type of data imperative and urgent. Here we address best practices to achieve security, with a focus on practicality and scalability. We also address the challenge of obtaining usable, rich metadata from the collected data, which is a major challenge in the biomedical field because of the use of fragmented and proprietary formats. We detail tools and strategies for extracting metadata from biomedical scientific file formats and how this underutilized metadata plays a key role in creating labeled data for use in the training of neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Protocol
GBP 34.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 87.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 109.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
GBP 159.99
Price includes VAT (United Kingdom)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Microsoft (2017) Microsoft Security Bulletin MS17-010—Critical. https://docs.microsoft.com/en-us/security-updates/securitybulletins/2017/ms17-010. Accessed 3 Sept 2019

  2. Microsoft (2018) Microsoft SMB Protocol and CIFS Protocol Overview. https://docs.microsoft.com/en-us/windows/win32/fileio/microsoft-smb-protocol-and-cifs-protocol-overview. Accessed 29 Aug 2019

  3. Rob Sobers (2019) CIFS vs SMB: What’s the Difference? https://www.varonis.com/blog/cifs-vs-smb/. Accessed 29 Aug 2019

  4. Wikipedia (2019) Network File System. https://en.wikipedia.org/wiki/Network_File_System. Accessed 3 Sept 2019

  5. Wikipedia (2019) Andrew File System. https://en.wikipedia.org/wiki/Andrew_File_System. Accessed 3 Sept 2019

  6. Paul Rubens (2019) SSD versus HDD Speed. https://www.enterprisestorageforum.com/storage-hardware/ssd-vs-hdd-speed.html. Accessed 29 Aug 2019

  7. Microsoft Technet Blog (2010) SHA2 and Windows. https://blogs.technet.microsoft.com/pki/2010/09/30/sha2-and-windows/. Accessed 3 Sept 2019

  8. Python Documentation (2019) Standard errno system symbols. https://docs.python.org/2/library/errno.html. Accessed 3 Sept 2019

  9. Wikipedia (2019) Comparison of file transfer protocols. https://en.wikipedia.org/wiki/Comparison_of_file_transfer_protocols. Accessed 5 Sept 2019

  10. Zeiss (2019) CZI Format License Request. https://www.zeiss.com/microscopy/us/products/microscope-software/czi/czi-download.html. Accessed 5 Sept 2019

  11. Wikipedia (2019) bzip2 file compression. https://en.wikipedia.org/wiki/Bzip2. Accessed 5 Sept 2019

  12. Wikipedia (2019) gzip file format and software application. https://en.wikipedia.org/wiki/Gzip. Accessed 5 Sept 2019

  13. HighSpeedInternet (2018) Why does my internet slow down at night? https://www.highspeedinternet.com/resources/why-does-my-internet-slow-down-at-night. Accessed 5 Sept 2019

  14. Amazon Web Services (2019) AWS Snowball: Physically migrate petabyte-scale data sets into and out of AWS. https://aws.amazon.com/snowball/. Accessed 5 Sept 2019

  15. Microsoft Azure (2019) Azure Data Box https://azure.microsoft.com/en-us/services/databox/. Accessed 5 Sept 2019

  16. Google Cloud (2019) Introducing the Transfer Appliance: Sneakernet for the cloud era. https://cloud.google.com/blog/products/gcp/introducing-transfer-appliance-sneakernet-for-the-cloud-era Accessed 5 Sept 2019

  17. Microsoft (2014) Support for Windows XP ended. https://www.microsoft.com/en-us/microsoft-365/windows/end-of-windows-xp-support. Accessed 10 Sept 2019

  18. U.S. Department of Health and Human Services Office for Civil Rights (2019) Breach Portal: Notice to the Secretary of HHS Break of Unsecured Protected Health Information. https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf. Accessed 7 Sept 2019

  19. Wikipedia (2019) Transport Layer Security. https://en.wikipedia.org/wiki/Transport_Layer_Security. Accessed 10 Sept 2019

  20. Google Security (2015) Maintaining digital certificate security. https://security.googleblog.com/2015/03/maintaining-digital-certificate-security.html. Accessed 12 Sept 2019

  21. Zetter, K. (2013) Google Discovers Fraudulent Digital Certificate Issued for Its Domain. In: Wired Magazine. https://www.wired.com/2013/01/google-fraudulent-certificate/. Accessed 12 Sept 2019

  22. Wikipedia (2019) Kazakhstan man-in-the-middle attack. https://en.wikipedia.org/wiki/Kazakhstan_man-in-the-middle_attack. Accessed 12 Sept 2019

  23. Wikipedia (2019) Forward secrecy. https://en.wikipedia.org/wiki/Forward_secrecy. Accessed 12 Sept 2019

  24. Wikipedia (2019) Replay attack: network attack type. https://en.wikipedia.org/wiki/Replay_attack. Accessed 14 Sept 2019

  25. Wikipedia (2019) Known-plaintext attacks. https://en.wikipedia.org/wiki/Known-plaintext_attack Accessed 14 Sept 2019

  26. Amazon Web Services (2006) Protecting Data Using Server-Side Encryption. https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html. Accessed 7 Sept 2019

  27. Google Cloud (2019) Encryption at Rest. https://cloud.google.com/security/encryption-at-rest/. Accessed 7 Sept 2019

  28. Microsoft Azure (2019) Azure Storage encryption for data at rest. https://docs.microsoft.com/en-us/azure/storage/common/storage-service-encryption Accessed 7 September 7th, 2019

  29. Wikipedia (2019) BREACH: Browser Reconnaissance and Exfiltration via Adaptive Compression of Hypertext. https://en.wikipedia.org/wiki/BREACH. Accessed 7 Sept 2019

  30. Wikipedia (2019) CRIME: Compression Ratio Info-leak Made Easy. https://en.wikipedia.org/wiki/CRIME. Accessed 7 Sept 2019

  31. Wikipedia (2019) IEEE-488: short-range digital communications bus specification. https://en.wikipedia.org/wiki/IEEE-488. Accessed 7 Sept 2019

  32. Toga AW, Foster I, Kesselman C et al (2015) Big biomedical data as the key resource for discovery science. J Am Med Information Assoc 22(6):1126–1131. https://doi.org/10.1093/jamia/ocv077

    Article  Google Scholar 

  33. Allotrope Foundation (2019) The Allotrope Framework and Data Format. https://www.allotrope.org/allotrope-framework Accessed 8 September 2019

  34. Pistoia Alliance (2019) Unified Data Model. https://www.pistoiaalliance.org/projects/udm/. Accessed 8 Sept 2019

  35. National Center for Advancing Translational Sciences (2013) Biomedical Data Translator. https://ncats.nih.gov/translator/about. Accessed 8 Sept 2019

  36. Spidlen J, Moore W, Parks D et al (2010) Data file standard for flow cytometry, Version FCS 3.1. J Cytometry A 77(1):97–100. https://doi.org/10.1002/cyto.a.20825

    Article  Google Scholar 

  37. Python Package Index (2018) FlowCytometryTools python package. https://pypi.org/project/FlowCytometryTools/. Accessed 8 Sept 2019

  38. Teague B (2019) Cytoflow GitHub code repository. https://github.com/bpteague/cytoflow. Accessed 8 Sept 2019

  39. FlowPy (2016) FlowPy Code Repository and Documentation. http://flowpy.wikidot.com/. Accessed 8 Sept 2019

  40. Scott White (2019) FlowIO python library for flow cytometry. https://github.com/whitews/flowio. Accessed 8 Sept 2019

  41. Ridiculous Fish (2019) Hex Fiend hexadecimal editor for Mac OS X. https://ridiculousfish.com/hexfiend/. Accessed 8 Sept 2019

  42. MH-Nexus (2019) HxD Hexadecimal editor for Windows. https://mh-nexus.de/en/hxd/. Accessed 8 Sept 2019

  43. Zentgraf D (2015) What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. http://kunststube.net/encoding/. Accessed 8 Sept 2019

  44. Wikipedia (2019) FASTQ text based format for storing biological sequences and quality scores. https://en.wikipedia.org/wiki/FASTQ_format. Accessed 8 Sept 2019

  45. HUPO Proteomics Standards Initiative (2017) mzML file format specification for raw spectrometer data. http://www.psidev.info/mzML. Accessed 8 Sept 2019

  46. Samtools (2019) Variant Call Format specification. https://samtools.github.io/hts-specs/VCFv4.2.pdf. Accessed 8 Sept 2019

  47. Fracchia C, Dapello J (2016) Reverse engineering biomedical equipment for fun and open science. Presented at DEFCON24—6 August 2016

    Google Scholar 

  48. BioBright (2017) Tools Bring Superpowers to the Biology Lab. https://www.businesswire.com/news/home/20170314005466/en/BioBright-Tools-Bring-%E2%80%98Superpowers%E2%80%99-Biology-Lab. Accessed 8 Sept 2019

  49. Hearst MA et al. (2007) BioText Search Engine: beyond abstract search. Bioinformatics 23(17):2348–2351. https://doi.org/10.1093/bioinformatics/btm301

  50. Amer-Yahia S, Shanmugasundaram J (2005) XML full-text search: challenges and opportunities. Proceedings Hearst MA, Divoli A, Guturu H et al. (2007) BioText Search Engine: beyond abstract search. Bioinformatics 23(17):2348-2351of 31st international conference on Very large databases 1368–1368

    Google Scholar 

  51. Xu S, McCusker J, Krauthammer M (2008) Yale Image Finder (YIF): a new search engine for retrieving biomedical images. Bioinformatics 24(17):1968–1970

    Article  CAS  Google Scholar 

  52. Amazon Web Services (2019) Elasticsearch Service: Fully managed, scalable, and secure Elasticsearch service. https://aws.amazon.com/elasticsearch-service/. Accessed 8 Sept 2019

  53. Microsoft Azure (2019) How full text search works in Azure Cognitive Search. https://docs.microsoft.com/en-us/azure/search/search-lucene-query-architecture. Accessed 8 Sept 2019

  54. Postgresql (2019) JSON Functions and Operators. https://www.postgresql.org/docs/current/functions-json.html. Accessed 8 Sept 2019

  55. Apache (2019) Lucene text search engine project. https://lucene.apache.org/core/index.html Accessed 8 Sept 2019

  56. Elasticsearch (2019) Configuring security in Elasticsearch. https://www.elastic.co/guide/en/elasticsearch/reference/current/configuring-security.html. Accessed 8 Sept 2019

  57. Swagger (2019) OpenAPI specification documentation. https://swagger.io/docs/specification/about/. Accessed 8 Sept 2019

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Fracchia, C. (2021). Secure and Scalable Collection of Biomedical Data for Machine Learning Applications. In: Cartwright, H. (eds) Artificial Neural Networks. Methods in Molecular Biology, vol 2190. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0826-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-0826-5_16

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-0825-8

  • Online ISBN: 978-1-0716-0826-5

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics