8000 GitHub - fsimkovic/mmtf-pyspark: Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

License

Notifications You must be signed in to change notification settings

fsimkovic/mmtf-pyspark

 
 

Repository files navigation

MMTF PySpark

Build Status GitHub license Version Download MMTF Download MMTF Reduced Binder Twitter URL

mmtfPyspark is a python package that provides APIs and sample applications for distributed analysis and scalable mining of 3D biomacromolecular structures, such as the Protein Data Bank (PDB) archive. mmtfPyspark uses Big Data technologies to enable high-performance parallel processing of macromolecular structures. mmtfPyspark use the following technology stack:

  • Apache Spark a fast and general engine for large-scale distributed data processing.
  • MMTF the Macromolecular Transmission Format for compact data storage, transmission and high-performance parsing
  • Hadoop Sequence File a Big Data file format for parallel I/O
  • Apache Parquet a columnar data format to store dataframes

This project is still currently under development.

Installation

Python

We strongly recommend that you have anaconda and we require at least python 3.6 installed. To check your python version:

python --version

If Anaconda is installed, and if you have python 3.6, the above command should return:

Python 3.6.4 :: Anaconda, Inc.

mmtfPyspark and dependencies

Since mmtfPyspark uses parallel computing to ensure high-performance, it requires additional dependencies such as Apache Spark. Therefore, please read follow the installation instructions for your OS carefully:

MacOS and LINUX

Windows

Hadoop Sequence Files

The MMTF Hadoop sequence files of all PDB structures can be downloaded by:

curl -O http://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar

curl -O http://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar

For Mac and Linux, the Hadoop sequence files can be downloaded and saved as environmental variables by running the following command:

curl https://raw.githubusercontent.com/sbl-sdsc/mmtf-pyspark/master/bin/download_mmtf_files.sh -o download_mmtf_files.sh
. ./download_mmtf_files.sh

About

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.1%
  • Jupyter Notebook 4.1%
  • Other 0.8%
0