MMTF PySpark

mmtfPyspark is a python package that provides APIs and sample applications for distributed analysis and scalable mining of 3D biomacromolecular structures, such as the Protein Data Bank (PDB) archive. mmtfPyspark uses Big Data technologies to enable high-performance parallel processing of macromolecular structures. mmtfPyspark use the following technology stack:

Apache Spark a fast and general engine for large-scale distributed data processing.
MMTF the Macromolecular Transmission Format for compact data storage, transmission and high-performance parsing
Hadoop Sequence File a Big Data file format for parallel I/O
Apache Parquet a columnar data format to store dataframes

This project is still currently under development.

Installation

Python

We strongly recommend that you have anaconda and we require at least python 3.6 installed. To check your python version:

python --version

If Anaconda is installed, and if you have python 3.6, the above command should return:

Python 3.6.4 :: Anaconda, Inc.

mmtfPyspark and dependencies

Since mmtfPyspark uses parallel computing to ensure high-performance, it requires additional dependencies such as Apache Spark. Therefore, please read follow the installation instructions for your OS carefully:

MacOS and LINUX

Windows

Hadoop Sequence Files

The MMTF Hadoop sequence files of all PDB structures can be downloaded by:

curl -O http://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar

curl -O http://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar

For Mac and Linux, the Hadoop sequence files can be downloaded and saved as environmental variables by running the following command:

curl https://raw.githubusercontent.com/sbl-sdsc/mmtf-pyspark/master/bin/download_mmtf_files.sh -o download_mmtf_files.sh
. ./download_mmtf_files.sh

Name		Name	Last commit message	Last commit date
Latest commit History 852 Commits
bin		bin
conda.recipe		conda.recipe
demos		demos
docs		docs
mmtfPyspark		mmtfPyspark
resources		resources
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DataAnalysisWithDataFrameExample.ipynb		DataAnalysisWithDataFrameExample.ipynb
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
apt.txt		apt.txt
environment.yml		environment.yml
postBuild		postBuild
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
test_mmtfPyspark.py		test_mmtfPyspark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMTF PySpark

Installation

Python

mmtfPyspark and dependencies

Hadoop Sequence Files

About

Uh oh!

Releases

Packages

Languages

License

fsimkovic/mmtf-pyspark

Folders and files

Latest commit

History

Repository files navigation

MMTF PySpark

Installation

Python

mmtfPyspark and dependencies

Hadoop Sequence Files

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages