XHap: Haplotype assembly using long-distance read correlations learned by transformers

About

XHap is a framework based on transformers (from natural language processing) for diploid and polyploid haplotype assembly. The framework is capable of haplotype reconstruction from either short reads (e.g. Illumina, Roche) or long reads (e.g. PacBio, ONT) or a combination thereof.

The current implementation of XHap uses Python3, PyTorch and PyTorch ROCm (for AMD GPUs). Both CPU and GPU implementations are available in xhap.py and xhap_parallel.py respectively.

Dependencies

PyTorch >= 1.10
PyTorch ROCm >= 2.0.1 (to use AMD GPUs)
Numpy
Scipy
C++
Samtools
MAFFT

Where possible, additional dependencies have been included in the GitHub repository.

Assumed directory structure

All the scripts included in this repository assume that the XHap source code is stored in the current working directory and the data files are stored in a subdirectory [data] of the directory generate_data. Consequently, the resulting data files can be found in generate_data/[data].

Note: This structure can be easily changed in the provided scripts by changing generate_data to the desired folder in the respective files.

Input

The provided pipeline for XHap takes a tab-separated file containing read-SNP matrix as the input. Details on how to obtain this matrix can be found in the generate_data folder.

Output

Each round of training XHap yields results stored in the following files saved in the corresponding data directory:

xhap_model: Stores the state_dict for both the convolutional (embedAE) and transformer encoder (corr_xformer) layers in XHap.
haptest_xformer_res.npz: NPZ file storing the reconstructed haplotypes (rec_hap), the read attributions (rec_hap_origin) and if applicable, the ground truth haplotyes (true_hap).

Usage

The function train_xhap in xhap (or xhap_parallel) can be invoked to run XHap on the data in the folder specified by outhead. This function also takes in the following parameters:

d_model: Embeding size for each read
num_hap: Number of haplotypes (ploidy of organism)
num_epoch: Number of training epochs for XHap
check_cpr: Set to true if the ground truth is present

Included scripts

There are several Python scripts included to run XHap end-to-end -- from data generation through haplotype assembly. These scripts can also be used to replicate the experiments described in the associated manuscript.

run_expt.py: Running experiments on semi-experimental data (includes data generation)
run_expt_real.py: Running experiments on experimental data (assumes data processing has been done)
run_indel_expt.py: Running experiments to validate indel detection included in the XHap pipeline

Citation

If you use this software, please cite:

Consul, S., Ke, Z., & Vikalo, H. (2023). XHap: haplotype assembly using long-distance read correlations learned by transformers. Bioinformatics Advances, 3(1), vbad169.

The correspodning BibTex is:

@article{consul2023xhap, title={XHap: haplotype assembly using long-distance read correlations learned by transformers}, author={Consul, Shorya and Ke, Ziqi and Vikalo, Haris}, journal={Bioinformatics Advances}, volume={3}, number={1}, pages={vbad169}, year={2023}, publisher={Oxford University Press} }

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
generate_data		generate_data
.gitignore		.gitignore
README.md		README.md
ScheduleOptim.py		ScheduleOptim.py
Transformers_Kernel_Kmeans-v4.ipynb		Transformers_Kernel_Kmeans-v4.ipynb
detect_indel_longread.py		detect_indel_longread.py
detect_indel_shortread.py		detect_indel_shortread.py
helper.py		helper.py
helper_indel.py		helper_indel.py
kernel_kmeans.py		kernel_kmeans.py
kernel_kmeans_torch.py		kernel_kmeans_torch.py
read_embeddings.py		read_embeddings.py
run_expt.py		run_expt.py
run_expt_real.py		run_expt_real.py
run_expt_sparse.py		run_expt_sparse.py
run_indel_expt.py		run_indel_expt.py
visualize_corr.py		visualize_corr.py
xhap.py		xhap.py
xhap_parallel.py		xhap_parallel.py
xhap_sparse.py		xhap_sparse.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XHap: Haplotype assembly using long-distance read correlations learned by transformers

About

Dependencies

Assumed directory structure

Input

Output

Usage

Included scripts

Citation

About

Uh oh!

Releases

Packages

Languages

shoryaconsul/XHap

Folders and files

Latest commit

History

Repository files navigation

XHap: Haplotype assembly using long-distance read correlations learned by transformers

About

Dependencies

Assumed directory structure

Input

Output

Usage

Included scripts

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages