This repository contains programs implementing the trio-binning assembly method published by Koren et al.
- kmc (for the
find-unique-kmers
script only) - Python
A binary package compiled for 64-bit linux can be downloaded on the release page. Just download the three executables from the latest release and you're good to go!
To compile from source, you'll need the following dependencies:
- rust (
curl https://sh.rustup.rs -sSf | sh
) - clang (you can use
sudo apt-get install clang
to install this in ubuntu) - liblzma (
sudo apt-get install liblzma-dev
)
Then, run the following commands to download and compile:
git clone https://github.com/esrice/trio_binning.git
cd trio_binning
cargo build --release
There will now be binaries in target/release
.
find-unique-kmers
is a script that uses the program kmc to find k-mers that
are unique to one of the two haplotypes. Here is an example of its usage to find
unique 21-mers in short read files from a hybrid's mother and father using 8
threads:
find-unique-kmers -k 21 -p 8 mother_R1.fastq.gz,mother_R2.fastq.gz \
father_R1.fastq.gz,father_R2.fastq.gz
After running this command, there will be a plain-text list of 21-mers unique to
the maternal genome in hapA_only_kmers.txt
and the same for the paternal
genome in hapB_only_kmers.txt
. The working directory will also contain kmc
databases for both the total set of 21-mers for each parent and the unique sets.
The numbers of unique k-mers for each haplotype is output to STDERR.
Once you've got lists of k-mers unique to the maternal and paternal genomes, you can use these to classify reads from the offspring into maternal and paternal haplotypes using the program classify_by_kmers
, like so:
classify_by_kmers -a hapA_only_kmers.txt -A classified/maternal \
-b hapB_only_kmers.txt -B classified/paternal \
-u offspring.fastq.gz -U classified/unclassified -i input_reads.bam
This will leave you with three files in the classified
directory: paternal.fq
, maternal.fq
, and unclassified.fq
. The input read format is super flexible — you can give this program reads in fasta or fastq format, gzipped or not gzipped, or even a bam file.
After you have performed the initial contig assembly, you may have short reads that you want to bin by haplotype as well, such as Hi-C reads for scaffolding or RNA-seq reads for annotation. Now that you have a reference genome, you no longer need to use k-mers to determine which reads came from which haplotype; you can instead align all reads to both haplotypes and then classify them based on which haplotype they align to best. This package contains a program for that as well. First, align the reads to both haplotypes using an aligner such as bwa and sort the alignments by read name:
bwa mem maternal_ref.fa short_reads_R1.fq.gz short_reads_R2.fq.gz \
| samtools view -bh - | samtools sort -n - > short_reads.maternal_ref.bam
bwa mem paternal_ref.fa short_reads_R1.fq.gz short_reads_R2.fq.gz \
| samtools view -bh - | samtools sort -n - > short_reads.paternal_ref.bam
N.B. If you do any PCR duplicate removal or anything else that could remove some alignments from one bam file but not the other, do that after the classification step.
If you used high-error long reads (e.g., PacBio CLR and not CCS), you should polish the contigs with long reads using a program such as Arrow before aligning short reads to them, as improving base accuracy of contigs will greatly improve the quality of the alignments.
Then, use the classify_by_alignment
program to classify each read into a
haplotype based on which assembly it aligns to best:
classify_by_alignment --hapA-in short_reads.maternal_ref.bam \
--hapA-out maternal_classified.bam \
--hapB-in short_reads.paternal_ref.bam \
--hapB-out paternal_classified.bam
This command will result in a file maternal_classified.bam
containing the
reads classified to the maternal haplotype, aligned to the maternal reference,
and a file paternal_classified.bam
containing reads classified to the
paternal haplotype, aligned to the paternal reference. You can then post-process
these bams and use them as input to your scaffolding or annotation pipelines.
- Koren et al. (2018). "Complete assembly of parental haplotypes with trio binning." Nature Biotechnology 2018/10/22/online
- Rice et al. (2020). "Continuous chromosome-scale haplotypes assembled from a single interspecies F1 hybrid of yak and cattle." GigaScience 9(4):giaa029