Background & Summary

Bacillus is a genus of Gram-positive, rod-shaped bacteria that are widely distributed in soil, water, and other diverse environments. Bacillus species have been extensively studied for their potential to produce secondary metabolites (SMs), which have a wide range of functions and activities, and are being harnessed in various fields, such as agriculture, biotechnology, and medicine1,2. Several studies have reported that Bacillus and related genera produce secondary metabolites, an ability conferred by the presence of biosynthetic gene clusters3,4,5.

Secondary metabolite biosynthetic gene clusters (smBGCs) are genomic regions containing two or more genes involved in the biosynthetic pathway of secondary metabolites. These genes encode enzymes, transport proteins, regulatory factors, and other accessory proteins that contribute to the secondary metabolite biosynthetic process6. The composition and structures of smBGCs can vary widely across and even within the same species. The importance and feasibility of exploring species-specific BGCs have been recently highlighted7,8. Many bioinformatics tools have been developed to predict, identify, and characterize smBGCs9, which require high quality genome sequences10. The development of sequencing technologies has made whole genome sequencing simpler and faster. In particular, the integration of high throughput sequencing (short-read) and long-read sequencing data, can lead to high quality assemblies of genomes, including complete genomes11.

In this study, we performed whole genome sequencing for strains collected from different countries and regions spanning four different continents (Fig. 1), based on an integrated approach, including Oxford Nanopore long-read sequencing and MGI short-read sequencing. Here, we sequenced and assembled 121 genomes using this approach. An outline of the study’s experimental and analysis design is presented in Fig. 2, and detailed descriptions of the workflow are provided in the methodology sections. According to the completeness criteria of the National Center for Biotechnology Information (NCBI), we produced, in total, 62 assemblies at a complete genome level, 2 at chromosome level, and a remaining 57 at contig level (Supplementary Table 1 for details). Overall, the genome sizes range from 3.50 Mb to 7.15 Mb (5.09 Mb on average), with a GC content ranging from 34.50% to 54.00% (40.19% on average). The base accuracy of each assembly was assessed using yak12, and the quality value (QV) ranges from 41.13 to 69.52 (56.41 on average). Based on NCBI PGAP13, an average of 5,119 genes, including 4,851 protein-coding genes were annotated in the genomes (Table 1). Taxonomic analysis showed that these 121 genomes could be classified into 16 genera within the Bacillales order, most of which were species from the Bacillus genus (Fig. 3). (Supplementary Table 2).

Fig. 1
figure 1

Distribution of sample collection site coordinates depicted using OpenStreetMap.

Fig. 2
figure 2

Illustration of Genome assembly and BGC analysis. (a) Strategy for sequencing and genome assembly, (b) the BGC analysis pipeline.

Table 1 Summary of general genome information for each genus.
Fig. 3
figure 3

A phylogenetic tree of the 121 genomes with plasmid content and BGC class and count indicated.

To assess the potential for secondary metabolite production in these isolates, the genome mining tool BGCFlow14 was applied for BGC identification and annotation, resulting in a total of 1,176 BGCs predicted. The BGCs were categorized into seven classes through BiG-SCAPE15, part of the BGCFlow executable, which showed that RiPPs have the greatest count of 381 and comprise the highest percentage at 32.4% (Supplementary Table 3). The distribution of BGC counts per genus highlights the uneven abundance of BGCs between the distinct genera (Fig. 4). Notably, the genera Bacillus and Paenibacillus harbor the highest number of BGCs among the genomes presented here.

Fig. 4
figure 4

The number of BGCs in each genus of the 121 genomes.

To assess the novelty of the BGCs found in the dataset, the sequence similarity network of the BGCs constructed using BiG-SCAPE was further enriched with the top KnownClusterBlast hit to the known entries from the MIBiG database16, resulting in 283 connected components of gene cluster families (GCFs). Using this analysis, we can group 310 BGCs (26%) into 27 GCFs with high similarity to 37 known compounds produced by Bacillus or related genera, such as bacilysin, surfactin, and subtilosin (Figure S1A, Supplementary Table 4a,b). Meanwhile, 435 BGCs (37%) can be grouped into 55 GCFs with medium to low similarity to 59 known compounds (Figure S1B, Supplementary Table 4c,b). Almost half of the known compound hits in this category (29 hits) are also produced by Bacillus and related genera, hinting that these GCFs might produce analogs or compounds of similar types. The remaining 30 compounds are known to be produced by other distantly related genera. This is likely due to the limitation of the database and suggest further detailed BGC comparison to identify possible products. Finally, the remaining 431 BGCs (36%) can be grouped into 201 GCFs without any hits to known BGCs in the database (Figure S1C, Supplementary Table 4e,f). While some of the GCFs in this category are quite conserved in some clades, a third of the GCFs are singletons. These unknown categories hold the potential for further experimentation to find novel compounds from the Bacillales group. A detailed list of the GCF assignments can be found in Supplementary Table 4.

The datasets and genomic analysis results described here greatly expand the reported genomic information of spore-forming Bacillales and will also strengthen studies advancing our understanding of the secondary metabolite potential of the Bacillales order.

Method

Sample collection and isolation

Sample collection was dependent on the isolating laboratory. Using soil samples collected at diverse locations in Germany, Denmark, China, and Mexico, spore-forming soil bacteria were isolated after heat treatment at 80 °C for 10 minutes and spreading the soil suspension on lysogeny broth (LB) or tryptic soy broth (TSB) plates with 1.5% agar that were incubated at 37 °C for 2 days.

Bacillus altitudinis J6-1 and J6-2 were isolated from a biofilm sample obtained from the pier at Jyllinge Harbour (55.744923; 12.094888). Biofilm samples were incubated at 80 °C for 15 mins and subsequently plated on LB agar and incubated at 25 °C.

Other marine samples were collected from the Cochin estuary and adjacent coastal waters (South-west coast of India), during pre-monsoon (March), monsoon (August) and post-monsoon (December) periods of the year 2012 and 2013. Water samples were serially diluted and spread on Norris Glucose Nitrogen free medium (NGNF medium, HIMEDIA-M712) with 1.5% agar (Himedia GRM 666) and incubated at 28 ± 1 °C for 7–14 days. Separated colonies with different morphologies were picked using a sterile inoculation loop, re-streaked and maintained on the slants of fresh nitrogen free culture medium at 4 °C. Cell morphology and presence of endospore was analyzed by light microscopy (Olympus CX21i). Rod shaped endospore forming isolates were selected for this study.

Isolate Mi106 D2 head1 chi was obtained from the head of a worker termite from a colony of Microtermes sp. and Mn106-1 head2 chi was obtained from the head of a worker from a colony of Macrotermes natalensis in Mookgophong, South Africa (S24 40 30.5 E28 47 50.4) in 2010. In both cases, the surface of a worker termite was rinsed using phosphate buffer saline (PBS). Subsequently, the head of the termite was crushed in 200 µl PBS, which was subsequently spread onto chitin medium (4 g chitin, 0.7 g K2HPO4, 0.3 g KH2PO4, 0.5 g MgSO4 × 5H2O, 0.01 g FeSO4 × 7H20, 0.001 g ZnSO4, 0.001 g MnCl2, and 20 g of agar per liter). Growing colonies on plates were streaked onto Yeast Malt Extract Agar medium (4 g yeast extract, 10 g malt extract, 4 g D-glucose and 20 g bacteriological agar per liter), and once in pure culture, stored in 10% glycerol at −20 °C. Isolate 11B was obtained using the same approach on a fragment from a fungus garden of a Macrotermes natalensis colony collected in Rietondale, South Africa (S25 43 45.6 E28 14 09.9) in 2010.

Strains GT4_IS1 and MW2_IS1 were previously isolated from the uropygial glands of Great tits (Parus major) from Denmark and Czechia respectively17.

In each case, observed colonies were re-streaked to obtain single colonies, and subsequently stored at −80 °C with 28% glycerol added. To obtain primary information about these strains, colony PCR was employed to amplify the 16S rRNA gene. Strains that exhibited low similarity and distant branches in the 16S rRNA phylogenetic tree were selected for further study.

Genomic DNA (gDNA) extraction

For genomic DNA (gDNA) extraction, a pure single colony of each isolate was inoculated in 5 ml of LB and incubated at 37 °C for more than 12 hours. Then gDNA was extracted using E.Z.N.A. DNA extraction kits (OMEGA Bio-Tek Inc., Norcross, GA, USA) following the manufacturer’s instructions. The quality and quantity of gDNA were assessed using agarose gel electrophoresis and Nanodrop (Thermo Fisher Scientific, MA, USA), to guarantee that the integrity, concentration, and purity met the requirements for library construction and sequencing.

Short-read sequencing on MGI platform

For each strain, 300 ng gDNA was used for short-read sequencing library construction according to MGI paired-end libraries construction protocol18. Briefly, gDNA was fragmented to 200–300 bp using segmentase followed by fragment selection with VAHTS™ DNA Clean Beads (Vazyme, Nanjing, Jiangsu, China). Subsequently, end repair, A-tailing reactions and adapter ligation were implemented. After PCR and purification, the concentration of each library was determined using Qubit® dsDNA HS Assay Kit (Thermo Fisher Scientific) as quality control. The qualified libraries were sequenced on the DNBSEQ-G400 (MGI Tech Co., Ltd.) platform according to the manufacturer’s instructions to generate paired end reads (150 bps at each end).

Long-read sequencing on Oxford nanopore platform

For Oxford Nanopore sequencing, the libraries were prepared using the SQK-RBK110.96 barcoding kit (Oxford Nanopore Technologies, Oxford, UK) starting from 50 ng DNA for each strain. In brief, each sample was fragmented and ligated by a unique rapid barcode with incubation at 30 °C for 2 minutes and then at 80 °C for 2 minutes, then all barcoded samples were pooled together in a 1:1 ratio and purified by SPRI beads. After ligation of 1 µl of Rapid Adapter F (RAP F) to 11 µl of pooled DNA, the final library was quantified using Nanodrop. The ONT library was loaded into the MinION spot-on Flow Cell (R9 Version) and sequenced on a MinION Mk1B device according to standard protocol. The resulting reads were basecalled and demultiplexed with MinKNOW UI v.4.1.22.

Genome assembly

For de novo assembly, the MGISEQ paired end short reads were adapter and quality trimmed using fastp v.0.22.0 and the Nanopore long reads were adapter trimmed using porechop v.0.2.1, using standard settings19,20. The trimmed long reads from Nanopore were assembled with flye v.2.9.1-b1780, and subsequently the trimmed reads from both platforms and the long-read assembly were hybrid assembled with Unicycler v.0.5.0 using the –existing_long_read_assembly option21,22. The completeness and contamination level of each genome were assessed using CheckM v.1.2.223 with the command ‘checkm lineage_wf <genomes folder>  <output folder> ,’ which places each genome phylogenetically before choosing the set of single-copy conserved genes to evaluate by the completeness defined as the proportion of markers present, and the contamination defined as the proportion of markers present in multiple copies (see details in Supplementary Table 5). Yak12 was used to assess quality values of each genome following the protocol24. To account for the Bacillales genome size of ~6 Mbp, the K value was changed to 6 M.

Genome annotation, taxonomic analysis and BGC prediction

The genomes of the 121 isolates were taxonomically classified and gene-annotated in a two-step process. Initially, we employed GTDB-Tk v2.1125, using the ‘classify_wf’ command, to preliminarily assign taxonomic classifications to the FASTA format genomes. Subsequently, these genomes were uploaded to the NCBI GenBank database, where they were annotated using the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Following this, we conducted a comprehensive analysis of the annotated genomes using BGCFlow v0.7.1. This tool integrates multiple genome mining and phylogenetic tools into one pipeline14. To set up the analysis, we created a folder containing the project configuration structure as defined by BGCFlow Portable Encapsulated Project (PEP) specification26. The designated project folder contains a comma separated sample file which contains the NCBI-assigned GenBank accession numbers of the 121 de novo assembled genomes and the PEP configuration file for the BGCFlow run. The YAML configuration file for the project was configured to enable GTDB-Tk and autoMLST wrapper for phylogenetic tree construction, antiSMASH27 for BGC annotation, and BiG-SCAPE15 for BGC dereplication into gene cluster families (GCFs) and generating summary tables. The resulting GCFs were then visualized using Cytoscape version 3.10.228. BGCFlow was executed using standard settings, which include KnownClusterBlast search against the MIBiG database16 for known BGCs.

We conducted a non-exhaustive search for plasmids within our de novo assembled genomes by identifying contiguous sequences (contigs) as plasmids if they were circular and if RFPlasmid29 (v.0.0.18), an open-source software that classifies contigs as plasmid or chromosomal based on the presence of marker genes and k-mers, classified them as plasmids. Due to the incomplete assembly of several genomes, which resulted in the presence of linear fragments, the absence of any plasmid identified by this method does not necessarily indicate their true absence.

Data Records

The sample information and assembled genomes were deposited in NCBI BioProject under PRJNA96071130, and raw reads of long-read sequencing on Nanopore and short-read sequencing on MGISEQ have been deposited in NCBI Sequence Read Archive (SRA) under SRP48516731 (Supplementary Tables 1, 6 for accession and other details).

Technical Validation

In this study, the main steps of experimental procedures and data analysis have been validated. For short-read sequencing on MGI, the libraries were quantified with a minimum of 10 ng/μl. For de novo assembly, default parameters were used for quality trimming. In brief, after filtering, an average of 2.69 G MGI reads (0.66 G-6.52 G, PE150) and 76,507 Nanopore reads with mean N50 of 6,709 bp (1,777bp-13,698 bp) for each sample were generated (Supplementary Table 7). CheckM was used for validation of the genome completeness and contamination.