Addition of bbsplit for filtering of genomic contaminants #1850

apsteinberg · 2025-03-26T15:43:57Z

Addition of bbsplit for filtering of genomic contaminants. Current issues:

need help with nextflow_schema.json
fastq files get split up and it results in empty files after the FASTP. This creates an issue for bbsplit.
needs linting and documentation

PR checklist

nf-core-bot · 2025-03-26T16:07:27Z

Warning

Newer version of the nf-core template is available 8000 .

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

conf/iris.config

subworkflows/local/prepare_bbsplit/main.nf

workflows/sarek/main.nf

resources/bbsplit_fasta_.csv

resources/wgs_test_samplesheet.csv

.gitignore

conf/iris.config

tobsecret · 2025-04-02T18:31:46Z

Tests are passing now!

FriederikeHanssen · 2025-04-02T20:37:59Z

main.nf

@@ -61,6 +61,9 @@ params.vep_genome              = getGenomeAttribute('vep_genome')
 params.vep_species             = getGenomeAttribute('vep_species')

 aligner = params.aligner
+skip_bbsplit = params.skip_bbsplit


to stay aligned with the rest of the pipeline, can this be part of skip_tools please. This sounds like bbsplit is always run. Given how unstable it is I would prefer it to be added to the tools that can be selected instead

This was introduced so bbsplit is not run by default, i.e. users have to specify --skip_bbsplit false. I guess we should have named this flag --run_bbsplit instead?
Ofc we could produce the same logic with skip_tools as well and specify that if you provide bbsplit to --skip_tools it's actually not skipped but that seems counterintuitive.

FriederikeHanssen · 2025-04-02T20:38:41Z

main.nf

+bbsplit_fasta_list = params.bbsplit_fasta_list
+bbsplit_index = params.bbsplit_index


please align these. Will the user need to provide the same fasta twice? ONce in --fasta and once here?

main.nf

FriederikeHanssen · 2025-04-02T20:40:45Z

main.nf

@@ -159,6 +165,8 @@ workflow NFCORE_SAREK {
                                    : PREPARE_GENOME.out.bwamem2
    dragmap     = params.dragmap    ? Channel.fromPath(params.dragmap).map{ it -> [ [id:'dragmap'], it ] }.collect()
                                    : PREPARE_GENOME.out.hashtable
+    // get index from bbsplit
+    bbsplit_index           = PREPARE_GENOME.out.bbsplit_index


If the indices are provided and not newly computed, should they not be assigned here?

subworkflows/local/fastq_align_gatk/main.nf

subworkflows/local/prepare_genome/main.nf

FriederikeHanssen · 2025-04-02T20:45:46Z

subworkflows/local/prepare_genome/main.nf

@@ -127,6 +164,7 @@ workflow PREPARE_GENOME {
    known_indels_tbi         = TABIX_KNOWN_INDELS.out.tbi.map{ meta, tbi -> [tbi] }.collect()           // path: {known_indels*}.vcf.gz.tbi
    msisensorpro_scan        = MSISENSORPRO_SCAN.out.list.map{ meta, list -> [list] }                   // path: genome_msi.list
    pon_tbi                  = TABIX_PON.out.tbi.map{ meta, tbi -> [tbi] }.collect()                    // path: pon.vcf.gz.tbi
+    bbsplit_index            = params.skip_bbsplit ? Channel.empty() : ch_bbsplit_index  // Conditional emission                                                         // channel: path(bbsplit/index/)


This should really be handled in the main workflow by only running prepare genomes if necessary

tobsecret · 2025-04-02T20:45:50Z

Thanks @FriederikeHanssen I'll try and get to this this week!

FriederikeHanssen · 2025-04-02T20:47:50Z

subworkflows/local/prepare_genome/main.nf

+        } else {
+            Channel
+                .from(file(bbsplit_fasta_list))
+                .splitCsv() // Read in 2 column csv file: short_name,path_to_fasta


can you use the samplesheet validation from nf-schema here? It comes with a bunch of neat utility functions and we have a schema to validate the file before the workflow starts

workflows/sarek/main.nf

main.nf

FriederikeHanssen · 2025-04-02T20:50:38Z

workflows/sarek/main.nf

+    */
+
+    // Check if file with list of fastas is provided when running BBSplit
+    if (!params.skip_bbsplit && !params.bbsplit_index && params.bbsplit_fasta_list) {


if you use nf-schema, we can outsource all of this to the plugin as well

workflows/sarek/main.nf

FriederikeHanssen

Thank you for adding this!!

I added some comments in the code. In addition, before we can approve, can you please add:

The tool to the readme overview
Usage docs, if anything particular is to be said about that
Metro map
Overview map
All new outputs should be described in the output.md
workflow level nf-test with snapshots

tobsecret · 2025-04-02T21:13:41Z

Great, thanks for the helpful review! I'll try and get to most of those points this or next week.

workflow level nf-test with snapshots

Regarding the tests how do we approach that? is that just rerunning nf-test test tests/main.nf.test --update-snapshot or do we have to actually write new test cases?

FriederikeHanssen · 2025-04-02T21:46:09Z

is that just rerunning nf-test test tests/main.nf.test --update-snapshot or do we have to actually write new test cases?

Please add a new test case. We can then trigger it only when relevant files fo bbsplit have changed + bbsplit is not part of the default execution.

Co-authored-by: Friederike Hanssen <friederike.hanssen@seqera.io>

maxulysse reviewed Mar 27, 2025

View reviewed changes

conf/iris.config Outdated Show resolved Hide resolved

maxulysse reviewed Mar 27, 2025

View reviewed changes

subworkflows/local/prepare_bbsplit/main.nf Outdated Show resolved Hide resolved

maxulysse reviewed Mar 27, 2025

View reviewed changes

workflows/sarek/main.nf Outdated Show resolved Hide resolved

maxulysse reviewed Mar 27, 2025

View reviewed changes

workflows/sarek/main.nf Outdated Show resolved Hide resolved

maxulysse reviewed Mar 27, 2025

View reviewed changes

workflows/sarek/main.nf Outdated Show resolved Hide resolved

apsteinberg added 24 commits March 27, 2025 11:37

home commits

8c6dc5c

home commits

26a34c5

home commits

0ea1b08

home commits

9c8a0a6

rebasing to match dev branch

e111111

rebasing

1d6de1c

home commits

5f8348f

home commits

045a2eb

home commits

7b1e121

home commits

8d6737e

home commits

73ee506

home commits

70dd006

home commits

2653860

home commits

bf7e122

home commits

fc1baeb

home commits

49809f0

home commits

9b5feff

home commits

340013b

home commits

284d342

home commits

a246976

home commits

8fb577c

initialization of bbsplit params

5fa3a95

feeding ref genome to bbsplit

3c73ba1

adding gunzip module

f66a8cc

tobsecret reviewed Apr 2, 2025

View reviewed changes

resources/bbsplit_fasta_.csv Outdated Show resolved Hide resolved

tobsecret reviewed Apr 2, 2025

View reviewed changes

resources/wgs_test_samplesheet.csv Outdated Show resolved Hide resolved

tobsecret requested changes Apr 2, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

conf/iris.config Outdated Show resolved Hide resolved

removing local testing files

bbf776e

tobsecret approved these changes Apr 2, 2025

View reviewed changes