CAR-NGS is a toolkit designed to simplify high-throughput genomic data analysis by providing user-friendly R-based frontend functions. These functions serve as an interface to automated data analysis pipelines, enabling seamless execution of genomic workflows without the need for extensive command-line expertise.
You can install the tool directly from GitHub using the following commands in R:
install.packages("devtools")
library(devtools)
install_github("https://github.com/Reproducible-Bioinformatics/CAR-NGS", ref="main")
Once installed, you can access and use the provided functions to interact with the CAR-NGS pipelines efficiently.
The sixteenS
function is an R-based frontend for executing the 16S rRNA gene sequencing analysis pipeline for microbial community profiling. This function simplifies execution by managing Docker commands directly from R, ensuring reproducibility and ease of use.
The 16S rRNA analysis pipeline consists of the following steps:
- Data Import: Converts raw paired-end FASTQ files into a QIIME2 artifact format (.qza).
- Quality Control: Summarizes sequencing read quality to assess data integrity.
- Denoising (DADA2): Filters and corrects sequencing errors, inferring true biological sequences (ASVs).
- Taxonomic Classification: Assigns taxonomy to inferred sequences using the Silva 138 pre-trained classifier.
- Visualization: Generates interactive taxonomic bar plots to summarize the microbial community composition.
- Exporting Results: Converts visualization files into HTML reports for easy access.
sixteenS(input_dir_path = "/path/to/fastq_files")
Parameter | Type | Description |
---|---|---|
input_dir_path |
character | Path to the directory containing paired-end FASTQ files. The file names must follow the format: sample_R1.fastq.gz and sample_R2.fastq.gz . |
Converts paired-end FASTQ files into QIIME2 format using a manifest file. The manifest file contains absolute paths to both forward (R1) and reverse (R2) reads. The imported data is stored as a .qza
file.
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path manifest.tsv \
--output-path aligned_results/sample-paired-end-demux.qza \
--input-format PairedEndFastqManifestPhred33V2
Generates a summary report of sequence quality to identify trimming parameters. The output is a QIIME2 visualization file (.qzv
), which can be viewed interactively.
qiime demux summarize \
--i-data aligned_results/sample-paired-end-demux.qza \
--o-visualization aligned_results/sample-paired-end-demux.qzv
Removes sequencing errors and infers amplicon sequence variants (ASVs). Trims reads to 250 bp forward and reverse.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs aligned_results/sample-paired-end-demux.qza \
--p-trim-left-f 0 --p-trim-left-r 0 \
--p-trunc-len-f 250 --p-trunc-len-r 250 \
--o-table aligned_results/sample-table.qza \
--o-representative-sequences aligned_results/sample-rep-seqs.qza \
--o-denoising-stats aligned_results/sample-denoising-stats.qza
Converts the denoising statistics into a human-readable format for evaluation.
qiime metadata tabulate \
--m-input-file aligned_results/sample-denoising-stats.qza \
--o-visualization aligned_results/sample-denoising-stats.qzv
Uses the Silva 138 classifier to assign taxonomy to ASVs.
qiime feature-classifier classify-sklearn \
--i-classifier /home/silva-138-99-nb-classifier.qza \
--i-reads aligned_results/sample-rep-seqs.qza \
--o-classification aligned_results/sample-taxonomy.qza
Generates interactive taxonomic bar plots, showing microbial composition at different taxonomic levels.
qiime taxa barplot \
--i-table aligned_results/sample-table.qza \
--i-taxonomy aligned_results/sample-taxonomy.qza \
--o-visualization aligned_results/sample-taxa-bar-plots.qzv
Converts the taxonomic bar plots into an HTML report for easy visualization.
qiime tools export \
--input-path aligned_results/sample-taxa-bar-plots.qzv \
--output-path aligned_results/html/sample-taxa-bar-plots
File | Description |
---|---|
aligned_results/sample-paired-end-demux.qza |
Imported QIIME2 artifact containing demultiplexed sequences. |
aligned_results/sample-paired-end-demux.qzv |
Visualization of demultiplexed sequences (interactive). |
aligned_results/sample-table.qza |
Feature table (ASV abundance per sample). |
aligned_results/sample-rep-seqs.qza |
Representative ASV sequences. |
aligned_results/sample-denoising-stats.qza |
Denoising statistics from DADA2. |
aligned_results/sample-taxonomy.qza |
Taxonomic classifications from the Silva 138 classifier. |
aligned_results/sample-taxa-bar-plots.qzv |
Taxonomic bar plots (interactive visualization). |
aligned_results/html/sample-taxa-bar-plots/ |
Exported HTML report of taxonomic composition. |
- Docker must be installed and running on your system.
- The input directory must contain paired-end FASTQ files named as:
sample_R1.fastq.gz
(Forward reads)sample_R2.fastq.gz
(Reverse reads)
- The Silva 138 classifier file (
silva-138-99-nb-classifier.qza
) must be available in the Docker container.
sixteenS(input_dir_path = "/the/input/path")
This command runs the 16S rRNA analysis pipeline inside the repbioinfo/qiime2023:latest
Docker container, mapping the input directory to its appropriate location inside the container.
- QIIME2 (version 2023) was used for all analyses.
- DADA2 denoising was performed with a truncation length of 250 bp.
- The Silva 138 classifier was used for taxonomy assignment.
- The taxonomic composition was visualized using QIIME2 bar plots.
- Results were exported as HTML reports for easy visualization.
The atacSeq
function is an R-based frontend for executing the ATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing) analysis pipeline. This function streamlines the execution of the backend pipeline by managing Docker execution directly from R, ensuring reproducibility and ease of use.
The ATAC-seq analysis pipeline performs the following steps:
- Quality Control: Runs FastQC on raw FASTQ files to assess sequencing quality.
- Genome Indexing: If an index is not found, it generates a Bowtie2 index from the reference genome.
- Read Alignment: Aligns sequencing reads to the reference genome using Bowtie2, allowing soft clipping.
- BAM Processing:
- Sorts and indexes BAM files using Samtools.
- Removes mitochondrial (Mt) and plastid (Pt) reads to filter out non-nuclear DNA.
- Peak Calling: Identifies regions of open chromatin using MACS2 (either paired-end or single-end).
- BigWig Conversion (Optional): Converts BedGraph peak intensity files to BigWig format for visualization in genome browsers.
atacSeq(input_dir_path = "/path/to/fastq_files",
genome_dir_path = "/path/to/fasta_files",
nThreads = 8)
Parameter | Type | Description |
---|---|---|
input_dir_path |
character | Path to the directory containing the FASTQ files to be analyzed. |
genome_dir_path |
character | Path to the directory containing the FASTA reference genome. The directory must contain a .fa or .fasta file. |
nThreads |
integer (default = 8) | Number of CPU cores for parallelization. |
For each input FASTQ file, FastQC generates quality control reports, which are saved in the results/Quality_ATAC/
directory.
If an indexed genome is not found in /genomes/index/
, the pipeline creates one using Bowtie2 (bowtie2-build
).
Indexing is skipped if an existing Bowtie2 index (genome_index.1.bt2
) is found.
bowtie2 --local --threads 8 -x /genomes/index/genome_index -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz | samtools view -bS - | samtools sort -o sample.sorted.bam
bowtie2 --local --threads 8 -x /genomes/index/genome_index -U sample.fastq.gz | samtools view -bS - | samtools sort -o sample.sorted.bam
The resulting BAM file (sample.sorted.bam
) is indexed:
samtools index sample.sorted.bam
Reads mapped to mitochondrial (Mt) or plastid (Pt) genomes are removed:
samtools idxstats sample.sorted.bam | cut -f1 | grep -v Mt | grep -v Pt | xargs samtools view -b sample.sorted.bam > sample.sorted.noorg.bam
The filtered nuclear BAM file (sample.sorted.noorg.bam
) is indexed:
samtools index sample.sorted.noorg.bam
macs2 callpeak -t sample.sorted.noorg.bam -q 0.05 --broad -f BAMPE -n sample -B --trackline --outdir results/peaks/
macs2 callpeak -t sample.sorted.noorg.bam -q 0.05 --broad -f BAM -n sample -B --trackline --outdir results/peaks/
Output includes:
- NarrowPeak/BroadPeak files (
sample_peaks.broadPeak
) - BedGraph signal intensity files (
sample_treat_pileup.bdg
)
Converts BedGraph to BigWig format for visualization in genome browsers (UCSC, IGV).
bedGraphToBigWig sample_treat_pileup.bdg genome_sizes.txt sample_treat_pileup.bw
The BigWig file (sample_treat_pileup.bw
) is saved in results/peaks/
.
File | Description |
---|---|
results/Quality_ATAC/*.html |
FastQC reports for quality control. |
results/*.sorted.bam |
Sorted BAM file (aligned reads). |
results/*.sorted.bam.bai |
Index file for BAM files. |
results/*.sorted.noorg.bam |
Filtered BAM file (removes Mt & Pt reads). |
results/*.sorted.noorg.bam.bai |
Index file for filtered BAM. |
results/peaks/*.broadPeak |
Peak calling results from MACS2. |
results/peaks/*.treat_pileup.bdg |
BedGraph file for signal intensity. |
results/peaks/*.treat_pileup.bw |
BigWig file for genome browser visualization. |
- Docker must be installed and running on your system.
- The input directory must contain FASTQ files for ATAC-seq.
- The genome directory must contain a FASTA file and, if possible, a pre-built Bowtie2 index.
atacSeq(input_dir_path = "/the/input/path",
genome_dir_path = "/the/genome/path",
nThreads = 12)
This command runs the ATAC-seq analysis pipeline inside the repbioinfo/atacseq:latest
Docker container, mapping the input and genome directories to their respective locations inside the container.
- The Bowtie2 aligner was used in local mode to allow soft clipping.
- Paired-end and single-end read alignment are both supported.
- MACS2 peak calling was performed using a q-value cutoff of 0.05 and the
--broad
flag. - BigWig conversion was performed for visualization using
bedGraphToBigWig
.
The bulk RNA-Seq analysis pipeline provides a fully automated workflow for processing RNA-Seq data, performing alignment, quantification, statistical analysis, and visualization. These R-based frontend functions manage execution inside a Docker container, ensuring reproducibility.
The bulk RNA-Seq analysis pipeline consists of the following major steps:
-
Genome Indexing & Read Alignment
- Uses STAR to align raw RNA-Seq reads to the reference genome.
- Generates sorted BAM files and a gene count matrix.
-
Principal Component Analysis (PCA)
- Conducts unsupervised clustering to visualize sample distribution.
-
Differential Expression Analysis (DESeq2)
- Identifies significantly differentially expressed genes (DEGs).
- Filters genes based on adjusted p-value and log2 fold-change.
- Generates DEG tables and Venn diagrams.
-
Heatmap Generation
- Visualizes the expression patterns of significant genes.
-
Complete Downstream Analysis (CDSA)
- Runs all post-alignment steps in one command.
The index_align
function performs alignment using STAR and generates a gene count matrix.
index_align(input_dir_path = "/path/to/fastq_files",
genome_dir_path = "/path/to/genome_files")
Parameter | Type | Description |
---|---|---|
input_dir_path |
character | Path to the directory containing FASTQ files. |
genome_dir_path |
character | Path to the directory containing the FASTA and GTF files. |
- Indexing: If no STAR index is found, it creates one (stored in
/genome/star_index/
). - Read Trimming: Adapters are removed using Cutadapt.
- Read Alignment: Aligns reads to the genome using STAR and produces sorted BAM files.
- Gene Quantification: Generates a gene count matrix (
gene_count_matrix.csv
). - Metadata Creation: Stores sample group information in
Covariatesstat.csv
.
File | Description |
---|---|
gene_count_matrix.csv |
Count matrix of gene expression levels. |
Covariatesstat.csv |
Metadata with sample group labels. |
sorted.bam |
Aligned RNA-Seq reads. |
The pca
function performs PCA to assess sample clustering.
pca(input_dir_path = "/path/to/results",
countmatrix_name = "gene_count_matrix.csv",
metadata_name = "Covariatesstat.csv")
Parameter | Type | Description |
---|---|---|
input_dir_path |
character | Path to directory containing RNA-Seq results. |
countmatrix_name |
character | Name of the count matrix file. |
metadata_name |
character | Name of the metadata file. |
- PCA Calculation: Transforms gene expression data into principal components.
- Visualization: Generates PCA plots to identify sample clustering.
File | Description |
---|---|
gene_count_matrix_pca_plot.png |
PCA plot of RNA-Seq samples. |
The deseq2
function identifies differentially expressed genes (DEGs).
deseq2(input_dir_path = "/path/to/results",
countmatrix_name = "gene_count_matrix.csv",
metadata_name = "Covariatesstat.csv",
reference_group = "wt",
organism = "Drosophilamelanogaster")
Parameter | Type | Description |
---|---|---|
input_dir_path |
character | Path to directory containing RNA-Seq results. |
countmatrix_name |
character | Name of the count matrix file. |
metadata_name |
character | Name of the metadata file. |
reference_group |
character | Name of the control group (e.g., "wt"). |
organism |
character | Organism: "Homo sapiens", "Mus musculus", or "Drosophila melanogaster". |
- Differential Expression Analysis: Uses DESeq2 to compute DEGs.
- Filtering: Applies thresholds of p-adjusted < 0.01 and log2 fold-change > 2.
- Venn Diagram Creation: Compares DEGs across conditions.
File | Description |
---|---|
DEG_<group>_vs_<reference_group>.csv |
DEG table for each comparison. |
filtered_count_matrix.csv |
Filtered gene expression matrix. |
venn_diagram.png |
Venn diagram of significant genes. |
The heatmap
function visualizes gene expression patterns.
heatmap(input_dir_path = "/path/to/results",
countmatrix_name = "filtered_count_matrix.csv",
metadata_name = "Covariatesstat.csv")
Parameter | Type | Description |
---|---|---|
input_dir_path |
character | Path to directory containing RNA-Seq results. |
countmatrix_name |
character | Name of the filtered count matrix file. |
metadata_name |
character | Name of the metadata file. |
- Normalization: Normalizes expression levels across samples.
- Heatmap Generation: Uses pheatmap to cluster and visualize significant genes.
File | Description |
---|---|
heatmap_filtered.png |
Heatmap of differentially expressed genes. |
The cdsa
function automates all post-alignment analyses.
cdsa(input_dir_path = "/path/to/results",
countmatrix_name = "gene_count_matrix.csv",
metadata_name = "Covariatesstat.csv",
reference_group = "wt",
organism = "Drosophilamelanogaster")
- Runs PCA to visualize sample clustering.
- Performs DESeq2 Analysis to detect differentially expressed genes.
- Generates Heatmap for significant genes.
All PCA, DEG, and heatmap results from previous steps.
- STAR (v2.7.10a) was used for genome indexing and read alignment.
- DESeq2 (v1.34.0) performed differential expression analysis.
- PCA was based on log-transformed gene expression values.
- DEGs were filtered at adjusted p-value < 0.01 and log2 fold-change > 2.
- Heatmaps were generated using pheatmap with row normalization.
The detectSeq
function is an R-based frontend for executing Detect-seq, a bioinformatics pipeline designed for the unbiased, genome-wide assessment of off-target effects associated with cytosine base editors (CBEs). The function simplifies execution by managing Docker-based execution directly from R, ensuring reproducibility and ease of use.
The Detect-seq pipeline consists of the following key steps:
-
Genome Indexing (if not already available)
- Builds BWA index for realignment.
- Builds HISAT3N index with C-to-T conversion handling.
-
Read Processing
- Trims sequencing adapters (Cutadapt).
- Aligns reads to the reference genome (HISAT3N).
- Extracts low-quality mapped reads for re-alignment with BWA MEM.
-
Alignment Merging & Deduplication
- Merges HISAT3N and BWA alignments.
- Sorts and removes duplicate reads (Picard).
-
Mutation Detection
- Converts BAM files to PMAT format, extracting C-to-T conversions.
-
Filtering & Visualization
- Filters detected mutations based on user-defined threshold.
- Generates BED and WIG files for downstream analysis.
detectSeq(genome_dir_path = "/path/to/genome_files",
output_dir_path = "/path/to/output_dir",
fastq_dir_path = "/path/to/fastq_files",
threshold = 3)
Parameter | Type | Description |
---|---|---|
genome_dir_path |
character | Path to directory containing FASTA genome file and index (if available). |
output_dir_path |
character | Path to directory where output files will be saved. |
fastq_dir_path |
character | Path to directory containing FASTQ files. |
threshold |
integer | Minimum read count required to retain detected mutations. |
If BWA and HISAT3N indexes do not exist, the script automatically creates them.
samtools faidx genome.fa
bwa index genome.fa
hisat-3n-build --base-change C,T genome.fa genome_index
Trims adapter sequences and low-quality reads.
cutadapt -j 0 --times 1 -e 0.1 -O 3 --quality-cutoff 25 -m 55 \
-a AGATCGGAAGAGCACACGT -A AGATCGGAAGAGCGTCGTG \
-o trimmed_R1.fastq.gz -p trimmed_R2.fastq.gz \
raw_R1.fastq.gz raw_R2.fastq.gz
Maps reads to CBE-modified genome using HISAT3N.
hisat-3n -x genome_index -1 trimmed_R1.fastq.gz -2 trimmed_R2.fastq.gz \
-p 20 --sensitive --base-change C,T --unique-only --repeat-limit 1000 \
--no-spliced-alignment -X 700 | samtools view -hb > hisat3n.bam
Selects reads with mapping quality ≤ 20 for re-alignment.
samtools view -h hisat3n.bam | awk '$1~"@" || $5 <= 20' | samtools view -hb > low_quality.bam
Re-aligns low-quality reads using BWA MEM.
bwa mem genome.fa unmapped_R1.fastq.gz unmapped_R2.fastq.gz \
-t 20 -M | samtools view -h -b -q 20 -f 3 -F 256 > bwa_realign.bam
Merges HISAT3N and BWA alignments, then removes duplicate reads.
samtools cat -o merged.bam hisat3n.bam bwa_realign.bam
samtools sort -O BAM -o sorted.bam merged.bam
java -jar picard.jar MarkDuplicates I=sorted.bam O=dedup.bam REMOVE_DUPLICATES=true
samtools index dedup.bam
Converts BAM files to PMAT format, extracting C-to-T conversions.
python bam2pmat.py -i dedup.bam -r genome.fa -o output.pmat -p 20 --mut_type ALL
Filters detected mutations based on user-defined threshold and generates BED and WIG files.
awk -v threshold=3 '$9=="CT" && $13>=threshold' output.pmat > filtered.pmat
awk '{print "variableStep chrom="$1" span=1\n"$2" "$9}' filtered.pmat > filtered.wig
File | Description |
---|---|
dedup.bam |
Final processed BAM file (duplicates removed). |
dedup.bam.bai |
Index file for BAM file. |
output.pmat |
Raw mutation matrix with detected C-to-T conversions. |
filtered.pmat |
Filtered mutation matrix, retaining mutations above threshold. |
filtered.bed |
BED file for detected mutations. |
filtered.wig |
WIG file for genome browser visualization. |
- Docker must be installed and running on your system.
- The input directory must contain:
- FASTQ files (
*_R1.fastq.gz
,*_R2.fastq.gz
). - Reference genome (
.fa
or.fasta
).
- FASTQ files (
detectSeq(genome_dir_path = "/the/genome/dir",
output_dir_path = "/the/output/dir",
fastq_dir_path = "/the/fastq/dir",
threshold = 3)
This command runs Detect-seq inside the repbioinfo/detectseq:latest Docker container, mapping the genome, output, and FASTQ directories to their appropriate locations inside the container.
- HISAT3N (v3.0) was used for primary alignment with C-to-T correction.
- BWA MEM (v0.7.17) was used for low-quality read re-alignment.
- Picard (v2.27.4) removed duplicate reads.
- PMAT files were generated using Detect-seq's
bam2pmat.py
script. - Mutations were filtered at a threshold of ≥ 3 reads.
The HTGTS
function is an R-based frontend for executing the HTGTS bioinformatics pipeline, designed to identify and analyze genome-wide translocation events. The function automates the pre-processing, alignment, and translocation detection steps using Docker-based execution, ensuring reproducibility and ease of use.
The HTGTS pipeline consists of the following key steps:
-
Library Information Generation
- Extracts sequencing library details from an XML file.
- Generates library configuration files required for downstream analysis.
-
Read Processing & Demultiplexing
- Identifies specific sequences in FASTQ files.
- Separates reads based on predefined primer sequences.
-
Alignment & Translocation Mapping
- Aligns reads to the reference genome using TLPpipeline.
- Identifies potential genome translocations.
-
BED File Generation
- Generates BED files for downstream analysis and visualization.
HTGTS(xml_file_path = "/path/to/HTGTS.xml",
configType = "HTGTS_mouse",
data_folder = "/path/to/data",
fastq1 = "sample_R1.fastq.gz",
fastq2 = "sample_R2.fastq.gz",
expInfo =
10000
span> "libseqInfo.txt",
expInfo2 = "libseqInfo2.txt",
outDir = "/path/to/output",
assembly = "mm9")
Parameter | Type | Description |
---|---|---|
xml_file_path |
character | Path to the XML file containing sequencing library information. |
configType |
character | Configuration type (e.g., "HTGTS_mouse", "HTGTS_human"). |
data_folder |
character | Path to the directory containing input FASTQ files. |
fastq1 |
character | Name of the first FASTQ file (R1). |
fastq2 |
character (optional) | Name of the second FASTQ file (R2, if paired-end). |
expInfo |
character | Name of the library information file (libseqInfo.txt ). |
expInfo2 |
character | Name of the additional library information file (libseqInfo2.txt ). |
outDir |
character | Directory where output files will be stored. |
assembly |
character | Reference genome assembly ("mm9", "mm10", "hg38", "custom"). |
Extracts sequencing library metadata from an XML (Excel) file and generates libseqInfo.txt
and libseqInfo2.txt
, which store:
- Restriction enzyme site
- Primer sequences
- Breaksite genomic coordinates
docker run -v "/data:/Data" repbioinfo/htgts_pipeline_lts_v16 \
python3 /Algorithm/sample_sheetTolibInfo.py \
/Data/HTGTS.xlsx /Data/libseqInfo.txt /Data/libseqInfo2.txt HTGTS_mouse
Identifies specific sequences in FASTQ files and separates reads into correctly formatted sequencing groups.
GEAT0.2.jar demultiplex -i sample_R1.fastq.gz -o demux_R1.fastq.gz
Aligns reads using TLPpipeline and identifies potential translocation events in the genome.
TLPpipeline.pl -i demux_R1.fastq.gz -o alignment_output/ -genome mm9
Creates BED files for visualization of detected translocations.
bedtools makewindows -b alignment_output/*.bam -w 1000 > translocations.bed
File | Description |
---|---|
libseqInfo.txt |
Library metadata (restriction sites, primers, genomic coordinates). |
libseqInfo2.txt |
Additional library metadata. |
alignment_output/*.bam |
Aligned sequencing reads. |
translocations.bed |
BED file containing translocation sites. |
- Docker must be installed and running.
- The input directory must contain:
- FASTQ files (
*_R1.fastq.gz
,*_R2.fastq.gz
). - An XML metadata file (
HTGTS.xlsx
).
- FASTQ files (
HTGTS(xml_file_path = "/the/path/HTGTS.xlsx",
configType = "HTGTS_mouse",
data_folder = "/the/input/data",
fastq1 = "sample_R1.fastq.gz",
fastq2 = "sample_R2.fastq.gz",
expInfo = "libseqInfo.txt",
expInfo2 = "libseqInfo2.txt",
outDir = "/the/output/dir",
assembly = "mm9")
This command runs HTGTS inside the repbioinfo/htgts_pipeline_lts_v16 Docker container, mapping the input/output directories and executing the full pipeline.
- GEAT0.2 was used for FASTQ demultiplexing.
- TLPpipeline (v1.6) performed read alignment and translocation detection.
- BEDTools (v2.30.0) was used for translocation visualization.
The sci_fromfastq
function is an R-based frontend for executing the SCI-RNA-seq bioinformatics pipeline, which converts raw FASTQ files into a gene expression matrix, including UMI counting, quality control, and visualization. The function automates data processing using Docker-based execution, ensuring reproducibility and ease of use.
The SCI-RNA-seq pipeline consists of the following key steps:
-
FASTQ Generation & Preprocessing
- Extracts FASTQ files from raw sequencing data.
- Adjusts read names to incorporate barcode and UMI information.
-
Poly-A Trimming
- Removes poly-A tails from reads to improve alignment quality.
-
Read Alignment
- Aligns reads to a reference genome using STAR.
- Outputs BAM files and generates alignment statistics.
-
Filtering & Sorting BAM Files
- Removes rRNA reads and ambiguously mapped reads.
- Assigns reads to genes using BED files.
-
UMI Counting & Deduplication
- Computes UMI counts per cell.
- Filters cells below the UMI threshold.
-
Quality Control & Visualization
- Generates knee plots for UMI distributions.
- Produces a final UMI count matrix for downstream analysis.
sci_fromfastq(group = "docker",
folder = "/path/to/working_directory",
sample.name = "Experiment_01",
UMI.cutoff = 500)
Parameter | Type | Description |
---|---|---|
group |
character | "docker" or "sudo" , depending on user permissions. |
folder |
character | Path to working directory containing input files. |
sample.name |
character | Name of the experiment/sample being processed. |
UMI.cutoff |
integer | Minimum UMI count per cell for inclusion. |
Extracts FASTQ files from raw BCL sequencing data (if applicable) and embeds Read 1 (barcode and UMI) information into Read 2 names for proper tracking.
bcl2fastq --runfolder-dir /path/to/sequencing_data --output-dir fastq_output
Removes poly-A sequences using TrimGalore to improve alignment efficiency.
trim_galore --paired --output_dir trimmed-fastq sample_R1.fastq.gz sample_R2.fastq.gz
Aligns reads to the reference genome using STAR, allowing for spliced-mapping of reads.
STAR --genomeDir /path/to/STAR_index \
--readFilesIn trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
--runThreadN 8 --outSAMtype BAM SortedByCoordinate
Removes rRNA reads and ambiguously mapped reads, then sorts BAM files for downstream gene assignment.
samtools view -hb -q 30 aligned.bam | samtools sort -o filtered_sorted.bam
Assigns reads to genes using a gene annotation BED file and computes UMI counts per sample, removing duplicate UMIs.
bedtools intersect -a filtered_sorted.bam -b gene_annotations.bed > gene_assigned.bam
umi_tools count --per-gene --gene-tag=XT --stdin=gene_assigned.bam --stdout=UMI_counts.txt
Generates knee plots to visualize the UMI distribution per cell.
Rscript knee-plot.R UMI_counts.txt knee_plot_output.png
Outputs a final UMI count matrix for downstream analysis.
gzip -c UMI_counts.txt > final-output/UMI.count.matrix.gz
File | Description |
---|---|
alignment_report_complete.txt |
Alignment statistics report. |
rRNA_report_complete.txt |
rRNA contamination statistics. |
UMI_report_complete.txt |
UMI counts per sample. |
final-output/rRNA.and.dup.rate.stats |
Final rRNA and duplication rate statistics. |
final-output/knee-plots/*.png |
Knee plots visualizing UMI distributions. |
prelim.UMI.count.rollup.gz |
Preliminary UMI count matrix. |
final-output/cell.annotations |
Filtered cell annotations. |
- Docker must be installed and running.
- The input directory must contain:
- FASTQ files (
*_R1.fastq.gz
,*_R2.fastq.gz
). - A reference genome (
.fa
,.gtf
).
- FASTQ files (
sci_fromfastq(group = "docker",
folder = "/the/input/data",
sample.name = "Experiment_01",
UMI.cutoff = 500)
This command runs SCI-RNA-seq inside the repbioinfo/sci_tomatrix_genome Docker container, processing raw sequencing data into a gene expression matrix.
- TrimGalore (v0.6.10) was used for poly-A trimming.
- STAR (v2.7.10a) performed read alignment to the reference genome.
- SAMtools (v1.7) was used for BAM file processing.
- UMI-tools (v1.0.0) was used for UMI counting and deduplication.
- BEDTools (v2.27.1) assigned reads to genes based on annotation BED files.
- Knee plots were generated using R ggplot2.
The singlecell_*
functions provide an R-based frontend for executing the Single Cell RNA-seq pipeline, which processes raw single-cell sequencing data into a gene expression matrix and performs downstream analysis. The functions automate data processing using Docker-based execution, ensuring reproducibility and ease of use.
-
Alignment & Indexing
- Uses Cell Ranger to align single-cell reads to a reference genome.
- Generates a gene-cell count matrix.
-
Filtering Mitochondrial & Ribosomal Reads
- Identifies and removes low-quality cells based on rRNA and mitochondrial gene expression.
-
Clustering & Stability Analysis
- Uses Seurat to cluster cells and assess cluster stability via bootstrapping.
-
Feature Selection (Differential Expression Analysis)
- Identifies differentially expressed genes (DEGs) across clusters using ANOVA, MAST, and edgeR.
-
Enrichment Analysis
- Performs KEGG/GO pathway enrichment on differentially expressed genes.
Aligns reads and indexes the genome for single-cell RNA-seq experiments.
singlecell_alignIndex(input_dir_path = "/path/to/fastq_files",
genome_dir_path = "/path/to/genome_files",
bamsave = TRUE)
File | Description |
---|---|
filtered_feature_bc_matrix/*.mtx |
Filtered gene expression matrix. |
filtered_feature_bc_matrix/*.barcodes.tsv.gz |
Modified barcodes including sample names. |
filtered_feature_bc_matrix/*.features.tsv.gz |
Gene feature information. |
Filters low-quality cells based on rRNA and mitochondrial gene expression.
singlecell_mitoRiboFilter(input_file_path = "/path/to/matrix.mtx",
mito_min = 0, mito_max = 10,
ribo_min = 0, ribo_max = 10,
genes_file = "genes.tsv",
barcodes_file = "barcodes.tsv")
File | Description |
---|---|
filtered_matrix.mtx |
Filtered gene expression matrix. |
filtered_features.tsv |
Filtered gene list. |
filtered_barcodes.tsv |
Filtered barcode list. |
ribosomal_vs_mitochondrial_plot.png |
Quality control plot. |
Clusters cells and assesses stability using Seurat and bootstrapping.
singlecell_clustering(input_file_path = "/path/to/matrix.mtx",
bootstrap_percentage = 0.1,
stability_threshold = 0.8,
permutations = 10,
genes_file = "genes.tsv",
barcodes_file = "barcodes.tsv",
resolution = 0.8)
File | Description |
---|---|
clustering_stability.output.csv |
Cluster assignments and stability scores. |
umap_clusters.png |
UMAP plot of clusters. |
umap_stability.png |
UMAP plot of cluster stability. |
Identifies differentially expressed genes (DEGs) across clusters.
singlecell_featureSelection(input_file_path = "/path/to/matrix.mtx",
clustering_file = "clustering_stability.output.csv",
threshold = 0.8, log2fc = 1, pvalue = 0.05,
genes_file = "genes.tsv",
barcodes_file = "barcodes.tsv",
heatmap = TRUE)
File | Description |
---|---|
anova_DE_results.csv |
DEG results for all clusters. |
volcano_plot.png |
Volcano plot of DEGs. |
heatmap.png |
Heatmap of DEGs. |
Identifies overrepresented pathways among DEGs.
enrichmentAnalysis(input_file_path = "/path/to/anova_DE_results.csv",
species = "dmelanogaster",
source = "KEGG",
separator = ",",
max_terms = 20)
File | Description |
---|---|
enrichment_plot.pdf |
Bar plot of enriched pathways. |
- Docker must be installed and running.
- The input directory must contain:
- FASTQ files (
*_R1.fastq.gz
,*_R2.fastq.gz
). - Reference genome (
.fa
,.gtf
). - Gene expression matrix (
.mtx
,.csv
).
- FASTQ files (
singlecell_alignIndex("/data/fastq", "/data/genome", TRUE)
singlecell_mitoRiboFilter("/data/matrix.mtx", mito_min = 0, mito_max = 10, ribo_min = 0, ribo_max = 10)
singlecell_clustering("/data/filtered_matrix.mtx", 0.1, 0.8, 10, genes_file = "genes.tsv", barcodes_file = "barcodes.tsv", resolution = 0.8)
singlecell_featureSelection("/data/matrix.mtx", "clustering_stability.output.csv", 0.8, 1, 0.05)
enrichmentAnalysis("/data/anova_DE_results.csv", "dmelanogaster", "KEGG", ",", 20)