EAP is an integrated R toolkit that aims to facilitate the epitranscriptome data analysis in plants. This toolkit contains a comprehensive set of functions for read mapping, CMR (chemical modifications of RNA) calling from epitranscriptome sequencing data, CMR prediction at the transcriptome scale, and CMR annotation (location distribution analysis, motif scanning and discovery, and gene functional enrichment analysis)
- Version 1.0 -First version released on september, 21th, 2017
- R (>= 3.3.1)
- randomForst (>= 0.6)
- seqinr (>= 3.4-5)
- stringr (>= 1.2.0)
- snowfall (>= 1.84-6.1)
- ggplot2 (>= 2.2.1)
- bigmemory (>= 4.5.19)
- Rsamtools (>= 1.28.0)
- motifRG (>= 1.21.0)
- devtools (>= 1.13.3)
- Tophat/Tophat2:Reads mapping
- Bowtie/Bowtie2:Reads mapping
- Hisat/Hisat2:Reads mapping
- macs2:Peak calling
## Install R Dependency
dependency.packages <- c("randomForst", "seqinr", "stringr", "snowfall",
"ggplot2", "bigmemory", "Rsamtools", "motifRG",
"devtools")
install.pacakages(dependency.packages)
## Install Tophat/Tophat2
sudo apt-get update
sudo apt-get install tophat or sudo apt-get install tophat2
## Install Bowtie/Bowtie2
sudo apt-get update
sudo apt-get install bowtie or sudo apt-get install bowtie2
## Install Hisat/Hisat2
sudo apt-get update
sudo apt-get install hisat or sudo apt-get install hisat2
pip install macs2
install.package("Download path/EAP_1.0.tar.gz",repos = NULL, type = "source")
- Arabidopsis thaliana m6A sequencing datasets
- Read mapping
- CMR calling from read-alignment files
- Arabidopsis m6A benchmark dataset construction
- Sample vectorization with three feature encoding schemes
- m6A predictor construction using ML-based PSOL algorithm
- Performance evaluation using the training dataset
- Comparison with other m6A predictors using the independent testing dataset
- CMR location distribution
- Motif scanning and discovery
- Functional enrichment analysis of CMR corresponded genes
- User manual
More details please see user manual
- 1.1 Arabidopsis thaliana m6A sequencing datasets
## download and convert example data
# Here, supposing that ‘sratoolkit’ has been downloaded in your device in the directory: /home/malab14/, then the following command will convert the sra format to fastq format
/home/malab14/sratoolkit/bin/fastq-dump --split-3 SRR1508369.sra
#Otherwise, you can also download the fastqc toolkit to perform quality control for fastq formatted files.
#Downloading the fastqc
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
#decompressing
unzip fastqc_v0.11.5.zip
#Running fastq
./FastQC/fastqc SRR1508369.fastq
- 1.2 Read mapping
input.fq <- "/home/malab14/input.fastq"
RIP.fq <- "/home/malab14/RIP.fastq"
referenceGenome <- "/home/malab14/tair10.fa"
GTF <- "/home/malab14/Arabidopsis.gtf"
input.bam <- readsMapping(alignment = "tophat", fq = input.fq,
refGenome = referenceGenome, paired = F,
bowtie1 = NULL, p = 5, G = GTF)
RIP.bam <- readsMapping(alignment = "tophat", fq = RIP.fq,
refGenome = referenceGenome, paired = F,
bowtie1 = NULL, p = 5, G = GTF)
- 1.3 CMR calling from read-alignment files
################m6A peak calling through "SlidingWindow"##################
cmrMat <- CMRCalling(CMR = "m6A", method = "SlidingWindow",
IPBAM = RIP.bam, inputBAM = input.bam)
- 2.1 Arabidopsis m6A benchmark dataset construction
cDNA <- "/home/malab14/tair10_cDNA.fa"
GTF <- "/home/malab14/Arabidopsis.gtf"
###Convert genomic position to cDNA position
peaks <- G2T(bedPos = cmrMat, GTF = GTF)
###Search consensus motif in cDNA sequence
motifPos <- searchMotifPos(RNAseq = cDNA)
posSamples <- findConfidentPosSamples(peaks = peaks,
motifPos = motifPos)
unlabelSamples <- findUnlabelSamples(cDNAID = posSamples$cDNAID,
motifPos = motifPos,
posSamples = posSamples$positives)
- 2.2 Sample vectorization with three feature encoding schemes
#########################Extract sequence#################################
posSeq <- extractSeq(RNAseq = cDNA, sampleMat = posSamples, seqLen = 101)
unlabelSeq <- extractSeq(RNAseq = cDNA, sampleMat = unlabelSamples,
seqLen = 101)
#########################Feature encoding#################################
posFeatureMat <- featureEncoding(posSeq)
unlabelFeatureMat <- featureEncoding(unlabelSeq<
6BDE
/span>)
featureMat <- rbind(posFeatureMat, unlabelFeatureMat)
- 3.3 m6A predictor construction using ML-based PSOL algorithm
###Setting the psol directory and running the PSOL-based ML classification###
PSOLResDic <- "/home/malab14/psol/"
psolResults <- PSOL(featureMatrix = featureMat, positives = positives,
unlabels = unlabels, PSOLResDic = PSOLResDic, cpus = 5)
####Ten-fold cross-validation and ROC curve analysis.
cvRes <- cross_validation(featureMat = featureMat,
positives = rownames(posFeatureMat),
negatives = rownames(unlabelFeatureMat),
cross = 10, cpus = 1)
- 3.1 CMR location distribution
GTF <- "/home/malab14/Arabidopsis_tair10.gtf"
#####Extract the UTR position information from GTF file and perform CMR location distribution analysis.
UTRMat <- getUTR(GTF = GTF)
results <- CMRAnnotation(cmrMat = cmrMat, SNR = T, UTRMat = UTRMat)
- 3.2 Motif scanning and discovery
RNAseq <- "/home/malab14/tair10.fa"
testSeqs <- extractSeqs(cmrMat = cmrMat, RNAseq = RNAseq)
results <- motifScan(sequence = testSeqs, motif = "[AG][AG]AC[ACT]")
motifs <- motifDetect (sequence = testSeqs)
- 3.3 Functional enrichment analysis of CMR corresponded genes
enrichements <- runTopGO(geneID = geneID,
dataset = "athaliana_eg_gene",
topNodes = 20)
Please use EAP/issues for how to use DeepGS and reporting bugs.