8000 GitHub - qchengray/EAP: An integrated R toolkit for epitranscriptome data analysis in plants
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

qchengray/EAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

EAP: an integrated R toolkit that aims to facilitate the epitranscriptome data analysis in plants


EAP is an integrated R toolkit that aims to facilitate the epitranscriptome data analysis in plants. This toolkit contains a comprehensive set of functions for read mapping, CMR (chemical modifications of RNA) calling from epitranscriptome sequencing data, CMR prediction at the transcriptome scale, and CMR annotation (location distribution analysis, motif scanning and discovery, and gene functional enrichment analysis)

Version and download

  • Version 1.0 -First version released on september, 21th, 2017

Dependency

R environment

Global software environment

Python environment

Dependency installation

## Install R Dependency
dependency.packages <- c("randomForst""seqinr""stringr""snowfall",
                          "ggplot2""bigmemory", "Rsamtools", "motifRG",
                          "devtools")
install.pacakages(dependency.packages)
## Install Tophat/Tophat2
sudo apt-get update
sudo apt-get install tophat or sudo apt-get install tophat2
## Install Bowtie/Bowtie2
sudo apt-get update
sudo apt-get install bowtie or sudo apt-get install bowtie2
## Install Hisat/Hisat2
sudo apt-get update
sudo apt-get install hisat or sudo apt-get install hisat2
pip install macs2

pip

Installation

install.package("Download path/EAP_1.0.tar.gz",repos = NULLtype = "source")

Contents

CMR calling

  • Arabidopsis thaliana m6A sequencing datasets
  • Read mapping
  • CMR calling from read-alignment files

Transcriptome-level CMR prediction

  • Arabidopsis m6A benchmark dataset construction
  • Sample vectorization with three feature encoding schemes
  • m6A predictor construction using ML-based PSOL algorithm
  • Performance evaluation using the training dataset
  • Comparison with other m6A predictors using the independent testing dataset

CMR annotation

  • CMR location distribution
  • Motif scanning and discovery
  • Functional enrichment analysis of CMR corresponded genes
  • User manual

Quick start

More details please see user manual

1.CMR calling

  • 1.1 Arabidopsis thaliana m6A sequencing datasets
## download and convert example data
# Here, supposing that ‘sratoolkit’ has been downloaded in your device in the directory: /home/malab14/, then the following command will convert the sra format to fastq format
/home/malab14/sratoolkit/bin/fastq-dump --split-3 SRR1508369.sra 
#Otherwise, you can also download the fastqc toolkit to perform quality control for fastq formatted files.
#Downloading the fastqc
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
#decompressing
unzip fastqc_v0.11.5.zip
#Running fastq
./FastQC/fastqc SRR1508369.fastq
  • 1.2 Read mapping
input.fq <- "/home/malab14/input.fastq"  
RIP.fq <- "/home/malab14/RIP.fastq"  
referenceGenome <- "/home/malab14/tair10.fa"  
GTF <- "/home/malab14/Arabidopsis.gtf"  
input.bam <- readsMapping(alignment = "tophat"fq = input.fq,   
                          refGenome = referenceGenome, paired = F,
                          bowtie1 = NULL, p = 5, G = GTF)
RIP.bam <- readsMapping(alignment = "tophat"fq = RIP.fq,   
                        refGenome = referenceGenome, paired = F,
                        bowtie1 = NULL, p = 5, G = GTF)
  • 1.3 CMR calling from read-alignment files
################m6A peak calling through "SlidingWindow"##################  
cmrMat <- CMRCalling(CMR = "m6A"method = "SlidingWindow",  
                     IPBAM = RIP.baminputBAM = input.bam

2.Transcriptome-level CMR prediction

  • 2.1 Arabidopsis m6A benchmark dataset construction
cDNA <- "/home/malab14/tair10_cDNA.fa"  
GTF <- "/home/malab14/Arabidopsis.gtf"  
###Convert genomic position to cDNA position  
peaks <- G2T(bedPos = cmrMatGTF = GTF)  
###Search consensus motif in cDNA sequence  
motifPos <- searchMotifPos(RNAseq = cDNA)  
posSamples <- findConfidentPosSamples(peaks = peaks,  
                                      motifPos = motifPos)  
unlabelSamples <- findUnlabelSamples(cDNAID = posSamples$cDNAID,   
                                     motifPos = motifPos,   
                                     posSamples = posSamples$positives)  
  • 2.2 Sample vectorization with three feature encoding schemes
#########################Extract sequence#################################  
posSeq <- extractSeq(RNAseq = cDNAsampleMat = posSamples, seqLen = 101)  
unlabelSeq <- extractSeq(RNAseq = cDNAsampleMat = unlabelSamples, 
                         seqLen = 101)  
#########################Feature encoding#################################  
posFeatureMat <- featureEncoding(posSeq)  
unlabelFeatureMat <- featureEncoding(unlabelSeq<
6BDE
/span>) 
featureMat <- rbind(posFeatureMat, unlabelFeatureMat)
  • 3.3 m6A predictor construction using ML-based PSOL algorithm
###Setting the psol directory and running the PSOL-based ML classification###  
PSOLResDic <- "/home/malab14/psol/"  
psolResults <- PSOL(featureMatrix = featureMatpositives = positives,   
                    unlabels = unlabelsPSOLResDic = PSOLResDiccpus = 5####Ten-fold cross-validation and ROC curve analysis.
cvRes <- cross_validation(featureMat = featureMat,   
                          positives = rownames(posFeatureMat),  
                          negatives = rownames(unlabelFeatureMat),  
                          cross = 10, cpus = 1)

3.CMR annotation

  • 3.1 CMR location distribution
GTF <- "/home/malab14/Arabidopsis_tair10.gtf"  
#####Extract the UTR position information from GTF file and perform CMR location distribution analysis.  
UTRMat <- getUTR(GTF = GTF)  
results <- CMRAnnotation(cmrMat = cmrMatSNR = T, UTRMat = UTRMat)  
  • 3.2 Motif scanning and discovery
RNAseq <- "/home/malab14/tair10.fa"  
testSeqs <- extractSeqs(cmrMat = cmrMatRNAseq = RNAseq)  
results <- motifScan(sequence = testSeqs, motif = "[AG][AG]AC[ACT]")
motifs <- motifDetect (sequence = testSeqs)  
  • 3.3 Functional enrichment analysis of CMR corresponded genes
enrichements <- runTopGO(geneID = geneID,   
                         dataset = "athaliana_eg_gene",  
                         topNodes = 20

Ask questions

Please use EAP/issues for how to use DeepGS and reporting bugs.

About

An integrated R toolkit for epitranscriptome data analysis in plants

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0