Genome Annotation by in-depth Epigenetic and Transcriptomic profiling (GAET)

This repository contains the scripts, tools, and workflows for the comprehensive gene annotation of Tetrahymena thermophila using various types of omics data, including transcriptomic, epigenetic, and sequencing data. The pipeline integrates gene model optimization, UTR annotation, alternative splicing (AS) isoform annotation, and manual curation to produce high-quality gene models.

Overview

This pipeline is designed to optimize and annotate gene models in Tetrahymena thermophila. The process involves several steps, including RNA-seq and epigenetic data integration, UTR and regulatory element analysis, AS isoform annotation, and manual genome curation. The final outputs are refined gene models and protein function annotations.

Pipeline Workflow

A. Gene Optimization with Transcriptomic Data

Input Data:
- RNA-seq
- Nanopore DRS
- Strand-specific RNA-seq

     Data Analysis
    ~/biosoft/TrimGalore-0.6.6/trim_galore -q 20 -length 20 --fastqc -j 4  --paired $j $R2 -o 02Clean_data/ >> 02Clean_data/Trim_Galore.log 2>&1

    if [ ! -d "03Hisat2_align" ];then   mkdir -p 03Hisat2_align ; fi
    hisat2 -p 6 --dta --score-min L,5,-0.4 --sp 1,0 --mp 3,1  --summary-file 03Hisat2_align/${name}_summary.txt -x Transcriptome_Assemblly-221208/BamFile/00Hisat2_index/1-Genome_assembly_new.fasta -S 03Hisat2_align/${name}.sam  -1 02Clean_data/${name}_1_val_1.fq.gz -2 02Clean_data/${name}_2_val_2.fq.gz  2>&1 >>03Hisat2_align/error.ht2.log
    java -jar /home/software/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar  SortSam -I 03Hisat2_align/${name}.sam -O 03Hisat2_align/${name}_sorted.bam -SORT_ORDER coordinate > 03Hisat2_align/${name}_sort.log 2>&1 
    java -jar /home/software/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar  MarkDuplicates -I 03Hisat2_align/${name}_sorted.bam -O 03Hisat2_align/${name}_rmdup.bam -REMOVE_DUPLICATES true -M 03Hisat2_align/${name}_marked_dup_metrics.txt > 03Hisat2_align/${name}_rmdup.log 2>&1
     rm 03Hisat2_align/${name}_sorted.bam
     rm 03Hisat2_align/${name}.sam
     perl 01Check_gff-file_format.pl
     perl 02Check_ATG2TGA.pl
     perl 2Merge_orf1_allORF_to_UTR-gff.pl
     perl 2Merge_orf1_to_UTR-gff.pl
     perl 03Remove_error_gene.pl
     perl 04_2Check_CDS_overlap.pl
     perl 04_3Remove_Error_CDS-overlap-gene.pl
     perl 04Check_gene_overlap.pl
     perl 4Filter_main_Nanogff.pl
     perl 05_1Remove_intron2long.pl
     perl 05Find_new_Gap-gene_RNAseq.pl
     perl 5Sort_gff.pl
     perl 6Remove_AS_transcript.pl
     perl 6Sort_gff.pl
     perl 7Merge_4mainGFF.pl
     perl 7Sort_gff.pl
     perl 8Find_new_Gap-gene_RNAseq.pl
     perl 9Merge_4GapNewGene2GFF.pl
     perl 9SeleceGFF_fromGTF.pl
     perl 9Sort_gff.pl
     perl 10Merge_7-9_sortdGFF.pl
     perl Change_OriError.pl
     perl Check_gene_overlap.pl
     perl Remove_LongGapExon_gene.pl

Process:
- Assemble transcriptome from RNA-seq data.
- Compare intron-exon junctions across data sources.
- Perform iterative gene optimization leading to Draft Gene Model v1 and v2.

B. Gene Optimization with Epigenetic Data

Input Data:
- ATAC-seq
- Epigenetic markers (H3K4me3, H2A.Z, 6mA)
Process:
- Predict transcription start sites (TSS) using Random Forest models.

#Cluster mark 

library("fpc")  #eps: nucleosome + linker = 200，或 half of linker + half of nucleosome = 100

#H2A.Z
data<-read.delim("H2A.Z_group.gff3",sep="\t", header = F)
#model2 <- dbscan(data, eps=147, MinPts=5)
for (i in unique(data$V1)){  a<-subset(data,V1==i); a1<-cbind(a$V4,V2=1); colnames(a1)<-c("V1","V2"); model3 <- dbscan(a1, eps=200, MinPts=3); a1<-as.matrix(a1); a2<-cbind(model3$cluster,i,a1[,1]); write.csv(a2,file=paste("Z:/H2A.Z/",i,".csv"，collapse = NULL, sep=""), sep="\t") }  
data<-read.delim("Z:/H2A.Z_group.gff3",sep="\t", header = F)

#nucleosome
data<-read.delim("Z:/Nucleosome.srt.rmdup.bed",sep="\t", header = F) 
for (i in unique(data$V1)){  a<-subset(data,V1==i); a1<-cbind(a$V2,V2=1); colnames(a1)<-c("V1","V2"); model3 <- dbscan(a1, eps=200, MinPts=3); a1<-as.matrix(a1); a2<-cbind(model3$cluster,i,a1[,1]); write.csv(a2,file=paste("Z:/nucleosome/",i,".csv"，collapse = NULL, sep=""), sep="\t") }

#H3K4me3
H3K4me3<-read.delim("Z:/H3K4me3_group.gff3",sep="\t", header = F)
for (i in unique(data$V1)){  a<-subset(data,V1==i); a1<-cbind(a$V4,V2=1); colnames(a1)<-c("V1","V2"); model3 <- dbscan(a1, eps=200, MinPts=3); a1<-as.matrix(a1); a2<-cbind(model3$cluster,a1[,1]); write.csv(a2,file=paste("Z:/H3K4me3/",i,".csv"，collapse = NULL, sep=""), sep="\t") }

#6mA
data<-read.delim("Z:/veg_6mA.sorted.gff",sep="\t", header = F) 
for (i in unique(data$V1)){  a<-subset(data,V1==i); a1<-cbind(a$V4,V2=1); colnames(a1)<-c("V1","V2"); model3 <- dbscan(a1, eps=200, MinPts=3); a1<-as.matrix(a1); a2<-cbind(model3$cluster,i,a1[,1]); write.csv(a2,file=paste("Z:/6mA/",i,".csv"，collapse = NULL, sep=""), sep="\t") }

#Tranining and prediction using Random Forest

library(randomForest)
library(caret)
library(pROC)
set.seed(1234)
colnames(TSS)<-c("6mA","H3K4me3","H2A.Z","nucleosome","type")

TSS<-read.csv("TSS-input.txt",sep = "\t",header = 1, row.names = 1)
TES<-read.csv("TES-input.txt",sep = "\t",header = 1, row.names = 1)
TSS_TES<-rbind(TSS,TES)
trains <- createDataPartition(y = TSS_TES$type,p = 0.75,list = F)
traindata <- TSS_TES[trains,]
testdata <- TSS_TES[-trains,]
rf.train <- randomForest(as.factor(type)~.,data = traindata,na.action = na.roughfix,importance = TRUE, ntree=500, mtry=1)

plot(rf.train,main="ERROR&TREES")

importance(rf.train)

trainpredprob<-predict(rf.train, newdata = traindata, type="prob")

trainroc<-roc(response=traindata$type, predictor=trainpredprob[,2])

plot(trainroc, print.auc=TRUE, auc.polygon=TRUE, grid=T, max.auc.polygon=T, auc.polygon.col="skyblue", print.thres=T, legacy.axes=T, bty="l")

bestp <- trainroc$threshold[which.max(trainroc$sensitivities + trainroc$specificities -1)]

trainpredlab <- as.factor(ifelse(trainpredprob[,2] > bestp ,1,0))

confusionMatrix(data = as.factor(trainpredlab),  reference = as.factor(traindata$type), positive = '1', mode = 'everything')

testpredprob <- predict(rf.train, newdata = testdata,type = 'prob')

testpredlab <- as.factor(ifelse(testpredprob[,2] > bestp ,1,0))

confusionMatrix(data = as.factor(testpredlab),  reference = as.factor(testdata$type),  positive = '1', mode = 'everything') 

testroc <- roc(response = testdata$type,predictor = testpredprob[,2]) 

plot(trainroc,print.auc = T,grid = c(0.1,0.2),auc.polygon = F,max.auc.polygon = T,main = "random Forest ROC", grid.col = c("green","red"))
plot(testroc,print.auc = T, print.auc.y = 0.4,add = T, col = 'red')
legend("bottomright",legend = c("traindata","testdata"), col = c(par('fg'),'red'), lwd = 2, cex = 0.9)


rf.test <- predict(rf.train,traindata,type="response")
trn_pred<-ifelse(predict(rf.train, type = "response")> 0.3, 1, 0)
trn_tab <- table(predicted = trn_pred, actual = traindata$type)
tst_pred <- ifelse(predict(rf.train, newdata = testdata, type = "response") > 0.3, 1, 0)
tst_tab <- table(predicted = tst_pred, actual = testdata$type)
calc_class_err(actual = teat$type, predicted = tst_pred)
confusionMatrix(trn_tab, positive = "1")
test_roc <- roc(testdata$type , test_prob, plot = TRUE, print.auc = TRUE)

Validate and optimize gene models.
Generate Draft Gene Model v3.

C. UTR Annotation and Regulatory Element Analysis

Input Data:
- ATAC-seq
- Nanopore RNA data
Process:
- Annotate 5' and 3' UTRs.
- Identify regulatory elements like promoters, TATA boxes, and poly-A signals.
- Update to Draft Gene Model v4.

D. Manual Curation and Genome Polish

Input Data:
- Whole-genome sequencing data
- Public protein databases
Process:
- Manual curation of genes.
- Align sequencing data to polish the genome.
- Finalize gene models in Draft Gene Model v5.

E. Alternatively Spliced Isoform Annotation

Input Data:
- RNA-seq (growth, starvation, conjugation)
- Nanopore DRS
Process:
- Identify and annotate alternative splicing events.
- Generate updated gene annotations in TGD2024.

Dependencies

Python 3.x
R
GATK
Samtools
Nanopore basecalling software
[List any other specific tools or libraries used]

Installation

Clone the repository:

git clone https://github.com/username/tetrahymena-gene-annotation.git
cd tetrahymena-gene-annotation

Install dependencies:

pip install -r requirements.txt

Usage

Prepare the input data: Ensure that all required data files (RNA-seq, ATAC-seq, etc.) are available in the specified directories.
Run the pipeline: Execute the main script that coordinates the workflow.

python main_pipeline.py --input data_directory --output results_directory

Monitor progress: Logs and intermediate files will be saved in the output directory.

Output

Gene Models: Optimized gene models in GFF3 format.
UTR Annotations: Annotated UTR regions.
Alternative Splicing Events: Detailed AS isoform annotations.
Polished Genome: Updated and polished genome sequence.
Protein Function Annotations: Functionally annotated proteins linked to gene models.

Tetrahymena_Genome_annotation_V2024

The appendix file of paper --"Comprehensive genome annotation of the model ciliate Tetrahymena thermophila by in-depth epigenetic and transcriptomic profiling." Doi: https://www.biorxiv.org/content/10.1101/2024.01.31.578305v1

Contributing

Contributions are welcome! Please submit a pull request or open an issue to discuss your changes or suggestions.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genome Annotation by in-depth Epigenetic and Transcriptomic profiling (GAET)

Table of Contents

Overview

Pipeline Workflow

A. Gene Optimization with Transcriptomic Data

B. Gene Optimization with Epigenetic Data

C. UTR Annotation and Regulatory Element Analysis

D. Manual Curation and Genome Polish

E. Alternatively Spliced Isoform Annotation

Dependencies

Installation

Usage

Output

Tetrahymena_Genome_annotation_V2024

Contributing

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
A. Gene Optimization with Transcriptomic Data		A. Gene Optimization with Transcriptomic Data
B. Gene Optimization with Epigenetic Data		B. Gene Optimization with Epigenetic Data
C. UTR Annotation and Regulatory Element Analysis		C. UTR Annotation and Regulatory Element Analysis
D. Manual Curation and Genome Polish		D. Manual Curation and Genome Polish
E. Alternatively Spliced Isoform Annotation		E. Alternatively Spliced Isoform Annotation
F. NATs_analysis		F. NATs_analysis
README.md		README.md

yefei521/GAET

Folders and files

Latest commit

History

Repository files navigation

Genome Annotation by in-depth Epigenetic and Transcriptomic profiling (GAET)

Table of Contents

Overview

Pipeline Workflow

A. Gene Optimization with Transcriptomic Data

B. Gene Optimization with Epigenetic Data

C. UTR Annotation and Regulatory Element Analysis

D. Manual Curation and Genome Polish

E. Alternatively Spliced Isoform Annotation

Dependencies

Installation

Usage

Output

Tetrahymena_Genome_annotation_V2024

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages