Alignment of transcription factor binding sites motifs of arbitrary models to detect the similarity of motifs and their co-occurrence
De novo motif search is the standard approach to detect nucleotide context specificity of transcription factor (TF) binding sites (BSs) in genomic DNA based on results of massive sequencing techniques such as ChIP-seq experiments. Although the standard motif model position weight matrix (PWM) is widely aplied, it cannot represent all potential sites of direct TF binding. This follows from the hypothesis of the PWM model on the independence from the impacts of different positions of a motif. De novo motif search is not restricted by only the PWM model of TF BS, it can apply the alternative motif models such as BaMM or SiteGA extending potential genomic loci of direct binding for tested target TFs (Tsukanov et al., 2022). To proceed with the result of de novo motif search, the list of enriched motifs, one should apply a tool for motif comparison. Unfortunately, available motif comparison tools have been developped only for the PWM model. Hence, de novo motif search application is very limited for any motif model besides the standard model PWM. The standard approach here is to apply the PWM model and to compare any possible enriched motifs with the motifs for binding sites of known transcription factors, e.g. from JASPAR, Hocomoco or CisBP databases of PWM models for DNA binding motifs. The default option to detect the similarity of two motifs of the PWM model is to compare two respective position frequency matrices (PFMs) directly evaluating pairwise similarity of columns of these PFMs, e.g. (Pietrokovski, 1996; Sandelin and Wasserman, 2004). Thus, TomTom (Gupta et al., 2007) is a popular tool for assessing the similarity of motifs of the PWM model. Although a PWM model can be constructed according to the alignment of binding sites predicted by an arbitrary alternative model, and, consequently, the similarity between a PWM and an alternative model can be estimated, this approach is not quite correct, since it certainly destroys partially the sequence specificity of an alternative model. The similarity of two motifs of two arbitrary motif models can be assessed indirectly. An arbitrary motif model is a set of weights for certain position-specific nucleotide contexts like frequencies of k-mers. The indirect motif comparison tool is not depend on the exact motif model itself, it depends from profiles of predicted sites of both commpared motif models for a given test set of DNA sequences. This principle of indirect motif comparison based on the overlap of their predictions for a given 'full' set of possible DNA sequences were earlier proposed and implemented in the tools of Vorontsov et al. (2013) and Lambert et al. (2017), but these tools were proposed for the PWM motif model only.
Let we have two motifs, each one represents either the standard PWM or alternative SiteGA model of motif. We suggest the tool to estimate the similarity of these two motifs and to predict their co-occurrence through estimation of area under Precision-Recall (PR) curve estimating True Positive vs. False Positive 'predictions' as notable their overlapping (more than a half of the length of a sorter motif of two) vs. relatively distant locations of these motifs. The motif comparison is derived from prediction profiles of these motifs for given FASTA file of DNA sequences. All possible 'close' locations of two motifs with respect to their mutual orientation and the shifts between the centers of motifs are compared with more distant locations of these two motifs according to the statistics of the Area under Precision-Recall curve (AUPRC). Although PR plots are less frequently used than more popular Receiver Operating Characteristics (ROC) plots, it was proposed that the PR curve is more accurate than ROC curve for imbalanced datasets (Saito, Rehmsmeier, 2015).
The tool takes a tested FASTA file, and two motifs. Among these two motifs motifA (MA)/motifB (MB) have the longer/shorter lengths. If Length(MA) >= Length(MB) then the longer/shorter motifs are MA / MB, othewise vice versa. The input parameter of ERR (Expected Recognition rate) is used transforms the recogntion thresholds of both motifs to the uniform scale. Following previously developed approaches (MCOT, Levitsky et al., 2019) and (MetArea, Levitsky et al., 2024), for each motif ERR is defined as the fraction of positions that contain predicted sites among all tested position in the whole-genome set of promoters of protein-coding genes. The tables 'Threshold vs. ERR' are preliminary computed for each of input motifs. Each table includes two columns: the first one is the recognition threshold given by a motif recognition model and the second one is the common logarithm of the probability value, -Log10(ERR). Table rows are sorted from the highest threshold to the lowest.
Furhther, the pairs of co-occured motifs with respect to DNA strands are predicted with the parameter of the maximal shift L between the centers of two motifs. There are two types of motif orientation. If two motifs lie in the same DNA strand, than they have the 'Direct' orientation: pairs in the forward and reverse strands are '-> ->' and '<- <-'. Two motifs from the opposite DNA strands are '-> <-' and '<- ->', they refer to the 'Reverse' orientation presuming the Inverted or Everted cases, respectively. For each motif, its center is defined as the average value of its start and end positions. For example, the motifs of lengths 4 and 5 located at positions 1..4 and 1..5, the centres are located at positions 2.5 and 3. For given parameter of the maximal shift of motif centers, we allow the positive and negative shifts: the pairs of motifs 'MA .. MB' / 'MB .. MA' respect positive/negative shifts of centers of motifs.
Counts of predicted pairs of motifs with certain orientation (two motifs in the same or different DNA strands) and shift mean the metrics of true positives (TP) predictions. Corresponding pairs mapped in any orientation with the larger shifts from L to 2L are used as the control metrics of false positives (FP). These TP and FP metrics are computed as functions of the ERR values. FP metrics normalized to the TP metrics since TP are defined for each shift and orientation, i.e. FP values transformed FP / 4L values (each motif is predicted with left/right section in any of two strands at maximal distance of L bp). Finally, for each shift and orientation the area under Precision-Recall (PR) curve estimates the alignment of two motifs. Recall is defined as a ratio of the number of predicted pairs (TP) at certain recognition threshold (ERRMAX) of the recognition rate respecting the number of predicted pairs (TP) at the lowest allowable threshold, default ERRMAX = 0.002. Precision is defined as the portion of true positives among the total number of true positives and false positives, Precision = TP/(TP + FP). The value AUPRCMAX(motifA,motifB) is defined for given pair of motifs as the maximal AUPRC value among all shifts and both orientations. All shifts are classified into (a) the strigent overlap (motif center of the shorter motif lies inside the longer motif), and (b) all the remaining cases when the centers of two motifs lie within the threshold of shift L. The highest possible value of the area under PR curve measure (AUPRC) of 1 implies perfectcly matching motifs (i.e. two motifs are coincides). The AUPRC values of 0.5 refers to two random motifs.
For the descibed above metrics AUPRCMAX(MA,MB) respecting 'heterotypic pair' of motifs 'MA & MB', two analogous normalizing values AUPRCMAX(MA,MA) and AUPRCMAX(MB,MB) estimate for two homotypic pairs, 'MA & MA', and 'MB & MB', the overlap of each motif with itself. Finally, the similarity score of two input motifs is computed as follows:
Similarity score = AUPRCMAX(MA,MB) / Max{AUPRCMAX(MA,MA), AUPRCMAX(MB,MB)}
Typically, the similarity score above 0.999 / 0.98 / 0.95 / 0.9 and 0.7 respect the excellent / very high / high / moderate and low levels of similarity between two motifs. As follows from the definition of AUPRC value, the similarity score of 0.5 means that the stringent overlap of two motifs and its absence have equal probabilities. The zero similarity score implies that the overlap of motifs is forbiden with given input threshold of motif recognition. For example, two quite similar yet distinct E-box motifs for the BHLHA15 TF from Hocomoco, and BHA15.H12CORE.0.P.B and BHA15.H12CORE.1.SM.B have consensuses CAgcTG and CAtaTG, respectively. Their similarity score is 0.998. The TomTom tool reports the p-value of 1.46e-03 for the significance of similarity of these motifs.
Following input data are required:
- Two motifs, only PWM and SiteGA motif models are supported. The PWM motif is represented by the position frequency matrix in standard format, and the SiteGA model is the list of locally positioned dinucleotides with their weights, (Tsukanov et al., 2022).
- FASTA file to recognize these two motifs, e.g. ChIP-seq peaks, see example format here top 1000 peaks for mouse BHLHA15 TF
- Two tables 'Threshold vs. ERR (Expected Recognition Rate)', see example format here ERR table for the motif of mouse BHLHA15 TF. These tables computed preliminary with the FASTA file of promoters of protein-coding genes of whole genome, (Tsukanov et al., 2022), example FASTA file for mouse
The tool needs two parameters:
- The threshold of the Expected Recognition Rate (ERR) for both motifs, (Tsukanov et al., 2022)
- The maximal shift between centers of two motifs, L
- The similarity scores and the maximal AUPRC values for (a) the stringent criterion on the overlap of the centers of two motifs, and (b) any shifts between the centers of two motifs, including their location with spacers and small overlaps.
- Histograms of AUPRC values as a function of the mutual orientation of two motifs (they either in the same or opposite strands) and the shift between their centers. Three histograms show distributions for the pair of input motif (heterotypic case, motifA/motifB), and two separate distributions for the first and second motifs (two homotypic cases, motifA/motifA and motifB/motifB). Due to possible even/odd lengths of motifs in a pair, positions in histograms are marked either by integer or non-integer values, ..., -2, -1, 0, +1, +2, ... or ..., -1.5, -0.5, +0.5, +1.5, ..., respectively.
- PR curves for two possible mutual orientations of motifs (motifs in the same and opposite strands) and shifts, -L <= x <= L, x denotes the positive shift means that the longer/shorter motifs are located closer to 5'/3' ends of DNA sequence alignment. Three blocks of data are PR curves for the pair of input motif (heterotypic case), and two separate blocks of curves for the first and second motifs (two homotypic cases).
- input FASTA file, set of DNA sequences, ChIP-seq peaks, example - top 1000 peaks (PEAKS039234) for mouse BHLHA15 TF from GTRD
- type of the first motif model, values 'pwm'/'PWM' and 'sga'/'SGA' imply PWM and SiteGA motif models
- type of the second motif model, values 'pwm'/'PWM' and 'sga'/'SGA' imply PWM and SiteGA motif models
- input file for the first motif, either of PWM or SiteGA model
- input file for the second motif, either of PWM or SiteGA model
- input file 'Threshold vs. ERR' for the first motif, this file has the same format for PWM and SiteGA models, ERR table. These files for PWM and SiteGA models are generated by PWM thresholds selection program from MCOT and SiteGA thresholds selection program from SiteGA repositories.
- input file 'Threshold vs. ERR' file for the second motif, this file has the same format for PWM and SiteGA models, ERR table. These files for PWM and SiteGA models are generated by PWM thresholds selection program from MCOT and SiteGA thresholds selection program from SiteGA repositories.
- integer value, L, maximal shift between the centers of the first and second motifs, the default value is 50 bp
- double ERR threshold, ERRMAX, the default value is 0.002, this value means the maximal ERR value used to predict sites and compute the area under PR curve. This value defines the range of recognition thresholds from the table 'Threshold vs. ERR'.
- output file, histograms of AUPRC values for both orientations of motifs all shifts of their centers, Histogram
- integer value, 0/1 mean printing and non-printing of histograms of AUPRC values to the output file (see argument #10 above)
- output file, PR curves for both orientations of motifs all shifts of their centers, PR curves
- integer value, 0/1 mean printing and non-printing of PR curves to the output file (see argument #12 above)
- output file, short text file of the similarity score for the shifts between the centers of two motifs implying their stringent overlap, file contains only the tabulator symbol '\t' and the similarity score
- output file, short text file of the similarity score for any allowable of the centers of two motifs, file contains only the tabulator symbol '\t' and the similarity score
- output file, text file of the similarity scores and AUPRC values for the stringent overlap and any shift between the centers of two motifs, and the details of the alignment of two motifs (shifts of the centers of two motifs and orientations of two motifs respecting the maximal AUPRC values) Similarity scores and AUPRC values
- One PWM motif vs. another PWM motif
com_line_pwm_pwm command line for two motifs of the PWM model BHA15.H12CORE.0.P.B.pfm and BHA15.H12CORE.1.SM.B.pfm respecting two structurally distinct motif types for murine BHLHA15 TF from Hocomoco and ChIP-seq peaks dataset PEAKS039234 for this TF from GTRD. Here and below parameters 0.002 and 50 are used for the threshold of ERR and the maximal shift. The output data are Histogram of AUPRC values, PR curves and Similarity scores and maximal AUPRC values.
- PWM motif vs. SiteGA motif
com_line_pwm_sga command line for two motifs of PWM and SiteGA models. The PWM motif BHA15.H12CORE.0.P.B.pfm respects the major structural motif type for murine BHLHA15 TF from Hocomoco and ChIP-seq peaks dataset PEAKS039234 for this TF from GTRD. The SiteGA motif PEAKS039234_BHLHA15_Q9QYC3_MACS2_1_40_cmat1 is the first ranked enriched motif derived by SiteGA motif model in the 1st iteration of the cross-validation procedure. The output data are Histogram of AUPRC values, PR curves and Similarity scores and maximal AUPRC values.
- One SiteGA motif vs. another SiteGA motif
com_line_sga_sga command line for two motifs of the SiteGA model. These models respect the murine BHLHA15 TF and dataset of top 1000 ChIP-seq peaks (PEAKS039234) for this TF from GTRD. Two motifs PEAKS039234_BHLHA15_Q9QYC3_MACS2_1_40_cmat1 and PEAKS039234_BHLHA15_Q9QYC3_MACS2_2_40_cmat1 are the first ranked enriched motifs derived by SiteGA motif model in the 1st and 2nd iterations of the cross-validation procedure. The output data are Histogram of AUPRC values, PR curves and Similarity scores and maximal AUPRC values.