Figures
Abstract
Accurately identifying potential piRNA-disease associations is of great importance in uncovering the pathogenesis of diseases. Recently, several machine-learning-based methods have been proposed for piRNA-disease association detection. However, they are suffering from the high sparsity of piRNA-disease association network and the Boolean representation of piRNA-disease associations ignoring the confidence coefficients. In this study, we propose a supplementarily weighted strategy to solve these disadvantages. Combined with Graph Convolutional Networks (GCNs), a novel predictor called iPiDA-SWGCN is proposed for piRNA-disease association prediction. There are three main contributions of iPiDA-SWGCN: (i) Potential piRNA-disease associations are preliminarily supplemented in the sparse piRNA-disease network by integrating various basic predictors to enrich network structure information. (ii) The original Boolean piRNA-disease associations are assigned with different relevance confidence to learn node representations from neighbour nodes in varying degrees. (iii) The experimental results show that iPiDA-SWGCN achieves the best performance compared with the other state-of-the-art methods, and can predict new piRNA-disease associations.
Author summary
PIWI-interacting RNAs (piRNAs) are a kind of small non-coding RNAs with approximately 23–36 nucleotides in length, showing critical impact on various biological processes. It is crucial to develop computational methods for identifying piRNA-disease associations. Although several computational methods have contributed to the piRNA-disease association detection, there are still two main limitations for the further improvement of piRNA-disease association prediction: (i) The high sparsity of piRNA-disease associations prevent the predictors to accurately infer potential associations. (ii) Most existing methods focus on the Boolean connectivity of piRNA-disease associations, ignoring the correlation information with different confidence. To solve aforementioned limitations, we propose a supplementarily weighted strategy in this study. Combined with Graph Convolutional Networks (GCNs), a novel predictor named iPiDA-SWGCN is proposed for piRNA-disease association prediction. The supplementarily weighted strategy for piRNA-disease association complement cannot only enrich network structure information, but also provide the association confidence information for GCN to learn node representations. As a result, the evaluation results show that iPiDA-SWGCN can effectively detect missing piRNA-disease associations, and outperforms the other state-of-the-art methods.
Citation: Hou J, Wei H, Liu B (2023) iPiDA-SWGCN: Identification of piRNA-disease associations based on Supplementarily Weighted Graph Convolutional Network. PLoS Comput Biol 19(6): e1011242. https://doi.org/10.1371/journal.pcbi.1011242
Editor: Ilya Ioshikhes, ., CANADA
Received: November 23, 2022; Accepted: June 5, 2023; Published: June 20, 2023
Copyright: © 2023 Hou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data and source code are available at http://bliulab.net/iPiDA-SWGCN/.
Funding: This work was supported by the National Natural Science Foundation of China (No. U22A2039, 62271049, 62250028 to BL) and Natural Science Basic Research Program of Shaanxi (No. 2023-JC-QN-0636 to HW). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
PIWI-interacting RNAs (piRNAs) belong to a category of small non-coding RNAs composed of approximately 23–36 nucleotides [1,2], showing a critical impact on various biological processes by regulating gene expression at epigenetic and post-transcriptional levels [3,4]. For example, piRNAs bind to piwi proteins for regulating transposon silencing, genome rearrangement, spermiogenesis, germ stem-cell maintenance, etc [5,6].
Emerging evidence indicate that piRNAs participate in many disease genesis and prognosis [7–9]. For example, piR-36712 is downregulated in breast cancer by suppressing cell proliferation, invasion and migration through the combination with SEPW1P RNA [10]. The piR-651 shows upregulated expression in gastric cancer, and tends to be associated with TNM stages [11]. Several studies highlight that piRNAs can be viewed as potential biomarkers of disease diagnosis and prognosis for effective therapeutic project design [12,13]. Therefore, developing computational methods has great significance for identifying piRNA-disease associations.
Several computational methods have been proposed to predict the associations between non-coding RNA s and diseases. These methods usually rely on network link [14], recommendation system [15], matrix completion [16], classical machine learning [17] and deep learning [18]. However, research on the detection of piRNA-disease associations is still in its preliminary stages. To unravel the complex interactions between piRNAs and diseases, several computational methods for predicting piRNA-disease associations have been proposed [19–23]. For example, iPiDi-PUL [19] and iPiDA-sHN [20] were proposed to predict piRNA-disease association based on positive-unlabeled learning. APDA [21] and iPiDA-GBNN [22] employed a stacked auto-encoder to extract piRNA-disease pair features, and then it was trained with random forests and Gradient Boosting Neural Network respectively to predict new piRNA-disease associations. DFL-PiDA [23] combined convolutional de-noising auto-encoder and extreme learning machine to identify potential associations. iPiDA-LTR [24] employed a ranking framework to integrate several component methods for piRNA-disease association detection. To further improve the representation ability of association, iPiDA-GCN [25] designed Asso-GCN and Sim-GCN models for iteratively extracting features for piRNAs and diseases.
Although the aforementioned methods have contributed to the piRNA-disease association detection, there are two main limitations for the further improvement of piRNA-disease association prediction: (i) The high sparsity of piRNA-disease associations. For example, there are only about 5% piRNA-disease associations with experimental validation in piRDisease v1.0 [26] and MNDR v3.0 [27]. The lack of association information will prevent the predictors to accurately infer the piRNA-disease associations. (ii) The Boolean associations between piRNAs and diseases. Most of the existing methods for piRNA-disease association detection utilize Boolean values to denote whether a piRNA is related with a disease or not during the training process. However, piRNAs are related with diseases with different probabilities. Boolean associations only focusing on the connectivity will ignore the confidence information.
To solve the above limitations, we propose a supplementarily weighted strategy to enrich the topology structure information of piRNA-disease network so as to provide more information for piRNA-disease association detection. As shown in Fig 1, our goal is to infer whether the target piRNA is associated with the target disease or not. In the original network, as the lack of link information, it is difficult to detect whether the target association exists or not (Fig 1a). Then three weighted piRNA-disease associations are preliminarily supplemented (Fig 1b). Therefore, three feasible paths can be generated to infer the association probability (Fig 1c). Finally, the prediction results are obtained by integrating different inference results based on their feasible paths (Fig 1d). As a result, supplementarily weighted associations can enrich the connectivity information, and yield more possible paths to comprehensively predict the relationship between piRNAs and diseases.
(a) The original piRNA-disease network is constructed based on similarity information and piRNA-disease associations. (b) PiRNA-disease network with supplementarily weighted associations. (c) Three feasible paths for piRNA-disease association detection, highlighted in purple, pink and yellow shadows. (d) The final prediction results for the target association. Integrating the inferences from three feasible paths, the final prediction result is obtained and denoted as the red line.
With the development of representation learning, Graph Convolutional Network (GCN) [28] is proposed to extend the CNN for graph-structured data, and achieves powerful ability of representation learning by capturing rich structural information [29]. In this paper, we combine the supplementarily weighted strategy and GCN, and propose a novel predictor named iPiDA-SWGCN for piRNA-disease association detection with three major contributions: (i) The supplementarily weighted matrix with high-quality computed by machine learning methods can provide more information for the original sparse piRNA-disease association network, based on which GCN can aggregate more neighbour node information for expressive feature learning. (ii) The Boolean associations are replaced by weighted associations with the computed relevance confidence. With the piRNA-disease associations assigned with different weights, GCN can learn node representations from neighbour nodes in varying degrees. (iii) The evaluation results indicate that iPiDA-SWGCN has the ability to effectively detect missing piRNA-disease associations, and shows superior performance than the other state-of-the-art methods.
Materials and methods
Datasets
A recently constructed database MNDR v3.0 (http://www.rnadisease.org/) [27] records different categories of ncRNA-disease associations. After removing duplicated associations and extracting human-related piRNA-disease associations, 11981 experimentally validated piRNA–disease associations are selected to construct the dataset with 10149 piRNAs and 19 diseases.
In this study, machine learning predictors are trained with two phases: (i) Training several basic classifiers to compute the weights of unknown piRNA-disease associations. (ii) Training GCN to capture piRNA and disease features. In the first phase, the dataset Dall is split into a benchmark dataset and an independent dataset . We train several basic predictors on the benchmark dataset, and then utilize the trained predictors to score the associations in the independent dataset. The datasets can be formulated as: (1) where is the positive set with 11981 known piRNA-disease associations. contains 180850 unknown piRNA-disease associations. and are randomly selected from Dall with the equivalent number of piRNA-disease associations, representing the positive and negative subset of , respectively. In order to assign weights for all unknown piRNA-disease associations, the negative independent set contains all unknown associations. The positive independent set are constructed by randomly selecting 20% positive associations in .
In the second training phase, we divide the dataset following the previous studies [30,31]: (2) where the positive benchmark dataset contains 80% of positive piRNA-disease associations randomly selected from , and the remaining positive associations constitute the positive independent dataset . The negative independent dataset are randomly selected from with the equal size of , and the rest of negative samples constitute the sub-dataset .
To prevent overestimating the performance of the proposed method, the piRNA-disease associations in the independent dataset for model evaluation are removed from the training phases: (3)
Method overview
In this study, we propose a novel method named iPiDA-SWGCN for piRNA-disease association detection. The overall process of iPiDA-SWGCN is shown in Fig 2 with four parts: (i) Network construction (Fig 2a). A heterogeneous piRNA-disease network is constructed by integrating piRNA and disease information; (ii) Supplementarily weighted piRNA-disease network generation (Fig 2b). Different supplementary weights are assigned to unknown piRNA-disease pairs based on the scores computed by several predictors; (iii) GCN-based feature extraction (Fig 2c). In this section, GCN is performed on the supplementarily weighted piRNA-disease network to capture the structural information, and extract feature representations of piRNAs and diseases. (iv) Association prediction (Fig 2d). Finally, we use the fully connected layers for dimension reduction and inner product operation so as to calculate the association scores between piRNAs and diseases.
(a) piRNA-disease network is constructed by integrating piRNA sequence similarity, disease semantic similarity and piRNA-disease associations. (b) The supplementarily weighted piRNA-disease network is generated based on the proposed supplementarily weighted strategy and the weights are computed via several basic predictors. (c) piRNA and disease node features are extracted by performing GCN on the supplementarily weighted piRNA-disease network. (d) Fully connected layers and inner production are employed to predict final association scores.
Network construction
In this section, we construct a heterogeneous piRNA-disease network denoted as G = {EpiR−disease, EpiR−piR, Edisease−disease, VpiR, Vdisease}. In detail, three types of edges and two types of nodes are included in the piRNA-disease network. The three kinds of edges are piRNA-disease associations, piRNA-piRNA edges and disease-disease edges, represented as EpiR−disease, EpiR−piR and Edisease−disease, respectively. EpiR−disease are the original piRNA-disease associations derived from the database MNDR v3.0. The other two kinds of edges are obtained based on the similarity between homogeneous biological entities. VpiR and Vdisease represent the nodes of piRNAs and diseases. The specific calculation of above edge and node representations will be introduced in the following sections.
PiRNA-disease associations.
The adjacency matrix APD represents the associations for each pair of piRNA-disease, denoted as: (4) where m and n are the number of piRNAs and diseases, respectively. The element ai,j is 1, if the association between i-th piRNA and the j-th disease is confirmed with experimental verification, otherwise ai,j = 0.
PiRNA-piRNA sequence similarity.
For each pair of piRNAs, the sequence similarity is calculated, where the sequence information is downloaded from piRBase v3.0 (http://bigdata.ibp.ac.cn/piRBase/) [32]. PiRNA-piRNA sequence similarity are calculated via Smith-Waterman local alignment algorithm [33], which is highly sensitive and can detect subtle similarities between sequences with low identity in a robust and accurate manner [34,35]. We compute the sequence similarity score for given pair of piRNAs as [33]: (5) where SW(pi, pj) denotes the sequence alignment score between pi and pi calculated by Smith-Waterman alignment algorithm.
Disease-disease semantic similarity.
Disease semantic similarity has been extensively used in ncRNA-disease association detection [36–39]. Directed Acyclic Graph (DAG) describes the relationship among different disease terms. Disease semantic similarity can be calculated based on the Disease Ontology (DO) descriptors in DAG [40–42]. The Directed Acyclic Graph (DAG) based algorithm not only uses a consistent standard to make the calculated disease similarity uniform and comparable but also is capable of fully capturing complex relationships between diseases and better represent the disease space and the interconnection of diseases. In this study, we adopt one of the most effective semantic similarity measurements [41,43,44] to compute the disease semantic similarity. For disease di and disease dj, the disease semantic similarity Dsem(di, dj) is calculated by [44]: (6) (7) where Ti is composed of all subterms in the DAG of disease di, indicates the semantic contribution of disease t ∈ Ti to the i-th disease according to [44]. θ is the semantic contribution factor which is set as 0.5 following [44].
Node feature extraction.
The row vector of similarity matrix Pseq or Dsem can serve as the feature vector for a piRNA or disease. However, they ignore the connectivity information, especially for non-neighboring and higher-order connected nodes. Therefore, in order to introduce the global topology information of similarity networks, we further apply random walk with restart (RWR) [45] on Pseq and Dsem to extract piRNA and disease features. The feature generation for node i based on RWR can be formulated as [45]: (8) where is the row vector of node i, whose elements indicate the probabilities of walking from node i to all the other homogeneous nodes after k steps. S is the probability transition matrix obtained from similarity matrix (Pseq or Dsem) with row-wise normalization, yi is a one-hot vector representing the initial probability vector of node i, and α is the restart probabilities. Finally, we obtain as the feature descriptor of node i.
Supplementarily Weighted Graph Convolutional Network (SWGCN)
Many unknown pairs lead to the high sparsity of piRNA-disease association network. GCN performing on such a network may lead to performance degradation because of limited neighbor information aggregation. To overcome this problem, we propose the Supplementarily Weighted Graph Convolutional Network (SWGCN). Firstly, the supplementarily weighted adjacency matrix is computed to achieve an informative piRNA-disease association network with high quality, and then GCN is adopted to capture hidden structure information for feature learning.
Weights computation for piRNA-disease association.
In this section, we define the supplementarily weighted adjacency matrix in SWGCN for piRNA-disease association completion as follows:
Definition 1. Supplementarily weighted adjacency matrix. Given a network G = {V, E}, we train several basic machine-learning-based predictors to compute the average relevance scores for unknown associations among nodes. Specifically, wi,j represents the average score of edge ei,j between node ni and node nj, and the adjacency matrix of the completed network can be denoted as , where , i.e., if APD(i, j) = 0, then , otherwise .
To improve the quality of SWGCN, 15 different basic predictors based on Random Forest [46], Support Vector Machine (SVM) [47], Gradient Boosting Decision Tree (GBDT) [48] are trained for weight computation. It should be noted that we train various predictors on different training sets, where the positive training set keeps the same while the negative training sets are randomly selected from for five times. For a pair of piRNA pi and disease dj, the element is formulated as: (9) where denotes the prediction score computed by the k-th SVM classifier. 15 predictors are employed to score the unknown associations, and the average scores are set as the weights for unknown associations. Finally, the weighted edges are added into the original heterogeneous network to generate the supplementarily weighted piRNA-disease network.
Feature learning based on Graph Convolutional Network.
GCN has been widely used for capturing graph structural information, and aggregating neighbour information so as to extract the node features [49,50]. After assigning weights to piRNA-disease associations, GCN is performed on the supplementarily weighted piRNA-disease network to learn the feature representations of different nodes by information aggregation from neighbours. The node embedding Hl learned by GCN in the l-th layer can be formulated as [28]: (10) where (11) (12) (13) where the adjacency matrix Aall ∈ R(m+n)×(m+n) indicates the association adjacency matrix for the whole network, and is Aall added with self-loop. denotes the degree matrix of , W represents the trainable parameters of GCN model, σ(·) denotes the nonlinear activation function. Hl ∈ R(m+n)×c is initialized by concatenating the piRNA and disease representations, where c denotes the common dimension of piRNA and disease features obtained by the dense layer. It is denoted that the batch norm layer is added before each dense layer and convolution layer to alleviate the problem of vanishing or exploding gradients and speed up the convergence. The batch norm layer can standardize the mean and variance of the input to each layer based on the statistics computed over a batch of training examples to make the distribution of each layer’s input relatively stable, and further reduce the risk of overfitting.
Compared with the original piRNA-disease network, GCN can effectively convolve more useful neighbor information with supplementarily weighted edges. Fig 3 shows the comparison of performing GCN on different networks. Take the disease node a as an example, on the original piRNA-disease network, the first layer of GCN updates the feature representation of node a by aggregating the features of its first-order neighbor node b1 and b2 (Fig 3a and 3b), failing to capture the indirectly connected node information of c1, c2, c3, d1, d2, d3. In contrast, the node representation of a can be learned by aggregating all piRNA node information with different concerns on the supplementarily weighted piRNA-disease association network (Fig 3c and 3d). As a result, SWGCN is able to capture deeper and wider neighbor information for informative feature learning.
Fig 3a and 3b show the receptive field of 1-layer of GCN on the original piRNA-disease association network, while Fig 3c and 3d show the receptive field of 1-layer GCN on the supplementarily weighted association network.
Association prediction
After obtaining the feature representation learned by GCN, the full connection networks with three layers are constructed to separately reduce the dimensions of representations for piRNAs and diseases. The probability of piRNA pi associated with disease di is calculated by inner production as: (14) where and denote the feature representation of piRNA pi and disease di after dimensionality reduction, and U is the predicted final score matrix for the associations between piRNAs and diseases. It should be noted that we conduct Batch Normalization (BN) [51] following each convolution layer to mitigate internal covariate shift and increase the stability.
We utilize the mean square error as the loss function which minimizes the Frobenius norm of the difference between prediction score matrix U and the label matrix APD. Nevertheless, the high sparsity of piRNA-disease association matrix may cause prediction bias to the unknown associations. To alleviate this problem, we adopt β-enhanced loss function [52] that enlarges the margin between the real label matrix APD and predicted score matrix U with hyper-parameter α. The loss function is formulated as: (15) where (16) where indicates the enhanced association matrix. W denotes the trainable parameters. μ is a decay factor controlling the regularization term of W to prevent overfitting.
Performance evaluation
Two metrics are employed to comprehensively evaluate the predictor performance, including AUPR (area under the precision recall curve) [53] and AUC (area under the receiver operating characteristics curve) [54]. AUC measures the sensitivity and specificity of the model [55], and AUPR can avoid the impact of imbalanced data sets, and comprehensively reflect the quality of predictions [56].
Results and discussion
Combination of basic methods can improve the quality of SWGCN
In this study, three basic machine learning methods (RF, SVM and GBDT) are used to compute weights for the unknown piRNA-disease pairs. To analyze their contribution to weighting associations in the proposed model, we compare the predictors based on different basic methods and their combinations. Table 1 lists the results, and their performance differences are shown in Fig 4, from which we can draw the following conclusions: (i) Assigning weights to associations indeed contributes to performance improvement. (ii) It is not surprising that integrating several complementary methods for weights computation can effectively improve the performance compared with using a single basic method. (iii) iPiDA-SWGCN outperforms all the other methods because it integrates all the basic methods to compute weights so as to assign weights to piRNA-disease associations with better performance.
(a) and (b) show the difference values in terms of AUC and AUPR among different predictors, respectively. The difference values are computed by the scores obtained from the method labelled at y-axis minus the scores obtained from the method labelled at x-axis.
Impact of different association adjacency matrices on the performance of iPiDA-SWGCN
Previous studies represent the piRNA-disease associations as Boolean values, where the element in the adjacency matrix is 1 if the association in the network is known, otherwise it is 0. In this study, we compute weights for unknown associations in the network instead of 0. To illuminate the effectiveness and necessity of assigning weights instead of Boolean values for unknown edges, we perform experiments on five networks with different edge types. The experimental results are listed in Table 2, from which we can see that the model with the weighted network is superior to those with the network with added Boolean piRNA-disease edges. The reason is that assigning weights to associations cannot only enrich more network structural information, but also can provide association confidence information for GCN so as to learn the expressive node representation. In contrast, Boolean edges only indicate whether the associations exist or not.
Performance comparison among different methods
To illuminate the effectiveness of iPiDA-SWGCN for identifying piRNA-disease associations, we conducted a performance comparison of our method with five state-of-the-art approaches, including iPiDi-PUL [19], iPiDA-sHN [20], iPiDA-LTR [24], iPiDA-GCN [25] and piRDA [57]. The web servers or source codes of all these methods are accessible, enabling unbiased evaluation of their performance. The experimental results are displayed in Table 3, from which we can concluded that iPiDA-SWGCN outperforms all the other methods. iPiDA-SWGCN is superior to iPiDA-GCN by 10.29% and 11.15% in terms of AUC and AUPR, respectively. The performance improvement can be attributed to the fact that iPiDA-SWGCN is able to capture more expressive node representations by aggregating more neighbor information from the supplementarily weighted network.
Visualization of predicted associations by iPiDA-SWGCN
In order to visually illustrate the effectiveness of the supplementarily weighted network used in iPiDA-SWGCN, the prediction results of GCN performing on different networks are compared. We take two piRNA-disease associations for further analysis, including <piR-hsa-22710, Parkinson’s disease> and <piR-hsa-28405, renal cell carcinoma>. Parkinson’s Disease (PD) is a widely prevalent neurodegenerative disorder. The gross pathological findings reveal significant damage to dopaminergic neurons in the midbrain’s substantia nigra (SN), leading to dopamine deficiency in the nerve terminals located in the basal ganglia [58]. PiR-has-22710 has length of 30nt and is downregulated in PD-patient tissue samples [59]. Renal cell carcinoma (RCC) is a prevalent cancer, ranking sixth in incidence in men and tenth in women globally [59]. Recent research suggests that the expression levels of piRNA are linked to the histological grade, pathological features of RCC, and patient survival. Multiple studies have demonstrated that piRNAs show differentially expressed in benign versus malignant renal tumor tissues [59,60]. PiR-has-28405 is a type of Homo sapiens piRNA with length of 32nt, showing about 4-fold downregulation in renal tumor tissue [60].
The results are shown in Fig 5, from which we can draw the following conclusions: (i) Due to the shortage of verified piRNA-disease associations, the ability of GCN to capture network structural information is limited; (ii) GCN performing on the supplementarily weighted network can correctly predict the test piRNA-disease associations, because the weighted network cannot only provide more proximity structural information, but also contains the association confidence.
The figures are plotted by using Gephi [61]. (a) PiRNA-disease associations for training and test. The black lines denote the associations in the training set, and the green lines denote the associations in test set for prediction. (b) Predicted results of the GCN performed on the original piRNA-disease association network. The red lines show the prediction results on the original network, and Parkinson’s disease is incorrectly predicted to be associated with piR-has-28405. (c) Supplementarily weighted piRNA-disease association network. The blue lines denote the supplementarily weighted associations computed by several basic predictors. (d) Predicted results of the GCN performed on the supplementarily weighted piRNA-disease association network. The associations in the test set are correctly predicted by GCN when adding supplementarily weighted associations.
Case study
To illuminate the practicability of iPiDA-SWGCN, we applied iPiDA-SWGCN to detect potential piRNAs associated with three important diseases (‘Renal cell carcinoma’, ‘Parkinson’s disease’ and ‘Cardiovascular disease’). The top five detected piRNAs and their corresponding literature evidence are listed in Table 4. From Table 4 we can see that all the top five detected piRNAs associated with ‘Renal cell carcinoma’ and ‘Parkinson’s disease’ have been validated by literatures. Four of five detected piRNAs associated with ‘cardiovascular disease’ are confirmed by the literature. For example, piR-hsa-23184 shows a higher expression of 2.26-fold in metastatic compared to non-metastatic tumor [60]. The piR-hsa-5389 is upregulated in cells and post-mortem tissue samples between control and Parkinson’s disease patients [59]. The expression of piR-hsa-25177 in cardio sphere cells is 3.38-fold higher than that in cardio sphere-derived cells [62]. Therefore, iPiDA-SWGCN can effectively detect potential piRNA-disease associations, which is suitable for real world applications. The more specific case study results are shown in Table B in S1 Text.
Conclusion
In this work, we propose a novel predictor named iPiDA-SWGCN for piRNA-disease association prediction by combining the supplementarily weighted strategy and GCN. The iPiDA-SWGCN mainly has following advantages: (i) Potential piRNA-disease associations are supplemented in the piRNA-disease network by integrating various basic predictors to provide an informative network, based on which GCN can capture deep proximity structure, and fully utilize network information for feature learning. (ii) Different confidences are assigned to the piRNA-disease associations instead of Boolean values. Therefore, GCN aggregates node information and accurately learns node representations from neighbor nodes in varying degrees. (iii) Although iPiDA-SWGCN is proposed for predicting piRNA-disease associations, it can be extended to other link prediction tasks.
It is anticipated that the strategy of supplementarily weighted adjacency matrix will be applied to other related fields to solve the problems of limited positive samples, such as lncRNA–disease association detection and drug repositioning. In particular, there are plenty of unknown associations in link prediction field resulting in high sparsity problem. The supplementarily weighted strategy can be implemented to preliminarily enrich the association network so as to provide more useful information and improve the prediction performance.
Besides, the value of piRNA-disease associations without independent experimental validation is worth mentioning. PiRNAs are potential biomarkers that may provide new avenues of investigation into the pathogenesis of diseases. However, it should be noted that these associations predicted by iPiDA-SWGCN are only correlations and require experimental validation to confirm their significance. This is especially relevant for piRNA studies due to their high abundance and poorly understood roles in diseases.
Future studies should aim to validate these associations through a combination of biological experiments and bioinformatics analysis to ensure their reliability and accuracy, promoting to gain a better understanding of the complex mechanisms underlying diseases and develop more effective strategies for prevention and treatment.
Supporting information
S1 Text.
Fig A. Parameter analysis of iPiDA-SWGCN. Table A. The performance comparison of basic predictors on . Table B. The top 10 piRNAs associated with different diseases detected by iPiDA-SWGCN.
https://doi.org/10.1371/journal.pcbi.1011242.s001
(DOCX)
References
- 1. Liu P, Dong Y, Gu J, Puthiyakunnon S, Wu Y, Chen X-G. Developmental piRNA profiles of the invasive vector mosquito Aedes albopictus. Parasites & vectors. 2016;9(1):1–15. pmid:27686069
- 2. Seto AG, Kingston RE, Lau NC. The coming of age for Piwi proteins. Molecular cell. 2007;26(5):603–9. pmid:17560367
- 3. Singh G, Roy J, Rout P, Mallick B. Genome-wide profiling of the PIWI-interacting RNA-mRNA regulatory networks in epithelial ovarian cancers. PloS one. 2018;13(1):e0190485. pmid:29320577
- 4. Czech B, Hannon GJ. One loop to rule them all: the ping-pong cycle and piRNA-guided silencing. Trends in biochemical sciences. 2016;41(4):324–37. pmid:26810602
- 5. Han Y-N, Li Y, Xia S-Q, Zhang Y-Y, Zheng J-H, Li W. PIWI proteins and PIWI-interacting RNA: emerging roles in cancer. Cellular Physiology and Biochemistry. 2017;44(1):1–20. pmid:29130960
- 6. Vagin VV, Sigova A, Li C, Seitz H, Gvozdev V, Zamore PD. A distinct small RNA pathway silences selfish genetic elements in the germline. Science. 2006;313(5785):320–4. pmid:16809489
- 7. Chen C, Liu J, Xu G. Overexpression of PIWI proteins in human stage III epithelial ovarian cancer with lymph node metastasis. Cancer Biomarkers. 2013;13(5):315–21. pmid:24440970
- 8. Wang K, Wang T, Gao XQ, Chen XZ, Wang F, Zhou LY. Emerging functions of piwi-interacting RNAs in diseases. Journal of Cellular and Molecular Medicine. 2021;25(11):4893–901. pmid:33942984
- 9. Halajzadeh J, Dana PM, Asemi Z, Mansournia MA, Yousefi B. An insight into the roles of piRNAs and PIWI proteins in the diagnosis and pathogenesis of oral, esophageal, and gastric cancer. Pathology-Research and Practice. 2020;216(10):153112. pmid:32853949
- 10. Tan L, Mai D, Zhang B, Jiang X, Zhang J, Bai R, et al. PIWI-interacting RNA-36712 restrains breast cancer progression and chemoresistance by interaction with SEPW1 pseudogene SEPW1P RNA. Molecular cancer. 2019;18(1):1–15.
- 11. Cheng J, Guo J-M, Xiao B-X, Miao Y, Jiang Z, Zhou H, et al. piRNA, the new non-coding RNA, is aberrantly expressed in human cancer cells. Clinica chimica acta. 2011;412(17–18):1621–5. pmid:21616063
- 12. Liu Y, Dou M, Song X, Dong Y, Liu S, Liu H, et al. The emerging role of the piRNA/piwi complex in cancer. Molecular cancer. 2019;18(1):1–15.
- 13. Ray SK, Mukherjee S. Piwi-interacting RNAs (piRNAs) and Colorectal Carcinoma: Emerging Non-invasive diagnostic Biomarkers with Potential Therapeutic Target Based Clinical Implications. Current Molecular Medicine. 2022.
- 14. Zhang W, Li Z, Guo W, Yang W, Huang F. A fast linear neighborhood similarity-based network link inference method to predict microRNA-disease associations. IEEE/ACM transactions on computational biology and bioinformatics. 2019;18(2):405–15.
- 15. Chen Q, Zhe Z, Lan W, Zhang R, Wang Z, Luo C, et al. Identifying miRNA-disease association based on integrating miRNA topological similarity and functional similarity. Quantitative Biology. 2019;7:202–9.
- 16. Ha J, Park C, Park C, Park S. Improved prediction of miRNA-disease associations based on matrix completion with network regularization. Cells. 2020;9(4):881. pmid:32260218
- 17. Yao D, Zhan X, Zhan X, Kwoh CK, Li P, Wang J. A random forest based computational model for predicting novel lncRNA-disease associations. BMC bioinformatics. 2020;21:1–18.
- 18. Ji C, Gao Z, Ma X, Wu Q, Ni J, Zheng C. AEMDA: inferring miRNA–disease associations based on deep autoencoder. Bioinformatics. 2021;37(1):66–72. pmid:32726399
- 19. Wei H, Xu Y, Liu B. iPiDi-PUL: identifying Piwi-interacting RNA-disease associations based on positive unlabeled learning. Briefings in Bioinformatics. 2021;22(3):bbaa058. pmid:32393982
- 20. Wei H, Ding Y, Liu B. iPiDA-sHN: Identification of Piwi-interacting RNA-disease associations by selecting high quality negative samples. Computational Biology and Chemistry. 2020;88:107361. pmid:32916452
- 21.
Zheng K, You Z-H, Wang L, Li H-Y, Ji B-Y, editors. Predicting human disease-associated pirnas based on multi-source information and random forest. International Conference on Intelligent Computing; 2020: Springer.
- 22.
Qian Y, He Q, Deng L, editors. iPiDA-GBNN: Identification of Piwi-interacting RNA-disease associations based on gradient boosting neural network. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2021: IEEE.
- 23.
Ji B, Luo J, Pan L, Xie X, Peng S, editors. DFL-PiDA: Prediction of Piwi-interacting RNA-Disease Associations based on Deep Feature Learning. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2021: IEEE.
- 24. Zhang W, Hou J, Liu B. iPiDA-LTR: Identifying piwi-interacting RNA-disease associations based on Learning to Rank. PLOS Computational Biology. 2022;18(8):e1010404. pmid:35969645
- 25. Hou J, Wei H, Liu B. iPiDA-GCN: Identification of piRNA-disease associations based on Graph Convolutional Network. PLOS Computational Biology. 2022;18(10):e1010671. pmid:36301998
- 26. Muhammad A, Waheed R, Khan NA, Jiang H, Song X. piRDisease v1. 0: a manually curated database for piRNA associated diseases. Database. 2019;2019.
- 27. Ning L, Cui T, Zheng B, Wang N, Luo J, Yang B, et al. MNDR v3. 0: mammal ncRNA–disease repository with increased coverage and annotation. Nucleic Acids Research. 2021;49(D1):D160–D4. pmid:32833025
- 28.
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:160902907. 2016.
- 29.
Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems. 2016;29.
- 30. Wang L, You Z-H, Huang Y-A, Huang D-S, Chan KC. An efficient approach based on multi-sources information to predict circRNA–disease associations using deep convolutional neural network. Bioinformatics. 2020;36(13):4038–46. pmid:31793982
- 31. Wei H, Xu Y, Liu B. iCircDA-LTR: identification of circRNA–disease associations based on Learning to Rank. Bioinformatics. 2021;37(19):3302–10. pmid:33963827
- 32. Wang J, Shi Y, Zhou H, Zhang P, Song T, Ying Z, et al. piRBase: integrating piRNA annotation in all aspects. Nucleic acids research. 2022;50(D1):D265–D72. pmid:34871445
- 33. Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of molecular biology. 1981;147(1):195–7. pmid:7265238
- 34. Pearson WR. An introduction to sequence similarity (“homology”) searching. Current protocols in bioinformatics. 2013;42(1):3.1.-3.1.8. pmid:23749753
- 35. Petti S, Bhattacharya N, Rao R, Dauparas J, Thomas N, Zhou J, et al. End-to-end learning of multiple sequence alignments with differentiable Smith–Waterman. Bioinformatics. 2023;39(1):btac724. pmid:36355460
- 36. You Z-H, Huang Z-A, Zhu Z, Yan G-Y, Li Z-W, Wen Z, et al. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS computational biology. 2017;13(3):e1005455. pmid:28339468
- 37. Yang X, Gao L, Guo X, Shi X, Wu H, Song F, et al. A network based method for analysis of lncRNA-disease associations and prediction of lncRNAs implicated in diseases. PloS one. 2014;9(1):e87797. pmid:24498199
- 38. Wang L, You Z-H, Li Y-M, Zheng K, Huang Y-A. GCNCDA: a new method for predicting circRNA-disease associations based on graph convolutional network algorithm. PLOS Computational Biology. 2020;16(5):e1007568. pmid:32433655
- 39. Chen X, Wang L, Qu J, Guan N-N, Li J-Q. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65. pmid:29939227
- 40. Schriml LM, Arze C, Nadendla S, Chang Y-WW, Mazaitis M, Felix V, et al. Disease Ontology: a backbone for disease semantic integration. Nucleic acids research. 2012;40(D1):D940–D6. pmid:22080554
- 41. Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81. pmid:17344234
- 42. Kibbe WA, Arze C, Felix V, Mitraka E, Bolton E, Fu G, et al. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic acids research. 2015;43(D1):D1071–D8. pmid:25348409
- 43. Yu G, Wang L-G, Yan G-R, He Q-Y. DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics. 2015;31(4):608–9. pmid:25677125
- 44. Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50. pmid:20439255
- 45.
Tong H, Faloutsos C, Pan J-Y, editors. Fast random walk with restart and its applications. Sixth international conference on data mining (ICDM’06); 2006: IEEE.
- 46. Breiman L. Random forests. Machine learning. 2001;45(1):5–32.
- 47. Noble WS. Support vector machine applications in computational biology. Kernel methods in computational biology. 2004;14:71–92.
- 48. Friedman JH. Stochastic gradient boosting. Computational statistics & data analysis. 2002;38(4):367–78.
- 49. Li J, Zhang S, Liu T, Ning C, Zhang Z, Zhou W. Neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction. Bioinformatics. 2020;36(8):2538–46. pmid:31904845
- 50.
Rong Y, Huang W, Xu T, Huang J. Dropedge: Towards deep graph convolutional networks on node classification. arXiv preprint arXiv:190710903. 2019.
- 51.
Ioffe S, Szegedy C, editors. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning; 2015: PMLR.
- 52.
Han P, Yang P, Zhao P, Shang S, Liu Y, Zhou J, et al. GCN-MF: Disease-Gene Association Identification By Graph Convolutional Networks and Matrix Factorization. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Anchorage, AK, USA: Association for Computing Machinery; 2019. p. 705–13.
- 53.
Flach P, Kull M. Precision-recall-gain curves: PR analysis done right. Advances in neural information processing systems. 2015;28.
- 54. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36. pmid:7063747
- 55.
Hasanin T, Khoshgoftaar TM, Leevy JL, editors. A comparison of performance metrics with severely imbalanced network security big data. 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI); 2019: IEEE.
- 56. Yi H-C, You Z-H, Huang D-S, Guo Z-H, Chan KC, Li Y. Learning representations to predict intermolecular interactions on large-scale heterogeneous molecular association network. Iscience. 2020;23(7):101261. pmid:32580123
- 57. Ali SD, Tayara H, Chong KT. Identification of piRNA disease associations using deep learning. Computational and Structural Biotechnology Journal. 2022;20:1208–17. pmid:35317234
- 58. Kakkar AK, Dahiya N. Management of Parkinson′ s disease: Current and future pharmacotherapy. European journal of pharmacology. 2015;750:74–81.
- 59. Schulze M, Sommer A, Plötz S, Farrell M, Winner B, Grosch J, et al. Sporadic Parkinson’s disease derived neuronal cells show disease-specific mRNA and small RNA signatures with abundant deregulation of piRNAs. Acta neuropathologica communications. 2018;6(1):1–18.
- 60. Busch J, Ralla B, Jung M, Wotschofsky Z, Trujillo-Arribas E, Schwabe P, et al. Piwi-interacting RNAs as novel prognostic markers in clear cell renal cell carcinomas. Journal of experimental & clinical cancer research. 2015;34(1):1–11. pmid:26071182
- 61.
Bastian M, Heymann S, Jacomy M, editors. Gephi: an open source software for exploring and manipulating networks. Proceedings of the international AAAI conference on web and social media; 2009.
- 62. Vella S, Gallo A, Nigro AL, Galvagno D, Raffa GM, Pilato M, et al. PIWI-interacting RNA (piRNA) signatures in human cardiac progenitor cells. The international journal of biochemistry & cell biology. 2016;76:1–11. pmid:27131603