Abstract
Outlier detection is a crucial task for network data analysis, which identifies abnormal entities that deviate from the rest of the dataset. Ranking in outlierness is often used for identifying abnormal nodes in directed citation networks containing citation relationship among nodes. A challenging issue in outlier ranking is how to leverage the rich graph data of complex citation networks. In this paper, we propose a cluster-based outlier score function to identify outliers in citation networks based on nonnegative matrix factorization (NMF). We first represent the citation data as a directed graph, and cluster the directed graph into logical groupings of nodes using NMF. Based on the clustering results, we obtain the outlier score and ranking for each node using the proposed outlier scoring function. The proposed method leverages the direct and indirect citation links between nodes to measure the graph-based outlierness. We validate the proposed outlier ranking method using small artificial dataset and the real-world U.S. patent data.
Similar content being viewed by others
References
Agreste, S., De Meo, P., Ferrara, E., Piccolo, S., & Provetti, A. (2015). Analysis of a heterogeneous social network of humans and cultural objects. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(4), 559–570.
Akoglu, L., McGlohon, M., & Faloutsos, C. (2010). Oddball: Spotting anomalies in weighted graphs. In Pacific-Asia conference on knowledge discovery and data Mining (pp. 410–421). Berlin: Springer
Banker, R. D., Chang, H., & Zheng, Z. (2017). On the use of super-efficiency procedures for ranking efficient units and identifying outliers. Annals of Operations Research, 250(1), 21–35.
Boutsidis, C., & Gallopoulos, E. (2008). SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognition, 41(4), 1350–1362.
Cao, X., Wang, X., Jin, D., Cao, Y., & He, D. (2013). Identifying overlapping communities as well as hubs and outliers via nonnegative matrix factorization. Scientific Reports, 3, 2993.
Codetta-Raiteri, D., & Portinale, L. (2015). Dynamic bayesian networks for fault detection, identification, and recovery in autonomous spacecraft. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(1), 13–24.
Ding, C. H., He, X., & Simon, H. D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. SDM, SIAM, 5, 606–610.
Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research, 168(1), 151–168.
Džamić, D., Aloise, D., & Mladenović, N. (2017). Ascent-descent variable neighborhood decomposition search for community detection by modularity maximization. Annals of Operations Research, 272, 273–287.
Holder, L. B., & Cook, D. J. (2009). Graph-based data mining. Encyclopedia of data warehousing and mining, 2, 943–949.
Kaffash, S., & Marra, M. (2017). Data envelopment analysis in financial services: A citations network analysis of banks, insurance companies and money market funds. Annals of Operations Research, 253(1), 307–344.
Kang, U., Akoglu, L., & Chau, D. H. P. (2013). Big graph mining: Algorithms, anomaly detection, and applications. Proceedings of the ACM ASONAM, 13, 25–28.
Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556–562).
Lu, N., Li, T., Pan, J., Ren, X., Feng, Z., & Miao, H. (2015). Structure constrained semi-nonnegative matrix factorization for EEG-based motor imagery classification. Computers in Biology and Medicine, 60, 32–39.
Ma, Y., Hu, X., He, T., & Jiang, X. (2016). Hessian regularization based symmetric nonnegative matrix factorization for clustering gene expression and microbiome data. Methods, 111, 80–84.
Michel, J., & Bettels, B. (2001). Patent citation analysis. A closer look at the basic input data from patent search reports. Scientometrics, 51(1), 185–201.
Moonesignhe, H., & Tan, P. N. (2006). Outlier detection using random walks. In 2006 18th IEEE international conference on tools with artificial intelligence (ICTAI’06), IEEE (pp. 532–539).
Newman, M. (2010). Networks: An introduction. Oxford: Oxford University Press.
Sun, H,. Huang, J., Han, J., Deng, H., Zhao, P., & Feng, B. (2010). gskeletonclu: Density-based network clustering via structure-connected tree division or agglomeration. In 2010 IEEE International Conference on Data Mining, IEEE (pp. 481–490).
Tong, H., & Lin, C. Y. (2011). Non-negative residual matrix factorization with application to graph anomaly detection. In SDM, SIAM (pp. 143–153).
Tosyali, A., Kim, J., Choi, J., & Jeong, M. K. (2019). Regularized asymmetric nonnegative matrix factorization for clustering in directed networks. Pattern Recognition Letters, 125, 750–757.
Wang, F., Li, T., Wang, X., Zhu, S., & Ding, C. (2011). Community discovery using nonnegative matrix factorization. Data Mining and Knowledge Discovery, 22(3), 493–521.
Xu, X., Yuruk, N., Feng, Z., & Schweiger, T. A. (2007). Scan: A structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (pp. 824–833).
Yoon, J., & Kim, K. (2011). Detecting signals of new technological opportunities using semantic patent analysis and outlier detection. Scientometrics, 90(2), 445–461.
Yuan, X., Guo, J., Hao, X., & Chen, H. (2015). Traffic sign detection via graph-based ranking and segmentation algorithms. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(12), 1509–1521.
Zhi, R., Flierl, M., Ruan, Q., & Kleijn, W. B. (2011). Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(1), 38–52.
Zou, Z., Li, J., Gao, H., & Zhang, S. (2010). Mining frequent subgraph patterns from uncertain graph data. IEEE Transactions on Knowledge and Data Engineering, 22(9), 1203–1218.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proof of proposition 1
Appendix: Proof of proposition 1
Suppose that a graph G is an acyclic directed graph with n nodes, then the length of the longest path in the graph is at most \(n-1\). Let \(\tau \) be the longest path in the graph G. For \(l>\tau \) and for all \(1\le i\), \(j\le n\), \([\mathbf A ]_{ij}^{(l)}=0\) since there is no path with length l from node i to node j. Therefore \(\mathbf A ^l=\mathbf 0 \), for all \(l>\tau \). Let \(\mathbf C =\sum _{l=1}^\infty \beta ^l \mathbf A ^l\), then \(\mathbf C =\sum _{l=1}^\tau \beta ^l \mathbf A ^l\). With \(0<\beta <1\), C can be rewritten as
Rights and permissions
About this article
Cite this article
Tosyali, A., Kim, J., Choi, J. et al. New node anomaly detection algorithm based on nonnegative matrix factorization for directed citation networks. Ann Oper Res 288, 457–474 (2020). https://doi.org/10.1007/s10479-019-03508-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-019-03508-4