Abstract
As the amount of documents has exploded in the Internet era, many researchers have tried to understand the relationships between documents and predict the links between similar but unconnected documents. However, existing link prediction techniques that use the predefined links of documents might provide incorrect results, because of the generic problem of citation analysis. Moreover, they may fail to reflect important contents of documents in the link prediction process. Thus, we propose a new link prediction approach that employs the Doc2vec algorithm, a document-embedding method, in order to predict potential links between documents, by reflecting the functional context of technological words. For this, first, we collected both citation information and documents of patents of interest, and generated a patent network by using the citation relationship between patents. Second, we identified unconnected links between nodes and transformed the patent document into document vectors, based on the Doc2vec algorithm. In particular, since patent documents include useful functions for solving technological problems, the proposed approach extracts subject-action-object (SAO) structures that we used to generate document vectors. Then, we calculated the similarity between patents in the unconnected links of a patent network, and could predict potential links by using the similarity. Third, we validated the results of the proposed approach by comparing them using the Adamic–Adar technique, one of the traditional link prediction techniques, and word vector-based link prediction. We applied the Doc2vec-based link prediction approach to a real case, the unmanned aerial vehicle (UAV) technology field. We found that the proposed approach makes better predictions performance than the Adamic–Adar technique and the word vector approach. Our results can help analyzers accurately forecast future relationships between nodes in a network, and give R&D managers insightful information on the future direction of technological development by using a patent network.
Similar content being viewed by others
References
Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social Networks, 25(3), 211–230.
Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised learning. In SDM06: workshop on link analysis, counter-terrorism and security.
Behrouzi, S., Sarmoor, Z. S., Hajsadeghi, K., & Kavousi, K. (2020). Predicting scientific research trends based on link prediction in keyword networks. Journal of Informetrics, 14(4), 101079.
Chen, D., & Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 740–750).
Chen, H., Li, X., & Huang, Z. (2005). Link prediction approach to collaborative filtering. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'05) (pp. 141–142).
Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv preprint http://arxiv.org/abs/arXiv:1507.07998.
Getoor, L. (2003). Link mining: A new data mining challenge. ACM SIGKDD Explorations Newsletter, 5(1), 84–89.
Getoor, L., & Diehl, C. P. (2005). Link mining: A survey. ACM SIGKDD Explorations Newsletter, 7(2), 3–12.
Goldberg, Y., & Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint http://arxiv.org/abs/arXiv:1402.3722.
Guo, J., Wang, X., Li, Q., & Zhu, D. (2016). Subject–action–object-based morphology analysis for determining the direction of technological change. Technological Forecasting and Social Change, 105, 27–40.
Hopcroft, J., Lou, T., & Tang, J. (2011). Who will follow you back?: Reciprocal relationship prediction. Proceedings of the 20th ACM international conference on Information and knowledge management, ACM (2011), pp. 1137–1146.
Huang, Z., Chen, H., & Zeng, D. (2004). Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Transactions on Information Systems (TOIS), 22(1), 116–142.
Huang, E. H., Socher, R., Manning, C. D., & Ng, A. Y. (2012). Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers,1, 873–882.
Jeong, B., Ko, N., Son, C., & Yoon, J. (2021). Trademark-based framework to uncover business diversification opportunities: Application of deep link prediction and competitive intelligence analysis. Computers in Industry, 124, 103356.
Kroeger P. R., Analyzing grammar: An introduction. Cambridge University Press, 2005.
Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint http://arxiv.org/abs/arXiv:1607.05368.
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188–1196).
Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems (pp. 2177–2185).
Li, S., Chua, T. S., Zhu, J., & Miao, C. (2016). Generative topic embedding: A continuous representation of documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 666–675).
Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), 1019–1031.
Liu, Y., Liu, Z., Chua, T. S. & Sun, M. (2015). Topical word embeddings. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Liu, W., & Lü, L. (2010). Link prediction based on local random walk. EPL (europhysics Letters), 89(5), 58007.
Lü, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica a: Statistical Mechanics and Its Applications, 390(6), 1150–1170.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations (pp. 55–60).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint http://arxiv.org/abs/arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Moehrle, M. G., Walter, L., Geritz, A., & Muller, S. (2005). Patent-based inventor profiles as a basis for human resource decisions in research and development. R&D Management, 35(5), 513–524.
Pavlov, M., & Ichise, R. (2007). Finding experts by link prediction in co-authorship networks. FEWS, 290, 42–55.
Popescul, A., & Ungar, L. H. (2003, August). Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data (Vol. 2003).
Rajbabu, K., Srinivas, H., & Sudha, S. (2018). Industrial information extraction through multi-phase classification using ontology for unstructured documents. Computers in Industry, 100, 137–147.
Rong, X. (2014). word2vec parameter learning explained. arXiv preprint http://arxiv.org/abs/arXiv:1411.2738.
Sun H. L., Ch’ng E., Yong X., Garibaldi J. M., See S., Chen D.-B. (2017). An improved game-theoretic approach to uncover overlapping communities International Journal of Modern Physics C, 28 (9), 1750112.
Tang, J., Wu, S., Sun, J., & Su. H. (2012). Cross-domain collaboration recommendation. Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1285–129.
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1555–1565).
Tang, J., Qu, M., & Mei, Q. (2015, August). Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1165–1174). ACM.
Taskar, B., Wong, M. F., Abbeel, P., & Koller, D. (2004). Link prediction in relational data. In Advances in neural information processing systems (pp. 659–666).
Toutanova, K., & Manning, C. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference EMNLP/VLC (pp. 63–71).
Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 384–394). Association for Computational Linguistics.
Wu, J., Zhang, G., & Ren, Y. (2017). A balanced modularity maximization link prediction model in social networks. Information Processing & Management, 53(1), 295–307.
Xie, Q., Zhang, X., Ding, Y., & Song, M. (2020). Monolingual and multilingual topic analysis using LDA and BERT embeddings. Journal of Informetrics, 14(3), 101055.
Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., & Zhang, G. (2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099–1117.
Acknowledgements
This work was supported by the Basic Science Research Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT under Grant NRF-2017R1D1A1B03036213.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Code for Doc2vec
Appendix 2: Searching query for UAV technology
Rights and permissions
About this article
Cite this article
Yoon, B., Kim, S., Kim, S. et al. Doc2vec-based link prediction approach using SAO structures: application to patent network. Scientometrics 127, 5385–5414 (2022). https://doi.org/10.1007/s11192-021-04187-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-021-04187-4