Abstract
This paper presents evidences concerned to convergence of controlled snowball sampling iterations applied to collecting seminal papers in a selected domain of research. Iterations start from the seed paper selection, plain snowball sampling and probabilistic topic modelling, then greedy controlled snowball sampling and analysis of the collected citation network are performed in rotation until the list of seminal papers becomes stable. The topic model is built on the base of word-word co-occurrence probability with combination of sparse symmetric nonnegative matrix factorization and principal component approximation. Experiments show that the number of topics in the model is determined in natural way and the Kullback-Leibler (KL) divergence provides the upper bound of the cosine similarity calculated from keywords assigned by publication authors. Several citation networks are collected and analysed. The analysis shows that all networks are “small worlds” and therefore the observed saturation of the controlled snowball sampling can provide the complete set of publications in domains of interest. Experiments with KL-divergence, symmetric KL-divergence and Jensen-Shannon divergence show that KL-divergence produces less connected citation network but provides better convergence of snowball iterations. Multiple runs of the sampling confirm the hypothesis that the set of seminal publications is stable with respect to variations of the seed papers. The modified main path analysis allows to distinguish the seminal papers including new publications following main stream of research. The comparison of different ranking criterion is made. It shows that Search Path Count provides better lists of seminal papers than citation index, PageRank and indegree.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The presented work is extended version of [12] which is available publicly online at http://ceur-ws.org/Vol-2105/10000179.pdf.
- 2.
Google Scholar https://scholar.google.com.ua/.
- 3.
Microsoft Academic https://academic.microsoft.com/.
- 4.
Semantic Scholar https://www.semanticscholar.org/.
- 5.
NetworkX, https://networkx.github.io.
- 6.
References
Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)
Akavipat, R., Wu, L.S., Menczer, F., Maguitman, A.G.: Emerging semantic communities in peer web search. In: Proceedings of the International Workshop on Information Retrieval in Peer-to-Peer Networks, pp. 1–8. ACM (2006)
Baez, M., Mirylenka, D., Parra, C.: Understanding and supporting search for scholarly knowledge. In: Proceeding of the 7th European Computer Science Summit, pp. 1–8 (2011)
Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–413 (2009)
Barbosa, M.W., Costa, M.M., Almeida, J.M., Almeida, V.A.: Using locality of reference to improve performance of peer-to-peer applications. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 216–227. ACM (2004)
Batagelj, V.: Efficient algorithms for citation network analysis. arXiv preprint cs/0309023 (2003)
Batagelj, V., Mrvar, A.: Pajek-program for large network analysis. Connections 21(2), 47–57 (1998)
Beel, J., Gipp, B., Langer, S., Breitinger, C.: Paper recommender systems: a literature survey. Int. J. Digit. Librar. 17(4), 305–338 (2016)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: Proceedings 22nd International Conference on Distributed Computing Systems, pp. 23–32. IEEE (2002)
De Bruijn, N.G.: Asymptotic Methods in Analysis, vol. 4. Courier Corporation, Chelmsford (1981)
Dobrovolskyi, H., Keberle, N.: Collecting the seminal scientific abstracts with topic modelling, snowball sampling and citation analysis. In: Proceedings of the 14th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer. Volume I: Main Conference, vol. 2105, pp. 179–192. CEUR-WS (2018)
Dobrovolskyi, H., Keberle, N., Todoriko, O.: Probabilistic topic modelling for controlled snowball sampling in citation network collection. In: Różewski, P., Lange, C. (eds.) KESW 2017. CCIS, vol. 786, pp. 85–100. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69548-8_7
Dong, R., Tokarchuk, L., Ma, A.: Digging friendship: paper recommendation in social network. In: Proceedings of Networking and Electronic Commerce Research Conference, NAEC 2009, pp. 21–28 (2009)
Doulamis, N.D., Karamolegkos, P.N., Doulamis, A., Nikolakopoulos, I.: Exploiting semantic proximities for content search over P2P networks. Comput. Commun. 32(5), 814–827 (2009)
Endres, D.M., Schindelin, J.E.: A new metric for probability distributions. IEEE Trans. Inf. Theory (2003)
Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. Int. J. Comput. Sci. Appl. 11(3) (2014)
Even, S.: Graph Algorithms. Cambridge University Press, Cambridge (2011)
Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs, vol. 57. Elsevier, Amsterdam (2004)
Gori, M., Pucci, A.: Research paper recommender systems: a random-walk based approach. In: IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 778–781. IEEE (2006)
Hamilton, D.P., et al.: Publishing by–and for?–the numbers. Science 250(4986), 1331–1332 (1990)
Huang, Z., Chung, W., Ong, T.H., Chen, H.: A graph-based recommender system for digital library. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 65–73. ACM (2002)
Küçüktunç, O., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Recommendation on academic networks using direction aware citation analysis. arXiv preprint arXiv:1205.1143 (2012)
Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-constrained random walks. Mach. Learn. 81(1), 53–67 (2010)
Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)
Liang, Y., Li, Q., Qian, T.: Finding relevant papers based on citation relations. In: Wang, H., Li, S., Oyama, S., Hu, X., Qian, T. (eds.) WAIM 2011. LNCS, vol. 6897, pp. 403–414. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23535-1_35
Lops, P., de Gemmis, M., Semeraro, G.: Content-based recommender systems: state of the art and trends. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 73–105. Springer, Boston, MA (2011). https://doi.org/10.1007/978-0-387-85820-3_3
Lucio-Arias, D., Leydesdorff, L.: Main-path analysis and path-dependent transitions in histcite™-based historiograms. J. Assoc. Inf. Sci. Technol. 59(12), 1948–1962 (2008)
MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Mendenhall, W.M., Sincich, T.L., Boudreau, N.S.: Statistics for Engineering and the Sciences, Student Solutions Manual. Chapman and Hall/CRC, Boca Raton (2016)
Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence. Random Struct. Algorithms 6(2–3), 161–180 (1995)
Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)
Newman, M.E.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001)
Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(Suppl. 1), 5200–5205 (2004)
Nicolini, A.L., Lorenzetti, C.M., Maguitman, A.G., Chesñevar, C.I.: Intelligent algorithms for improving communication patterns in thematic P2P search. Inf. Proces. Manag. 53(2), 388–404 (2017)
Nikulin, M.S.: Hellinger distance. In: Encyclopedia of Mathematics, vol. 78 (2001)
Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 408–424. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_24
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)
Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. Health Psychol. Pract. 150–179 (2004)
Pohl, S., Radlinski, F., Joachims, T.: Recommending related papers based on digital library access records. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 417–418. ACM (2007)
Ráez, A.M., López, L.A.U., Steinberger, R.: Adaptive selection of base classifiers in one-against-all learning for large multi-labeled collections. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 1–12. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30228-5_1
Ricci, F., Rokach, L., Shapira, B.: Recommender systems: introduction and challenges. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 1–34. Springer, Boston, MA (2015). https://doi.org/10.1007/978-1-4899-7637-6_1
Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004)
Small, H.: Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24(4), 265–269 (1973)
de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)
Tan, P.N., et al.: Introduction to Data Mining. Pearson Education India, Delhi (2007)
Trudeau, R.J.: Introduction to Graph Theory. Courier Corporation, Chelmsford (2013)
Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: AAAI Workshop: Scholarly Big Data (2015)
Varela, A.R., et al.: Mapping the historical development of physical activity and health research: a structured literature review and citation network analysis. Prev. Med. 111, 466–472 (2018)
Vellino, A.: Usage-based vs. citation-based methods for recommending scholarly research articles. arXiv preprint arXiv:1303.7149 (2013)
Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12580-0_3
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440 (1998)
Woodruff, A., Gossweiler, R., Pitkow, J., Chi, E.H., Card, S.K.: Enhancing a digital book with a reading recommender. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 153–160. ACM (2000)
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Information retrieval techniques for peer-to-peer networks. Comput. Sci. Eng. 6(4), 20–26 (2004)
Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Exploiting locality for scalable information retrieval in peer-to-peer networks. Inf. Syst. 30(4), 277–298 (2005)
Zhou, D., et al.: Learning multiple graphs for document recommendations. In: Proceedings of the 17th International Conference on World Wide Web, pp. 141–150. ACM (2008)
Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)
Acknowledgements
The authors would like to express their gratitude to anonymous reviewers whose comments and suggestions helped improve the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Dobrovolskyi, H., Keberle, N. (2019). On Convergence of Controlled Snowball Sampling for Scientific Abstracts Collection. In: Ermolayev, V., Suárez-Figueroa, M., Yakovyna, V., Mayr, H., Nikitchenko, M., Spivakovsky, A. (eds) Information and Communication Technologies in Education, Research, and Industrial Applications. ICTERI 2018. Communications in Computer and Information Science, vol 1007. Springer, Cham. https://doi.org/10.1007/978-3-030-13929-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-13929-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13928-5
Online ISBN: 978-3-030-13929-2
eBook Packages: Computer ScienceComputer Science (R0)