Abstract
An approach to user identification based on deviations of their topic trends in operation with text information is presented. An approach is proposed to solve this problem; the approach implies topic analysis of the user’s past trends (behavior) in operation with text content of various (including confidential) categories and forecast of their future behavior. The topic analysis of user’s operation implies determining the principal topics of their text content and calculating their respective weights at the given instants. Deviations in the behavior in the user’s operation with the content from the forecast are used to identify this user. In the framework of this approach, our own original time series forecasting method is proposed based on orthogonal non-negative matrix factorization (ONMF). Note that ONMF has not been used to solve time series forecasting problems before. The experimental research held on the example of real-world corporate emailing formed out of the Enron data set showed the proposed user identification approach to be applicable.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.REFERENCES
Yampolskiy, V.R. and Govindaraju, V., Behavioural biometrics: a survey and classification, Int. J. Biometrics (IJBM), 2008, vol. 1, no. 1.
Time Series. http://www.machinelearning.ru/wiki/ index.php?title=Временной ряд. Cited March 24, 2015.
Mashechkin, I.V., Petrovskii, M.I., and Tsarev, D.V., Methods for calculation of text fragment relevance based on subject area models in the problem of automatic annotation, Numer. Methods Program, 2013, vol. 14, no. 1, pp. 91–102.
Mashechkin, I.V., Petrovskiy, M.I., Popov, D.S., and Tsarev, D.V., Automatic text summarization using latent semantic analysis, Program. Comput. Software, 2011, vol. 37, no. 6, pp. 299–305.
Tsarev, D.V., Petrovskiy, M.I., and Mashechkin, I.V., Using NMF-based text summarization to improve supervised and unsupervised classification, 11th Int. Conf. on Hybrid Intelligent Systems (HIS 2011), Malacca, Malaysia, 2011 (IEEE, 2011), pp. 185–189.
Tsarev, D.V., Petrovskiy, M.I., and Mashechkin, I.V., Supervised and unsupervised text classification via generic summarization, International Journal of Computer Information Systems and Industrial Management Applications, MIR Labs, 2013, vol. 5, pp. 509–515.
Mashechkin, I.V., Petrovskiy, M.I., Popov, D.S., and Tsarev D.V., Applying text mining methods for data loss prevention, Program. Comput. Software, 2015, vol. 41, no. 1, pp. 23–30.
Manning, C.D., Raghavan, P., and Schutze, H., Introduction to Information Retrieval, Cambridge: Cambridge University Press, 2008.
Mirzal, A., Converged Algorithms for Orthogonal Nonnegative Matrix Factorizations. CoRR abs/ 1010.5290, 2010.
Wei Xu, Xin Liu, and Yihong Gong, Document clustering based on non-negative matrix factorization, Proc. 26th Annu. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, Canada, 2003.
Chris Ding, Tao Li, Wei Peng, Haesun Park, Orthogonal nonnegative matrix tri-factorizations for clustering, SIGKDD, 2006.
Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., and Plemmons, R.J., Algorithms and applications for approximate nonnegative matrix factorization, Computational Statistics Data Analysis, 2007, vol. 52, no. 1, pp. 155–173.
Yoo, J. and Choi, S., Orthogonal nonnegative matrix factorization: multiplicative updates on Stiefel manifolds, Intelligent Data Engineering and Automated Learning — IDEAL 2008, Lect. Notes Comput. Sci., 2008, vol. 5326, pp. 140–147.
Meek, C., Chickering, D.M., and Heckerman, D., Autoregressive tree models for time-series analysis, Proc. 2002 SIAM Int. Conf. on Data Mining, SIAM, August 4, 2002. http://go.microsoft.com/fwlink/ ?LinkId=45966.
Microsoft Time Series Algorithm Technical Reference. http://msdn.microsoft.com/ru-ru/library/bb677216. aspx.
Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., and Botstein, D., Imputing missing data for gene expression arrays, Technical Report, Stanford Statistics Department, 1999.
Troyanskaya, O., Missing value estimation methods for DNA microarrays, Bioinformatics, 2001, vol. 17, no. 6, pp. 520–525.
Tsarev, D.V., Kurynin, R.V., Petrovskiy, M.I., and Mashechkin, I.V., Applying non-negative matrix factorization methods to discover user’s resource access patterns for computer security tasks, Proc. 2014 Int. Conf. on Hybrid Intelligent Systems (HIS 2014), IEEE Computer Society, New York, United States, 2014, pp. 43–48.
Lee, D. and Seung, S., Learning the parts of objects by non-negative matrix factorization, Nature, 1999, vol. 401, pp.788–791.
Enron Email Dataset. http://www.cs.cmu.edu/~./ enron/. Cited March 24, 2015.
Natural Language Toolkit (NLTK). http://www.nltk. org. Cited March 24, 2015.
Kendall, M. and Stuart, A., The Advanced Theory of Statistics, New York: McGraw-Hill, 1969.
Receiver Operating Characteristic (ROC) curve. http://www.machinelearning.ru/wiki/index.php?title =ROC-кpивaя. Cited March 24, 2015.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Translated by M. Talacheva
Rights and permissions
About this article
Cite this article
Korolev, V.Y., Korchagin, A.Y., Mashechkin, I.V. et al. Applying Time Series for Background User Identification Based on Their Text Data Analysis. Program Comput Soft 44, 353–362 (2018). https://doi.org/10.1134/S0361768818050055
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768818050055