[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Applying Time Series for Background User Identification Based on Their Text Data Analysis

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

An approach to user identification based on deviations of their topic trends in operation with text information is presented. An approach is proposed to solve this problem; the approach implies topic analysis of the user’s past trends (behavior) in operation with text content of various (including confidential) categories and forecast of their future behavior. The topic analysis of user’s operation implies determining the principal topics of their text content and calculating their respective weights at the given instants. Deviations in the behavior in the user’s operation with the content from the forecast are used to identify this user. In the framework of this approach, our own original time series forecasting method is proposed based on orthogonal non-negative matrix factorization (ONMF). Note that ONMF has not been used to solve time series forecasting problems before. The experimental research held on the example of real-world corporate emailing formed out of the Enron data set showed the proposed user identification approach to be applicable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

REFERENCES

  1. Yampolskiy, V.R. and Govindaraju, V., Behavioural biometrics: a survey and classification, Int. J. Biometrics (IJBM), 2008, vol. 1, no. 1.

  2. Time Series. http://www.machinelearning.ru/wiki/ index.php?title=Временной ряд. Cited March 24, 2015.

  3. Mashechkin, I.V., Petrovskii, M.I., and Tsarev, D.V., Methods for calculation of text fragment relevance based on subject area models in the problem of automatic annotation, Numer. Methods Program, 2013, vol. 14, no. 1, pp. 91–102.

    Google Scholar 

  4. Mashechkin, I.V., Petrovskiy, M.I., Popov, D.S., and Tsarev, D.V., Automatic text summarization using latent semantic analysis, Program. Comput. Software, 2011, vol. 37, no. 6, pp. 299–305.

    Article  MathSciNet  MATH  Google Scholar 

  5. Tsarev, D.V., Petrovskiy, M.I., and Mashechkin, I.V., Using NMF-based text summarization to improve supervised and unsupervised classification, 11th Int. Conf. on Hybrid Intelligent Systems (HIS 2011), Malacca, Malaysia, 2011 (IEEE, 2011), pp. 185–189.

  6. Tsarev, D.V., Petrovskiy, M.I., and Mashechkin, I.V., Supervised and unsupervised text classification via generic summarization, International Journal of Computer Information Systems and Industrial Management Applications, MIR Labs, 2013, vol. 5, pp. 509–515.

    Google Scholar 

  7. Mashechkin, I.V., Petrovskiy, M.I., Popov, D.S., and Tsarev D.V., Applying text mining methods for data loss prevention, Program. Comput. Software, 2015, vol. 41, no. 1, pp. 23–30.

    Article  Google Scholar 

  8. Manning, C.D., Raghavan, P., and Schutze, H., Introduction to Information Retrieval, Cambridge: Cambridge University Press, 2008.

    Book  MATH  Google Scholar 

  9. Mirzal, A., Converged Algorithms for Orthogonal Nonnegative Matrix Factorizations. CoRR abs/ 1010.5290, 2010.

  10. Wei Xu, Xin Liu, and Yihong Gong, Document clustering based on non-negative matrix factorization, Proc. 26th Annu. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, Canada, 2003.

  11. Chris Ding, Tao Li, Wei Peng, Haesun Park, Orthogonal nonnegative matrix tri-factorizations for clustering, SIGKDD, 2006.

    Google Scholar 

  12. Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., and Plemmons, R.J., Algorithms and applications for approximate nonnegative matrix factorization, Computational Statistics Data Analysis, 2007, vol. 52, no. 1, pp. 155–173.

    Article  MathSciNet  MATH  Google Scholar 

  13. Yoo, J. and Choi, S., Orthogonal nonnegative matrix factorization: multiplicative updates on Stiefel manifolds, Intelligent Data Engineering and Automated Learning — IDEAL 2008, Lect. Notes Comput. Sci., 2008, vol. 5326, pp. 140–147.

    Article  Google Scholar 

  14. Meek, C., Chickering, D.M., and Heckerman, D., Autoregressive tree models for time-series analysis, Proc. 2002 SIAM Int. Conf. on Data Mining, SIAM, August 4, 2002. http://go.microsoft.com/fwlink/ ?LinkId=45966.

  15. Microsoft Time Series Algorithm Technical Reference. http://msdn.microsoft.com/ru-ru/library/bb677216. aspx.

  16. Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., and Botstein, D., Imputing missing data for gene expression arrays, Technical Report, Stanford Statistics Department, 1999.

    Google Scholar 

  17. Troyanskaya, O., Missing value estimation methods for DNA microarrays, Bioinformatics, 2001, vol. 17, no. 6, pp. 520–525.

    Article  Google Scholar 

  18. Tsarev, D.V., Kurynin, R.V., Petrovskiy, M.I., and Mashechkin, I.V., Applying non-negative matrix factorization methods to discover user’s resource access patterns for computer security tasks, Proc. 2014 Int. Conf. on Hybrid Intelligent Systems (HIS 2014), IEEE Computer Society, New York, United States, 2014, pp. 43–48.

  19. Lee, D. and Seung, S., Learning the parts of objects by non-negative matrix factorization, Nature, 1999, vol. 401, pp.788–791.

    Article  MATH  Google Scholar 

  20. Enron Email Dataset. http://www.cs.cmu.edu/~./ enron/. Cited March 24, 2015.

  21. Natural Language Toolkit (NLTK). http://www.nltk. org. Cited March 24, 2015.

  22. Kendall, M. and Stuart, A., The Advanced Theory of Statistics, New York: McGraw-Hill, 1969.

    MATH  Google Scholar 

  23. Receiver Operating Characteristic (ROC) curve. http://www.machinelearning.ru/wiki/index.php?title =ROC-кpивaя. Cited March 24, 2015.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to V. Yu. Korolev, A. Yu. Korchagin, I. V. Mashechkin, M. I. Petrovskii or D. V. Tsarev.

Additional information

Translated by M. Talacheva

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Korolev, V.Y., Korchagin, A.Y., Mashechkin, I.V. et al. Applying Time Series for Background User Identification Based on Their Text Data Analysis. Program Comput Soft 44, 353–362 (2018). https://doi.org/10.1134/S0361768818050055

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768818050055

Keywords:

Navigation