Abstract
Process monitoring involves tracking a system’s behaviors, evaluating the current state of the system, and discovering interesting events that require immediate actions. In this paper, we consider monitoring temporal system state sequences to help detect the changes of dynamic systems, check the divergence of the system development, and evaluate the significance of the deviation. We begin with discussions of data reduction, symbolic data representation, and anomaly detection in temporal discrete sequences. Time-series representation methods are also discussed and used in this paper to discretize raw data into sequences of system states. Markov chains and stationary-state distributions are continuously generated from temporal sequences to represent snapshots of the system dynamics in different time frames. We use generalized Jensen–Shannon divergence as the measure to monitor changes of the stationary symbol probability distributions and evaluate the significance of system deviations. We prove that the proposed approach is able to detect deviations of the systems we monitor and assess the deviation significance in probabilistic manner.
Similar content being viewed by others
References
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Willsky A, Jones H (1976) A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans Automat Control 21(1):108–112
Basseville M, Nikiforov IV (1993) Detection of abrupt changes: theory and application, vol 104, Prentice Hall, Englewood Cliffs
Gustafsson F (1996) The marginalized likelihood ratio test for detecting abrupt changes. IEEE Trans Automat Control 41(1):66–78
Yamanishi K, Takeuchi J (2002) A unifying framework for detecting outliers and change points from non-stationary time series data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 676–681
Kawahara Y, Sugiyama M (2009) Change-point detection in time-series data by direct density-ratio estimation. In: Proceedings of 2009 SIAM International Conference on Data Mining (SDM2009), pp 389–400
Montgomery DC (2008) Introduction to statistical quality control, 6th edn. Wiley, New York
Fodor IK (2002) A survey of dimension reduction techniques. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Los Alamos
Daw CS, Finney CEA, Tracy ER (2003) A review of symbolic analysis of experimental data. Review of Scientific Instruments 74(2):915–930
Bergroth L, Hakonen H, Raita T (2000) A survey of longest common subsequence algorithms. In: String processing and information retrieval, 2000. SPIRE 2000 Proceedings. Seventh international symposium on, pp 39 –48
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Bhattacharyya A (1946) On a measure of divergence between two multinomial populations Sankhyā. Indian J Stat (1933–1960) 7(4):401–406
Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37:145–151
“Markov chain,” Wikipedia, the free encyclopedia, 09 Nov 2014. http://en.wikipedia.org/w/index.php?title=Markov_chain&oldid=632678042
Langville AN, Meyer CD (2006) Google PageRank and Beyond. Princeton University Press, Princeton
Grosse I, Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver J, Stanley HE (2002) Analysis of symbolic sequences using the Jensen-Shannon divergence. Phys Rev E 65(4):041905
Keogh E, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min Knowl Discov 7(4):349–371
Jolliffe IT (2002) Principal component analysis. Springer, New York
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23(2):419–429
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3(3):263–286
Keogh EJ, Pazzani MJ (2000) Scaling up dynamic time warping for datamining applications. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, Boston, pp 285–289
Agrawal R, Psaila G, Wimmers EL, Zaït M (1995) Querying shapes of histories. In: Proceedings of the 21th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc, pp 502–514
Keogh E, Lin J, Fu A (2005) Hot sax: efficiently finding the most unusual time series subsequence. In: Data mining, IEEE international conference on, Los Alamitos, pp 226–233
Robin S, Rodolphe F, Schbath S (2005) DNA, words and models. Cambridge University Press, Cambridge
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, p 11
Shieh J, Keogh E (2008) iSAX: indexing and mining terabyte sized time series. In: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, pp 623–631
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144
Van Der Aalst WMP (2011) Process mining: discovery, conformance and enhancement of business processes. Springer, New York
Koller D, Friedman N (2009) Probabilistic graphical models. MIT press, Cambridge
Cassandras CG, Lafortune S (2008) Introduction to discrete event systems, 2nd edn. Springer, New York
Arvey AJ, Azad RK, Raval A, Lawrence JG (2009) Detection of genomic islands via segmental genome heterogeneity. Nucleic Acids Res 37(16):5255–5266
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Majtey AP, Lamberti PW, Prato DP (2005) Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states. Phys Rev A 72(5):052310
Lamberti PW, Majtey AP, Borras A, Casas M, Plastino A (2008) Metric character of the quantum Jensen-Shannon divergence. Phys Rev A 77(5):052311
Patil GP (2002) Weighted distributions. John Wiley & Sons, Ltd
Herzel H, Große I (1997) Correlations in DNA sequences: the role of protein coding segments. Phys Rev E 55(1):800–810
R Core Team (2014) R: a Language and environment for statistical computing, R Foundation for Statistical Computing, Vienna. http://www.R-project.org/
Y. Kang, Supplemental contents for process monitoring using maximum sequence divergence. http://ykang.info/ProcessMon/index.html
Lake JA (1994) Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc Natl Acad Sci US Am 91(4):1455–1459
Budalakoti S, Budalakoti S, Srivastava AN, Otey ME, Otey ME (2009) Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety. IEEE Trans Syst Man Cybern Part C 39(1):101–113
Jiang T, Li M (1995) On the approximation of shortest common supersequences and longest common subsequences. SIAM J Compu 24(5):1122–1139
Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications, vol 10. Chapman & Hall/CRC, Boca Raton
Durbin R (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge
“UCI Machine Learning Repository,” Ozone level detection data set. http://archive.ics.uci.edu/ml/datasets/Ozone+Level+Detection. Accessed 20 Dec 2013]
Zhang K, Fan W (2008) Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond. Knowl Inf Syst 14(3):299–326
Chandola V, Banerjee A, Kumar V (2012) Anomaly detection for discrete sequences: a survey. IEEE Trans Knowl Data Eng 24(5):823–839
Acknowledgments
We would like to thank editor and anonymous reviewers for their time and feedback that help improve this paper. We would also like to thank Dr. Tatjana Welzer in the Institute of Informatics, University of Maribor, for her advice on data integration and analysis.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kang, Y., Zadorozhny, V. Process monitoring using maximum sequence divergence. Knowl Inf Syst 48, 81–109 (2016). https://doi.org/10.1007/s10115-015-0858-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0858-z