Abstract
This paper introduces a novel probabilistic activity modeling approach that mines recurrent sequential patterns called motifs from documents given as word \(\times \) time count matrices (e.g., videos). In this model, documents are represented as a mixture of sequential activity patterns (our motifs) where the mixing weights are defined by the motif starting time occurrences. The novelties are multi fold. First, unlike previous approaches where topics modeled only the co-occurrence of words at a given time instant, our motifs model the co-occurrence and temporal order in which the words occur within a temporal window. Second, unlike traditional Dynamic Bayesian networks (DBN), our model accounts for the important case where activities occur concurrently in the video (but not necessarily in synchrony), i.e., the advent of activity motifs can overlap. The learning of the motifs in these difficult situations is made possible thanks to the introduction of latent variables representing the activity starting times, enabling us to implicitly align the occurrences of the same pattern during the joint inference of the motifs and their starting times. As a third novelty, we propose a general method that favors the recovery of sparse distributions, a highly desirable property in many topic model applications, by adding simple regularization constraints on the searched distributions to the data likelihood optimization criteria. We substantiate our claims with experiments on synthetic data to demonstrate the algorithm behavior, and on four video datasets with significant variations in their activity content obtained from static cameras. We observe that using low-level motion features from videos, our algorithm is able to capture sequential patterns that implicitly represent typical trajectories of scene objects.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The starting time \({t_s}\) can range over different intervals, depending on hypotheses. In the experiments, we assumed that all words generated by a motif starting at time \({t_s}\) occur within a temporal document; hence \({t_s}\) takes values between 1 and \({T_{ds}}\), where \({T_{ds}}= {T_d}-{T_z}+1\). However, we can also assume that motifs are partially observed (beginning or end are missing). In this case \({t_s}\) ranges between \(2-{T_z}\) and \({T_d}\).
Note that \(P({ d}) \propto {N_d}\) and is thus not unknown.
The correspondence between the ground truth motifs and the estimated ones is made by optimizing the normalized cross-correlation measure between the learned motifs \(\hat{P}({{t_r}},{ w}|{ z})\) and the true motifs \(P({{t_r}},{ w}|{ z})\).
Note that the topic distributions contain more information than the location probability: for each location, we know what types of motion are present as well. This explains the location overlap between several topics, e.g., between those of Fig. 13b, e, which have different dominant motion directions.
Note that we did not set any prior on the motif occurrences within the document, i.e., we set \({{\alpha _{z,d}}}=0\).
In the the additional material an exhaustive set of results are provided with motifs rendered in animated-GIF.
Waiting activities are characterized by the same word(s) repeated over time in the motif. Thus the successive time color-coded images overwrite the previous ones in the collapsed representation as explained in Sect.. 5.3, leaving visible only the last (orange, red) time instant.
Note however that longer motifs increase the chance of observing some random co-occurrences, as the amount of overlap with other activities, potentially unexplained by current motifs, increases as well. This is particularly true when the amount of data is not very large like in the Traffic junction case.
To perform the temporal association we allowed a constant offset (either \(-\)1, 0 or 1) between the annotated instants of the event in the ground truth and the starting time of a learned motif.
Note that rather than simply using \(P({ z})\) as the prior for a topic to start at time \(t\), we could have further exploited the past informations available in the past motif occurrences \(\hat{P}({t_s},{ z}|{ d})\) (e.g., the motif of Fig. 15e is often followed by that of Fig. 15d several seconds later). However, as this is not part of our model, we preferred to go for the simpler case.
By global behavior state, we mean a state that control the set of activities that can occur in the whole scene i.e., at a global scene level. For example, in the MIT video, the traffic lights turning red or green determines which activities can occur in the scene as a whole (vehicles will stop in one side of the road, while stopped vehicles start to move on the other side).
Also taking into account that probabilities need to be positive. As a technical detail, a minimum value \(\varepsilon \) is required for \(p({t_s}|{ z},{ d})\) to avoid undefined log-likelihoods in the objective function.
Abbreviations
- \(\mathcal{D }\) :
-
Dataset, count matrices of the form \({ n}({ w},{{t_a}},{ d})\)
- \({ z}\) :
-
Motif index
- \({ w}\) :
-
Word index (SLA patterns for real data. See Sect. 5.)
- \({ d}\) :
-
Temporal document index
- \({{t_a}}\) :
-
Absolute time index in temporal documents
- \({t_s}\) :
-
Start time of a motif
- \({{t_r}}\) :
-
Relative time from the start of the motif
- \(\varTheta \) :
-
Model parameters {\(P({ z}|{ d}), P({t_s}|{ z},{ d}), P({ w},{{t_r}}|{ z})\)}
- \({N_z}\) :
-
Number of motifs
- \({N_w}\) :
-
Vocabulary size (number of different words)
- \({ D}\) :
-
Number of temporal documents
- \({T_z}\) :
-
Maximum duration of a motif
- \({T_d}\) :
-
Duration of the temporal document
- \({T_{ds}}\) :
-
Number of motif start time indices in a temporal document
- \(\lambda _{z,d}\) :
-
Sparsity constraint weight
- \(\lambda _{bic}\) :
-
Penalty term weight in BIC equation
References
Besnerais, G., Bercher, J., & Demoment, G. (1999). A new look at entropy for solving linear inverse problems. IEEE Transactions on Information Theory, 45(5), 1565–1578.
Blei, D., & Lafferty, J. (2006). A correlated topic model of science. Annals of Applied Statistics, 1(1), 17–35.
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In International Conference on Machine Learning (pp. 113–120). New York.
Blei, D. M., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Boiman, O., & Irani, M. (2007). Detecting irregularities in images and in video. International Journal of Computer Vision, 74(1), 17–31.
Bradley, D., & Bagnell, J. A. D. (2008). Differentiable sparse coding. In Proceedings of Neural Information Processing Systems (p. 22).
Chen, S. S., & Gopalakrishnan, P. S. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. In DARPA Broadcast News Transcription and Understanding, Workshop (pp. 127–132).
Chien, J. T., & Wu, M. S. (2008). Adaptive bayesian latent semantic analysis. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 198–207.
Emonet, R., Varadarajan, J., & Odobez, J. M. (2011). Extracting and locating temporal motifs in video scenes using a hierarchical non parametric bayesian model. In IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs.
Fablet, R., & Bouthemy, P. (2000). Statistical motion-based object indexing using optic flow field. (Vol. 4). In IEEE International Conference on Pattern Recognition, ICPR.
Faruquie, T. A., Kalra, P. K., & Banerjee, S. (2009). Time based activity inference using latent Dirichlet allocation. In British Machine Vision Conference. London.
Girolami, M., & Kabán, A. (2003). On an equivalence between PLSI and LDA. In ACM SIGIR Conference on Research and Development in, Information Retrieval (pp. 433–434).
Gohr, A., Hinneburg, A., Schult, R., & Spiliopoulou, M. (2009). Topic evolution in a stream of documents. In SIAM International Conference on Data Mining (pp. 859–870).
Gruber, A., Rosen-Zvi, M., & Weiss, Y. (2007). Hidden topic Markov model. In International Conference on Artificial Intelligence and Statistics. San Juan, Puerto Rico.
Hervieu, A., Bouthemy, P., & Cadre, J. P. L. (2008). A statistical video content recognition method using invariant features on object trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 18(11), 1533–1543.
Hofmann, T. (2001). Unsupervised learning by probability latent semantic analysis. Machine Learning, 42, 177–196.
Hospedales, T., Gong, S., & Xiang, T. (2009). A Markov clustering topic model for mining behavior in video. In IEEE International Conference on Computer Vision. Kyoto.
Hospedales, T., Gong, S., & Xiang, T. (2012). Video behaviour mining using a dynamic topic model. International Journal of Computer Vision, 98(3), 303–323.
Hoyer, P. O. (2005). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5(2), 1457–1470.
Keogh, E., Chakrabarti, K., Pazzani, M., & Mehrotra, S. (2000). Dimensionality reduction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems, 263–286.
Kuettel, D., Breitenstein, M. D., Gool, L. V., & Ferrari, V. (2010). What’s going on? discovering spatio-temporal dependencies in dynamic scenes. In IEEE Conference on Computer Vision and, Pattern Recognition (pp. 1951–1958).
Li, J., Gong, S., & Xiang, T. (2008). Global behaviour inference using probabilistic latent semantic analysis. In British Machine Vision Conference.
Li, J., Gong, S., & Xiang, T. (2009). Discovering multi-camera behaviour correlations for on-the-fly global activity prediction and anomaly detection. In IEEE International Workshop on Visual Surveillance. Kyoto.
Luvison, B., Chateau, T., Sayed, P., & Pham, Q. C., & Laprest, J. T. (2009). An unsupervised learning based approach for unexpected event detection (pp. 506–513). International Conference on Computer Vision Theory and Applications (VISAPP), Lisboa : In .
Makris, D., & Ellis, T. (2003). Automatic learning of an activity-based semantic scene model. IEEE International Conference on Advanced Video and Signal Based Surveillance, 2(1), 183–188.
Mueen, A., Keogh, E., Zhu, Q., Cash, S., & Westover, B. (2009). Exact discovery of time series motifs. In SIAM International Conference on Data Mining (pp. 473–484).
Niebles, J. C., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3), 299–318.
Quelhas, P., Monay, F., Odobez, J. M., Gatica-perez, D., & Tuytelaars, T. (2007). A thousand words in a scene. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9), 1575–89.
Saleemi, I., Hartung, L., & Shah, M. (2010). Scene understanding by statistical modeling of motion patterns. In IEEE Conference on Computer Vision and Pattern Recognition.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering object categories in image collections. In IEEE International Conference on Computer Vision.
Stauffer, C., & Grimson, E. (2000). Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 747–757.
Tanaka, Y., Iwamoto, K., & Uehara, K. (2005). Discovery of time-series motif from multi-dimensional data based on MDL principle. Machine Learning, 58, 269–300.
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
Tommasi, C., & Kanade, T. (1991). Detection and tracking of point features. Tech. rep., CMU-CS-91-132.
Tritschler, A., & Gopinath, R. (1999). Improved speaker segmentation and segments clustering using the bayesian information criterion. In Sixth European Conference on Speech Communication and Technology.
Vasconcelos, N., & Lippman, A. (2000). Statistical models of video structure for content analysis and characterization. IEEE Transactions on Image Processing, 1, 3–19.
Varadarajan, J., & Odobez, J. (2009). Topic models for scene analysis and abnormality detection. In IEEE International Workshop on Visual Surveillance. Kyoto.
Varadarajan, J., Emonet, R., & Odobez, J. (2010). Probabilistic latent sequential motifs: Discovering temporal activity patterns in video scenes. In British Machine Vision Conference (pp. 117.1–117.11). Aberystwyth.
Varadarajan, J., Emonet, R., & Odobez, J. M. (2010). A sparsity constraint for topic models—application to temporal activity mining. In N. I. P. S. Workshop (Ed.), Workshop on Practical Applications of Sparse Modeling: Open Issues and New Directions.
Varadarajan, J., Emonet, R., & Odobez, J. (2012). Bridging the past, present and future; modeling scene activities from event relationships and global rules. In IEEE Conference on Computer Vision and Pattern Recognition.
Wallach, H. M. (2006). Topic modeling: beyond bag-of-words. International Conference on Machine Learning (pp. 977–984). Pittsburgh, PA.
Wang, C., & Blei, D. (2009). Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. In Neural Information Processing Systems (pp. 1982–1989).
Wang, C., Blei, D. M., & Heckerman, D. (2008). Continuous time dynamic topic models. In Conference on Uncertainty in Artificial Intelligence.
Wang, X., Ma, X., & Grimson, E. L. (2009). Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(3), 539–555.
Wang, X., & McCallum, A. (2006). Topics over time: A non-Markov continuous-time model of topical trends. Philadelphia: ACM Conference Knowledge Discovery and Data Mining.
Wang, X., Tieu, K., Grimson, E. L. (2004). Learning semantic scene models by trajectory analysis. In European Conference on Computer Vision, Vol. 14 (pp. 234–778).
Wang, Y., & Mori, G. (2009). Human action recognition by semi-latent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence Special Issue on Probabilistic Graphical Models in Computer Vision, 31(10), 1762–1774.
Williamson, S., Wang, C., Heller, K., & Blei, D. (2009). Focused topic models. In NIPS workshop on Applications for Topic Models: Text and Beyond. Whistler, Canada.
Xiang, T., & Gong, S. (2008). Video behavior profiling for anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5), 893–908.
Yang, Y., Liu, J., & Shah, M. (2009). Video scene understanding using multi-scale analysis. In IEEE International Conference on Computer Vision. Kyoto.
Yao, J., & Odobez, J. M. (2007). Multi-layer background subtraction based on color and texture. In IEEE CVPR International Workshop on Visual Surveillance (pp. 1–8).
Yi, Z., Jeff, S., & Artur, D. (2010). Learning compressible models. In Proceedings of SIAM Data Mining (SDM) Conference.
Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I., & Lathoud, G. (2004). Multimodal group action clustering in meetings. In ACM International Conference on Multimedia, Workshop on Video Surveillance and Sensor, Networks.
Zhong, H., Jianbo, S., & Mirko, V. (2004). Detecting unusual activity in video, Vol. 2. IEEE Conference on Computer Vision and Pattern Recognition (pp. 819–826). Washington, DC.
Acknowledgments
Authors gratefully acknowledge the support of the Swiss National Science Foundation (Project: FNS-198, HAI) and of the European Union through its 7th framework program (Integrated project VANAHEIM (248907, and Network of Excellence PASCAL2).
Author information
Authors and Affiliations
Corresponding author
Appendix: Detailed EM Equation Derivation
Appendix: Detailed EM Equation Derivation
In this section we detail the derivation of equations involved in PLSM inference. Our data \(\mathcal{D } = {{ n}({ w},{{t_a}},{ d})}\) is a matrix of counts, where each triplet \(({ w},{{t_a}},{ d})\) denotes the number of times a word \({ w}\), appears at time \({{t_a}}\) in document \({ d}\). The probability of observing this data is given by:
Taking the log on both sides we obtain our objective log-likelihood function as
Since \(P({ w},{{t_a}},{ d}) = \sum _{{t_s}=1}^{{T_{ds}}} \sum _{{ z}=1}^{{N_z}} P({ w},{{t_a}},{ d},{ z},{t_s})\), the log-likelihood equation conditioned on the model parameters \(\varTheta \) i.e., the probability distributions \(P({ z}|{ d}), P({t_s}|{ z},{ d})\), and \(P({ w},{{t_r}}|{ z})\) is written as
We want to infer the model parameters with a constraint that the distribution on motif starting times \(P({t_s}|{ z},{ d})\) is sparse and peaky. To achieve this goal as said in Sect. 3.4, we want to maximize \(D_{KL}(U||P({t_s}|{ z},{ d}))\), the KL divergence between Uniform distribution and \(P({t_s}|{ z},{ d})\). So the constrained log-likelihood equation is given by:
By removing the constant factor we obtain Eq. 5 given by
But the above equation cannot be solved directly due to summation terms inside the log making it intractable. Instead, the EM approach works by optimizing the expected log-likelihood of the complete data with respect to the hidden variables keeping the constraint unchanged, which gives
Notice that now, the log operates over the joint probability over all the variables and not just the observed variables. An optimized way to compute this is by using Eq. 1 and indices \({{t_r}}\) and \({t_s}\) instead of \({{t_a}}\). The joint distribution can be split into its constituent distributions using Eq. 2 So, the expected log-likelihood equation is re-written as:
The goal is thus to optimize this expression with constraints so that the distributions sum to one. Therefore such constraints are enforced using lagrangian multipliers. Finally, the constrained objective function that is optimized is given by:
where \(\gamma _{{ z}}, \delta _{{ z},{ d}}\) and \(\tau _{{ d}}\) are the lagrangian multipliers. The EM algorithm works by iterating through the following E-step and M-step:
E-step: In the Expectation step, the posterior distribution of hidden variables \(({ z},{t_s})\) is computed where the parameters come from the previous iteration’s M-step,
M-step: In the Maximization step, maximizing \(\mathcal{H }(\varTheta )\) w.r.t to the parameters which are the probability mass functions results in the following set of equations.
by eliminating the lagrangian multipliers Footnote 12, we obtain the following expressions that were presented in (Eqs. 8–10)
Rights and permissions
About this article
Cite this article
Varadarajan, J., Emonet, R. & Odobez, JM. A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs. Int J Comput Vis 103, 100–126 (2013). https://doi.org/10.1007/s11263-012-0596-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0596-6