A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs

Jagannadan Varadarajan^2,1,
Rémi Emonet^1,2 &
Jean-Marc Odobez^1,2

1089 Accesses
40 Citations
3 Altmetric
Explore all metrics

Abstract

This paper introduces a novel probabilistic activity modeling approach that mines recurrent sequential patterns called motifs from documents given as word $\times $ time count matrices (e.g., videos). In this model, documents are represented as a mixture of sequential activity patterns (our motifs) where the mixing weights are defined by the motif starting time occurrences. The novelties are multi fold. First, unlike previous approaches where topics modeled only the co-occurrence of words at a given time instant, our motifs model the co-occurrence and temporal order in which the words occur within a temporal window. Second, unlike traditional Dynamic Bayesian networks (DBN), our model accounts for the important case where activities occur concurrently in the video (but not necessarily in synchrony), i.e., the advent of activity motifs can overlap. The learning of the motifs in these difficult situations is made possible thanks to the introduction of latent variables representing the activity starting times, enabling us to implicitly align the occurrences of the same pattern during the joint inference of the motifs and their starting times. As a third novelty, we propose a general method that favors the recovery of sparse distributions, a highly desirable property in many topic model applications, by adding simple regularization constraints on the searched distributions to the data likelihood optimization criteria. We substantiate our claims with experiments on synthetic data to demonstrate the algorithm behavior, and on four video datasets with significant variations in their activity content obtained from static cameras. We observe that using low-level motion features from videos, our algorithm is able to capture sequential patterns that implicitly represent typical trajectories of scene objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A dynamic causal topic model for mining activities from complex videos

Article 10 May 2017

Analyzing Activities in Videos Using Latent Dirichlet Allocation and Granger Causality

Language-Motivated Approaches to Action Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The starting time ${t_s}$ can range over different intervals, depending on hypotheses. In the experiments, we assumed that all words generated by a motif starting at time ${t_s}$ occur within a temporal document; hence ${t_s}$ takes values between 1 and ${T_{ds}}$, where ${T_{ds}}= {T_d}-{T_z}+1$. However, we can also assume that motifs are partially observed (beginning or end are missing). In this case ${t_s}$ ranges between $2-{T_z}$ and ${T_d}$.
Note that $P({ d}) \propto {N_d}$ and is thus not unknown.
The correspondence between the ground truth motifs and the estimated ones is made by optimizing the normalized cross-correlation measure between the learned motifs $\hat{P}({{t_r}},{ w}|{ z})$ and the true motifs $P({{t_r}},{ w}|{ z})$.
Note that the topic distributions contain more information than the location probability: for each location, we know what types of motion are present as well. This explains the location overlap between several topics, e.g., between those of Fig. 13b, e, which have different dominant motion directions.
Note that we did not set any prior on the motif occurrences within the document, i.e., we set ${{\alpha _{z,d}}}=0$.
In the the additional material an exhaustive set of results are provided with motifs rendered in animated-GIF.
Waiting activities are characterized by the same word(s) repeated over time in the motif. Thus the successive time color-coded images overwrite the previous ones in the collapsed representation as explained in Sect.. 5.3, leaving visible only the last (orange, red) time instant.
Note however that longer motifs increase the chance of observing some random co-occurrences, as the amount of overlap with other activities, potentially unexplained by current motifs, increases as well. This is particularly true when the amount of data is not very large like in the Traffic junction case.
To perform the temporal association we allowed a constant offset (either $-$1, 0 or 1) between the annotated instants of the event in the ground truth and the starting time of a learned motif.
Note that rather than simply using $P({ z})$ as the prior for a topic to start at time $t$, we could have further exploited the past informations available in the past motif occurrences $\hat{P}({t_s},{ z}|{ d})$ (e.g., the motif of Fig. 15e is often followed by that of Fig. 15d several seconds later). However, as this is not part of our model, we preferred to go for the simpler case.
By global behavior state, we mean a state that control the set of activities that can occur in the whole scene i.e., at a global scene level. For example, in the MIT video, the traffic lights turning red or green determines which activities can occur in the scene as a whole (vehicles will stop in one side of the road, while stopped vehicles start to move on the other side).
Also taking into account that probabilities need to be positive. As a technical detail, a minimum value $\varepsilon $ is required for $p({t_s}|{ z},{ d})$ to avoid undefined log-likelihoods in the objective function.

Abbreviations

$\mathcal{D }$ :: Dataset, count matrices of the form ${ n}({ w},{{t_a}},{ d})$
${ z}$ :: Motif index
${ w}$ :: Word index (SLA patterns for real data. See Sect. 5.)
${ d}$ :: Temporal document index
${{t_a}}$ :: Absolute time index in temporal documents
${t_s}$ :: Start time of a motif
${{t_r}}$ :: Relative time from the start of the motif
$\varTheta $ :: Model parameters {$P({ z}|{ d}), P({t_s}|{ z},{ d}), P({ w},{{t_r}}|{ z})$}
${N_z}$ :: Number of motifs
${N_w}$ :: Vocabulary size (number of different words)
${ D}$ :: Number of temporal documents
${T_z}$ :: Maximum duration of a motif
${T_d}$ :: Duration of the temporal document
${T_{ds}}$ :: Number of motif start time indices in a temporal document
$\lambda _{z,d}$ :: Sparsity constraint weight
$\lambda _{bic}$ :: Penalty term weight in BIC equation

References

Besnerais, G., Bercher, J., & Demoment, G. (1999). A new look at entropy for solving linear inverse problems. IEEE Transactions on Information Theory, 45(5), 1565–1578.
Article MATH Google Scholar
Blei, D., & Lafferty, J. (2006). A correlated topic model of science. Annals of Applied Statistics, 1(1), 17–35.
Article MathSciNet Google Scholar
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In International Conference on Machine Learning (pp. 113–120). New York.
Blei, D. M., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Boiman, O., & Irani, M. (2007). Detecting irregularities in images and in video. International Journal of Computer Vision, 74(1), 17–31.
Article Google Scholar
Bradley, D., & Bagnell, J. A. D. (2008). Differentiable sparse coding. In Proceedings of Neural Information Processing Systems (p. 22).
Chen, S. S., & Gopalakrishnan, P. S. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. In DARPA Broadcast News Transcription and Understanding, Workshop (pp. 127–132).
Chien, J. T., & Wu, M. S. (2008). Adaptive bayesian latent semantic analysis. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 198–207.
Article Google Scholar
Emonet, R., Varadarajan, J., & Odobez, J. M. (2011). Extracting and locating temporal motifs in video scenes using a hierarchical non parametric bayesian model. In IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs.
Fablet, R., & Bouthemy, P. (2000). Statistical motion-based object indexing using optic flow field. (Vol. 4). In IEEE International Conference on Pattern Recognition, ICPR.
Faruquie, T. A., Kalra, P. K., & Banerjee, S. (2009). Time based activity inference using latent Dirichlet allocation. In British Machine Vision Conference. London.
Girolami, M., & Kabán, A. (2003). On an equivalence between PLSI and LDA. In ACM SIGIR Conference on Research and Development in, Information Retrieval (pp. 433–434).
Gohr, A., Hinneburg, A., Schult, R., & Spiliopoulou, M. (2009). Topic evolution in a stream of documents. In SIAM International Conference on Data Mining (pp. 859–870).
Gruber, A., Rosen-Zvi, M., & Weiss, Y. (2007). Hidden topic Markov model. In International Conference on Artificial Intelligence and Statistics. San Juan, Puerto Rico.
Hervieu, A., Bouthemy, P., & Cadre, J. P. L. (2008). A statistical video content recognition method using invariant features on object trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 18(11), 1533–1543.
Article Google Scholar
Hofmann, T. (2001). Unsupervised learning by probability latent semantic analysis. Machine Learning, 42, 177–196.
Article MATH Google Scholar
Hospedales, T., Gong, S., & Xiang, T. (2009). A Markov clustering topic model for mining behavior in video. In IEEE International Conference on Computer Vision. Kyoto.
Hospedales, T., Gong, S., & Xiang, T. (2012). Video behaviour mining using a dynamic topic model. International Journal of Computer Vision, 98(3), 303–323.
Article MathSciNet MATH Google Scholar
Hoyer, P. O. (2005). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5(2), 1457–1470.
Google Scholar
http://www.idiap.ch/paper/plsm/plsm.html
Keogh, E., Chakrabarti, K., Pazzani, M., & Mehrotra, S. (2000). Dimensionality reduction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems, 263–286.
Kuettel, D., Breitenstein, M. D., Gool, L. V., & Ferrari, V. (2010). What’s going on? discovering spatio-temporal dependencies in dynamic scenes. In IEEE Conference on Computer Vision and, Pattern Recognition (pp. 1951–1958).
Li, J., Gong, S., & Xiang, T. (2008). Global behaviour inference using probabilistic latent semantic analysis. In British Machine Vision Conference.
Li, J., Gong, S., & Xiang, T. (2009). Discovering multi-camera behaviour correlations for on-the-fly global activity prediction and anomaly detection. In IEEE International Workshop on Visual Surveillance. Kyoto.
Luvison, B., Chateau, T., Sayed, P., & Pham, Q. C., & Laprest, J. T. (2009). An unsupervised learning based approach for unexpected event detection (pp. 506–513). International Conference on Computer Vision Theory and Applications (VISAPP), Lisboa : In .
Makris, D., & Ellis, T. (2003). Automatic learning of an activity-based semantic scene model. IEEE International Conference on Advanced Video and Signal Based Surveillance, 2(1), 183–188.
Google Scholar
Mueen, A., Keogh, E., Zhu, Q., Cash, S., & Westover, B. (2009). Exact discovery of time series motifs. In SIAM International Conference on Data Mining (pp. 473–484).
Niebles, J. C., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3), 299–318.
Article Google Scholar
Quelhas, P., Monay, F., Odobez, J. M., Gatica-perez, D., & Tuytelaars, T. (2007). A thousand words in a scene. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9), 1575–89.
Article Google Scholar
Saleemi, I., Hartung, L., & Shah, M. (2010). Scene understanding by statistical modeling of motion patterns. In IEEE Conference on Computer Vision and Pattern Recognition.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Article MathSciNet MATH Google Scholar
Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering object categories in image collections. In IEEE International Conference on Computer Vision.
Stauffer, C., & Grimson, E. (2000). Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 747–757.
Article Google Scholar
Tanaka, Y., Iwamoto, K., & Uehara, K. (2005). Discovery of time-series motif from multi-dimensional data based on MDL principle. Machine Learning, 58, 269–300.
Article MATH Google Scholar
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
Article MathSciNet MATH Google Scholar
Tommasi, C., & Kanade, T. (1991). Detection and tracking of point features. Tech. rep., CMU-CS-91-132.
Tritschler, A., & Gopinath, R. (1999). Improved speaker segmentation and segments clustering using the bayesian information criterion. In Sixth European Conference on Speech Communication and Technology.
Vasconcelos, N., & Lippman, A. (2000). Statistical models of video structure for content analysis and characterization. IEEE Transactions on Image Processing, 1, 3–19.
Article Google Scholar
Varadarajan, J., & Odobez, J. (2009). Topic models for scene analysis and abnormality detection. In IEEE International Workshop on Visual Surveillance. Kyoto.
Varadarajan, J., Emonet, R., & Odobez, J. (2010). Probabilistic latent sequential motifs: Discovering temporal activity patterns in video scenes. In British Machine Vision Conference (pp. 117.1–117.11). Aberystwyth.
Varadarajan, J., Emonet, R., & Odobez, J. M. (2010). A sparsity constraint for topic models—application to temporal activity mining. In N. I. P. S. Workshop (Ed.), Workshop on Practical Applications of Sparse Modeling: Open Issues and New Directions.
Varadarajan, J., Emonet, R., & Odobez, J. (2012). Bridging the past, present and future; modeling scene activities from event relationships and global rules. In IEEE Conference on Computer Vision and Pattern Recognition.
Wallach, H. M. (2006). Topic modeling: beyond bag-of-words. International Conference on Machine Learning (pp. 977–984). Pittsburgh, PA.
Wang, C., & Blei, D. (2009). Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. In Neural Information Processing Systems (pp. 1982–1989).
Wang, C., Blei, D. M., & Heckerman, D. (2008). Continuous time dynamic topic models. In Conference on Uncertainty in Artificial Intelligence.
Wang, X., Ma, X., & Grimson, E. L. (2009). Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(3), 539–555.
Article Google Scholar
Wang, X., & McCallum, A. (2006). Topics over time: A non-Markov continuous-time model of topical trends. Philadelphia: ACM Conference Knowledge Discovery and Data Mining.
Google Scholar
Wang, X., Tieu, K., Grimson, E. L. (2004). Learning semantic scene models by trajectory analysis. In European Conference on Computer Vision, Vol. 14 (pp. 234–778).
Wang, Y., & Mori, G. (2009). Human action recognition by semi-latent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence Special Issue on Probabilistic Graphical Models in Computer Vision, 31(10), 1762–1774.
Article Google Scholar
Williamson, S., Wang, C., Heller, K., & Blei, D. (2009). Focused topic models. In NIPS workshop on Applications for Topic Models: Text and Beyond. Whistler, Canada.
Xiang, T., & Gong, S. (2008). Video behavior profiling for anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5), 893–908.
Article Google Scholar
Yang, Y., Liu, J., & Shah, M. (2009). Video scene understanding using multi-scale analysis. In IEEE International Conference on Computer Vision. Kyoto.
Yao, J., & Odobez, J. M. (2007). Multi-layer background subtraction based on color and texture. In IEEE CVPR International Workshop on Visual Surveillance (pp. 1–8).
Yi, Z., Jeff, S., & Artur, D. (2010). Learning compressible models. In Proceedings of SIAM Data Mining (SDM) Conference.
Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I., & Lathoud, G. (2004). Multimodal group action clustering in meetings. In ACM International Conference on Multimedia, Workshop on Video Surveillance and Sensor, Networks.
Zhong, H., Jianbo, S., & Mirko, V. (2004). Detecting unusual activity in video, Vol. 2. IEEE Conference on Computer Vision and Pattern Recognition (pp. 819–826). Washington, DC.

Download references

Acknowledgments

Authors gratefully acknowledge the support of the Swiss National Science Foundation (Project: FNS-198, HAI) and of the European Union through its 7th framework program (Integrated project VANAHEIM (248907, and Network of Excellence PASCAL2).

Author information

Authors and Affiliations

Idiap Research Institute, Martigny, Switzerland
Jagannadan Varadarajan, Rémi Emonet & Jean-Marc Odobez
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Jagannadan Varadarajan, Rémi Emonet & Jean-Marc Odobez

Authors

Jagannadan Varadarajan
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Emonet
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Marc Odobez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jagannadan Varadarajan.

Appendix: Detailed EM Equation Derivation

In this section we detail the derivation of equations involved in PLSM inference. Our data $\mathcal{D } = {{ n}({ w},{{t_a}},{ d})}$ is a matrix of counts, where each triplet $({ w},{{t_a}},{ d})$ denotes the number of times a word ${ w}$, appears at time ${{t_a}}$ in document ${ d}$. The probability of observing this data is given by:

$$\begin{aligned} P(\mathcal{D }) = \prod _{{ d}=1}^{{ D}} \prod _{{{t_a}}=1}^{{T_d}} \prod _{{ w}=1}^{{N_w}} P({ w},{{t_a}},{ d})^{{ n}({ w},{{t_a}},{ d})} \end{aligned}$$

(20)

Taking the log on both sides we obtain our objective log-likelihood function as

$$\begin{aligned} \mathcal{L }(\mathcal{D }) = \log P(\mathcal{D }) = \sum _{{ d}=1}^{{ D}} \sum _{{{t_a}}=1}^{{T_d}} \sum _{{ w}=1}^{{N_w}} { n}({ w},{{t_a}},{ d})\log P({ w},{{t_a}},{ d}) \end{aligned}$$

(21)

Since $P({ w},{{t_a}},{ d}) = \sum _{{t_s}=1}^{{T_{ds}}} \sum _{{ z}=1}^{{N_z}} P({ w},{{t_a}},{ d},{ z},{t_s})$, the log-likelihood equation conditioned on the model parameters $\varTheta $ i.e., the probability distributions $P({ z}|{ d}), P({t_s}|{ z},{ d})$, and $P({ w},{{t_r}}|{ z})$ is written as

$$\begin{aligned} \mathcal{L }(\mathcal{D }|\varTheta ) = \sum _{{ d}=1}^{ D}\sum _{{ w}=1}^{N_w} \sum _{{{t_a}}=1}^{{T_d}} n({ w}, {{t_a}}, { d}) \log \sum _{{ z}=1}^{{N_z}} \sum _{{t_s}=1}^{{T_{ds}}}P({ w},{{t_a}},{ d},{ z},{t_s}) \end{aligned}$$

(22)

We want to infer the model parameters with a constraint that the distribution on motif starting times $P({t_s}|{ z},{ d})$ is sparse and peaky. To achieve this goal as said in Sect. 3.4, we want to maximize $D_{KL}(U||P({t_s}|{ z},{ d}))$, the KL divergence between Uniform distribution and $P({t_s}|{ z},{ d})$. So the constrained log-likelihood equation is given by:

$$\begin{aligned} \mathcal{L }_c(\mathcal{D }|\varTheta )&= \mathcal{L }(\mathcal{D }|\varTheta ) + \sum _{{ z},{ d}} \lambda _{{ z},{ d}} D_{KL}(U||P({t_s}|{ z},{ d})) \end{aligned}$$

(23)

$$\begin{aligned}&= \mathcal{L }(\mathcal{D }|\varTheta ) + \sum _{{ z},{ d}} \lambda _{{ z},{ d}} \sum _{{t_s}=1}^{{T_{ds}}} \frac{1}{{T_{ds}}}\log \frac{1/{T_{ds}}}{P({t_s}|{ z},{ d})} \end{aligned}$$

(24)

$$\begin{aligned} \begin{aligned}{}&= \mathcal{L }(\mathcal{D }|\varTheta ) + \sum _{{ z},{ d}} \lambda _{{ z},{ d}} \left( \sum _{{t_s}}^{{T_{ds}}} \frac{1}{{T_{ds}}}\log \frac{1}{{T_{ds}}}\right.\nonumber \\&\quad \left. - \sum _{{t_s}=1}^{{T_{ds}}} \frac{1}{{T_{ds}}}\log P({t_s}|{ z},{ d}) \right) \end{aligned} \end{aligned}$$

(25)

By removing the constant factor we obtain Eq. 5 given by

$$\begin{aligned} {\mathcal{L }_c}(\mathcal{D }|\varTheta ) = \mathcal{L }(\mathcal{D }|\varTheta ) - \sum _{{t_s},{ z},{ d}} \frac{\lambda _{{ z},{ d}}}{{T_{ds}}}\log P({t_s}|{ z},{ d}) \end{aligned}$$

(26)

But the above equation cannot be solved directly due to summation terms inside the log making it intractable. Instead, the EM approach works by optimizing the expected log-likelihood of the complete data with respect to the hidden variables keeping the constraint unchanged, which gives

$$\begin{aligned} \begin{aligned}{} E[{\mathcal{L }_c}]&= \sum _{{ d}=1}^{ D}\sum _{{ w}=1}^{{N_w}} \sum _{{{t_a}}=1}^{{T_d}} \sum _{{ z}=1}^{{N_z}} \sum _{{t_s}=1}^{{T_{ds}}} { n}({ w}, {{t_a}}, { d}) P({ z},{t_s}| { w},{{t_a}},{ d}) \nonumber \\&\quad \log P({ w},{{t_a}}, { d},{ z},{t_s})-\sum _{{t_s}, { z},{ d}} \frac{\lambda _{{ z},{ d}}}{{T_{ds}}} \cdot \log (P({t_s}|{ z},{ d})) \end{aligned} \end{aligned}$$

(27)

Notice that now, the log operates over the joint probability over all the variables and not just the observed variables. An optimized way to compute this is by using Eq. 1 and indices ${{t_r}}$ and ${t_s}$ instead of ${{t_a}}$. The joint distribution can be split into its constituent distributions using Eq. 2 So, the expected log-likelihood equation is re-written as:

$$\begin{aligned} E[{\mathcal{L }_c}]&= \sum _{{ d}=1}^{ D}\sum _{{ w}=1}^{{N_w}} \sum _{{{t_r}}=1}^{{T_z}} \sum _{{ z}=1}^{{N_z}} \sum _{{t_s}=1}^{{T_{ds}}} { n}({ w}, {{t_a}}, { d}) P({ z},{t_s}|{ w},{t_s}+{{t_r}},{ d})\nonumber \\&\log [ P({ z}|{ d})P({t_s}|{ z},{ d}) P({{t_r}}|{ z})P({ w}|{{t_r}}, { z}) ]\\&-\sum _{{t_s}, { z},{ d}} \frac{\lambda _{{ z},{ d}}}{{T_{ds}}} \cdot \log P({t_s}|{ z},{ d}) \\&= \sum _{{ d}=1}^{ D}\sum _{{ w}=1}^{{N_w}} \sum _{{{t_r}}=1}^{{T_z}} \sum _{{ z}=1}^{{N_z}} \sum _{{t_s}=1}^{{T_{ds}}} { n}({ w}, {{t_a}}, { d}) P({ z},{t_s}|{ w},{t_s}+{{t_r}},{ d})\nonumber \\&[ \log P({ z}|{ d}) + \log P({t_s}|{ z},{ d}) + \log P({{t_r}}|{ z})\\&+ \log P({ w}|{{t_r}}, { z}) ] -\sum _{{t_s}, { z},{ d}} \frac{\lambda _{{ z},{ d}}}{{T_{ds}}} \cdot \log P({t_s}|{ z},{ d}) \end{aligned}$$

The goal is thus to optimize this expression with constraints so that the distributions sum to one. Therefore such constraints are enforced using lagrangian multipliers. Finally, the constrained objective function that is optimized is given by:

$$\begin{aligned} \mathcal{H }(\varTheta )&= E[{\mathcal{L }_c}] + \sum _{{ z}} \gamma _{{ z}}\left(1-\sum _{{ w},{{t_r}}} P({ w},{{t_r}}|{ z}) \right)\nonumber \\&\quad + \sum _{{ z},{ d}} \delta _{{ z},{ d}} \left(1 - \sum _{{t_s}} P({t_s}|{ z},{ d}) \right)\nonumber \\&\quad + \sum _{{ d}} \tau _{{ d}} \left(1 - \sum _{{ z}} P({ z}|{ d}) \right) \end{aligned}$$

(28)

where $\gamma _{{ z}}, \delta _{{ z},{ d}}$ and $\tau _{{ d}}$ are the lagrangian multipliers. The EM algorithm works by iterating through the following E-step and M-step:

E-step: In the Expectation step, the posterior distribution of hidden variables $({ z},{t_s})$ is computed where the parameters come from the previous iteration’s M-step,

$$\begin{aligned}&P({ z},{t_s}|{ w}, {{t_a}}, { d})= \frac{P({ d},{ z},{t_s},{{t_a}},{ w})}{P({ w},{{t_a}},{ d})}\nonumber \\&\quad = \frac{P({ z}|{ d})P({t_s}|{ z}, { d})P({{t_r}}|{ z})P({ w}|{{t_r}}, { z})}{\sum _{{ z}^{\prime },{t_s}^{\prime }}P({ z}^{\prime }|{ d})P ({t_s}^{\prime }|{ z}^{\prime },{ d})P({{t_a}}-{t_s}^{\prime }|{ z}^{\prime })P({ w}| {{t_a}}-{t_s}^{\prime },{ z}^{\prime })} \end{aligned}$$

(29)

M-step: In the Maximization step, maximizing $\mathcal{H }(\varTheta )$ w.r.t to the parameters which are the probability mass functions results in the following set of equations.

$$\begin{aligned}&\sum _{{t_s}=1}^{{T_{ds}}} \sum _{{{t_r}}=0}^{{T_z}-1} \sum _{{ w}=1}^{{N_w}} { n}({ w}, {t_s}+{{t_r}}, { d})P({ z},{t_s}|{ w}, {t_s}+{{t_r}}, { d})\\&\quad - \tau _{{ d}} P({ z}|{ d}) = 0, \quad 1 \le { d}\le { D},\\&\sum _{{{t_a}}=1}^{{T_d}} \sum _{{ w}=1}^{{N_w}} { n}({ w}, {t_s}+{{t_r}},{ d})P({ z},{t_s}|{ w}, {t_s}+{{t_r}}, { d}) - \frac{\lambda _{{ z},{ d}}}{{T_{ds}}}\\&\quad - \delta _{{ z},{ d}}P({t_s}|{ z},{ d}) = 0, \quad 1 \le { d}\le { D}, 1 \le { z}\le {N_z}\\&\sum _{{ d}=1}^{{ D}} \sum _{{t_s}=1}^{{T_{ds}}} { n}({ w}, {t_s}+{{t_r}}, { d})P({ z},{t_s}| { w}, {t_s}+{{t_r}}, { d})\\&\quad - \gamma _{{ z}} P({ w},{{t_r}}|{ z}) = 0, \quad 1 \le { z}\le {N_z}\end{aligned}$$

by eliminating the lagrangian multipliers ^{Footnote 12}, we obtain the following expressions that were presented in (Eqs. 8–10)

$$\begin{aligned} P({ z}|{ d}) \propto \sum _{{t_s}=1}^{{T_{ds}}} \sum _{{{t_r}}=0}^{{T_z}-1} \sum _{{ w}=1}^{{N_w}} { n}({ w}, {t_s}+{{t_r}}, { d})P({ z},{t_s}|{ w}, {t_s}+{{t_r}}, { d}) \end{aligned}$$

(30)

$$\begin{aligned} \begin{aligned}{}&P({t_s}|{ z},{ d}) \propto \max \left( \varepsilon , \left( \sum _{{ w}=1}^{{N_w}} \sum _{{{t_r}}=0}^{{T_z}-1} { n}({ w}, {t_s}+{{t_r}}, { d})\right.\right.\\&\quad \left.\left. P({ z},{t_s}|{ w}, {t_s}+{{t_r}}, { d}) \right) - \frac{\lambda _{{ z},{ d}}}{{T_{ds}}} \right) \end{aligned} \end{aligned}$$

(31)

$$\begin{aligned} P({ w},{{t_r}}|{ z}) \propto \sum _{{ d}=1}^{{ D}} \sum _{{t_s}=1}^{{T_{ds}}} { n}({ w}, {t_s}+{{t_r}}, { d})P({ z},{t_s}| { w}, {t_s}+{{t_r}}, { d}) \end{aligned}$$

(32)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Varadarajan, J., Emonet, R. & Odobez, JM. A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs. Int J Comput Vis 103, 100–126 (2013). https://doi.org/10.1007/s11263-012-0596-6

Download citation

Received: 06 October 2011
Accepted: 02 November 2012
Published: 19 December 2012
Issue Date: May 2013
DOI: https://doi.org/10.1007/s11263-012-0596-6

A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A dynamic causal topic model for mining activities from complex videos

Analyzing Activities in Videos Using Latent Dirichlet Allocation and Granger Causality

Language-Motivated Approaches to Action Recognition

Notes

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Detailed EM Equation Derivation

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A dynamic causal topic model for mining activities from complex videos

Analyzing Activities in Videos Using Latent Dirichlet Allocation and Granger Causality

Language-Motivated Approaches to Action Recognition

Explore related subjects

Notes

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Detailed EM Equation Derivation

Appendix: Detailed EM Equation Derivation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation