Keywords

1 Introduction

As the growth of video content accelerates, it becomes increasingly necessary to improve video understanding ability with less annotation effort. Since videos can contain a large number of frames, the cost of identifying the exact start and end frames of each action is high (frame-level) in comparison to just labeling what actions the video contains (video-level). Researchers are motivated to explore approaches that do not require per-frame annotations. In this work, we focus on weakly-supervised action localization paradigm, using only video-level action labels to learn activity recognition and localization. This problem can be framed as a special case of the Multiple Instance Learning (MIL) problem [4]: a bag contains multiple instances; Instances’ labels collectively generate the bag’s label, and only the bag’s label is available during training. In our task, each video represents bag, and the clips of the video represent the instances inside the bag. The key challenge here is to handle key instance assignment during training – to identify which instances within the bag trigger the bag’s label.

Fig. 1.
figure 1

Each curve represents a bag and points on the curve represent instances in the bag. We aim to find a concept point such that each positive bag contains some key instances close to it while all instances in the negative bags are far from it. In E step we use the current concept to pick key instances for each positive bag. In M step we use key instances and negative bags to update the concept.

Most previous works used attention-based approaches to model the key instance assignment process. They used attention weights to combine instance-level classification to produce the bag’s classification. Models of this form are then trained via standard classification procedures. The learned attention weights imply the contribution of each instance to the bag’s label, and thus can be used to localize the positive instances (action clips) [17, 26]. While promising results have been observed, models of this variety tend to produce incomplete action proposals  [13, 31], that only part of the action is detected. This is also a common problem in attention-based weakly-supervised object detection  [11, 25]. We argue that this problem is due to a misspecification of the MIL-objective. Attention weights, which indicate key instances’ assignment, should be our optimization target. But in an attention-MIL framework, attention is learned as a by-product when conducting classification for bags. As a result, the attention module tends to only pick the most discriminative parts of the action or object to correctly classify a bag, due to the fact that the loss and training signal come from the bag’s classification.

Inspired by traditional MIL literature, we adopt a different method to tackle weakly-supervised action localization using the Expectation–Maximization framework. Historically, Expectation–Maximization (EM) or similar iterative estimation processes have been used to solve the MIL problems  [4, 5, 35] before the deep learning era. Motivated by these works, we explicitly model key instance assignment as a hidden variable and optimize this as our target. Shown in Fig. 1, we adopt the EM algorithm to solve the interlocking steps of key instance assignment and action concept classification. To formulate our learning objective, we derive two pseudo-label generating schemes to model the E and M process respectively. We show that our alternating update process optimizes a lower bound of the MIL-objective. We also find that previous attention-MIL models implicitly violate the MIL assumptions. They apply attention to negative bags, while the MIL assumption states that instances in negative bags are uniformly negative. We show that our method can better model the data generating procedure of both positive and negative bags. It achieves state-of-the-art performance with a simple architecture, suggesting its potential to be extended to many practical settings. The main contributions of this paper are:

  • We propose to adapt the Expectation–Maximization MIL framework to weakly supervised action localization task. We derive two novel pseudo-label generating schemes to model the E and M process respectively.Footnote 1

  • We show that previous attention-MIL models implicitly violate the MIL assumptions, and our method better model the background information.

  • Our model is evaluated on two standard benchmarks, THUMOS14 and ActivityNet1.2, and achieves state of the art results.

Fig. 2.
figure 2

Our EM-MIL model architecture builds on fixed two-stream I3D features, and alternates between updating the key-instance assignment branch \(\textit{\textbf{q}}_{\varvec{\phi }}\) (E Step) and the classification branch \(\textit{\textbf{p}}_{\varvec{\theta }}\) (M Step). We use the classification score and key instance assignment result to generate pseudo-labels for each other (detailed in Sects. 3.1 and refM), and alternate freezing one branch to train the other.

2 Related Work

Weakly-Supervised Action Localization. Weakly supervised action localization learns to localize activities inside videos when only action class labels are available. UntrimmedNet  [26] first used attention to model the contribution of each clip to a video-level action label. It performs classification separately at each clip, and predicts video’s label through a weighted combination of clips’ scores. Later the STPN model  [17] proposed that instead of combining clips’ scores, it uses attention to combine clips’ features into a video-level feature vector and conducts classification from there. [8] generalizes a framework for these attention-based approaches and formalizes such combination as a permutation-invariant aggregation function. W-TALC  [19] proposed a regularization to enforce action periods of the same class must share similar features. It is also noticed that attention-MIL methods tend to produce incomplete localization results. To tackle that, a series of papers  [22, 23, 33, 38] took the adversarial erasing idea to improve the detection completeness by hiding the most discriminative parts. [31] conducted sub-samplings based on activation to suppress the dominant response of the discriminative action parts. To model complete actions, [13] proposed to use a multi-branch network with each branch handling distinctive action parts. To generate action proposals, they combine per-clip attention and classification scores to form the Temporal Class Activation Sequence (T-CAS [17]) and group the high activation clips. Another type of models [14, 21] train a boundary predictor based on pre-trained T-CAS scores to output the action start and end point without grouping (Fig. 2).

Some previous methods in weakly-supervised object or action localization involve iterative refinement, but their training processes and objectives are different from our Expectation–Maximization method. RefineLoc [1]’s training contains several passes. It uses the result of the \(i^{th}\) pass as supervision for the \((i+1)^{th}\) pass and trains a new model from scratch iteratively. [24] uses a similar approach in image objection detection but stacks all passes together. Our approach differs from these in the following ways: Their self-supervision and iterative refinement happen between different passes. In each pass all modules are trained jointly till converge. In comparison, we adopts an EM framework which explicitly models key instance assignment as hidden variables. Our pseudo-labeling and alternating training happen between different modules of the same model. Thus our model requires only one pass. In addition, as discussed in Sect. 3.4, they handle the attention in negative bags different to us.

Traditional Multi-Instance Learning Methods. The Multiple Instance Learning problem was first defined by Dietterich et al. [4], who proposed the iterated discrimination algorithm. It starts from a point in the feature space and iteratively searches for the smallest box covering at least one point (instance) per positive bag and avoiding all points in negative bags. [15] sets up the Diverse Density framework. They defined a point in the feature space to be the positive concept. Every positive bag (“diverse") contains at least one instance close to the concept while all instances in the negative bags are far from it (in terms of some distance metric). They modeled the likelihood of a concept using Gaussian Mixture models along with a Noisy-OR probability estimation. [34] then applied AdaBoost to this Noisy-OR model and [10]’s ISR model, and derived two MIL loss functions. [5] adapted the K-nearest neighbors method to the Diverse Density framework. Later [35] proposed the EM-DD algorithm, combing Expectation Maximization process and the Diverse Density metric. These early works did not involve neural networks and were not applied over the high-dimensional task of action localization. Many of them involve modeling key instances assignment as hidden variable and use iterative optimization. They also differ from the predominant attention-MIL paradigm in how they treat negative instances. We view these distinctions as motivation to explore our approach.

3 Method

Multiple Instance Learning (MIL) is a supervised learning problem where instead of one instance X being matched to one label y, a bag or set of multiple instances \([X_1, X_2,X_3, ...]\) are matched to single label y. In the binary MIL setting, a bag’s label is positive if at least one instance in the bag is positive. Therefore a bag is negative only if all instances in the bag are negative.

In our task, following the best practice of previous works  [17, 19, 26], we divide a long video into multiple 15-frame clips. Then a video corresponds to a bag (bag-level video label is given), and the clips of the video represent the instances inside the bag (instance-level clip labels are missing). Each video (bag) contains T video clips (instances), denoted by \(\mathbf {X} = \{\mathbf {x}_t\}_{t = 1}^{T}\), where \(\mathbf {x}_t \in \mathbb {R}^d\) is the feature of clip t. We represent the video’s action label in one hot way, where \(y_c = 1\) if the video contains clips of action c, otherwise \(y_c = 0\), \(c \in \{1, 2, \cdots , C\}\) (each video can contain multiple action classes). In the MIL setting, label of each video is determined by the labels of clips it contains. To be specific, we assign a binary variable \(z_t \in \{0, 1\}\) to each clip t, denoting whether clip t is responsible for the generation of video-level label. \(\textit{\textbf{z}} = \{z_t\}_{t = 1}^{T}\) models the assignment of key instances’ scope. Video-level label is generated with probability:

$$\begin{aligned} p_\theta (y_c = 1 | \mathbf {X}, \textit{\textbf{z}}) = \sigma _{t \in \{1, \cdots , T\}} \{~ p_\theta (y_{c, t} = 1 | \mathbf {x}_t) \cdot [z_t = 1]~\} \end{aligned}$$
(1)

where \([z_t = 1]\) is the indicator function for assignment. \(p_\theta (y_{c, t} = 1 | \mathbf {x}_t)\) is the probability (parameterized by \(\theta \)) that clip t belongs to class c. The closer clip t is to the concept, the higher \(p_\theta (y_{c, t} = 1 | \mathbf {x}_t)\) is. \(\sigma \) is a permutation-invariant operator, e.g. maximum  [36] or mean operator  [8].

In our temporal action localization problem, we propose to first estimate the probability of \(z_t = 1\) with an estimator \(q_\phi (z_t = 1 | \mathbf {x}_t)\) parameterized by \(\phi \), and then choose the clips with high estimated likelihood as our action segments. Since \(\{z_t\}\) are latent variables with no ground truth, we optimize \(q_\phi \) through maximization of the variational lower bound:

$$\begin{aligned} \begin{aligned} \log p_\theta (y_c | \mathbf {X})&= KL(q_\phi (\textit{\textbf{z}} | \mathbf {X}) \ || \ p_\theta (\textit{\textbf{z}} | \mathbf {X}, y_c)) + \int q_\phi (\textit{\textbf{z}} | \mathbf {X}) \log \frac{p_\theta (\textit{\textbf{z}}, y_c | \mathbf {X})}{q_\phi (\textit{\textbf{z}} | \mathbf {X})} d \textit{\textbf{z}} \\&\ge \int q_\phi (\textit{\textbf{z}} | \mathbf {X}) \log p_\theta (\textit{\textbf{z}}, y_c | \mathbf {X}) d \textit{\textbf{z}} + H(q_\phi (\textit{\textbf{z}} | \mathbf {X})), \end{aligned} \end{aligned}$$
(2)

where \(H(q_\phi (\textit{\textbf{z}} | \mathbf {X}))\) is entropy of \(q_\phi \). By maximizing the lower bound, we are actually optimizing the likelihood of \(y_c\) given \(\mathbf {X}\). In this work, we adopt the Expectation-Maximization (EM) algorithm, and optimize the lower bound by updating \(\theta \) and \(\phi \) alternately. To be specific, we first update \(\phi \) by minimizing \(KL(q_\phi (\textit{\textbf{z}} | \mathbf {X}) \ || \ p_\theta (\textit{\textbf{z}} | \mathbf {X}, y_c))\) and tighten the lower bound in E step, and update \(\theta \) through maximization of the lower bound in M step. In the following subsections, we will first get into details of updating \(\theta \) and \(\phi \) in E step and M step separately, and then sum up the whole algorithm.

3.1 E Step

In E step, we update \(\phi \) by minimizing \(KL(q_\phi (\textit{\textbf{z}} | \mathbf {X}) \ || \ p_\theta (\textit{\textbf{z}} | \mathbf {X}, y_c))\) and tighten the lower bound in Eq. 2. As in previous works  [17, 18], we approximate \(q_\phi (\textit{\textbf{z}} | \mathbf {X})\) with \(\prod _t q_\phi (z_t | \mathbf {x}_t)\) assuming the independence between different clips, where \(q_\phi (z_t | \mathbf {x}_t)\) is estimated by neural network with parameter \(\phi \) on each clip. Thus we only have to minimize \(KL(q_\phi (z_t | \mathbf {x}_t) \ || \ p_\theta (z_t | \mathbf {x}_t, y_c))\) for each clip t. Following the literature, we assume that the posterior \(p_\theta (z_t | \mathbf {x}_t, y_c)\) is proportional to the classification score \(p_\theta (y_c | \mathbf {x}_t)\). Then we propose to update \(q_\phi \) with pseudo label generated from classification score. Specifically, dynamic thresholds are calculated based on the instance classification scores to generate pseudo-labels for \(q_\phi \). If an instance has a classification score over the threshold for any ground truth class within the video, the instance is treated as a positive example; otherwise, it is treated as a negative example. The pseudo label is formulated as follows:

$$\begin{aligned} \hat{z}_{t} =\left\{ \begin{array}{ll} 1, &{} \text {if } \sum \nolimits _{c=1}^{C} \mathbbm {1}(P_{t,c}> \overline{P}_{1:T,c} ~\wedge ~ y_{c} = 1 ) > 0 \\ 0, &{} \text {otherwise} \end{array}\right. \end{aligned}$$
(3)

where \(P_{t,c} = p_\theta (y_c | \mathbf {x}_t)\) and \(\overline{P}_{1:T,c}\) is the mean of \(P_{t,c}\) over temporal axis. Then we update \(q_\phi \) using binary cross entropy (BCE) loss and the updating process is illustrated in Fig. 3.

$$\begin{aligned} \mathcal {L}(q_\phi ) = - \hat{z}_t \log q_\phi (z_t | \mathbf {x}_t) - (1 - \hat{z}_t) \log (1 - q_\phi (z_t | \mathbf {x}_t)). \end{aligned}$$
(4)

3.2 M Step

In M step, we update \(p_\theta \) through optimization of the lower bound in Eq. 2. Since \(H(q_\phi (\textit{\textbf{z}} | \mathbf {X}))\) is constant wrt \(\theta \), we only optimize \(\int q_\phi (\textit{\textbf{z}} | \mathbf {X}) \log p_\theta (\textit{\textbf{z}}, y_c | \mathbf {X}) d \textit{\textbf{z}}\), which is equivalent to optimize the classification performance given key instance assignment \(q_\phi (\textit{\textbf{z}} | \mathbf {X})\). To this end, we use the class-agnostic key-instance assigning module \(q_\phi \) and the ground truth video-level labels to generate a \(T\times {C}\) pseudo-label map which discriminates between foreground and background clips within the same video. Similarly, our pseudo-label generation procedure calculates a dynamic threshold based on the distribution of instance-assignment scores for each video clip. It assigns positive classifications for all instances whose scores are higher than the threshold, and negative classifications for all instances whose scores are below or instances in negative bags. The pseudo label is given by:

$$\begin{aligned} \hat{y}_{t,c} =\left\{ \begin{array}{ll} 1, &{} \text {if } y_{c} = 1 \text { and } Q_{t} > \overline{Q}_{1:T} + \gamma \cdot (\max (Q_t) - \min (Q_t))\\ 0, &{} \text {otherwise} \end{array}\right. , \end{aligned}$$
(5)

where \(Q_t = q_\phi (z_t | \mathbf {x}_t)\) and \(\overline{Q}_{1:T}\) is the mean of \(Q_t\) over temporal axis. The threshold hyper-parameter \(\gamma \) implies a distribution priori on how similar the same action exhibits across several videos. Then we update \(p_\theta \) with BCE loss and the updating process is illustrated in Fig. 4.

$$\begin{aligned} \mathcal {L}(p_\theta ) = -\hat{y}_{t,c} \log p_\theta (y_c | x_t) - (1 - \hat{y}_{t,c}) \log (1 - p_\theta (y_c | x_t)). \end{aligned}$$
(6)

3.3 Overall Algorithm

We summarize our EM-style algorithm in Algorithm 1. We update the key-instance assigning module \(q_\phi \) and classification module \(p_\theta \) alternately. In E step we freeze the classification \(p_\theta \) and update \(q_\phi \) using pseudo labels from \(p_\theta \). In M step we optimize classification based on \(q_\phi \). Two steps are processed alternately to maximize the likelihood \(\log p_\theta (y_c | \mathbf {X})\), and meanwhile optimize the localization results.

figure a

3.4 Comparison with Previous Methods

After careful examination of Eqs. 3 and 5, we find that our pseudo-labeling process \(Q_t\) and \(\hat{y}_{t,c}\) can also be interpreted as a special kind of attention. Denote loss function by \(\mathcal {L}\), then in Eq. 5, the loss is calculated as

$$\begin{aligned} \mathcal {L}\left[ ~p_\theta (\mathbf {y} | \mathbf {x}),~ \mathcal {F}(\mathbf {Q}, \mathbf {y})~ \right] \end{aligned}$$
(7)

\(\mathcal {F}\) is the pseudo label generation function in Eq. 5, \(\mathbf {Q},\mathbf {y},\mathbf {x}\) is the compact expression of \(Q_t, y_c, x_t\). On the other hand, if we denote attention and classification score as \(\mathbf {a}, \mathbf {c}\), the loss for a typical attention-based model like [26] is:

$$\begin{aligned} \mathcal {L}\left[ ~\varvec{\sigma }(\mathbf {c} \odot \mathbf {a}), ~\mathbf {y} ~\right] \end{aligned}$$
(8)
Fig. 3.
figure 3

In our EM-MIL model only the foreground classification score \(P_{t,c}\) affects the key instance pseudo label \(\hat{z}_{t}\) (left), while in previous models all-class classification scores contribute to the attention weights (right).

Fig. 4.
figure 4

Our EM-MIL model (left) uses key instance assignment \(Q_{t}\) to generate pseudo classification labels \(\hat{y}_{t,c}\) only for the foreground classes, while in previous models such as UntrimmedNet (right) attentions are applied to all classes.

Here \(\varvec{\sigma }\) is the aggregation operator  [8], such as \(reduce\_sum\) or \(reduce\_max\). Comparing Eq. 7 to Eq. 8, it is easy to see that they can be matched. \(p_\theta (\mathbf {y} | \mathbf {x})\) is classification score (\(\mathbf {c}\)), and \(\mathbf {Q}\) can be seen as special attention (corresponds to \(\mathbf {a}\)). In M step it attends to the key instance it estimates. But compared to previous attention-MIL methods, Eq. 3 shows that this “attention" only happens in positive bags. We believe it better aligns with the MIL assumption, which says that all instances in negative bags are uniformly negative. Previous methods that applies attention to negative bags implicitly assumes that some instances are more negative than others. This violates the MIL assumption. The differences between our attention and theirs are illustrated in Figs. 3 and 4. In addition, in Eq. 5, this “attention" is a threshold-based hard attention. Clips below the threshold are classified as background with high confidence, while clips above the threshold are weighted equally and re-scored in the next iteration. The use of hard pseudo labels allows for the distinct treatment of positive and negative instances that would be more complex to enforce with soft-boundaries. We initialize our training procedure by labeling every clip in a positive bag to be 1 and gradually narrow down the search scope. Such training process maintains high recalls for action clips in each E-M iteration. It prevents attention from focusing on the discriminative parts too quickly, thus increases the proposal completeness.

Another way to compare our methods with previous ones is through the lens of the MIL framework. As discussed in [2], MIL problem has two setting: instance-level vs bag-level. The instance level setting prioritizes classification precision of instance over bag’s, and vice versa. Our task aligns with the instance setting as the primary goal is action localization (equivalent to clips’ classification). Previous attention-MIL models like [17, 19, 26] treat instance-localization as the by-product of an accurate bag-level classification system, which align with the bag-level MIL setting. By modeling the problem through an instance-level MIL framework our approach more accurately models the target objective. This change in objective function and optimization procedure allows substantial improvement in performance.

3.5 Inference

At test time, we use another branch for video-level classification and use our model for localization as in previous work  [21]. For classification branch, we used a plain UntrimmedNet [26] with soft attention for the THUMOS14 dataset and the W-TALC [19] for the ActivityNet1.2 dataset. We run a forward pass with our model to get the localization score L by fusing instance assignment score \(Q_t\) and classification score \(P_{t,c}\).

$$\begin{aligned} L_t = \lambda *Q_t + (1-\lambda )*P_{t,c}, \end{aligned}$$
(9)

where \(\lambda \) is set to be 0.8 through grid search in THUMOS14 dataset and 0.3 in the ActivityNet1.2 dataset. In the Experiment Sec. 4.2 we analyze the impact of different of \(\lambda \). We threshold the \(L_t\) score to get prediction \(y_{i}'\) for each clip using the same scheme as in Eq. 5. Then we group the clips above the threshold to get the temporal start and end point of the action proposal.

4 Experiments

In this section, we evaluate our EM-MIL model on two large-scale temporal activity detection datasets: THUMOS14  [9] and ActivityNet1.2  [7]. Section 4.1 introduces experimental setup of these datasets, the evaluation metrics and the implementation details. Section 4.2 compares weakly localization results between our proposed model and the state-of-the-art models on both THUMOS14 and ActivityNet1.2 datasets, and visualizes some localization results. Section 4.3 shows the ablation studies for each component of our model on THUMOS14 dataset.

4.1 Experimental Setup

Datasets: The THUMOS14  [9] activity detection dataset contains over 24 h of videos from 20 different athletic activities. The train set contains 2765 trimmed videos, while the validation set and the test set contains 200 and 213 untrimmed videos respectively. We use the validation set as train data and report weakly-supervised temporal activity localization results on the test set. This dataset is particularly challenging as it consists of very long videos with multiple activity instances of very small duration. Most videos contain multiple activity instances of the same activity class. In addition, some videos contain activity instances from different classes.

The ActivityNet  [7] dataset consists three versions. We use the ActivityNet1.2 version which contains a total of around 10000 videos including 4819 train videos, 2383 validation videos, and 2480 withheld test videos for challenge purpose. We report the weakly-supervised temporal activity localization results on the validation videos. In ActivityNet1.2, around 99% videos contain activity instances of a single class. Many of the videos have activity instances covering more than half of the duration. Compared to THUMOS14, this is a large-scale dataset, both in terms of the number of activities involved and the amount of videos.

Evaluation Metric: The weakly-supervised temporal activity localization results are evaluated in terms of mean Average Precision (mAP) with different temporal Intersection over Union (tIoU) thresholds, which is denoted as mAP@\(\alpha \) where \(\alpha \) is the threshold. Average mAP at 10 evenly distributed tIoU thresholds between 0.5 and 0.95 is also commonly used in the literature.

Implementation Details: Video frames are sampled at 12 fps (for THUMOS14) or 25 fps (for ActivityNet1.2). For each frame, we perform the center crop of size \(224 \times 224\) after re-scaling the shorter dimension to 256 and construct video clips for every 15 frames. We extract the features of the clips using the publicly released, two-stream I3D model pretrained on Kinetics dataset  [3]. We use the feature map from \(Mixed\_5c\) layer as feature representation. For optical flow stream, TV-L1 flow  [27, 32] is used as the input.

Our model is implemented in pyTorch and trained using Adam optimizer with initial learning rate 0.0001 for both datasets. For the THUMOS14 dataset, we train the model by alternating E/M step every 10 epochs in the first 30 epochs. Then we raise the learning rate to 4 times larger and decrease the alternating cycle to 1 epoch for another 35 epochs. For ActivityNet1.2 dataset, we use a similar training approach but the alternating cycle is 5 epochs and the learning rate is constant. We use our model to generate instance assignment \(Q_t\) and classification score \(P_{t,c}\) separately for RGB and Flow branch. Then, we fuse the RGB/Flow score by weighted averaging. The threshold hyper-parameter \(\gamma \) in Eq. 5 is set to 0.15 for THUMOS14 dataset and 0 for ActivityNet1.2 dataset. Intuitively, the value of \(\gamma \) reflects how similar the same action exhibits across several videos, and should be negatively correlated with the variance of the action’s feature distribution. We also explore different \(\gamma \) in the range of [0.05, 0.2], mAP@tIoU=0.5 varies between 29.0% and 30.5% in THUMOS14 dataset, compared to the previous SOTA 26.8%  [18] using the same training data.

Table 1. Our EM-MIL detection results on THUMOS14 in percentage. mAP at different tIoU thresholds \(\alpha \) are reported. The top half shows fully-supervised methods while the bottom half shows weakly-supervised ones including ours. EM-MIL-UNT represents the result using UntrimmedNet’s [26] features.
Fig. 5.
figure 5

Qualitative visualization. (a) and (b) show results for two videos each on THUMOS14 and ActivityNet1.2, a good prediction example (top) and a bad one (bottom). Ground truth activity segments are marked in red. Localization score distribution \(L_t\) and predicted activity segments are in blue. (Color figure online)

4.2 Comparison with State-of-the-Art Approaches

Results on THUMOS14 Dataset: We compare our model’s results on the THUMOS14 dataset with state-of-the-art results in Table 1. Our model outperforms all the previous published models and achieves a new state-of-the-art result at mAP@0.5, 30.5%. This result is achieved by our simple EM training policy and the pseudo-labeling scheme, without auxiliary losses to regularize the learning process. Compared to the best result among the six recent models  [1, 16,17,18,19, 30] using the same two-stream I3D feature extraction backbone as our model, we get 3% significant improvement at mAP@0.5. We also tried using UntrimmedNet’s feature on our model (denoted as EM-MIL-UNT in Table 1), and got a mAP@0.5 of 27.2% which still improves significantly over previous models (e.g.. [14, 21, 26]) using the same feature backbone. Our model also shows more significant improvement at high threshold metrics tIoU=0.6 and tIoU=0.7, which implies that our action proposals are more complete. On the other hand, our performance is slightly worse in the low IoU metrics.

Several examples’ qualitative results are shown in Fig. 5(a). For each example, we show the video, intermediate score map \(L_t\) from our model, final activity detection result and ground truth temporal segment annotation. In the first example of Clean and Jerk, we localize the activity correctly with almost 100% overlap. We also show one bad prediction from our model in the second example, where our model overestimates the Cricket Bowling activity duration by 20%, as an effect of the interactive shrinkage training process which first labels every instance positive. Our model greatly resolves the incompleteness problem for activity detection in videos containing multiple action segments, while in some cases it might also bring in additional false positives. In addition, our model is also highly time efficient: in THUMOS14 our model trains for 65 epochs, taking 64.7 s on two TITAN RTX GPUs. We have run the released code for AutoLoc [21] and W-TALC [19] on the same machine with their recommended training procedures. Their training times are 44.5 s and 6051.2 s, respectively. All experiments used pre-computed features and [21]’s training required additional pretrained CAS scores.

Results on ActivityNet1.2 Dataset: We compare our model’s results on the ActivityNet1.2 dataset with previous results in Table 2. Our model outperforms previously published models in mAP@0.5 and gets the value of 37.4%. Despite the state-of-the-art result in mAP@0.5, our model performs worse in high tIoU metrics, which is the opposite to what we observed on THUMOS14 dataset. We further investigate the reason for different result trends on both datasets. Videos in the THUMOS14 dataset contains multiple action segments, each segment with relatively short duration. It has high localization requirement where our model outperforms pervious ones at high tIoU. Unlike THUMOS14, most videos (> 99%) in the ActivityNet1.2 dataset have only one action class, and most of these videos have only a few activity segments which compose a big portion of the whole video duration. Thus videos in ActivityNet1.2 dataset can be regarded as trimmed actions in certain extent. We speculate that the action localization performance in the ActivityNet1.2 dataset depends more on the classification module, which might be the bottleneck for our model. This speculation also correlates with the different \(\lambda \) values in Eq. 9 when calculating localization score on THUMOS14 and ActivityNet1.2 datasets. According to our model’s assumption, key instance assignment score \(Q_t\) implies the action clips and higher weight for this part facilitates the localization. On THUMOS14, the weight \(\lambda \) for the key instance assignment score \(Q_t\) is set to be a high value 0.8. But for ActivityNet1.2, the classification score \(P_{t,c}\) has a higher weight (0.7), implying that the model mostly relies on classification to succeed on this dataset. For further illustration, we also visualize some good and bad detection results from ActivityNet1.2 dataset in Fig. 5(b).

Table 2. Detection results on ActivityNet1.2 in terms of mAP@{0.5, 0.7, 0.9} and average mAP at tIoU thresholds \(\alpha \in (0.5,0.95)\) with step 0.05 (in percentage). It shows both fully-supervised method and weakly-supervised ones.

4.3 Ablation Studies

We ablate our pseudo label generation scheme and Expectation-Maximization alternating training method on THUMOS14 dataset with mAP@0.5 in Table 3.

Ablation on the Pseudo Labeling: We first ablate on the pseudo labeling scheme for \(\hat{z}_{t}\) and \(\hat{y}_{t,c}\), and include the results in Table 3. We switch our learning to be supervised by an attention-MIL loss based on softmax function, similar to  [17, 26]. In the E step, classification scores of all classes contribute collectively to the attention weights. In the M step, attention weights are applied equally to both positive and negative videos without paying special attention to the bag’s label. Compared to the “Alternating model" doing alternating training but with a plain attention, “Full Model" improves mAP@0.5 from 24.5% to 30.5%. This indicates the usefulness of the proposed pseudo labeling strategy. It models the key instance assignment explicitly and aligns with the MIL assumption better.

Ablation on the EM Alternating Training Technique: We also evaluate the effectiveness of Expectation-Maximization alternating training compared to joint optimization. The EM training method iteratively estimates the key instance assignment, then maximizes the video classification accuracy, and achieves better activity detection performance. “Full Model" improves mAP@0.5 from 26.8% to 30.5% compared to “Pseudo labeling" model with joint optimization. The same training process can be potentially applied on other MIL based models for weakly-supervised object detection task to improve accuracy as well.

Table 3. Ablation results for the pseudo labeling and EM alternating training on THUMOS14 dataset in terms of mAP@0.5 (%).

5 Conclusion

We propose a EM-MIL framework with pseudo labeling and alternating training for weakly-supervised action detection in video. Our EM-MIL framework is motivated by traditional MIL literature which is under-explored in deep learning settings. By allowing us to explicitly model latent variables, this framework improves our control over the learning objective of the instance-level MIL, which leads to state of the art performance. While this work uses a relatively simple pseudo-labeling scheme to implement the EM method, more sophisticated EM methods can be designed, e.g.. explicitly parameterize the latent distribution for instances and directly optimize the instance likelihood in E and M steps. Incorporating the video’s temporal structure is also a promising direction for further performance improvement.