|
1.IntroductionVideo matching techniques are motivated by various applications, including video retrieval, surveillance, and watermarking. In video retrieval,1, 2 frame matching is used to retrieve video clips, similar to a query clip, from a database. In video surveillance using multiple cameras,3 it is necessary to establish the correspondences in time and space to combine asynchronous multiview sequences. In video watermarking,4, 5, 6 temporal synchronization should be established between the original and watermarked videos before the extraction of frame-dependent watermarks. Video matching techniques have different accuracy requirements, depending on their target applications. In video retrieval,1, 2 video matching need not be accurate at the frame level. It is acceptable that a retrieved frame does not match a query frame exactly, as long as they contain similar contents. In contrast, in video watermarking, more accurate matching is required, and two video sequences should be synchronized at the frame level, since the matching is a preprocessing step before the extraction of frame-dependent watermarks.8 For instance, in Ref. 6, the pattern of skipped frames is employed as a watermark, and the accurate frame-level synchronization between an original video and its watermarked version is an essential step in watermark detection. In watermarking applications, Delannay, Roover, and Macq4 used an affine model to align video sequences. An affine model, however, cannot effectively describe irregular frame insertion and removal as well as frame shuffling, which are employed to attack watermarked video sequences. Cheng5 proposed a temporal registration algorithm to correct temporal misalignment, which occurs in frame rate conversion or video capturing. The two-frame integration model was employed to represent temporally overlapping frames in the acquisition procedures. In video watermarking, the original video can be modified by various temporal attacks, such as frame removal, insertion, and shuffling. To establish frame-level synchronization between original and modified sequences in video watermarking applications, we propose a dynamic programming algorithm that matches each frame in the modified sequence to the corresponding frame in the original sequence. Although Ref. 5 also employed dynamic programming to match two sequences, they used only frame dissimilarity in the matching. On the other hand, we introduce the notion of adaptive unmatched cost in addition to the matching cost to achieve more accurate matching. Moreover, the unmatched cost is used to deal with frame shuffling attacks, as well as frame removal and insertion attacks. 2.Matching Function and Matching CostSuppose that a video clip , where denotes the ’th frame, is modified from the original clip . A frame may be different from any frame in the original clip, since the clip may be a compressed version of . Moreover, some frames can be removed from or inserted into , and then the resulting frames can be shuffled. We assume that the number of removed frames is less than a threshold . Similarly, let and denote the maximum numbers of inserted frames and shuffled frames, respectively. We say that a frame in matches a frame in if originates from . On the other hand, is called an unmatched frame if does not match any frame in . Then, the temporal modification of a video can be represented by a function on the space of frame indices. Specifically, we define the matching function as where belongs to the frame index set of . For example, Table 1 shows a matching function, when and . In this case, is modified from by removing and , inserting new frames before , and then reversing the order of and .Table 1A matching function λ(j) that describes the temporal alignment between two sequences.
Let denote the matching cost between two frames and , which measures the dissimilarity between and . The mean squared error (MSE) is employed as the matching cost. 3.Proposed Matching AlgorithmGiven a video clip and its original , we attempt to find a matching function , such that in is identical or similar to in for each . The simplest matching can be achieved by the local minimization of matching costs (LMMC) via To recover from frame insertion attacks, can be declared to be an unmatched frame if the minimum cost is larger than a threshold. However, in this LMMC approach, matching functions may not be one to one, and estimation results are very sensitive to the threshold.We propose a globally optimal matching algorithm that minimizes the total matching cost subject to the one-to-one matching constraint. First, we assume that the original video clip is temporally modified by frame removal and insertion attacks only, and find the best matching function under this monotonicity assumption. Later, in the second phase, we detect frame shuffling attacks by further matching the unmatched frames of without imposing the monotonicity constraint. Assuming frame removal and insertion attacks only, the matching function can be estimated by minimizing the sum of the costs of matched frames and unmatched frames, given by where . Note that a function in is monotonically increasing on the reduced domain { and }, which excludes the unmatched frame indices. The unmatched cost of the ’th frame is defined asNotice that the unmatched cost of the ’th frame is adaptively determined by the matched costs of the neighboring frames inside a temporal window , where denotes the window size. In general, a large indicates that the neighboring frames are heavily compressed or they contain complex textures and motions. Thus, a large makes it easier to classify the ’th frame as a matched frame.In Eq. 3, when frames, , are inserted into , there are possible choices of unmatched frames and possible choices of matched frames. Therefore, the complexity of the exhaustive minimization in Eq. 3 is proportional to , which is too demanding in most applications. To reduce the complexity, we minimize the total cost function in Eq. 3 based on dynamic programming. The dynamic programming method in this work is similar to that for computing the Levenshtein distance, also called the edit distance, between two text strings.10, 11 The edit distance counts the minimum number of letter substitutions, insertions, or removals to convert a text string to another string. Similarly, the proposed algorithm matches a video sequence to another sequence by considering frame insertions and removals. However, whereas the distance between two letters is binary (identical or not), the distance between two frames should represent the similarity of those two frames. Thus, as compared with letter insertions or removals, it is more difficult to identify frame insertions or removals. To overcome this difficulty, the proposed algorithm defines the matching cost as MSE and the unmatched cost in Eq. 4 and minimizes the total cost in Eq. 3 to achieve reliable video matching. Suppose that the first frames of are obtained by removing frames from the first frames of and then inserting new frames. Note that , since frames in match frames in . Let denote the minimum sum of the matching costs for the first frames in when frames are inserted and frames are removed. We compute recursively. First, we initialize if , , or with the exception . Then, can be obtained by finding the minimum among three cases where . When matches , the matching cost is added to , which is the first term in Eq. 5. When is an inserted frame, the unmatched cost is added to . If is a removed frame, remains unchanged. In this way, we compute all inductively. Finally, the minimum sum of costs for the whole sequence is given by . While computing the partial costs in Eq. 5, we also record the minimum conditions, which we can trace back to get the matching function in Eq. 3.Next, let us consider frame shuffling attacks, in addition to frame removal and insertion attacks. In Eqs. 3, 5, we impose the monotonicity constraint that the matching function for matched frames is monotonically increasing, and shuffled frames hence can be classified as unmatched frames. Therefore, when is classified as unmatched, it can be further matched to a frame in , which is not matched by the function in Eq. 3. More specifically, is refined by where denotes the indices of frames in that are not matched by Eq. 3. Notice that we reduce the total matching cost by changing the category of from an unmatched frame to a shuffled frame when its matching cost is smaller than the unmatched cost.4.Experimental ResultsThe performance of the proposed algorithm is evaluated using five common intermediate format (CIF) sequences Foreman, Paris, Mobile, Tempete, and Carphone, each of which consists of 100 frames. We also use 33,220 video clips, selected from 20 Korean movies.12 Each clip consists of 100 frames of resolution , and has a frame rate of either 24 or 30 frames per second. Thus, 3,322,500 frames are used in total. To match a video clip with another clip, we minimize the total cost in Eq. 3. It can be shown that the dynamic programming method requires also recursion steps. We use a personal computer with an Intel Pentium D processor for simulations. To match a video clip of 100 frames to another clip, the proposed algorithm requires about for the dynamic programming method. Table 2 shows the matching error probabilities. A matching error probability is defined as the rate of an estimated matching function being different from the true matching function. The sequences are compressed by the H.264/AVC standard with QP 20 and 35, which yield about 42.0 and peak signal-to-noise ratios (PSNRs) on average, respectively. The sequences then go through frame removal attacks , frame insertion attacks , and frame shuffling attacks . The numbers of removed and inserted frames are randomly selected from 1 to 10 and from 1 to 3 ( and ), respectively. An inserted frame is constructed by averaging two adjacent frames in the original sequences. For shuffling attacks, the number of pairs of adjacent frames is randomly selected from 1 to 3 , and each pair of frames swaps their places. Table 2Comparison of matching error probabilities when a video is modified from its original through frame removal (R) , insertion (I) , shuffling (S) , and data compression attacks.
We compare the performance of the proposed algorithm with those of Cheng’s algorithm5 and the LMMC approach. In Ref. 5, frame shuffling attacks are not considered, and the matching fails under these attacks. Since the weights for the two-frame integration model are overparameterized for insertion attacks, matching errors occur even when QP is as low as 20. The LMMC approach is also vulnerable to insertion attacks, since it simply searches the best matching frame for each individual frame. On the other hand, the proposed algorithm provides much better performance by globally minimizing the total matching cost. For example, when QP is 35 and frame removal and insertion attacks are combined , the matching error probability of the proposed algorithm is about 41 times lower than that of LMMC and about 21 times lower than that of Cheng’s algorithm . As QP increases, the qualities of modified videos degrade and the matching error probabilities get higher. However, the proposed algorithm provides significantly better performance than the conventional algorithm, even in these severe conditions. 5.ConclusionWe propose a temporal alignment algorithm between two video sequences, when a video is modified from the other through frame removal, insertion, and shuffling attacks as well as data compression attacks. We define a cost function for global matching errors and develop a dynamic programming algorithm to minimize cost efficiently. Experimental results show that the proposed algorithm provides a significantly lower probability of matching errors than the conventional algorithm. AcknowledgmentThis work was supported partly by the Ministry of Knowledge Economy, Korea, under the Information Technology Research Center support program supervised by the Institute of Information Technology Advancement (grant number IITA-2008-C1090-0801-0017) and partly by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (number R01-2008-000-20292-0). referencesZ. Li, L. Gao, and A. K. Katsaggelos,
“Locally embedded linear subspaces for efficient video indexing and retrieval,”
1765
–1768
(2006). Google Scholar
J. Yuan, W. Wang, J. Meng, Y. Wu, and D. Li,
“Mining repetitive clips through finding continuous paths,”
289
–292
(2007). Google Scholar
E. Grimson, P. Viola, O. Faugeras, T. Lonzano-Perez, T. Poggio, and S. Teller,
“A forest of sensors,”
Proc. DARPA Image Understanding Workshop, 1 45
–50
(1997) Google Scholar
D. Delannay, C. de Roover, and B. Macq,
“Temporal alignment of video sequences for water-marking systems,”
Proc. SPIE, 5020 481
–492
(2003). Google Scholar
H. Cheng,
“Temporal registration of video sequences,”
489
–492
(2003). Google Scholar
Y. Y. Lee, C. S. Kim, and S. U. Lee,
“Video fingerprinting based on frame skipping,”
2305
–2308
(2006). Google Scholar
E. T. Lin and E. J. Delp,
“Temporal synchronization in video watermarking,”
IEEE Trans. Signal Process., 52 3007
–3022
(2004). https://doi.org/10.1109/TSP.2004.833866 Google Scholar
V. I. Levenshtein,
“Binary codes capable of correcting deletions, insertions and reversals,”
Sov. Phys. Dokl., 10 707
–710
(1966). Google Scholar
G. Navarro,
“A guided tour to approximate string matching,”
ACM Comput. Surv., 33
(1), 31
–88
(2001). Google Scholar
Y. Y. Lee,
“Temporal feature modulation for video watermarking,”
Seoul National University,
(2008). Google Scholar
|