Abstract
Motions in videos are often governed by physical and biological laws such as gravity, collisions, flocking, etc. Accounting for such natural properties is an appealing way to improve realism in future frame video prediction. Nevertheless, the definition and computation of intricate physical and biological properties in motion videos are challenging. In this work, we introduce PhyLoNet, a PhyDNet extension that learns long-term future frame prediction and manipulation. Similar to PhyDNet, our network consists of a two-branch deep architecture that explicitly disentangles physical dynamics from complementary information. It uses a recurrent physical cell (PhyCell) for performing physically-constrained prediction in latent space. In contrast to PhyDNet, PhyLoNet introduces a modified encoder-decoder architecture together with a novel relative flow loss. This enables a longer-term future frame prediction from a small input sequence with higher accuracy and quality. We have carried out extensive experiments, showing the ability of PhyLoNet to outperform PhyDNet on various challenging natural motion datasets such as ball collisions, flocking, and pool games. Ablation studies highlight the importance of our new components. Finally, we show an application of PhyLoNet for video manipulation and editing by a novel class label modification architecture.
This research was partially supported by the Lynn and William Frankel Center for Computer Science at BGU.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aoyagi, Y., Murata, N., Sakaino, H.: Spatio-temporal predictive network for videos with physical properties. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2268–2278 (2021). https://doi.org/10.1109/CVPRW53098.2021.00256
Battaglia, P.W., Pascanu, R., Lai, M., Rezende, D., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics (2016)
Brabandere, B.D., Jia, X., Tuytelaars, T., Gool, L.V.: Dynamic filter networks (2016)
Byeon, W., Wang, Q., Srivastava, R.K., Koumoutsakos, P.: ContextVP: fully context-aware video prediction (2017). https://doi.org/10.48550/ARXIV.1710.08518. https://arxiv.org/abs/1710.08518
Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series (2017)
Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video (2017)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale (2021)
Eslami, S.M.A., et al.: Attend, infer, repeat: fast scene understanding with generative models (2016)
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction (2016). https://doi.org/10.48550/ARXIV.1605.07157. https://arxiv.org/abs/1605.07157
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning (2015)
Gao, H., Xu, H., Cai, Q.Z., Wang, R., Yu, F., Darrell, T.: Disentangling propagation and generation for video prediction (2019)
Guen, V.L., Thome, N.: Shape and time distortion loss for training deep time series forecasting models (2019)
Hsieh, J.T., Liu, B., Huang, D.A., Fei-Fei, L., Niebles, J.C.: Learning to decompose and disentangle representations for video prediction (2018)
Hui, T.W., Tang, X., Loy, C.C.: LiteFlowNet: a lightweight convolutional neural network for optical flow estimation (2018)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks (2016)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational inference for interacting systems (2018)
Kosiorek, A.R., Kim, H., Posner, I., Teh, Y.W.: Sequential attend, infer, repeat: generative modelling of moving objects (2018)
Krishnan, R.G., Shalit, U., Sontag, D.: Deep Kalman filters (2015)
Kwon, Y.H., Park, M.G.: Predicting future frames using retrospective cycle GAN. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1811–1820 (2019). https://doi.org/10.1109/CVPR.2019.00191
Le Guen, V., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: Computer Vision and Pattern Recognition (CVPR) (2020)
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.-H.: Flow-grounded spatial-temporal video prediction from still images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 609–625. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_37
Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction (2017)
Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow (2017)
Long, Z., Lu, Y., Dong, B.: PDE-Net 2.0: learning PDEs from data with a numeric-symbolic hybrid deep network. J. Comput. Phys. 399, 108925 (2019). https://doi.org/10.1016/j.jcp.2019.108925
Long, Z., Lu, Y., Ma, X., Dong, B.: PDE-Net: learning PDEs from data (2018)
Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos (2017)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error (2015)
Mo, S., Cho, M., Shin, J.: InstaGAN: instance-aware image-to-image translation (2019)
Mrowca, D., et al.: Flexible neural representation for physics prediction (2018)
Oliu, M., Selva, J., Escalera, S.: Folded recurrent neural networks for future video prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 745–761. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_44
Palm, R.B., Paquet, U., Winther, O.: Recurrent relational networks (2017)
Pan, T., Jiang, Z., Han, J., Wen, S., Men, A., Wang, H.: Taylor saves for later: disentanglement for video prediction using Taylor representation. Neurocomputing 472, 166–174 (2022)
Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory (2015)
Raissi, M.: Deep hidden physics models: deep learning of nonlinear partial differential equations. J. Mach. Learn. Res. 19(1), 932–955 (2018)
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part II): data-driven discovery of nonlinear partial differential equations (2017)
Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network (2016)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Rudy, S.H., Brunton, S.L., Proctor, J.L., Kutz, J.N.: Data-driven discovery of partial differential equations. Sci. Adv. 3(4), e1602614 (2016)
Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control (2018)
Seo, S., Liu, Y.: Differentiable physics-informed graph networks (2019)
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting (2015). https://doi.org/10.48550/ARXIV.1506.04214. https://arxiv.org/abs/1506.04214
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, vol. 1, pp. 802–810. MIT Press, Cambridge (2015)
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs (2015). https://doi.org/10.48550/ARXIV.1502.04681. https://arxiv.org/abs/1502.04681
van Steenkiste, S., Chang, M., Greff, K., Schmidhuber, J.: Relational neural expectation maximization: unsupervised discovery of objects and their interactions (2018)
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume (2018)
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation (2017)
Vaswani, A., et al.: Attention is all you need (2017)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction (2018)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics (2016)
Wang, Y., Gao, Z., Long, M., Wang, J., Yu, P.S.: PredRNN++: towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning (2018). https://doi.org/10.48550/ARXIV.1804.06300. https://arxiv.org/abs/1804.06300
Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3D LSTM: a model for video prediction and beyond. In: ICLR (2019)
Wang, Y., et al.: PredRNN: a recurrent neural network for spatiotemporal predictive learning (2021). https://doi.org/10.48550/ARXIV.2103.09504. https://arxiv.org/abs/2103.09504
Wang, Y., Zhang, J., Zhu, H., Long, M., Wang, J., Yu, P.S.: Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics (2018). https://doi.org/10.48550/ARXIV.1811.07490. https://arxiv.org/abs/1811.07490
Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., Zoran, D.: Visual interaction networks (2017)
Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, Red Hook, NY, USA, pp. 152–163. Curran Associates Inc. (2017)
Wu, Y., Gao, R., Park, J., Chen, Q.: Future video synthesis with object motion prediction (2020)
Xu, J., Ni, B., Li, Z., Cheng, S., Yang, X.: Structure preserving video prediction. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1460–1469 (2018). https://doi.org/10.1109/CVPR.2018.00158
Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks (2016)
Yin, Y., et al.: Augmenting physical models with deep networks for complex dynamics forecasting. J. Stat. Mech. Theory Exp. 2021(12), 124012 (2021). https://doi.org/10.1088/1742-5468/ac3ae5
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zikri, N.B., Sharf, A. (2023). PhyLoNet: Physically-Constrained Long-Term Video Prediction. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13847. Springer, Cham. https://doi.org/10.1007/978-3-031-26293-7_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-26293-7_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26292-0
Online ISBN: 978-3-031-26293-7
eBook Packages: Computer ScienceComputer Science (R0)