[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

PhyLoNet: Physically-Constrained Long-Term Video Prediction

  • Conference paper
  • First Online:
Computer Vision – ACCV 2022 (ACCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13847))

Included in the following conference series:

  • 594 Accesses

Abstract

Motions in videos are often governed by physical and biological laws such as gravity, collisions, flocking, etc. Accounting for such natural properties is an appealing way to improve realism in future frame video prediction. Nevertheless, the definition and computation of intricate physical and biological properties in motion videos are challenging. In this work, we introduce PhyLoNet, a PhyDNet extension that learns long-term future frame prediction and manipulation. Similar to PhyDNet, our network consists of a two-branch deep architecture that explicitly disentangles physical dynamics from complementary information. It uses a recurrent physical cell (PhyCell) for performing physically-constrained prediction in latent space. In contrast to PhyDNet, PhyLoNet introduces a modified encoder-decoder architecture together with a novel relative flow loss. This enables a longer-term future frame prediction from a small input sequence with higher accuracy and quality. We have carried out extensive experiments, showing the ability of PhyLoNet to outperform PhyDNet on various challenging natural motion datasets such as ball collisions, flocking, and pool games. Ablation studies highlight the importance of our new components. Finally, we show an application of PhyLoNet for video manipulation and editing by a novel class label modification architecture.

This research was partially supported by the Lynn and William Frankel Center for Computer Science at BGU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 79.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 99.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aoyagi, Y., Murata, N., Sakaino, H.: Spatio-temporal predictive network for videos with physical properties. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2268–2278 (2021). https://doi.org/10.1109/CVPRW53098.2021.00256

  2. Battaglia, P.W., Pascanu, R., Lai, M., Rezende, D., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics (2016)

    Google Scholar 

  3. Brabandere, B.D., Jia, X., Tuytelaars, T., Gool, L.V.: Dynamic filter networks (2016)

    Google Scholar 

  4. Byeon, W., Wang, Q., Srivastava, R.K., Koumoutsakos, P.: ContextVP: fully context-aware video prediction (2017). https://doi.org/10.48550/ARXIV.1710.08518. https://arxiv.org/abs/1710.08518

  5. Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series (2017)

    Google Scholar 

  6. Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video (2017)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale (2021)

    Google Scholar 

  8. Eslami, S.M.A., et al.: Attend, infer, repeat: fast scene understanding with generative models (2016)

    Google Scholar 

  9. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction (2016). https://doi.org/10.48550/ARXIV.1605.07157. https://arxiv.org/abs/1605.07157

  10. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning (2015)

    Google Scholar 

  11. Gao, H., Xu, H., Cai, Q.Z., Wang, R., Yu, F., Darrell, T.: Disentangling propagation and generation for video prediction (2019)

    Google Scholar 

  12. Guen, V.L., Thome, N.: Shape and time distortion loss for training deep time series forecasting models (2019)

    Google Scholar 

  13. Hsieh, J.T., Liu, B., Huang, D.A., Fei-Fei, L., Niebles, J.C.: Learning to decompose and disentangle representations for video prediction (2018)

    Google Scholar 

  14. Hui, T.W., Tang, X., Loy, C.C.: LiteFlowNet: a lightweight convolutional neural network for optical flow estimation (2018)

    Google Scholar 

  15. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks (2016)

    Google Scholar 

  16. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43

    Chapter  Google Scholar 

  17. Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational inference for interacting systems (2018)

    Google Scholar 

  18. Kosiorek, A.R., Kim, H., Posner, I., Teh, Y.W.: Sequential attend, infer, repeat: generative modelling of moving objects (2018)

    Google Scholar 

  19. Krishnan, R.G., Shalit, U., Sontag, D.: Deep Kalman filters (2015)

    Google Scholar 

  20. Kwon, Y.H., Park, M.G.: Predicting future frames using retrospective cycle GAN. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1811–1820 (2019). https://doi.org/10.1109/CVPR.2019.00191

  21. Le Guen, V., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  22. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.-H.: Flow-grounded spatial-temporal video prediction from still images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 609–625. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_37

    Chapter  Google Scholar 

  23. Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction (2017)

    Google Scholar 

  24. Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow (2017)

    Google Scholar 

  25. Long, Z., Lu, Y., Dong, B.: PDE-Net 2.0: learning PDEs from data with a numeric-symbolic hybrid deep network. J. Comput. Phys. 399, 108925 (2019). https://doi.org/10.1016/j.jcp.2019.108925

    Article  MathSciNet  MATH  Google Scholar 

  26. Long, Z., Lu, Y., Ma, X., Dong, B.: PDE-Net: learning PDEs from data (2018)

    Google Scholar 

  27. Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos (2017)

    Google Scholar 

  28. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error (2015)

    Google Scholar 

  29. Mo, S., Cho, M., Shin, J.: InstaGAN: instance-aware image-to-image translation (2019)

    Google Scholar 

  30. Mrowca, D., et al.: Flexible neural representation for physics prediction (2018)

    Google Scholar 

  31. Oliu, M., Selva, J., Escalera, S.: Folded recurrent neural networks for future video prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 745–761. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_44

    Chapter  Google Scholar 

  32. Palm, R.B., Paquet, U., Winther, O.: Recurrent relational networks (2017)

    Google Scholar 

  33. Pan, T., Jiang, Z., Han, J., Wen, S., Men, A., Wang, H.: Taylor saves for later: disentanglement for video prediction using Taylor representation. Neurocomputing 472, 166–174 (2022)

    Article  Google Scholar 

  34. Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory (2015)

    Google Scholar 

  35. Raissi, M.: Deep hidden physics models: deep learning of nonlinear partial differential equations. J. Mach. Learn. Res. 19(1), 932–955 (2018)

    MathSciNet  MATH  Google Scholar 

  36. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part II): data-driven discovery of nonlinear partial differential equations (2017)

    Google Scholar 

  37. Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network (2016)

    Google Scholar 

  38. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  39. Rudy, S.H., Brunton, S.L., Proctor, J.L., Kutz, J.N.: Data-driven discovery of partial differential equations. Sci. Adv. 3(4), e1602614 (2016)

    Article  Google Scholar 

  40. Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control (2018)

    Google Scholar 

  41. Seo, S., Liu, Y.: Differentiable physics-informed graph networks (2019)

    Google Scholar 

  42. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting (2015). https://doi.org/10.48550/ARXIV.1506.04214. https://arxiv.org/abs/1506.04214

  43. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, vol. 1, pp. 802–810. MIT Press, Cambridge (2015)

    Google Scholar 

  44. Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs (2015). https://doi.org/10.48550/ARXIV.1502.04681. https://arxiv.org/abs/1502.04681

  45. van Steenkiste, S., Chang, M., Greff, K., Schmidhuber, J.: Relational neural expectation maximization: unsupervised discovery of objects and their interactions (2018)

    Google Scholar 

  46. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume (2018)

    Google Scholar 

  47. Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24

    Chapter  Google Scholar 

  48. Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation (2017)

    Google Scholar 

  49. Vaswani, A., et al.: Attention is all you need (2017)

    Google Scholar 

  50. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction (2018)

    Google Scholar 

  51. Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics (2016)

    Google Scholar 

  52. Wang, Y., Gao, Z., Long, M., Wang, J., Yu, P.S.: PredRNN++: towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning (2018). https://doi.org/10.48550/ARXIV.1804.06300. https://arxiv.org/abs/1804.06300

  53. Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3D LSTM: a model for video prediction and beyond. In: ICLR (2019)

    Google Scholar 

  54. Wang, Y., et al.: PredRNN: a recurrent neural network for spatiotemporal predictive learning (2021). https://doi.org/10.48550/ARXIV.2103.09504. https://arxiv.org/abs/2103.09504

  55. Wang, Y., Zhang, J., Zhu, H., Long, M., Wang, J., Yu, P.S.: Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics (2018). https://doi.org/10.48550/ARXIV.1811.07490. https://arxiv.org/abs/1811.07490

  56. Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., Zoran, D.: Visual interaction networks (2017)

    Google Scholar 

  57. Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, Red Hook, NY, USA, pp. 152–163. Curran Associates Inc. (2017)

    Google Scholar 

  58. Wu, Y., Gao, R., Park, J., Chen, Q.: Future video synthesis with object motion prediction (2020)

    Google Scholar 

  59. Xu, J., Ni, B., Li, Z., Cheng, S., Yang, X.: Structure preserving video prediction. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1460–1469 (2018). https://doi.org/10.1109/CVPR.2018.00158

  60. Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks (2016)

    Google Scholar 

  61. Yin, Y., et al.: Augmenting physical models with deep networks for complex dynamics forecasting. J. Stat. Mech. Theory Exp. 2021(12), 124012 (2021). https://doi.org/10.1088/1742-5468/ac3ae5

  62. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrei Sharf .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zikri, N.B., Sharf, A. (2023). PhyLoNet: Physically-Constrained Long-Term Video Prediction. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13847. Springer, Cham. https://doi.org/10.1007/978-3-031-26293-7_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26293-7_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26292-0

  • Online ISBN: 978-3-031-26293-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics