[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3528233.3530704acmconferencesArticle/Chapter ViewAbstractPublication PagessiggraphConference Proceedingsconference-collections
research-article

Novel View Synthesis of Human Interactions from Sparse Multi-view Videos

Published: 24 July 2022 Publication History

Abstract

This paper presents a novel system for generating free-viewpoint videos of multiple human performers from very sparse RGB cameras. The system reconstructs a layered neural representation of the dynamic multi-person scene from multi-view videos with each layer representing a moving instance or static background. Unlike previous work that requires instance segmentation as input, a novel approach is proposed to decompose the multi-person scene into layers and reconstruct neural representations for each layer in a weakly-supervised manner, yielding both high-quality novel view rendering and accurate instance masks. Camera synchronization error is also addressed in the proposed approach. The experiments demonstrate the better view synthesis quality of the proposed system compared to previous ones and the capability of producing an editable free-viewpoint video of a real soccer game using several asynchronous GoPro cameras. The dataset and code are available at https://github.com/zju3dv/EasyMocap .

References

[1]
Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, and Srinivasa Narasimhan. 2020. 4D Visualization of Dynamic Events From Unconstrained Multi-View Videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5365–5374. https://doi.org/10.1109/CVPR42600.2020.00541
[2]
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Computer Vision – ECCV 2016(Lecture Notes in Computer Science). Springer International Publishing.
[3]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In CVPR. 7291–7299.
[4]
Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality streamable free-viewpoint video. ACM TOG (2015).
[5]
Edilson De Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. 2008. Performance capture from sparse multi-view video. In SIGGRAPH. 1–10.
[6]
Junting Dong, Qi Fang, Wen Jiang, Yurou Yang, Hujun Bao, and Xiaowei Zhou. 2021. Fast and Robust Multi-Person 3D Pose Estimation and Tracking from Multiple Views. IEEE TPAMI (2021).
[7]
Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. 2016. Fusion4D: Real-Time Performance Capture of Challenging Scenes. ACM TOG 35, 4 (2016), 1–13.
[8]
Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B. Tenenbaum, and Jiajun Wu. 2021. Neural Radiance Flow for 4D View Synthesis and Video Processing. In ICCV. 14324–14334.
[9]
Juergen Gall, Carsten Stoll, Edilson de Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-Peter Seidel. 2009. Motion capture using joint skeleton tracking and surface estimation. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 1746–1753. https://doi.org/10.1109/CVPR.2009.5206755
[10]
Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. 2021. Dynamic View Synthesis from Dynamic Monocular Video. In ICCV.
[11]
Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. 1996. The lumigraph. In SIGGRAPH.
[12]
Jonathan Granskog, Till N Schnabel, Fabrice Rousselle, and Jan Novák. 2021. Neural scene graph rendering. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–11.
[13]
Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, Xueming Yu, Matt Whalen, Geoff Harvey, Sergio Orts-Escolano, Rohit Pandey, Jason Dourgarian, 2019. The relightables: Volumetric performance capture of humans with realistic relighting. ACM TOG (2019).
[14]
Michelle Guo, Alireza Fathi, Jiajun Wu, and Thomas Funkhouser. 2020. Object-centric neural scene rendering. arXiv preprint arXiv:2012.08503(2020).
[15]
Marc Habermann, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, and Christian Theobalt. 2019. Livecap: Real-time human performance capture from monocular video. ACM TOG 38, 2 (2019), 1–17.
[16]
Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard Pons-Moll, and Christian Theobalt. 2020. Deepcap: Monocular human performance capture using weak supervision. In CVPR. 5052–5063.
[17]
Buzhen Huang, Yuan Shu, Tianshu Zhang, and Yangang Wang. 2021. Dynamic multi-person mesh recovery from uncalibrated multi-view cameras. In 3DV. 710–720.
[18]
Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing, Chloe LeGendre, Linjie Luo, Chongyang Ma, and Hao Li. 2018. Deep volumetric video from very sparse multi-view performance capture. In Proceedings of the European Conference on Computer Vision (ECCV). 336–354.
[19]
Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. 2019. Learnable triangulation of human pose. In ICCV. 7718–7727.
[20]
Marc Levoy and Pat Hanrahan. 1996. Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. 31–42.
[21]
Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2020b. Self-Correction for Human Parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020). https://doi.org/10.1109/TPAMI.2020.3048039
[22]
Ruilong Li, Yuliang Xiu, Shunsuke Saito, Zeng Huang, Kyle Olszewski, and Hao Li. 2020a. Monocular real-time volumetric performance capture. In European Conference on Computer Vision. Springer, 49–67.
[23]
Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. 2021. Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes. In CVPR. 6498–6508.
[24]
Haotong Lin, Sida Peng, Zhen Xu, Hujun Bao, and Xiaowei Zhou. 2021. Efficient Neural Radiance Fields with Learned Depth-Guided Sampling. arXiv preprint arXiv:2112.01517(2021).
[25]
Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. 2020. Neural Sparse Voxel Fields. In NeurIPS.
[26]
Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. 2021. Neural Actor: Neural Free-view Synthesis of Human Actors with Pose Control. ACM TOG (2021).
[27]
Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. 2019. Neural Volumes: Learning Dynamic Renderable Volumes from Images. ACM TOG 38, 4 (2019), 1–14.
[28]
Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. 2021. Mixture of Volumetric Primitives for Efficient Neural Rendering. ACM Trans. Graph. 40, 4, Article 59 (jul 2021), 13 pages. https://doi.org/10.1145/3450626.3459863
[29]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM TOG 34, 6 (2015), 1–16.
[30]
Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T Freeman, and Michael Rubinstein. 2020. Layered neural rendering for retiming people in video. ACM TOG 39, 6 (2020), 1–14.
[31]
Keyang Luo, Tao Guan, Lili Ju, Haipeng Huang, and Yawei Luo. 2019. P-mvsnet: Learning patch-wise matching confidence aggregation for multi-view stereo. In ICCV. 10452–10461.
[32]
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV. Springer International Publishing, Cham, 405–421.
[33]
Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, and Shigeo Morishima. 2019. SiCloPe: Silhouette-Based Clothed People. In CVPR. 4480–4490.
[34]
Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In CVPR. 343–352.
[35]
Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. 2020. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR. 3504–3515.
[36]
Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya Harada. 2021. Neural Articulated Radiance Field. In ICCV.
[37]
Michael Oechsle, Songyou Peng, and Andreas Geiger. 2021. UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction. In ICCV.
[38]
Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. 2021. Neural Scene Graphs for Dynamic Scenes. In CVPR. 2856–2865.
[39]
Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. 2021. Nerfies: Deformable Neural Radiance Fields. In ICCV. 5865–5874.
[40]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In CVPR. 10975–10985.
[41]
Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. 2021. Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans. In CVPR. 9054–9063.
[42]
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. 2021. D-NeRF: Neural Radiance Fields for Dynamic Scenes. In CVPR. 10318–10327.
[43]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In CVPR. 779–788.
[44]
Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. 2021. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In ICCV. 14335–14345.
[45]
Javier Romero, Dimitrios Tzionas, and Michael J Black. 2017. Embodied hands: modeling and capturing hands and bodies together. ACM TOG 36, 6 (2017), 1–17.
[46]
Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. 2019. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV. 2304–2314.
[47]
Harry Shum and Sing Bing Kang. 2000. Review of image-based rendering techniques. In Visual Communications and Image Processing 2000, Vol. 4067. SPIE, 2–13.
[48]
Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. 2019. Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations. In NeurIPS, Vol. 32. 1121–1132.
[49]
Zhuo Su, Lan Xu, Zerong Zheng, Tao Yu, Yebin Liu, and Lu Fang. 2020. RobustFusion: Human Volumetric Capture with Data-Driven Visual Cues Using a RGBD Camera. In ECCV. 246–264.
[50]
Xin Suo, Yuheng Jiang, Pei Lin, Yingliang Zhang, Minye Wu, Kaiwen Guo, and Lan Xu. 2021. NeuralHumanFVV: Real-Time Neural Volumetric Human Performance Rendering using RGB Cameras. In CVPR. 6226–6237.
[51]
Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. 2021. Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video. In ICCV. 12959–12970.
[52]
Hanyue Tu, Chunyu Wang, and Wenjun Zeng. 2020. VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment. In ECCV. 197–212.
[53]
Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. 2018. Bodynet: Volumetric inference of 3d human body shapes. In ECCV. 20–36.
[54]
Minh Vo, Ersin Yumer, Kalyan Sunkavalli, Sunil Hadap, Yaser Sheikh, and Srinivasa G Narasimhan. 2020b. Self-supervised multi-view person association and its applications. IEEE transactions on pattern analysis and machine intelligence 43, 8(2020), 2794–2808.
[55]
Minh Phuoc Vo, Yaser A Sheikh, and Srinivasa G Narasimhan. 2020a. Spatiotemporal Bundle Adjustment for Dynamic 3D Human Reconstruction in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[56]
Liao Wang, Ziyu Wang, Pei Lin, Yuheng Jiang, Xin Suo, Minye Wu, Lan Xu, and Jingyi Yu. 2021a. iButter: Neural Interactive Bullet Time Generator for Human Free-viewpoint Rendering. In ACM MM. 4641–4650.
[57]
Tao Wang, Jianfeng Zhang, Yujun Cai, Shuicheng Yan, and Jiashi Feng. 2021b. Direct Multi-view Multi-person 3D Human Pose Estimation. NeurIPS 34(2021).
[58]
Minye Wu, Yuehao Wang, Qiang Hu, and Jingyi Yu. 2020. Multi-View Neural Human Rendering. In CVPR. 1682–1691.
[59]
Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. 2021. Space-time Neural Irradiance Fields for Free-Viewpoint Video. In CVPR. 9421–9431.
[60]
Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han Zhou, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. 2021. Learning object-compositional neural radiance field for editable scene rendering. In ICCV. 13779–13788.
[61]
Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. 2021a. PlenOctrees for Real-time Rendering of Neural Radiance Fields. In ICCV.
[62]
Hong-Xing Yu, Leonidas J. Guibas, and Jiajun Wu. 2022. Unsupervised Discovery of Object Radiance Fields. In International Conference on Learning Representations.
[63]
Tao Yu, Zerong Zheng, Kaiwen Guo, Pengpeng Liu, Qionghai Dai, and Yebin Liu. 2021b. Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors. In CVPR. 5746–5756.
[64]
Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. 2018. Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor. In CVPR. 7287–7296.
[65]
Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, and Steven Lovegrove. 2021. STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering. In CVPR. 13144–13152.
[66]
Jiakai Zhang, Xinhang Liu, Xinyi Ye, Fuqiang Zhao, Yanshun Zhang, Minye Wu, Yingliang Zhang, Lan Xu, and Jingyi Yu. 2021. Editable free-viewpoint video using a layered neural representation. ACM Transactions on Graphics (TOG) 40, 4, 1–18.
[67]
Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. 2020b. NeRF++: Analyzing and Improving Neural Radiance Fields. arXiv:2010.07492 (2020).
[68]
Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and Yebin Liu. 2020a. 4D association graph for realtime multi-person motion capture using multiple video cameras. In CVPR. 1324–1333.
[69]
Yang Zheng, Ruizhi Shao, Yuxiang Zhang, Tao Yu, Zerong Zheng, Qionghai Dai, and Yebin Liu. 2021. DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras. In ICCV.
[70]
Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and Yebin Liu. 2019. DeepHuman: 3D Human Reconstruction From a Single Image. In ICCV. 7739–7749.
[71]
Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang Yang. 2019. Detailed human shape estimation from a single image by hierarchical mesh deformation. In CVPR. 4491–4500.
[72]
C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. 2004. High-quality video view interpolation using a layered representation. ACM TOG (2004).

Cited By

View all
  • (2024)HMR-Adapter: A Lightweight Adapter with Dual-Path Cross Augmentation for Expressive Human Mesh RecoveryProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681641(6093-6102)Online publication date: 28-Oct-2024
  • (2024)InNeRF: Learning Interpretable Radiance Fields for Generalizable 3D Scene Representation and RenderingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681393(11004-11012)Online publication date: 28-Oct-2024
  • (2024)SeamPose: Repurposing Seams as Capacitive Sensors in a Shirt for Upper-Body Pose TrackingProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676341(1-13)Online publication date: 13-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGGRAPH '22: ACM SIGGRAPH 2022 Conference Proceedings
July 2022
553 pages
ISBN:9781450393379
DOI:10.1145/3528233
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Novel view synthesis
  2. dynamic scene modeling
  3. neural rendering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • NSFC

Conference

SIGGRAPH '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,822 of 8,601 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)462
  • Downloads (Last 6 weeks)45
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)HMR-Adapter: A Lightweight Adapter with Dual-Path Cross Augmentation for Expressive Human Mesh RecoveryProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681641(6093-6102)Online publication date: 28-Oct-2024
  • (2024)InNeRF: Learning Interpretable Radiance Fields for Generalizable 3D Scene Representation and RenderingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681393(11004-11012)Online publication date: 28-Oct-2024
  • (2024)SeamPose: Repurposing Seams as Capacitive Sensors in a Shirt for Upper-Body Pose TrackingProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676341(1-13)Online publication date: 13-Oct-2024
  • (2024)ST-4DGS: Spatial-Temporally Consistent 4D Gaussian Splatting for Efficient Dynamic Scene RenderingACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657520(1-11)Online publication date: 13-Jul-2024
  • (2024)HDhuman: High-Quality Human Novel-View Rendering From Sparse ViewsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.329054330:8(5328-5338)Online publication date: Aug-2024
  • (2024)3D Human Scan With A Moving Event Camera2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00568(5586-5596)Online publication date: 17-Jun-2024
  • (2024)Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00107(1062-1071)Online publication date: 16-Jun-2024
  • (2024)MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00019(109-118)Online publication date: 16-Jun-2024
  • (2024)Virtual lighting environment and real human fusion based on multiview videosInformation Fusion10.1016/j.inffus.2023.102090103:COnline publication date: 4-Mar-2024
  • (2024)Dyn-E: Local appearance editing of dynamic neural radiance fieldsComputers & Graphics10.1016/j.cag.2024.104140(104140)Online publication date: Dec-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media