Abstract
Considering the prevalent usage of multimedia applications on commodity computers equipped with both CPU and GPU devices, the possibility of simultaneously exploiting all parallelization capabilities of such hybrid platforms for high performance video encoding has been highly quested for. Accordingly, a method to concurrently implement the H.264/ advanced video coding (AVC) inter-loop on hybrid GPU + CPU platforms is proposed in this manuscript. This method comprises dynamic dependency aware task distribution methods and real-time computational load balancing over both the CPU and the GPU, according to an efficient dynamic performance modeling. With such optimal balance, the set of rather optimized parallel algorithms that were conceived for video coding on both the CPU and the GPU are dynamically instantiated in any of the existing processing devices, to minimize the overall encoding time. The proposed model does not only provide an efficient task scheduling and load balancing for H.264/AVC inter-loop, but it also does not introduce any significant computational burden to the time-limited video coding application. Furthermore, according to the presented set of experimental results, the proposed scheme has proved to provide speedup values as high as 2.5 when compared with highly optimized GPU-only encoding solutions or even other state of the art algorithm. Moreover, by simply using the existing computational resources that usually equip most commodity computers the proposed scheme is able to achieve inter-loop encoding rates as high as 40 fps at a HD 1920 × 1080 resolution.
Similar content being viewed by others
References
Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira, F., Stockhammer, T., Wedi, T.: Video coding with H.264/AVC tools, performance, and complexity. IEEE Circuits Syst. Mag. 4(1), 7–28 (2004)
Wiegand, T., Schwartz, H., Kossentini, F., Ulivan G., S.: Rate-constrained coder control and comparison of video coding standards. IEEE Trans. Circuits Syst. Video Technol. 13(7), 668–703 (2003)
Lu, C.-T., Hang, H.-M.:Multiview encoder parallelized fast search realization on NVIDIA CUDA. In: Proc. Visual Communications and Image Processing (VCIP), IEEE, pp. 1–4 (2011)
Schwalb, M., Ewerth, R., Freisleben, B.: Fast motion estimation on graphics hardware for H.264 video encoding. IEEE Trans. Multimed. 11(1), 1–10 (2009)
Momcilovic, S., Sousa, L.: Development and evaluation of scalable video motion estimators on GPU. In: Proc. Workshop on Signal Processing Systems (SIPS) (2009)
Kung, M.C., Au, O., Wong, P., Liu, C.-H.: Intra frame encoding using programmable graphics hardware. In: Proc. Pacific Rim Conference on Advances in Multimedia Information Processing (PCM), pp. 609–618. Springer, Berlin (2007)
Obukhov, A., Kharlamovl, A.: Discrete cosine transform for 8x8 blocks with CUDA. Research report, NVIDIA, Santa Clara, CA (2008)
Shen, G., Gao, G.-P., Li, S., Shum, H.-Y., Zhang, Y.-Q.: Accelerate video decoding with generic GPU. IEEE Trans. Circuits Syst. Video Technol. 15(5), 685–693 (2005)
Pieters, B., Hollemeersch, C.-F., De Cock, J., Lambert, P., De Neve, W., Vande Walle, R.: Parallel deblocking filtering in MPEG-4 AVC/H.264 on massively parallel architectures. IEEE Trans. Circuits Syst. Video Technol. 21(1), 96–100 (2011)
Cheung, N.-M., Fan, X., Au O., C., Kung, M.-C.: Video coding on multicore graphics processors. IEEE Signal Process. Mag. 27(2), 79–89 (2010)
Azevedo, A., Juurlink, B., Meenderinck, C., Terechko, A., Hoogerbrugge, J., Alvarez, M., Ramirez, A., Valero, M.: A highly scalable parallel implementation of H.264. In: Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), pp. 111–134 (2011)
Chen, W.-N., Hang, H.-M.: H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA). In: Proc. International Conference on Multimedia and Expo (ICME), pp. 697–700 (2008)
Momcilovic, S., Roma, N., Sousa, L.: Multi-level parallelization of advanced video coding on hybrid CPU/GPU platform. In: Proceedings of the 10th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar/Euro-Par 2012) (2012)
Ates, H.F., Altunbasak, Y.: SAD reuse in hierarchical motion estimation for the H.264 encoder. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2, 905–908 (2005)
First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem). Intel Corporation (2008)
Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28(4), 13–27 (2008)
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)
Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1(1), 269–271 (1959)
Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007)
Intel Corporation. SSE4 Programming Reference (2007). http://edc.intel.com/Link.aspx?id=1630
Momcilovic, S., Ilic, A., Roma, N., Sousa, L.: Advanced Video Coding on CPUs and GPUs: Parallelization and RD Analysis. Technical report (available online), INESC-ID (2013)
Aji, A.M., Feng, W., Blagojevic, F., Nikolopoulos, D.S.: Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine. In: CF ’08: Proceedings of the 5th Conference on Computing Frontiers, pp. 13–22. ACM, New York (2008) (ISBN 978-1-60558-077-7)
ITU-T. JVT Reference Software, version 17.2 (2010). http://iphome.hhi.de/suehring/tml/download
Tan, T.; Sullivan,G.; Wedi. Recommended simulation common conditions for coding efficiency experiments-revision 3. Doc. VCEG-AI10, ITU-Telecommunications Standardization Sector, STUDY GROUP 16 Question 6, Video Coding Experts Group (VCEG), Lisbon, Portugal (2008)
Acknowledgments
This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT), under project PEst-OE/EEI/LA0021/2013.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Momcilovic, S., Roma, N. & Sousa, L. Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms. J Real-Time Image Proc 11, 571–587 (2016). https://doi.org/10.1007/s11554-013-0357-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-013-0357-y