research-article

Open access

MEDUSA: A Dynamic Codec Switching Approach in HTTP Adaptive Streaming

Authors:

Daniele Lorenzi,

Farzad Tashtarian,

Hermann Hellwagner,

Christian TimmererAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 10

Article No.: 319, Pages 1 - 23

https://doi.org/10.1145/3656175

Published: 12 September 2024 Publication History

PDF eReader

Abstract

HTTP Adaptive Streaming (HAS) solutions utilize various Adaptive BitRate (ABR) algorithms to dynamically select appropriate video representations, aiming at adapting to fluctuations in network bandwidth. However, current ABR implementations have a limitation in that they are designed to function with one set of video representations, i.e., the bitrate ladder, which differ in bitrate and resolution, but are encoded with the same video codec. When multiple codecs are available, current ABR algorithms select one of them prior to the streaming session and stick to it throughout the entire streaming session. Although newer codecs are generally preferred over older ones, their compression efficiencies differ depending on the content’s complexity, which varies over time. Therefore, it is necessary to select the appropriate codec for each video segment to reduce the requested data while delivering the highest possible quality. In this article, we first provide a practical example where we compare compression efficiencies of different codecs on a set of video sequences. Based on this analysis, we formulate the optimization problem of selecting the appropriate codec for each user and video segment (on a per-segment basis in the outmost case), refining the selection of the ABR algorithms by exploiting key metrics, such as the perceived segment quality and size. Subsequently, to address the scalability issues of this centralized model, we introduce a novel distributed plug-in ABR algorithm for Video on Demand (VoD) applications called MEDUSA to be deployed on top of existing ABR algorithms. MEDUSA enhances the user’s Quality of Experience (QoE) by utilizing a multi-objective function that considers the quality and size of video segments when selecting the next representation. Using quality information and segment size from the modified Media Presentation Description (MPD), MEDUSA utilizes buffer occupancy to prioritize quality or size by assigning specific weights in the objective function. To show the impact of MEDUSA, we compare the proposed plug-in approach on top of state-of-the-art techniques with their original implementations and analyze the results for different network traces, video content, and buffer capacities. According to the experimental findings, MEDUSA shows the ability to improve QoE for various test videos and scenarios. The results reveal an impressive improvement in the QoE score of up to 42% according to the ITU-T P.1203 model (mode 0). Additionally, MEDUSA can reduce the transmitted data volume by up to more than 40% achieving a QoE similar to the techniques compared, reducing the burden on streaming service providers for delivery costs.

1 Introduction

Recently, there has been a significant global increase in the use of video streaming. According to the Ericsson Mobility Report [9], by the end of 2021, video streaming represented \(69\%\) of all mobile data traffic, and this percentage is expected to increase to \(79\%\) by 2027. Most videos available online are transmitted using HTTP Adaptive Streaming (HAS) [1], including Dynamic Adaptive Streaming over HTTP (MPEG-DASH) [7] and HTTP Live Streaming (HLS) [26]. The video content in HAS is encoded in various representations on the server side and then divided into segments of equal length. An Adaptive BitRate (ABR) algorithm selects the appropriate bitrate for each segment on the client side to deliver the best possible Quality of Experience (QoE) while adapting to changes in network traffic [17, 22, 25, 30].

QoE and video encoding quality are interrelated from a streaming service provider’s perspective. Currently, the most widely used video codec is Advanced Video Coding (AVC) [38], as all major streaming platforms and web browsers support it.¹ However, its low compression efficiency, especially for high-resolution videos, led to the development of a new video coding standard called High Efficiency Video Coding (HEVC) [33] in 2013, which achieved a 50% reduction in bitrate compared to AVC [23]. To address the need for an open-source and royalty-free video codec, the Alliance for Open Media (AOMedia) industry consortium developed AOMedia Video 1 (AV1) [4], which offers a bitrate reduction of around 30% compared to its predecessor, VP9 [20].

However, it is important to note that, depending on the content and its complexity [19], older codecs like AVC may actually outperform more recent video codecs [2]. Therefore, it is necessary to determine the most suitable video codec not only for specific types of content but also for different video segments, since the complexity of a video sequence varies over time. Although we currently lack a definitive answer to this question, we do understand the importance of video codecs within the video streaming chain.

The main challenges in multi-codec techniques relate to (i) content provisioning, (ii) content delivery, and (iii) content consumption.

In (i), the main step is the encoding process, which requires high computational resources to squeeze a video sequence into a compressed media file. Since this process is performed for each quality representation and video codec, increasing the number of adopted video codecs imposes a higher computational cost for producing and storing the video segments.

In Video on Demand (VoD), the content is pre-encoded and stored on the origin server. For the distribution over the Internet, Content Delivery Networks (CDNs) are required to fetch popular content and cache it to improve retrieval speed. However, this convenience comes with higher expenses and limited cache storage. Hence, in (ii), using multiple codecs can reduce data transmission and improve QoE, but increases CDN storage needs. It is worth noting that the encoding cost is not directly influenced by the number of clients requesting the content. Conversely, the delivery cost increases with the number of clients served by the CDNs, whose service fees scale up with the amount of data delivered.

Lastly, in (iii), the challenge is guaranteeing that the audience can access video streaming services from different platforms, such as consoles, TVs, set-top boxes, or web browsers. Table 1 shows the percentage of worldwide usage for the four most widely used web browsers.\(^{1}\) This consideration can be extended to any other streaming device or application. Current streaming platforms support multiple video codecs, but the specific codecs supported may vary depending on the adopted application or browser, and the device’s decoding capabilities (see Table 1).

Table 1.

Web Browser	Market Share (%)	Supported Codecs
Google Chrome	62.4	AVC, HEVC, VP9, AV1
Safari	20.2	AVC, HEVC, VP9
Microsoft Edge	5.2	AVC, HEVC, VP9
Mozilla Firefox	2.7	AVC, VP9, AV1

Table 1. Market Share of Web Browsers Worldwide with Supported Codecs\(^{1}\)

In this sense, selecting the appropriate video codecs becomes a complex task as it involves finding the right balance between reducing the encoding effort and minimizing transmitted data while ensuring that the audience can play the content.

Once this decision is made, the encoded representations on the server are typically listed in the Media Presentation Description (MPD), also called manifest, a document containing all the necessary information for the client to initiate the streaming session. When multiple codecs are available, clients determine the appropriate one based on their decoding capabilities and compression efficiency estimations. However, this simplistic approach fails to consider the complexity of the content itself, which, varying over time, impacts different codecs in unique ways.

Based on the aforementioned problems and considerations, we propose a novel distributed ABR technique for VoD applications termed multi-codec dynamic plug-in for HTTP adaptive bitrate streaming (MEDUSA) to work on top of existing ABR algorithms, which we term “underlying ABR algorithm” throughout the work, to enable efficient multi-codec delivery in HAS. Taking into account a multi-codec bitrate ladder, MEDUSA exploits key metrics related to video segments, such as the perceived quality expressed in Video Multi-Method Assessment Fusion (VMAF) [16] and the segment size (in Bytes), to request the appropriate codec for each video segment. Therefore, MEDUSA enables dynamic codec switching over time to maximize QoE. The contributions of this article are four-fold:

—

Video codecs’ performance: We evaluate the performance of multiple video codecs on selected video sequences with different content complexities in terms of video quality and segment size.

—

Dynamic codec selection: In opposition to existing techniques, which stick to the same codec for the entire streaming session, we formulate the optimization problem of selecting the appropriate codec for each user and video segment, refining the selection of the ABR algorithms. For this purpose, the centralized model runs on an edge server, i.e., a high-speed server positioned at the edge of the network to cache and efficiently deliver digital content to users with minimized latency. This approach requires playback information from the users, such as chosen representation, segment index, and buffer occupancy, and exploits key metrics, such as the perceived segment quality and size.

—

MEDUSA: To reduce the complexity of the optimization model and address scalability issues, we move the computation to the client by proposing a novel distributed plug-in ABR algorithm termed MEDUSA that, acting on top of an underlying ABR algorithm, changes codec dynamically to enhance the QoE of the users.

—

Evaluation: Various experiments are conducted to validate our proposed method with different test sequences and a comparison with state-of-the-art approaches.

The article is organized as follows. Section 2 summarizes related work in the context of HAS, followed by the motivation of our work, the problem description and formulation, and the proposed method, MEDUSA, in Section 3. In Section 4, we compare the performance of MEDUSA with state-of-the-art ABR algorithms. Finally, Section 5 concludes the article.

Throughout this work, when discussing video codecs, the word performance is used to refer to the achieved content quality (often measured in terms of PSNR or VMAF) in relation to the content size. Furthermore, we use the terms client to refer to the machine and user to refer to the person accessing the streaming session.

2 Related Work

Recently, there has been an increase in attention toward multi-codec systems, focusing on performance evaluation, bitrate ladder optimization, and dataset creation [29, 34, 40].

Zabrovskiy et al. [40] compare the performance of video codecs such as AVC, HEVC, VP9, and AV1 based on videos with different spatial and temporal information. The authors found that VP9 and HEVC performed similarly, while AVC had the lowest compression efficiency among all items examined. Furthermore, AV1 achieved a higher BD-rate than the other codecs when employing weighted PSNR on the Y, U, and V components. Nevertheless, the scope of this study is restricted to three video sequences, encompassing a bitrate ladder up to 4 K resolution and employing PSNR as the quality metric. Reznik et al. [29] explore the development of an optimal multi-codec bitrate ladder that takes into account network conditions and video content complexity, and present an optimization approach to achieve this goal. This study demonstrates that the rates and points distribution between AVC and HEVC codecs exhibits an interleaved pattern. Consequently, a dual-codec client can take advantage of this pattern, alternating between these codecs to achieve finer-grained adaptation and ultimately enhance the overall quality. Taraghi et al. [34] provide a multi-codec dataset for DASH systems that includes AVC, HEVC, AV1, and VVC, allowing interoperability testing and streaming experiments for these codecs under varying circumstances. Although related to the multi-codec domain, these studies do not consider segment selection and delivery.

In our previous work [18], we proposed a Multi-Codec Optimization Model at the edge for Live streaming (MCOM-Live) to efficiently fetch and deliver segments encoded with multiple codecs. The proposed technique operates at the edge of the network, i.e., a server existing in the network as close as possible to a requesting client to reduce communication latency. Given a video content encoded with multiple codecs according to a fixed bitrate ladder, the Mixed-Binary Linear Programming (MBLP) model will choose among three available policies, i.e., fetch, transcode, or skip, the best option to handle the representations, based on bandwidth and computation costs. The experimental results show that MCOM-Live can reduce additional latency by up to 23% and streaming costs by up to 78%, in addition to improving the visual quality of the delivered segments by up to 0.5 dB, in terms of PSNR, compared to state-of-the-art approaches. Since MCOM-Live is designed for live streaming and relies on a Virtual Reverse Proxy (VRP) located on the edge server to deliver content from two video codecs (AVC and HEVC), the scalability of this system can be challenging. The processing delays introduced by this technique can increase depending on the number of clients. Furthermore, the available resources of the VRP must be shared and carefully allocated to each client.

Contrary to MCOM-Live, this article targets the VoD scenario and investigates and compares different techniques deployed on the client to fetch segments encoded with multiple video codecs. Therefore, we study the impact on the QoE of an ABR strategy in which the client is responsible for selecting the next segment from different video codecs (i.e., AVC, HEVC, VP9, and AV1). To do so, the client must obtain additional knowledge about the quality and size of the available segments. In contrast, the proposed approach is agnostic about the underlying ABR algorithm.

ABR algorithms can be divided into (i) throughput-based, (ii) buffer-based, or (iii) hybrid-based based on the metrics utilized for adjusting the bitrate selection [1]. In the family of throughput-oriented ABR algorithms, we consider AGG [21] (short for aggressive), which selects the highest possible bitrate quality level within the estimated throughput limit. Within the category of buffer-based approaches, Buffer-Based Adaptation version 0 (BBA-0) [10] and Buffer Occupancy based Lyapunov Algorithm (BOLA) [32] are two algorithms that use instant buffer occupancy to determine the appropriate quality level for the next segment. Focusing on BOLA’s slow-start issue, BOLA-E [31] has been developed to cope with user events such as startup and seeking. SARA [14] is a hybrid ABR algorithm that retrieves the next segment based on the current buffer state and throughput estimation.

3 Medusa

In this section, we first motivate our approach with practical examples and present insights concerning video complexity and codecs performances according to the selected video sequences (Section 3.1). We then introduce the streaming scenario in Section 3.2, followed by Section 3.3, which formulates the centralized optimization problem of selecting the appropriate codec for each user and video segment. Lastly, Section 3.4 presents MEDUSA as a distributed technique to enhance the QoE of the users by enabling multi-codec delivery while addressing the scalability issues associated with a centralized solution.

3.1 Motivation

On average, newer video codecs provide a quality improvement compared to older ones for the same target bitrate. However, depending on the content complexity, video codecs provide varying compression efficiency for different video sequences. The motivation for our approach comes from a practical evaluation of the performance of video codecs on different types of content.

Each considered video sequence is split into 4-second segments and Figure 1(a)–(d) depict three subplots (top, middle, and bottom) representing the performance of HEVC, VP9, and AV1 compared to the baseline AVC. The adopted video sequences have different complexity in terms of spatial information (SI) and temporal information (TI) [13], whose average values are reported in the figure captions. We believe they cover well the spectrum of video content complexity since ToS1 has low SI and low TI, Gameplay has high SI and low TI, Rally has low SI and high TI, and ToS2 has high SI and high TI. Further details related to encoding configuration and parameters are provided in Section 4.

Fig. 1.

Besides the codecs, the three plots consider three different conditions: (1) SS-HV: true if the compared codec has a smaller size and higher VMAF than AVC; (2) BS-HV: true if the compared codec has a bigger size and higher VMAF than AVC; and (3) SS-LV: true if the compared codec has a smaller size and a lower VMAF than AVC. The top subplots in Figure 1(a)–(d) represent the percentage of encoded segments that meet the conditions mentioned above. This value alone does not provide information on the magnitude of the VMAF and the size difference. For this reason, the middle and bottom subplots depict the statistical properties of the set of segments satisfying the discussed conditions. In the middle subplots, the boxplots represent the VMAF difference between the compared codec and AVC; a positive VMAF difference implies that the selected codec leads to a higher VMAF than AVC. In the bottom subplots, the size difference in megabytes (MB) between the compared codec and AVC is illustrated; a negative size difference means that the compared codec leads to a reduced segment size compared to AVC.

By examining the percentage of segments that satisfy a specific condition, notable similarities can be observed among the video sequences. Regardless of the particular video sequence, all codecs presented exhibit the capability to fulfill the SS-HV condition for most segments, indicating their superior compression efficiency compared to AVC. AV1 and HEVC achieve across all video sequences the second highest percentage of segments meeting the BS-HV condition opposed to VP9, for which the second most important condition is SS-LV, indicating lower VMAF scores and smaller segment sizes than AVC.

Figure 1(a) presents a comparison of codecs specifically for the least complex video sequence, ToS1. It is worth noting the similarities between HEVC and AV1 on this sequence, as they exhibit similar trends in the percentage of conditions met and yield comparable results. VP9 instead reduces the size down to 3.7 MB with a minimal quality decrease of 1.8 VMAF points for SS-LV, while for BS-HV we observe a limited improvement of up to 8.4 VMAF points compared to HEVC and AV1, both achieving up to 14.8 VMAF points.

More complex video sequences, such as ToS2, exhibit on average more constrained outcomes, displaying skewed distributions that highlight the compression challenges when encoding highly complex content. Notably, in Figure 1(b), AV1 outperforms HEVC for SS-LV reducing the segment sizes by up to 2.7 MB compared to AVC, while HEVC achieves a reduction of 1.8 MB. The quality decrease is for both below 0.8 VMAF points and, hence, negligible. Furthermore, AV1 demonstrates substantial improvements in VMAF compared to VP9 for BS-HV of up to roughly 17 VMAF points, with minimal size increase compared to AVC. Remarkably, these reductions are accompanied by imperceptible quality decreases.

Although only few video segments meet SS-LV, HEVC achieves the best results for Gameplay, with a size reduction of up to 4.6 MB with a decrease of 0.3 VMAF points, compared to 4.3 MB for VP9 with a decrease of 2.7 VMAF points, and 2.4 MB for AV1 with a decrease of 1.3 VMAF points. It is interesting that for SS-HV HEVC achieves a higher quality and size reduction compared to AVC for more than 80% of the segment, with VP9 and AV1 stopping at 64% and 72%, respectively.

Based on the aforementioned graphs and considerations, it becomes evident that on a coarse view each content influences the compression efficiency of a specific codec. Additionally, if we analyze the data on a finer scale, we notice that each segment encoded using the adopted codecs yields a different quality-size combination that necessitates individual analysis and comparison with other segments. Consequently, utilizing multiple codecs can prove advantageous in optimizing a video streaming session. Instead of relying solely on a single fixed codec, employing multiple codecs allows for variations in the decisions made by each codec, thereby mitigating the impact of content on the quality and size of the encoded segments. By considering the diverse characteristics of different codecs, it becomes possible to tailor the encoding process to individual segments, enhancing the overall streaming experience. Thus, alternative approaches must be explored to optimize video streaming sessions effectively, considering the variability of quality and size across different segments.

3.2 Overview

For each given video sequence, we focus on the problem of selecting the optimal codec for the video segments to be requested. Let us consider the scenario depicted in Figure 2.

Fig. 2.

Clients A and B access a streaming service provided by a media server with two different modalities. While client A requests the “next quality” representation from only one codec (single-codec), client B supports multiple codecs (multi-codec) and can, therefore, change codec dynamically over time. The idea is to compare similar representations from different codecs having the same position (#) in the multi-codec bitrate ladder (see Table 3). For this purpose, let us assume w.l.o.g. that the media server stores a video sequence encoded with multiple codecs (AVC, HEVC, VP9, and AV1, in the example) according to a specific multi-codec bitrate ladder.

Table 2.

Notation	Definition
Input parameters
\(u\in U\)	User in the set \(U\) of all streaming users
\(C_u\)	Set of the video codecs supported by user \(u\)
\(L_c\)	The set of incrementally ordered available bitrates \(\lbrace l_{c,r}\rbrace\) for codec \(c\) and representation \(r\)
\(\tau\)	The segment duration in seconds
\(\hat{i}_u\), \(\hat{c}_u\), \(\hat{r}_u\)	The index of the next segment, codec and representation selected by the underlying ABR algorithm for user \(u\), respectively
\(V\), \(G\)	The sets of VMAF and segment size values
\(b_u^{curr}\), \(b_u^{max}\)	The current buffer occupancy and maximum buffer capacity for user \(u\)
\(\gamma\)	The JND threshold used by MEDUSA
\(\hat{v}_u\), \(\hat{s}_u\)	VMAF and size for user \(u\) and segment \(\hat{i}_u\) with codec \(\hat{c}_u\) and representation \(\hat{r}_u\)
\(v_u^{max}\), \(s_u^{max}\)	Maximum VMAF and size for user \(u\) and segment \(\hat{i}_u\) with codec in \(C_u\) and representation \(\hat{r}\)
\(\alpha _u\)	Weight used in the objective function of user \(u\)
Variables
\(\psi _u\)	Value of the objective function of user \(u\)
\(x_{c,u}\)	Binary variable set to 1 if \(c\) is the optimal codec for user \(u\), representation \(\hat{r}_u\) and segment \(\hat{i}_u\)
\(c_u^*\)	The codec selected by MEDUSA for user \(u\)

Table 2. Notations used in the Article

Table 3.

#	Bitrate	Resolution	Bitrate	Resolution	Bitrate	Resolution	Bitrate	Resolution
	AVC		HEVC		VP9		AV1
1	100k	256 \(\times\) 144	145k	320 \(\times\) 180	145k	320 \(\times\) 180	145k	320 \(\times\) 180
2	200k	320 \(\times\) 180	180k	384 \(\times\) 216	180k	384 \(\times\) 216	180k	384 \(\times\) 216
3	375k	384 \(\times\) 216	350k	512 \(\times\) 288	350k	512 \(\times\) 288	350k	512 \(\times\) 288
4	550k	512 \(\times\) 288	500k	640 \(\times\) 360	500k	640 \(\times\) 360	500k	640 \(\times\) 360
5	750k	640 \(\times\) 360	700k	768 \(\times\) 432	700k	768 \(\times\) 432	700k	768 \(\times\) 432
6	1,000k	768 \(\times\) 432	900k	1,024 \(\times\) 576	900k	1,024 \(\times\) 576	900k	1,024 \(\times\) 576
7	1,500k	1,024 \(\times\) 576	1,400k	1,280 \(\times\) 720	1,400k	1,280 \(\times\) 720	1,400k	1,280 \(\times\) 720
8	3,000k	1,280 \(\times\) 720	2,750k	1,920 \(\times\) 1,080	2,750k	1,920 \(\times\) 1,080	2,750k	1,920 \(\times\) 1,080
9	5,800k	1,920 \(\times\) 1,080	5,500k	1,920 \(\times\) 1,080	5,500k	1,920 \(\times\) 1,080	5,500k	1,920 \(\times\) 1,080
10	7,500k	2,560 \(\times\) 1,440	7,000k	2,560 \(\times\) 1,440	7,000k	2,560 \(\times\) 1,440	7,000k	2,560 \(\times\) 1,440
11	12,000k	3,840 \(\times\) 2,160	11,000k	3,840 \(\times\) 2,160	11,000k	3,840 \(\times\) 2,160	11,000k	3,840 \(\times\) 2,160
12	17,000k	3,840 \(\times\) 2,160	15,000k	3,840 \(\times\) 2,160	15,000k	3,840 \(\times\) 2,160	15,000k	3,840 \(\times\) 2,160

Table 3. Multi-codec Bitrate Ladder (Bitrate/Resolution Pairs)

The streaming sessions start with the two clients fetching the same MPD, which contains essential information about the available representations, codecs, and URLs to fetch the video segments. The MPD and other streaming-related parameters like the buffer occupancy are parsed by the player core module, which is responsible for different tasks, such as generating URLs for the selected segments and calling the ABR function. Assume client A and B adopt the same single-codec ABR algorithm, which selects a certain representation from the given bitrate ladder, defined by the codec chosen at the beginning of the streaming session. Unlike client A, client B can modify the selection performed by the ABR algorithm either via the edge server (Figure 2(a)) or locally (Figure 2(b)). The blue geared boxes refer to the two proposed implementations for c odec selection.

It is worth noting that these techniques do not influence the number of codecs adopted in the encoding process nor it adds any overhead to or trigger the encoding process. However, the higher the number of supported codecs available, the higher the chance for the client to find the optimal representation among them. Additionally, increasing the number of codecs will bear on the cache storage of the CDNs in a linearly dependent way. Nevertheless, since CDNs use proprietary caching strategies, we believe that the storage can be optimized through custom policies and fetching techniques.

The main difference between the two techniques regards scalability and the required additional metadata, such as perceived quality and segment size, that drive the codec selection. If the codec selection is performed on the edge server, the computational complexity of the selection depends on the number of streaming users, which clearly leads to scalability issues as this number grows. This analysis is discussed in Section 3.3 and visually presented in Figure 3.

Fig. 3.

On the other hand, performing the codec selection directly on the client would clearly tackle this problem. However, it would require metadata, such as quality (VMAF) and size of the video segments, to select the optimal codec, which must be conveyed to each client. This imposes a higher overhead than with the centralized approach, when the CDN server requires one copy of metadata from the origin server. These metadata can be conveyed in the form of a MPD, whose size for VoD content depends on the number of listed periods, adaptation sets, and representations. Considering 4 codecs, 12 representations per codec, each comprising of 75 video segments, the size of the MPD can grow from 10.0 KB to 274.3 KB. However, existing text compression techniques, such as the Efficient XML Interchange ( EXI ) [39], could reduce its size to 60.2 KB. Additionally, other techniques could be employed to convey size and VMAF information, such as Common Media Client Description ( CMCD ) [6], custom fields in the GET response packets, or piggybacking. This way the quantity of data to be delivered would be divided through the number of video segments and result negligible compared to storing them in the MPD.

Both methods require information from the other end of the architecture to retrieve quality and size of the specified segments. The technique on the edge server requires for this purpose the segment index and representation selected by the ABR algorithm (which can be gathered from the requested URL), while the method on client parses these values from the MPD. Furthermore, the centralized model on the server requires the buffer occupancy of each user to weight the impact of quality and size differently.

Once the optimal codec has been chosen, either the edge server fetches and serves the representation defined by the selected codec to the client (Figure 2(a)) or the player core module generates the URL received from the blue box and sends the request to the server to retrieve the corresponding segment (Figure 2(b)). Iterating this procedure for all segments can lead to a high dynamicity of codec selection, represented in Figure 2 with segments colored differently. It is worth noting that these techniques are agnostic to the available codecs and content at the CDN server, and also to the underlying ABR algorithms.

In Section 3.3, we first formulate the problem of dynamically selecting the optimal codec for the streaming users to maximize their QoE. Afterward, Section 3.4 presents our technique, MEDUSA, to solve the problem heuristically and move the computation to the clients.

3.3 Optimization Problem

Every user \(u\) accessing a streaming session supports a specific set of codecs \(C_u \subseteq C\), where \(C\) refers to the set of all video codecs. Consider a scenario where video segments, each \(\tau\) seconds long, are encoded according to several bitrate ladders \(L_c=\lbrace l_{c,1}, \ldots ,l_{c,N}\rbrace\) for each codec \(c \in C\), where \(l_{c,r}\in L_c\) denotes the \(r^{th}\) representation with \(N\) being the highest representation. It is worth noting that the same position (#) in the multi-codec bitrate ladder (see Table 3) corresponds to a fixed target bitrate.

Assume that the ABR algorithm employed by user \(u\) selects representation \(\hat{r}_u\) from codec \(\hat{c}_u \in C_u\) for segment with index \(\hat{i}_u\). Since changing the bitrate selected by the underlying ABR algorithm could reduce the QoE by causing stalls or quality switches, we do not change the bitrate selected by the underlying ABR algorithms. However, considering the buffer occupancy of the clients guarantees that changing the codec does not lead to any QoE degradation. Given the segment size and quality variations for different codecs depicted in Figure 1, a high buffer occupancy allows for a higher overhead in segment size justified by an increase of quality. On the other hand, a low buffer occupancy would force the codec selection to prioritize lower segment sizes.

Therefore, we formulate a Mixed Binary Linear Programming ( MBLP ) model to select the optimal codec \(c^{*}_{u}\) for each user \(u \in U\) for segment \(\hat{i}_u\). Table 2 shows the defined sets and variables. We introduce the binary variable \(x_{c,u}\), where \(x_{c,u}=1\) indicates that the next segment \(\hat{i}_u\) for user \(u\) should be requested with representation \(\hat{r}_u\) (selected by user \(u\)) and codec \(c\):

\begin{equation} \sum _{c\in C_u} x_{c,u} = 1 , \forall u \in U. \end{equation}

(1)

To drive the selection of the optimal codec, the model needs information on the VMAF quality and size of the video segments, included in \(V\) and \(G\) sets, respectively. Based on the value of \(x_{c,u}\), user \(u\) will request a segment with VMAF \(\hat{v}_u\) (Equation (2)) and size \(\hat{s}_u\) (Equation (3)):

\begin{equation} \hat{v}_u=\sum _{c\in C_u} V_{\hat{i}_u,\hat{r}_u,c} \cdot x_{c,u} , \forall u \in U \end{equation}

(2)

\begin{equation} \hat{s}_u=\sum _{c\in C_u} G_{\hat{i}_u,\hat{r}_u,c} \cdot x_{c,u} , \forall u \in U \end{equation}

(3)

Fulfilling the requests of the clients implies that the edge server running the optimization model must fetch all the encoded segments with size \(\hat{s}_u\) from the media/CDN server. Assuming a bandwidth \(\Phi\) at the link between edge and media/CDN server and the maximum additional latency \(\Delta\) of the fetching operation, the model must fulfill the following constraint:

\begin{equation} \sum _{u \in U} \hat{s}_u \le \Phi \cdot \Delta. \end{equation}

(4)

Furthermore, we need to define \(\alpha \in [0,1]\), a parameter which drives the impact of the VMAF on the segment selection. It is computed as the ratio between the buffer occupancy \(b_u^{curr}\) and the maximum buffer capacity \(b_u^{max}\). To determine the codec \(c^*_u\) to select for each user \(u \in U\), the model relies on the following objective function:

\begin{align} \psi _u = \alpha _u \cdot \frac{\hat{v}_u}{v_u^{max}} - (1-\alpha _u) \cdot \frac{\hat{s}_u}{s_u^{max}}, \forall u \in U, \end{align}

(5)

where \(\alpha _u = \frac{b_u^{curr}}{b_u^{max}}\), \(v_u^{max} = \max _{c \in C_u} V_{\hat{i}_u,\hat{r}_u,c}\), and \(s_u^{max} = \max _{c \in C_u} G_{\hat{i}_u,\hat{r}_u,c}\). As formulated above, \(v_u^{max}\) and \(s_u^{max}\) are the maximum VMAF and segment size values, respectively, of segments with index \(\hat{i}_u\) and representation \(\hat{r}_u\). Given the defined variables, we formulate the MBLP model as follows:

\begin{align} &Maximize: \quad \min _{u \in U} \psi _u \qquad - \qquad s.t.: Constraints(1)-(5). \end{align}

(6)

The main advantage of maximizing the minimum utility function \(\psi _u\) is that the model ensures to select the optimal codec configuration for each user \(u\), unlike maximizing the average utility function. It is trivial to proof that this leads to the optimal codec selection for each user \(u\). However, solving this problem is of non-deterministic polynomial-time ( NP ) hardness, as it can be proved with the reduction from the binary Knapsack problem, and presents scalability issues [8, 36]. Figure 3 presents the execution time (s) required by such a model, implemented in Python with the PuLP library, to find the optimal solution based on the number of users and available codecs. Since the execution time increases exponentially with the number of users, in Section 3.4, we propose MEDUSA, a distributed plug-in ABR algorithm, and move the computation directly to the client.

3.4 Heuristic Distributed Algorithm

To reduce the complexity of the problem defined in Section 3.3 while enabling multi-codec delivery and improving the QoE, we propose MEDUSA, a distributed plug-in ABR algorithm that aims at distributedly enhancing segment selection for traditional ABR algorithms by changing codec dynamically within the streaming session.

There are multiple advantages in adopting a plug-in-like approach. First, it can be less time-consuming and complex than creating or modifying the existing code of multiple ABR algorithms, which can become a burden depending on the number of implementations to customize. Furthermore, separating the ABR algorithm from the plug-in allows for easy replacement or updates, streamlining maintenance and enabling the introduction of new features. Having stated the importance of our design choices, let us consider the same scenario described in Section 3.2 and the parameters defined in Section 3.3.

Assume client A and client B rely on the same underlying ABR algorithm and set of supported codecs \(C_u\). If the next segment for both clients has index \(\hat{i}_ u\), the selected representation is defined as \(\hat{r}_u\in L_{c}\). Furthermore, suppose that client B exploits MEDUSA on top of the adopted ABR algorithm to enable multi-codec delivery. While client A locks the selected representation \(\hat{r}_u\), MEDUSA, acting on client B, modifies and enhances the selection, if needed. The steps involved in this process are shown in Algorithm 1.

The algorithm takes in nine variables as input, obtained from the initial URL, the MPD, and the buffer: (i) the number of the next segment \(\hat{i}\), the selection made previously by the ABR algorithm, (ii) codec \(\hat{c}_u\) and (iii) representation \(\hat{r}_u\), (iv) the set \(C_u\) of supported codecs, (v) the set \(V\) of VMAF values, (vi) the set \(G\) of segment sizes, (vii) the current buffer occupancy \(b_u^{curr}\), (viii) the maximum buffer capacity \(b_u^{max}\), and (ix) \(\gamma\), which refers to the Just-Noticeable Difference (JND), the minimum amount of VMAF change noticeable by the user. From the literature we know that this value, known as JND, has a high importance in assessing the quality of a video content [15]. It means that if the quality difference of two consecutive video segments is lower than one JND, the difference is likely not to be noticed by the user. Otherwise, this quality difference is likely to be perceived. Clearly, if this difference is positive, the user is going to appreciate the quality increase. If this difference is negative, the quality decrease is likely to negatively affect the QoE of the user.

The output comprises only the selected codec \(c_u^*\) corresponding to the maximum objective value, since representation \(\hat{r}_u\) and segment index \(\hat{i}_u\) remain unaltered. Returning only \(c_u^*\) reduces the computational search complexity from \(\mathcal {O}(|C_u||L_{c}|)\) to \(\mathcal {O}(|C_u|)\), minimizing additional selection overhead and making a sub-optimal decision.

Firstly, we retrieve the VMAF \(\hat{v}_u\) (line 1) and the segment size \(\hat{s}_u\) (line 2) of the initial selection, respectively. Then, we need to identify the maximum value of VMAF, \(v_u^{max}\) (line 3), and segment size, \(s_u^{max}\) (line 4), among the codecs, which will be then used for normalization. Next, \(\alpha _u\) is computed based on current buffer \(b^{curr}_u\) and maximum buffer \(b^{curr}_u\) (line 5). Since \(\alpha _u\) is directly proportional to the buffer occupancy, using it to weight the quality means that we can download a high-quality segment in case of high buffer occupancy while keeping low the probability of experiencing a stall event. Then we compute the value of the objective function for the given initial selection and assign it to the variable \(\psi _u\) (line 6). The objective function, as mentioned before, considers two factors: (i) the VMAF quality \(\hat{v}_u\) of the selected segment, weighted with \(\alpha _u\), and (ii) the size \(\hat{s}_u\) of the selected segment, weighted with \(1-\alpha _u\). The given objective function is to be maximized. Therefore, the quality appears as a positive factor and the segment size as a negative factor. Before starting the loop, we initialize the output \(c_u^*\) with the input \(\hat{c}_u\). Then, for every supported codec \(x\) (line 8), we first acquire the corresponding VMAF value \(V_{\hat{i}_u,x,\hat{r}_u}\) (line 9) and segment size \(G_{\hat{i}_u,x,\hat{r}_u}\) (line 10). Afterward, we compute the temporary objective function \(\psi ^{^{\prime }}\) (line 11).

If this value is greater than that of the objective function \(\psi _u\), we consider codec \(x\) as a possible candidate for the selection (line 12). However, we want to exclude the possibility that the objective function drives the selection to a representation with much lower VMAF than that of the original selection. Therefore, we make sure that the difference between the VMAF of the initial selection and that of the current segment is lower than one JND (line 13). We note that adding this condition to the optimization model described in Section 3.3 requires defining a nonlinear constraint. Nevertheless, it is a beneficial constraint to ensure that every selection made by MEDUSA does not impact the perceived quality. It is worth noting that, since this value is positive, this condition is met even for current VMAF higher than that of the initial selection (\(v \gt \hat{v}_u\)). If both conditions (lines 12–13) are met, we update the objective function (line 14) and the codec selection (line 15).

Lastly, when the comparisons are terminated, we return the chosen codec to the player core module that is in charge of requesting the newly selected representation and switching, if needed, to a different codec than the one previously selected (line 16).

4 Experiments and Discussion

4.1 Methodology and Evaluation Setup

Experimental setup. Our testbed is based on CAdViSE [35] and comprises two Amazon EC2 instances, i.e., an HTTP server and an HTTP client, which communicate with each other via an Internet connection. Furthermore, we exploit AStream,² a headless player written in Python, to handle the streaming session. To simulate different scenarios, we shape the network connection between the server and the client using Wondershaper,³ according to 4G-LTE [28] and Amazon FCC [5] network traces as shown in Figure 4(a) and (b) for all experiments. Additionally, for each run, we adopt the network traces starting from a random point in the timeline. It is worth noting that, although the entry points are randomized, to guarantee a fair comparison, each ABR technique has the same trace for the same run.

Fig. 4.

Every ABR algorithm in its original approach (without MEDUSA) requests segments encoded with AV1 since it is reported to outperform the other selected codecs [4, 15]. We employed four test videos, as reported in Section 3.1: (i) Tears of Steel (the first 5 minutes) – ToS1, (ii) Gameplay⁴ [37], (iii) Rally,⁵(iv) Tears of Steel (the last 5 minutes) – ToS2. The complexity of these sequences is reported in the captions of Figure 1(a)–(d) and visually presented in Figure 4(c).

Videos and encoding parameters. The videos are encoded into the bitrate ladder presented in Table 3 with double-pass encoding and slow preset using ffmpeg⁶ with libx264, libx265, libvpx-vp9, and libsvtav1. This bitrate ladder is derived from the DASH 8K multi-codec dataset, as outlined in [34], but it has been adjusted to focus on 4K resolution. However, since the dataset does not include VP9, we have assigned the same bitrate ladder as that of HEVC for VP9.

The segment duration \(\tau\) is 4 s as recommended in [3]. At the client, the buffer capacity \(b_{max}\) is set to either 20 s, to test the responsiveness of the adaptation algorithm, or 40 s, as proposed by Juluri et al. [14]. The JND quality threshold \(\gamma\) for the proposed method is set to 4 VMAF points, obtained as the mean of 2 [15] and 6 [24] VMAF points, as mentioned in the literature and industry. The reason why \(\gamma =4\) leads to the highest QoE is that with \(\gamma =2\) the requested segments will have higher size than the selection from the underlying ABR algorithm although the VMAF quality difference is negligible, increasing the chance of stall events. On the other hand, a quality difference of \(\gamma =6\) does not happen often among codecs and, hence, the codec switches over time are insufficient for MEDUSA to show a real impact on the QoE. We compare our proposed method, MEDUSA, with state-of-the-art approaches described in Section 2: AGG, BOLA, SARA, BBA-0. Each experiment is run 20 times, and the experimental results represent the mean values.

Evaluation metrics. In this article, the following metrics are used:

(i) mean bitrate as the average bitrate of all segments played out; (ii) mean VMAF as the average VMAF value of all segments reproduced by the client; (iii) video instability denoting the average difference in the VMAF values of two adjacent segments, calculated as Equation (7), where \(n \in \lbrace 1, .., K-1\rbrace\) represents the segment number and \(v_n\) the VMAF value of the \(n\)th segment; (iv) transmitted data expressed as the overall segment data transmitted from the server to the client within the streaming session; (v) stall duration and number of stalls expressed as the overall time that the buffer is empty and the reproduction stops and the number of times these events occur, respectively; (vi) codec switches as the number of codec switching events between two consecutive segments; (vii) QoE score according to the extended version [27]⁷ of the original ITU-T P.1203 mode 0 [12].

It is worth noting that the bitrate values presented in Section 4.2 correspond to the information captured from the manifest, which can differ from the real content bitrates.

\begin{equation} I = \frac{1}{K}\sum _{n=1}^{K-1} |v_n - v_{n+1}|. \end{equation}

(7)

Experimental scenarios. In Section 4.2, we present two different scenarios. In Scenario I, we compare MEDUSA with state-of-the-art approaches according to different buffer capacities (20 s and 40 s) and network traces, as mentioned in the experimental setup. To account for all video sequences, we merged the results into one distribution and present them as box plots. In Scenario II, we present the results for MEDUSA and state-of-the-art approaches for one specific combination of fixed parameters (buffer capacity and network trace) highlighting the impact of each video content on the results. The notations AGG-M, BBA-0-M, BOLA-M, and SARA-M, represent combinations of the ABRs AGG, BBA-0, BOLA, and SARA, respectively, with the proposed approach MEDUSA.

4.2 Results and Analysis

4.2.1 Scenario I.

We evaluate MEDUSA’s performance compared to state-of-the-art approaches according to the metrics introduced earlier. Figure 5–12 represent the performance difference of ABR algorithms equipped with MEDUSA with respect to the baseline, i.e., without MEDUSA. Arrows indicating the higher the better (\(\uparrow\)) and the lower the better (\(\downarrow\)). The green triangle reported for each distribution represents its mean value, i.e., as the arithmetic average of the values in the distribution.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Bitrate. Figure 5(a) and (b) illustrate the difference in bitrates requested by MEDUSA compared to the baseline (i.e., without MEDUSA) for 20 s and 40 s buffer capacity. A positive value means that MEDUSA obtains a bitrate increase with respect to the underlying ABR technique. Although “the higher the better” is reported in the graph, a higher bitrate value does not necessarily correspond to a higher perceived quality nor to a higher amount of transmitted data, as it refers to the bitrate reported in the manifest, hence averaged across all video segments. Looking at AGG-M and SARA-M, we notice a similar trend between network traces and two buffer scenarios. For both network traces, the bitrate distribution remains significantly below 0, decreasing down to roughly -30% for AGG-M and -33% for SARA-M. This implies that MEDUSA for AGG and SARA behaves more conservatively than the respective techniques and frequently switches representations to codecs with lower bitrates. It is interesting to note that for AGG-M and 4G-LTE, the mean value is lower than the median, which implies the effect of several negative outliers. BOLA-M exhibits a behavior in line with buffer-based ABR algorithms, boosting the mean bitrate when the buffer capacity is increased. This increase is attributed to the ample availability of buffer which leads MEDUSA’s results for 4G-LTE to a median change from -5.5% to 3.2% compared to BOLA. The highest bitrate increment is reached by BBA-0-M, whose distribution, benefiting from the buffer increase to 40 s, consistently improves the bitrate by up to 27% compared to BBA-0. The wide distribution range for 20 s buffer capacity comes from the different input video content (analyzed in Section 4.2.2).

VMAF. Figure 6(a) and (b) depict the difference in VMAF requested by MEDUSA compared to the baseline (i.e., without MEDUSA) for 20 s and 40 s buffer capacity, respectively. Higher values correspond to higher VMAF values for MEDUSA. In Figure 6(a) and in accordance with the bitrate changes discussed above, for both 20 s and 40 s buffer capacity, MEDUSA results in a VMAF decrease for AGG-M of up to 1 JND, which happens for limited cases, and in a VMAF increase for BBA-0-M which surpasses 1 JND for 20 s and touches 2 JNDs for 40 s. Given the consistent bitrate reduction for BOLA-M and SARA-M, it is interesting to note that MEDUSA can increase the VMAF by up to 1 JND compared to the respective underlying approach, which contrasts the general assumption that a lower bitrate maps to a lower quality. Increasing the buffer capacity leads SARA-M for 4G-LTE to a general decrease in VMAF of up to 1 JND, in accordance with the prominent bitrate reduction presented above. On the other hand, FCC drives SARA-M to obtain a consistent VMAF increase with respect to SARA by up to 5 VMAF points, reflecting an improvement of more than 1 JND.

VMAF instability. Figure 7(a) and (b) show the VMAF instability for 20 s and 40 s buffer capacity, respectively. A negative value implies that MEDUSA achieves a lower instability than the compared underlying ABR technique. In Figure 7(a) we notice higher variability and peaks than in Figure 7(b). This can be explained by the lower buffer range, which influences \(\alpha\) and, hence, the bitrate selection. Since the buffer occupancy is modeled as the number of stored segments, with a higher buffer, the discrete weight values have higher precision, which guarantees a more stable bitrate selection. Furthermore, we observe that the higher the mean VMAF, the lower the mean VMAF instability.

For 20 s buffer capacity, the mean VMAF instability difference is contained within -4% and 4%, showing a consistent instability trend with and without MEDUSA. The only exception is BOLA-M using FCC, whose median and mean have very different values, roughly 2% and 8%, respectively. This means that the original distribution includes many positive outliers for which the instability value is much higher than for the rest of the distribution. The reason for these two opposite behaviors is due to the network trace itself.

FCC, unlike 4G-LTE, maintains a stable throughput of 1 Mbps for more than 80 s, which then suddenly fluctuates between 1 Mbps and 7 Mbps. Since MEDUSA relies on the buffer to balance quality and size, if the buffer is low, BOLA’s selection results in low quality, and MEDUSA prioritizes reducing size, which decreases VMAF and increases instability. If the buffer is high, MEDUSA prioritizes high VMAF regardless of size, increasing instability.

For 40 s buffer capacity, MEDUSA mostly improves the VMAF instability compared to the baseline algorithms. The maximum VMAF instability reduction of 36.8% is obtained by BOLA-M compared to BOLA. The maximum VMAF increase is apported by AGG-M, up to roughly 14% more than AGG.

Transmitted data. Figure 8(a) and (b) represent the comparison in transmitted data between MEDUSA and the underlying ABR technique for 20 s and 40 s buffer capacity. Lower values are preferred, as they refer to a reduction in transmitted data for MEDUSA. Comparing with the bitrate results from Figure 5(a) and (b), we can notice some similarities. However, the trend of bitrate and data changes does not always match, motivating the client’s need to be informed about the segment sizes prior to the segment selection. In fact, although MEDUSA for AGG-M follows the bitrate results reducing transmitted data compared to AGG for both 20 s and 40 s buffer capacity, BBA-0-M has a consistent negative median value for all settings while achieving the highest mean bitrate. The maximum discrepancy occurs for 40 s buffer capacity, when MEDUSA achieves more than 26% bitrate increase compared to BBA-0 while practically limiting the size increase by up to 1%. This shows that the bitrate is a superficial metric and that MEDUSA strikes a real tradeoff between size and quality, according to the mentioned increase in VMAF. Unlike in the 20 s buffer capacity case, MEDUSA for BOLA-M with 40 s buffer capacity and 4G-LTE consistently reduces the size by up to 10% compared to an increase in bitrate of the same percentage. Following the bitrate trend, SARA-M provides a consistent reduction in transmitted data for both 20 s (of up to 19.3%) and 40 s (of up to 41.8%) buffer capacity.

Number and duration of stalls. According to Figure 9(a) and (b), the reduction in bitrate and size has positive effects on the number of stalls, which are consistently decreased for almost all combinations of ABR algorithms and MEDUSA. A negative value means that MEDUSA reduces the number of stall events with respect to the underlying ABR approach. The same holds for the duration of the stalls, shown in Figure 10(a) and (b). The highest visible reduction in the number of stalls is with 20 s buffer capacity for AGG-M, with 1–7 fewer stalls for 4G-LTE and 2–5 fewer stalls for FCC, motivated by the bitrate and size reductions presented above. These results confirm the importance of trading-off between quality and size to reduce the amount of transmitted data while maintaining high video quality. Although the number of stalls decreases, the duration of the stalls remains constant for FCC or slightly increases for 4G-LTE. This means that AGG-M achieves fewer but longer stalls compared to AGG. Increasing the buffer capacity to 40 s has a positive impact on AGG-M, which can better cope with excessive throughput estimation. This leads to a reduction in number and duration of stalls for both network traces. For BBA-0, BOLA, and SARA, the buffer plays a strategic role in assessing decisions. To a lesser degree than for AGG-M, with 20 s buffer capacity, BBA-0-M occasionally reduces the number of stalls to 2. As mentioned above, FCC is a more stable trace than 4G-LTE, allowing BBA-0-M to better manage the buffer during turbulent periods. MEDUSA improves decisions when buffer occupancy is good, but exacerbates issues when decisions are poor. This behavior leads to a similar number of stalls but longer stall durations in 4G-LTE. Increasing the buffer capacity to 40 s does not cause any stalls for BBA-0 and BBA-0-M for most runs. Some outliers are pushing the mean value for BBA-0-M toward the negative region, meaning that BBA-0-M occasionally reduces the number of stalls compared to BBA-0. To improve the playback smoothness, BOLA acts more conservatively than BBA-0, requesting lower bitrates. This impacts also MEDUSA, which improves BOLA’s actions and reduces the number of stalls under unstable network conditions. Indeed, for 4G-LTE, we can notice a decrease down to 2 stalls, with a trend similar for BBA-0-M. However, compared to BBA-0-M, the reduction in the duration of stalls for BOLA-M is consistent and reaches 9.3 s. SARA-M encounters fewer stalls than SARA in 4G-LTE with a 20 s buffer capacity. The difference in stall duration ranges from -7.1 s to 4.0 s, depending on the video sequence. For FCC, SARA-M reduces the number of stalls. Occasionally, there is a slight increase in the number of stalls, but a significant decrease in the duration of stalls. Increasing the buffer to 40 s, SARA-M is unable to reduce stalls in the FCC case, although the mean transmitted data volume is lower than for SARA. Therefore, there are no differences in the mean number and duration of stalls. The SARA results for 4G-LTE present a more interesting trend, with a reduction in the number of stalls to 1 and in the duration of the stalls to 22.6 s.

Codec switches. Figure 11(a) and (b) represent the difference in the number of codec switches for MEDUSA compared to the baseline with 20 s and 40 s buffer capacity, where each switch event refers to playing a video segment with a different codec than the previous one. Lower values are preferred, as they indicate that MEDUSA leads to a lower frequency in changing the codec. Higher values, however, depict the need for dynamic codec switching over time. It is interesting to note that AGG-M follows a similar trend for 20 s and 40 s buffer capacity since the underlying AGG does not consider the buffer to select the next segment. The number of codec switches ranges from 14 to 48 for both 4G-LTE and FCC. BBA-0-M is shown to be very sensitive to the buffer, which is detected by BBA-0 for the initial selection and then by MEDUSA to choose the right codec, with a wide distribution from 4 to 54 for 20 s and from 8 to 48 for 40 s buffer capacity. BOLA-M provides a similar distribution for 4G-LTE and FCC with 20 s buffer capacity within 20 and 50 while in the 40 s buffer capacity scenario, 4G-LTE leads to more switches than FCC. In particular, the inter-quartile range is within 22 and 38 for 4G-LTE and within 10 and 21 for FCC. With a behavior similar to BOLA-M for 20 s, SARA-M has a slightly wider distribution for FCC, from 12 to 49, with the median below 30. For 40 s buffer capacity, FCC leads to a tighter distribution from 19 to 39 compared to 4G-LTE, which extends from 20 to 52. The explanation for the high variability in distributions is the different number of switches for different video sequences.

QoE. The metrics explained so far give an overview of MEDUSA’s impact on a video streaming session. Combining them according to [27], we obtain the final QoE measurements, which are presented in Figure 12(a) and (b) as the differences between the QoE scores by MEDUSA and the ones achieved by the underlying ABR algorithm without MEDUSA. Therefore, a positive value implies that MEDUSA achieves a better QoE score than the underlying ABR algorithm. Although reducing the overall volume of requested data compared to AGG, MEDUSA (AGG-M) increases the QoE by up to 30% (for 20 s buffer capacity scenario and 4G-LTE) due to a significant reduction in the number of stalls during the streaming session. A similar trend is observed for FCC. However, it is worth noting that although the bitrate is consistently reduced (by up to 30%), the mean decrease in VMAF is lower than 1 JND, which means that it has no impact on the quality perceived by the user. Since the considered ITU-T P.1203 model maps the bitrate to perceived quality, the expected real QoE improvement is probably even higher. Increasing the buffer capacity to 40 s shows a similar trend with MEDUSA (AGG-M) showing a QoE improvement of up to 35% over AGG. MEDUSA improves the QoE for BBA-0-M and BOLA-M compared to the respective underlying ABR algorithms with a QoE improvement of up to 11% for 20 s buffer capacity, with a similar trend for both network traces. A few samples in the distributions of Gameplay and Rally lie below zero for the 4G-LTE trace, similarly to BBA-0-M and AGG-M for FCC. Similar results are obtained by increasing the buffer capacity. For 20 s buffer capacity, SARA-M is the ABR technique achieving the highest mean QoE score. For 4G-LTE, SARA-M’s distribution reaches up to 41.7% improvement. For FCC, SARA-M performs 50% of the times better and 50% of the times worse than SARA. Here the content is extremely important; while SARA-M leads to a QoE improvement compared to SARA for all considered video sequences, it also reduces the QoE by down to 10.5% compared to SARA when streaming Gameplay. With 40 s buffer capacity, MEDUSA (SARA-M) behaves similar to SARA.

To assert the importance of the content in a video streaming session and the dependence of the ABR techniques’ performance on the content, Section 4.2.2 provides detailed graphs on the performance of each ABR for all video sequences, yet for one network trace and buffer capacity only.

4.2.2 Scenario II.

In Scenario II, we highlight the performance of MEDUSA depending on the streamed video content. Therefore, we analyze a specific scenario and present the metrics explained before, excluding the request bitrates (the transmitted data volume gives us a better overview) and the number of codec switches. The chosen scenario refers to 20 s buffer capacity and the 4G-LTE network trace. Figure 13 illustrates the different metrics for each video content and ABR algorithm (with and without MEDUSA on top).

Fig. 13.

Transmitted data. In Figure 13(a) we can observe that MEDUSA can consistently reduce the mean volume of transmitted data for each ABR algorithm and video content. The largest reduction occurs for AGG-M and ToS1, by up to approximately 24%. It is also interesting to note that for SARA-M the highest reduction is with ToS2 (15%), indicating that MEDUSA is particularly effective for complex video sequences (i.e., high SI/TI for ToS2; cf. Figure 4(c)).

VMAF. Figure 13(b) depicts the mean VMAF values of the requested segments. It is evident that Gameplay achieves a significantly lower VMAF compared to the other video sequences, independently of the chosen ABR strategy (due to its lower VMAF scores compared to other sequences). This impacts the final VMAF score but not the general VMAF trend for the considered ABR strategies, which is similar to all sequences. Based on Figure 13(a), with a single-codec strategy, we would expect that the data reduction corresponds to a decrease in VMAF points. However, we can observe that MEDUSA consistently improves the achieved VMAF compared to the underlying ABR technique. This increment can reach a mean of 4.3 VMAF points, over 1 JND, for BBA-0-M with Gameplay. The exception is AGG-M, which is, however, able to keep the VMAF reduction for all video sequences within 3 VMAF points compared to AGG.

VMAF instability. Figure 13(c) shows the mean VMAF instability for the ABR strategies and video sequences. We can observe a similar behavior of MEDUSA and the underlying ABR techniques, with all changes being \(\pm 1\) VMAF points at most. This is expected since MEDUSA does not consider VMAF instability in its objective function and, hence, does not focus on reducing it.

Number and duration of stalls. Figure 13(d) and (e) illustrate the numbers and durations of stall events in the playback for each ABR technique and video content. We can see that MEDUSA can consistently reduce the number of stalls and partially decrease the duration of stalls for each ABR algorithm and video content. The largest decrease in number of stalls occurs for AGG-M with ToS2, the most complex test sequence in evaluation. The reduction in this case reflects 5 stalls on average (-73%). Despite this significant decrease, AGG-M doubles the duration of these stalls from approximately 5 s to 10 s, which means that AGG-M stalls less frequently but notably longer than AGG. Comparing BBA-0 and BBA-0-M, it is interesting to notice a similar trend in the number of stalls, but a different trend in the duration of stalls for Gameplay and Rally. While BBA-0 achieves a lower stall duration (-54%) for Gameplay than BBA-0-M, BBA-0-M reduces the mean value for Rally (-59%) compared to BBA-0. This can be explained by the behavior of MEDUSA. The mentioned stalls occur in proximity to a throughput drop. Although neither MEDUSA nor BBA-0 consider the throughput, the buffer is too short to cope with throughput fluctuations; this results in stall events. The duration of such stalls depends on the buffer status before the request is sent. Therefore, the decisions taken by BBA-0 or BBA-0-M for the previous segments are of vital importance. For Rally, having large segments, the decisions made by BBA-0 keep the buffer occupancy in a risky area, limiting the bitrate for the next segments and, hence, reducing the stall duration if a stall happens. BBA-0-M, on the other hand, aims at optimizing the tradeoff between quality and size, reduces the transmitted data volume and, therefore, also the download time, which results in an increase of the buffer occupancy. When the buffer occupancy is high and BBA-0-M prioritizes the quality over the size, if the throughput drops in the middle of the segment request, we inevitably experience a longer stall than for BBA-0. The opposite consideration holds for Gameplay, whose segments are on average smaller than those of other video sequences, which explains the lower VMAF score.

QoE. The aforementioned metrics eventually influence the QoE of users, which is represented in Figure 13(f). It is evident that MEDUSA can consistently enhance the QoE compared to the underlying ABR technique. The largest increase in QoE comes from SARA-M for ToS1 with more than 37%, due to the lower number and duration of stalls compared to SARA, which has a big impact on the QoE. The second ABR algorithm that provides a large improvement in QoE is AGG-M, which enhances the QoE for TOS2 by 22% as compared to AGG. Considering different video sequences for the same ABR approaches, we can observe a mostly constant pattern for BOLA and BOLA-M, while, for instance, SARA and SARA-M have different behaviors depending on the content. For instance, SARA achieves the highest QoE score for Gameplay (1.61) and the lowest for ToS1 (1.35). SARA-M, instead, achieves the highest QoE for ToS2 (1.88) but the worst for Gameplay (1.66).

5 Conclusions

In this article, we formulated the centralized optimization problem of selecting the appropriate codec for each user on a segment-basis. To address the scalability issues associated with this technique, weproposed a novel distributed plug-in ABR algorithm, termed MEDUSA, that acts on top of a state-of-the-art ABR algorithm to enhance the QoE of the user by considering segments’ metadata, like quality and size, for requesting the next representation. Fetching this metadata from the manifest, MEDUSA utilizes the buffer occupancy to assign different importance values to quality and size by assigning specific weights in its objective function. After receiving the segment selection from the underlying ABR algorithm, MEDUSA compares the value of the objective function for this segment with other representations having the same position in the bitrate ladder but with different codecs. Among these, the segment with the highest value is selected and requested by the client. We compared the impact of MEDUSA when deployed on top of state-of-the-art ABR algorithms with their original versions and analyzed the results for different network traces, video content, and buffer capacities. To motivate the use of different video sequences, we presented a comparison in terms of compression efficiency for the considered codecs.

The experimental results showed that MEDUSA can significantly increase the QoE score while trading off between segment size and quality in almost all test videos and scenarios, compared to the state-of-the-art ABR approaches. The increase in VMAF of up to more than 1 JND and the smooth playback, with a substantial reduction in the number of stalls, led to a QoE score improvement of up to 42%. Furthermore, with the help of MEDUSA, the transmitted data volume was reduced by up to 40%.

Footnotes

https://caniuse.com/?search=video%20format (accessed 13 July 2023).

https://github.com/pari685/AStream (accessed 13 July 2023).

https://github.com/magnific0/wondershaper (accessed 13 July 2023).

⁴

https://www.youtube.com/watch?v=gkIYZCmD-40 (accessed 13 July 2023).

⁵

https://www.youtube.com/watch?v=OQbDi3PnB2g (accessed 13 July 2023).

⁶

https://ffmpeg.org/

⁷

https://github.com/Telecommunication-Telemedia-Assessment/itu-p1203-codecextension (accessed 20 September 2021).

References

[1]

Abdelhak Bentaleb, Bayan Taani, Ali C. Begen, Christian Timmerer, and Roger Zimmermann. 2019. A survey on bitrate adaptation schemes for streaming media over HTTP. IEEE Communications Surveys and Tutorials 21, 1 (2019), 562–585. DOI:

Abstract

1 Introduction

2 Related Work

3 Medusa

3.1 Motivation

3.2 Overview

3.3 Optimization Problem

3.4 Heuristic Distributed Algorithm

4 Experiments and Discussion

4.1 Methodology and Evaluation Setup

4.2 Results and Analysis

4.2.1 Scenario I.

4.2.2 Scenario II.

5 Conclusions

Footnotes

References

Index Terms

Recommendations

E-WISH: An Energy-aware ABR Algorithm For Green HTTP Adaptive Video Streaming

COBIRAS: Offering a Continuous Bit Rate Slide to Maximize DASH Streaming Bandwidth Utilization

H2BR: an HTTP/2-based retransmission technique to improve the QoE of adaptive video streaming

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations