survey

Open access

Soft Delivery: Survey on a New Paradigm for Wireless and Mobile Multimedia Streaming

Authors:

Takuya Fujihashi,

Toshiaki Koike-Akino,

Takashi WatanabeAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 2

Article No.: 33, Pages 1 - 37

https://doi.org/10.1145/3607139

Published: 14 September 2023 Publication History

PDF eReader

Abstract

The increasing demand for video streaming services is the key driver of modern wireless and mobile communications. Although many studies have designed digital-based delivery schemes to send video content over wireless and mobile networks, significant quality degradation, known as cliff and leveling effects, often occurs owing to fluctuating channel characteristics. In this article, we present a comprehensive summary of soft delivery, which is a new paradigm for wireless and mobile video streaming and discuss the future directions of soft delivery. Existing studies found that introducing multi-dimensional cosine transform, human vision system, and graph signal processing can make soft delivery schemes more effective in untethered immersive experiences, including virtual reality and volumetric media, than digital-based delivery schemes. In addition, this study finds that soft delivery has the potential to be a new standard to deliver deep neural network models and tactile information over wireless and mobile networks.

1 Introduction

Video streaming over wireless and mobile networks is one of the major applications in wireless environments. According to Ericsson Mobility Reports, approximately 80% of all mobile data traffic will be video traffic by 2028 [1]. The explosive growth of data traffic, especially video traffic, poses a huge challenge to wireless and mobile networks. In recent years, immersive content, such as Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR), have shown very good potential to be the next important applications for networks. The growth of such immersive applications is rapidly increasing together with the development of fifth generation (5G) technology and smart wearable devices, which can enable technology for all Extended Reality (XR) applications.

In general, wireless video streaming systems transmit images and video signals with different channel characteristics to single or multiple users. For high-quality video streaming applications, the main challenge is the difficulty in fully utilizing each user’s channel capacity and providing each user with the best video quality possible under his or her channel conditions. Solving this challenge will provide users with an improved quality of experience. To address this challenge, conventional streaming systems, which consist of video coding and transmission technologies, have been proposed based on digital-based solutions. For video coding, the H.265/High-Efficiency Video Coding (HEVC) standard [2], which has been standardized by the Joint Collaborative Team on Video Coding, can be used to encode VR/360-degree videos. As the successor of HEVC, the future video coding standard, H.266/Versatile Video Coding (VVC), has been developed by the Joint Video Experts Team. The VVC standard takes camera-view video, high dynamic range video, and VR/360-degree video into account. In addition, video- and geometry-based point cloud coding [3, 4] have been standardized by the Motion Picture Experts Group for volumetric video encoding and decoding. For video transmission, the source bits are channel coded with time interleaving to exhibit robustness against a certain level of channel errors. The channel-coded bits are then mapped into the transmit data symbols corresponding to arbitrary modulation schemes, such as Binary Phase Shift Keying (BPSK), Quadrature Phase Shift Keying (QPSK), or Quadrature Amplitude Modulation (QAM). To choose an appropriate source and channel coding rate according to the user’s channel condition, the channel statistics are generally required to be known at the time of source and channel coding. Once both the source and channel coding processes are completed, the conventional systems work optimally only for a specific channel condition and have performance limitations in noisy time-varying channels [5, 6].

If the observed channel quality (i.e., the channel Signal-to-Noise Ratio (SNR)) falls below a threshold, the decoding process tends to break down completely. This phenomenon is called the cliff effect [7]. In contrast, if the observed channel quality increases beyond the threshold, it does not improve the performance unless an adaptive rate control of the source and channel coding is performed in real time according to the rapid fading channels. This phenomenon is known as the leveling effect.

Thus, accurate channel estimation and real-time rate control of the source and channel coding are desired for conventional streaming systems. However, the channel conditions of wireless and mobile networks may vary drastically and unpredictably, resulting in imperfect channel estimation and rate control. Conventional streaming systems tend to utilize the channel capacity conservatively to prevent cliff and leveling effects, considering the fact that rate control may be inaccurate.

Scalable Video Coding (SVC) [8] and Dynamic Adaptive Streaming over HTTP (DASH) [9, 10] are typical standardized systems that utilize channel capacity without cliff and leveling effects. SVC encodes the video frames into multiple layers to progressively improve the video quality according to the number of received layers. DASH encodes video frames at multiple quality levels and stores all encoded video frames on the server. The main difference between SVC and DASH is the sender-dependent and receiver-dependent schemes. SVC determines how many layers to send based on the estimated channel quality. DASH retrieves the appropriate quality of video frames from the server based on the estimated channel quality. However, the conservative strategy in SVC and DASH systems under imperfect channel estimation and rate control will cause quality degradation.

A new paradigm of wireless video delivery, namely soft delivery [11, 12, 13], has been proposed to fully utilize the instantaneous channel capacity without cliff and leveling effects.

In contrast to SVC and DASH systems, soft delivery does not require channel estimation to utilize the instantaneous channel capacity.

It is essentially a scheme with “lossless compression and lossy transmission.” The compression stage is solely a transform to decorrelate the image and video signals into frequency-domain coefficients, leaving out the conventional quantization and entropy coding. The transmission stage skips digital-based channel coding. Instead, it scales each transform coefficient individually and modulates it directly to a dense constellation for transmission inspired by the advantage of analog transmission with linear coding [5, 14, 15]. Here, the scaling operation serves the purposes of both power allocation and unequal signal protection against channel noises and fading effects to maximize the reconstruction quality. At the receiver end, the image and video signals are reconstructed by demodulating the received signals and inverting the scaling and transform operations. The soft delivery scheme was shown to not only provide a graceful performance transition in a wide channel SNR range but also achieve competitive performance compared with conventional digital-based delivery schemes.

1.1 Contributions

This work provides a comprehensive survey of soft delivery schemes, including an overview on existing techniques, extension for immersive experiences, and future research directions. Some existing studies have focused on soft delivery schemes, which are shown in Figure 1, with a brief description of the related topics and key techniques. Although there are some survey papers [9, 16, 17, 18, 19, 20] related to the video delivery over wireless channels, all of the survey papers deal with digital-based approaches. To the best of our knowledge, this survey is the first to introduce methodologies and approaches for soft and Hybrid Digital–Analog (HDA) delivery to transmit high-quality image and video signals via unstable and diverse wireless and mobile channel environments. The main contributions of this study are summarized as follows:

Fig. 1.

•

We present an overview of the conventional digital-based and soft delivery schemes, as well as the benefits of the soft delivery schemes.

•

The existing soft delivery techniques, such as energy compaction, power allocation, bandwidth utilization, overhead reduction, and packet loss resilience, are surveyed. In this context, the abstraction and key contributions of these techniques are reviewed and summarized.

•

We present the extensions of soft delivery for immersive experiences, which are mainly contributed by our prior research. We summarize the key ideas of the extensions and benefits in contrast to the digital-based delivery schemes.

•

In addition, we review the future research directions of soft delivery: HDA delivery, AI-empowered soft delivery, soft delivery for AI, and Tactile Internet. We carry out some evaluations to discuss the various directions.

This survey identifies that soft delivery works particularly well in untethered immersive experiences including free viewpoint video, VR, and point cloud. Soft delivery yields better reconstruction quality compared with HEVC-based delivery schemes by integrating the energy compaction, power allocation, and overhead reduction techniques discussed in existing studies. In addition, soft delivery can realize high-quality and adaptive delivery of Deep Neural Network (DNN) models and tactile information over wireless channels. Such delivery will help to realize future services including Federated Learning (FL) and untethered XR applications.

1.2 Survey Structure

The remainder of this article is organized as follows:

•

Section 2 describes an overview of conventional digital-based delivery schemes and their issues, such as the cliff, leveling, and staircase effects.

•

Section 3 presents the basic principles of the pioneering work on soft delivery to solve the aforementioned effects in wireless and mobile video streaming applications.

•

Section 4 presents a review of the existing techniques on soft delivery. We classify these techniques into energy compaction, power allocation, bandwidth utilization, packet loss resilience, overhead reduction, and extension for immersive experiences, and we discuss their implementation as well as their contributions.

•

Section 5 suggests the future directions of the soft delivery approach. In addition to image and video signals, the soft delivery approach has a potential to realize a high-quality DNN model and tactile delivery over wireless channels.

•

Section 6 concludes the article.

2 Conventional Digital-Based Delivery

2.1 Overview

One of the major issues in wireless video delivery is sending high-quality videos within the considerably limited capacity of wireless links. For this purpose, standardized digital video compression is carried out for video frames in conventional video delivery schemes [21, 22, 23], as shown in Figure 2(a) to remove redundancy among video frames. In particular, H.264/Advanced Video Coding (AVC) [24], H.265/HEVC, and H.266/VVC standards are typical video coding standards for generating a compressed bitstream from video frames. In such video encoders, pixel values in each video frame are divided into blocks and transformed into frequency domain coefficients by using the Discrete Cosine Transform (DCT) or discrete sine transform, and then non-uniformly quantizing the coefficients according to a quantization parameter. A large quantization parameter indicates a larger quantization step, leading to a smaller bit rate. Finally, the quantized coefficients are compressed by an entropy coder, which removes statistical redundancy in the coefficients. Variable length coding is widely deployed for entropy coding because of its efficiency and simplicity.

Fig. 2.

After passing through the digital video compression, the bitstream is then passed to the wireless transmission system in sequence. There are typically two ways to deliver the bitstream: Internet Protocol (IP) and non-IP networks. For the IP networks, the sender uses DASH [9, 10] or Real-Time Transport Protocol [25] for the bitstream. Although the IP-based protocols can deliver the content via the deployed IP networks, they cause transmission delays to packetize and de-packetize the bitstream [26]. For the non-IP networks, the sender passes the bitstream to the PHY (physical layer) for transmissions. The existing studies related to the non-IP network schemes reported that such schemes reduce the delay compared with the IP-based schemes, and thus the non-IP schemes are applicable for low-delay applications including untethered XR experiences and cooperative and competitive gaming.

This survey mainly discusses the content delivery over non-IP networks. In such non-IP schemes, channel coding is first used for the bitstream to protect against channel errors. For example, binary convolutional codes and low-density parity checks are widely used as forward error correction in Wi-Fi systems. The coded bitstream is then mapped onto in-phase and quadrature (I and Q) components using digital modulation formats such, as quadrature phase shift keying and m-ary QAM. In both wireless and mobile networks, a combination of modulation formats and different channel coding rates, for example, 1/2 and 3/4, is defined in the Modulation and Coding Scheme (MCS). According to the measured wireless channel SNRs, the sender adapts its MCS value to maximize the link data rate. At the receiver end, bit errors may occur in channel-coded bits owing to effective noise and/or fading effects. The receiver then tries to reconstruct video frames from the received bits using inverse procedures (i.e., demodulation, channel decoding, and video decoding).

2.2 Critical Issues on Quality

If the measured wireless channel quality is stable during video transmission, conventional digital-based schemes can provide high-quality video frames for users. However, the channel quality of each user fluctuates over time owing to a combination of user mobility, multipath propagation, and obstacle shadow. Table 1 lists three critical issues regarding the video quality of the digital-based schemes because of the channel quality fluctuations: cliff, leveling, and staircase effects.

Table 1.

Phenomenon	Effect on Video Quality	Cause	Solution in Soft Delivery
Cliff effect	Sudden degradation	All-or-nothing behavior in entropy and channel codings	Skip entropy and channel codings to prevent all-or-nothing behavior
Leveling effect	Constant irrespective of channel quality	Unrecoverable quantization error in video coding	Skip quantization and adopt pseudo-analog modulation for recoverable errors at the receiver end
Staircase effect	Step function of channel quality	All-or-nothing behavior and quantization error in layered coding	Skip hierarchical operations and adopt pseudo-analog modulation for linear video quality

Table 1. Critical Issues Regarding Video Quality in Wireless and Mobile Video Streaming

2.2.1 Cliff Effect.

Digitally encoded bits are known to be susceptible to errors during wireless transmission. Because entropy coding schemes have an all-or-nothing behavior, even a single bit error can cause the loss of entire data [27]. As mentioned earlier, channel coding schemes are adopted to correct burst and random bit errors. However, they generally exhibit an all-or-nothing behavior for error correction. When the instantaneous channel quality (i.e., the SNR) falls below a certain threshold, possible errors that occur in the bitstream during wireless communications will disable video decoding.

A collapsed signal reconstruction causes a cliff effect. The cliff effect is a phenomenon whereby the quality of the received information abruptly drops as soon as the channel quality falls below the threshold, as shown in Figure 3(a). For example, the video quality of the BPSK modulation format with 1/2-rate channel coding drops below the wireless channel SNR of 4 dB. In modern network environments (e.g., content delivery, mobile, and wireless networks), the cliff effect becomes a major impediment when video frames are transmitted over diverse channel conditions to heterogeneous users. In this case, users whose channel quality is below the critical point receive unwatchable video frames.

Fig. 3.

Some solutions have addressed the cliff effect associated exclusively with channel coding, such as hybrid automatic repeat request, and rateless coding schemes [28, 29, 30, 31, 32]. They adapt the number of transmissions to changing channel conditions for error prevention. However, these schemes are not well suited for streaming multiple users under diverse channel conditions. In addition, they do not reduce the quantization error at the video encoder, and thus the leveling effect still occurs in video quality.

2.2.2 Leveling Effect.

Once the channel quality surpasses the threshold, the video quality remains constant, as shown in Figure 3(a). As mentioned earlier, the cliff effect is caused when the receiver SNR is below 4 dB in the BPSK modulation format with 1/2-rate channel coding, whereas the channel gain does not reflect on the video quality above the wireless channel SNR of 4 dB. Digital-based schemes determine the parameters of the video coding and wireless transmission parts based on the channel estimation. If the instantaneous channel quality is better than the estimated one, no additional gain can be obtained because the distortion of the video coding cannot be reconstructed for each user.

2.2.3 Staircase Effect.

To mitigate the cliff and leveling effects, some layered coding schemes, referred to as schemes with SVC [8] with a combination of Hierarchical Modulation (HM) [33], have been proposed for wireless and mobile video streaming [34, 35]. These layered coding schemes encode video frames into one Base Layer (BL) and several Enhancement Layers (ELs). The BL is used to ensure that all users in the target channel SNR range can receive the baseline quality of video frames, whereas the ELs are used to enhance the video quality of users in high-channel SNRs. Each SVC layer is then mapped onto the corresponding HM layer. Notably, HM provides unequal error protection to the transmitted video frames according to their relative importance. However, SVC with HM cannot completely remove the cliff effect; it only divides one big cliff into multiple stairs according to the number of layers, as shown in Figure 3(b). In addition, because the assigned transmission power to each layer is lower than that of the single-layer coding schemes, the cliff shifts to higher wireless channel SNRs.

3 SoftCast: A Pioneer Work On Soft Delivery

3.1 Overview

To prevent the cliff, leveling, and staircase effects in wireless video delivery, a pioneer soft delivery work, namely SoftCast, was proposed by Jakubczak et al. [11, 12, 13]. The block diagram of SoftCast is illustrated in Figure 2(b). The design of SoftCast is based on a simple principle that ensures the transmitted signal samples are linearly related to the original pixel values. This principle naturally enables a sender to satisfy multiple receivers with diverse channel qualities, as well as a single receiver, where different packets experience different channel qualities.

The sender first takes a Group of Pictures (GoP) and uses a full-frame 3D-DCT as the decorrelation transform. The DCT frames are then divided into N small rectangular blocks of transformed coefficients called chunks. The coefficients in each chunk are then scaled to match the transmission power constraints. Specifically, the scaling coefficients are chosen to minimize the reconstruction Mean Square Error (MSE). Walsh–Hadamard transform is then applied to the scaled chunks for power normalization across the chunks to provide packet loss resilience. This process transforms the chunks into slices. Each slice is a linear combination of all scaled chunks. Finally, the coefficients in the slices are directly mapped to the I and Q components in a pseudo-analog manner for transmission. Here, channel coding operations are skipped for the coefficients.

Figure 4(a) and (b) show the conventional digital-based modulation (i.e., 16-QAM) and pseudo-analog modulation proposed in SoftCast. Conventional modulation modulates channel-coded bits to produce real-value digital samples that are transmitted to the channel. For example, 16-QAM modulation takes sequences of four bits and maps each sequence to a complex I/Q number, as shown in Figure 4(a). After modulation, the wireless PHY of the sender transmits the mapped complex numbers to the receiver. Because of the broadcast nature of the wireless medium, multiple receivers hear the transmitted samples but with different noise levels. A receiver with a low channel SNR can distinguish only the quadrant of the transmitted sample and hence can decode only the two bits of the transmitted sample. In this case, these bit errors may cause a collapsed signal reconstruction during digital video decoding.

Fig. 4.

In contrast to the existing modulation design, SoftCast outputs the real values of the DCT coefficients that are already coded for error protection. The pseudo-analog modulation in Figure 4(b) directly maps pairs of the scaled DCT coefficients to the I and Q of the digital signal samples. Figure 4(c) shows the distribution of the analog-modulated symbols of the test video sequence of “Akiyo” with the resolution of common intermediate format [37]. As shown in Figure 4(c), we find that the pseudo-analog modulation of the DCT coefficients follows a bivariate Lorentzian distribution as follows:

\begin{equation} f(x,y) = a \frac{1}{\pi ^2\left(b^2+y^2+x^2+\frac{x^2y^2}{b^2}\right)}, \end{equation}

(1)

where a and b represent the fitting parameters. From least-squares fitting based on the distribution of pseudo-analog symbols of the test video sequence and the bivariate Lorentzian function, the best fitting parameters are a of 0.001 and b of 0.24. As mentioned earlier, multiple receivers hear the transmitted samples under different channel SNRs. Although the transmitted samples are distorted according to their SNR, the receiver regards the received samples as scaled DCT coefficients. The sender does not need to estimate the channel condition, and the noise level in the received samples faithfully reflects the instantaneous channel condition [38]. Thus, pseudo-analog modulation ensures that the received video quality is proportional to the instantaneous channel quality. Consequently, this process avoids all cliff, leveling, and staircase effects.

In parallel, SoftCast sends an amount of data, referred to as metadata, for signal reconstruction. These metadata consist of the mean and variance of each transmitted chunk as well as a bitmap. The mean of each chunk is used to obtain the chunks’ approximate zero-mean distributions by subtracting the mean of all pixels in each chunk [39]. The variance of each chunk is used to find the per-chunk scaling factors that minimize the reconstruction error. The bitmap indicates the positions of the discarded chunks into the GoP. When the available channel bandwidth for SoftCast is less than the required bandwidth, SoftCast discards chunks with less energy. Specifically, when the available and required bandwidths for SoftCast are M chunks and \(N (\gt M)\) chunks, respectively, SoftCast discards lower-energy \(M-N\) chunks to meet the bandwidth requirement. On the receiver side, these discarded chunks are replaced by null values. The discarded chunks are registered as a bitmap and then compressed using run-length encoding. Metadata are strongly protected and transmitted in a robust way (e.g., BPSK modulation format with a low-rate channel code) to ensure correct delivery and decoding.

At the receiver side, a Minimum Mean Square Error (MMSE) decoder is used to estimate the content of the chunks to counteract channel noise. The MMSE provides a high-quality estimate of the DCT coefficients by leveraging the knowledge of the statistics of the DCT coefficients (i.e., chunk variance) as well as the statistics of the channel noise. Using the metadata, the denoised chunks are properly reassembled and undergo an inverse 3D-DCT, thereby providing the corresponding GoP. In principle, the preceding design and performance do not affect the content types (i.e., on-demand or live content). Regardless of the content type, SoftCast provides adaptive video delivery based on the channel quality between the sender and each receiver.

Table 2 shows the strengths, limitations, and use cases of the digital-based and soft delivery schemes. The digital-based delivery schemes are well suited for point-to-point communications over time-invariant channels. In addition, the buffering cost is relatively low compared to the soft delivery schemes because the required storage size for the compressed bitstream is small. Soft delivery schemes perform well in time-varying and diverse channels. As well, soft delivery schemes are low delay because they do not need to perform an expensive motion search for compression, making soft delivery schemes preferable for delay-sensitive applications. However, they require a high cost for modulation and demodulation in the PHY because modifying the PHY in both sender and receiver is required for the pseudo-analog modulation.

Table 2.

Schemes	Channel Types	Communication Types	Latency	Buffering Cost	Modem Cost	Use Cases
Digital-based delivery	Time invariant	Point-to-point	High	Low	Low	Video streaming over wide area networks, HTTP adaptive streaming
Soft delivery	Time variant	Multicast, Broadcast	Low	High	High	Broadcasting, V2X communication, Video surveillance

Table 2. Strengths, Limitations, and Use Cases for Digital-Based and Soft Delivery Schemes

In summary, the typical use cases of digital-based delivery schemes are streaming services over wide area networks and HTTP adaptive streaming. In contrast, soft delivery schemes are well suited for delay-sensitive applications such as video broadcasting, Vehicle-to-Everything (V2X) communication, and real-time video surveillance.

3.2 Details of Scaling and Inverse Scaling Operations

In SoftCast, a full-frame 3D-DCT is carried out for the video frames in each GoP to compact the energy of the video signals, and the resulting DCT coefficients are transmitted to the receivers using pseudo-analog modulation. Here, the transmitted analog-modulated symbols are degraded over wireless channels at each receiver. SoftCast should minimize the MSE between the received and transmitted DCT coefficients under the wireless channel SNR to reconstruct the highest-quality video frames at each receiver. For this purpose, SoftCast must design MSE-minimized power allocation and denoising filters (i.e., scaling and inverse scaling operations) for analog-modulated symbols.

Figure 5 illustrates the procedures for obtaining the reconstructed DCT coefficients at the receiver end. SoftCast implements chunk-wise power allocation and filter operations according to the statistics of the chunks and channel conditions. Let \(x_{i}\) denote the ith analog-modulated symbol. Each analog-modulated symbol is scaled by \(g_{i}\) for noise reduction:

\begin{equation} x_{i} = g_{i} \cdot s_{i}. \end{equation}

(2)

Here, \(s_{i}\) is the ith DCT coefficient and \(g_{i}\) is the scale factor for the coefficient power allocation. The sender performs optimal power control for \(g_i\) to achieve the highest video quality. Specifically, the best \(g_{i}\) is obtained by minimizing the MSE under the power constraint with the total power budget P as follows:

\begin{align} \min &\quad \mathsf {MSE} = \mathbb {E} \left[ \left(x_{i} - \hat{x}_{i}\right)^2\right] = \sum _{i}^{N} \frac{\sigma ^2 {\lambda }_{i}}{g_{i}^2{\lambda }_{i} + \sigma ^2} \\ \nonumber \nonumber \mathrm{s.t.} &\quad \frac{1}{N}\sum _{i}^{N} g_{i}^2{\lambda }_{i} = P, \end{align}

(3)

Fig. 5.

where \(\mathbb {E}[\cdot ]\) denotes the expectation, \(\hat{x}_{i}\) is an estimate of the transmitted symbol, \({\lambda }_{i}\) is the power of the ith DCT coefficient, N is the number of DCT coefficients, and \(\sigma ^2\) is the receiver noise variance. The near-optimal solution is expressed as

\begin{equation} g_{i} = {\lambda }_{i}^{-1/4} \sqrt {\frac{P}{\sum _j{\lambda }_{j}}}. \end{equation}

(4)

After transmission over the wireless channel, each symbol at the receiver end can be modeled as \(y_{i} = x_{i} + n_i\), where \(y_{i}\) is the ith received symbol and \(n_{i}\) is an effective noise with a variance of \(\sigma ^2\). The receiver extracts DCT coefficients from the I and Q components and reconstructs the coefficients using the MMSE filter [11] as follows:

\begin{equation} \hat{s}_{i} = \frac{g_{i} {\lambda }_{i}^2}{g_{i}^2 {\lambda }_{i}^2 + \sigma ^2} \cdot y_{i}. \end{equation}

(5)

The receiver then obtains the corresponding video sequence using the inverse 3D-DCT for the filter output \(\hat{s}_i\).

4 Technical Solutions for Soft Delivery

Because SoftCast skips nonlinear digital-based encoding and decoding operations corresponding to motion estimation, quantization, and entropy coding, it realizes a linear quality improvement associated with channel quality improvement. In particular, SoftCast has shown outstanding performance compared with the conventional digital-based delivery schemes when receivers are highly diverse and/or the channel condition of each receiver varies drastically. Conversely, the design of SoftCast is simplistic, so there remains much scope for improvement in adopting soft delivery in practical scenarios, including stable channel conditions, band-limited, and/or error-prone environments. For this purpose, many studies have been conducted to improve the performance of soft delivery. The existing works on soft delivery schemes can be classified into seven types, as shown in Figure 1: energy compaction, optimal scaling, bandwidth utilization, resilience to packet loss, overhead reduction, hardware implementation, and extension for immersive experiences.

4.1 Energy Compaction of Source Signals

In soft delivery schemes via linear mapping (from source signals to channel signals), the reconstruction quality greatly depends on the performance of the energy compaction technique for the source signals. Specifically, the study by Prabhakaran et al. [40] clarified that the performance of soft delivery schemes degrades as the ratio of maximum energy to minimum energy of the source component increases. To yield better quality under both stable and unstable channel conditions, existing studies have adopted different energy compaction techniques listed in Table 3 for the source signals.

Table 3.

Techniques	Features	Pros	Cons
2D-DCT/2D-DWT	Take DCT/DWT operation for each video frame	Reduce spatial redundancy	No temporal filter
3D-DCT	Take DCT operation for each GoP	Reduce both spatial and temporal redundancy	Weak temporal filter
MCTF	Take wavelet transform for temporal filtering	Further reduce temporal redundancy	Computational cost for temporal filtering
Component protection	Send lower-frequency coefficients as metadata	Distribute power to higher-frequency coefficients	Communication overhead, Significant degradation due to metadata error
Layered operation	Divide video frames into the BL and ELs and send them in digital and pseudo-analog ways, respectively	Provide baseline quality via the BL while enhancing quality via the ELs	ELs will be meaningless if bit errors occur in the BL
Coset coding	Partition coefficients into several cosets and transmit the coset residual codes	Bring lower entropy according to a coset step	Accuracy of coset step and side information is crucial for reconstruction

Table 3. Brief Introduction to Typical Energy Compaction Techniques for Soft Delivery Schemes

Typical solutions are to adopt wavelet-based signal decorrelation methods. Specifically, some studies [41, 42, 43, 44, 45, 46] have adopted a Motion-Compensated Temporal Filter (MCTF), which is a temporal wavelet transform method, to remove inter-frame redundancy by realizing motion compensation in soft delivery. The MCTF recursively decomposes video frames into low- and high-frequency frames according to a predefined level. For example, WaveCast [44] adopted a 3D-Discrete Wavelet Transform (DWT) (i.e., the integration of 2D-DWT and MCTF) to remove temporal and spatial redundancy. Although SoftCast exploits a full-frame 3D-DCT to remove the intra- and inter-frame redundancy for energy compaction, WaveCast can further improve the reconstruction quality by fully exploiting the inter-frame redundancy using motion compensation. A detailed discussion on the effects of other decorrelation methods is presented in the work of Xiong et al. [47, 48]. Trioux et al. [49] also utilized inter-frame redundancy by designing an adaptive GOP size mechanism. It adaptively controlled the GoP size based on shot changes and the spatio-temporal characteristics of the video frames. It then used a full-frame 3D-DCT for energy compaction across the video frames in one GoP.

Another typical solution is to send large energy coefficients as metadata and thus prevent the transmission of such coefficients using pseudo-analog modulation. Lin et al. [50] designed Advanced SoftCast (ASoftCast) to send low-frequency coefficients as the metadata. ASoftCast decomposed the original images into frequency components using 2D-DWT. The frequency component was then divided into two parts: the lowest-frequency sub-band and other sub-bands. The wavelet coefficients in the lowest-frequency sub-band are processed by run-length coding. They are then channel coded and digitally modulated for additional metadata transmissions. The optimized power allocation for the SoftCast scheme in the work of He et al. [51] selected and sent high-energy coefficients as the metadata to reduce the energy of the analog-modulated symbols. These results can assign a high transmission power to low-energy coefficients to improve the received quality. Here, determining the high-energy coefficients for each GoP is computationally complex owing to the use of an exhaustive search. To reduce the computational complexity, Trioux et al. [52] adopted a zigzag scan to select the side information. Other studies [53, 54, 55, 56, 57, 58] divided the video into BL and ELs, which were coded and sent in digital and pseudo-analog ways, respectively. For example, the BL in gradient-based image SoftCast (G-Cast) [57] sent the DC and low-frequency coefficients of the image, whereas the EL extracted and sent an image gradient, which represents the edge portion of the image, using a gradient transform. The receiver then created a final estimation of the image via a gradient-based reconstruction procedure, utilizing both the image gradient at the EL and the low-frequency coefficients provided by the BL.

Other solutions adopted a nonlinear encoder and decoder for source signals to decrease the ratio of the maximum to the minimum energy of the analog-modulated symbols. The typical solution for soft delivery is to introduce coset coding [59, 60], which is a typical technique in distributed source coding. Coset coding partitions the set of possible source values into several cosets and transmits the coset residual codes to the receiver. With the received coset codes and the predictor, the receiver can recover the source value in the coset by choosing the one closest to the predictor. DCast [61, 62, 63, 64] first introduced coset coding for the soft delivery of inter frames. The coset coding in DCast divides each frequency domain coefficient \(s_i\) by a coset step q and obtains the coset residual code \(l_i\) as \(l_i = s_i - \lfloor \frac{s_i}{q} + \frac{1}{2} \rfloor q\), where \(\lfloor \frac{s_i}{q} + \frac{1}{2} \rfloor\) represents the coset index. At this time, the sender only needs to transmit the coset residual code for energy compaction. At the user side, with the received coset residual code \(\hat{l}_i\) and the side information \(\bar{s}_i\) (i.e., the predicted DCT coefficient obtained from the reference video frame), the receiver reconstructs the DCT coefficients by coset decoding. Given the coset residual code \(l_i\), there are multiple possible reconstructions of \(s_i\) that form a coset \(C = \lbrace \hat{l}_i, \hat{l}_i \pm q, \hat{l}_i \pm 2q, \hat{l}_i \pm 3q, \ldots \rbrace\). DCast then selects the coset C that is nearest to the side information \(\bar{s}_i\) as the reconstruction of the DCT coefficient. In this case, the value of each coset step q is crucial for the coding performance of DCast. The value of q is calculated by estimating the noise at the receiver end shown in the work of Fan et al. [63, 64]. However, the reconstruction quality of DCast also depends on the side information quality. If the side information \(\bar{s}_i\) is error prone, the receiver may make wrong decisions with a smaller q. Huang et al. [65] introduced a side information refinement algorithm [66] to refine the side information for the quality enhancement of DCast.

The concept of coset coding has been widely applied in other studies on soft delivery for the same purpose. For example, several works [67, 68, 69, 70, 71] utilized pseudo-coset coding for lower-frequency components and sent the coset index using the digital framework. Here, the residuals in the lowest-frequency components and other frequency components are sent using pseudo-analog modulation. The main difference between coset coding and pseudo-coset coding is the sending of the coset index as additional metadata. The layered coset coding and adaptive coset coding were applied to the soft delivery scheme in the work of Fan et al. [69] and Lv et al. [70], respectively. LayerCast [69] introduced layered coset coding to simultaneously accommodate heterogeneous users with diverse SNRs and bandwidths. The layered coset coding used large to small coset steps to obtain coarse to fine layers from each chunk. The coarse layer (i.e., BL) is sufficient to reconstruct a low-quality DCT chunk for narrowband users, whereas each fine layer (i.e., EL) provides refinement information of the DCT chunk for wideband users. Some works [72, 73, 74] utilized the coset coding for cooperative soft delivery systems (i.e., a three-node relay network). A sender broadcasts the DCT coefficients obtained from the video frames using pseudo-analog modulation to the relay node and the destination node. If the channel quality between the sender and the destination node is higher than a threshold, the destination node reconstructs the video frames from the soft-delivered DCT coefficients. If the channel condition is lower than the threshold, the relay node sends the coset residual code to the destination node, then the destination node reconstructs the video frames using the received coset residual code and the side information obtained from the softly delivered DCT coefficients from the sender.

4.2 Channel-Aware and Perception-Aware Power Allocation

As mentioned in Section 3.2, the power allocation in SoftCast minimizes the MSE between the original and reconstructed video signals over Additive White Gaussian Noise (AWGN) channels. There are several drawbacks toward adopting SoftCast in practical scenarios: (1) practical wireless channels have more complex characteristics (e.g., fading caused by multipath and impulse noise) than the AWGN channels, and (2) MSE is not an effective index for describing the perceptual fidelity of images/videos. To address the drawbacks related to the power allocation, the existing studies in Table 4 propose the power allocation for practical wireless channels and perceptual considerations.

Table 4.

Category	Papers	Channel Consideration	Optimization Metric
	[11]	AWGN	MSE
	[75]	Fading	MSE
	[76, 77]	OFDM	MSE
	[78, 79, 80]	MIMO	MSE
Channel-aware power allocation	[81, 82, 83]	MIMO-OFDM	MSE
Channel-aware power allocation	[84]	Impulse noise	MSE
	[85, 86]	NOMA	MSE
	[87]	Underwater acoustic networks	MSE
	[88]	UAV-enabled networks	MSE
	[89]	mmWave lens MIMO	MSE
	[90]	AWGN	SSIM
	[91]	AWGN and MIMO	FWD
Perception-aware power allocation	[92]	AWGN	EQMSE
	[93]	AWGN	Foreground and background distortions

Table 4. Overview of Power Allocation Techniques for Soft Delivery Schemes

For the first drawback, the existing studies redesigned the power allocation for practical wireless channels, including fading [75] and frequency-selective fading (i.e., Orthogonal Frequency-Division Multiplexing (OFDM)) [76, 77], impulse noise [84], Multiple-Input and Multiple-Output (MIMO) [78, 79, 80], and MIMO-OFDM channels [81, 82, 83]. Cui et al. [75] designed an optimal power allocation for fading channels. In fading channels, a fading effect (i.e., multiplicative noise) will degrade the reconstruction quality. Although SoftCast assumes that multiplicative noise can be canceled with exact channel estimation at the receiver end, no algorithm can guarantee an error-free channel estimation. In addition to the power allocation design, the authors analyzed the effect of the channel estimation error on the reconstruction quality at the receiver end.

For frequency-selective fading channels, such as OFDM and MIMO-OFDM channels, the key issue is how to match the analog-modulated symbols to the independent subcarriers/subchannels for high-quality image/video reconstruction. Liu et al. [81, 82] observed similarities between the source and channel characteristics and exploited the similarities for subcarrier/subchannel matching. ParCast [81] and the extended version of ParCast\(+\) [82] assigned the more important DCT coefficients to higher gain channel components and allocated power weights for each DCT coefficient with joint consideration of the source and channel for video unicast systems. ECast [83] extended the source and channel matching and power allocation for video multicast systems. For multicast systems, it is necessary to deal with the large overhead of channel feedback from multiple receivers. In ECast, multiple users simultaneously send tone signals for the channel feedback, and the sender receives the superposition of multiple tone signals. Although the sender cannot distinguish each of the channel gains, the weighted harmonic means of channel gains can be obtained from the superposed tone signals. Thus, ECast utilizes the channel gain for the source and channel matching and power allocation.

Other studies solved power allocation problems in modern wireless systems, including Non-Orthogonal Multiple Access (NOMA) [85, 86], underwater acoustic OFDM [87], Unmanned Aerial Vehicle (UAV)-enabled [88], and mmWave lens MIMO systems [89]. For example, in NOMA systems, source signals are coded into the BL and ELs and then transmitted simultaneously through superposition coding. With successive interference cancellation, near users with strong channel gains can decode both BL and EL signals, whereas far users with weak channel gains may only decode BL signals. In the existing studies, both the BL and ELs are analog coded in the work of Jiang et al. [85], whereas the BL and ELs are digital coded and analog coded, respectively, in the work of Wu et al. [86]. They solved the power allocation across the BL and ELs to minimize the distortion for all receivers with heterogeneous channel conditions. In underwater acoustic OFDM [87] and mmWave lens MIMO systems [89], the error behavior differed substantially across channel components, and the channel characteristics showed a similar tendency. They solved the source and channel matching and power allocation problems, which are also discussed in frequency-selective fading channels, to minimize the distortion at the receiver end.

For the second drawback, some studies [90, 91, 92, 93] also redesigned the power allocation with perceptual considerations, including Structural Similarity (SSIM) [90], foveation [91], and saliency [92]. In these studies, determining the perception-aware weights for each source component is challenging. Specifically, in SoftCast, the scaling factor for each coefficient is obtained from its power information to minimize the MSE: \(g_i \varpropto\lambda _i^{-1/4}\). These studies considered the perception-aware weight for the ith coefficient \(w_i\) in the scaling factor to minimize the perceptual distortion as \(g_i \∝_i^{1/4} \lambda _i^{-1/4}\). For this purpose, Zhao et al. [90] demonstrated the relationship between the MSE in the DCT coefficients and the SSIM distortion to obtain the weight for the ith DCT coefficients of all chunks \(w_i\). They found that the weight for the high-frequency coefficients was larger than that for the low-frequency coefficients, which was consistent with the characteristics of the Human Visual System (HVS). FoveaCast [91] introduced the foveation-based HVS [94] and the corresponding HVS-based visual perceptual quality metric, called Foveated Weighted Distortion (FWD), for the optimization objective. For a given foveation point \((f_x, f_y)\) in the pixel and frequency domains, the error sensitivity for each pixel/frequency coefficient at location \((x, y)\) can be defined in the foveation-based HVS. FoveaCast regarded the error sensitivity in the DWT domains as the weight \(w_i\) and performed foveation-aware power allocation. In the work of Hadizadeh [92], visual saliency maps were introduced for the perception-aware power allocation. Saliency maps represent the attended regions in an image when a user watches the image owing to the visual attention mechanism of the human brain. In this case, the weight for the ith pixel \(w_i\) is based on the normalized visual saliency defined from any arbitrary visual saliency model, such as the Itti–Koch–Niebur model [95]. Based on the weight, it allocates considerable transmission power to salient regions to minimize the Eye-Tracking Weighted Mean Square Error (EQMSE).

4.3 Bandwidth Utilization

The source bandwidth of soft delivery schemes depends on the number of transmitted analog-modulated symbols every second (i.e., baud rate). In the aforementioned designs, the source bandwidth is mainly considered sufficient to send all transmitted non-zero analog-modulated symbols over the wireless medium. However, when the channel bandwidth is lower than the source bandwidth, some analog-modulated symbols are discarded at the receiver side. Hence, the loss of the important coefficients (i.e., the low-frequency coefficients) may have a significant impact on the reconstruction quality. Specifically, the expected distortions in soft delivery schemes for single and multiple content owing to the bandwidth constraint under the transmission power constraint are discussed in the work of Liu et al. [111] and He et al. [112, 113], respectively. Some existing studies have adopted different techniques listed in Table 5 to meet the bandwidth constraint. The typical method is to selectively discard the chunks in higher-frequency components to fill the bandwidth [11, 96]. When the sender discards some chunks, the receiver regards all coefficients in the discarded chunks as zeros. Because it needs to send the locations of the discarded chunks to the receiver, SoftCast sends the location information as a bitmap. Although SoftCast assumes equal-size chunks across low- to high-frequency components, Li et al. [96] adopted smaller chunk sizes in high-frequency components to realize a fine-grained control to meet the bandwidth limitation. Another study [97] used bandwidth-reducing Shannon–Kotelnikov (SK) mappings to increase the number of chunks transmitted over bandwidth-constrained channels. The SK mappings are typical N:1 bandwidth-reducing or 1:M bandwidth-expanding non-linear mappings. In this study, 2:1 SK mappings were used to encode several pairs of chunks with less energy to send more chunks with medium energy within the channel bandwidth.

Table 5.

Papers	Technique	Pros	Cons
[11]	Discarding low-energy chunks	Simple algorithm	Additional metadata for the location of discarded chunks and coefficients
[96]	Adaptive chunk division	Fully utilize available bandwidth by discarding small high-frequency chunks	Improper power allocation in low-frequency chunks
[97]	SK mapping	High reconstruction quality in middle and high channel SNRs	Low reconstruction quality in a low SNR regime
[98, 99, 100, 101, 102, 103, 104, 105]	Compressive sensing	Recover discarded coefficients using a reconstruction algorithm	Computational cost for the algorithm
[106, 107, 108, 109, 110]	Data-assisted communication	Reduce traffic by utilizing related images in a cloud	Same or correlated images are available in a cloud

Table 5. Brief Introduction to the Existing Soft Delivery Techniques for Band-Limited Channels

Other studies [98, 99, 100, 101, 102, 103, 104, 105] introduced Compressive Sensing (CS) techniques [114, 115] for soft delivery over bandwidth-constrained wireless channels. Notably, CS is a sampling paradigm that allows the simultaneous measurement and compression of signals that are sparse or compressible in some domains. In general, recovering source signals from compressed signals is impossible because the system is underdetermined. However, if the source signals are sufficiently sparse in some domains, CS theory indicates that the source signals can be reconstructed from the compressed signals by solving the \(\ell _1\) minimization problem. The advantage of CS-based soft delivery is the recovery of chunks in high-frequency coefficients using CS-based signal reconstruction algorithms, such as approximate message passing and iterative thresholding, even though the chunks are discarded at the sender’s end. For high-quality reconstruction, adaptive rate control and reconstruction algorithms are mainly adopted for CS-based soft delivery. For instance, Yami and Hadizadeh [100] adaptively controlled the compression rate based on visual attention (i.e., both the texture complexity and visual saliency) to satisfy the bandwidth constraint while maintaining better perceptual quality. Liu et al. [104] adaptively selected reliable columns from the measurement matrix and compressed source signals using the selected columns. In view of the reconstruction algorithm, Hadizadeh and Bajic [101] designed an adaptive transform for noisy measurement signals to obtain sparser transform coefficients for clean reconstruction. Yin et al. [102] and Tung and Gunduz [103] designed grouping methods for measurement signals to utilize the similarity between video frames for the reconstruction.

Other studies utilized stored images/videos on the cloud to reduce the bandwidth requirement in soft delivery. Specifically, data-assisted communications of mobile images (DAC-Mobi) [106], data-assisted cloud radio access network (DaC-RAN) [107], and knowledge-enhanced mobile video broadcasting (KMV-Cast) schemes [108, 109, 110], which are referred to as data-assisted soft delivery schemes, have been proposed for high-quality image/video transmission. The main contributions of the data-assisted soft delivery schemes are (1) a sender sends a limited number of analog-modulated symbols and (2) the receiver reconstructs images/videos using correlated images (i.e., side information) obtained from a cloud.

In DAC-Mobi [106], successive coset encoders were introduced to divide the DCT coefficients into three layers of bit planes: Most Significant Bits (MSBs) in low-frequency coefficients, MSBs in other frequency coefficients and middle bits, and Least Significant Bits (LSBs). Here, MSBs in low-frequency coefficients and LSBs were transmitted to the receiver in digital and pseudo-analog manners, respectively, whereas MSBs in other frequency coefficients and middle bits were discarded. Based on the received MSB in the low-frequency coefficients, the receiver reconstructs a down-sampled image to retrieve correlated images in the cloud. The retrieved correlated images were used as side information to resolve ambiguity due to discarded bits and reconstruct the entire image. DaC-RAN [107] and the extended version of KMV-Cast [108, 109, 110] adopted Bayesian reconstruction algorithms that utilize correlated images/videos in the cloud as prior information to reduce the required bandwidth for soft delivery. The main difference between the DaC-RAN and KMV-Cast schemes is that the former assumes that the same images/videos exist in the cloud, whereas the latter does not require that the same images/videos exist at the receiver end by designing prior knowledge broadcasting in a digital manner.

The aforementioned studies considered the channel bandwidth to be lower than the source bandwidth. If the channel bandwidth is greater than the source bandwidth, the soft delivery schemes become less efficient. In this case, the soft delivery schemes utilize the extra bandwidth by retransmission. Lin et al. [116] and Tan et al. [117] designed an analog channel coding to use the extra channel bandwidth for quality enhancement. For example, Tan et al. [117] proposed a chaotic function-based analog encoding [118] for soft delivery. Although the existing chaotic function-based analog coding is designed for uniformly distributed sources, the analog coding for Gaussian distributed sources significantly amplifies source signals and thus consumes unnecessary transmission power. They designed a chaotic map function for Gaussian distributed source signals to prevent power increments compared to the input power. MCast [119] also utilized extra bandwidth for quality improvement. As mentioned earlier, the sender can send the source data multiple times if an extra bandwidth is available. In this case, the utilization of extra time slots for quality improvement is a key issue. To overcome this issue, MCast optimized the assignment of the chunks of the DCT coefficients to available channels in multiple time slots to fully exploit the time and frequency diversities.

In contrast to the aforementioned studies, Lan et al. [120] and He et al. [121] dealt with bandwidth variations. When the available bandwidth is less than the expected bandwidth at the sender’s end, some important chunks will not have the opportunity to be transmitted before the playback deadline. They grouped several chunks into a tile and sent the tile with a large variance and high priority to dispatch important coefficients before the playback deadline.

4.4 Packet Loss Resilience

Even when the channel bandwidth is sufficient to send all non-zero analog-modulated symbols, some analog-modulated symbols can be discarded at the receiver side owing to loss-prone wireless channels. Specifically, the packet loss owing to strong fading and interference may have a significant impact on the reconstruction quality if important chunks and coefficients are lost. SoftCast used Walsh–Hadamard transform to redistribute the energy of the source signals across whole packets for resilience against packet loss. However, each packet still contains a large amount of energy, and thus degradation owing to packet losses remains considerable.

To maintain better reconstruction quality in error-prone wireless channels, some related studies [122, 123, 124] have introduced CS techniques (i.e., block-wise CS [125]) for packet loss resilience. The CS technique is suitable for wireless transmission with random packet loss owing to its random measurement. Random measurement considers all packets as of equal importance. In contrast to typical CS techniques, block-wise CS can reduce the storage and computational costs of the reconstruction. A pioneering work on packet loss resilience is the distributed compressed sensing-based multicast scheme (DCS-Cast) [122]. In DCS-Cast, each image is first divided into blocks and the coefficients in each block are randomized using the same measurement matrix across the blocks. One coefficient in every block is packetized to normalize the importance across packets. Even though some packets may be lost over loss-prone wireless channels, the receiver obtains noisy pixel values using the same measurement matrix at the sender and reconstructs the lost pixel values using the CS reconstruction algorithm in the DCT/DWT domains. Because the lost pixel values can be recovered from the reconstruction algorithm, DCS-Cast maintains high image/video quality in loss-prone channels. To further improve the reconstruction quality, multi-scale [123] and adaptive [124] block-wise CS algorithms have been adopted for soft delivery. The multi-scale block-wise CS algorithm [123] decomposes each video frame into a multi-level 2D-DWT and then optimizes the sampling rate for each DWT level according to its importance. However, the adaptive block-wise CS algorithm [124] divides several video frames into one reference frame and subsequent non-reference frames and adaptively determines whether direct or predictive sampling should be used for each block in a non-reference frame. Direct sampling randomizes the signals in the block, whereas predictive sampling calculates the residuals between the blocks in the reference and non-reference frames and randomizes residuals to utilize the inter-frame similarity for the reconstruction.

4.5 Overhead Reduction

In soft delivery schemes without chunk division, a sender needs to let the receiver know the power information of all the DCT coefficients to demodulate the signals. For the receiver to carry out the MMSE filtering in Equation (5), the sender needs to transmit \({\lambda }_{i}\) of all coefficients without errors as metadata, which may constitute a large overhead. For example, when the sender transmits eight video frames with a resolution of \(352\times 288\), the sender needs to transmit metadata for all DCT coefficients (i.e., \(352\times 288\times 8 = 811{,}008\) variables in total) to the receiver. This overhead may induce performance degradation owing to the rate and power losses in the transmission of analog-modulated symbols. To reduce the overhead, SoftCast divides the DCT coefficients into chunks and carries out chunk-wise power allocation using an MMSE filter. However, overheads are still high, and chunk division causes performance degradation due to improper power allocation.

To achieve better quality under a low overhead requirement, the related studies can be classified into two types, as shown in Figure 6: sender-side overhead reduction and receiver-side overhead reduction. Studies on the sender-side overhead reduction [126, 127, 128, 129] designed fitting functions to obtain the power information with fewer parameters. In this case, the sender and receiver share the same fitting function in advance and send the parameters as metadata for overhead reduction. Specifically, Song et al. [126] designed a fitting function with four parameters for each chunk, and Xiong et al. [127] designed a log-linear function with two parameters for each chunk. Another study [128] found that equal-size chunk division was not suitable for chunk-wise fitting, and thus an adaptive chunk division (i.e., L-shaped chunk division) was designed for an accurate fitting. In addition, Fujihashi et al. [129] exploited a Lorentzian fitting function with seven parameters based on a Gaussian Markov Random Field (GMRF) for each GoP. The sender-side studies accurately predict the metadata using the fitting function with a limited number of the parameters (i.e., low overhead), whereas the fitting function causes an additional computational cost.

Fig. 6.

Studies on receiver-side overhead reduction [130, 131] estimate the power information only from the received signals without any additional computational cost at the sender side. The study of Li et al. [130] is a pioneer work to estimate the power information from the received signals. Blind data detection [131] was proposed to decode the received analog-modulated symbols without the power information at the receiver. Specifically, blind data detection uses a zero-forcing estimator and the sign of the received signals to approximate the source signals. One typical issue of the receiver-side overhead reduction is that the reconstruction quality highly depends on the quality of the received signals.

We note that both types of overhead reduction cause quality degradation owing to estimation errors. In the work of Xiong et al. [127], the effect of modeling accuracy on the reconstruction quality in soft delivery was analyzed.

4.6 Implementation

The aforementioned studies mainly discussed performance improvements in theoretical analyses and simulations. Table 6 lists the categories of the existing studies in terms of the performance evaluation. Existing studies [41, 45, 81, 82, 121, 134] used the software-defined radio platform SORA [141] to carry out emulations. In contrast to the simulations, the emulations obtain channel fading and noise trace from SORA to evaluate the performance under real wireless environments.

Table 6.

Evaluation	Papers
Simulations	[11, 12, 38, 39, 42, 43, 44, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 96, 97, 98, 99, 100, 101, 102, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 116, 119, 120, 122, 123, 124, 126, 127, 128, 129, 130, 131, 132, 133]
Emulations	[41, 45, 81, 82, 121, 134]
Experiments	USRP [13, 103, 135], SOUP [117], Xilinx Virtex7 [136, 137, 138], MU-MIMO prototype [139], LTE prototype [140]

Table 6. Categories of the Existing Soft Delivery Schemes in Terms of Performance Evaluation

Some studies implemented a soft delivery scheme on a software-defined radio platform [13, 103, 117, 135] and an Field-Programmable Gate Array (FPGA) [136, 137, 138] to empirically demonstrate the benefits of soft delivery in practical wireless channels. In some works [13, 103, 135], the authors used Universal Software Radio Peripheral (USRP) 2, USRP NI2900, and USRP X310 and GNU Radio for implementation and evaluated the visual quality of soft delivery, respectively. In addition, Tan et al. [117] built an experimental system based on the OpenAirInterface (OAI) platform and self-developed Software Universal Platform (SOUP). They migrated OAI to self-developed SOUP software-defined radio and implemented the proposed scheme based on OAI eMBMS codes. Conversely, other works [136, 137, 138] exploited the Xilinx Virtex7 FPGA for implementation and tested the reconstruction quality as a function of wireless channel SNRs.

Other studies [139, 140] implemented soft delivery on the prototypes of multi-user MIMO (MU-MIMO) and Long-Term Evolution (LTE) systems. For example, in the work of Chen et al. [139], SoftCast is implemented on BUSH, which is a large-scale MU-MIMO prototype that performs scalable beam user selection with hybrid beamforming for phased-array antennas in legacy WLANs. They performed experiments to evaluate the video quality in terms of Peak Signal-to-Noise Ratio (PSNR) and SSIM over a lossy MU-MIMO channel.

4.7 Extension for Immersive Experiences

SoftCast and other soft delivery schemes mentioned in the previous sections were designed for conventional images and video signals. In modern wireless and mobile communication scenarios, the streaming of immersive content will be a key application for reconstructing 3D perceptual scenes that provide full parallax and depth information for human eyes. The immersive content can be applied to various applications, such as 3 to 6 degrees-of-freedom entertainment, remote device operation, medical imaging, vehicular perception, VR/AR/MR, and simulated training. The typical immersive content includes free viewpoint video [142, 143, 144], 360-degree video, and point cloud [145], and Table 7 lists the features. Even in immersive content, the video frames are compressed in a digital manner, and the compressed bitstream is then channel coded and modulated in sequence. This means that cliff and leveling effects still occur in the streaming of the immersive content owing to the variation in the channel conditions. To prevent cliff, leveling, and staircase effects, some studies have extended soft delivery schemes toward immersive content for future wireless multimedia services.

Table 7.

Content	Acquisition	Display	Key Issues
Free viewpoint video	Large number of closely spaced RGB and infrared camera arrays	Synthesize virtual cameras using rendering and freely switch the viewing camera	Resource allocation for each RGB and infrared camera to maximize viewing quality
360-degree video	360-degree camera	Playback viewport through a VR headset	Predict future viewport and allocate resource to the viewport for quality maximization
Point cloud	Laser scanner	Playback 3D points through AR and MR headsets and holographic display	Compress and send numerous and irregular structure of 3D points

Table 7. Typical Immersive Content and Its Features

One of the key advantages of soft delivery schemes for immersive content is that they simplify the optimization problem for image and video quality maximization. In immersive content delivery, the main issue is to maximize the image and video quality considering the user’s perspective. For example, the view synthesis distortion optimization problem and the viewport optimization problem should be solved in free viewpoint video and 360-degree video, respectively. In digital-based delivery schemes, a sender needs to find the best bit and transmission power allocation for video frames. However, it is often cumbersome to derive a solution. Soft delivery schemes simplify the optimization problems by reformulating them into a simple power allocation problem since bit allocation for quantization is not required in soft delivery schemes.

4.7.1 Free Viewpoint Video.

Free viewpoint videos enable us to observe a 3D scene from freely switchable angles/viewpoints. A free viewpoint video consists of numerous closely spaced RGB and infrared camera arrays to capture the texture and depth frames of a 3D scene, such as a football game. Even though the number of deployed cameras in the field is limited owing to physical constraints, the receiver can synthesize intermediate virtual viewpoints using rendering techniques (e.g., depth image-based rendering [146, 147]) to obtain numerous switchable viewpoints. To synthesize intermediate virtual viewpoints using the rendering technique, the sender encodes and transmits the texture and depth frames of two or more adjacent viewpoints, the format of which is known as Multi-View plus Depth (MVD) [148].

For conventional MVD video streaming over wireless links, digital video compression for MVD video frames (e.g., MVC+D [149] or 3D-AVC [150]) fully utilizes the redundancy between the cameras and texture-depth for compression. In this case, the streaming schemes need to solve view synthesis problems in addition to cliff and leveling effects to yield better video quality even in the synthesized virtual viewpoints. Specifically, the video quality of the virtual viewpoint is determined by the distortion of each texture and depth frame. In digital-based MVD schemes, the distortion depends on the bit and power assignments for each texture and depth frame. It is often cumbersome to achieve the best quality at a target virtual viewpoint using parameter optimization owing to the combinatorial problem with nonlinear quantization.

Some studies [151, 152, 153, 154, 155] designed a soft delivery scheme for a free viewpoint video. Specifically, FreeCast [151, 152] is the first scheme for a free viewpoint video. Because MVD video frames have redundancy of cameras and texture-depth, FreeCast jointly transforms texture and depth frames using 5D-DCT to exploit inter-view and texture-depth correlations for energy compaction. In addition, FreeCast can simplify the optimization problems of view synthesis by reformulating it into a simple power assignment problem. This is because bit allocation (i.e., quantization) is not required in FreeCast. The authors found that the power assignment problem for the texture and depth frames can be solved using a quadratic function to yield the best quality at the desired virtual viewpoint. Furthermore, FreeCast introduces a fitting function obtained from multi-dimensional GMRF at the sender and the receiver to obtain the power information with few parameters for the overhead reduction.

3DV SoftCast [154] focused on the view synthesis problems under the 3D-DCT operations for each camera’s texture and depth frames, and designed the power allocation method to solve the problem. The main differences between 3DV SoftCast and FreeCast are that 3DV SoftCast performs 3D-DCT for each camera and controls the transmission power to minimize the view synthesis distortion, whereas FreeCast performs 5D-DCT for better energy compaction of analog-modulated symbols and multi-dimensional GMRF-based overhead reduction for reconstructing high-quality MVD frames in band-limited environments. Yang et al. [155] designed a soft delivery scheme for depth video. They found that a block-based DCT performs well on depth video compared to a full-frame DCT because depth video has different characteristics from texture video. Although a different soft delivery scheme is required for texture video, low-distortion depth video in the work of Yang et al. [155] can provide better virtual viewpoint quality in free viewpoint video.

4.7.2 360-Degree Video.

Notably, 360-degree video content builds a synthetic virtual environment to mimic the real world with which the users interact. Each user can watch 360-degree videos through a traditional computer-supported VR headset or an all-in-one headset (e.g., Oculus Go). When the user requests the 360-degree video, the sender sends the 360-degree video frames, and the user may play a part of the 360-degree video frames, which is referred to as the viewport, through the user’s headset. Here, 360-degree videos are mainly captured by an omnidirectional camera or a combination of multiple cameras and saved in a spherical format. Before transmissions, the sphere frames are mapped onto the 2D plane using a certain projection method (e.g., equirectangular and cube map projections).

In 360-degree video streaming, the major issue is to yield better video quality in the user’s viewport by effectively reducing perceptual redundancy within 360-degree video frames. Because each user only watches the viewport via the headset at each time instance, excessive video traffic is created if the sender sends the full resolution of the 2D-projected video frames with an identical quantization parameter. One of the simplest methods to reduce perceptual redundancy is viewport-only streaming [156]. In video playback, the user may move a viewing viewport according to the user’s head/eye movement. Based on the movement, the user requests a new viewport from the sender, and the sender sends back the corresponding viewport. Because the sender transmits one viewport at each time instant, viewport-only streaming can mitigate the video traffic. However, the user needs to receive a new viewport from the sender in every viewport switching, which causes a long switching delay. A long switching delay (i.e., approximately 10 ms) may cause simulator sickness [157]. Owing to a long delay in the standard Internet, it is difficult for viewport-only streaming schemes to satisfy the switching delay requirements. To prevent simulator sickness, conventional schemes [158] divide 360-degree video frames into multiple tiles and independently encode them with different quantization parameters to yield better viewport quality within the bandwidth constraint.

Studies [159, 160, 161, 162] on soft delivery schemes focus on the quality optimization of the user’s viewport in addition to cliff and leveling effect prevention. Fujihashi et al. [159] presented the first scheme for viewport-aware soft 360-degree video delivery. According to the viewing viewport, the sender first adopts pixel-wise power allocation to reduce the perceptual redundancy in 360-degree video frames and then carries out the combination of 1D-DCT and spherical wavelet transform for decorrelation to utilize the redundancy in the sphere and time domains. In the work of Zhao et al. [160], OmniCast further considers the features of 360-degree videos into quality optimization. Specifically, the authors analyze the relationship of the distortion between the spherical and projected 2D domains as the spherical distortion for each projection method, and design power allocation to realize the optimal quality in the 2D-projected 360-degree videos. 360Cast [161] and the extended version 360Cast\(+\) [162] adopt viewport prediction based on linear regression and foveation-aware power allocation within the predicted viewport to further reduce the perceptual redundancy. They evaluate 360Cast\(+\) with the existing digital-based schemes in terms of weighted-to-spherically uniform PSNR [163]. Here, the digital-based schemes use HEVC Test Model 16.20 [164] and the modulation format of BPSK. They found that 360Cast\(+\) improves the average weighted-to-spherically uniform PSNR performance compared with the digital-based schemes by preventing the cliff effect at low SNR regimes and gradually improving the received video quality with the improvement of the wireless channel quality.

4.7.3 Point Cloud.

Volumetric content delivery provides highly immersive experiences for users through XR devices. The point cloud [145] is arguably the most popular volumetric data structure for representing 3D scenes and objects on holographic displays [165, 166]. A point cloud typically consists of a set of 3D points, and each point is defined by 3D coordinates (i.e., (X, Y, Z)) and color attributes (i.e., (R, G, B)). In contrast to conventional 2D images and videos, 3D point cloud data are neither well aligned nor uniformly distributed in space.

The major challenge in volumetric delivery over wireless channels is how to efficiently compress and send numerous and irregular structures of the 3D point cloud within a limited bandwidth. Some compression methods have been proposed for point clouds to deliver 3D data. Specifically, Draco [167] employs kd tree-based compression [168] and a point cloud library using octree-based compression [169, 170, 171]. To further reduce the amount of data traffic in point cloud delivery, two transform techniques have been proposed for energy compaction of the non-ordered and non-uniformly distributed signals: Fourier-based transform (e.g., Graph Fourier Transform (GFT)) and wavelet-based transform (e.g., region-adaptive Haar transform) [172]. For example, recent studies used GFT for the color components [173] and 3D coordinates [174] of graph signals for signal decorrelation. They used quantization and entropy coding for the compression of decorrelated signals.

HoloCast [175] is a pioneering work on soft 3D point cloud delivery for unstable wireless channels. Specifically, they regard 3D points as vertices in a graph with edges between nearby vertices to deal with the irregular structure of the 3D points motivated by the work of Rente et al. [174] and Zhang et al. [176]. HoloCast uses GFT for such graph signals to exploit the underlying correlations among adjacent graph signals and directly transmits linear-transformed graph signals as a pseudo-analog modulation over the channel. We compared HoloCast with the conventional digital-based delivery, which is based on point cloud digital compression used in a point cloud library [169]. HoloCast gradually improves the reconstruction quality with the improvement of wireless channel quality. In addition, the GFT-based HoloCast can achieve better quality compared with the DCT-based HoloCast.

However, it has been found that graph-based coding schemes need to send the graph-based transform basis matrix used in GFT as additional metadata for signal decoding. For example, the sender needs to send \(N^2\) real elements of the graph-based transform basis matrix as the metadata when the number of 3D points is N. In some works [177, 178, 179], Givens rotation [180, 181] was used for GFT basis matrix compression. Givens rotation is used to selectively introduce zeros into a matrix to create an identity matrix from the basis matrix using angle parameters. The angle parameters are uniformly and non-uniformly quantized prior to the metadata transmission for overhead reduction. From the evaluations, Givens rotation with the uniform quantization reduces the overhead up to 89.8% [177] compared with HoloCast without the overhead reduction. In addition, Givens rotation with the non-uniform quantization further reduces the overhead up to 28.6% [178] compared with the uniform quantization.

5 Future Directions

The existing soft delivery schemes have been studied to overcome the issues of conventional image and video streaming in modern wireless and mobile networks. In this section, we foresee the future directions of soft delivery. Table 8 lists the features and challenges of each future direction. Specifically, the integration of digital-based and DNN-based operations with soft delivery will be further discussed to yield better reconstruction quality. These studies are referred to as HDA delivery and AI-empowered soft delivery. In addition, our study and other studies find that soft delivery can improve the delivery quality of the DNN architectures and tactile data. The soft delivery based schemes may become a new standard for such delivery. Although the neural network compression and haptic codec are designed for the delivery, it causes low reconstruction quality owing to insufficient energy compaction.

Table 8.

Future Direction	Advantage	Disadvantage	Challenges
HDA delivery	Compact signal energy by integrating with digital-based operations	Cause cliff effect if digital-coded symbols fail	Discuss tradeoff between coding delay and quality
AI-empowered soft delivery	Realize energy compaction and signal reconstruction by using DNN-based architectures	Large computational overhead	Deal with bandwidth heterogeneity, Design an optimal architecture for semantic communication
Soft delivery for AI	Efficiently exchange model parameters by using simultaneous transmission	Require symbol-level synchronization	Design power allocation for low-energy FL
Soft delivery forTactile Internet	Meet strict delay constraint	Consider a single vibrotactile sensor	Deal with multiple vibrotactile sensors, Minimize the distortion of human tactile perception

Table 8. Future Directions and Challenges of Soft Delivery Schemes

5.1 HDA Delivery

For further quality improvement, the pioneer studies of HDA delivery [182, 183] integrate low-rate digital-based encoding and decoding into soft delivery.

They proposed the superposition of analog-coded and digital-coded symbols to take advantage of conventional digital-based and soft delivery schemes. Specifically, the digital-coded symbols provide the baseline quality of the video frames, whereas the analog-coded symbols enhance the quality of the video frames according to the wireless channel quality. Here, the low-rate digital-based operations can significantly reduce the signal energy of the analog-coded symbols (e.g., the decrement of the ratio of the maximum energy to the minimum energy of the source component). A theoretical study [40] clarified that a lower ratio improves the reconstruction quality of the analog-coded symbols. This means that the amount of quality improvement with the improvement of the wireless channel quality in HDA coding schemes becomes more significant compared to the pure soft delivery schemes. Nonetheless, the integration with digital-based encoding has one drawback: the cliff effect may occur when the decoding of digital-coded symbols fails.

Figure 7 shows an overview of the HDA delivery schemes. HDA delivery schemes consist of the digital and analog coding parts. At the sender side, the video frames are first encoded by the digital video encoder and the digitally coded bitstream is channel coded, modulated, and assigned transmission power by the sender. Meanwhile, the residuals are coded, power assigned, and modulated by the soft delivery scheme. Both outputs from the digital and analog coding parts are superposed and transmitted over wireless channels. In this case, the transmitted signal \(x_i\) is the sum of the BPSK-modulated vector signal \(x^{\langle \mathsf {d}\rangle }_{i}\) and output vector signal of the soft delivery scheme \(x^{\langle \mathsf {a}\rangle }_{i}\) as follows:

\begin{equation} x_i = x^{\langle \mathsf {d}\rangle }_{i} + \jmath x^{\langle \mathsf {a}\rangle }_{i}. \end{equation}

(6)

The BPSK-modulated symbol and the analog-modulated symbol are scaled by \(P_\mathrm{d}\) and \(g_{i}\), respectively:

\begin{equation} x^{\langle \mathsf {d}\rangle }_{i} = \sqrt {P_\mathsf {d}} \cdot b_{i}, \qquad x^{\langle \mathsf {a}\rangle }_{i} = g_{i} \cdot s_{i}, \end{equation}

(7)

where \(b_{i}\in \mathbb {X} = \left\lbrace \pm 1 \right\rbrace\) is the BPSK-modulated symbol and \(\jmath =\sqrt {-1}\) denotes the imaginary unit. Here, the near-optimal solution of \(g_i\) under the transmission power budget \(P_\mathsf {a}\) is based on Equation (4). We note that the budgets of the transmission power for the digital and analog parts need to satisfy the total power budget \(P_\mathsf {t}\) (i.e., \(P_\mathsf {t} = P_\mathsf {a} + P_\mathsf {d}\)).

Fig. 7.

At the receiver side, it first decodes the digital-modulated symbols and then obtains the analog-modulated symbols by subtracting the digital-modulated symbols from the received symbols. Finally, the receiver reconstructs the baseline quality of the video frames from the output of the digital part and enhances the video quality by adding the output of the analog part.

A key issue in HDA delivery is the assignment of transmission power to the digital and analog parts [184]. Specifically, the power assigned to the digital part must guarantee the correct decoding of the symbols. By contrast, the digital decoder treats the superimposed analog-modulated symbols \(x^{\langle \mathsf {a}\rangle }_{i}\) as noise. To achieve better decoding performance, the I component of \(x^{\langle \mathsf {a}\rangle }_{i}\) should be kept as small as possible. In the work of Song et al. [185], the authors only select the high-frequency coefficients, which are expected to be very small values for superposition. The remaining low-frequency coefficients are delivered using pseudo-analog modulation. The HDA framework in the work of Tan et al. [186] regards the superposed symbols as three main parts: orthogonal analog symbols, digital symbols, and nonorthogonal analog symbols superimposed onto digital symbols. They designed resource allocation among these three parts to achieve a better balance between lowering interference and improving reconstruction quality. Another study [187] designs a prediction model to describe the relationship between the variance of residuals and the quantization parameter, and determines the optimal transmission power for the analog part, which maximizes the reconstruction quality with the correct decoding of the digital part. The HDA delivery scheme in the work of Zhang et al. [188] treats the imperfect decoding of the digital part and finds the best assignment of the transmission power for the digital and analog parts. This prevents too much power assignment for the digital part to ensure a low bit error rate. In contrast to the aforementioned studies, Liang et al. [189] treat the bandwidth of other digital traffic as hidden resources for HDA video delivery. Specifically, they superimpose the analog-modulated symbols and digital symbols of the other digital traffic to utilize the hidden resource under the constraint that the bit error rate requirement of the other digital traffic is not compromised.

Other studies have redesigned the power allocation in HDA delivery for practical wireless channel environments, including fading [190, 191], OFDM [192, 193], MIMO [194], and relay networks [195]. For example, the power allocation with perfect channel state information is designed in the work of Shen et al. [190], Yahampath [192], and Liu et al. [194], whereas the power allocation with imperfect channel state information is designed in other work by Yahampath [191, 193]. In view of the packet loss resilience in HDA delivery, the study of Fujihashi et al. [196] introduced CS for the residuals.

Other studies [197, 198, 199] extend HDA video delivery for immersive content. Swift [197] considers stereo video delivery and designs a zigzag coding structure for the stereo video to utilize both intra- and inter-view correlations. In the zigzag coding structure, the odd frames in the left view and the even frames in the right view are encoded digitally, and the rest of the frames are encoded in analog. Here, the reconstructions of the digitally coded frames are used as side information to further remove redundant information from the analog-coded frames. Another study [198] extends HDA delivery for MVD videos and solves the view synthesis optimization to yield the best quality from an intermediate virtual viewpoint. HoloCast+ [199] designs HDA delivery for point cloud delivery.

In future work, the recent coding standards such as H.266/VVC, learned video compression [200, 201], and point cloud coding can be used for the digital part. Although they have achieved significant energy compaction and can further improve the quality of the analog part, the recent coding standards require a long coding delay for compression. The tradeoff between coding delay and reconstructed image and video quality is an open question in HDA delivery.

5.2 AI-Empowered Soft Delivery

Some recent studies integrate DNN architectures for nonlinear encoding and decoding operations of soft delivery, namely AI-empowered soft delivery. AI-empowered soft delivery schemes utilize Deep Convolutional Neural Networks (DCNNs) [202, 203] and multi-layer perceptron networks for energy compaction, power allocation, and overhead reduction tasks.

The multi-layer perceptron auto-encoder was first adopted to reduce the overhead of soft delivery [204]. Specifically, the proposed encoder obtains a few latent variables from the pixel values, and the proposed decoder decodes the accurate power information from the received latent variables for proper power allocation. The reconstruction quality can be maintained even with only one metadata across one GoP. Another study [205] designs Deep Joint Source-Channel Coding (DJSCC) for the energy compaction of the image and video signals. The DJSCC schemes integrate a DCNN-based auto-encoder into a soft delivery scheme. The proposed encoder directly compresses each image into a limited number of latent variables, and the proposed decoder reconstructs the image from the latent variables. Here, the latent variables are transmitted over wireless channels using pseudo-analog modulation. Even though the latent variables are obtained by nonlinear functions and delivered over wireless channels with a lower SNR, cliff and leveling effects can be prevented via pseudo-analog modulation. Other studies have introduced the DNN architecture for power allocation [206] and decoding operations [207]. The study of Tang et al. [206] uses a YOLO (you-only-look-once) structure [208] to extract the Region of Interest (ROI) and non-ROI parts from each image and then assign unequal transmission power across ROI and non-ROI parts for perceptual quality enhancement. The proposed scheme in the work of Fujihashi et al. [207] integrates DCNN-based image denoising, specifically DIP (deep-image-prior) [209], into soft delivery. The DIP finds linear and nonlinear noise effects for reconstructing clean images from noisy images. The proposed scheme can remove fading and noise effects from the received images using DCNN-based image restoration. Another study [210] introduces the Graph Neural Network (GNN) [211] for wireless point cloud delivery. The GNN is a novel model for graph representation learning that allows the analysis of the irregular geometric structure of graph data. GNN-based auto-encoder [212, 213] was designed to encode 3D point clouds into a limited number of latent variables. One of the benefits of the GNN-based auto-encoder is that it allows graph signal reconstruction from a limited number of latent variables without requiring additional metadata.

DJSCC schemes with the latest neural network architectures have been well studied for further energy compaction in recent years. As a result, they have achieved better image and video quality without cliff, leveling, and staircase effects. However, the existing DJSCC schemes need to deal with the bandwidth heterogeneity among the receivers. Here, how to adaptively improve the image and video quality according to the available bandwidth using the same architecture remains challenging work. In addition, DJSCC schemes have been considered as the fundamental techniques for realizing semantic communication [214, 215] in future wireless and mobile networks. Semantic communication has been envisioned as a new transmission paradigm that delivers semantic meaning rather than a bit stream of transmitted messages. Another challenging issue is to design an optimal DJSCC scheme for the given semantic triple.

5.3 Soft Delivery for AI

Our study and other studies find that soft delivery can be utilized to support various AI architectures. Specifically, many AI-based applications need to exchange trained DNN models between the receiver over wireless networks within a short delay including viewport prediction [216] in untethered XR applications, and dead reckoning in autonomous driving [217, 218] and online gaming services [219]. In recent years, digital-based model compression schemes [220, 221] have been designed and standardized for sharing the trained model over the networks. However, the existing studies found that the cliff and leveling effects occur even in the DNN model transmissions. To prevent both effects, analog modulation is effective in the model transmissions. AirNet [222] adopts analog modulation to deliver the DNN model parameters over wireless networks. Specifically, AirNet directly maps the model parameters to the transmission symbols and sends the analog-modulated symbols via wireless channels. This process avoids the issues mentioned previously, and the model restoration quality faithfully corresponds to the instantaneous channel condition. As well, AirNet adopts SK mappings [97] to reduce the number of transmission symbols for band-limited channels.

In addition, the model parameter transmission is key technique to realize FL [223, 224] over wireless networks. Figure 8(a) shows one typical FL example over wireless networks. The FL is a decentralized learning approach, which trains the model over a federation of distributed learners and an aggregator, to obtain an accurate model even with a limited dataset in each distributed learner. Each learner in the federation uses only locally available data for training. For the training over the distributed learners and aggregator, both learners and the aggregator exchange the model parameters over wireless channels. The existing studies found that the analog modulation-based solutions are efficient for exchanging the model parameters for the FL. AirComp (analog over-the-air computation) [225, 226] is a typical solution for the model parameter transmission in FL. All learners simultaneously send the analog-modulated parameters with channel inversion to the aggregator, and the aggregator can receive the aggregated model parameters from the superimposed waveforms. Although the simultaneous analog transmission can improve the throughput of the model parameter transmission, AirComp requires a precise symbol-level synchronization between the distributed learners. In previous work [227], we proposed the model parameter transmission in a round-robin manner for quasi-asynchronous FL systems. Specifically, Federated AirNet based on HDA delivery integrates low-rate model parameter compression and energy-compact analog modulation.

Fig. 8.

Figure 8(b) shows the average top-1 classification accuracy of the global model as a function of wireless channel SNR when an available number of transmission symbols is at most 6.0 Msymbols. The number of transmission iterations from 10 learners to the aggregator is 10. We compare Federated AirNet to two state-of-the-art digital and analog approaches: DeepCABAC and AirNet. In the DeepCABAC scheme, the compressed bitstream is channel coded by a half-rate convolutional code with a constraint length of 8 and digitally modulated by BPSK or 4-QAM formats. The AirNet scheme directly maps each of the model parameters onto a transmission symbol in analog modulation. We found that the AirNet and Federated AirNet schemes prevent drastic degradation in accuracy since they do not rely on quantization and entropy coding. In addition, the proposed Federated AirNet scheme yields the best accuracy, in higher-channel SNR regimes, achieving near error-free performance.

Although the soft delivery schemes have the potential for the realization of FL, they cause large energy consumption because they consider the transmission power of each distributed learner as identical. Although each learner can limit the transmission power to reduce energy consumption, the limitation may cause a long delay for convergence and low global model performance. The discussion on the transmission power allocation among the distributed learners will be a key issue for the realization of FL with low energy consumption.

5.4 Soft Delivery for Tactile Internet

In addition to visual information, multiple sensorial media (mulsemedia) delivery can enhance the quality of immersive experiences in XR applications. Tactile is one of the typical mulsemedia data. Especially, tactile communications can support the untethered and immersive XR applications. In contrast to visual information, the sampling rate of tactile information is relatively high (i.e., above 1,000 Hz) and the delay requirement for tactile communications is strict. In this case, the sender does not retransmit the tactile under the channel quality fluctuation. Although the haptic codec [228, 229], which is defined in the IEEE 1918.1.1 standard, is designed to compress the tactile (e.g., vibrotactile signals), the cliff and leveling effects also occur owing to the channel quality fluctuation. We use the analog modulation with DCT and DWT for the vibrotactile signals to discuss the feasibility of the soft haptic delivery. Figure 9(a) shows an overview of the proposed soft haptic delivery scheme, and Figure 9(b) shows the reconstruction quality of the vibrotactile signal over the soft haptic and the digital-based delivery schemes at the channel SNRs of 10 dB and 20 dB under different available bandwidths. Here, the digital-based delivery scheme uses the BPSK modulation format and controls the quantization parameter to fit the transmission symbols into the available bandwidth, whereas the soft haptic delivery scheme discards high-frequency coefficients for the same purpose. We used “1spike_Probe_-_aluminumGrid_-_fast” as the reference vibrotactile data provided by the IEEE 1918.1.1 standard. From preliminary evaluations, the soft haptic delivery scheme yields better reconstruction quality of the vibrotactile signals irrespective of the available bandwidth.

Fig. 9.

The soft haptic delivery scheme was designed only for the single vibrotactile sensor and to minimize the MSE between the original and reconstructed vibrotactile signals. However, vibrotactile signals from multiple sensors should be delivered to provide immersive experiences for users. In addition, a psychohaptic model in the work of Noll et al. [228] and Steinbach et al. [229] demonstrated that each frequency band of vibrotactile signals has unequal sensitivity for humans. Here, how to design energy compression for vibrotactile signals from multiple sensors and power allocation to minimize the distortion considering human tactile perception are challenging issues for the realization of haptic delivery.

6 Conclusion

In this article, we presented an exhaustive survey and research outlook of soft delivery schemes. We first reviewed conventional digital-based video delivery schemes and the critical issues of the schemes, including cliff, leveling, and staircase effects. We then provided an overview of the soft delivery schemes and the taxonomy of the existing schemes from the perspectives of energy compaction, power allocation, bandwidth utilization, packet loss resilience, overhead reduction, and implementation. Some studies adopted the existing techniques of the energy compaction and overhead reduction to the soft delivery for immersive content, finding that the reconstruction quality outperforms the digital-based delivery schemes even with HEVC-based source coding. Finally, we envisioned the future directions of soft delivery based on preliminary evaluations. We expect that soft delivery will be essential for sending high-quality model parameters and tactile information over wireless and mobile networks.

References

[1]

Ericsson. 2022. Ericsson Mobility Report.

Abstract

1 Introduction

1.1 Contributions

1.2 Survey Structure

2 Conventional Digital-Based Delivery

2.1 Overview

2.2 Critical Issues on Quality

2.2.1 Cliff Effect.

2.2.2 Leveling Effect.

2.2.3 Staircase Effect.

3 SoftCast: A Pioneer Work On Soft Delivery

3.1 Overview

3.2 Details of Scaling and Inverse Scaling Operations

4 Technical Solutions for Soft Delivery

4.1 Energy Compaction of Source Signals

4.2 Channel-Aware and Perception-Aware Power Allocation

4.3 Bandwidth Utilization

4.4 Packet Loss Resilience

4.5 Overhead Reduction

4.6 Implementation

4.7 Extension for Immersive Experiences

4.7.1 Free Viewpoint Video.

4.7.2 360-Degree Video.

4.7.3 Point Cloud.

5 Future Directions

5.1 HDA Delivery

5.2 AI-Empowered Soft Delivery

5.3 Soft Delivery for AI

5.4 Soft Delivery for Tactile Internet

6 Conclusion

References

Cited By

Index Terms

Recommendations

A survey of multimedia content adaptation for mobile devices

A cooperative content delivery scheme for multimedia services in contents delivery networks

Mobile video streaming in modern wireless networks

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations