Because SoftCast skips nonlinear digital-based encoding and decoding operations corresponding to motion estimation, quantization, and entropy coding, it realizes a linear quality improvement associated with channel quality improvement. In particular, SoftCast has shown outstanding performance compared with the conventional digital-based delivery schemes when receivers are highly diverse and/or the channel condition of each receiver varies drastically. Conversely, the design of SoftCast is simplistic, so there remains much scope for improvement in adopting soft delivery in practical scenarios, including stable channel conditions, band-limited, and/or error-prone environments. For this purpose, many studies have been conducted to improve the performance of soft delivery. The existing works on soft delivery schemes can be classified into seven types, as shown in Figure
1: energy compaction, optimal scaling, bandwidth utilization, resilience to packet loss, overhead reduction, hardware implementation, and extension for immersive experiences.
4.1 Energy Compaction of Source Signals
In soft delivery schemes via linear mapping (from source signals to channel signals), the reconstruction quality greatly depends on the performance of the energy compaction technique for the source signals. Specifically, the study by Prabhakaran et al. [
40] clarified that the performance of soft delivery schemes degrades as the ratio of maximum energy to minimum energy of the source component increases. To yield better quality under both stable and unstable channel conditions, existing studies have adopted different energy compaction techniques listed in Table
3 for the source signals.
Typical solutions are to adopt wavelet-based signal decorrelation methods. Specifically, some studies [
41,
42,
43,
44,
45,
46] have adopted a
Motion-Compensated Temporal Filter (MCTF), which is a temporal wavelet transform method, to remove inter-frame redundancy by realizing motion compensation in soft delivery. The MCTF recursively decomposes video frames into low- and high-frequency frames according to a predefined level. For example, WaveCast [
44] adopted a 3D-
Discrete Wavelet Transform (DWT) (i.e., the integration of 2D-DWT and MCTF) to remove temporal and spatial redundancy. Although SoftCast exploits a full-frame 3D-DCT to remove the intra- and inter-frame redundancy for energy compaction, WaveCast can further improve the reconstruction quality by fully exploiting the inter-frame redundancy using motion compensation. A detailed discussion on the effects of other decorrelation methods is presented in the work of Xiong et al. [
47,
48]. Trioux et al. [
49] also utilized inter-frame redundancy by designing an adaptive GOP size mechanism. It adaptively controlled the GoP size based on shot changes and the spatio-temporal characteristics of the video frames. It then used a full-frame 3D-DCT for energy compaction across the video frames in one GoP.
Another typical solution is to send large energy coefficients as metadata and thus prevent the transmission of such coefficients using pseudo-analog modulation. Lin et al. [
50] designed
Advanced SoftCast (ASoftCast) to send low-frequency coefficients as the metadata. ASoftCast decomposed the original images into frequency components using 2D-DWT. The frequency component was then divided into two parts: the lowest-frequency sub-band and other sub-bands. The wavelet coefficients in the lowest-frequency sub-band are processed by run-length coding. They are then channel coded and digitally modulated for additional metadata transmissions. The optimized power allocation for the SoftCast scheme in the work of He et al. [
51] selected and sent high-energy coefficients as the metadata to reduce the energy of the analog-modulated symbols. These results can assign a high transmission power to low-energy coefficients to improve the received quality. Here, determining the high-energy coefficients for each GoP is computationally complex owing to the use of an exhaustive search. To reduce the computational complexity, Trioux et al. [
52] adopted a zigzag scan to select the side information. Other studies [
53,
54,
55,
56,
57,
58] divided the video into BL and ELs, which were coded and sent in digital and pseudo-analog ways, respectively. For example, the BL in gradient-based image SoftCast (G-Cast) [
57] sent the DC and low-frequency coefficients of the image, whereas the EL extracted and sent an image gradient, which represents the edge portion of the image, using a gradient transform. The receiver then created a final estimation of the image via a gradient-based reconstruction procedure, utilizing both the image gradient at the EL and the low-frequency coefficients provided by the BL.
Other solutions adopted a nonlinear encoder and decoder for source signals to decrease the ratio of the maximum to the minimum energy of the analog-modulated symbols. The typical solution for soft delivery is to introduce coset coding [
59,
60], which is a typical technique in distributed source coding. Coset coding partitions the set of possible source values into several cosets and transmits the coset residual codes to the receiver. With the received coset codes and the predictor, the receiver can recover the source value in the coset by choosing the one closest to the predictor. DCast [
61,
62,
63,
64] first introduced coset coding for the soft delivery of inter frames. The coset coding in DCast divides each frequency domain coefficient
\(s_i\) by a coset step
q and obtains the coset residual code
\(l_i\) as
\(l_i = s_i - \lfloor \frac{s_i}{q} + \frac{1}{2} \rfloor q\), where
\(\lfloor \frac{s_i}{q} + \frac{1}{2} \rfloor\) represents the coset index. At this time, the sender only needs to transmit the coset residual code for energy compaction. At the user side, with the received coset residual code
\(\hat{l}_i\) and the side information
\(\bar{s}_i\) (i.e., the predicted DCT coefficient obtained from the reference video frame), the receiver reconstructs the DCT coefficients by coset decoding. Given the coset residual code
\(l_i\), there are multiple possible reconstructions of
\(s_i\) that form a coset
\(C = \lbrace \hat{l}_i, \hat{l}_i \pm q, \hat{l}_i \pm 2q, \hat{l}_i \pm 3q, \ldots \rbrace\). DCast then selects the coset
C that is nearest to the side information
\(\bar{s}_i\) as the reconstruction of the DCT coefficient. In this case, the value of each coset step
q is crucial for the coding performance of DCast. The value of
q is calculated by estimating the noise at the receiver end shown in the work of Fan et al. [
63,
64]. However, the reconstruction quality of DCast also depends on the side information quality. If the side information
\(\bar{s}_i\) is error prone, the receiver may make wrong decisions with a smaller
q. Huang et al. [
65] introduced a side information refinement algorithm [
66] to refine the side information for the quality enhancement of DCast.
The concept of coset coding has been widely applied in other studies on soft delivery for the same purpose. For example, several works [
67,
68,
69,
70,
71] utilized pseudo-coset coding for lower-frequency components and sent the coset index using the digital framework. Here, the residuals in the lowest-frequency components and other frequency components are sent using pseudo-analog modulation. The main difference between coset coding and pseudo-coset coding is the sending of the coset index as additional metadata. The layered coset coding and adaptive coset coding were applied to the soft delivery scheme in the work of Fan et al. [
69] and Lv et al. [
70], respectively. LayerCast [
69] introduced layered coset coding to simultaneously accommodate heterogeneous users with diverse SNRs and bandwidths. The layered coset coding used large to small coset steps to obtain coarse to fine layers from each chunk. The coarse layer (i.e., BL) is sufficient to reconstruct a low-quality DCT chunk for narrowband users, whereas each fine layer (i.e., EL) provides refinement information of the DCT chunk for wideband users. Some works [
72,
73,
74] utilized the coset coding for cooperative soft delivery systems (i.e., a three-node relay network). A sender broadcasts the DCT coefficients obtained from the video frames using pseudo-analog modulation to the relay node and the destination node. If the channel quality between the sender and the destination node is higher than a threshold, the destination node reconstructs the video frames from the soft-delivered DCT coefficients. If the channel condition is lower than the threshold, the relay node sends the coset residual code to the destination node, then the destination node reconstructs the video frames using the received coset residual code and the side information obtained from the softly delivered DCT coefficients from the sender.
4.2 Channel-Aware and Perception-Aware Power Allocation
As mentioned in Section
3.2, the power allocation in SoftCast minimizes the MSE between the original and reconstructed video signals over
Additive White Gaussian Noise (AWGN) channels. There are several drawbacks toward adopting SoftCast in practical scenarios: (1) practical wireless channels have more complex characteristics (e.g., fading caused by multipath and impulse noise) than the AWGN channels, and (2) MSE is not an effective index for describing the perceptual fidelity of images/videos. To address the drawbacks related to the power allocation, the existing studies in Table
4 propose the power allocation for practical wireless channels and perceptual considerations.
For the first drawback, the existing studies redesigned the power allocation for practical wireless channels, including fading [
75] and frequency-selective fading (i.e.,
Orthogonal Frequency-Division Multiplexing (OFDM)) [
76,
77], impulse noise [
84],
Multiple-Input and Multiple-Output (MIMO) [
78,
79,
80], and MIMO-OFDM channels [
81,
82,
83]. Cui et al. [
75] designed an optimal power allocation for fading channels. In fading channels, a fading effect (i.e., multiplicative noise) will degrade the reconstruction quality. Although SoftCast assumes that multiplicative noise can be canceled with exact channel estimation at the receiver end, no algorithm can guarantee an error-free channel estimation. In addition to the power allocation design, the authors analyzed the effect of the channel estimation error on the reconstruction quality at the receiver end.
For frequency-selective fading channels, such as OFDM and MIMO-OFDM channels, the key issue is how to match the analog-modulated symbols to the independent subcarriers/subchannels for high-quality image/video reconstruction. Liu et al. [
81,
82] observed similarities between the source and channel characteristics and exploited the similarities for subcarrier/subchannel matching. ParCast [
81] and the extended version of ParCast
\(+\) [
82] assigned the more important DCT coefficients to higher gain channel components and allocated power weights for each DCT coefficient with joint consideration of the source and channel for video unicast systems. ECast [
83] extended the source and channel matching and power allocation for video multicast systems. For multicast systems, it is necessary to deal with the large overhead of channel feedback from multiple receivers. In ECast, multiple users simultaneously send tone signals for the channel feedback, and the sender receives the superposition of multiple tone signals. Although the sender cannot distinguish each of the channel gains, the weighted harmonic means of channel gains can be obtained from the superposed tone signals. Thus, ECast utilizes the channel gain for the source and channel matching and power allocation.
Other studies solved power allocation problems in modern wireless systems, including
Non-Orthogonal Multiple Access (NOMA) [
85,
86], underwater acoustic OFDM [
87],
Unmanned Aerial Vehicle (UAV)-enabled [
88], and mmWave lens MIMO systems [
89]. For example, in NOMA systems, source signals are coded into the BL and ELs and then transmitted simultaneously through superposition coding. With successive interference cancellation, near users with strong channel gains can decode both BL and EL signals, whereas far users with weak channel gains may only decode BL signals. In the existing studies, both the BL and ELs are analog coded in the work of Jiang et al. [
85], whereas the BL and ELs are digital coded and analog coded, respectively, in the work of Wu et al. [
86]. They solved the power allocation across the BL and ELs to minimize the distortion for all receivers with heterogeneous channel conditions. In underwater acoustic OFDM [
87] and mmWave lens MIMO systems [
89], the error behavior differed substantially across channel components, and the channel characteristics showed a similar tendency. They solved the source and channel matching and power allocation problems, which are also discussed in frequency-selective fading channels, to minimize the distortion at the receiver end.
For the second drawback, some studies [
90,
91,
92,
93] also redesigned the power allocation with perceptual considerations, including
Structural Similarity (SSIM) [
90], foveation [
91], and saliency [
92]. In these studies, determining the perception-aware weights for each source component is challenging. Specifically, in SoftCast, the scaling factor for each coefficient is obtained from its power information to minimize the MSE:
\(g_i \varpropto\lambda _i^{-1/4}\). These studies considered the perception-aware weight for the
ith coefficient
\(w_i\) in the scaling factor to minimize the perceptual distortion as
\(g_i \∝_i^{1/4} \lambda _i^{-1/4}\). For this purpose, Zhao et al. [
90] demonstrated the relationship between the MSE in the DCT coefficients and the SSIM distortion to obtain the weight for the
ith DCT coefficients of all chunks
\(w_i\). They found that the weight for the high-frequency coefficients was larger than that for the low-frequency coefficients, which was consistent with the characteristics of the
Human Visual System (HVS). FoveaCast [
91] introduced the foveation-based HVS [
94] and the corresponding HVS-based visual perceptual quality metric, called
Foveated Weighted Distortion (FWD), for the optimization objective. For a given foveation point
\((f_x, f_y)\) in the pixel and frequency domains, the error sensitivity for each pixel/frequency coefficient at location
\((x, y)\) can be defined in the foveation-based HVS. FoveaCast regarded the error sensitivity in the DWT domains as the weight
\(w_i\) and performed foveation-aware power allocation. In the work of Hadizadeh [
92], visual saliency maps were introduced for the perception-aware power allocation. Saliency maps represent the attended regions in an image when a user watches the image owing to the visual attention mechanism of the human brain. In this case, the weight for the
ith pixel
\(w_i\) is based on the normalized visual saliency defined from any arbitrary visual saliency model, such as the Itti–Koch–Niebur model [
95]. Based on the weight, it allocates considerable transmission power to salient regions to minimize the
Eye-Tracking Weighted Mean Square Error (EQMSE).
4.3 Bandwidth Utilization
The source bandwidth of soft delivery schemes depends on the number of transmitted analog-modulated symbols every second (i.e., baud rate). In the aforementioned designs, the source bandwidth is mainly considered sufficient to send all transmitted non-zero analog-modulated symbols over the wireless medium. However, when the channel bandwidth is lower than the source bandwidth, some analog-modulated symbols are discarded at the receiver side. Hence, the loss of the important coefficients (i.e., the low-frequency coefficients) may have a significant impact on the reconstruction quality. Specifically, the expected distortions in soft delivery schemes for single and multiple content owing to the bandwidth constraint under the transmission power constraint are discussed in the work of Liu et al. [
111] and He et al. [
112,
113], respectively. Some existing studies have adopted different techniques listed in Table
5 to meet the bandwidth constraint. The typical method is to selectively discard the chunks in higher-frequency components to fill the bandwidth [
11,
96]. When the sender discards some chunks, the receiver regards all coefficients in the discarded chunks as zeros. Because it needs to send the locations of the discarded chunks to the receiver, SoftCast sends the location information as a bitmap. Although SoftCast assumes equal-size chunks across low- to high-frequency components, Li et al. [
96] adopted smaller chunk sizes in high-frequency components to realize a fine-grained control to meet the bandwidth limitation. Another study [
97] used bandwidth-reducing
Shannon–Kotelnikov (SK) mappings to increase the number of chunks transmitted over bandwidth-constrained channels. The SK mappings are typical
N:1 bandwidth-reducing or 1:
M bandwidth-expanding non-linear mappings. In this study, 2:1 SK mappings were used to encode several pairs of chunks with less energy to send more chunks with medium energy within the channel bandwidth.
Other studies [
98,
99,
100,
101,
102,
103,
104,
105] introduced
Compressive Sensing (CS) techniques [
114,
115] for soft delivery over bandwidth-constrained wireless channels. Notably, CS is a sampling paradigm that allows the simultaneous measurement and compression of signals that are sparse or compressible in some domains. In general, recovering source signals from compressed signals is impossible because the system is underdetermined. However, if the source signals are sufficiently sparse in some domains, CS theory indicates that the source signals can be reconstructed from the compressed signals by solving the
\(\ell _1\) minimization problem. The advantage of CS-based soft delivery is the recovery of chunks in high-frequency coefficients using CS-based signal reconstruction algorithms, such as approximate message passing and iterative thresholding, even though the chunks are discarded at the sender’s end. For high-quality reconstruction, adaptive rate control and reconstruction algorithms are mainly adopted for CS-based soft delivery. For instance, Yami and Hadizadeh [
100] adaptively controlled the compression rate based on visual attention (i.e., both the texture complexity and visual saliency) to satisfy the bandwidth constraint while maintaining better perceptual quality. Liu et al. [
104] adaptively selected reliable columns from the measurement matrix and compressed source signals using the selected columns. In view of the reconstruction algorithm, Hadizadeh and Bajic [
101] designed an adaptive transform for noisy measurement signals to obtain sparser transform coefficients for clean reconstruction. Yin et al. [
102] and Tung and Gunduz [
103] designed grouping methods for measurement signals to utilize the similarity between video frames for the reconstruction.
Other studies utilized stored images/videos on the cloud to reduce the bandwidth requirement in soft delivery. Specifically, data-assisted communications of mobile images (DAC-Mobi) [
106], data-assisted cloud radio access network (DaC-RAN) [
107], and knowledge-enhanced mobile video broadcasting (KMV-Cast) schemes [
108,
109,
110], which are referred to as data-assisted soft delivery schemes, have been proposed for high-quality image/video transmission. The main contributions of the data-assisted soft delivery schemes are (1) a sender sends a limited number of analog-modulated symbols and (2) the receiver reconstructs images/videos using correlated images (i.e., side information) obtained from a cloud.
In DAC-Mobi [
106], successive coset encoders were introduced to divide the DCT coefficients into three layers of bit planes:
Most Significant Bits (MSBs) in low-frequency coefficients, MSBs in other frequency coefficients and middle bits, and
Least Significant Bits (LSBs). Here, MSBs in low-frequency coefficients and LSBs were transmitted to the receiver in digital and pseudo-analog manners, respectively, whereas MSBs in other frequency coefficients and middle bits were discarded. Based on the received MSB in the low-frequency coefficients, the receiver reconstructs a down-sampled image to retrieve correlated images in the cloud. The retrieved correlated images were used as side information to resolve ambiguity due to discarded bits and reconstruct the entire image. DaC-RAN [
107] and the extended version of KMV-Cast [
108,
109,
110] adopted Bayesian reconstruction algorithms that utilize correlated images/videos in the cloud as prior information to reduce the required bandwidth for soft delivery. The main difference between the DaC-RAN and KMV-Cast schemes is that the former assumes that the same images/videos exist in the cloud, whereas the latter does not require that the same images/videos exist at the receiver end by designing prior knowledge broadcasting in a digital manner.
The aforementioned studies considered the channel bandwidth to be lower than the source bandwidth. If the channel bandwidth is greater than the source bandwidth, the soft delivery schemes become less efficient. In this case, the soft delivery schemes utilize the extra bandwidth by retransmission. Lin et al. [
116] and Tan et al. [
117] designed an analog channel coding to use the extra channel bandwidth for quality enhancement. For example, Tan et al. [
117] proposed a chaotic function-based analog encoding [
118] for soft delivery. Although the existing chaotic function-based analog coding is designed for uniformly distributed sources, the analog coding for Gaussian distributed sources significantly amplifies source signals and thus consumes unnecessary transmission power. They designed a chaotic map function for Gaussian distributed source signals to prevent power increments compared to the input power. MCast [
119] also utilized extra bandwidth for quality improvement. As mentioned earlier, the sender can send the source data multiple times if an extra bandwidth is available. In this case, the utilization of extra time slots for quality improvement is a key issue. To overcome this issue, MCast optimized the assignment of the chunks of the DCT coefficients to available channels in multiple time slots to fully exploit the time and frequency diversities.
In contrast to the aforementioned studies, Lan et al. [120] and He et al. [121] dealt with bandwidth variations. When the available bandwidth is less than the expected bandwidth at the sender’s end, some important chunks will not have the opportunity to be transmitted before the playback deadline. They grouped several chunks into a tile and sent the tile with a large variance and high priority to dispatch important coefficients before the playback deadline.
4.4 Packet Loss Resilience
Even when the channel bandwidth is sufficient to send all non-zero analog-modulated symbols, some analog-modulated symbols can be discarded at the receiver side owing to loss-prone wireless channels. Specifically, the packet loss owing to strong fading and interference may have a significant impact on the reconstruction quality if important chunks and coefficients are lost. SoftCast used Walsh–Hadamard transform to redistribute the energy of the source signals across whole packets for resilience against packet loss. However, each packet still contains a large amount of energy, and thus degradation owing to packet losses remains considerable.
To maintain better reconstruction quality in error-prone wireless channels, some related studies [
122,
123,
124] have introduced CS techniques (i.e., block-wise CS [
125]) for packet loss resilience. The CS technique is suitable for wireless transmission with random packet loss owing to its random measurement. Random measurement considers all packets as of equal importance. In contrast to typical CS techniques, block-wise CS can reduce the storage and computational costs of the reconstruction. A pioneering work on packet loss resilience is the distributed compressed sensing-based multicast scheme (DCS-Cast) [
122]. In DCS-Cast, each image is first divided into blocks and the coefficients in each block are randomized using the same measurement matrix across the blocks. One coefficient in every block is packetized to normalize the importance across packets. Even though some packets may be lost over loss-prone wireless channels, the receiver obtains noisy pixel values using the same measurement matrix at the sender and reconstructs the lost pixel values using the CS reconstruction algorithm in the DCT/DWT domains. Because the lost pixel values can be recovered from the reconstruction algorithm, DCS-Cast maintains high image/video quality in loss-prone channels. To further improve the reconstruction quality, multi-scale [
123] and adaptive [
124] block-wise CS algorithms have been adopted for soft delivery. The multi-scale block-wise CS algorithm [
123] decomposes each video frame into a multi-level 2D-DWT and then optimizes the sampling rate for each DWT level according to its importance. However, the adaptive block-wise CS algorithm [
124] divides several video frames into one reference frame and subsequent non-reference frames and adaptively determines whether direct or predictive sampling should be used for each block in a non-reference frame. Direct sampling randomizes the signals in the block, whereas predictive sampling calculates the residuals between the blocks in the reference and non-reference frames and randomizes residuals to utilize the inter-frame similarity for the reconstruction.
4.5 Overhead Reduction
In soft delivery schemes without chunk division, a sender needs to let the receiver know the power information of all the DCT coefficients to demodulate the signals. For the receiver to carry out the MMSE filtering in Equation (
5), the sender needs to transmit
\({\lambda }_{i}\) of all coefficients without errors as metadata, which may constitute a large overhead. For example, when the sender transmits eight video frames with a resolution of
\(352\times 288\), the sender needs to transmit metadata for all DCT coefficients (i.e.,
\(352\times 288\times 8 = 811{,}008\) variables in total) to the receiver. This overhead may induce performance degradation owing to the rate and power losses in the transmission of analog-modulated symbols. To reduce the overhead, SoftCast divides the DCT coefficients into chunks and carries out chunk-wise power allocation using an MMSE filter. However, overheads are still high, and chunk division causes performance degradation due to improper power allocation.
To achieve better quality under a low overhead requirement, the related studies can be classified into two types, as shown in Figure
6: sender-side overhead reduction and receiver-side overhead reduction. Studies on the sender-side overhead reduction [
126,
127,
128,
129] designed fitting functions to obtain the power information with fewer parameters. In this case, the sender and receiver share the same fitting function in advance and send the parameters as metadata for overhead reduction. Specifically, Song et al. [
126] designed a fitting function with four parameters for each chunk, and Xiong et al. [
127] designed a log-linear function with two parameters for each chunk. Another study [
128] found that equal-size chunk division was not suitable for chunk-wise fitting, and thus an adaptive chunk division (i.e., L-shaped chunk division) was designed for an accurate fitting. In addition, Fujihashi et al. [
129] exploited a Lorentzian fitting function with seven parameters based on a
Gaussian Markov Random Field (GMRF) for each GoP. The sender-side studies accurately predict the metadata using the fitting function with a limited number of the parameters (i.e., low overhead), whereas the fitting function causes an additional computational cost.
Studies on receiver-side overhead reduction [
130,
131] estimate the power information only from the received signals without any additional computational cost at the sender side. The study of Li et al. [
130] is a pioneer work to estimate the power information from the received signals. Blind data detection [
131] was proposed to decode the received analog-modulated symbols without the power information at the receiver. Specifically, blind data detection uses a zero-forcing estimator and the sign of the received signals to approximate the source signals. One typical issue of the receiver-side overhead reduction is that the reconstruction quality highly depends on the quality of the received signals.
We note that both types of overhead reduction cause quality degradation owing to estimation errors. In the work of Xiong et al. [
127], the effect of modeling accuracy on the reconstruction quality in soft delivery was analyzed.
4.6 Implementation
The aforementioned studies mainly discussed performance improvements in theoretical analyses and simulations. Table
6 lists the categories of the existing studies in terms of the performance evaluation. Existing studies [
41,
45,
81,
82,
121,
134] used the software-defined radio platform SORA [
141] to carry out emulations. In contrast to the simulations, the emulations obtain channel fading and noise trace from SORA to evaluate the performance under real wireless environments.
Some studies implemented a soft delivery scheme on a software-defined radio platform [
13,
103,
117,
135] and an
Field-Programmable Gate Array (FPGA) [
136,
137,
138] to empirically demonstrate the benefits of soft delivery in practical wireless channels. In some works [
13,
103,
135], the authors used
Universal Software Radio Peripheral (USRP) 2, USRP NI2900, and USRP X310 and GNU Radio for implementation and evaluated the visual quality of soft delivery, respectively. In addition, Tan et al. [
117] built an experimental system based on the
OpenAirInterface (OAI) platform and self-developed
Software Universal Platform (SOUP). They migrated OAI to self-developed SOUP software-defined radio and implemented the proposed scheme based on OAI eMBMS codes. Conversely, other works [
136,
137,
138] exploited the Xilinx Virtex7 FPGA for implementation and tested the reconstruction quality as a function of wireless channel SNRs.
Other studies [
139,
140] implemented soft delivery on the prototypes of multi-user MIMO (MU-MIMO) and
Long-Term Evolution (LTE) systems. For example, in the work of Chen et al. [
139], SoftCast is implemented on BUSH, which is a large-scale MU-MIMO prototype that performs scalable beam user selection with hybrid beamforming for phased-array antennas in legacy WLANs. They performed experiments to evaluate the video quality in terms of
Peak Signal-to-Noise Ratio (PSNR) and SSIM over a lossy MU-MIMO channel.
4.7 Extension for Immersive Experiences
SoftCast and other soft delivery schemes mentioned in the previous sections were designed for conventional images and video signals. In modern wireless and mobile communication scenarios, the streaming of immersive content will be a key application for reconstructing 3D perceptual scenes that provide full parallax and depth information for human eyes. The immersive content can be applied to various applications, such as 3 to 6 degrees-of-freedom entertainment, remote device operation, medical imaging, vehicular perception, VR/AR/MR, and simulated training. The typical immersive content includes free viewpoint video [
142,
143,
144], 360-degree video, and point cloud [
145], and Table
7 lists the features. Even in immersive content, the video frames are compressed in a digital manner, and the compressed bitstream is then channel coded and modulated in sequence. This means that cliff and leveling effects still occur in the streaming of the immersive content owing to the variation in the channel conditions. To prevent cliff, leveling, and staircase effects, some studies have extended soft delivery schemes toward immersive content for future wireless multimedia services.
One of the key advantages of soft delivery schemes for immersive content is that they simplify the optimization problem for image and video quality maximization. In immersive content delivery, the main issue is to maximize the image and video quality considering the user’s perspective. For example, the view synthesis distortion optimization problem and the viewport optimization problem should be solved in free viewpoint video and 360-degree video, respectively. In digital-based delivery schemes, a sender needs to find the best bit and transmission power allocation for video frames. However, it is often cumbersome to derive a solution. Soft delivery schemes simplify the optimization problems by reformulating them into a simple power allocation problem since bit allocation for quantization is not required in soft delivery schemes.
4.7.1 Free Viewpoint Video.
Free viewpoint videos enable us to observe a 3D scene from freely switchable angles/viewpoints. A free viewpoint video consists of numerous closely spaced RGB and infrared camera arrays to capture the texture and depth frames of a 3D scene, such as a football game. Even though the number of deployed cameras in the field is limited owing to physical constraints, the receiver can synthesize intermediate virtual viewpoints using rendering techniques (e.g., depth image-based rendering [
146,
147]) to obtain numerous switchable viewpoints. To synthesize intermediate virtual viewpoints using the rendering technique, the sender encodes and transmits the texture and depth frames of two or more adjacent viewpoints, the format of which is known as
Multi-View plus Depth (MVD) [
148].
For conventional MVD video streaming over wireless links, digital video compression for MVD video frames (e.g., MVC+D [
149] or 3D-AVC [
150]) fully utilizes the redundancy between the cameras and texture-depth for compression. In this case, the streaming schemes need to solve view synthesis problems in addition to cliff and leveling effects to yield better video quality even in the synthesized virtual viewpoints. Specifically, the video quality of the virtual viewpoint is determined by the distortion of each texture and depth frame. In digital-based MVD schemes, the distortion depends on the bit and power assignments for each texture and depth frame. It is often cumbersome to achieve the best quality at a target virtual viewpoint using parameter optimization owing to the combinatorial problem with nonlinear quantization.
Some studies [
151,
152,
153,
154,
155] designed a soft delivery scheme for a free viewpoint video. Specifically, FreeCast [
151,
152] is the first scheme for a free viewpoint video. Because MVD video frames have redundancy of cameras and texture-depth, FreeCast jointly transforms texture and depth frames using 5D-DCT to exploit inter-view and texture-depth correlations for energy compaction. In addition, FreeCast can simplify the optimization problems of view synthesis by reformulating it into a simple power assignment problem. This is because bit allocation (i.e., quantization) is not required in FreeCast. The authors found that the power assignment problem for the texture and depth frames can be solved using a quadratic function to yield the best quality at the desired virtual viewpoint. Furthermore, FreeCast introduces a fitting function obtained from multi-dimensional GMRF at the sender and the receiver to obtain the power information with few parameters for the overhead reduction.
3DV SoftCast [
154] focused on the view synthesis problems under the 3D-DCT operations for each camera’s texture and depth frames, and designed the power allocation method to solve the problem. The main differences between 3DV SoftCast and FreeCast are that 3DV SoftCast performs 3D-DCT for each camera and controls the transmission power to minimize the view synthesis distortion, whereas FreeCast performs 5D-DCT for better energy compaction of analog-modulated symbols and multi-dimensional GMRF-based overhead reduction for reconstructing high-quality MVD frames in band-limited environments. Yang et al. [
155] designed a soft delivery scheme for depth video. They found that a block-based DCT performs well on depth video compared to a full-frame DCT because depth video has different characteristics from texture video. Although a different soft delivery scheme is required for texture video, low-distortion depth video in the work of Yang et al. [
155] can provide better virtual viewpoint quality in free viewpoint video.
4.7.2 360-Degree Video.
Notably, 360-degree video content builds a synthetic virtual environment to mimic the real world with which the users interact. Each user can watch 360-degree videos through a traditional computer-supported VR headset or an all-in-one headset (e.g., Oculus Go). When the user requests the 360-degree video, the sender sends the 360-degree video frames, and the user may play a part of the 360-degree video frames, which is referred to as the viewport, through the user’s headset. Here, 360-degree videos are mainly captured by an omnidirectional camera or a combination of multiple cameras and saved in a spherical format. Before transmissions, the sphere frames are mapped onto the 2D plane using a certain projection method (e.g., equirectangular and cube map projections).
In 360-degree video streaming, the major issue is to yield better video quality in the user’s viewport by effectively reducing perceptual redundancy within 360-degree video frames. Because each user only watches the viewport via the headset at each time instance, excessive video traffic is created if the sender sends the full resolution of the 2D-projected video frames with an identical quantization parameter. One of the simplest methods to reduce perceptual redundancy is viewport-only streaming [
156]. In video playback, the user may move a viewing viewport according to the user’s head/eye movement. Based on the movement, the user requests a new viewport from the sender, and the sender sends back the corresponding viewport. Because the sender transmits one viewport at each time instant, viewport-only streaming can mitigate the video traffic. However, the user needs to receive a new viewport from the sender in every viewport switching, which causes a long switching delay. A long switching delay (i.e., approximately 10 ms) may cause
simulator sickness [
157]. Owing to a long delay in the standard Internet, it is difficult for viewport-only streaming schemes to satisfy the switching delay requirements. To prevent simulator sickness, conventional schemes [
158] divide 360-degree video frames into multiple tiles and independently encode them with different quantization parameters to yield better viewport quality within the bandwidth constraint.
Studies [
159,
160,
161,
162] on soft delivery schemes focus on the quality optimization of the user’s viewport in addition to cliff and leveling effect prevention. Fujihashi et al. [
159] presented the first scheme for viewport-aware soft 360-degree video delivery. According to the viewing viewport, the sender first adopts pixel-wise power allocation to reduce the perceptual redundancy in 360-degree video frames and then carries out the combination of 1D-DCT and spherical wavelet transform for decorrelation to utilize the redundancy in the sphere and time domains. In the work of Zhao et al. [
160], OmniCast further considers the features of 360-degree videos into quality optimization. Specifically, the authors analyze the relationship of the distortion between the spherical and projected 2D domains as the spherical distortion for each projection method, and design power allocation to realize the optimal quality in the 2D-projected 360-degree videos. 360Cast [
161] and the extended version 360Cast
\(+\) [
162] adopt viewport prediction based on linear regression and foveation-aware power allocation within the predicted viewport to further reduce the perceptual redundancy. They evaluate 360Cast
\(+\) with the existing digital-based schemes in terms of weighted-to-spherically uniform PSNR [
163]. Here, the digital-based schemes use HEVC Test Model 16.20 [
164] and the modulation format of BPSK. They found that 360Cast
\(+\) improves the average weighted-to-spherically uniform PSNR performance compared with the digital-based schemes by preventing the cliff effect at low SNR regimes and gradually improving the received video quality with the improvement of the wireless channel quality.
4.7.3 Point Cloud.
Volumetric content delivery provides highly immersive experiences for users through XR devices. The point cloud [
145] is arguably the most popular volumetric data structure for representing 3D scenes and objects on holographic displays [
165,
166]. A point cloud typically consists of a set of 3D points, and each point is defined by 3D coordinates (i.e., (X, Y, Z)) and color attributes (i.e., (R, G, B)). In contrast to conventional 2D images and videos, 3D point cloud data are neither well aligned nor uniformly distributed in space.
The major challenge in volumetric delivery over wireless channels is how to efficiently compress and send numerous and irregular structures of the 3D point cloud within a limited bandwidth. Some compression methods have been proposed for point clouds to deliver 3D data. Specifically, Draco [
167] employs
kd tree-based compression [
168] and a point cloud library using octree-based compression [
169,
170,
171]. To further reduce the amount of data traffic in point cloud delivery, two transform techniques have been proposed for energy compaction of the non-ordered and non-uniformly distributed signals: Fourier-based transform (e.g.,
Graph Fourier Transform (GFT)) and wavelet-based transform (e.g., region-adaptive Haar transform) [
172]. For example, recent studies used GFT for the color components [
173] and 3D coordinates [
174] of graph signals for signal decorrelation. They used quantization and entropy coding for the compression of decorrelated signals.
HoloCast [
175] is a pioneering work on soft 3D point cloud delivery for unstable wireless channels. Specifically, they regard 3D points as vertices in a graph with edges between nearby vertices to deal with the irregular structure of the 3D points motivated by the work of Rente et al. [
174] and Zhang et al. [
176]. HoloCast uses GFT for such graph signals to exploit the underlying correlations among adjacent graph signals and directly transmits linear-transformed graph signals as a pseudo-analog modulation over the channel. We compared HoloCast with the conventional digital-based delivery, which is based on point cloud digital compression used in a point cloud library [
169]. HoloCast gradually improves the reconstruction quality with the improvement of wireless channel quality. In addition, the GFT-based HoloCast can achieve better quality compared with the DCT-based HoloCast.
However, it has been found that graph-based coding schemes need to send the graph-based transform basis matrix used in GFT as additional metadata for signal decoding. For example, the sender needs to send
\(N^2\) real elements of the graph-based transform basis matrix as the metadata when the number of 3D points is
N. In some works [
177,
178,
179], Givens rotation [
180,
181] was used for GFT basis matrix compression. Givens rotation is used to selectively introduce zeros into a matrix to create an identity matrix from the basis matrix using angle parameters. The angle parameters are uniformly and non-uniformly quantized prior to the metadata transmission for overhead reduction. From the evaluations, Givens rotation with the uniform quantization reduces the overhead up to 89.8% [
177] compared with HoloCast without the overhead reduction. In addition, Givens rotation with the non-uniform quantization further reduces the overhead up to 28.6% [
178] compared with the uniform quantization.