1 Introduction
The past two decades have witnessed the tremendous success of
Artificial Neural Networks (ANNs) in various applications including computer vision [
23], speech recognition [
32], and natural language processing [
20]. More recently, as an emerging and very promising direction in computer graphics, neural rendering successfully uses ANNs to learn scene representation and synthesize photorealistic images of the scene [
48]. In particular, the seminal work of
Neural Radiance Fields (NeRF) [
30] uses volume rendering [
29] to project the scene representations into images and leads to an “explosion” of developments in the neural rendering field [
49]. However, like other applications of ANNs [
45], neural rendering has to process a huge number of model parameters and input data during the inference process, and thus incurs substantial energy consumption due to the large amount of computation, which poses a great challenge to the deployment of neural rendering on energy-constrained devices.
Many comprehensive techniques have been proposed for neural networks targeting to energy-efficiency. First, at the algorithm level, researchers have developed techniques such as pruning [
14,
52], quantization [
8], and knowledge distillation [
19] to build a light-weight ANN model in the training stage. Second, at the computing hardware level, customized neural network accelerators [
7] have been developed which have shown to be able to achieve orders of magnitude higher energy-efficiency than state-of-the-art Complementary Metal Oxide Semiconductor
designs. Third, inspired by the biological neurons [
17], researchers devise the energy-efficient
Spiking Neural Networks (SNNs) as a new computing paradigm [
18]. SNNs apply a series of discrete binary spikes to transfer information, instead of continuous activation values in ANNs. This leads to more energy-efficient computation by substituting multiplication with addition in customized neuromorphic chips [
1,
38]. SNNs have recently been applied to image classification tasks and have achieved high accuracy and high energy-efficiency, even on deep neural networks (such as VGG, ResNet) and complex datasets (such as CIFAR-10, ImageNet) [
4,
9].
Due to the non-differentiable nature of the spikes in SNNs, gradient-based back-propagation algorithms used in ANNs cannot be applied to train SNNs. Researchers have proposed three training methods for SNNs, including
Spike Timing-Dependent Plasticity (STDP) rule-based learning method [
42], spike-based error back-propagation algorithm [
25], and ANN-to-SNN conversion method [
5,
10,
21,
41]. The STDP rule updates the parameters of SNNs based on the order in which the spikes arrive at the neurons [
40]. The spike-based error back-propagation algorithm approximates the non-linear output function of the spiking neurons as a derivable function and then trains the SNNs using the gradient-based back-propagation algorithm [
24,
33]. As illustrated in
Figure 1, the conversion method converts the parameters of pre-trained ANNs to those of SNNs and has almost no additional computational overhead compared to the training process of ANNs.
The aforementioned three training methods of SNNs have been successfully applied in image classification tasks. However, the STDP-based training method is only suitable for training shallow SNNs, and the accuracy of STDP-based SNNs on complex datasets such as CIFAR-10 is far lower than that of ANNs [
11]. Besides, the spike-based error bark-propagation algorithm introduces the surrogate function to perform back-propagation process, which increases the complexity of training process and limits its scalability in deep SNNs and complex datasets [
24]. In contrast to these two methods, the ANN-to-SNN conversion method applies the scaled parameters of ANNs to SNNs [
10], has excellent scalability, and yields the best-performing SNNs [
4,
16].
In this article, we focus on developing energy-efficient spiking neural rendering using the ANN-to-SNN conversion method. Although SNNs have been studied thoroughly in image classification tasks, it is more challenging to implement SNNs in neural rendering. The conversion method establishes the proportional relationship between the firing rates of spiking neurons in converted SNNs and the activation values of analog neurons in source ANNs. In image classification tasks, the converted SNNs only need to ensure that the output value of the correct class is the largest. Therefore, the ANN-to-SNN conversion method can ensure that the converted SNN has a high classification accuracy. However, in neural rendering, ANNs output the numerical values of the scene properties. To guarantee the rendering performance, the converted SNNs must output the same values as the source ANNs. In addition, the neural network architecture used for neural rendering differs from the normal classification-purposed ANNs—there are hidden layers whose inputs include extra inserted values in addition to the outputs from the previous layer (see
Figure 2). Considering the aforementioned two unique properties, we cannot apply traditional ANN-to-SNN conversion method to yield SNNs with good rendering performance.
To overcome these two challenges, in this work, we first propose a precise output decoding scheme for SNNs through the mathematical analysis of the conversion process. We then customize the parameter normalization method for the special network architecture of neural rendering. Moreover, considering that the colors of the sampled points contribute variably to the color of the pixel, we apply an Early Termination Strategy (ETS) to reduce the energy consumption. Combining these three methods, we propose Spiking-NeRF, an energy-efficient neural rendering model based on SNNs.
The contributions of this article can be summarized as follows:
–
We propose a precise output decoding scheme for SNNs since neural rendering requires the precise numerical values of scene properties, through analyzing the mathematical relationship between ANNs and converted SNNs.
–
We customize parameter normalization method for the special network architecture of neural rendering and develop an ETS based on the color weights of the sampled points and the discrete property of spikes.
–
We evaluate the performance of Spiking-NeRF on both realistic scenes and synthetic scenes. Experimental results show that Spiking-NeRF can achieve comparable rendering performance for all scenes and reduce energy consumption by up to \(2.27\times\) compared to the ANN-based NeRF.
The remainder of this article is organized as follows: In
Section 2, we review the progress of research on SNNs based on ANN-to-SNN conversion methods. In
Section 3, we first introduce the background of neural rendering techniques and then introduce the work of NeRF in detail. Afterward, in
Section 4, we propose a precise output scheme, a customized parameter normalization method, and an ETS to build the energy-efficient Spiking-NeRF. Finally, we present our experimental results in
Section 5, followed by conclusions in
Section 6.
2 Related Work
ANN-to-SNN conversion is a promising approach for training SNNs. The existing works on the ANN-to-SNN conversion method have yielded SNNs with high accuracy for image classification tasks. Cao et al. [
5] first propose the ANN-to-SNN conversion method and apply it to image classification tasks. Through mapping the parameters from ANNs to SNNs and replacing the
Rectified Linear Unit (ReLU) activation with the
Integrate-and-Fire (IF) model, the converted SNNs can achieve similar accuracy as ANNs while reducing the energy consumption by two orders of magnitude. To further reduce the accuracy loss, Diehl et al. [
10] propose model-based and data-based parameter normalization methods. But their method results in intolerable low fire rates of spiking neurons if applied to deep SNNs. Based on the model-based normalization method, Rueckauer et al. [
41] present a robust parameter normalization method to improve the fire rates. Besides, they adopt the reset by subtraction method to reduce conversion loss. Combining these two approaches, they successfully convert deep ANNs (like VGG-16) to SNNs and report good performance of converted SNNs on MNIST, CIFAR-10, and ImageNet. As deeper networks achieve better performance on image classification tasks, Han et al. [
13] propose the
Residual Membrane Potential (RMP) neurons and convert ResNet-24 and ResNet-34 to SNNs. However, their method requires a large number of time steps for the converted SNNs to achieve comparable accuracy to ANNs. To reduce the number of time steps during the inference process of the converted SNNs, Deng and Gu [
9] shift the initial membrane potential of every spiking neuron to increase its fire rate and Ho and Chang [
16] improve the data-based normalization method and propose
Trainable Cliping Layers (TCL) to restrict the maximum activation value of each layer. Parameter normalization method based on trainable parameters in the TCL can greatly increase the firing rates of the spiking neurons and reduce the number of time steps required for the converted SNNs. Based on TCL, Bu et al. [
4] optimize the initial potentials of spiking neurons, resulting in state-of-the-art classification accuracy. Furthermore, Liu et al. [
27] propose an efficient conversion framework and achieve the fewest inference time steps for SNNs on image classification tasks.
Moreover, some works have extended the application of SNNs exploiting the ANN-to-SNN conversion method. Kim et al. [
21] introduce the channel-wise normalization method for convolutional neural networks and signed spiking neuron with imbalanced threshold for Leaky-ReLU activation function, then propose the first spike-based energy-efficient object detection model. Tan et al. [
46] propose robust fire rates of spiking neurons to reduce the conversion loss and extend the application of SNNs to deep Q-Networks. Despite these efforts, the applications of SNNs are still limited at present. In this article, we apply the ANN-to-SNN conversion method to develop an energy-efficient spiking neural rendering model.
3 Preliminary
Synthesizing photo-realistic images and videos is crucial in the field of computational graphics and has been the focus of research in recent decades. Traditional techniques generate synthetic images of scenes using rendering algorithms such as rasterization or ray tracing. However, these techniques typically requires a significant amount of expensive manual effort to synthesize high-quality images [
48]. With the remarkable development of artificial intelligence, the computer graphics community tries to combine basic principles of computational graphics with machine learning to solve rendering problems, which is known as neural rendering techniques. The emerging neural rendering techniques use neural networks to learn the geometry, appearance, illumination, and other properties of a scene [
49]. In comparison to traditional rendering techniques, neural rendering achieves state-of-the-art rendering performance using pre-trained neural network models and represents a leap forward toward the goal of synthesizing photo-realistic image and video content. In recent years, numerous studies have explored various ways of accomplishing rendering using learnable scene properties, contributing to tremendous progress in the field of neural rendering. Among them, the innovative work of the NeRF [
30] has made a breakthrough in novel view synthesis and leads to a significant surge in developments of neural rendering.
Using a
Multi-Layer Perceptron (MLP), NeRF represents a static scene as a continuous five dimensional function [
30]. The function
\(F:(\textbf{x},\textbf{d})\rightarrow(\textbf{c},\sigma)\) regresses from
three dimensional (3D) position coordinate
\(\textbf{x}=(x,y,z)\) and two dimensional viewing direction
\(\textbf{d}=(\theta,\phi)\) to emitted color
\(\textbf{c}=(r,g,b)\) and volume density
\(\sigma\).
Figure 2 illustrates an overview of NeRF’s scene representation and neural network architecture. A ray
\(\textbf{r}(h)=\textbf{o}+h\textbf{d}\) is emitted from the camera’s center of projection
o along the direction
d and passes through one pixel on the image plane. When rendering the pixel, a set of points with distance
h is sampled along the ray
\(\textbf{r}(h)=\textbf{o}+h\textbf{d}\). For each point with
\(h_{i}\in\textbf{h}\), its corresponding 3D position coordinates can be obtained by the camera ray
\(\textbf{x}=\textbf{r}(h_{i})\). To enable the MLP to learn high frequency details, NeRF separately preprocesses each of the three coordinate values in
x and the two components of direction
d with the following sinusoidal encoding function:
where
\(L\) is an integer hyper-parameter. It is worth noting that in
Figure 2,
\(\gamma(\textbf{x})\) and
\(\gamma(\textbf{d})\) are inserted to the stage 2 and stage 3, respectively. This unique network architecture is widely adopted in the field of neural rendering to improve the learning performance [
6,
30,
35]. To obtain the color of a certain pixel, the estimated colors and densities of the sampled points along the ray are used to approximate the volume rendering integral by numerical quadrature [
29]:
where
\(\hat{C}(\textbf{r})\) is the estimated pixel color,
\(\textbf{c}_{i}\) and
\(\sigma_{i}\) are the estimated color and density of the
\(i\)th point,
\(\delta_{i}=h_{i+1}-h_{i}\) is the distance between two adjacent sampled points along the ray, and
\(w_{i}\) evaluates the importance of
\(\textbf{c}_{i}\) to the pixel.
To improve the rendering performance, NeRF utilizes both coarse and fine neural networks for scene reconstruction. During the inference process, NeRF first samples
\(N_{c}\) points along each ray and feeds them to the coarse network to perform the forward propagation process. Based on the outputs of the coarse network and
Equation (3), the color weight
\(w_{i}\) of each sampled point is calculated. The color weights of
\(N_{c}\) sampled points are then normalized using
\(\hat{w}_{i}=w_{i}/\sum^{N_{c}}_{j=1}w_{j}\) to estimate the distribution of the color weights along the entire ray. Subsequently,
\(N_{f}\) points with larger color weights are sampled according to the estimated distribution, and the resulting
\(N_{c}+N_{f}\) sampled points are jointly input into the fine network for inference. This hierarchical sampling strategy effectively feeds more points that contribute significantly to the pixel color into the fine network, thereby enhancing the rendering performance of NeRF. However, this strategy also dramatically increase the computational overhead of NeRF. For instance, if
\(N_{c}=N_{f}=64\) is selected, NeRF must perform forward propagation on
\(3\times 10^{7}\) input sampled points to render an image of size
\(400\times 400\). Additionally, dozens or even hundreds of images must be rendered to reconstruct a complete scene, which inevitably results in significant energy consumption. To address this challenge, this article proposes Spiking-NeRF, an energy-efficient neural rendering model that utilizes the SNN to implement neural rendering.