research-article

Open access

Spiking-NeRF: Spiking Neural Network for Energy-Efficient Neural Rendering

Authors:

Ziwen Li,

Yu Ma,

Jindong Zhou,

Pingqiang ZhouAuthors Info & Claims

ACM Journal on Emerging Technologies in Computing Systems, Volume 20, Issue 3

Article No.: 10, Pages 1 - 23

https://doi.org/10.1145/3675808

Published: 26 August 2024 Publication History

PDF eReader

Abstract

Artificial Neural Networks (ANNs) have achieved remarkable performance in many artificial intelligence tasks. As the application scenarios become more sophisticated, the computation and energy consumption of ANNs are also constantly increasing, which poses a challenge for deploying ANNs on energy-constrained devices. Spiking Neural Networks (SNNs) provide a promising solution to build energy-efficiency neural networks. However, the current training methods of SNNs cannot output values as precise as ANNs. This limits the applications of SNNs to relatively simple image classification tasks. In this article, we extend the application of SNNs to neural rendering tasks and propose an energy-efficient spiking neural rendering model, called Spiking-NeRF (Spiking Neural Radiance Fields). We first analyze the ANN-to-SNN conversion theory and propose an output scheme for SNNs to obtain the precise scene property values. Then we customize the parameter normalization method for the special network architecture of neural rendering. Furthermore, we present an early termination strategy (ETS) based on the discrete nature of spikes to reduce energy consumption. We evaluate the performance of Spiking-NeRF on both realistic and synthetic scenes. Experimental results show that Spiking-NeRF can achieve comparable rendering performance to ANN-based NeRF with up to \(2.27\times\) energy reduction.

1 Introduction

The past two decades have witnessed the tremendous success of Artificial Neural Networks (ANNs) in various applications including computer vision [23], speech recognition [32], and natural language processing [20]. More recently, as an emerging and very promising direction in computer graphics, neural rendering successfully uses ANNs to learn scene representation and synthesize photorealistic images of the scene [48]. In particular, the seminal work of Neural Radiance Fields (NeRF) [30] uses volume rendering [29] to project the scene representations into images and leads to an “explosion” of developments in the neural rendering field [49]. However, like other applications of ANNs [45], neural rendering has to process a huge number of model parameters and input data during the inference process, and thus incurs substantial energy consumption due to the large amount of computation, which poses a great challenge to the deployment of neural rendering on energy-constrained devices.

Many comprehensive techniques have been proposed for neural networks targeting to energy-efficiency. First, at the algorithm level, researchers have developed techniques such as pruning [14, 52], quantization [8], and knowledge distillation [19] to build a light-weight ANN model in the training stage. Second, at the computing hardware level, customized neural network accelerators [7] have been developed which have shown to be able to achieve orders of magnitude higher energy-efficiency than state-of-the-art Complementary Metal Oxide Semiconductor designs. Third, inspired by the biological neurons [17], researchers devise the energy-efficient Spiking Neural Networks (SNNs) as a new computing paradigm [18]. SNNs apply a series of discrete binary spikes to transfer information, instead of continuous activation values in ANNs. This leads to more energy-efficient computation by substituting multiplication with addition in customized neuromorphic chips [1, 38]. SNNs have recently been applied to image classification tasks and have achieved high accuracy and high energy-efficiency, even on deep neural networks (such as VGG, ResNet) and complex datasets (such as CIFAR-10, ImageNet) [4, 9].

Due to the non-differentiable nature of the spikes in SNNs, gradient-based back-propagation algorithms used in ANNs cannot be applied to train SNNs. Researchers have proposed three training methods for SNNs, including Spike Timing-Dependent Plasticity (STDP) rule-based learning method [42], spike-based error back-propagation algorithm [25], and ANN-to-SNN conversion method [5, 10, 21, 41]. The STDP rule updates the parameters of SNNs based on the order in which the spikes arrive at the neurons [40]. The spike-based error back-propagation algorithm approximates the non-linear output function of the spiking neurons as a derivable function and then trains the SNNs using the gradient-based back-propagation algorithm [24, 33]. As illustrated in Figure 1, the conversion method converts the parameters of pre-trained ANNs to those of SNNs and has almost no additional computational overhead compared to the training process of ANNs.

Fig. 1.

The aforementioned three training methods of SNNs have been successfully applied in image classification tasks. However, the STDP-based training method is only suitable for training shallow SNNs, and the accuracy of STDP-based SNNs on complex datasets such as CIFAR-10 is far lower than that of ANNs [11]. Besides, the spike-based error bark-propagation algorithm introduces the surrogate function to perform back-propagation process, which increases the complexity of training process and limits its scalability in deep SNNs and complex datasets [24]. In contrast to these two methods, the ANN-to-SNN conversion method applies the scaled parameters of ANNs to SNNs [10], has excellent scalability, and yields the best-performing SNNs [4, 16].

In this article, we focus on developing energy-efficient spiking neural rendering using the ANN-to-SNN conversion method. Although SNNs have been studied thoroughly in image classification tasks, it is more challenging to implement SNNs in neural rendering. The conversion method establishes the proportional relationship between the firing rates of spiking neurons in converted SNNs and the activation values of analog neurons in source ANNs. In image classification tasks, the converted SNNs only need to ensure that the output value of the correct class is the largest. Therefore, the ANN-to-SNN conversion method can ensure that the converted SNN has a high classification accuracy. However, in neural rendering, ANNs output the numerical values of the scene properties. To guarantee the rendering performance, the converted SNNs must output the same values as the source ANNs. In addition, the neural network architecture used for neural rendering differs from the normal classification-purposed ANNs—there are hidden layers whose inputs include extra inserted values in addition to the outputs from the previous layer (see Figure 2). Considering the aforementioned two unique properties, we cannot apply traditional ANN-to-SNN conversion method to yield SNNs with good rendering performance.

Fig. 2.

To overcome these two challenges, in this work, we first propose a precise output decoding scheme for SNNs through the mathematical analysis of the conversion process. We then customize the parameter normalization method for the special network architecture of neural rendering. Moreover, considering that the colors of the sampled points contribute variably to the color of the pixel, we apply an Early Termination Strategy (ETS) to reduce the energy consumption. Combining these three methods, we propose Spiking-NeRF, an energy-efficient neural rendering model based on SNNs.

The contributions of this article can be summarized as follows:

–

We propose a precise output decoding scheme for SNNs since neural rendering requires the precise numerical values of scene properties, through analyzing the mathematical relationship between ANNs and converted SNNs.

–

We customize parameter normalization method for the special network architecture of neural rendering and develop an ETS based on the color weights of the sampled points and the discrete property of spikes.

–

We evaluate the performance of Spiking-NeRF on both realistic scenes and synthetic scenes. Experimental results show that Spiking-NeRF can achieve comparable rendering performance for all scenes and reduce energy consumption by up to \(2.27\times\) compared to the ANN-based NeRF.

The remainder of this article is organized as follows: In Section 2, we review the progress of research on SNNs based on ANN-to-SNN conversion methods. In Section 3, we first introduce the background of neural rendering techniques and then introduce the work of NeRF in detail. Afterward, in Section 4, we propose a precise output scheme, a customized parameter normalization method, and an ETS to build the energy-efficient Spiking-NeRF. Finally, we present our experimental results in Section 5, followed by conclusions in Section 6.

2 Related Work

ANN-to-SNN conversion is a promising approach for training SNNs. The existing works on the ANN-to-SNN conversion method have yielded SNNs with high accuracy for image classification tasks. Cao et al. [5] first propose the ANN-to-SNN conversion method and apply it to image classification tasks. Through mapping the parameters from ANNs to SNNs and replacing the Rectified Linear Unit (ReLU) activation with the Integrate-and-Fire (IF) model, the converted SNNs can achieve similar accuracy as ANNs while reducing the energy consumption by two orders of magnitude. To further reduce the accuracy loss, Diehl et al. [10] propose model-based and data-based parameter normalization methods. But their method results in intolerable low fire rates of spiking neurons if applied to deep SNNs. Based on the model-based normalization method, Rueckauer et al. [41] present a robust parameter normalization method to improve the fire rates. Besides, they adopt the reset by subtraction method to reduce conversion loss. Combining these two approaches, they successfully convert deep ANNs (like VGG-16) to SNNs and report good performance of converted SNNs on MNIST, CIFAR-10, and ImageNet. As deeper networks achieve better performance on image classification tasks, Han et al. [13] propose the Residual Membrane Potential (RMP) neurons and convert ResNet-24 and ResNet-34 to SNNs. However, their method requires a large number of time steps for the converted SNNs to achieve comparable accuracy to ANNs. To reduce the number of time steps during the inference process of the converted SNNs, Deng and Gu [9] shift the initial membrane potential of every spiking neuron to increase its fire rate and Ho and Chang [16] improve the data-based normalization method and propose Trainable Cliping Layers (TCL) to restrict the maximum activation value of each layer. Parameter normalization method based on trainable parameters in the TCL can greatly increase the firing rates of the spiking neurons and reduce the number of time steps required for the converted SNNs. Based on TCL, Bu et al. [4] optimize the initial potentials of spiking neurons, resulting in state-of-the-art classification accuracy. Furthermore, Liu et al. [27] propose an efficient conversion framework and achieve the fewest inference time steps for SNNs on image classification tasks.

Moreover, some works have extended the application of SNNs exploiting the ANN-to-SNN conversion method. Kim et al. [21] introduce the channel-wise normalization method for convolutional neural networks and signed spiking neuron with imbalanced threshold for Leaky-ReLU activation function, then propose the first spike-based energy-efficient object detection model. Tan et al. [46] propose robust fire rates of spiking neurons to reduce the conversion loss and extend the application of SNNs to deep Q-Networks. Despite these efforts, the applications of SNNs are still limited at present. In this article, we apply the ANN-to-SNN conversion method to develop an energy-efficient spiking neural rendering model.

3 Preliminary

Synthesizing photo-realistic images and videos is crucial in the field of computational graphics and has been the focus of research in recent decades. Traditional techniques generate synthetic images of scenes using rendering algorithms such as rasterization or ray tracing. However, these techniques typically requires a significant amount of expensive manual effort to synthesize high-quality images [48]. With the remarkable development of artificial intelligence, the computer graphics community tries to combine basic principles of computational graphics with machine learning to solve rendering problems, which is known as neural rendering techniques. The emerging neural rendering techniques use neural networks to learn the geometry, appearance, illumination, and other properties of a scene [49]. In comparison to traditional rendering techniques, neural rendering achieves state-of-the-art rendering performance using pre-trained neural network models and represents a leap forward toward the goal of synthesizing photo-realistic image and video content. In recent years, numerous studies have explored various ways of accomplishing rendering using learnable scene properties, contributing to tremendous progress in the field of neural rendering. Among them, the innovative work of the NeRF [30] has made a breakthrough in novel view synthesis and leads to a significant surge in developments of neural rendering.

Using a Multi-Layer Perceptron (MLP), NeRF represents a static scene as a continuous five dimensional function [30]. The function \(F:(\textbf{x},\textbf{d})\rightarrow(\textbf{c},\sigma)\) regresses from three dimensional (3D) position coordinate \(\textbf{x}=(x,y,z)\) and two dimensional viewing direction \(\textbf{d}=(\theta,\phi)\) to emitted color \(\textbf{c}=(r,g,b)\) and volume density \(\sigma\). Figure 2 illustrates an overview of NeRF’s scene representation and neural network architecture. A ray \(\textbf{r}(h)=\textbf{o}+h\textbf{d}\) is emitted from the camera’s center of projection o along the direction d and passes through one pixel on the image plane. When rendering the pixel, a set of points with distance h is sampled along the ray \(\textbf{r}(h)=\textbf{o}+h\textbf{d}\). For each point with \(h_{i}\in\textbf{h}\), its corresponding 3D position coordinates can be obtained by the camera ray \(\textbf{x}=\textbf{r}(h_{i})\). To enable the MLP to learn high frequency details, NeRF separately preprocesses each of the three coordinate values in x and the two components of direction d with the following sinusoidal encoding function:

\begin{align}\gamma(p)=[\text{sin}(2^{0}p),\text{cos}(2^{0}p),\dots,\text{sin}(2^{L}p), \text{cos}(2^{L}p)]^{T},\end{align}

(1)

where \(L\) is an integer hyper-parameter. It is worth noting that in Figure 2, \(\gamma(\textbf{x})\) and \(\gamma(\textbf{d})\) are inserted to the stage 2 and stage 3, respectively. This unique network architecture is widely adopted in the field of neural rendering to improve the learning performance [6, 30, 35]. To obtain the color of a certain pixel, the estimated colors and densities of the sampled points along the ray are used to approximate the volume rendering integral by numerical quadrature [29]:

\begin{align}\displaystyle\hat{C}(\textbf{r})=\sum_{i=1}^{N}w_{i}\textbf{c}_{i},\end{align}

(2)

\begin{align}\displaystyle\text{with}\ w_{i}=(1-\text{exp}(-\sigma_{i}\delta_{i}))\text{exp }\left(-\sum_{j=1}^{i-1}\sigma_{j}\delta_{j}\right),\end{align}

(3)

where \(\hat{C}(\textbf{r})\) is the estimated pixel color, \(\textbf{c}_{i}\) and \(\sigma_{i}\) are the estimated color and density of the \(i\)th point, \(\delta_{i}=h_{i+1}-h_{i}\) is the distance between two adjacent sampled points along the ray, and \(w_{i}\) evaluates the importance of \(\textbf{c}_{i}\) to the pixel.

To improve the rendering performance, NeRF utilizes both coarse and fine neural networks for scene reconstruction. During the inference process, NeRF first samples \(N_{c}\) points along each ray and feeds them to the coarse network to perform the forward propagation process. Based on the outputs of the coarse network and Equation (3), the color weight \(w_{i}\) of each sampled point is calculated. The color weights of \(N_{c}\) sampled points are then normalized using \(\hat{w}_{i}=w_{i}/\sum^{N_{c}}_{j=1}w_{j}\) to estimate the distribution of the color weights along the entire ray. Subsequently, \(N_{f}\) points with larger color weights are sampled according to the estimated distribution, and the resulting \(N_{c}+N_{f}\) sampled points are jointly input into the fine network for inference. This hierarchical sampling strategy effectively feeds more points that contribute significantly to the pixel color into the fine network, thereby enhancing the rendering performance of NeRF. However, this strategy also dramatically increase the computational overhead of NeRF. For instance, if \(N_{c}=N_{f}=64\) is selected, NeRF must perform forward propagation on \(3\times 10^{7}\) input sampled points to render an image of size \(400\times 400\). Additionally, dozens or even hundreds of images must be rendered to reconstruct a complete scene, which inevitably results in significant energy consumption. To address this challenge, this article proposes Spiking-NeRF, an energy-efficient neural rendering model that utilizes the SNN to implement neural rendering.

4 Spiking-NeRF

In NeRF, synthesizing photo-realistic images requires the neural network to estimate the precise values of color \(\textbf{c}_{i}\) and density \(\sigma_{i}\) for each sampled point. As a result, if we apply SNNs to NeRF, the converted SNNs should output the same numerical values as the source ANNs. In this section, we first present a precise output decoding scheme by analyzing the inference process and mathematical relationship of ANNs and SNNs. With the proposed output scheme, the converted SNNs can output precise values corresponding to the source ANNs. For the two layers in NeRF with the inserted \(\gamma(\textbf{x})\) and \(\gamma(\textbf{d})\) (see Figure 2), we develop customized conversion method for their parameters and propose a spiking neural rendering model, called Spiking-NeRF. Finally, we propose an ETS for Spiking-NeRF to reduce energy consumption by analyzing the computing paradigm of volume rendering and combining it with the discrete property of the spikes.

4.1 Conversion Method

4.1.1 Inference Process of ANNs and SNNs.

The ANNs to be converted to SNNs are ordinarily trained with ReLU activation. Specifically, for an \(L\)-layer fully-connected ANN with ReLU, the inference process of the analog neurons can be formulated as

\begin{align}f^{l}(\boldsymbol{x})=\text{max}(0,\boldsymbol{W}^{l}_{A}f^{l-1}(\boldsymbol{x})+\boldsymbol{b}^{l}_{A}),\end{align}

(4)

where matrix \(\boldsymbol{W}^{l}_{A}\) (\(l={1,2,3,\dots,L}\)) denotes the weight matrix of the ANN between layer \(l-1\) and layer \(l\), vector \(\boldsymbol{b}^{l}\) indicates the bias of the ANN in layer \(l\), vector \(\boldsymbol{x}\) refers to the input of the ANN, and \(f^{l}(\boldsymbol{x})\) is the activation values of layer \(l\). Utilizing Equation (4), we can derive the analog values predicted by the output layer:

\begin{align}f^{L}(\boldsymbol{x})=\boldsymbol{W}^{L}_{A}f^{L-1}(\cdots f^{1}(\boldsymbol{x})\cdots)+\boldsymbol{b}^{L}_{A}.\end{align}

(5)

Different from ANNs, SNNs simulate spiking neurons with a series of discrete spikes over \(T\) time steps. In our work, we consider the IF model [5, 10] for spiking neurons. Figure 3 illustrates the forward propagation process of a spike neuron with IF model, which consists of integration phase and firing phase. In the derivation of the equations below, we consider the forward propagation process for all spiking neurons in each layer. The length of the spike train in SNN is the total time step \(T\). At a certain time step \(t\), the spiking neurons in layer \(l\) receive the spikes from layer \(l-1\) and enter the integration phase, the integrated membrane potentials \(\boldsymbol{z}^{l}(t)\) can be computed as

Fig. 3.

\begin{align}\boldsymbol{z}^{l}(t)=\boldsymbol{v}^{l}(t-1)+\boldsymbol{W}^{l}_{S}\boldsymbol{\theta}^{l-1}(t)V_{th}^{l-1}+ \boldsymbol{b}^{l}_{S},\end{align}

(6)

where \(\boldsymbol{W}^{l}_{S}\) and \(\boldsymbol{b}^{l}_{S}\) are synaptic weight and bias of spiking neurons at layer \(l\), \(\boldsymbol{v}^{l}(t-1)\) is the membrane potential of layer \(l\) at time step \(t-1\), \(V_{th}^{l-1}\) is the firing threshold potential of layer \(l-1\), \(\boldsymbol{\theta}^{l-1}(t)\) is a vector denoting whether each neuron in layer \(l-1\) fires a spike at time step \(t\). For the spiking neurons whose integrated membrane potentials exceed \(V_{th}^{l}\), they enter the firing phase and fire spikes to the following layer. Therefore, \(\boldsymbol{\theta}^{l}(t)\) is defined as follows:

\begin{align}\boldsymbol{\theta}^{l}(t)=U(\boldsymbol{z}^{l}(t)-V_{th}^{l}),\end{align}

(7)

where \(U(\alpha)\) is the unit step function. Once firing spikes, the spiking neurons reset their integrated potentials in \(\boldsymbol{z}^{l}(t)\). To reduce the information loss, we adopt the reset by subtraction strategy following the previous works [13, 41]. The RMPs after reset can be formulated as

\begin{align}\boldsymbol{v}^{l}(t)=\boldsymbol{v}^{l}(t-1)+\boldsymbol{W}^{l}_{S}\boldsymbol{\theta}^{l-1}(t)V_{th}^{l-1}+ \boldsymbol{b}^{l}_{S}-\boldsymbol{\theta}^{l}(t)V_{th}^{l}.\end{align}

(8)

4.1.2 Precise Output Decoding Scheme.

In general, the initial membrane potentials \(\boldsymbol{v}^{l}(0)\) are set to zero. Thus we integrate Equation (8) over \(T\) time steps and obtain:

\begin{align}\boldsymbol{v}^{l}(T)=\boldsymbol{W}^{l}_{S}\sum_{t=1}^{T}\boldsymbol{\theta}^{l-1}(t)V_{th}^{l-1}+T \boldsymbol{b}^{l}_{S}-\sum_{t=1}^{T}\boldsymbol{\theta}^{l}(t)V_{th}^{l}.\end{align}

(9)

During \(T\) time steps, the firing rates of spiking neurons in layer \(l\) are denoted as \(\boldsymbol{r}^{l}(T)=\frac{\sum_{t=1}^{T}\boldsymbol{\theta}^{l}(t)}{T}\) and fall into the interval \([0,1]\). Divide Equation (9) by \(T\), we can get:

\begin{align}\boldsymbol{r}^{l}(T)V_{th}^{l}=\boldsymbol{W}^{l}_{S}\boldsymbol{r}^{l-1}(T)V_{th}^{l-1}+\boldsymbol{b}^{l}_{ S}-\frac{\boldsymbol{v}^{l}(T)}{T}.\end{align}

(10)

If total time step \(T\) is large enough, the values of the RMPs \(\boldsymbol{v}^{l}(T)\) can be neglected compared to \(T\). Thus, we can assume \(\frac{\boldsymbol{v}^{l}(T)}{T}\) is infinitely approaching zero and Equation (10) can be reformulated as:

\begin{align}\boldsymbol{r}^{l}(T)V_{th}^{l} & =\boldsymbol{W}^{l}_{S}\boldsymbol{r}^{l-1}(T)V_{th}^{l-1}+\boldsymbol{b}^{l}_{S} \\& =\text{max}(0,\boldsymbol{W}^{l}_{S}\boldsymbol{r}^{l-1}(T)V_{th}^{l-1}+\boldsymbol{b}^{l }_{S}).\end{align}

(11)

For a pre-trained ANN with ReLU activation function, the activation value in layer \(l\) has an upper bound \(\lambda^{l}=\text{max}\{f^{l}(\boldsymbol{x})\}\) when running on the dataset. We normalize the activation of analog neurons as \(p^{l}(\boldsymbol{x})=\frac{f^{l}(\boldsymbol{x})}{\lambda^{l}}\), and reformulate Equation (4) as

\begin{align}p^{l}(\boldsymbol{x})\lambda^{l}=\text{max}(0,\boldsymbol{W}^{l}_{A}p^{l-1}(\boldsymbol{x})\lambda^{l- 1}+\boldsymbol{b}^{l}_{A}). \end{align}

(12)

The essential principle of ANN-to-SNN conversion is that the firing rates of spiking neurons are proportional to the activation values of analog neurons. The firing rates \(\boldsymbol{r}^{l}(T)\) of spiking neurons and the normalized activation \(p^{l}(\boldsymbol{x})\) of analog neurons are both strict in \([0,1]\). Typically, the threshold potentials of all layers in SNNs are set to \(1\). Comparing Equation (11) with Equation (12), we can conclude that ANNs can be converted to SNNs through the following parameter normalization rules [10, 41]:

\begin{align}\boldsymbol{W}^{l}_{S}=\boldsymbol{W}^{l}_{A}\frac{\lambda^{l-1}}{\lambda^{l}}; \quad\boldsymbol{b}^{l}_{S}=\boldsymbol{b}^{l}_{A}\frac{1}{\lambda^{l}}.\end{align}

(13)

Since the test set is unknown, in practice the maximum activation \(\lambda^{l}\) is estimated by the inference process on the training set.

With the above parameter normalization method, there are two output decoding schemes for the output neurons of SNNs [21]: one outputs fire rates \(\boldsymbol{r}^{L}(T)\), the other outputs the integrated membrane potentials \(\boldsymbol{z}^{L}(t)\). Unfortunately, neither of these output schemes generate precise output values corresponding to the source ANNs. \(\boldsymbol{r}^{L}(T)\) always falls into \([0,1]\), and \(\boldsymbol{z}^{L}(t)\) changes linearly as \(T\) increases. Actually, both of them are proportional to the activation values of ANNs [41]. Therefore, the converted SNNs applying these two output schemes cannot output the precise values corresponding to source ANNs. To guarantee the rendering performance, we propose an output decoding scheme which enables the converted SNNs output the precise values corresponding to the output values of the source ANNs.

In our proposed precise output decoding scheme, the neurons in the output layer only integrate membrane potentials without firing spikes. Through Equation (6), we can obtain the integrated membrane potentials of the output layer \(L\) over \(T\) time steps. Then we divide it by \(T\) and get

\begin{align}\frac{\boldsymbol{z}^{L}(T)}{T}=\boldsymbol{W}^{L}_{S}\boldsymbol{r}^{L-1}(T)V_{th}^{L-1}+\boldsymbol{b}^{L}_{ S}.\end{align}

(14)

The same as Equation (12), reformulating Equation (5) as \(f^{L}(\boldsymbol{x})=\boldsymbol{W}^{L}_{A}p^{L-1}(\boldsymbol{x})\lambda^{L-1}+\boldsymbol{b}^{L}_{A}\), and comparing it with Equation (14), we can conclude that the parameters of output layer should obey the following conversion rules:

\begin{align}\boldsymbol{W}^{L}_{S}=\boldsymbol{W}^{L}_{A}\lambda^{L-1};\quad\boldsymbol{b}^{L}_{S}= \boldsymbol{b}^{L}_{A}.\end{align}

(15)

Equation (11) demonstrates information transfer relationship between adjacent layers of SNNs. Bring it to Equation (14), and we can get

\begin{align}\frac{\boldsymbol{z}^{L}(T)}{T}=\boldsymbol{W}^{L}_{S}(\cdots(\boldsymbol{W}^{1}_{S}\boldsymbol{r}^{0}(t)+\boldsymbol{ b}^{1}_{S})\cdots)+\boldsymbol{b}^{L}_{S},\end{align}

(16)

where \(\boldsymbol{r}^{0}(t)\) is the firing rate of the input spikes. To obtain the numerical relationship between the source ANNs and the converted SNNs, we bring Equations (13) and (15) into Equation (16) and get

\begin{align}\frac{\boldsymbol{z}^{L}(T)}{T}=\boldsymbol{W}^{L}_{A}(\cdots(\boldsymbol{W}^{1}_{A}\boldsymbol{r}^{0}(t)+\boldsymbol{ b}^{1}_{A})\cdots)+\boldsymbol{b}^{L}_{A}.\end{align}

(17)

Compare Equation (17) with Equation (5), then we can get that the integrated membrane potentials of the converted SNNs in layer \(L\) divided by total time step \(T\) equal to the output values of the source ANNs, i.e., \(\frac{\boldsymbol{z}^{L}(T)}{T}=f^{L}(\boldsymbol{x})\). Therefore, to obtain the precise output values, we adopt \(\frac{\boldsymbol{z}^{L}(T)}{T}\) to be the output of Spiking-NeRF.

4.1.3 Customized Parameter Normalization Method for Spiking-NeRF.

According to Equation (1), the maximum value of input layer is 1, thus the weight and bias between the input layer and its next layer obey the following normalization rule:

\begin{align}\boldsymbol{W}^{l}_{S}=\boldsymbol{W}^{l}_{A}\frac{1}{\lambda^{l}};\quad\boldsymbol{b}^{l }_{S}=\boldsymbol{b}^{l}_{A}\frac{1}{\lambda^{l}}.\end{align}

(18)

Moreover, the neural network architecture of NeRF is different from the ordinary full-connected ANNs. As illustrated in Figure 2, the inputs of stage 2 and stage 3 are the activation values of the previous layer concatenated with \(\gamma(\textbf{x})\) and \(\gamma(\textbf{d})\), respectively. For the parameters of these two layers, we have to take different parameter normalization methods for the two parts of concatenated inputs. The parameters for the activation values of the previous layer (blue arrows in Figure 2) still follow the normalization rules of Equation (13):

\begin{align}\boldsymbol{W}^{l}_{S}[P]=\boldsymbol{W}^{l}_{A}[P]\frac{\lambda^{l-1}}{\lambda^{ l}};\quad\boldsymbol{b}^{l}_{S}[P]=\boldsymbol{b}^{l}_{A}[P]\frac{1}{\lambda^{l}}.\end{align}

(19)

And \(\boldsymbol{W}^{l}_{A}[P]\) and \(\boldsymbol{b}^{l}_{A}[P]\) represent the weight and bias corresponding to the activation values of the previous layer, respectively. Same as the input layer, the maximum value of inserted input equals to 1 rather than \(\lambda^{l-1}\) according to Equation (1). Therefore, the parameters for the inserted inputs (green arrows in Figure 2) obey the following normalization rules during conversion:

\begin{align}\boldsymbol{W}^{l}_{S}[I]=\boldsymbol{W}^{l}_{A}[I]\frac{1}{\lambda^{l}};\quad\boldsymbol{b}^{l}_{S}[I]=\boldsymbol{b}^{l}_{A}[I]\frac{1}{\lambda^{l}},\end{align}

(20)

where \(\boldsymbol{W}^{l}_{A}[I]\) and \(\boldsymbol{b}^{l}_{A}[I]\) respectively represent the weight and bias corresponding to the inserted inputs in the weight matrix \(\boldsymbol{W}^{l}_{A}\) and bias vector \(\boldsymbol{b}^{l}_{A}\) of the current layer. In order to output the precise value, Spiking-NeRF uses the proposed precise output decoding scheme. Therefore, the parameter normalization rule for the output layer is expressed in Equation (15), and the parameter normalization rule for the residual layers is defined by Equation (13). In conclusion, the overall parameter normalization flow for converting NeRF to Spiking-NeRF is summarized in Figure 4 and Algorithm 1.

Fig. 4.

4.2 ETS

The parameter \(w_{i}\) in Equation (2) represents the importance of the color of the \(i\)th sampled point along the ray to the pixel, we note it as color weight. As formulated in Equation (3), since the distance \(\delta_{i}\) is non-negative and the volume density \(\sigma_{i}\) is rectified by ReLU function, the color weight \(w_{i}\) is a non-negative value in the interval of [0, 1]. Larger color weight \(w_{i}\) means that \(\textbf{c}_{i}\) contributes more to the pixel, and \(w_{i}=0\) means that \(\textbf{c}_{i}\) have no contribution to the pixel. Figure 5 illustrates the percentage of sampled points with color weights equal to zero and those with color weight \(\in\) (0, 1] in the training set of realistic scenes and synthetic scenes [30], respectively. For realistic scenes, the sampled points with \(w_{i}=0\) are more than \(50\%\). Since each synthetic scene consists of a physical object and a white background (such as the Ship image in Figure 10(a)), the sampled points with \(w_{i}=0\) account for a larger percentage in the synthetic scenes. For example, the sampled points with \(w_{i}=0\) in Chair scene account for as high as \(94.68\%\). The color weights of the sampled points are obtained based on the outputs of the neural network. Therefore, in NeRF, the number of computational operations required by every sampled point is the same regardless of their contribution to the pixel color, which inevitably results in significant energy overhead.

Fig. 5.

As demonstrated in the previous section, the converted SNNs transmit discrete spikes and can generate precise output values compared with the source ANNs. We use this property of the converted SNNs and propose an ETS according to the color weights of sampled points, thus reducing the computational operations and energy consumption. The proposed ETS is summarized in Figure 6 and Algorithm 2. In general, all inputs in SNNs have the same total time step \(T\). But during the inference process of Spiking-NeRF, we can obtain \(\sigma_{i,T_{temp}}\) from the membrane potential and use Equation (3) to calculate the color weight of the sampled point at a proper time step \(T_{temp}\) (\(T_{temp}{\lt}T\)), and then determine whether to terminate the inference process of the sampled point based on its color weight \(w_{i,T_{temp}}\). Specifically, the actual time step \(T_{actual}\) of the \(i\)th sampled point is determined by the following function:

Fig. 6.

\begin{align}T_{actual}(w_{i,T_{temp}})=\left\{\begin{matrix}T & w_{i,T_{temp}} > 0 \\T_{temp} & w_{i,T_{temp}}=0\end{matrix}\right.\end{align}

(21)

Since color weight equal to zero indicates the \(i\)th sampled point has no contribution to the color of the pixel, we terminate the inference process of the sampled points with \(w_{i,T_{temp}}=0\) and output \(\frac{\boldsymbol{z}^{L}_{i}(T_{temp})}{T_{temp}}\). To guarantee the rendering performance, the sampled points with positive color weight (\(w_{i,T_{temp}}{\gt}0\)) are still fed into Spiking-NeRF for the residual time step \(T-T_{temp}\) (blue dashed input spikes in Figure 6), and Spiking-NeRF outputs \(\frac{\boldsymbol{z}^{L}_{i}(T)}{T}\) at time step \(T\) (blue dashed MP in Figure 6). Therefore, our proposed ETS can reduce the computational overhead and energy consumption of Spiking-NeRF while ensuring image rendering quality.

5 Experimental Results

5.1 Implementation Details

We evaluate the rendering performance of Spiking-NeRF on both synthetic and realistic datasets [30], and compare its energy consumption with that of NeRF. All the experiments are implemented on NVIDIA GeForce 2080Ti GPUs with PyTorch framework [37].

5.1.1 Datasets.

The synthetic dataset contains eight scenes of different objects. For each scene, there are \(100\) views for training and \(200\) views for testing. The image of each view has \(400\times 400\) pixels. The complex realistic dataset consists of eight real-world scenes captured by mobile phones. Each scene is composed of \(20\)–\(60\) images, all at \(504\times 378\) pixels. And one eighth of the images are used for testing. These 16 scenes constitute the complete dataset employed in NeRF [30]. Therefore, the experiments in this article also evaluate Spiking-NeRF on these 16 scenes.

5.1.2 Input Encoding for Spiking-NeRF.

According to Equation (1), the inputs of NeRF are mapped with fourier features before being fed into the neural network. The intensity values of mapped inputs are in the interval of \([-1,1]\). In Spiking-NeRF, we use Poisson rate coding [15] to generate input spike trains, whose firing rates are approximately equal to the absolute values of the corresponding inputs in ANN. And the spike train of the \(i\)th input neuron is positive or negative depending on the sign of its corresponding analog input value in ANN.

5.1.3 Training Setup.

Following the practice in NeRF [30], we employ a 10-layer MLP network with ReLU activation function, as shown in Figure 2, to reconstruct each scene. Furthermore, in order to reduce conversion loss, the trainable clipping layers [16] are added after the ReLU activation for the hidden layers of MLP. And we execute the parameter normalization for Spiking-NeRF using the parameters of trainable clipping layers in NeRF. All the models are trained with Adam [22] optimizer for 200,000 iterations. The batch size is set to 1,024 rays. Following [30], we use the initial learning rate of \(5\times 10^{-4}\) and decay it exponentially as the iterations increase. The number of sampled points for synthetic scenes is set to \(N_{c}=64\) and \(N_{f}=128\), while the number of sampled points for realistic scenes is set to \(N_{c}=N_{f}=64\). The hyper-parameter \(L\) in Equation (1) is set to \(10\) for 3D coordinate x and \(4\) for view direction d. For parameters in trainable clipping layers, we set their initial values to \(600\) and \(100\) when training synthetic scenes and realistic scenes respectively, with the \(L_{2}\)-regularization coefficient of \(1\times 10^{-1}\).

5.1.4 Evaluation Metrics.

The performance of Spiking-NeRF is evaluated from three aspects:

The quality of rendered images. The Peak Signal-to-Noise Ratio (PSNR) of the rendered image is used as the metric for image quality assessment. PSNR is calculated by the mean square error of the ground truth images and the renderings. A larger PSNR value means that the rendered image is more photo-realistic. The evaluation of energy consumption is the ratio of energy consumed by NeRF and Spiking-NeRF respectively when rendering a single image.

The energy consumption during rendering. The energy consumption is determined by the type and the quantity of computational operations of neural networks [21, 24, 27, 41]. During the inference process, the computational operations that occurred in analog neurons are Multiply-Accumulate (MAC) operations, while integrating discrete binary spikes that occurred in spiking neurons are Accumulate (AC) operations. In the case of ANNs, the quantity of MAC is determined by the network architecture. So rendering each image requires the same quantity of MAC operations. In the case of SNNs, the quantity of AC operations is determined by both the firing rates of spiking neurons and the network architecture. Since the fire rates vary in image, the quantity of AC operations required by Spiking-NeRF is also different when rendering each image. To make a fair comparison, we calculate the number of AC operations required for Spiking-NeRF to render all images in the test set and take the average value as the number of AC operations required to render a single image. Following the practice in [24], we utilize 3.2pJ/MAC and 0.1pJ/AC as the energy consumption baseline.

The scalability of the proposed ETS. To further validate the scalability of our proposed ETS, we change the neural network architecture of NeRF and apply ETS on it. Considering the enormous training cost of NeRF, we use the following four network architectures for experiments:

–

Case (a) contains 1.5\(\times\) more neurons (384 neurons in each hidden layer) than NeRF;

–

Case (b) contains 0.5\(\times\) fewer neurons (128 neurons in each hidden layer) than NeRF;

–

Case (c) adds one layer to stage 1 and stage 2 in Figure 2, respectively (total 12 layers);

–

Case (d) reduces one layer in stage 1 and stage 2, respectively (total 8 layers).

5.1.5 Number of Computational Operation.

Since the neural network of NeRF is an MLP, we only consider the number of computational operations in fully connected layers. For an \(L\)-layer fully connected neural network, let \(f_{in,l}\) denote the number of neurons in layer \(l-1\) and \(f_{out,l}\) denote the number of neurons in layer \(l\). The number of MAC operations required for a single forward propagation process of ANN can be calculated as

\begin{align}\sum^{L}_{l=2}f_{in,l}\times f_{out,l}.\end{align}

(22)

For SNN with total time step \(T\), the number of AC operations required for a single forward propagation process can be calculated as

\begin{align}\sum^{T}_{t=1}\left(\sum^{L}_{l=2}s_{l-1}(t)\times f_{out,l}\right),\end{align}

(23)

where \(s_{l-1}(t)\) represents the total number of spikes fired by the spiking neurons in layer \(l-1\) at time step \(t\).

5.2 Rendering Performance

We evaluate the rendering performance of Spiking-NeRF quantitatively and qualitatively. Table 1 and Table 2 show the average PSNR values of the images rendered by NeRF, Spiking-NeRF without ETS, and Spiking-NeRF on the test set of eight real scenes and eight synthetic scenes, respectively. Spiking-NeRF refers to the spiking neural rendering model that has applied ETS. Figures 7 and 8 illustrate the results of rendering an image from the Fern and Ship by Spiking-NeRF without ETS when the total time step \(T\) ranges from \(8\) to \(1{,}024\), respectively.

Table 1.

Realistic Scenes		Fortress	Fern	Orchids	Room	Flower	Horns	Leaves	T-Rex
NeRF		32.48	26.72	21.13	32.73	28.06	28.65	22.24	27.81
Spiking-NeRF without ETS	\(T=8\)	14.51	15.91	9.12	17.07	10.35	13.05	9.68	15.05
	\(T=16\)	18.80	18.95	10.12	20.93	14.03	17.62	12.02	17.79
	\(T=32\)	23.51	21.85	12.17	24.99	18.51	21.94	15.15	20.88
	\(T=64\)	28.04	24.25	15.27	28.40	22.72	25.25	18.35	23.89
	\(T=128\)	30.97	25.76	18.67	30.85	25.72	27.29	20.73	26.08
	\(T=256\)	32.06	26.45	20.59	31.97	27.22	28.23	22.83	27.17
	\(T=512\)	32.27	26.68	21.13	32.46	27.77	28.56	22.18	27.58
	\(T=1{,}024\)	32.38	26.76	21.23	32.62	27.96	28.66	22.27	27.72
Spiking-NeRF	\(T=256\)	32.05	26.46	20.58	31.95	27.22	28.23	22.81	27.16

Table 1. The Average PSNR Value of the Test Set of Realistic Scenes Rendered by (1) NeRF, (2) Spiking-NeRF without ETS, and (3) Spiking-NeRF

The bolded values indicate the negligible effect of ETS on PSNR.

Table 2.

Synthetic Scenes		Mic	Materials	Lego	Hothog	Ship	Drums	Ficus	Chair
NeRF		33.28	29.33	31.63	36.79	29.37	25.54	29.17	33.74
Spiking-NeRF without ETS	\(T=8\)	9.81	10.36	9.81	14.85	13.22	11.14	13.98	12.24
	\(T=16\)	14.37	14.26	13.59	18.95	17.33	14.56	19.38	15.29
	\(T=32\)	19.97	19.09	18.83	23.76	21.79	17.98	24.14	20.40
	\(T=64\)	26.11	23.56	24.50	26.33	25.49	21.59	27.09	26.33
	\(T=128\)	30.57	26.73	28.96	33.13	27.75	24.23	28.44	30.47
	\(T=256\)	32.37	28.37	30.95	35.84	28.74	25.23	28.89	32.83
	\(T=512\)	32.83	29.01	31.43	36.77	29.10	25.47	29.06	33.56
	\(T=1{,}024\)	32.96	29.23	31.65	37.02	29.21	25.52	29.10	33.73
Spiking-NeRF	\(T=256\)	32.37	28.36	30.98	35.83	28.77	25.23	28.91	32.84

Table 2. The Average PSNR Value of the Test Set of Realistic Scenes Rendered by (1) NeRF, (2) Spiking-NeRF without ETS, and (3) Spiking-NeRF

The bolded values indicate the negligible effect of ETS on PSNR.

Fig. 7.

Fig. 8.

As illustrated in Tables 1 and 2, the average PSNR value of images on the test set rendered by Spiking-NeRF without ETS increases with the total time step \(T\). Meanwhile, the increment of average PSNR gradually decays as \(T\) increases. Specifically, when the total time step \(T\) increases from 8 to 16, the average PSNR values of Fern and Ship increase by 3.04 and 4.11, respectively. However, when \(T\) increases from 512 to 1,024, the average PSNR values for Fern and Ship only increase by 0.08 and 0.11, respectively. When \(T\) is larger than 256, the PSNR value increases extremely slowly. Moreover, it can be observed from Figures 7 and 8 that the rendered images have almost no visual difference at \(T=256\), \(T=512\), and \(T=1{,}024\), and the variation of PSNR value is also very small. However, in terms of SNN, the computational operations is proportional to \(T\). An increase in \(T\) from 256 to 512 doubles the computational operations of SNN, and an increase in \(T\) from 256 to 1,024 quadruples the computational operations of SNN. To balance the tradeoff between PSNR and energy consumption, we set the total time step \(T\) of Spiking-NeRF to 256 for all scenes.

To implement our proposed ETS, we first sweep hyper-parameter \(T_{temp}\) on the training set of each scene. The sweep range of \(T_{temp}\) is [8, 88], with the interval of \(8\). For each scene, we select an optimal \(T_{temp}\) that can balance the tradeoff between the PSNR and energy consumption on the training set. And then we apply the selected \(T_{temp}\) to the inference process of the test set. Figure 9 illustrates the sweep results of four scenes from realistic scenes and four scenes from synthetic scenes. And we demonstrate how to select \(T_{temp}\) for each scene. During the sweep process, the PSNR values of Flower, Lego, Ficus, and Materials are almost constant. However, on these scenes, the ratio of energy consumed by NeRF and Spiking-NeRF has a maximum value. We take the time step with the maximum energy ratio as \(T_{temp}\). Therefore, we select \(T_{temp}=16\) for Flower, \(T_{temp}=32\) for Lego, \(T_{temp}=16\) for Ficus, and \(T_{temp}=16\) for Materials to reduce energy consumption. In addition, the ratio of energy consumed by NeRF and Spiking-NeRF decreases as \(T\) increases for Fern, T-Rex, Horns and Ship. Moreover, on these four scenes, the PSNR values of images rendered by Spiking-NeRF gradually stabilize as \(T\) increases. After the PSNR values are stabilized, we take the time step with the largest energy ratio as \(T_{temp}\). Therefore, we select \(T_{temp}=56\) for Fern, \(T_{temp}=56\) for T-Rex, \(T_{temp}=72\) for Horns and \(T_{temp}=16\) for Ship to reduce energy consumption. The \(T_{temp}\) for the remaining scenes is determined by the same approach and is reported in Table 3. We investigate the relationship between the selection of \(T_{temp}\) and the complexity of the image. We assess the image complexity by quantifying it in terms of the variance of the pixel points and find that a smaller variance in pixel points corresponds to a smaller value of \(T_{temp}\).

Fig. 9.

Table 3.

Dataset	Spiking-NeRF w/o ETS (\(T=256\))			Spiking-NeRF (\(T=256\))
Dataset	ACs	Energy (J)	Ratio	\(T_{temp}\)	ACs	Energy (J)	Ratio
Fern	4.60E\(+\)14	46.00	1.18	56	3.03E\(+\)14	30.30	1.78
T-Rex	4.86E\(+\)14	48.60	1.11	56	2.91E\(+\)14	29.10	1.86
Horns	4.72E\(+\)14	47.20	1.15	72	2.80E\(+\)14	28.00	1.93
Room	4.57E\(+\)14	45.70	1.18	64	2.86E\(+\)14	28.60	1.89
Flower	5.28E\(+\)14	52.80	1.20	16	3.06E\(+\)14	30.06	1.77
Leaves	4.98E\(+\)14	49.80	1.09	48	2.73E\(+\)14	27.30	1.98
Orchid	4.79E\(+\)14	47.90	1.13	24	3.04E\(+\)14	30.40	1.78
Fortress	4.91E\(+\)14	49.10	1.10	24	2.53E\(+\)14	25.30	2.14
Mic	8.67E\(+\)14	86.70	0.70	16	3.56E\(+\)14	35.60	1.70
Lego	6.93E\(+\)14	69.30	0.87	32	4.41E\(+\)14	44.10	1.37
Ship	5.53E\(+\)14	55.30	1.09	16	3.42E\(+\)14	34.20	1.77
Chair	8.23E\(+\)14	82.30	0.73	24	4.10E\(+\)14	41.00	1.48
Drums	7.41E\(+\)14	71.40	0.82	16	3.04E\(+\)14	30.40	1.99
Ficus	7.72E\(+\)14	77.20	0.78	16	2.67E\(+\)14	26.70	2.27
Hotdog	6.95E\(+\)14	69.50	0.87	24	4.83E\(+\)14	48.30	1.25
Materials	7.89E\(+\)14	78.90	0.77	16	3.64E\(+\)14	36.40	1.66

Table 3. The Number of Operations and Energy Rendering a Single Image Required by NeRF, Spiking-NeRF without ETS When \(T=256\), and Spiking-NeRF When \(T=256\)

When applying ETS to Spiking-NeRF, as shown in the bold rows of Tables 1 and 2, the average PSNR values of all scenes on the test set rendered by Spiking-NeRF and Spiking-NeRF without ETS at \(T=256\) has almost no difference. Moreover we compare the rendering performance of the proposed precise output decoding scheme with that of the traditional decoding schemes (fire rates and membrane potentials). Figure 10(a) and (b) show that the rendered images of Fern and Ship have no observable visual difference between NeRF and Spiking-NeRF with the proposed decoding scheme. However, as shown in Figure 10(c) and (d), Spiking-NeRF cannot generate recognizable images with either membrane potentials or with fire rates as output. Besides, we also validate the effectiveness of the customized parameter normalization method for Spiking-NeRF. When Spiking-NeRF does not apply the customized parameter normalization method for \(\gamma(\textbf{x})\) inserted to stage 2 and \(\gamma(\textbf{d})\) inserted to stage 3, the rendering results on Fern and Ship are shown in Figure 11(b) and (d), respectively. And Figure 11(a) and (c) are the ground truth. There is an obvious visual difference between Figure 11(a) and (b), and between Figure 11(c) and (d). In addition, the PSNR values of Figure 11(b) and (d) are very low.

Fig. 10.

Fig. 11.

5.3 Energy Consumption

We compare the energy consumed by NeRF, Spiking-NeRF without ETS, and Spiking-NeRF when rendering a single image. We first calculate the average number of computational operations required to render a single image by NeRF and Spiking-NeRF respectively, and then evaluate the total energy consumed to render a single image based on the energy consumption baseline. When rendering a single image in realistic scenes and synthetic scenes, NeRF requires 1.69E\(+\)13 MACs and 1.89E\(+\)13 MACs, and consumes 54.08J and 60.48J energy, respectively. The reduction in energy consumption achieved by Spiking-NeRF comes from two aspects:

–

The computational operations in ANN are energy-intensive MAC operations, while the computational operations in SNN are less energy-consuming AC operations. As reported in Table 3, when \(T=256\), Spiking-NeRF without ETS shows a slightly better energy efficiency advantage than NeRF on realistic scenes while consuming more energy than NeRF on synthetic scenes.

–

The proposed ETS reduces the number of AC operations in SNN. After implementing our proposed ETS, Spiking-NeRF has a more significant energy efficiency advantage. The energy consumption of Spiking-NeRF with \(T=256\) is \(1.25\times\) to \(2.27\times\) less than NeRF.

The reduction in energy consumption achieved by the proposed ETS is associated with the value of \(T_{temp}\) and the percentage of positive color weight \(w_{i,T_{temp}}\). A smaller \(T_{temp}\) or a decreased percentage of positive \(w_{i,T_{temp}}\), corresponds to a greater reduction in energy consumption of the converted SNN with ETS. Moreover, the energy reduced by ETS is also related to the activation value of the source ANN. When the activation value of the source ANN is smaller, the proportion of the activated spiking neurons during interval \((0,T_{temp})\) is lower. Consequently, the computational operations conducted by the converted SNN during interval \((0,T_{temp})\) are fewer. Therefore, the smaller the activation value of the source ANN, the more energy consumption can be reduced by applying the proposed ETS strategy to the converted SNN.

Table 4.

Dataset	NeRF		Case (a)		Case (b)		Case (c)		Case (d)
Fern	1.69E\(+\)13	54.08 J	3.70E\(+\)13	118.40 J	0.45E\(+\)13	14.40 J	2.17E\(+\)13	69.44 J	1.21E\(+\)13	38.72 J
Ship	1.89E\(+\)13	60.48 J	4.14E\(+\)13	132.48 J	0.51E\(+\)13	16.32 J	2.43E\(+\)13	77.76 J	1.35E\(+\)13	43.20 J

Table 4. The Number of MAC Operations (Left Column) and Energy Consumption (Right Column) Required by Different ANN Architectures When Rendering a Single Image

5.4 Scalability

The number of MAC operations and energy consumption required to render a single image by NeRF and other four ANN network architectures are shown in Table 4. We convert the aforementioned four neural networks in Section 5.1.4 into SNNs and apply the proposed ETS on realistic Fern scene and synthetic Ship scene. In order to ensure the fairness of the experiments, we refer to the experimental setup of vanilla Spiking-NeRF in Section 5.2 and set the total time step \(T\) of every SNN to 256. As reported in Table 5, the average PSNR values of the images rendered by SNN and ANN on the testset are comparable regardless of the network architecture.

Case (a) and Case (b) have the same layers as vanilla NeRF. In these two cases, the average PSNR value of the converted SNN have the same tendency to increase or decrease as the average PSNR value of the source ANN. In Case (c), the average PSNR value of ANN surpasses that of vanilla NeRF. However, the average PSNR of the converted SNN is observed to be inferior to that of vanilla NeRF. This phenomenon can be attributed to that the spiking neurons in deeper layer are harder to be activated. On the contrary, if the number of network layers decreases, then the difference between SNN and ANN becomes less. Therefore, the difference between the average PSNR value of SNN and ANN in Case (c) is larger than vanilla NeRF, while the difference between the average PSNR value of SNN and ANN in Case (d) is smaller than vanilla NeRF.

Table 5.

Dataset	NeRF		Case (a)		Case (b)		Case (c)		Case (d)
Fern	26.72	26.46	27.05	26.80	25.45	24.98	26.89	26.39	26.56	26.40
Ship	29.37	28.77	29.80	29.42	28.07	27.37	29.42	28.57	28.94	28.58

Table 5. The Average PSNR Value on Testset of ANN (Left Column) and SNN with ETS (Right Column) with Different Neural Network Architectures

The total time step \(T\) of SNN is \(256\).

The number of AC operations and energy consumption required to render a single image by SNNs with ETS and other four ANN network architectures are shown in Table 6. As Case (a) and Case (c) have more complicated neural network architectures than vanilla NeRF, the computational operations of the converted SNN increase. In these two cases, SNNs without ETS have a slight energy reduction over ANNs. The proposed ETS reduces the number of computational operations when SNNs rendering images. Therefore, SNNs with ETS have more energy reduction than SNNs without ETS. For example, in Case (a), the energy consumption of the source ANN is 2.40\(\times\) higher than the energy consumption of the converted SNN with ETS on Fern dataset. The computational operations of Case (b) and Case (d) are fewer than that of vanilla NeRF. In these two cases, the energy consumption of the converted SNN without ETS is more than that of the source ANN. In Case (b), the energy consumption of the source ANN on Fern dataset is 0.5\(\times\) as much as that of the converted SNN. Nevertheless, after applying our proposed ETS, the converted SNNs consumes less energy than the source ANNs. In conclusion, the ETS proposed in this article enables the converted SNNs to have less energy consumption than the source ANNs in all experimental neural networks.

As reported in Table 6, the energy ratios for Case (a) and Case (c) with ETS are both greater than those of NeRF, while the energy ratios for Case (b) and Case (d) with ETS are lower than those of NeRF. This fact arises from more MAC operations of deeper or wider ANNs. After the deeper or wider ANNs are converted into SNN, the ETS can reduce more AC operations in the converted SNNs. Therefore, the proposed ETS shows better performance on deeper and wider neural networks.

Table 6.

Network and Dataset		SNN w/o ETS (\(T=256\))			SNN with ETS (\(T=256\))
Network and Dataset		ACs	Energy (J)	Ratio	\(T_{temp}\)	ACs	Energy (J)	Ratio
NeRF	Fern	4.60E\(+\)14	46.00	1.18	56	3.03E\(+\)14	30.30	1.78
NeRF	Ship	5.53E\(+\)14	55.30	1.09	16	3.42E\(+\)14	34.20	1.77
Case (a)	Fern	9.24E\(+\)14	92.40	1.28	56	4.94E\(+\)14	49.40	2.40
Case (a)	Ship	1.05E\(+\)15	105.0	1.26	16	6.11E\(+\)14	61.60	2.15
Case (b)	Fern	2.57E\(+\)14	25.70	0.56	56	1.33E\(+\)14	13.30	1.08
Case (b)	Ship	2.64E\(+\)14	26.40	0.62	16	1.37E\(+\)14	13.70	1.19
Case (c)	Fern	6.20E\(+\)14	62.00	1.12	64	3.44E\(+\)14	34.40	2.02
Case (c)	Ship	7.32E\(+\)14	73.20	1.06	24	4.35E\(+\)14	43.50	1.79
Case (d)	Fern	5.15E\(+\)14	51.50	0.75	32	2.48E\(+\)14	24.80	1.56
Case (d)	Ship	4.76E\(+\)14	47.60	0.91	16	3.33E\(+\)14	33.30	1.30

Table 6. The Number of AC Operations and Energy Consumption Rendering a Single Image Required by SNN without ETS and SNN with ETS for Different Neural Work Architectures

The total time step \(T\) of SNN is \(256\).

5.5 Discussion

5.5.1 Estimation of Energy Consumption.

The experimental results presented in this article are obtained through simulation on a general-purpose GPU platform, rather than running Spiking-NeRF on a neuromorphic chip. Therefore, to compare the energy consumption, we use the energy consumed by AC operation and MAC operation on the same hardware platform as the energy consumption baseline. In reality, the energy efficiency of SNNs is even more pronounced when running on neuromorphic chips. If we follow the practice in [21], which estimates the energy consumption of Spiking-NeRF on the neuromorphic chip TrueNorth and NeRF on the NVIDIA TITAN V100 GPU, our findings indicate that Spiking-NeRF can achieve a reduction in energy consumption by 2.18\(\times\) to 3.95\(\times\).

5.5.2 Discussion with Other Works Based on NeRF.

Although NeRF is an innovative method for reconstructing scenes, it does not perform well in rendering complex scenes and consumes a large amount of time and energy during both training and inference. Numerous optimization works have been conducted based on NeRF. At the algorithm level, researchers extend NeRF to handle dynamic scenes [26, 34], deformable objects [12, 36], scenes with changing illumination and occluders [28, 47], neural relighting tasks [3, 43], generative models [44, 51] and so on. In addition, there are also some works to improve image quality [2, 50, 53] or reduce energy consumption [31] of NeRF. At the computing hardware level, researchers have proposed specialized hardware architectures for NeRF to achieve energy-efficient [39, 54] rendering. Different from the aforementioned related works, our method achieves the target of energy-efficiency at the computing paradigm level. Notably, our article is the first work applying SNN to neural rendering. The algorithm-level studies based on NeRF can leverage our proposed methods to convert ANNs to SNNs, thereby achieving energy-efficient spiking neural rendering. Furthermore, similar to hardware-level studies, the neuromorphic hardware architecture for the deployment of Spiking-NeRF can be customized to achieve energy efficiency.

6 Conclusion

In this article, we propose Spiking-NeRF, an energy-efficient spiking neural rendering model. We first propose a precise output decoding scheme for SNNs, which allows the converted SNNs to output the precise values corresponding to the source ANNs. Then we customize the parameter normalization method for the special network architecture of neural rendering. Furthermore, we present an ETS to improve the energy-efficiency of Spiking-NeRF. We evaluate the performance of Spiking-NeRF on both realistic scenes and synthetic scenes. Experimental results show that compared to ANN-based NeRF, Spiking-NeRF can achieve comparable rendering performance with up to \(2.27\times\) energy reduction.

References

[1]

Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, et al. 2015. TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (2015), 1537–1557.

Abstract

1 Introduction

2 Related Work

3 Preliminary

4 Spiking-NeRF

4.1 Conversion Method

4.1.1 Inference Process of ANNs and SNNs.

4.1.2 Precise Output Decoding Scheme.

4.1.3 Customized Parameter Normalization Method for Spiking-NeRF.

4.2 ETS

5 Experimental Results

5.1 Implementation Details

5.1.1 Datasets.

5.1.2 Input Encoding for Spiking-NeRF.

5.1.3 Training Setup.

5.1.4 Evaluation Metrics.

5.1.5 Number of Computational Operation.

5.2 Rendering Performance

5.3 Energy Consumption

5.4 Scalability

5.5 Discussion

5.5.1 Estimation of Energy Consumption.

5.5.2 Discussion with Other Works Based on NeRF.

6 Conclusion

References

Cited By

Index Terms

Recommendations

Modulated spike-time dependent plasticity (STDP)-based learning for spiking neural network (SNN): A review

A biological-like controller using improved spiking neural networks

Improved Izhikevich neurons for spiking neural networks

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations