Open AccessArticle

A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections

Nikolai A. K. Steur

and

Friedhelm Schwenker

Institute of Neural Information Processing, Ulm University, James-Franck-Ring, 89081 Ulm, Baden-Württemberg, Germany

Author to whom correspondence should be addressed.

AI 2025, 6(1), 1; https://doi.org/10.3390/ai6010001

Submission received: 2 November 2024 / Revised: 15 December 2024 / Accepted: 18 December 2024 / Published: 24 December 2024

Download

Browse Figures

Versions Notes

Abstract

Background: Integrating nonlinear behavior into the architecture of artificial neural networks is regarded as essential requirement to constitute their effectual learning capacity for solving complex tasks. This claim seems to be true for moderate-sized networks, i.e., with a lower double-digit number of layers. However, going deeper with neural networks regularly turns into destructive tendencies of gradual performance degeneration during training. To circumvent this degradation problem, the prominent neural architectures Residual Network and Highway Network establish skip connections with additive identity mappings between layers. Methods: In this work, we unify the mechanics of both architectures into Capsule Networks (CapsNet)s by showing their inherent ability to learn skip connections. As a necessary precondition, we introduce the concept of Adaptive Nonlinearity Gates (ANG)s which dynamically steer and limit the usage of nonlinear processing. We propose practical methods for the realization of ANGs including biased batch normalization, the Doubly-Parametric ReLU (D-PReLU) activation function, and Gated Routing (GR) dedicated to extremely deep CapsNets. Results: Our comprehensive empirical study using MNIST substantiates the effectiveness of our developed methods and delivers valuable insights for the training of very deep nets of any kind. The final experiments on Fashion-MNIST and SVHN demonstrate the potential of pure capsule-driven networks with GR.

Keywords:

batch normalization; capsule network; degradation problem; information diffusion; parametric activation function; residual learning; skip connection; very deep networks

1. Introduction

The depth of neural architectures is known as a crucial factor for the ease of learning complex tasks by enriching the representational capacity of neural systems with further abstraction levels. Hence, the exceptional role of depth is particularly manifested in the term deep learning itself. Despite the great potential of going deeper with neural networks, one encounters various difficulties in practice that negatively influence or even completely stall network optimization during the course of training. A de facto standard strategy against such a difficulty constitutes Batch Normalization (BN) [1] that normalizes feature distributions in order to prevent the problem of exponentially growing or decreasing gradients caused by an enlarged amount of stacked layers. Although there is perceived numerical stability in the optimization process of deeper neural networks, the resulting model performance gradually degenerates with an increase in network depth, which is commonly referred to as the degradation problem [2,3,4,5,6,7].

Figure 1 and Table 1 serve as the first empirical evidence concerning the degradation problem regarding conventional neural networks and Capsule Networks (CapsNet)s [8,9], based on the simple MNIST [10,11] classification dataset of handwritten digits. Specifically, CapsNets organize their network layers in capsules [12] consisting of equal-sized neuron groups and conduct a predefined routing procedure to steer signal propagation from lower-layer capsules to higher-layer ones [8,9]. CapsNets were previously applied on various application tasks such as image recognition [8,9], text classification [13,14,15], visual smoke and fire detection [16], or in medical research [17]. Performance degradation occurs with a gain in network depth independent from the neural architecture across a variety of activation functions (i.e., Rectified Linear Unit (ReLU) [18,19], Leaky ReLU [20], Exponential Linear Unit (ELU) [21], the tangens hyperbolicus and the logistic sigmoid). To isolate potential overfitting effects from our consideration of the degradation problem, we provide at this point model accuracy solely based on the training set. We will address the generalization ability of a pure capsule-driven architecture in the final experiment. However, the concrete activation function appears as a critical hyperparameter for governing the intensity of degenerative effects. The same observation holds for the selected routing procedure in CapsNets (including Static Routing (SR) [13], k-Means Routing (k-MR) [14], Dynamic Routing (DR) [8] and the squash [8] nonlinearity). In accordance with previous work [22,23], the degradation problem also arises for the linear case where the activation function equals the identity function.

An intuition of this circumstance can be conveyed with the recursive expression of a network block with linear activation using BN:

\begin{matrix} {\bar{h}}_{i} (x; l) : = BN (\sum_{j} w_{i j}^{(l)} {\bar{h}}_{j} (x; l - 1)), \\ {\bar{h}}_{i} (x; 1) : = BN (w_{1}^{T} x) . \end{matrix}

(1)

Since BN applies a static nonlinear transformation per batch, the neural activity

{\bar{h}}_{i} (x; l)

of the i-th neuron in the l-layer for an input sample

x \in R^{d}

can be straightforwardly interpreted as the iterative weighted mean over all previous layers. Assuming a random weight vector initialization, the linear neural network degenerates to a single path of repeated linear mixtures with nonlinear distortions induced by BN. We denominate the destructive tendency to dilute highly informative neuron outputs through iterative summation over all available signals as information diffusion. We further ascribe information diffusion to the inability of a neural network to select a subset of relevant input signals per neuron.This problem formulation explicitly includes nonlinear effects (e.g., from BN and activation functions), as opposed to theoretical analyses [22,23]. Figure 1 demonstrates that specific activation functions can mitigate or amplify information diffusion up to a certain extent of network depth.

The most promising technique to effectively overcome the degradation problem includes the implementation of skip connections that bypass outputs from previous layers as additive components to the input of deeper layers in the network hierarchy. The idea of skip connections essentially constitutes two distinct types of network architectures, namely Residual Networks [5,6] and Highway Networks [3,4]. In residual networks, skip connections are realized as static shortcuts between predefined blocks comprising several stacked network layers [5,6], whereas highway networks dynamically create skip connections via parametric neural gates between regular network layers [3,4]. The rationale behind both types is the facilitation of identity mappings which limit the effective depth of a neural network to the needed representational capacity demanded by a specific application task. At the level of network layers, skip connections especially constitute implicit paths of neural information processing that exponentially increase with a gain in network depth and behave like a network ensemble [24]. It is observed that the presence of skip connections at the time of training attenuates degradation effects on the performance of single-path feed-forward networks during inference [7]. A current explanation for this beneficial behavior is the solving of the near-singularity problem, where skip connections prevent neural computations from collapsing into singular feature spaces [22,23]. Hence, it is not surprising that other techniques even mimic the functionality of skip connections using specific initialization strategies [25] or learnable scaling parameters in combination with kernel updates [26].

Interestingly, the inherent structure of CapsNets in combination with a suitable routing procedure provides the potential to learn parametric skip connections during network optimization. Encouraging CapsNets to establish self-building skip connections leads to one implicit path of neural information processing for each capsule in a layer, while classical residual and highway networks are limited to a single implicit path per layer. This is in line with the next evolutionary stage of residual networks, namely ResNeXt [27], which aggregates residual transformations over multiple paths within predefined blocks and proved superior performance against its single-path predecessor. Again, CapsNets embody the more generic variant to ResNeXt without the need for explicit block design or static bottleneck structures between residual parts. As a natural consequence, CapsNets with self-building skip connections take a step towards structural neuroplasticity after the desired credo one architecture fits it all. The central contribution of our paper is three-fold:

Firstly, we theoretically unify the skip connection mechanisms of residual networks and highway networks into CapsNets by showing their functional equivalence under certain conditions. Moreover, we identify the necessary preconditions to facilitate the shaping of self-building skip connections within CapsNets. Our theoretical findings provide direct implications for the design of arbitrary neural architectures by demystifying specific properties of their dynamics.
Secondly, we introduce the concept of Adaptive Nonlinearity Gates by means of practical methods that fulfill the necessary preconditions and help to stabilize the training of very deep networks in general. These methods comprise straightforward strategies like biased BN, parametric activation functions and adaptive signal propagation. In particular, we present the novel Doubly-Parametric ReLU activation function and design the Gated Routing procedure dedicated to the training of enormously deep CapsNets.
Thirdly, we supply a comprehensive experimental study to substantiate our theoretical findings and the proposed methods. The empirical results reveal valuable insights for the optimization of very deep neural networks of any kind. Specifically, our strategies prove to effectively mitigate the degradation problem.

The paper is structured according to our contributions: Section 2 starts with a theoretical derivation of an equivalence relationship between the computations in residual networks, highway networks and CapsNets. Section 3 comprises the definition of practical methods to bring very deep neural networks in a desirable condition to promote stable optimization, and presents a novel routing algorithm between layers in a CapsNet to resist the degradation problem. In Section 4, we systematically conduct experiments to validate our theoretical insights and empirically prove the effectiveness of our devised approaches. The study design is driven by the formulation and subsequent answering of dedicated research questions. Section 5 draws the important implications of our work in the context of related topics. Section 6 concludes with the major findings of the paper.

2. Theory

2.1. Skip Connection Pattern

In Residual Learning [5,6], a network layer is formulated as a residual unit that conducts the mapping

H_{G}^{(R)} (x) = ϕ (F (x) + G (x))

(2)

where

F (x)

means a residual function as output from a single or stacked conventional layers,

x

denotes the input signal to the residual unit, and

ϕ

equals the activation function. The mapping

G (x)

usually corresponds to the identity function to support stable gradient flow over arbitrary paths of neural information processing [6]. Introducing a projection matrix W is a valid strategy to ensure the same dimensionality in the summation for the general case [5], which specifies Equation (2) as

H^{(R)} (x) = ϕ (F (x) + W x) .

(3)

The computational graph for Equation (3) is visualized in subfigure (a) of Figure 2.

Subfigure (b) transfers the residual mechanism from (a) to CapsNets by presenting an atomic pattern for a self-building skip connection between consecutive capsule layers. We prefer the term skip connection [4,6] against shortcut connection [5] since the CapsNet rather passes signals silently than abbreviating processing paths. In the CapsNet snippet three fully-connected layers with

{1, 2, 1}

capsules are shown. Specifically, we declare the left path as a residual function

F (x)

and the right path, emphasized in red, as a skip connection. For simplicity, bias weights are completely omitted. The exemplary CapsNet utilizes Static Routing (SR) [13] as inter-layer dynamics, leading to the computation rule for a capsule

c_{i}

in the

(l + 1)

-layer of

c_{i}^{(l + 1)} = ϕ (\sum_{j} W_{i j}^{(l + 1)} c_{j}^{(l)})

(4)

where

ϕ

denotes the used nonlinearity and

W_{i j}^{(l + 1)}

represents an individual transformation matrix between a higher-layer capsule and a lower-layer one. As its name indicates, SR simply propagates all signals from a previous layer to each capsule in the subsequent layer. Contrary to the classic description of SR [13], we generalize the nonlinearity

ϕ

instead of applying the squash [8] function. In the progress of this section, we will explain this design decision. First, we want to harmonize Equation (4) with Equation (3). Denominating the first capsule in subfigure (b) of Figure 2 as input vector

x

and the final capsule as mapping

H (x)

we obtain the computation rule

\begin{matrix} H^{(C)} (x) & = ϕ [W_{F}^{(l)} ϕ (W_{F}^{(l - 1)} x) + W_{S}^{(l)} ϕ (W_{S}^{(l - 1)} x)] \\ = ϕ [F (x) + W_{S}^{(l)} ϕ (W_{S}^{(l - 1)} x)] \end{matrix}

(5)

where the subscript

F

and

S

of the weight matrices signals belonging to the residual function or the skip connection, respectively. The superscript

(C)

specifies

H (x)

in the context of CapsNets. If we now assume the following conditions hold:

\begin{matrix} (1) \forall x \in [I_{1}, I_{2}] : ϕ (x) ≅ λ x, \\ (2) {(W_{S}^{(l - 1)} x)}_{i} \in [I_{1}, I_{2}], \end{matrix}

(6)

meaning that the activation function

ϕ

contains a suitable interval

[I_{1}, I_{2}]

with exact or approximative linear progress, and the first transformation matrix

W_{S}^{(l - 1)}

of the skip connection projects the components of

x

onto the interval

[I_{1}, I_{2}]

. From these assumptions, we can deduce an equivalence relationship between Equations (5) and (3) by

\begin{matrix} \Rightarrow H^{(C)} (x) & = ϕ (F (x) + λ \cdot W_{S}^{(l)} W_{S}^{(l - 1)} x) \\ = ϕ (F (x) + W_{S^{'}} x) \\ \equiv H^{(R)} (x) \end{matrix}

(7)

where

W_{S^{'}}

subsumes both weight matrices from the skip connection and the scaling factor

λ \in R ∖ {0}

. This leads to an equivalent version of a regular shortcut connection with the projection matrix, as displayed in Equation (3). It is now clear why the nonlinearity

ϕ

in the CapsNet needs to be abstracted to an activation function that satisfies the conditions in Equation (6).

It is noteworthy that the use of a projection matrix weakens the intention behind residual learning to facilitate identity mappings. Moreover, the experiments in [6] suggest a clear advance of exact identity mappings against linear approximations. Nevertheless, we will resolve this apparent contradiction by systematically identifying the necessary preconditions to avert obstacles in approaching identity mappings within CapsNets during network optimization. Note that in the primary paper [5] about residual learning the additional parameter investment in projection matrices improved model performance for moderate-sized residual networks (≈30 units), while successive work [6] reported degenerative effects when linearly manipulating skip connection signals in enlarged networks (≈100 units). This situation indicates conceptual flaws in conventional network design which amplify with an increase in depth.

2.2. Multi-Layered Skip Paths

Highway Networks [3,4] integrate parametric neural gates in their architecture to establish direct information flow across multiple consecutive layers. Contrary to residual learning, highway networks, implicitly rather than explicitly, steer information flows in the direction of the vertical depth in a neural network [3,4]:

H^{(H)} (x) = F (x) ⊙ T (x) + x ⊙ C (x)

(8)

where the vector

x

means the input to a highway layer and the operator ⊙ denotes the element-wise product.

F (x)

T (x)

and

C (x)

are arbitrarily parameterized nonlinear mappings, e.g., defined as regular network layers, with the same output dimensionality. In particular, the mappings

T (x)

and

C (x)

implement the transform and carry gates of a highway layer [3,4]. Highway networks can be simplified by constraining their layer-wise gates over a probability distribution, i.e.,

C (x) = 1 - T (x)

, which reduces Equation (8) to contain a single neural gate [3,4].

Interestingly, CapsNet’s prominent inter-layer communication scheme routing-by-agreement [8,9] allows a similar probability assignment to grouped input signals through a variety of iterative clustering procedures (cf. [15]). An important property of routing dynamics is the individual processing of all capsules from a preceding layer to determine the state of a capsule from the consecutive layer [8,9,13,14,15]. We generalize the proceeding of routing with probability assignments to the mapping

ψ : {x_{i} \in R^{d}}^{m} ⟼ \sum α_{i} ⊙ x_{i}, A \cdot 1_{m \times 1} = 1_{d \times 1},

(9)

where a routing function

ψ

receives as input a collection of vectors

x_{i}

of the same dimensionality and outputs a weighted sum over all incoming vectors. Matrix A comprises the weight vector

α_{i}

column-wise and

1

means the one vector with specified dimensionality in each subscript. Note that the sum of the elements in a row of matrix A must be one to constitute a valid probability distribution per feature dimension. The determination of the weight vector

α_{i}

is usually subject to a nonlinear process defined as routing-by-agreement implementation [8,9,14,15]. Although routing-by-agreement procedures commonly apply their weighting of an input vector as element-wise multiplication with a scalar value [8,14], we prefer the more general definition in Equation (9) with a single weight per vector element. Subfigure (a) in Figure 3 integrates a nonlinear routing gate before applying the activation function of the final capsule. If we again assume that elements of an input vector

x

are continuously projected onto a linear interval of the used activation functions per layer, we compute the representation vector of the last capsule as

\begin{matrix} H^{(C)} (x) & = ϕ [ψ (F (x), W_{S^{'}} x)] \\ = ϕ [α ⊙ F (x) + (1 - α) ⊙ W_{S^{'}} x] \end{matrix}

(10)

where

α

contains the element-wise weights determined by the routing gate, and

1

means the one vector with the same dimensionality as

α

. By expanding the general mapping of a routing gate

ψ

from Equation (9), we reveal a strong coherence between Equation (10) and the definition of a highway layer in Equation (8). The key difference between both formulae is the application of the final activation function

ϕ

, which could, in turn, perform a linear projection for a subset of or all the input values. In particular,

ϕ

could be chosen as identity mapping without loss of model expressiveness if a suitable nonlinear routing procedure is utilized and enough input capsules are available. It is noteworthy that Equation (10) allows for an instance-based equal weighting of the residual part and the projected input vector as

H^{(C)} (x) = ϕ [\frac{1}{2} (F (x) + W_{S^{'}} x)],

(11)

reducing to an equivalent formulation of residual learning in Equation (3) except for a simple scaling factor proportional to the number of incoming capsule signals. We conclude that following our defined preconditions, a CapsNet can unify the mechanisms in residual learning and highway networks, according to Equation (10):

\Rightarrow H^{(C)} \equiv H^{(H)} \equiv H^{(R)} .

(12)

Previous work [24,25] already characterized highway layers as equivalent to residual units with parametric scaling, but our capsule-based formulation generalizes to an arbitrary mixture of residual and skipped signals, as expressed in Equation (13). Subfigure (b) in Figure 3 visualizes a more extensive network with three layers consisting of three capsules each and with routing between all consecutive layers. Moreover, the displayed network shows an exemplary skip path (in red) which could be active in a trained model for a certain type of input data. In this example, the capsule representation vector

x_{3}

is projected first. Then, the routing gate completely opens the flow of

x_{3}

to the second capsule in the next layer. Finally, each capsule in the last layer computes a weighted sum over its projected lower-layer capsule states

F_{1} (X)

F_{3} (X)

and

F_{2} (X) = W_{S^{'}} x_{3}

, and eventually applies a nonlinearity

ϕ

on the resulting vector.

2.3. Horizontal Network Scaling

The capsule representing the mapping

H_{i} (X)

consumes the

m = p + q

capsules from the previous layer in a scenario of two consecutive hidden layers:

H_{i}^{(C)} (X) = ϕ [ψ ({F_{j} (X)}^{p}, {W_{k} x_{k}}^{q})] .

(13)

Specifically, the capsules from the first hidden layer can correspond to residual functions

F_{j} (X)

or projected input vectors

W_{k} x_{k}

. This two-layer pattern can arise several times in a CapsNet whereby the parametric gating mechanism

ψ

creates arbitrarily long skip paths across the network depth (if the necessary preconditions are supported). This innate ability substantially differs from the predefined blocks in ResNeXt with static path structure.

An interesting interpretation is that each capsule can be viewed as a conventional network layer with scalar-valued neurons, where the parametric routing between adjacent capsule layers steers signal propagation. In that sense, a CapsNet can instance-based modulate its active network architecture to act as a parametric network ensemble. Although a CapsNet requires solely two capsules per layer to realize skip connections, we expect beneficial effects on the stability of training very deep networks and the capability of representation learning, when enriching the horizontal network depth with further capsules.

In general, a residual function

F (X)

can take arbitrary forms. Evidently, a multi-layered CapsNet with at least two capsules per layer is able to compose

(i - z)

previous nonlinear mappings to

F_{i} (X) = F_{i - 1} \circ F_{i - 2} \circ \dots \circ F_{z} (X)

(14)

by approximately routing single signals from the preceding layer to the respective capsules from the subsequent layer. This means a strong advantage compared to classical residual learning, where the nonlinear mappings of the residual part represent predefined blocks of stacked layers [5,6]. Of course, CapsNets also allow the formulation of residual functions based on the weighted summation of multiple nonlinear mappings from previous capsule layers, as stated in Equation (13).

Apart from the strict definition of CapsNets, residual functions may correspond to other types of layers (e.g., convolutional layers) or even constitute entire subnetworks as long as output dimensionalities are suitable and gradients can still flow unhindered. But such kinds of residual functions are out of the scope of this work and remain open for future research.

3. Methods

3.1. Adaptive Nonlinearity Gates

An essential prerequisite for the feasibility of self-building skip connections represents the ability of an activation function

ϕ

to switch between linear and nonlinear processing on demand. In this section, we present two distinct strategies for involving Adaptive Nonlinearity Gates (ANG)s within neural networks that control and restrict the use of nonlinear behavior originating from activation functions. The looks linear initialization of [25] is strongly related to our idea of ANGs, though, its derivation is narrowed to adequate gradient flow, and ANGs manipulate network dynamics beyond effectual initialization.

3.1.1. Biased Batch Normalization

BN transforms the activation distribution

D [x_{i}]

for each feature

x_{i}

from an emitting linear layer to have zero mean and unit variance, followed by the introduction of an auxiliary fully connected layer [1]:

y_{i} = ϕ [γ_{i} \frac{x_{i} - μ_{i}}{\sqrt{σ_{i}^{2} + ϵ}} + β_{i}],

(15)

where

x_{i}

and

y_{i}

equal the i-th input and the i-th output feature of a BN layer, respectively. Moreover,

μ_{i} = E [x_{i}]

and

σ^{2} = Var [x_{i}]

correspond to the mean and variance of the i-th feature dimension, and

ϵ

serves as a smoothing term for securing numerical stability. The involved parameters

β_{i}

and

γ_{i}

constitute learnable weights per feature dimension to rescale and position the normalized distribution for maintaining network capability [1]. Note that the simplification to entirely omit bias terms in our derivation of self-building skip connections is coherent with Equation (15) since the bias of a preceding linear capsule layer to a BN layer is directly eliminated through the centering of each feature dimension. In that sense, BN integrates into Equation (13) after the routing procedure

ψ

and before the application of the activation function

ϕ

. In particular, BN and skip connections are complementary in preserving structural neural activity patterns within randomly initialized very deep neural networks [25]. In our concept of self-building skip connections with CapsNets, skipped signals are also normalized with BN which prevents exploding or vanishing gradients caused by scaling factor accumulation, as mentioned in [6]).

Besides BN’s key attribute to stabilize signal propagation at the time of network initialization, it especially controls the processing window of the subsequent activation function. This means a crucial impact on the feasibility of training very deep neural architectures where network degradation typically originates from a vast amount of stacked nonlinearities (cf. Figure 1 and Equation (1)). In addition, initializing a network towards linear processing suggests a positive influence on the stability and effectualness of gradients [25]. For instance, point-symmetric activation functions such as sigmoid or the hyperbolic tangent require initialization of the BN parameters similar to

β_{i} = 0

and

γ_{i} = 0.5

to locate the input distribution within their unsaturated, mostly linear sections. On the contrary, for asymmetric functions like ReLU and its variants, an initialization scheme that largely shifts the input distribution in a continuous interval is preferred, e.g., using

β_{i} = \pm 2

and

γ_{i} = 1

. Both exemplary initialization schemes are visualized in Figure 4. Apart from distribution positioning, the scaling parameter

γ_{i}

within BN has the important property of influencing the severity of feature discrimination through multiplicative pairwise point-distance expansion per feature dimension, i.e.,

\forall {\tilde{x}}_{1}, {\tilde{x}}_{2} \in γ X + β : {\tilde{x}}_{1} - {\tilde{x}}_{2} = γ (x_{1} - x_{2})

(16)

where

x_{1}, x_{2} \in X

and the random variable X represents the incoming feature of an arbitrary dimension to its subsequent BN layer with current parameterization

β

and

γ

We hypothesize that permitting a small overlap between the input distribution and the nonlinear regions of the utilized activation function may be helpful for network optimization since the network is still able to perceive some gradual changes in the use of a nonlinearity without the risk of strong degradation effects. For the opposite case, with input distributions entirely positioned in regions without nonlinear behavior, we expect rather haphazard adjustments and slower overall convergence.

The use of BN as a generic extension to a broad range of activation functions satisfies our defined preconditions on self-building skip connections in Equation (6) if the applied activation function provides a suitable linear interval. In more detail, each properly initialized BN layer serves as the ANG per feature dimension by successively learning when to pass signals through and when to make use of nonlinear regions. It is noteworthy that particularly in the context of CapsNets the prominent squash [8] activation function represents a holistic nonlinear mapping which is incompatible with BN’s gating mechanism to fulfill the conditions in Equation (6). Nevertheless, the squash function seems to realize a semantic identity mapping in which feature encodings are sustained as vector orientations despite its nonlinear computation.

3.1.2. Parametric Activation Functions

Although BN’s parameterization already permits the implicit definition of ANGs in combination with partially linear activation functions, a more explicit way represents the use of adaptive activation functions that directly specify the kind and degree of nonlinearity. Parametric ReLU (PReLU) [28] is a natural choice for an adaptive activation function since it abstracts the Leaky ReLU with a learnable parameter

ρ \in R

per feature dimension of the input vector

x

, leading to

prelu (x) = \{\begin{matrix} x, & if x > 0 \\ ρ x, & otherwise \end{matrix} .

(17)

Despite PReLU’s property of constituting a piecewise linear function, its learnable parameter

ρ

especially controls the sensitivity of weight adjustments. In addition, we adopt the S-shaped ReLU (SReLU) [29] activation function:

srelu (x) = \{\begin{matrix} ρ_{1} x + (1 - ρ_{1}) t_{\min}, & if x < t_{\min} \\ ρ_{2} x + (1 - ρ_{2}) t_{\max}, & if x > t_{\max} \\ x, & otherwise \end{matrix} .

(18)

SReLU strengthens our idea of an ANG by assigning the identity function to the interval

[t_{\min}, t_{\max}]

between two parametric nonlinearity specifications

ρ_{i} \in R

. The nonlinear behavior again appears as the transition between the piecewise linear functions, if the respective

ρ_{i}

deviates from one. Both PReLU and SReLU are displayed in Figure 5 with an exemplary parameter range of

ρ_{i} \in [0, 1]

for ease of visualization. SReLU involves, with its variable interval thresholds, three more learnable parameters than PReLU. As intermediate variant, we design the activation function Doubly-Parametric ReLU (D-PReLU),

d - prelu (x) = \{\begin{matrix} ρ_{1} x, & if x > 0 \\ ρ_{2} x, & otherwise \end{matrix},

(19)

with two learnable scaling parameters

ρ_{i}

around zero. The unique characteristic of SReLU and D-PReLU represents their fair treatment of the left and the right parts from an input distribution as opposed to regular rectified units that are usually biased to favor positive input signals. Specifically, the function class of D-PReLU contains

max (0, x)

and

min (0, x)

. The same holds for SReLU if the edge case

t_{\min}, t_{\max} \to 0

emerges. A possible alternative to SReLU represents Concatenated Rectified Linear Unit (CReLU) [30] which also maintains positive and negative input signals (and can be initialized towards linear processing [25]), but with the price of doubling layer-wise parameters. Evidently, initializing PReLU’s

ρ

parameter with one leads to the identity function [25]. The same situation holds for both nonlinearity specifications

ρ_{i}

involved in SReLU and D-PReLU.

Opposite to parametric rectified functions, Adaptive Piecewise-Linear Units (APLU)s [31] can shape arbitrary continuous functions with

aplu (x) = max (0, x) + \sum_{i = 1}^{s} ρ_{i} max (0, - x + ξ_{i})

(20)

depending on the number s of additive piecewise-linear components, where

ρ_{i}

and

ξ_{i}

act as learnable parameters of the i-th function segment during network training. An APLU with an arbitrary amount s of additive components corresponds to the identity function for the parameter configuration

ρ_{i} = - 1 / s

and

ξ_{i} = 0

, which allows its straight interpretation as ANG.

Note that we prefer the use of piecewise linear activation functions as implementation of ANGs against other parametric nonlinearities such as Parametric Exponential Linear Unit (PELU) [32] or Bendable Linear Unit (BLU) [33] to keep nonlinear distortions on input distributions minimal while preserving selective capability. More precisely, parameter adjustments on PELU and BLU lead to global changes in the resulting activation function. Note that in [33], it is argued that BLUs could be resilient against the removal of skip connections in moderate-sized networks (with 40 layers) due to their ability of linear processing. We will experimentally demonstrate that the linearity property is required for self-building skip connections, but it is not sufficient.

3.2. Gated Routing with Self-Attention

In Algorithm 1, we propose a Gated Routing (GR) procedure that reflects our theoretical insights about obviating the degradation problem. More precisely, GR pursues the two goals of allowing identity mappings of capsule vectors and restricting signal propagation from lower-layer capsules to higher-layer ones. Due to our superior objective to enable neuroplasticity after the credo one architecture fits it all, we refrain from a predefined variation in neurons per capsule between layers and entirely leave the path finding of neural information flow to the optimization process. Therefore, we apply GR only on two consecutive layers with the same capsule dimensionality. In cases with differing dimensionalities between layers, GR reduces to SR. Another special characteristic of GR is the additional bypassing of lower-layer capsule vectors

c_{j}^{(l)}

before conducting connection-specific transformation matrices

W_{i j}

. This proceeding offers the direct realization of neural gates as known from highway networks [3,4]. Specifically, GR reuses the transformed inputs

{\tilde{c}}_{j}^{(l)}

in combination with a learnable bias vector

b_{i j}

to constitute the degree of original signal preservation. This generally requires one auxiliary bias parameter per neuron, i.e., per capsule dimension, multiplied by the number of higher-layer capsules. A possible relaxation to this parameter investment is the use of solely one parameter per lower-layer capsule. Following [3,4], we also suggest a bias parameter initialization such as

- 3

for configuring a CapsNet to favor identity mappings at the beginning of the training course. The selection of a subset of lower-layer capsules is implemented with Self-Attention (SA) [34], so that each higher-layer capsule learns to ignore irrelevant input signals. We categorize our attention-based mechanism as SA due to the fusion of backpropagation information of all lower-layer capsules within the weight vector

z_{i}

and the use of already transformed capsule outputs

{\tilde{c}}_{j}

. GR particularly conducts dot-product attention [35] using a learnable weight vector

z_{i}

per higher-layer capsule. Initializing weight vector components after

z_{k} \sim N (0, 1)

, leads to dot-product values with v variance. (Here, we imply that transformed capsule outputs

{\tilde{c}}_{j}

follow a distribution with zero mean and unit variance. We will ensure this property by applying BN on capsule outputs

c_{j}

and initializing transformation matrices

W_{i j}

by sampling from

N (0, 1)

). For that reason, we use scaled dot-product attention [34] which we scale down by

1 / \sqrt{v}

. As an option to make SA even more selective, we introduce the softmax temperature

τ

[36] as a model hyperparameter which we will keep constant during our experiments. A temperature value between zero and one sharpens the softmax distribution, whereas an increasing value above one approaches the uniform distribution (cf. [36]). SA can be deactivated for network architectures with a small number of capsules per layer, where distortions through iterative summation over lower-level capsules are negligible.

Algorithm 1 Gated Routing (GR) from all n capsules

c_{j}^{(l)}

in the preceding layer to an individual capsule

c_{i}^{(l + 1)}

in the next-higher layer. The symbols ⊕ and ⊙ refer to the element-wise addition and multiplication, respectively.

Require:: SA: bool—flag for activating self-attention;

$τ = 1.0$ —softmax temperature with default value;
$z_{i} \in R^{v}$ —learnable weight vector for self-attention;
$ϕ$ —activation function of capsule $c_{i}^{(l + 1)}$ ;
${c_{j}^{(l)} \in R^{d}}_{j = 1}^{n}$ —all capsules from the $(l)$ -layer;
${{\tilde{c}}_{j}^{(l)} \in R^{v}}_{j = 1}^{n}$ —transformed capsules, i.e., ${\tilde{c}}_{j}^{(l)} = W_{i j} c_{j}^{(l)}$ ;
${b_{i j} \in R^{v}}_{j = 1}^{n}$ —learnable bias vector of the neural gate.
if $d = v$ then
for all lower-layer capsules $c_{j}^{(l)}$ do
# reuse transformed inputs for neural gating
$α_{j} \leftarrow sigmoid ({\tilde{c}}_{j}^{(l)} \oplus b_{i j})$
${\tilde{c}}_{j}^{(l)} \leftarrow α_{j} ⊙ {\tilde{c}}_{j}^{(l)} + (1 - α_{j}) ⊙ c_{j}^{(l)}$
# restrict information flow with self-attention
if SA is activated then
$τ^{'} \leftarrow τ \cdot \sqrt{v}$
$α_{j} \leftarrow softmax (z_{i}^{T} {\tilde{c}}_{j}^{(l)} / τ^{'})$
${\tilde{c}}_{j}^{(l)} \leftarrow α_{j} \cdot {\tilde{c}}_{j}^{(l)}$
end if
end for
end if
$c_{i}^{(l + 1)} \leftarrow ϕ (\sum_{j = 1}^{n} {\tilde{c}}_{j}^{(l)})$
return $c_{i}^{(l + 1)}$

4. Results

4.1. Preliminaries

4.1.1. Global Setup

The implementation of this paper is realized in the programming language Python [37] using the machine learning library Keras [38] (including KerasCV [39] for image augmentation) with TensorFlow (TF) [40] backend. If not otherwise stated, programming functionality is utilized with TF’s default parameters. Within the training process of neural architectures in this paper, the gradient descent optimization algorithm Adam [41] is applied. Specifically, Adam uses a learning rate of

η = 0.001

without weight decay. In addition, each batch contains 128 randomized samples during training. Each configuration per analysis is conducted for five runs to ensure statistical significance over the empirical results.

4.1.2. Datasets

For our experiments, we utilize the image classification datasets MNIST [10,11], Fashion-MNIST [42] and SVHN [43,44]. The dataset versions supplied by TensorFlow Datasets [45] are used. As a single preprocessing step, we map each pixel (per channel) from a sample image through division by 255 in order to obtain floating point values in the range of

[0, 1]

. MNIST contains grayscale images of handwritten digits. Fashion-MNIST consists of grayscale images from fashion products in the same dimensionality as MNIST. SVHN comprises real photos of house-number signs from a street-level view. We employ the dataset version with cropped images which cover a single digit in the center corresponding to the respective class label. SVHN describes a much more severe digit classification task than MNIST since SVHN’s images are subject to real-world phenomena like lightning conditions, view-angle distortions and coloring effects. As usual in realistic applications, a strong dispersion in the number of data instances per class can be observed in the training and test set from SVHN. The central dataset characteristics including sample images can be found in Appendix A.

4.1.3. Generic Analysis Model

As a basis for our experiments (except for the last one), we design a generalized analysis model as displayed in Figure 6. The analysis model is composed of a one-layer Feature Extractor (FE) component, k intermediate blocks of fully-connected layers and a classification layer for z classes. The convolutional layer (Conv) is specified by its kernel size, activation function and the number of filters. In addition, Conv uses a stride factor of 2, same padding and BN with initial values of

β = 0

and

γ = 1

before applying its activation function. In the case of CapsNets, the introducing capsule layer, known as Primary Capsules [8,9], builds two-dimensional capsules over the available filters from the FE. Intermediate blocks of fully connected layers are divided into a capsule and scalar-neuron (plain) variant. A dense capsule block follows our defined mapping in Equation (13). Both block variants come with an effective bandwidth of 32 scalar elements per layer, which are grouped into eight vectors with a dimensionality of 4 for a capsule-based block. The same bandwidth implies comparable representative capacity and ensures an equal amount of parameters for both block types. Reshaping operations take place to flatten convolutional or capsule dimensions if necessary.

4.2. Biased-BN Strategies

In our first experiment, we investigate the impacts of using diverse BN initialization strategies on the training process of a plain neural network with a depth of 90 intermediate blocks, as defined in Figure 6. According to our introductory examination of the degradation problem in Figure 1, a plain network with a depth of 90 blocks noticeably suffers from degenerative effects for the standard BN

(0, 1)

initialization with conventional activation functions. Specifically, we want to address in our experiment the following Research Questions (RQ)s:

RQ 1:

Can BN layers act as ANGs for subsequent activation functions to enable the training of deeper neural networks?

RQ 1a:: What are preferable BN initializations?
RQ 1b:: What is the appearance of BN’s parameters after successful model training?

To be consistent with the analysis setup in Figure 1, all configurations are applied on the MNIST classification dataset for five runs with 180 epochs each. For the default BN

(0, 1)

initialization, we re-use the previous analysis results which are emphasized with an asterisk appended to the configuration name. The analysis results of testing varying BN initialization strategies are illustrated in Figure 7. In addition, Table 2 and Table 3 summarize the final accuracies reached after the application of all training epochs.

RQ 1a: The development of the training losses in Figure 7 reveal the importance of an appropriate harmonization between the initialization of BN’s parameters and the utilized activation function. In accordance with our former thoughts, shifting the input distribution to rectified activation functions into one of their linear regions drastically improves the performance for the best model and all models on average without additional costs. In particular, the analysis results in Figure 7 and Table 2 support our hypothesis that a small overlap between the input distribution and nonlinear regions of an activation function may be helpful for steering network optimization. In that sense, ReLU and ELU effectively mitigate the degradation problem with BN

(2, 1)

but become stepwise inferior when further increasing the bias parameter

β

. However, Leaky ReLU counterintuitively works best for BN

(3, 1)

on average, indicating a latent interdependence between the type of nonlinearity and the degree of its usage through the input distribution.

The training progressions for the sigmoid and the hyperbolic tangent illustrate that point-symmetric activation functions to the origin potentially profit from input distribution scaling, which also controls the usage of nonlinear regions. Although sigmoid and tanh can effectively prevent network degeneration for single runs with a parameter initialization of

γ < 1

, their performance on average remains unstable, as displayed in Table 3. This circumstance can be explained with Equation (16) that emphasizes the impact of the scaling factor

γ

on the variance in feature values. Thus,

γ

is a more critical parameter to adjust than

β

because of its direct influence on the severity of layer-wise feature discrimination.

The analysis results imply that biasing the initialization of BN

(β, γ)

towards linear processing regions of the activation function is crucial if the nonlinear capacity of a neural network exceeds the required representational expressiveness determined by the application task complexity. Since the degradation problem occurs in very deep networks with linear activation function (cf. Figure 1), we recommend selecting the parameter

β

to locate the normalized input distribution in linear progress with small overlaps of nonlinear behavior. We assume that a limited degree of nonlinear behavior allows a neural network to transfer only a relevant subset of output signals per layer which prevents information diffusion.

RQ 1b: The last two rows in Figure 7 display the mean parameter deviations per BN layer from their initial values. For obtaining meaningful results, the parameter constellation of the resulting models from the superior BN initialization scheme is considered for each activation function. Parameter deviations are visualized for the best run and as average over all runs per subplot. The computation rule for the mean deviation of the bias parameter

β

is stated as

Δ {\bar{β}}^{(l)} = \frac{1}{n} \sum_{i = 1}^{n} | β_{0} - β_{i}^{(l)} |

(21)

where l specifies the index of the considered BN layer, n equals the number of features in this BN layer and

β_{0}

corresponds to the initial parameter value. Since we apply the same initialization scheme to all BN layers and each intermediate block in our network architecture contains an equal number of neurons, n and

β_{0}

stay identical for all BN layers. The computation rule in Equation (21) analogously holds for the scaling parameter

γ

Regarding the mean deviations for the parameter

β

, we observe a similar pattern of parameter usage independent from the concrete activation function. In more detail, the neural networks tend to increase their parameter adjustments beginning from the centered layers, at half of the network depth, into the direction of the first and the last network layers. These parameter adjustments form a valley-like structure with, typically, a higher mountain arising in the lower network area. A straightforward explanation for the strong bias parameter modifications in the lower network layers constitutes the process of representation learning, in which the selective power of nonlinearities is accessed to differentiate between sample features in order to establish higher-order object classes. We assume that this process continues until a suitable granularity level of object classes is reached for achieving the training objective. In the case of the MNIST dataset, the highest-order entities probably correspond to the ten digit classes. The continual decrease in parameter adjustments from the lower network layers to the centered layers visualizes the diminishing of feature/concept diversity with increasing abstraction level, i.e., few higher-order concepts are composed of many lower-order concepts.

Since the network potential exceeds the needed representational capacity for solving the MNIST classification task, the final entity encodings are passed via almost linear processing through the centered network layers. The absence of degradation effects despite the long sequence of nearly linear processing layers around half of the network depth supports our former assumption of preventing information diffusion by eliminating individual neuron outputs per layer. We hypothesize that the increasing parameter modifications after the centered layers serve the task-specific interpretation of the learned representations with a rising degree of detail. Possibly, this circumstance also holds for shallower neural networks but is usually not observable because of an almost full exploitation of nonlinear capacities.

The last row in Figure 7 highlights the sparse usage of the scaling parameter

γ

in contrast to the bias parameter

β

. The main purpose of optimizing the parameter

γ

appears to control the feature variances according to Equation (16). This explanation is supported by the slight parameter deviations in the lower layers which we assign to the process of representation learning, and the strong parameter adaption in the last layer for increasing the sensitivity to feature discrimination with the aim to solve the classification task.

RQ 1: The analysis results point out that BN layers can indeed act as ANGs for subsequent activation functions to enable the training of deeper neural networks. A necessary requirement, for example, is a BN initialization scheme that biases the network mechanics towards linear processing at the start of the training procedure. On the one hand, adjusting the bias parameter

β

is a cost-free option to effectively mitigate the degradation problem with relatively stable performance over several runs. On the other hand, the scaling parameter

γ

has a limited ability to restrict the usage of nonlinear behavior and significantly influences the severity of feature discrimination, leading to a fragile performance over multiple runs. The experiment results indicate that the centric linear regions in the point-symmetric activation functions of sigmoid and tanh may be insufficiently large to establish ANGs through BN layers. Interestingly, ReLU achieves superior performance with the desirable properties of fast convergence and high accuracy with small variance over several runs, but only for the BN

(2, 1)

initialization scheme.

Despite the stabilizing effects of a proper BN initialization on the resulting performance of deeper neural networks, the training progress in Figure 7 rarely constitutes monotonically decreasing functions. Specifically, we assume that the heavy peaks in the training losses occur as a consequence of parameter adjustments in the lower layers that cause accumulated modifications in the neural activity of the following layers, leading to strong gradient changes. This situation is probably even amplified by the disruptive nature of nonlinearities. Consistent with these observations, the training process takes a long duration on average to consolidate an adequate direction of optimization. Nevertheless, the results suggest that network depth remains an important factor for the required number of training epochs due to the needed time for learning the passing of the relevant output signals per layer. In the subsequent experiment, we will investigate how the training process of very deep neural networks can be further stabilized.

4.3. Handling Salient Gradients with AMSGrad

AMSGrad [46] extends gradient descent optimizers which are based on moving averages, such as Adam, to normalize their gradient updates by means of the maximum gradient change observed in the training course. This proceeding especially prevents temporary increases in the effective learning rate for rare but salient gradients [46]. Our motivation behind the equipment of the Adam optimizer with AMSGrad is summarized in the following research question:

RQ 2:: Does AMSGrad improve training loss convergence for deeper neural networks through sober handling of salient gradients?

To answer this question, we investigate the smoothness of the training loss progress when Adam is enhanced with AMSGrad. For this purpose, we run the same setup as in our previous experiment with 90 intermediate blocks but apply solely the superior BN

(β, γ)

initialization strategy per activation function. The analysis results are visualized in Figure 8 and Table 4.

RQ 2: According to Figure 8, the slight modification of AMSGrad in the Adam optimizer substantially smooths the training loss progress and promotes fast convergence in the early training stages for all considered activation functions. Table 4 exhibits that AMSGrad leads to higher final accuracies with significantly smaller variance over multiple runs for the rectified functions. In fact, AMSGrad improves the best-reached accuracies for ReLU, Leaky ReLU and ELU with superior BN initialization by

{0.56, 1.75, 0.88}

percentage points, respectively. The corresponding mean accuracies are even more convincing with an improvement of

{3.39, 6.94, 12.96}

percentage points and a vanishing small variance each time below a quarter of a percentage point. These results indicate that AMSGrad’s handling of salient gradients allows properly initialized deeper neural networks to make use of their extended network depth in fulfilling the defined training objective. Specifically, AMSGrad prevents network degeneration by obviating the falling into a network configuration with an exceeded integration of nonlinear behavior. It is noteworthy that the high accuracies in Table 4 result from a network with a bandwidth of solely 32 neurons in each fully connected hidden layer. This observation supports the well-known intuition of the representational power induced by the vertical depth of a neural network. The symmetric activation functions tanh and sigmoid still suffer from their insufficiently large linear interval around the origin, which is observable through their fragile performance over multiple runs. The exceptionally high accuracy in the best run for tanh may be explainable as a lucky beneficial network parameter constellation which was mainly reached by chance. Nevertheless, AMSGrad entails the tendency to reduce the occurrence of disruptive parameter changes during network optimization. We assume that this tendency accounts for the degraded performance of the sigmoid activation function in the current experiment.

All experiments conducted so far imply that adding nonlinear behavior works similarly to a trapdoor function, where introducing further nonlinearities in a neural network is controllable by the learning procedure but the intentional reverse operation appears as an over-proportional more severe or even impossible task. In general, our empirical results suggest that a biased network initialization towards linearity and the smoothing of salient gradients with AMSGrad, not only retains performance but enables deeper neural networks to translate auxiliary nonlinear capacities into performance improvements. Since the optimization of deeper neural networks provokes strong gradient changes initiated by the accumulated effects of parameter adjustments over the vertical network size, we identify the appropriate handling of salient gradients, e.g., with AMSGrad, as an essential requirement for training success.

Motivated by the positive results with AMSGrad, we investigate in our next experiment how the training convergence is affected by further increasing the network depth. Again, we formulate the analysis reason as a research question:

RQ 3:: What are the limitations of initially biased scalar-neuron networks towards linear processing regarding an increase in network depth?

In this experiment, we plan to aggressively increase the network depth with

{120,

150,

200,

250,

300,

400,

500}

intermediate blocks to reveal if performance degeneration still occurs and how it precisely appears. As the rectified functions performed similarly well for a depth of 90 blocks, the analysis is restricted to the use of ReLU with BN

(2, 1)

. Our previous experiment emphasized the fast convergence of the training loss if AMSGrad is applied. To save valuable computational resources, we first determine the required number of epochs until convergence. The accuracy gain

g (t)

at epoch t for all remaining epochs

(t \to \infty)

can be defined as

g (t) = \frac{{\bar{a}}_{t \to \infty} - \bar{a} (t)}{{\bar{a}}_{t \to \infty}}

(22)

where

\bar{a} (t)

means the expected accuracy after epoch t and

{\bar{a}}_{t \to \infty}

equals the expected accuracy value of convergence. For the sake of simplicity, we assume that all training courses produce only monotonically increasing accuracy functions. The gain in accuracy is approximated based on the mean training loss development from the preceding experiment. The left subfigure (a) of Figure 9 presents the percentage gain in accuracy for the activation functions ReLU, Leaky ReLU and ELU. In each case,

{\bar{a}}_{t \to \infty}

is chosen as the final accuracy value after the termination of the 180 training epochs. ReLU and ELU achieve after circa 70 epochs

99 %

of the final accuracy value, whereas ELU takes about additional 30 epochs. Since we expect a network with a further increased depth and an initialization biased towards linear processing to converge within a similar time window as the identical network with fewer layers, we restrict the training process for all network depths to 80 epochs. The experiment results are summarized in subfigure (b) of Figure 9 and in the first section of Table 5.

RQ 3: In subfigure (b) of Figure 9, we still encounter the degradation problem where network performance gradually degenerates with an increase in network depth. In accordance with our expectations, all losses seem to converge at an early stage of the training process. Moreover, AMSGrad effectively smooths the training loss curves, even for the deepest network with 500 intermediate blocks. Interestingly, the use of biased batch normalization towards linear processing and AMSGrad enables a network with 300 fully-connected layers in the best run to retain an accuracy over

90 %

for the MNIST classification task after solely 80 training epochs, as reported in Table 5. Performance degeneration probably still happens, despite initially biased linear processing capabilities, because of the conceptual deficiency in single-path plain networks to establish direct identity mappings.

4.4. Linear-Initialized Parametric Activation Functions

Now we want to examine if parametric activation functions can create ANGs in a similar manner to BN, and if both strategies can cooperate to enhance the resulting network performance. For this purpose, we formulate the following research questions:

RQ 4:

Constitute parametric activation functions a valid alternative for realizing ANGs?

RQ 4a:: Can BN and parametric activation functions operate complementary?
RQ 4b:: What is the parameter appearance of BN and the parametric activation function after successful model training?

To answer these questions, we sequentially investigate the effects of the parametric activation functions PReLU, SReLU/D-PReLU and APLU on the resulting model performance. All configurations train a neural network with a depth of 90 intermediate blocks on the MNIST classification dataset over five runs with 180 epochs each. The Adam optimizer is equipped with AMSGrad for smoothing salient gradients. To explore the influence of PReLU and APLU on network stability isolated from BN (apart from BN’s distribution normalization property), we test both activation functions with BN

(0, 1)

and BN

(2, 1)

. According to the empirical results in [31], we choose for APLU the numbers of piecewise-linear components of

s \in {1, 3, 5}

. In the case of D-PReLU, the biased BN initialization uses a random choice between the values

{- 2, + 2}

with equal probability per neuron to ensure a balanced usage of nonlinear regions. For a fair comparison between SReLU and D-PReLU, SReLU’s adaptive thresholds follow the initialization

t_{\min} = - 2

and

t_{\max} = 2

. In contrast to the use of solely biased BN, initializing parametric activation functions towards linear processing creates network symmetry without overlaps of nonlinear regions. To counteract this condition, we additionally try for PReLU’s learnable parameter, an initialization with slight variation, i.e.,

ρ_{0} \sim N (1, 0.01)

. The experimental results are reported in the first four columns of Figure 10 and in Table 6, Table 7 and Table 8.

RQ 4: The first two columns of Figure 10 visualize the mean and best training loss developments for the parametric activation functions. First of all, involving a small variation in the initial parameter values of PReLU slightly accelerates training convergence independent from the concrete BN initialization. For that reason, the parameters of SReLU (thresholds excluded), D-PReLU and APLU were also initialized with the same degree of variation for all configurations. At this point, we see a differentiated picture insofar PReLU suffers from slightly degraded accuracy with higher result variance if BN is initially biased, but APLU significantly benefits from biased BN. D-PReLU with its biased BN strategy actually achieves the highest accuracy on average with marginal variance. Surprisingly, although SReLU possesses auxiliary threshold parameters that are jointly learned during training, performance still degenerates at certain points that cannot be fully recovered. Despite APLU’s expressive power for function approximation, increasing its number of piecewise-linear components drastically reduces the training convergence and resulting performance when unbiased BN is used. Thus, parametric activation functions can potentially constitute a valid alternative for realizing ANGs, but they assemble less smooth transitions for integrating nonlinear behavior than biased BN.

RQ 4a: According to Table 7 and Table 8, the combined use of D-PReLU and APLU with biased BN indeed operate complementarily. Result variance is kept minimal and final accuracies are at least slightly enhanced. Especially, APLU can only benefit from a raised amount of components if BN is initially biased. PReLU shows the inverse relationship in Table 6. However, its training loss also converges faster and smoother with biased BN.

RQ 4b: The third and fourth columns of Figure 10 display the final parameter deviations of BN and the parametric activation functions from their initial values, indicating the usage level of nonlinear behavior per network layer. For each activation function, the superior model configuration by means of the highest-reached accuracy on average is used. Except for APLU, the single-component variant is preferred for better visualization. The parameter modifications agree with the results from our previous experiments where mostly linear processing occurs around the centered layers at half of the network depth. Again, these results imply that initially biasing a neural network towards mostly linear processing serves a network to restrict nonlinear behavior before degradation effects arise.

Finally, we elaborate on the robustness of parametric activation functions against the degradation problem by gradually increasing the network depth:

RQ 5:: To which extent alleviate parametric activation functions the degradation problem regarding an increase in network depth?

Per parametric activation function, the superior configuration from the preceded experiment is utilized. The analysis results are summarized in the last column of Figure 10 and in the corresponding sections of Table 5.

RQ 5: The mean training loss progress in the last column of Figure 10 reveals a special resilience of parametric activation functions against the degradation problem; preconditioned, the number of adjustable parameters per activation function is moderate. In that sense, PReLU and D-PReLU outperform APLU with its five piecewise-linear components resulting in ten learnable parameters per neuron. It is remarkable that PReLU and D-PReLU keep performance degeneration relatively small in vast networks with five hundred stacked fully connected layers. This observation leads to the insight that parametric activation functions could constitute a crucial factor in facilitating the training of enormously deep neural networks. This insight is also supported by the permanent high accuracies achieved in the best run per network depth, as stated in Table 5. In fact, the minor performance reductions in PReLU and D-PReLU in the best runs can probably be attributed to the constant number of 80 training epochs for all network depths. However, the degradation problem still arises as an average performance deficit with a significant gain in result variability. One possible explanation to harmonize both contradictory observations is the disruptive nature of activation function modifications, which might be helpful for faster exploration of the available solution space.

4.5. Vast CapsNets with Gated Routing

In our preceding experiments, we revealed that the resilience of plain networks against the degradation problem can be generally improved with the inclusion of ANGs in the neural architecture. Regarding static and parametric activation functions, the strategies of combining ReLU with BN

(2, 1)

and D-PReLU with BN

({\pm 2}, 1)

demonstrated superior performance. Both strategies satisfy our identified preconditions on activation functions to allow for self-building skip connections within CapsNets. In Appendix B, we experimentally prove the absence of the degradation problem for CapsNets with a depth of up to 90 intermediate blocks using SR with the ReLU and D-PReLU strategy. Moreover, GR consequently averts the degradation problem in the case of a linear activation function. Therefore, we investigate in our next experiment, the effectiveness of GR in extremely deep CapsNets using the above ANGs, which are summarized in the subsequent research question:

RQ 6:: Can vast CapsNets with GR and suitable ANG resist the degradation problem during training?

To keep computation time manageable, we immediately train each configuration with the largest depth from our previous experiments, i.e., 500 intermediate blocks. In addition, we employ solely two capsules per layer with 16 dimensions, again leading to 32 scalar elements in total. To obtain convincing statements about GR’s capability, we also conduct each CapsNet configuration with SR. The analysis results are presented in Figure 11 and Table 9.

RQ 6: In Figure 11, both CapsNet variants with GR greatly outperform their counterparts with SR, which indicates the ability of CapsNets with GR to autonomously shape skip connections. This observation supplies direct evidence for the effectiveness of our theoretical framework and its practical implementation. Thus, we conclude that CapsNets can indeed unify the mechanics in residual and highway networks to ensure the successful training of very deep architectures.

4.6. Convolutional Capsule Network

In our final experiment, we investigate the potential of a pure capsule-driven network with GR on the task of image classification. For this purpose, we apply a convolutional CapsNet on the datasets Fashion-MNIST and SVHN. In short, the network is composed of capsule-based feature maps with GR between feature maps of the same dimensionality, grouped convolutions [27] for downsampling feature dimensions, and 30 dense capsule blocks. Network scaling is modest in order to keep computation time feasible. A detailed architecture description can be found in Appendix C. To the best of our knowledge, this is the first pure capsule-driven architecture. Usually, capsules merely take place in the final two network layers of an extended net structure. We formulate the following research question:

RQ 7:: Provide capsules the potential to embody arbitrary-level entities?

As a strategy against overfitting, dataset-specific data augmentation is used. The data augmentation for Fashion-MNIST includes horizontal flip (

p = 0.5

), random zoom and translation with factors of

0.1

. The data augmentation for SVHN imitates the natural variances in the data using random zoom, translation and shear with factors of

0.1

. In addition, random rotation with a maximal magnitude of

5 %

from

2 π

is employed. For each dataset, five randomly initialized CapsNets are trained over 100 epochs. The experimental results are illustrated in Figure 12 and Table 10.

RQ 7: The training and validation loss developments in Figure 12 point out that entirely capsule-oriented networks can operate on low-level and high-level entities. This is also supported by the adaptation of GR’s bias parameter similar to our previous results. However, our pure CapsNet does not achieve state-of-the-art test accuracy, as stated in Table 10. We attribute this deficit to CapsNets’ major limitation of intense computational complexity, resulting in long training sessions even for small architectures. We have the hope that future research into efficient routing algorithms and their low-level implementations will make CapsNets applicable to a broader range of machine-learning tasks.

5. Discussion

5.1. Remarks on the Degradation Problem

The experiments in this paper revealed that the degradation problem is versatile in its appearance. The most obvious form is degenerative performance with respect to model accuracy during inference and training. This aspect can be subdivided into global and local degenerative performance, whereas global refers to the average performance over several runs (including result variance) and local means the best-reached performance within multiple runs. For example, we found global degradation to be stronger for parametric activation functions; however, local degradation remains noticeably small with a gain in network depth. Another facet of the degradation problem regards the convergence behavior of the optimization process. We showed that without the application of AMSGrad (with Adam optimizer), the learning curves of deep nets tend to be fragile and often fail to converge at all. We also introduced the idea of information diffusion as a crucial risk factor to promote the degradation problem. In fact, our experimental results on plain networks at least indicate an entanglement between the kind and degree of nonlinearities with the magnitude of the degradation problem. Specifically, we hypothesize that GR’s adaptive signal selection mechanism helps to overcome the degradation problem by reducing the effective risk of information diffusion.

5.2. Increased Depth in Plain Nets

Similar to previous work [25,47], our empirical results confirm that the successful training of very deep feedforward nets can be realized with a deliberate network design and initialization. We found that initially biasing neural nets towards linear processing and the useful handling of salient gradients (here: Adam with AMSGrad) constitute universal ingredients for robust optimization of very deep networks. An outstanding insight of our experiments is the resistance of certain parametric activation functions (here: PReLU and D-PReLU) against the degradation problem despite several hundreds of layers. We attribute this ability to the disruptive re-forming of the loss function until the network is set in a proper condition.

5.3. Ensemble of Neural Processing Paths

In [24], the authors argued that residual networks with n units (i.e., layers) produce

O (2^{n})

implicit paths of neural information processing, which delivers a primary explanation for their gain in performance. We connect to this view by arguing that CapsNets, consisting of k capsules per layer and applying GR between consecutive layers can even amplify the positive effects on network performance by engendering

O (2^{n k})

implicit paths. Although ResNeXts theoretically yield the same number of implicit paths, their bottleneck structure after the residual parts impedes the selection of individual paths, especially at the time of network initialization.

5.4. Impact of Capsule Quantity

ResNeXts enhance classification accuracy by introducing grouped convolutions with an auxiliary dimension in network design, named cardinality and meaning the number of groups [27]. Grouped convolutions are inspired by the work [48] about dividing convolutional operations over multiple GPUs with subsequent merging in order to reduce memory requirements [27]. We ascribe the performance gain of grouped convolutions to their capsule-similar mechanics, where self-regularization occurs as restricted entity recognition per capsule (opposite to complete scene classification in the layers of plain nets) which results in stronger representations (cf. [27]). Thus, we equate cardinality with the number of capsules per layer and describe width as neuron count per capsule.

5.5. Computational Efficiency Considerations

In general, the integration of ANGs into a neural network potentially involves an additional parameter investment or increases the time complexity for optimization, resulting in longer training sessions. However, biased BN probably causes no auxiliary resource consumption in most cases because of its standard usage in very deep networks for preserving numerical stability. In the worst case BN adds one bias and scaling parameter per neuron to a network, whereas the bias parameters act as substitutes for the bias weights of neurons. The computational overhead induced by BN is negligible due to its beneficial impact on stable training. Using parametric activation functions as ANG realization, we observed a superior performance for low-parameterized variants like PReLU or D-PReLU. The parameter investment for these variants remains negligible, even in the useful combination with biased BN. The outcome of our experiments actually indicates that parametric activation functions can promote faster exploration of the solution space through their disruptive nature, leading to an accelerated optimization process. Despite GR’s reuse of transformation matrices, it requires extra

D \times N \times M

bias parameters per layer for gating neuron signals where N and M correspond to the number of lower-layer and higher-layer capsules, respectively, and D means the neuron count per lower-layer capsule. An alternative option is the use of

N \times M

parameters with only one bias weight per lower-layer capsule which could possibly mitigate overfitting issues. Note that CapsNets usually involve

D \times V \times N \times M

weights for transformation matrices between two consecutive capsule layers, with V representing higher-layer capsule dimensionality. Although the additional parameter investment appears relatively high, comparable gating mechanisms for skip connection creation as in highway networks scale analogously. If SA is activated within GR we need to invest further

V \times M

parameters per layer to select between incoming capsule signals. We argue that SA should be activated in the general case to mitigate noise accumulation over capsule layers. In comparison to regular routing algorithms such as DR and k-MR with their iterative computations, GR induces lower time complexity for signal propagation. Evidently, CapsNets generally suffer from higher computational efforts and often include more parameters in relation to plain neural networks but they also offer special properties for representation learning. Furthermore, CapsNets with self-building skip connections theoretically lead to a significant gain in implicit information processing paths compared to residual and highway networks. Thus, the choice of using CapsNets with self-building skip connections is determined by a trade-off between the needed representational capacity for a specific application task and the resulting computational efficiency.

5.6. Limitations of Our Empirical Study

The present paper serves as an introductory work for enabling the successful training of enlarged CapsNets by facilitating the learning of self-building skip connections. For this purpose, we provide ANG strategies for stabilizing the optimization process of very deep networks to effectively mitigate the degradation problem. Although we conducted a comprehensive experimental analysis to validate our theoretical framework and proposed methods, there are several limitations of our empirical study that restrict the scope of our work. At first, our experiments are limited to neural networks with extended depths but small neuron counts per layer to keep memory and computational efforts minimal. For obtaining meaningful results, most of the experiments were applied on the simple MNIST dataset. Note that the use of simple datasets especially intensifies the degradation problem by magnifying the overparametrization of our models with extended depths. Thus, the empirical results are sufficient for proving the preservation of a stable optimization process referring to training curves and accuracy but lack a state-of-the-art performance comparison with specialized models for more complex datasets and a broad range of application tasks. Specifically, we categorize our proposed approaches as complementary techniques to enhance existing state-of-the-art models, and to contribute to the understanding of the dynamics in very deep networks. In GR we involved a softmax temperature hyperparameter for enhanced control over the selectivity for incoming capsule signals. We hypothesize that the importance of the softmax temperature increases with the capsule count per layer. Since we used very deep but narrow CapsNets in our experiments, we kept the softmax temperature constant with a value of one. However, a proper hyperparameter search including other weight initialization schemes is recommended in the future investigation of advanced CapsNet models with state-of-the-art performance.

5.7. Future Research Directions

Following the versatile results and the limitations of our empirical study we identify several aspects that need to be addressed in future research efforts. First of all, a straightforward next step is the integration of our proposed methods into existing state-of-the-art models and conducting sophisticated performance comparisons for individual application domains. Since we introduced the concept of ANGs as a necessary precondition to facilitate the forming of self-building skip connections, we also suggest a thoroughly theoretical analysis in future work that formalizes ANGs and systematically investigates their effects on the training stability of diverse neural architectures. Moreover, we limited our practical ANG realizations to biased BN and parametric activation functions. A promising research direction constitutes the design of other ANG realizations for various network types and application tasks. Although we supplied implicit evidence about the creation of self-building skip connections based on the model’s resistance against the degradation problem and visualized the controlled access of nonlinear capacities within very deep networks, we highly recommend the future development of objective metrics for quantifying the existence and effectiveness of skip connection mechanisms. We believe that such metrics could deliver valuable insights and significantly contribute to better model comparability in this field.

6. Conclusions

In this work, we systematically analyzed the impacts of the degradation problem on the training of very deep plain networks and CapsNets up to a depth of 500 fully-connected layers. Through the unification of residual and highway networks into CapsNets, we provide a theoretical basis for beneficial properties in neural architectures to overcome limitations on network depth. Moreover, we complement our abstract concepts such as ANGs with practical methods like initially-biased BN or parametric activation functions that can be easily integrated into arbitrary net structures for effectively mitigating the degradation problem. Finally, the promising empirical results with CapsNets using GR open the successful training of extended capsule-driven architectures, which hopefully motivates future research efforts in this direction.

Author Contributions

Conceptualization, N.A.K.S. and F.S.; methodology, N.A.K.S.; software, N.A.K.S.; validation, N.A.K.S. and F.S.; formal analysis, N.A.K.S.; investigation, N.A.K.S.; resources, F.S.; data curation, N.A.K.S.; writing—original draft preparation, N.A.K.S.; writing—review and editing, F.S.; visualization, N.A.K.S.; supervision, F.S.; project administration, N.A.K.S. and F.S.; funding acquisition, N.A.K.S. and F.S.; All authors have read and agreed to the published version of the manuscript.

Funding

The work of Nikolai A. K. Steur is supported with a Landesgraduiertenförderungsgesetz (LGFG) scholarship by Ulm University. The work of Friedhelm Schwenker is supported by the German Research Foundation (DFG) under Grant SCHW 623/7-1.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The used datasets are publicly available. Code is available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Datasets Characteristics

Table A1 summarizes the central properties of the used datasets in this paper. In cases of rounded values, we simply ignore the following decimals. In addition, Figure A1 displays 20 random samples for each dataset in a separate row, in the same order as in Table A1.

Table A1. Basic characteristics of the used datasets in the experiments (with Fashion-MNIST abbreviated as F-MNIST).

					# Samples per Class in
		Sample	Total Size of		Train Set			Test Set
Dataset	# Classes	Format	Train Set	Test Set	AVG	MIN	MAX	AVG	MIN	MAX
MNIST	10	$28 \times 28 \times 1$	$60 K$	$10 K$	$6.0 K$	≈ $5.4 K$	≈ $6.7 K$	$1.0 K$	≈ $0.8 K$	≈ $1.1 K$
F-MNIST	10	$28 \times 28 \times 1$	$60 K$	$10 K$	$6.0 K$	$6.0 K$	$6.0 K$	$1.0 K$	$1.0 K$	$1.0 K$
SVHN	10	$32 \times 32 \times 3$	≈ $73 K$	≈ $26 K$	≈ $7.3 K$	≈ $4.6 K$	≈ $13.8 K$	≈ $2.6 K$	≈ $1.5 K$	≈ $5.0 K$

Figure A1. Row-wise 20 random samples for each dataset in Table A1.

Appendix B. CapsNet Resistance Against the Degradation Problem

To empirically verify that our gathered knowledge about the resistance of certain network configurations against the degradation problem can be directly transferred to CapsNets, we copy the experimental setup of our introductory experiment on CapsNets but use the found superior configurations. More precisely, we combine SR with the two configurations of ReLU with BN

(2, 1)

and D-PReLU with BN

({\pm 2}, 1)

. To evaluate if GR prevents the degradation problem independent from specific activation functions, we additionally test CapsNets with GR and linear activation. This configuration is subdivided into the use of Self-Attention (SA). The corresponding analysis results are documented in Figure A2 and Table A2.

Figure A2. Final training loss (left) and training accuracy (right) averaged over five runs using CapsNets with increasing network depth and distinct configurations.

Table A2. Final training accuracies (in %) for distinct CapsNets Using 90 intermediate blocks and 180 training epochs.

Configuration	Best	AVG
SR + ReLU + BN $(2, 1)$	$99.47$	$99.33 \pm 0.08$
SR + D-PReLU + BN $({\pm 2}, 1)$	99.54	$\underset{̲}{99.47} \pm 0.07$
GR + linear + BN $(0, 1)$ + SA	$99.33$	$99.21 \pm 0.11$
GR + linear + BN $(0, 1)$ + ¬SA	$99.23$	$99.08 \pm 0.13$

Figure A2 shows that our proposed methods effectively avert the degradation problem for CapsNets with moderate network sizes up to 90 fully-connected layers. Since GR enables a CapsNet to cut off irrelevant signals, information diffusion does not arise. Table A2 emphasizes that GR with linear activation has on-par performance with the other configurations. Finally, we observe a slight advance in the use of SA which we do not attribute to the additional parameter investment due to the overparameterization of the analysis model.

Appendix C. Pure Capsule-Driven Architecture

The pure capsule-driven architecture comprises three convolutional capsule units, followed by 30 dense capsule blocks and a classification head after the definition in Figure 6. A convolutional capsule unit is visualized in Figure A3. First, a two dimensional convolutional capsule layer (ConvCaps2D) is created via a regular convolutional layer, as known from primary capsule layers. Afterward, a ConvCaps2D with identical dimensionality is constructed which applies GR between capsules at the same spatial position. Finally, a grouped convolution layer halves the spatial feature dimensions and doubles the capsule size. Concretely, we keep the number of

c = 4

capsule-based feature maps fixed and start with

m = 4

capsule dimensions in the first ConvCaps2D. The 30 dense capsule blocks conduct GR between adjacent blocks and use as ANG the activation function D-PReLU with BN

({\pm 2}, 1)

initialization.

Figure A3. Convolutional capsule unit with GR between two layers of identical dimensionality and image downsampling using grouped convolutions.

References

Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
He, K.; Sun, J. Convolutional Neural Networks at Constrained Time Cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5353–5360. [Google Scholar]
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway Networks. In Proceedings of the Deep Learning Workshop at the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1–6. [Google Scholar] [CrossRef]
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Training Very Deep Networks. Adv. Neural Inf. Process. Syst. (NIPS) 2015, 28, 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar] [CrossRef]
Monti, R.P.; Tootoonian, S.; Cao, R. Avoiding degradation in deep feed-forward networks by phasing out skip-connections. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Rhodes, Greece, 4–7 October 2018; pp. 447–456. [Google Scholar]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar] [CrossRef]
Hinton, G.; Sabour, S.; Frosst, N. Matrix Capsules with EM Routing. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–15. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
LeCun, Y.; Cortes, C.; Burges, C.J.C. The MNIST Database of Handwritten Digits. ATT Labs [Online]. 2010. Volume 2. Available online: http://yann.lecun.com/exdb/mnist (accessed on 5 June 2024).
Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming Auto-Encoders. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Espoo, Finland, 14–17 June 2011; pp. 44–51. [Google Scholar] [CrossRef]
Kim, J.; Jang, S.; Choi, S.; Park, E. Text Classification using Capsules. arXiv 2018, arXiv:1808.03976v2. [Google Scholar] [CrossRef]
Ren, H.; Lu, H. Compositional coding capsule network with k-means routing for text classification. Pattern Recognit. Lett. 2022, 160, 1–8. [Google Scholar] [CrossRef]
Steur, N.A.K.; Schwenker, F. Next-Generation Neural Networks: Capsule Networks with Routing-by-Agreement for Text Classification. IEEE Access 2021, 9, 125269–125299. [Google Scholar] [CrossRef]
Shakhnoza, M.; Sabina, U.; Sevara, M.; Cho, Y.I. Novel Video Surveillance-Based Fire and Smoke Classification Using Attentional Feature Map in Capsule Networks. Sensors 2022, 22, 98. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Huang, L.; Jiang, S.; Wang, Y.; Zou, J.; Fu, H.; Yang, S. Capsule Networks Showed Excellent Performance in the Classification of hERG Blockers/Nonblockers. Front. Pharmacol. 2020, 10, 1631. [Google Scholar] [CrossRef]
Fukushima, K. Cognitron: A Self-organizing Multilayered Neural Network. Biol. Cybern. 1975, 20, 121–136. [Google Scholar] [CrossRef] [PubMed]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 1–8. [Google Scholar]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; pp. 1–6. [Google Scholar]
Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016; pp. 1–14. [Google Scholar] [CrossRef]
Oyedotun, O.K.; Ismaeil, K.A.; Aouada, D. Training very deep neural networks: Rethinking the role of skip connections. Neurocomputing 2021, 441, 105–117. [Google Scholar] [CrossRef]
Oyedotun, O.K.; Ismaeil, K.A.; Aouada, D. Why Is Everyone Training Very Deep Neural Network with Skip Connections? IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5961–5975. [Google Scholar] [CrossRef]
Veit, A.; Wilber, M.; Belongie, S. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. Adv. Neural Inf. Process. Syst. (NIPS) 2016, 29, 1–9. [Google Scholar]
Balduzzi, D.; Frean, M.; Leary, L.; Lewis, J.P.; Ma, K.W.D.; McWilliams, B. The Shattered Gradients Problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 342–350. [Google Scholar]
Zagoruyko, S.; Komodakis, N. DiracNets: Training Very Deep Neural Networks Without Skip-Connections. arXiv 2017, arXiv:1706.00388v2. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
Jin, X.; Xu, C.; Feng, J.; Wei, Y.; Xiong, J.; Yan, S. Deep Learning with S-Shaped Rectified Linear Activation Units. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar] [CrossRef]
Shang, W.; Sohn, K.; Almeida, D.; Lee, H. Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 20–22 June 2016; Volume 48, pp. 2217–2225. [Google Scholar]
Agostinelli, F.; Hoffman, M.; Sadowski, P.; Baldi, P. Learning Activation Functions to Improve Deep Neural Networks. In Proceedings of the Workshop Contribution at the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–9. [Google Scholar] [CrossRef]
Trottier, L.; Giguère, P.; Chaib-draa, B. Parametric Exponential Linear Unit for Deep Convolutional Neural Networks. In Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 207–214. [Google Scholar] [CrossRef]
Godfrey, L.B. An Evaluation of Parametric Activation Functions for Deep Learning. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3006–3011. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–15. [Google Scholar] [CrossRef]
Luong, M.T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar] [CrossRef]
Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; pp. 1–13. [Google Scholar] [CrossRef]
Python Programming Language. Available online: https://www.python.org/ (accessed on 5 June 2024).
Chollet, F. Keras. Available online: https://keras.io (accessed on 5 June 2024).
Wood, L.; Tan, Z.; Stenbit, I.; Bischof, J.; Zhu, S.; Chollet, F.; Sreepathihalli, D.; Sampath, R. KerasCV. 2022. Available online: https://github.com/keras-team/keras-cv (accessed on 5 June 2024).
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2015, arXiv:1603.04467v2. [Google Scholar] [CrossRef]
Kingma, D.P.; Lei Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar] [CrossRef]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747v2. [Google Scholar] [CrossRef]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In Proceedings of the Workshop on Deep Learning and Unsupervised Feature Learning at the 25th Conference on Neural Information Processing Systems (NIPS), Granada, Spain, 12–15 December 2011; pp. 1–9. [Google Scholar]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y.; The Street View House Numbers (SVHN) Dataset. Stanford University [Online]. 2011. Available online: http://ufldl.stanford.edu/housenumbers (accessed on 5 June 2024).
TensorFlow Datasets: A Collection of Ready-to-Use Datasets. Available online: https://www.tensorflow.org/datasets (accessed on 5 June 2024).
Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–9. [Google Scholar]
Oyedotun, O.K.; Shabayek, A.E.R.; Aouada, D.; Ottersten, B. Going Deeper With Neural Networks Without Skip Connections. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1756–1760. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. (NIPS) 2012, 25, 1–9. [Google Scholar] [CrossRef]

Figure 1. Visualization of the degradation problem in relation to the network depth based on (a) plain networks and (b) CapsNets with distinct activation functions, using the MNIST classification dataset. A plain network contains 32 neurons per layer, while a CapsNet consists of eight capsules with four neurons each. Network depth is stated as the number of intermediate blocks, including an introducing convolutional layer and a closing classification head. Each block consists of a fully connected layer followed by BN and the application of the activation function. In the case of CapsNets, signal flow between consecutive capsule layers is controlled by a specific routing procedure. The final loss (as cross-entropy) and accuracy, both based on the training set, are reported as an average over five runs with random network initialization. Each run comprises

2 n

training epochs, where n equals the number of intermediate blocks.

2 n

training epochs, where n equals the number of intermediate blocks.

Figure 2. Shortcut and skip connections (highlighted in red) in residual learning. (a) Original definition of a shortcut connection with projection matrix based on [5]. (b) Pattern for self-building skip connections in a CapsNet with SR and an activation function with a suitable linear interval.

Figure 3. Replacement of the static signal propagation in a CapsNet with a nonlinear routing procedure to form parametric information flow gates. (a) Basic pattern with a single routing gate. (b) Exemplary skip path (highlighted in red) crossing multiple layers and routing gates.

Figure 4. Customizing the initialization scheme for BN

(β, γ)

allows the training of deeper networks by constraining the input distribution (in blue) of an activation function to be positioned in a mostly linear section. Exemplary initializations are shown for (a) sigmoid with BN

(0, 0.5)

, and (b) Leaky ReLU with BN

(- 2, 1)

Figure 4. Customizing the initialization scheme for BN

(β, γ)

(0, 0.5)

, and (b) Leaky ReLU with BN

(- 2, 1)

Figure 5. Parametric versions of ReLU with (a) single and (b) four degree(s) of freedom using an exemplary parameter range of

ρ_{i} \in [0, 1]

. (a) PReLU learns a nonlinearity specification

ρ

for input values below zero and directly passes signals above zero. (b) SReLU applies the identity function within the interval

[t_{\min}, t_{\max}]

, and learns two individual nonlinearity specifications

ρ_{1}

and

ρ_{2}

outside of the centered interval.

Figure 5. Parametric versions of ReLU with (a) single and (b) four degree(s) of freedom using an exemplary parameter range of

ρ_{i} \in [0, 1]

. (a) PReLU learns a nonlinearity specification

ρ

for input values below zero and directly passes signals above zero. (b) SReLU applies the identity function within the interval

[t_{\min}, t_{\max}]

, and learns two individual nonlinearity specifications

ρ_{1}

and

ρ_{2}

outside of the centered interval.

Figure 6. (a) Generic model architecture with (b) one-layer Feature Extractor (FE), a classification head with z classes and (c) intermediate blocks consisting of fully-connected layers. Dense blocks are specified via capsules or scalar neurons (plain) for the fully-connected units.

Figure 7. First two rows: Mean (first row) and best (second row) training loss progressions over five runs for each BN

(β, γ)

initialization scheme per activation function. Last two rows: Mean deviation per BN layer of the final

β_{i}

and

γ_{i}

parameters from their initial values, using the identified superior BN initialization scheme for each activation function. Per plot the model parameter deviations are shown for the best run and as average over all five runs.

Figure 7. First two rows: Mean (first row) and best (second row) training loss progressions over five runs for each BN

(β, γ)

initialization scheme per activation function. Last two rows: Mean deviation per BN layer of the final

β_{i}

and

γ_{i}

Figure 8. (a) Mean and (b) best training loss development over five runs using 90 intermediate blocks, AMSGrad and the superior BN

(β, γ)

initialization strategy per activation function. Both subfigures provide an inset as a zoom-in for tight regions.

Figure 8. (a) Mean and (b) best training loss development over five runs using 90 intermediate blocks, AMSGrad and the superior BN

(β, γ)

initialization strategy per activation function. Both subfigures provide an inset as a zoom-in for tight regions.

Figure 9. (a) Percentage gain in accuracy for the remaining epochs measured in relation to the final accuracy. Accuracy gains below one percentage (red line) are gray. (b) Mean training loss development over five runs for varying network depths using ReLU, AMSGrad and BN

(2, 1)

initialization strategy.

(2, 1)

initialization strategy.

Figure 10. Each row summarizes the experiment results of the parametric activation functions PReLU, SReLU/D-PReLU and APLU, respectively. First two columns: Mean (first column) and best (second column) training loss development over five runs using AMSGrad and varying initialization strategies for BN

(β, γ)

and the activation function parameters. Insets are provided as zoom-in for tight regions. Second two columns: Mean parameter deviations per layer from their initial values with respect to BN and the parametric activation function. In each case, the identified superior configuration strategy is used. For APLU the configuration with

s = 1

is preferred against

s = 5

for the benefit of proper visualization. Last column: Mean training loss progress over five runs for varying network depths using the identified superior configuration strategy.

(β, γ)

s = 1

is preferred against

s = 5

for the benefit of proper visualization. Last column: Mean training loss progress over five runs for varying network depths using the identified superior configuration strategy.

Figure 11. Mean training loss development over five runs using CapsNets with a depth of 500 intermediate blocks and varying routing procedures, activation functions and BN initializations.

Figure 12. (a) Mean training (solid) and validation (dotted) loss progressions over five runs for the pure capsule-driven architecture. (b) Mean bias parameter deviation of GR after training from their initial value of

- 3

- 3

Table 1. Final training accuracies (in %) for distinct activation functions using a network depth of 90 intermediate blocks.

Configuration		Best	AVG
Plain Net	linear	$47.52$	$41.92 \pm 9.21$
	ReLU	$33.77$	$20.15 \pm 7.13$
	Leaky ReLU	$44.94$	$31.70 \pm 11.75$
	ELU	$77.38$	$56.98 \pm 14.09$
	tanh	$19.65$	$18.43 \pm 1.09$
	sigmoid	85.21	$\underset{̲}{68.00} \pm 20.80$
CapsNet	SR+linear	$43.67$	$41.68 \pm 2.43$
	SR+sigmoid	$72.93$	$\underset{̲}{68.75} \pm 3.98$
	SR+Leaky ReLU	$63.17$	$37.90 \pm 13.99$
	SR+squash	$\underset{̲}{89.48}$	$57.66 \pm 30.92$
	k-MR+squash	$11.23$	$11.23 \pm 0.00$
	DR+squash	$11.23$	$11.23 \pm 0.00$

Table 2. Final training accuracies (in %) for different BN initialization strategies using 90 intermediate blocks.

Activation	BN $(0, 1)$ *		BN $(2, 1)$		BN $(3, 1)$		BN $(4, 1)$
$ϕ$	Best	AVG	Best	AVG	Best	AVG	Best	AVG
ReLU	$33.77$	$20.15 \pm 7.13$	98.71	$\underset{̲}{95.77} \pm 2.85$	$87.52$	$79.77 \pm 9.28$	$87.10$	$56.69 \pm 26.65$
Leaky ReLU	$44.94$	$31.70 \pm 11.75$	$\underset{̲}{98.86}$	$76.17 \pm 13.28$	$97.18$	$\underset{̲}{91.74} \pm 6.54$	$83.69$	$78.18 \pm 6.34$
ELU	$77.38$	$56.98 \pm 14.09$	$\underset{̲}{98.60}$	$\underset{̲}{86.40} \pm 14.75$	$96.95$	$70.50 \pm 26.19$	$84.16$	$73.51 \pm 10.05$

Table 3. Final training accuracies (in %) for different BN initialization strategies using 90 intermediate blocks.

Activation	BN $(0, 1)$ *		BN $(0, 0.6)$		BN $(0, 0.2)$
$ϕ$	Best	AVG	Best	AVG	Best	AVG
tanh	$19.65$	$18.43 \pm 1.09$	$48.61$	$48.43 \pm 2.97$	91.09	$\underset{̲}{51.37} \pm 20.97$
sigmoid	$85.21$	$68.00 \pm 20.80$	$\underset{̲}{98.12}$	$\underset{̲}{76.02} \pm 23.20$	$11.23$	$11.23 \pm 0.00$

Table 4. Final training accuracies (in %) for the AMSGrad optimizer extension using 90 intermediate blocks.

Activation $ϕ$	Best	AVG
ReLU	$99.27$	$99.16 \pm 0.07$
Leaky ReLU	$98.93$	$98.68 \pm 0.24$
ELU	99.48	$\underset{̲}{99.36} \pm 0.10$
tanh	$99.20$	$79.15 \pm 18.89$
sigmoid	$69.61$	$65.55 \pm 13.80$

Table 5. Final training accuracies (in %) for plain networks with increasing depth trained over 80 epochs using AMSGrad.

Network Depth	ReLU + BN $(2, 1)$		PReLU + BN $(0, 1)$		D-PReLU + BN $({\pm 2}, 1)$		APLU $(s = 5)$ + BN $(2, 1)$
(in # Blocks)	Best	AVG	Best	AVG	Best	AVG	Best	AVG
120	$98.49$	98.15 ± 0.22	$98.91$	$78.33 \pm 28.17$	$98.97$	$96.64 \pm 4.34$	$\underset{̲}{98.94}$	$96.50 \pm 3.88$
150	$98.36$	$\underset{̲}{96.18} \pm 3.83$	$98.17$	$91.83 \pm 7.39$	$\underset{̲}{98.92}$	$94.82 \pm 4.91$	$97.75$	$81.51 \pm 21.70$
200	$96.67$	$\underset{̲}{96.44} \pm 0.14$	$98.47$	$90.03 \pm 9.59$	$\underset{̲}{98.60}$	$84.44 \pm 12.03$	$96.15$	$69.19 \pm 22.04$
250	$95.22$	$\underset{̲}{94.07} \pm 1.50$	$\underset{̲}{98.20}$	$78.74 \pm 15.48$	$98.16$	$89.68 \pm 8.35$	$42.36$	$30.48 \pm 7.54$
300	$92.32$	$84.51 \pm 4.96$	$97.96$	$79.26 \pm 17.45$	$\underset{̲}{98.28}$	$\underset{̲}{89.77} \pm 11.15$	$55.82$	$35.63 \pm 12.01$
400	$77.97$	$67.31 \pm 9.11$	$96.86$	$74.26 \pm 12.19$	$\underset{̲}{98.65}$	$\underset{̲}{81.07} \pm 20.83$	$44.27$	$31.29 \pm 9.25$
500	$58.24$	$48.22 \pm 7.81$	$93.46$	$67.98 \pm 19.48$	$\underset{̲}{94.31}$	$\underset{̲}{73.34} \pm 19.34$	$21.78$	$19.17 \pm 4.37$

Table 6. Final training accuracies (in %) for PReLU with 90 intermediate blocks using AMSGrad.

Initialization	Best	AVG
BN $(0, 1)$ + $ρ_{0} = 1$	$99.27$	$99.07 \pm 0.12$
BN $(0, 1)$ + $ρ_{0} \sim N (1, 0.01)$	$99.34$	99.19 ± 0.09
BN $(2, 1)$ + $ρ_{0} = 1$	$99.52$	$97.51 \pm 3.92$
BN $(2, 1)$ + $ρ_{0} \sim N (1, 0.01)$	$\underset{̲}{99.55}$	$97.62 \pm 3.57$

Table 7. Final training accuracies (in %) for SReLU and D-PReLU with 90 intermediate blocks using AMSGrad.

Configuration	Best	AVG
SReLU + BN $(0, 1)$	99.44	$93.32 \pm 8.09$
D-PReLU + BN $({\pm 2}, 1)$	$99.41$	$\underset{̲}{99.39} \pm 0.05$

Table 8. Final training accuracies (in %) for APLU with 90 intermediate blocks using AMSGrad.

Initialization	Best	AVG
BN $(0, 1)$ + $s = 1$	$98.58$	$96.13 \pm 4.46$
BN $(0, 1)$ + $s = 3$	$98.64$	$96.50 \pm 2.95$
BN $(0, 1)$ + $s = 5$	$87.88$	$63.85 \pm 14.04$
BN $(2, 1)$ + $s = 1$	$99.39$	$99.22 \pm 0.14$
BN $(2, 1)$ + $s = 3$	$99.36$	$99.32 \pm 0.04$
BN $(2, 1)$ + $s = 5$	99.52	$\underset{̲}{99.38} \pm 0.12$

Table 9. Final training accuracies (in %) for distinct CapsNets using 500 intermediate blocks and 80 training epochs.

Configuration	Best	AVG
SR + ReLU + BN $(2, 1)$	$50.59$	$46.17 \pm 5.58$
SR + D-PReLU + BN $({\pm 2}, 1)$	$91.57$	$58.61 \pm 23.85$
GR + ReLU + BN $(2, 1)$	99.32	$\underset{̲}{99.21} \pm 0.10$
GR + D-PReLU + BN $({\pm 2}, 1)$	$99.22$	$\underset{̲}{99.21} \pm 0.05$

Table 10. Final training and test accuracies (in %) of the convolutional CapsNet on Fashion-MNIST and SVHN.

	Fashion-MNIST		SVHN
Set	Best	AVG	Best	AVG
Train	$86.11$	$85.77 \pm 0.22$	$83.33$	$81.53 \pm 1.97$
Test	$87.08$	$85.61 \pm 1.02$	$86.59$	$85.38 \pm 1.80$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Steur, N.A.K.; Schwenker, F. A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections. AI 2025, 6, 1. https://doi.org/10.3390/ai6010001

AMA Style

Steur NAK, Schwenker F. A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections. AI. 2025; 6(1):1. https://doi.org/10.3390/ai6010001

Chicago/Turabian Style

Steur, Nikolai A. K., and Friedhelm Schwenker. 2025. "A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections" AI 6, no. 1: 1. https://doi.org/10.3390/ai6010001

APA Style

Steur, N. A. K., & Schwenker, F. (2025). A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections. AI, 6(1), 1. https://doi.org/10.3390/ai6010001

Article Menu

A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections

Abstract

1. Introduction

2. Theory

2.1. Skip Connection Pattern

2.2. Multi-Layered Skip Paths

2.3. Horizontal Network Scaling

3. Methods

3.1. Adaptive Nonlinearity Gates

3.1.1. Biased Batch Normalization

3.1.2. Parametric Activation Functions

3.2. Gated Routing with Self-Attention

4. Results

4.1. Preliminaries

4.1.1. Global Setup

4.1.2. Datasets

4.1.3. Generic Analysis Model

4.2. Biased-BN Strategies

4.3. Handling Salient Gradients with AMSGrad

4.4. Linear-Initialized Parametric Activation Functions

4.5. Vast CapsNets with Gated Routing

4.6. Convolutional Capsule Network

5. Discussion

5.1. Remarks on the Degradation Problem

5.2. Increased Depth in Plain Nets

5.3. Ensemble of Neural Processing Paths

5.4. Impact of Capsule Quantity

5.5. Computational Efficiency Considerations

5.6. Limitations of Our Empirical Study

5.7. Future Research Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Datasets Characteristics

Appendix B. CapsNet Resistance Against the Degradation Problem

Appendix C. Pure Capsule-Driven Architecture

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI