A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections
<p>Visualization of the degradation problem in relation to the network depth based on (<b>a</b>) plain networks and (<b>b</b>) CapsNets with distinct activation functions, using the MNIST classification dataset. A plain network contains 32 neurons per layer, while a CapsNet consists of eight capsules with four neurons each. Network depth is stated as the number of intermediate blocks, including an introducing convolutional layer and a closing classification head. Each block consists of a fully connected layer followed by BN and the application of the activation function. In the case of CapsNets, signal flow between consecutive capsule layers is controlled by a specific routing procedure. The final loss (as cross-entropy) and accuracy, both based on the training set, are reported as an average over five runs with random network initialization. Each run comprises <math display="inline"><semantics> <mrow> <mn>2</mn> <mi>n</mi> </mrow> </semantics></math> training epochs, where <span class="html-italic">n</span> equals the number of intermediate blocks.</p> "> Figure 2
<p>Shortcut and skip connections (highlighted in red) in residual learning. (<b>a</b>) Original definition of a shortcut connection with projection matrix based on [<a href="#B5-ai-06-00001" class="html-bibr">5</a>]. (<b>b</b>) Pattern for self-building skip connections in a CapsNet with SR and an activation function with a suitable linear interval.</p> "> Figure 3
<p>Replacement of the static signal propagation in a CapsNet with a nonlinear routing procedure to form parametric information flow gates. (<b>a</b>) Basic pattern with a single routing gate. (<b>b</b>) Exemplary skip path (highlighted in red) crossing multiple layers and routing gates.</p> "> Figure 4
<p>Customizing the initialization scheme for BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mi>β</mi> <mo>,</mo> <mi>γ</mi> <mo>)</mo> </mrow> </semantics></math> allows the training of deeper networks by constraining the input distribution (in blue) of an activation function to be positioned in a mostly linear section. Exemplary initializations are shown for (<b>a</b>) sigmoid with BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mn>0.5</mn> <mo>)</mo> </mrow> </semantics></math>, and (<b>b</b>) Leaky ReLU with BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mo>−</mo> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics></math>.</p> "> Figure 5
<p>Parametric versions of ReLU with (<b>a</b>) single and (<b>b</b>) four degree(s) of freedom using an exemplary parameter range of <math display="inline"><semantics> <mrow> <msub> <mi>ρ</mi> <mi>i</mi> </msub> <mo>∈</mo> <mrow> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>]</mo> </mrow> </mrow> </semantics></math>. (<b>a</b>) PReLU learns a nonlinearity specification <math display="inline"><semantics> <mi>ρ</mi> </semantics></math> for input values below zero and directly passes signals above zero. (<b>b</b>) SReLU applies the identity function within the interval <math display="inline"><semantics> <mrow> <mo>[</mo> <msub> <mi>t</mi> <mi>min</mi> </msub> <mo>,</mo> <msub> <mi>t</mi> <mi>max</mi> </msub> <mo>]</mo> </mrow> </semantics></math>, and learns two individual nonlinearity specifications <math display="inline"><semantics> <msub> <mi>ρ</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>ρ</mi> <mn>2</mn> </msub> </semantics></math> outside of the centered interval.</p> "> Figure 6
<p>(<b>a</b>) Generic model architecture with (<b>b</b>) one-layer Feature Extractor (FE), a classification head with <span class="html-italic">z</span> classes and (<b>c</b>) intermediate blocks consisting of fully-connected layers. Dense blocks are specified via capsules or scalar neurons (plain) for the fully-connected units.</p> "> Figure 7
<p><b>First two rows:</b> Mean (first row) and best (second row) training loss progressions over five runs for each BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mi>β</mi> <mo>,</mo> <mi>γ</mi> <mo>)</mo> </mrow> </semantics></math> initialization scheme per activation function. <b>Last two rows:</b> Mean deviation per BN layer of the final <math display="inline"><semantics> <msub> <mi>β</mi> <mi>i</mi> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>γ</mi> <mi>i</mi> </msub> </semantics></math> parameters from their initial values, using the identified superior BN initialization scheme for each activation function. Per plot the model parameter deviations are shown for the best run and as average over all five runs.</p> "> Figure 8
<p>(<b>a</b>) Mean and (<b>b</b>) best training loss development over five runs using 90 intermediate blocks, AMSGrad and the superior BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mi>β</mi> <mo>,</mo> <mi>γ</mi> <mo>)</mo> </mrow> </semantics></math> initialization strategy per activation function. Both subfigures provide an inset as a zoom-in for tight regions.</p> "> Figure 9
<p>(<b>a</b>) Percentage gain in accuracy for the remaining epochs measured in relation to the final accuracy. Accuracy gains below one percentage (red line) are gray. (<b>b</b>) Mean training loss development over five runs for varying network depths using ReLU, AMSGrad and BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics></math> initialization strategy.</p> "> Figure 10
<p>Each row summarizes the experiment results of the parametric activation functions PReLU, SReLU/D-PReLU and APLU, respectively. <b>First two columns:</b> Mean (first column) and best (second column) training loss development over five runs using AMSGrad and varying initialization strategies for BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mi>β</mi> <mo>,</mo> <mi>γ</mi> <mo>)</mo> </mrow> </semantics></math> and the activation function parameters. Insets are provided as zoom-in for tight regions. <b>Second two columns:</b> Mean parameter deviations per layer from their initial values with respect to BN and the parametric activation function. In each case, the identified superior configuration strategy is used. For APLU the configuration with <math display="inline"><semantics> <mrow> <mi>s</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math> is preferred against <math display="inline"><semantics> <mrow> <mi>s</mi> <mo>=</mo> <mn>5</mn> </mrow> </semantics></math> for the benefit of proper visualization. <b>Last column:</b> Mean training loss progress over five runs for varying network depths using the identified superior configuration strategy.</p> "> Figure 11
<p>Mean training loss development over five runs using CapsNets with a depth of 500 intermediate blocks and varying routing procedures, activation functions and BN initializations.</p> "> Figure 12
<p>(<b>a</b>) Mean training (<b>solid</b>) and validation (<b>dotted</b>) loss progressions over five runs for the pure capsule-driven architecture. (<b>b</b>) Mean bias parameter deviation of GR after training from their initial value of <math display="inline"><semantics> <mrow> <mo>−</mo> <mn>3</mn> </mrow> </semantics></math>.</p> "> Figure A1
<p>Row-wise 20 random samples for each dataset in <a href="#ai-06-00001-t0A1" class="html-table">Table A1</a>.</p> "> Figure A2
<p>Final training loss (<b>left</b>) and training accuracy (<b>right</b>) averaged over five runs using CapsNets with increasing network depth and distinct configurations.</p> "> Figure A3
<p>Convolutional capsule unit with GR between two layers of identical dimensionality and image downsampling using grouped convolutions.</p> ">
Abstract
:1. Introduction
- Firstly, we theoretically unify the skip connection mechanisms of residual networks and highway networks into CapsNets by showing their functional equivalence under certain conditions. Moreover, we identify the necessary preconditions to facilitate the shaping of self-building skip connections within CapsNets. Our theoretical findings provide direct implications for the design of arbitrary neural architectures by demystifying specific properties of their dynamics.
- Secondly, we introduce the concept of Adaptive Nonlinearity Gates by means of practical methods that fulfill the necessary preconditions and help to stabilize the training of very deep networks in general. These methods comprise straightforward strategies like biased BN, parametric activation functions and adaptive signal propagation. In particular, we present the novel Doubly-Parametric ReLU activation function and design the Gated Routing procedure dedicated to the training of enormously deep CapsNets.
- Thirdly, we supply a comprehensive experimental study to substantiate our theoretical findings and the proposed methods. The empirical results reveal valuable insights for the optimization of very deep neural networks of any kind. Specifically, our strategies prove to effectively mitigate the degradation problem.
2. Theory
2.1. Skip Connection Pattern
2.2. Multi-Layered Skip Paths
2.3. Horizontal Network Scaling
3. Methods
3.1. Adaptive Nonlinearity Gates
3.1.1. Biased Batch Normalization
3.1.2. Parametric Activation Functions
3.2. Gated Routing with Self-Attention
Algorithm 1 Gated Routing (GR) from all n capsules in the preceding layer to an individual capsule in the next-higher layer. The symbols ⊕ and ⊙ refer to the element-wise addition and multiplication, respectively. |
|
4. Results
4.1. Preliminaries
4.1.1. Global Setup
4.1.2. Datasets
4.1.3. Generic Analysis Model
4.2. Biased-BN Strategies
- RQ 1:
- Can BN layers act as ANGs for subsequent activation functions to enable the training of deeper neural networks?
- RQ 1a:
- What are preferable BN initializations?
- RQ 1b:
- What is the appearance of BN’s parameters after successful model training?
4.3. Handling Salient Gradients with AMSGrad
- RQ 2:
- Does AMSGrad improve training loss convergence for deeper neural networks through sober handling of salient gradients?
- RQ 3:
- What are the limitations of initially biased scalar-neuron networks towards linear processing regarding an increase in network depth?
4.4. Linear-Initialized Parametric Activation Functions
- RQ 4:
- Constitute parametric activation functions a valid alternative for realizing ANGs?
- RQ 4a:
- Can BN and parametric activation functions operate complementary?
- RQ 4b:
- What is the parameter appearance of BN and the parametric activation function after successful model training?
- RQ 5:
- To which extent alleviate parametric activation functions the degradation problem regarding an increase in network depth?
4.5. Vast CapsNets with Gated Routing
- RQ 6:
- Can vast CapsNets with GR and suitable ANG resist the degradation problem during training?
4.6. Convolutional Capsule Network
- RQ 7:
- Provide capsules the potential to embody arbitrary-level entities?
5. Discussion
5.1. Remarks on the Degradation Problem
5.2. Increased Depth in Plain Nets
5.3. Ensemble of Neural Processing Paths
5.4. Impact of Capsule Quantity
5.5. Computational Efficiency Considerations
5.6. Limitations of Our Empirical Study
5.7. Future Research Directions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Datasets Characteristics
# Samples per Class in | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Sample | Total Size of | Train Set | Test Set | |||||||
Dataset | # Classes | Format | Train Set | Test Set | AVG | MIN | MAX | AVG | MIN | MAX |
MNIST | 10 | ≈ | ≈ | ≈ | ≈ | |||||
F-MNIST | 10 | |||||||||
SVHN | 10 | ≈ | ≈ | ≈ | ≈ | ≈ | ≈ | ≈ | ≈ |
Appendix B. CapsNet Resistance Against the Degradation Problem
Configuration | Best | AVG |
---|---|---|
SR + ReLU + BN | ||
SR + D-PReLU + BN | 99.54 | |
GR + linear + BN + SA | ||
GR + linear + BN + ¬SA |
Appendix C. Pure Capsule-Driven Architecture
References
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- He, K.; Sun, J. Convolutional Neural Networks at Constrained Time Cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5353–5360. [Google Scholar]
- Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway Networks. In Proceedings of the Deep Learning Workshop at the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Srivastava, R.K.; Greff, K.; Schmidhuber, J. Training Very Deep Networks. Adv. Neural Inf. Process. Syst. (NIPS) 2015, 28, 1–9. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar] [CrossRef]
- Monti, R.P.; Tootoonian, S.; Cao, R. Avoiding degradation in deep feed-forward networks by phasing out skip-connections. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Rhodes, Greece, 4–7 October 2018; pp. 447–456. [Google Scholar]
- Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar] [CrossRef]
- Hinton, G.; Sabour, S.; Frosst, N. Matrix Capsules with EM Routing. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–15. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- LeCun, Y.; Cortes, C.; Burges, C.J.C. The MNIST Database of Handwritten Digits. ATT Labs [Online]. 2010. Volume 2. Available online: http://yann.lecun.com/exdb/mnist (accessed on 5 June 2024).
- Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming Auto-Encoders. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Espoo, Finland, 14–17 June 2011; pp. 44–51. [Google Scholar] [CrossRef]
- Kim, J.; Jang, S.; Choi, S.; Park, E. Text Classification using Capsules. arXiv 2018, arXiv:1808.03976v2. [Google Scholar] [CrossRef]
- Ren, H.; Lu, H. Compositional coding capsule network with k-means routing for text classification. Pattern Recognit. Lett. 2022, 160, 1–8. [Google Scholar] [CrossRef]
- Steur, N.A.K.; Schwenker, F. Next-Generation Neural Networks: Capsule Networks with Routing-by-Agreement for Text Classification. IEEE Access 2021, 9, 125269–125299. [Google Scholar] [CrossRef]
- Shakhnoza, M.; Sabina, U.; Sevara, M.; Cho, Y.I. Novel Video Surveillance-Based Fire and Smoke Classification Using Attentional Feature Map in Capsule Networks. Sensors 2022, 22, 98. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Huang, L.; Jiang, S.; Wang, Y.; Zou, J.; Fu, H.; Yang, S. Capsule Networks Showed Excellent Performance in the Classification of hERG Blockers/Nonblockers. Front. Pharmacol. 2020, 10, 1631. [Google Scholar] [CrossRef]
- Fukushima, K. Cognitron: A Self-organizing Multilayered Neural Network. Biol. Cybern. 1975, 20, 121–136. [Google Scholar] [CrossRef] [PubMed]
- Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 1–8. [Google Scholar]
- Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; pp. 1–6. [Google Scholar]
- Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016; pp. 1–14. [Google Scholar] [CrossRef]
- Oyedotun, O.K.; Ismaeil, K.A.; Aouada, D. Training very deep neural networks: Rethinking the role of skip connections. Neurocomputing 2021, 441, 105–117. [Google Scholar] [CrossRef]
- Oyedotun, O.K.; Ismaeil, K.A.; Aouada, D. Why Is Everyone Training Very Deep Neural Network with Skip Connections? IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5961–5975. [Google Scholar] [CrossRef]
- Veit, A.; Wilber, M.; Belongie, S. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. Adv. Neural Inf. Process. Syst. (NIPS) 2016, 29, 1–9. [Google Scholar]
- Balduzzi, D.; Frean, M.; Leary, L.; Lewis, J.P.; Ma, K.W.D.; McWilliams, B. The Shattered Gradients Problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 342–350. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. DiracNets: Training Very Deep Neural Networks Without Skip-Connections. arXiv 2017, arXiv:1706.00388v2. [Google Scholar] [CrossRef]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
- Jin, X.; Xu, C.; Feng, J.; Wei, Y.; Xiong, J.; Yan, S. Deep Learning with S-Shaped Rectified Linear Activation Units. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar] [CrossRef]
- Shang, W.; Sohn, K.; Almeida, D.; Lee, H. Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 20–22 June 2016; Volume 48, pp. 2217–2225. [Google Scholar]
- Agostinelli, F.; Hoffman, M.; Sadowski, P.; Baldi, P. Learning Activation Functions to Improve Deep Neural Networks. In Proceedings of the Workshop Contribution at the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–9. [Google Scholar] [CrossRef]
- Trottier, L.; Giguère, P.; Chaib-draa, B. Parametric Exponential Linear Unit for Deep Convolutional Neural Networks. In Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 207–214. [Google Scholar] [CrossRef]
- Godfrey, L.B. An Evaluation of Parametric Activation Functions for Deep Learning. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3006–3011. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–15. [Google Scholar] [CrossRef]
- Luong, M.T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar] [CrossRef]
- Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; pp. 1–13. [Google Scholar] [CrossRef]
- Python Programming Language. Available online: https://www.python.org/ (accessed on 5 June 2024).
- Chollet, F. Keras. Available online: https://keras.io (accessed on 5 June 2024).
- Wood, L.; Tan, Z.; Stenbit, I.; Bischof, J.; Zhu, S.; Chollet, F.; Sreepathihalli, D.; Sampath, R. KerasCV. 2022. Available online: https://github.com/keras-team/keras-cv (accessed on 5 June 2024).
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2015, arXiv:1603.04467v2. [Google Scholar] [CrossRef]
- Kingma, D.P.; Lei Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar] [CrossRef]
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747v2. [Google Scholar] [CrossRef]
- Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In Proceedings of the Workshop on Deep Learning and Unsupervised Feature Learning at the 25th Conference on Neural Information Processing Systems (NIPS), Granada, Spain, 12–15 December 2011; pp. 1–9. [Google Scholar]
- Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y.; The Street View House Numbers (SVHN) Dataset. Stanford University [Online]. 2011. Available online: http://ufldl.stanford.edu/housenumbers (accessed on 5 June 2024).
- TensorFlow Datasets: A Collection of Ready-to-Use Datasets. Available online: https://www.tensorflow.org/datasets (accessed on 5 June 2024).
- Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–9. [Google Scholar]
- Oyedotun, O.K.; Shabayek, A.E.R.; Aouada, D.; Ottersten, B. Going Deeper With Neural Networks Without Skip Connections. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1756–1760. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. (NIPS) 2012, 25, 1–9. [Google Scholar] [CrossRef]
Configuration | Best | AVG | |
---|---|---|---|
Plain Net | linear | ||
ReLU | |||
Leaky ReLU | |||
ELU | |||
tanh | |||
sigmoid | 85.21 | ||
CapsNet | SR+linear | ||
SR+sigmoid | |||
SR+Leaky ReLU | |||
SR+squash | |||
k-MR+squash | |||
DR+squash |
Activation | BN* | BN | BN | BN | ||||
---|---|---|---|---|---|---|---|---|
Best | AVG | Best | AVG | Best | AVG | Best | AVG | |
ReLU | 98.71 | |||||||
Leaky ReLU | ||||||||
ELU |
Activation | BN* | BN | BN | |||
---|---|---|---|---|---|---|
Best | AVG | Best | AVG | Best | AVG | |
tanh | 91.09 | |||||
sigmoid |
Activation | Best | AVG |
---|---|---|
ReLU | ||
Leaky ReLU | ||
ELU | 99.48 | |
tanh | ||
sigmoid |
Network Depth | ReLU + BN | PReLU + BN | D-PReLU + BN | APLU + BN | ||||
---|---|---|---|---|---|---|---|---|
(in # Blocks) | Best | AVG | Best | AVG | Best | AVG | Best | AVG |
120 | 98.15 ± 0.22 | |||||||
150 | ||||||||
200 | ||||||||
250 | ||||||||
300 | ||||||||
400 | ||||||||
500 |
Initialization | Best | AVG |
---|---|---|
BN + | ||
BN + | 99.19 ± 0.09 | |
BN + | ||
BN + |
Configuration | Best | AVG |
---|---|---|
SReLU + BN | 99.44 | |
D-PReLU + BN |
Initialization | Best | AVG |
---|---|---|
BN + | ||
BN + | ||
BN + | ||
BN + | ||
BN + | ||
BN + | 99.52 |
Configuration | Best | AVG |
---|---|---|
SR + ReLU + BN | ||
SR + D-PReLU + BN | ||
GR + ReLU + BN | 99.32 | |
GR + D-PReLU + BN |
Fashion-MNIST | SVHN | |||
---|---|---|---|---|
Set | Best | AVG | Best | AVG |
Train | ||||
Test |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Steur, N.A.K.; Schwenker, F. A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections. AI 2025, 6, 1. https://doi.org/10.3390/ai6010001
Steur NAK, Schwenker F. A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections. AI. 2025; 6(1):1. https://doi.org/10.3390/ai6010001
Chicago/Turabian StyleSteur, Nikolai A. K., and Friedhelm Schwenker. 2025. "A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections" AI 6, no. 1: 1. https://doi.org/10.3390/ai6010001
APA StyleSteur, N. A. K., & Schwenker, F. (2025). A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections. AI, 6(1), 1. https://doi.org/10.3390/ai6010001