SATRN: Spiking Audio Tagging Robust Network
<p>Overview of our proposed spike-based audio processing pipeline. The framework illustrates the complete processing flow from the raw audio input through time flow coding to the final classification, highlighting the integration of temporal feature extraction and spike-based neural processing.</p> "> Figure 2
<p>Overview of the proposed SATRN architecture. The model consists of three key components: (1) <b>the Spiking Potential Layer (SPL)</b>, which encodes input data into spike representations; (2) <b>hierarchical feature fusion (HFF)</b>, which integrates global and local features using Potential Feature Fusion (PFF) modules; and (3) <b>Spatio-Temporal Self-Attention (STSA)</b>, which captures spatial and temporal dependencies for enhanced feature representation. These components work together to enable efficient and effective spike-based neural network processing.</p> "> Figure 3
<p>Robustness evaluation on FSD50K across different SNR levels. Results demonstrate consistent performance advantage of time flow coding over static encoding, especially in challenging noise conditions.</p> "> Figure 4
<p>Performance comparison under varying noise levels on UrbanSound8k. Our time flow coding approach maintained higher accuracy across all SNR values, with particularly significant advantages in high-noise conditions (SNR < 0 dB).</p> ">
Abstract
:1. Introduction
1.1. Key Challenges in Audio Processing
1.2. Advantages of SNN-Based Approach
1.3. Our Contributions
2. Related Work
2.1. Advances in Spiking Neural Networks
Training Methodologies
2.2. Audio Event Detection and Classification
2.3. Neural Architectures for Audio Processing
Integration of Attention Mechanisms
2.4. Research Opportunities and Our Approach
3. Materials and Methods
3.1. Spiking Neuron Model
3.2. Architectural Comparison with Recurrent Networks
3.3. Time Flow Coding
Novel Encoding Strategies
Algorithm 1: Time flow coding algorithm. |
Input: Audio sequence X of length Parameters: , window size, buffer Y Output: Feature sequence
|
3.4. SATRN Architecture
Potential Feature Fusion
- denotes the concatenated feature representation;
- represents the attention network;
- ⊙ denotes element-wise multiplication.
3.5. Spike Time–Space Attention
3.5.1. Architectural Design
3.5.2. Attention Mechanism Analysis
- B represents the batch size;
- T denotes the temporal length;
- C indicates the channel dimension;
- represents the temporal map;
- F denotes the frequency.
- Complementary Processing: The temporal path captures sequence patterns, while the spatial path focuses on channel–frequency relationships.
- Computational Efficiency: Decomposed attention reduces the complexity from
- Binary Attention: The spike activation function maintains SNN characteristics while providing natural regularization.
- Feature Enhancement: The residual connection preserves original information while enriching feature representation.
3.5.3. Spike-Based Processing
3.6. Computational Efficiency Analysis
- Traditional approach: Processing the entire 8 s (25 ms hop length) segment generates a large Mel spectrogram of the size ( frames for 8s of audio).
- TFC approach: Dividing the segment into 4 segments of 2 s each achieves smaller individual spectrograms of the size , with reduced peak memory usage during processing and more efficient cache utilization.
- Traditional Processing:
- -
- Memory requirement: for the full spectrogram.
- -
- Computation: operations on the full feature map.
- TFC Processing:
- -
- Memory requirement: per segment, where n is the number of segments.
- -
- Computation: operations per segment.
- -
- An additional benefit from a better cache locality.
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
- FSD50K [11]: A large-scale multi-label dataset featuring the following:
- -
- 200 distinct audio categories.
- -
- 108 h of total content.
- -
- Training set: 80.4 h (average duration of 7.1 s).
- -
- Testing set: 27.9 h (average duration of 9.8 s).
- Urbansound8K [10]: A single-label dataset comprising the following:
- -
- 8.8 h of urban environmental sounds.
- -
- 10 distinct categories (air conditioner, car horn, children playing, etc.).
- -
4.1.2. Implementation Details
- Audio resampling to 16 kHz for standardization.
- Segmentation into units of length, , and seconds.
- Short-time Fourier transform with 10ms window size.
- 128-dimensional Mel filtering producing features of the shape .
4.2. Performance Analysis
4.2.1. Evaluation on Urbansound8K
4.2.2. Evaluation on FSD50K
- Adaptive Segmentation: The audio was divided into segments of a predefined time step, where each segment length was determined by . This ensured consistency in the segment size across different audio clips.
- Padding and Truncation: When an audio clip was shorter than the required segment length, we applied padding by repeating the last valid frame to maintain the required size. Conversely, if an audio clip was longer, we truncated it to the maximum length allowed in the dataset to ensure uniform processing.
- Buffer-Based Supplementation: For segments that did not align perfectly due to non-integer multiples of the time step, we introduced a buffer mechanism to extend the last segment using relevant feature information from the preceding segments.
4.3. Robustness Analysis
4.3.1. Experimental Setup
4.3.2. Key Findings
4.3.3. Performance Analysis
5. Conclusions
5.1. Methodological Innovations
5.2. Practical Implications
5.3. Future Research Directions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Liu, K.; Cui, X.; Ji, X.; Kuang, Y.; Zou, C.; Zhong, Y.; Xiao, K.; Wang, Y. Real-Time Target Tracking System with Spiking Neural Networks Implemented on Neuromorphic Chips. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 1590–1594. [Google Scholar] [CrossRef]
- Hu, S.G.; Qiao, G.C.; Liu, X.K.; Liu, Y.H.; Zhang, C.M.; Zuo, Y.; Zhou, P.; Liu, Y.A.; Ning, N.; Yu, Q.; et al. A Co-Designed Neuromorphic Chip with Compact (17.9K F2) and Weak Neuron Number-Dependent Neuron/Synapse Modules. IEEE Trans. Biomed. Circuits Syst. 2022, 16, 1250–1260. [Google Scholar] [CrossRef] [PubMed]
- Barchid, S.; Allaert, B.; Aissaoui, A.; Mennesson, J.; Djéraba, C. Spiking-Fer: Spiking Neural Network for Facial Expression Recognition with Event Cameras. arXiv 2023, arXiv:2304.10211. [Google Scholar]
- Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going Deeper in Spiking Neural Networks: VGG and Residual Architectures. Front. Neurosci. 2019, 13, 95. [Google Scholar] [CrossRef]
- Zhou, Z.; Zhu, Y.; He, C.; Wang, Y.; Yan, S.; Tian, Y.; Yuan, L. Spikformer: When Spiking Neural Network Meets Transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Dellaferrera, G.; Martinelli, F.; Cernak, M. A Bin Encoding Training of a Spiking Neural Network Based Voice Activity Detection. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
- Ustubioglu, B.; Tahaoglu, G.; Ulutas, G. Detection of audio copy-move-forgery with novel feature matching on Mel spectrogram. Expert Syst. Appl. 2023, 213, 118963. [Google Scholar] [CrossRef]
- Salamon, J.; Jacoby, C. A Dataset and Taxonomy for Urban Sound Research. 2014. Available online: https://zenodo.org/record/1203745/ (accessed on 13 February 2025).
- Fonseca, E.; Favory, X.; Pons, J.; Font, F.; Serra, X. FSD50K: An Open Dataset of Human-Labeled Sound Events. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 829–852. [Google Scholar] [CrossRef]
- Qawaqneh, Z.; Mallouh, A.A.; Barkana, B.D. Deep neural network framework and transformed MFCCs for speaker’s age and gender classification. Knowl.-Based Syst. 2017, 115, 5–14. [Google Scholar] [CrossRef]
- Izhikevich, E.M. Simple model of spiking neurons. IEEE Trans. Neural Netw. 2003, 14, 1569–1572. [Google Scholar] [CrossRef]
- Na, B.; Mok, J.; Park, S.; Lee, D.; Choe, H.; Yoon, S. AutoSNN: Towards Energy-Efficient Spiking Neural Networks. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S., Eds.; 2022; Volume 162, pp. 16253–16269. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Kim, Y.; Panda, P. Optimizing deeper spiking neural networks for dynamic vision sensing. Neural Netw. 2021, 144, 686–698. [Google Scholar] [CrossRef] [PubMed]
- Ding, J.; Yu, Z.; Tian, Y.; Huang, T. Optimal ANN-SNN Conversion for Fast and Accurate Inference in Deep Spiking Neural Networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Montreal, QC, Canada, 19–27 August 2021; pp. 2328–2336. [Google Scholar]
- Han, B.; Srinivasan, G.; Roy, K. Rmp-snn: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13558–13567. [Google Scholar]
- Deng, L.; Wu, Y.; Hu, X.; Liang, L.; Ding, Y.; Li, G.; Zhao, G.; Li, P.; Xie, Y. Rethinking the performance comparison between SNNS and ANNS. Neural Netw. 2020, 121, 294–307. [Google Scholar] [CrossRef] [PubMed]
- Wu, Y.; Deng, L.; Li, G.; Zhu, J.; Shi, L. Spatio-Temporal Backpropagation for Training High-Performance Spiking Neural Networks. Front. Neurosci. 2018, 12, 331. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.H.; Delbruck, T.; Pfeiffer, M. Training Deep Spiking Neural Networks Using Backpropagation. Front. Neurosci. 2016, 10, 508. [Google Scholar] [CrossRef] [PubMed]
- Shrestha, S.B.; Orchard, G. Slayer: Spike layer error reassignment in time. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
- Zenke, F.; Ganguli, S. Superspike: Supervised learning in multilayer spiking neural networks. Neural Comput. 2018, 30, 1514–1541. [Google Scholar] [CrossRef]
- Fang, W.; Yu, Z.; Chen, Y.; Masquelier, T.; Huang, T.; Tian, Y. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2661–2671. [Google Scholar]
- Xu, Y.; Huang, Q.; Wang, W.; Foster, P.; Sigtia, S.; Jackson, P.J.B.; Plumbley, M.D. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1230–1241. [Google Scholar] [CrossRef]
- Berouti, M.; Schwartz, R.; Makhoul, J. Enhancement of speech corrupted by acoustic noise. In Proceedings of the ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, Washington, DC, USA, 2–4 April 1979; Volume 4, pp. 208–211. [Google Scholar] [CrossRef]
- Cohen, I.; Berdugo, B. Speech enhancement for non-stationary noise environments. Signal Process. 2001, 81, 2403–2418. [Google Scholar] [CrossRef]
- Fang, H.; Carbajal, G.; Wermter, S.; Gerkmann, T. Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar] [CrossRef]
- Le Roux, J.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR—Half-baked or well done? In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 626–630. [Google Scholar] [CrossRef]
- Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
- Koutini, K.; Schlüter, J.; Eghbal-zadeh, H.; Widmer, G. Efficient Training of Audio Transformers with Patchout. In Proceedings of the Interspeech 2022, ISCA, Incheon, Republic of Korea, 18–22 September 2022. [Google Scholar] [CrossRef]
- Desimone, R.; Duncan, J. Neural mechanisms of selective visual attention. Annu. Rev. Neurosci. 1995, 18, 193–222. [Google Scholar] [CrossRef]
- Yu, C.; Gu, Z.; Li, D.; Wang, G.; Wang, A.; Li, E. STSC-SNN: Spatio-Temporal Synaptic Connection with temporal convolution and attention for spiking neural networks. Front. Neurosci. 2022, 16, 1079357. [Google Scholar] [CrossRef]
- Yao, M.; Zhao, G.; Zhang, H.; Hu, Y.; Deng, L.; Tian, Y.; Xu, B.; Li, G. Attention Spiking Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9393–9410. [Google Scholar] [CrossRef] [PubMed]
- Jiang, C.; Zhang, Y. KLIF: An optimized spiking neuron unit for tuning surrogate gradient slope and membrane potential. arXiv 2023, arXiv:2302.09238. [Google Scholar]
- He, W.; Wu, Y.; Deng, L.; Li, G.; Wang, H.; Tian, Y.; Ding, W.; Wang, W.; Xie, Y. Comparing SNNs and RNNs on neuromorphic vision datasets: Similarities and differences. Neural Netw. 2020, 132, 108–120. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; Lu, H.; Yuan, J.; Li, X. CAT: Causal Audio Transformer for Audio Classification. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Gong, Y.; Chung, Y.A.; Glass, J. PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3292–3306. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
- Fang, W.; Chen, Y.; Ding, J.; Yu, Z.; Masquelier, T.; Chen, D.; Huang, L.; Zhou, H.; Li, G.; Tian, Y. SpikingJelly: An open-source machine learning infrastructure platform for spike-based intelligence. Sci. Adv. 2023, 9, eadi1480. [Google Scholar] [CrossRef]
- Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Medhat, F.; Chesmore, D.; Robinson, J. Masked Conditional Neural Networks for Environmental Sound Classification. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 21–33. [Google Scholar] [CrossRef]
- Wu, H.H.; Seetharaman, P.; Kumar, K.; Bello, J.P. Wav2CLIP: Learning Robust Audio Representations From CLIP. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 7–13 May 2022. [Google Scholar]
Block | Structure | Output Size |
---|---|---|
Input Conv1 | - , 32 | |
Stage 1 | conv, 128 piking neuron conv , 128, Spiking neuron , 64 | |
Stage 2 | , 256 Spiking neuron , 256, Spiking neuron , 128 | |
Stage 3 | , 512 Spiking neuron , 512, Spiking neuron , 256 | |
Stage 4 | , 1024 Spiking neuron , 1024, Spiking neuron , 512 | |
Stage 5 | , 2048 Spiking neuron , 2048, Spiking neuron , 1024 |
Related Works | Model | Step | Unit (s) | UrbanSound8K (Accuracy) |
---|---|---|---|---|
[7] | Spikeformer-TFC | 8 | 0.5 | 93.8% |
[7] | Spikeformer-TSC | 8 | 4 | 92.9% |
This work | SATRN-TFC-STSA | 8 | 0.5 | 95.5% |
This work | SATRN-TSC-STSA | 8 | 4 | 95.0% |
This work | SATRN-TFC-STSA | 4 | 0.5 | 93.0% |
This work | SATRN-TSC-STSA | 4 | 2 | 92.7% |
Model | Acc (%) |
---|---|
ESResNeXt [41] | 89.14 |
CAT [37] | 95.90 |
MCLNN [42] | 73.30 |
SATRN-TSC-STSA | 95.0 |
SATRN-TFC-STSA | 95.5 |
Related Works | Model | Step | Unit(s) | FSD50K (mAP) |
---|---|---|---|---|
[7] | Spikeformer-TFC | 4 | 2 | 43.3% |
[7] | Spikeformer-TSC | 4 | 8 | 43.0% |
This work | SATRN-TFC-STSA | 4 | 2 | 45.5% |
This work | SATRN-TSC-STSA | 4 | 8 | 44.5% |
This work | SATRN-TFC-STSA | 8 | 1 | 45.0% |
This work | SATRN-TSC-STSA | 8 | 8 | 44.5% |
Related Works | Model | FSD50K (mAP) |
---|---|---|
[11] | CRNN | 0.417 |
[11] | VGG-like | 0.434 |
[11] | ResNet-18 | 0.373 |
[11] | DenseNet-121 | 0.425 |
[43] | Wav2CLIP | 0.431 |
This work | SATRN-TFC-STSA | 0.455 |
Related Works | Model | UrbanSound8K (Accuracy) | FSD50K (mAP) |
---|---|---|---|
[7] | Spikeformer | 93.8% | 43.3% |
This work | SATRN-TFC-SSA | 95.1% | 44.7% |
This work | SATRN-TFC-STSA | 95.5% | 45.5% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gao, S.; Deng, X.; Fan, X.; Yu, P.; Zhou, H.; Zhu, Z. SATRN: Spiking Audio Tagging Robust Network. Electronics 2025, 14, 761. https://doi.org/10.3390/electronics14040761
Gao S, Deng X, Fan X, Yu P, Zhou H, Zhu Z. SATRN: Spiking Audio Tagging Robust Network. Electronics. 2025; 14(4):761. https://doi.org/10.3390/electronics14040761
Chicago/Turabian StyleGao, Shouwei, Xingyang Deng, Xiangyu Fan, Pengliang Yu, Hao Zhou, and Zihao Zhu. 2025. "SATRN: Spiking Audio Tagging Robust Network" Electronics 14, no. 4: 761. https://doi.org/10.3390/electronics14040761
APA StyleGao, S., Deng, X., Fan, X., Yu, P., Zhou, H., & Zhu, Z. (2025). SATRN: Spiking Audio Tagging Robust Network. Electronics, 14(4), 761. https://doi.org/10.3390/electronics14040761