[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3394171.3413603acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Unified Framework for Detecting Audio Adversarial Examples

Published: 12 October 2020 Publication History

Abstract

Adversarial attacks have been widely recognized as the security vulnerability of deep neural networks, especially in deep automatic speech recognition (ASR) systems. The advanced detection methods against adversarial attacks mainly focus on pre-processing the input audio to alleviate the threat of adversarial noise. Although these methods could detect some simplex adversarial attacks, they fail to handle robust complex attacks especially when the attacker knows the detection details. In this paper, we propose a unified adversarial detection framework for detecting adaptive audio adversarial examples, which combines noise padding with sound reverberation. Specifically, a well-designed adaptive artificial utterances generator is proposed to balance the design complexity, such that the artificial utterances (speech with reverberation) are efficiently determined to reduce the false positive rate and false negative rate of detection results. Moreover, to destroy the continuity of the adversarial noise, we develop a novel multi-noise padding strategy, which implants the Gaussian noises in the silent fragments of the input speech by the voice activity detector. Furthermore, our proposed method can effectively tackle the robust adaptive attacks in an adaptive learning manner. Importantly, the conceived system is easily embedded into any ASR models without requiring additional retraining or modification. The experimental results show that our method consistently outperforms the state-of-the-art audio defense methods, even for the adaptive and robust attacks.

Supplementary Material

MP4 File (3394171.3413603.mp4)
Presentation Video.

References

[1]
Jont B Allen and David A Berkley. 1979. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65, 4 (1979), 943--950.
[2]
Ido Ariav and Israel Cohen. 2019. An end-to-end multimodal voice activity detection using wavenet encoder and residual networks. IEEE Journal of Selected Topics in Signal Processing 13, 2 (2019), 265--274.
[3]
Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Synthesizing Robust Adversarial Examples. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), JenniferDy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 284--293.
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In international conference on learning representations (ICLR 2015).
[5]
Marco Barreno, Blaine Nelson, Anthony D Joseph, and J Doug Tygar. 2010. The security of machine learning. Machine Learning 81, 2 (2010), 121--148.
[6]
Marco Barreno, Blaine Nelson, Russell Sears, Anthony D Joseph, and J Doug Tygar. 2006. Can machine learning be secure?. In Proceedings of the 2006 ACM Symposium on Information, computer and communications security. ACM, 16--25.
[7]
Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 1--7.
[8]
Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning. ACM, 18.
[9]
Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng Polo Chau. 2018. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 52--68.
[10]
Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. Robust physicalworld attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1625--1634.
[11]
Virginie Gilg, Christophe Beaugeant, and Bernt Andrassy. 2020. METHODOLOGY FOR THE DESIGN OF A ROBUST VOICE ACTIVITY DETECTOR FOR SPEECH ENHANCEMENT. (04 2020).
[12]
Wael H Gomaa, Aly A Fahmy, et al. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (2013), 13--18.
[13]
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations.
[14]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645--6649.
[15]
JA Haigh and JS Mason. 1993. Robust voice activity detection using cepstral features. In Proceedings of TENCon'93. IEEE Region 10 International Conference on Computers, Communications and Automation, Vol. 3. IEEE, 321--324.
[16]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).
[17]
Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis & Machine Intelligence 10 (1990), 993--1001.
[18]
Karl Moritz Hermann and Phil Blunsom. 2014. Multilingual distributed representations without word alignment. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014).
[19]
IEEE 2019. Targeted adversarial examples for black box audio systems. IEEE.
[20]
Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. 2019. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems. 125--136.
[21]
Matthew A Jaro. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84, 406 (1989), 414--420.
[22]
Matthew A Jaro. 1995. Probabilistic linkage of large public health data files. Statistics in medicine 14, 5--7 (1995), 491--498.
[23]
Marco Jeub, Magnus Schafer, and Peter Vary. 2009. A binaural room impulse response database for the evaluation of dereverberation algorithms. In 2009 16th International Conference on Digital Signal Processing. IEEE, 1--5.
[24]
Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, and Yu-Gang Jiang. 2019. Black-box adversarial attacks on video recognition models. In Proceedings of the 27th ACM International Conference on Multimedia. 864--872.
[25]
Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Emanuël Habets, Reinhold Haeb-Umbach, Volker Leutnant, Armin Sehr, Walter Kellermann, Roland Maas, et al. 2013. The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1--4.
[26]
Stepan Komkov and Aleksandr Petiushko. 2019. AdvHat: Real-world adversarial attack on ArcFace Face ID system. arXiv preprint arXiv:1908.08705 (2019).
[27]
Anders Krogh and Jesper Vedelsby. 1995. Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems. 231--238.
[28]
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016).
[29]
Hyun Kwon, Hyunsoo Yoon, and Ki-Woong Park. 2019. POSTER: Detecting Audio Adversarial Example through Audio Modification. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 2521--2523.
[30]
Y LeCun, Y Bengio, and G Hinton. 2015. Deep learning. nature 521 (7553): 436. Google Scholar (2015).
[31]
Bowon Lee and Mark Hasegawa-Johnson. 2007. Minimum mean squared error a posteriori estimation of high variance vehicular noise. Biennial on DSP for In-Vehicle and Mobile Systems (2007).
[32]
Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707--710.
[33]
Juncheng Li, Frank Schmidt, and Zico Kolter. 2019. Adversarial camera stickers: A physical camera-based attack on deep learning systems. In International Conference on Machine Learning. 3896--3904.
[34]
Shasha Li, Ajaya Neupane, Sujoy Paul, Chengyu Song, Srikanth V Krishnamurthy, Amit K Roy Chowdhury, and Ananthram Swami. 2018. Adversarial perturbations against real-time video classification systems. arXiv preprint arXiv:1807.00458 (2018).
[35]
Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada. 2000. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. (2000).
[36]
Luis Perez and Jason Wang. 2017. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 (2017).
[37]
Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, Long Beach, California, USA, 5231-- 5240.
[38]
Krishan Rajaratnam and Jugal Kalita. 2018. Noise flooding for detecting audio adversarial examples against automatic speech recognition. In 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 197--201.
[39]
Robin Scheibler, Eric Bezzam, and Ivan Dokmani?. 2018. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 351--355.
[40]
Jongseo Sohn and Wonyong Sung. 1998. A voice activity detector employing soft decision based noise spectrum adaptation. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181), Vol. 1. IEEE, 365--368.
[41]
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In international conference on learning representations (ICLR 2014).
[42]
Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).
[43]
William E Winkler. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. (1990).
[44]
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2018. Mitigating Adversarial Effects Through Randomization. In International Conference on Learning Representations.
[45]
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE International Conference on Computer Vision. 1369--1378.
[46]
Yue Xie, Ruiyu Liang, Zhenlin Liang, Chengwei Huang, Cairong Zou, and Björn Schuller. 2019. Speech emotion classification using attention-based lstm. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 11 (2019), 1675--1685.
[47]
Hiromu Yakura and Jun Sakuma. 2019. Robust Audio Adversarial Example for a Physical Attack. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 5334--5341.
[48]
Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2018. Towards mitigating audio adversarial perturbations. (2018).
[49]
Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2019. Characterizing Audio Adversarial Examples Using Temporal Dependency. In International Conference on Learning Representations. https://openreview.net/forum?id=r1g4E3C9t7
[50]
Bolaji Yusuf, Batuhan Gundogdu, and Murat Saraclar. 2019. Low Resource Keyword Search With Synthesized Crosslingual Exemplars. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 7 (2019), 1126--1135.
[51]
Pu Zhao, Sijia Liu, Yanzhi Wang, and Xue Lin. 2018. An admm-based universal framework for adversarial attacks on deep neural networks. In Proceedings of the 26th ACM international conference on Multimedia. 1065--1073.

Cited By

View all
  • (2024)Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style TransferProceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems10.1145/3665451.3665532(47-55)Online publication date: 2-Jul-2024
  • (2024)DP-RAE: A Dual-Phase Merging Reversible Adversarial Example for Image Privacy ProtectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681291(671-680)Online publication date: 28-Oct-2024
  • (2024)Toward Robust ASR System against Audio Adversarial Examples using Agitated LogitACM Transactions on Privacy and Security10.1145/366182227:2(1-26)Online publication date: 26-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial attacks
  2. multi-noise padding

Qualifiers

  • Research-article

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)6
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style TransferProceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems10.1145/3665451.3665532(47-55)Online publication date: 2-Jul-2024
  • (2024)DP-RAE: A Dual-Phase Merging Reversible Adversarial Example for Image Privacy ProtectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681291(671-680)Online publication date: 28-Oct-2024
  • (2024)Toward Robust ASR System against Audio Adversarial Examples using Agitated LogitACM Transactions on Privacy and Security10.1145/366182227:2(1-26)Online publication date: 26-Apr-2024
  • (2024)AdvReverb: Rethinking the Stealthiness of Audio Adversarial Examples to Human PerceptionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334563919(1948-1962)Online publication date: 2024
  • (2024)Incentive Mechanism Design Toward a Win–Win Situation for Generative Art Trainers and ArtistsIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.341563111:6(7528-7540)Online publication date: Dec-2024
  • (2024)Efficient physical image attacks using adversarial fast autoaugmentation methodsKnowledge-Based Systems10.1016/j.knosys.2024.112576304(112576)Online publication date: Nov-2024
  • (2024)Adaptive unified defense framework for tackling adversarial audio attacksArtificial Intelligence Review10.1007/s10462-024-10863-757:8Online publication date: 26-Jul-2024
  • (2023)Toward Intrinsic Adversarial Robustness Through Probabilistic TrainingIEEE Transactions on Image Processing10.1109/TIP.2023.329053232(3862-3872)Online publication date: 2023
  • (2023)Adversarial Example Detection Techniques in Speech Recognition Systems: A review2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM)10.1109/IC2EM59347.2023.10419688(1-7)Online publication date: 28-Nov-2023
  • (2023)Exploring Diverse Feature Extractions for Adversarial Audio DetectionIEEE Access10.1109/ACCESS.2023.323411011(2351-2360)Online publication date: 2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media