More Web Proxy on the site http://driver.im/

research-article

A Unified Framework for Detecting Audio Adversarial Examples

Authors:

Zheng ZhangAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 3986 - 3994

https://doi.org/10.1145/3394171.3413603

Published: 12 October 2020 Publication History

Abstract

Adversarial attacks have been widely recognized as the security vulnerability of deep neural networks, especially in deep automatic speech recognition (ASR) systems. The advanced detection methods against adversarial attacks mainly focus on pre-processing the input audio to alleviate the threat of adversarial noise. Although these methods could detect some simplex adversarial attacks, they fail to handle robust complex attacks especially when the attacker knows the detection details. In this paper, we propose a unified adversarial detection framework for detecting adaptive audio adversarial examples, which combines noise padding with sound reverberation. Specifically, a well-designed adaptive artificial utterances generator is proposed to balance the design complexity, such that the artificial utterances (speech with reverberation) are efficiently determined to reduce the false positive rate and false negative rate of detection results. Moreover, to destroy the continuity of the adversarial noise, we develop a novel multi-noise padding strategy, which implants the Gaussian noises in the silent fragments of the input speech by the voice activity detector. Furthermore, our proposed method can effectively tackle the robust adaptive attacks in an adaptive learning manner. Importantly, the conceived system is easily embedded into any ASR models without requiring additional retraining or modification. The experimental results show that our method consistently outperforms the state-of-the-art audio defense methods, even for the adaptive and robust attacks.

Supplementary Material

MP4 File (3394171.3413603.mp4)

Presentation Video.

Download
110.70 MB

References

[1]

Jont B Allen and David A Berkley. 1979. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65, 4 (1979), 943--950.

[2]

Ido Ariav and Israel Cohen. 2019. An end-to-end multimodal voice activity detection using wavenet encoder and residual networks. IEEE Journal of Selected Topics in Signal Processing 13, 2 (2019), 265--274.

[3]

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Synthesizing Robust Adversarial Examples. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), JenniferDy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 284--293.

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In international conference on learning representations (ICLR 2015).

[5]

Marco Barreno, Blaine Nelson, Anthony D Joseph, and J Doug Tygar. 2010. The security of machine learning. Machine Learning 81, 2 (2010), 121--148.

Digital Library

[6]

Marco Barreno, Blaine Nelson, Russell Sears, Anthony D Joseph, and J Doug Tygar. 2006. Can machine learning be secure?. In Proceedings of the 2006 ACM Symposium on Information, computer and communications security. ACM, 16--25.

Digital Library

[7]

Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 1--7.

[8]

Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning. ACM, 18.

Digital Library

[9]

Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng Polo Chau. 2018. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 52--68.

[10]

Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. Robust physicalworld attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1625--1634.

[11]

Virginie Gilg, Christophe Beaugeant, and Bernt Andrassy. 2020. METHODOLOGY FOR THE DESIGN OF A ROBUST VOICE ACTIVITY DETECTOR FOR SPEECH ENHANCEMENT. (04 2020).

[12]

Wael H Gomaa, Aly A Fahmy, et al. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (2013), 13--18.

[13]

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations.

[14]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645--6649.

[15]

JA Haigh and JS Mason. 1993. Robust voice activity detection using cepstral features. In Proceedings of TENCon'93. IEEE Region 10 International Conference on Computers, Communications and Automation, Vol. 3. IEEE, 321--324.

[16]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).

[17]

Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis & Machine Intelligence 10 (1990), 993--1001.

Digital Library

[18]

Karl Moritz Hermann and Phil Blunsom. 2014. Multilingual distributed representations without word alignment. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014).

[19]

IEEE 2019. Targeted adversarial examples for black box audio systems. IEEE.

[20]

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. 2019. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems. 125--136.

[21]

Matthew A Jaro. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84, 406 (1989), 414--420.

[22]

Matthew A Jaro. 1995. Probabilistic linkage of large public health data files. Statistics in medicine 14, 5--7 (1995), 491--498.

[23]

Marco Jeub, Magnus Schafer, and Peter Vary. 2009. A binaural room impulse response database for the evaluation of dereverberation algorithms. In 2009 16th International Conference on Digital Signal Processing. IEEE, 1--5.

[24]

Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, and Yu-Gang Jiang. 2019. Black-box adversarial attacks on video recognition models. In Proceedings of the 27th ACM International Conference on Multimedia. 864--872.

Digital Library

[25]

Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Emanuël Habets, Reinhold Haeb-Umbach, Volker Leutnant, Armin Sehr, Walter Kellermann, Roland Maas, et al. 2013. The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1--4.

[26]

Stepan Komkov and Aleksandr Petiushko. 2019. AdvHat: Real-world adversarial attack on ArcFace Face ID system. arXiv preprint arXiv:1908.08705 (2019).

[27]

Anders Krogh and Jesper Vedelsby. 1995. Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems. 231--238.

[28]

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016).

[29]

Hyun Kwon, Hyunsoo Yoon, and Ki-Woong Park. 2019. POSTER: Detecting Audio Adversarial Example through Audio Modification. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 2521--2523.

Digital Library

[30]

Y LeCun, Y Bengio, and G Hinton. 2015. Deep learning. nature 521 (7553): 436. Google Scholar (2015).

[31]

Bowon Lee and Mark Hasegawa-Johnson. 2007. Minimum mean squared error a posteriori estimation of high variance vehicular noise. Biennial on DSP for In-Vehicle and Mobile Systems (2007).

[32]

Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707--710.

[33]

Juncheng Li, Frank Schmidt, and Zico Kolter. 2019. Adversarial camera stickers: A physical camera-based attack on deep learning systems. In International Conference on Machine Learning. 3896--3904.

[34]

Shasha Li, Ajaya Neupane, Sujoy Paul, Chengyu Song, Srikanth V Krishnamurthy, Amit K Roy Chowdhury, and Ananthram Swami. 2018. Adversarial perturbations against real-time video classification systems. arXiv preprint arXiv:1807.00458 (2018).

[35]

Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada. 2000. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. (2000).

[36]

Luis Perez and Jason Wang. 2017. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 (2017).

[37]

Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, Long Beach, California, USA, 5231-- 5240.

[38]

Krishan Rajaratnam and Jugal Kalita. 2018. Noise flooding for detecting audio adversarial examples against automatic speech recognition. In 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 197--201.

[39]

Robin Scheibler, Eric Bezzam, and Ivan Dokmani?. 2018. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 351--355.

Digital Library

[40]

Jongseo Sohn and Wonyong Sung. 1998. A voice activity detector employing soft decision based noise spectrum adaptation. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181), Vol. 1. IEEE, 365--368.

[41]

Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In international conference on learning representations (ICLR 2014).

[42]

Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).

[43]

William E Winkler. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. (1990).

[44]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2018. Mitigating Adversarial Effects Through Randomization. In International Conference on Learning Representations.

[45]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE International Conference on Computer Vision. 1369--1378.

[46]

Yue Xie, Ruiyu Liang, Zhenlin Liang, Chengwei Huang, Cairong Zou, and Björn Schuller. 2019. Speech emotion classification using attention-based lstm. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 11 (2019), 1675--1685.

Digital Library

[47]

Hiromu Yakura and Jun Sakuma. 2019. Robust Audio Adversarial Example for a Physical Attack. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 5334--5341.

[48]

Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2018. Towards mitigating audio adversarial perturbations. (2018).

[49]

Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2019. Characterizing Audio Adversarial Examples Using Temporal Dependency. In International Conference on Learning Representations. https://openreview.net/forum?id=r1g4E3C9t7

[50]

Bolaji Yusuf, Batuhan Gundogdu, and Murat Saraclar. 2019. Low Resource Keyword Search With Synthesized Crosslingual Exemplars. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 7 (2019), 1126--1135.

Digital Library

[51]

Pu Zhao, Sijia Liu, Yanzhi Wang, and Xue Lin. 2018. An admm-based universal framework for adversarial attacks on deep neural networks. In Proceedings of the 26th ACM international conference on Multimedia. 1065--1073.

Digital Library

Cited By

Jin WCao YSu JShen QYe KWang DHao JLiu Z(2024)Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style TransferProceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems10.1145/3665451.3665532(47-55)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1145/3665451.3665532
Zhu JDu XZhou JPun CXu QLiu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DP-RAE: A Dual-Phase Merging Reversible Adversarial Example for Image Privacy ProtectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681291(671-680)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681291
Park NKim J(2024)Toward Robust ASR System against Audio Adversarial Examples using Agitated LogitACM Transactions on Privacy and Security10.1145/366182227:2(1-26)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3661822
Show More Cited By

Index Terms

A Unified Framework for Detecting Audio Adversarial Examples
1. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Privacy protections
  2. Software and application security

Recommendations

Detecting Audio Adversarial Examples with Logit Noising
ACSAC '21: Proceedings of the 37th Annual Computer Security Applications Conference

Automatic speech recognition (ASR) systems are vulnerable to audio adversarial examples that attempt to deceive ASR systems by adding perturbations to benign speech signals. Although an adversarial example and the original benign wave are ...
Feature autoencoder for detecting adversarial examples
Abstract
Deep neural networks (DNNs) have gained widespread adoption in computer vision. Unfortunately, state‐of‐the‐art DNNs are vulnerable to adversarial example (AE) attacks, where an adversary introduces imperceptible perturbations to a test example ...
Audio-deepfake detection: Adversarial attacks and countermeasures
Abstract
Audio has always been a powerful resource for biometric authentication: thus, numerous AI-based audio authentication systems (classifiers) have been proposed. While these classifiers are effective in identifying legitimate human-generated input ...
Highlights
- Expose weakness of Deep4SNet, state-of-the-art deepfake audio detection.
- Reveal limitations of relying on audio histograms for deepfake detection.
- Present powerful GAN-based attacks on audio-deepfake classifier.
- Propose viable, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
504
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)6

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jin WCao YSu JShen QYe KWang DHao JLiu Z(2024)Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style TransferProceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems10.1145/3665451.3665532(47-55)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1145/3665451.3665532
Zhu JDu XZhou JPun CXu QLiu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DP-RAE: A Dual-Phase Merging Reversible Adversarial Example for Image Privacy ProtectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681291(671-680)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681291
Park NKim J(2024)Toward Robust ASR System against Audio Adversarial Examples using Agitated LogitACM Transactions on Privacy and Security10.1145/366182227:2(1-26)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3661822
Chen MLu LYu JBa ZLin FRen K(2024)AdvReverb: Rethinking the Stealthiness of Audio Adversarial Examples to Human PerceptionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334563919(1948-1962)Online publication date: 2024
https://doi.org/10.1109/TIFS.2023.3345639
Duan HSaddik ACai W(2024)Incentive Mechanism Design Toward a Win–Win Situation for Generative Art Trainers and ArtistsIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.341563111:6(7528-7540)Online publication date: Dec-2024
https://doi.org/10.1109/TCSS.2024.3415631
Du XPun CZhou J(2024)Efficient physical image attacks using adversarial fast autoaugmentation methodsKnowledge-Based Systems10.1016/j.knosys.2024.112576304(112576)Online publication date: Nov-2024
https://doi.org/10.1016/j.knosys.2024.112576
Du XZhang QZhu JLiu X(2024)Adaptive unified defense framework for tackling adversarial audio attacksArtificial Intelligence Review10.1007/s10462-024-10863-757:8Online publication date: 26-Jul-2024
https://doi.org/10.1007/s10462-024-10863-7
Dong JYang LWang YXie XLai J(2023)Toward Intrinsic Adversarial Robustness Through Probabilistic TrainingIEEE Transactions on Image Processing10.1109/TIP.2023.329053232(3862-3872)Online publication date: 2023
https://doi.org/10.1109/TIP.2023.3290532
Noureddine KKheddar HMaazouz M(2023)Adversarial Example Detection Techniques in Speech Recognition Systems: A review2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM)10.1109/IC2EM59347.2023.10419688(1-7)Online publication date: 28-Nov-2023
https://doi.org/10.1109/IC2EM59347.2023.10419688
Choi YPark JLee JKim H(2023)Exploring Diverse Feature Extractions for Adversarial Audio DetectionIEEE Access10.1109/ACCESS.2023.323411011(2351-2360)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3234110
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents