[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Free access

A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters

Published: 01 June 2008 Publication History

Abstract

Statistical spam filters are known to be vulnerable to adversarial attacks. One of the more common adversarial attacks, known as the good word attack, thwarts spam filters by appending to spam messages sets of "good" words, which are words that are common in legitimate email but rare in spam. We present a counterattack strategy that attempts to differentiate spam from legitimate email in the input space by transforming each email into a bag of multiple segments, and subsequently applying multiple instance logistic regression on the bags. We treat each segment in the bag as an instance. An email is classified as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We show that a classifier using our multiple instance counterattack strategy is more robust to good word attacks than its single instance counterpart and other single instance learners commonly used in the spam filtering domain.

References

[1]
S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In NIPS 15, pages 561-568. MIT Press, 2003.
[2]
P. Auer. On learning from multi-instance examples: Empirical evaluation of a theoretical approach. In Proceedings of the 14th International Conference on Machine Learning, pages 21-29, San Francisco, CA, 1997. Morgan Kaufmann.
[3]
M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar. Can machine learning be secure? In ASIACCS '06: Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pages 16-25, New York, NY, USA, 2006. ACM Press. ISBN 1-59593-272-0.
[4]
A. Blum and A. Kalai. A note on learning from multiple-instance examples. Machine Learning, 30 (1):23-30, 1998.
[5]
A. Bratko. Probabilistic sequence modeling shared library. http://ai.ijs.si/andrej/psmslib.html, 2008.
[6]
A. Bratko and B. Filipi¿. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jo¿ef Stefan Institute, Ljubljana, Slovenia, 2005.
[7]
J. Carpinter and R. Hunt. Tightening the net: A review of current and next generation spam filtering tools. Computers and Security, 25(8):566-578, 2006.
[8]
Y. Chen and J.Z. Wang. Image categorization by learning and reasoning with regions. Journal of Machine Learning Research, 5:913-939, 2004.
[9]
Y. Chevaleyre and J.D. Zucker. Solving multiple-instance and multiple-part learning problems with decision trees and rule sets. application to the mutagenesis problem. In Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, pages 204-214, 2001.
[10]
G. V. Cormack and T. R. Lynam. Spam track guidelines -- TREC 2005-2007. http://plg.uwaterloo.ca/gvcormac/treccorpus06/, 2006.
[11]
N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99-108. ACM Press, 2004.
[12]
T.G. Dietterich, R.H. Lathrop, and T. Lozano-Pérez. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence Journal, 89(1-2):31-71, 1997.
[13]
T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27:861-874, 2006.
[14]
T. Gärtner, P. Flach, A. Kowalczyk, and A. Smola. Multi-instance kernels. In Proceedings of the 19th International Conference on Machine Learning, pages 179-186, San Francisco, CA, 2002. Morgan Kaufmann.
[15]
R. Jennings. The global economic impact of spam. Technical report, Ferris Research, 2005.
[16]
A.M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes. Multinomila naive bayes for text categorization revisited. In Proceedings of the 17th Australian Joint Conference on Artificial Intelligence, pages 488-499. Springer, 2004.
[17]
J.Z. Kolter and M.A. Maloof. Using additive expert ensembles to cope with concept drift. In Proceedings of the Twenty-second International Conference on Machine Learning, pages 449- 456, New York, NY, 2005. ACM Press.
[18]
H. Lee and A. Ng. Spam deobfuscation using a hidden Markov model. In Proceedings of the Second Conference on Email and Anti-Spam, 2005.
[19]
P. Long and L. Tan. PAC learning axis-aligned rectangles with respect to product distribution from multiple-instance examples. Machine Learning, 30(1):7-21, 1998.
[20]
D. Lowd and C. Meek. Adversarial learning. In Proceedings of the 2005 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 641-647. ACM Press, 2005a.
[21]
D. Lowd and C. Meek. Good word attacks on statistical spam filters. In Proceedings of the 2nd Conference on Email and Anti-Spam, 2005b.
[22]
O. Maron and T. Lozano-Pérez. A framework for multiple-instance learning. Advances in Neural Information Processing Systems, 10:570-576, 1998.
[23]
J. Newsome, B. Karp, and D. Song. Paragraph: Thwarting signature learning by training maliciously. In Recent Advances in Intrusion Detection: 9th International Symposium (RAID), pages 81-105, 2006.
[24]
J. Ramon and L.D. Raedt. Multi instance neural networks. In Proceedings of ICML-2000 workshop on Attribute-Value and Relational Learning, 2000.
[25]
S. Ray and M. Craven. Supervised versus multiple instance learning: An empirical comparison. In Proceedings of the 22nd International Conference on Machine Learning, pages 697-704, New York, NY, 2005. ACM Press.
[26]
J. Rocchio Jr. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, pages 68-73. Prentice Hall, 1971.
[27]
J. Wang and J.D. Zucker. Solving the multiple-instance learning problem: A lazy learning approach. In Proceedings of the 17th International Conference on Machine Learning, pages 1119-1125, San Francisco, CA, 2000. Morgan Kaufmann.
[28]
S. Webb, S. Chitti, and C. Pu. An experimental evaluation of spam filter performance and robustness against attack. In The 1st International Conference on Collaborative Computing: Networking, Applications and Worksharing, pages 19-21, 2005.
[29]
I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools with Java Implementations . Morgan Kaufmann, San Francisco, CA, USA, 2000.
[30]
X. Xu. Statistical learning in multiple instance problems. Master's thesis, University of Waikato, 2003.
[31]
X. Xu and E. Frank. Logistic regression and boosting for labeled bags of instances. In Proceedings of the Pacific-Asian Conference on Knowledge discovery and data mining. Springer-Verlag, 2004.
[32]
W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In Proceedings of the Third Conference on Email and Anti-Spam, 2006.
[33]
M.-L. Zhang and Z.-H. Zhou. Multi-label learning by instance differentiation. In The 22nd AAAI Conference on Artificial Intelligence (AAAI'07), pages 669-674, Vancouver, Canada, 2007.
[34]
Q. Zhang and S. Goldman. EM-DD: An improved multiple-instance learning technique. In Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference, pages 1073-1080, Cambridge, MA, 2002. MIT Press.
[35]
Z.H. Zhou and M.L. Zhang. Ensembles of multi-instance learners. In ECML-03, 15th European Conference on Machine Learning, pages 492-502, 2003.

Cited By

View all
  • (2023)Reproducibility in multiple instance learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666718(13530-13544)Online publication date: 10-Dec-2023
  • (2021)Crafting Adversarial Email Content against Machine Learning Based Spam Email DetectionProceedings of the 2021 International Symposium on Advanced Security on Software and Systems10.1145/3457340.3458302(23-28)Online publication date: 7-Jun-2021
  • (2021)A Survey on Adversarial Recommender SystemsACM Computing Surveys10.1145/343972954:2(1-38)Online publication date: 5-Mar-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 9, Issue
6/1/2008
1964 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 June 2008
Published in JMLR Volume 9

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)7
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Reproducibility in multiple instance learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666718(13530-13544)Online publication date: 10-Dec-2023
  • (2021)Crafting Adversarial Email Content against Machine Learning Based Spam Email DetectionProceedings of the 2021 International Symposium on Advanced Security on Software and Systems10.1145/3457340.3458302(23-28)Online publication date: 7-Jun-2021
  • (2021)A Survey on Adversarial Recommender SystemsACM Computing Surveys10.1145/343972954:2(1-38)Online publication date: 5-Mar-2021
  • (2021)Short-Term Strong Wind Risk Prediction for High-Speed RailwayIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2021.305860822:7(4243-4255)Online publication date: 1-Jul-2021
  • (2018)FALCONProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271787(67-76)Online publication date: 17-Oct-2018
  • (2018)Multiple instance learningPattern Recognition10.1016/j.patcog.2017.10.00977:C(329-353)Online publication date: 1-May-2018
  • (2017)A multiple-instance stream learning framework for adaptive document categorizationKnowledge-Based Systems10.1016/j.knosys.2017.01.001120:C(198-210)Online publication date: 15-Mar-2017
  • (2016)Twin Support Vector Machine for Multiple Instance Learning Based on Bag DissimilaritiesAdvances in Artificial Intelligence10.1155/2016/12697082016(1)Online publication date: 1-Aug-2016
  • (2016)A multi-instance multi-label learning algorithm based on instance correlationsMultimedia Tools and Applications10.1007/s11042-016-3494-z75:19(12263-12284)Online publication date: 1-Oct-2016
  • (2015)Automated Attacks on Compression-Based ClassifiersProceedings of the 8th ACM Workshop on Artificial Intelligence and Security10.1145/2808769.2808778(69-80)Online publication date: 16-Oct-2015
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media