More Web Proxy on the site http://driver.im/

article

Free access

A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters

Authors:

Zach Jorgensen,

Meador IngeAuthors Info & Claims

The Journal of Machine Learning Research, Volume 9

Pages 1115 - 1146

Published: 01 June 2008 Publication History

Abstract

Statistical spam filters are known to be vulnerable to adversarial attacks. One of the more common adversarial attacks, known as the good word attack, thwarts spam filters by appending to spam messages sets of "good" words, which are words that are common in legitimate email but rare in spam. We present a counterattack strategy that attempts to differentiate spam from legitimate email in the input space by transforming each email into a bag of multiple segments, and subsequently applying multiple instance logistic regression on the bags. We treat each segment in the bag as an instance. An email is classified as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We show that a classifier using our multiple instance counterattack strategy is more robust to good word attacks than its single instance counterpart and other single instance learners commonly used in the spam filtering domain.

References

[1]

S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In NIPS 15, pages 561-568. MIT Press, 2003.

[2]

P. Auer. On learning from multi-instance examples: Empirical evaluation of a theoretical approach. In Proceedings of the 14th International Conference on Machine Learning, pages 21-29, San Francisco, CA, 1997. Morgan Kaufmann.

Digital Library

[3]

M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar. Can machine learning be secure? In ASIACCS '06: Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pages 16-25, New York, NY, USA, 2006. ACM Press. ISBN 1-59593-272-0.

Digital Library

[4]

A. Blum and A. Kalai. A note on learning from multiple-instance examples. Machine Learning, 30 (1):23-30, 1998.

Digital Library

[5]

A. Bratko. Probabilistic sequence modeling shared library. http://ai.ijs.si/andrej/psmslib.html, 2008.

[6]

A. Bratko and B. Filipi¿. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jo¿ef Stefan Institute, Ljubljana, Slovenia, 2005.

[7]

J. Carpinter and R. Hunt. Tightening the net: A review of current and next generation spam filtering tools. Computers and Security, 25(8):566-578, 2006.

Digital Library

[8]

Y. Chen and J.Z. Wang. Image categorization by learning and reasoning with regions. Journal of Machine Learning Research, 5:913-939, 2004.

Digital Library

[9]

Y. Chevaleyre and J.D. Zucker. Solving multiple-instance and multiple-part learning problems with decision trees and rule sets. application to the mutagenesis problem. In Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, pages 204-214, 2001.

Digital Library

[10]

G. V. Cormack and T. R. Lynam. Spam track guidelines -- TREC 2005-2007. http://plg.uwaterloo.ca/gvcormac/treccorpus06/, 2006.

[11]

N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99-108. ACM Press, 2004.

Digital Library

[12]

T.G. Dietterich, R.H. Lathrop, and T. Lozano-Pérez. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence Journal, 89(1-2):31-71, 1997.

Digital Library

[13]

T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27:861-874, 2006.

Digital Library

[14]

T. Gärtner, P. Flach, A. Kowalczyk, and A. Smola. Multi-instance kernels. In Proceedings of the 19th International Conference on Machine Learning, pages 179-186, San Francisco, CA, 2002. Morgan Kaufmann.

Digital Library

[15]

R. Jennings. The global economic impact of spam. Technical report, Ferris Research, 2005.

[16]

A.M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes. Multinomila naive bayes for text categorization revisited. In Proceedings of the 17th Australian Joint Conference on Artificial Intelligence, pages 488-499. Springer, 2004.

Digital Library

[17]

J.Z. Kolter and M.A. Maloof. Using additive expert ensembles to cope with concept drift. In Proceedings of the Twenty-second International Conference on Machine Learning, pages 449- 456, New York, NY, 2005. ACM Press.

Digital Library

[18]

H. Lee and A. Ng. Spam deobfuscation using a hidden Markov model. In Proceedings of the Second Conference on Email and Anti-Spam, 2005.

[19]

P. Long and L. Tan. PAC learning axis-aligned rectangles with respect to product distribution from multiple-instance examples. Machine Learning, 30(1):7-21, 1998.

Digital Library

[20]

D. Lowd and C. Meek. Adversarial learning. In Proceedings of the 2005 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 641-647. ACM Press, 2005a.

Digital Library

[21]

D. Lowd and C. Meek. Good word attacks on statistical spam filters. In Proceedings of the 2nd Conference on Email and Anti-Spam, 2005b.

[22]

O. Maron and T. Lozano-Pérez. A framework for multiple-instance learning. Advances in Neural Information Processing Systems, 10:570-576, 1998.

Digital Library

[23]

J. Newsome, B. Karp, and D. Song. Paragraph: Thwarting signature learning by training maliciously. In Recent Advances in Intrusion Detection: 9th International Symposium (RAID), pages 81-105, 2006.

Digital Library

[24]

J. Ramon and L.D. Raedt. Multi instance neural networks. In Proceedings of ICML-2000 workshop on Attribute-Value and Relational Learning, 2000.

[25]

S. Ray and M. Craven. Supervised versus multiple instance learning: An empirical comparison. In Proceedings of the 22nd International Conference on Machine Learning, pages 697-704, New York, NY, 2005. ACM Press.

Digital Library

[26]

J. Rocchio Jr. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, pages 68-73. Prentice Hall, 1971.

[27]

J. Wang and J.D. Zucker. Solving the multiple-instance learning problem: A lazy learning approach. In Proceedings of the 17th International Conference on Machine Learning, pages 1119-1125, San Francisco, CA, 2000. Morgan Kaufmann.

Digital Library

[28]

S. Webb, S. Chitti, and C. Pu. An experimental evaluation of spam filter performance and robustness against attack. In The 1st International Conference on Collaborative Computing: Networking, Applications and Worksharing, pages 19-21, 2005.

[29]

I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools with Java Implementations . Morgan Kaufmann, San Francisco, CA, USA, 2000.

Digital Library

[30]

X. Xu. Statistical learning in multiple instance problems. Master's thesis, University of Waikato, 2003.

[31]

X. Xu and E. Frank. Logistic regression and boosting for labeled bags of instances. In Proceedings of the Pacific-Asian Conference on Knowledge discovery and data mining. Springer-Verlag, 2004.

[32]

W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In Proceedings of the Third Conference on Email and Anti-Spam, 2006.

[33]

M.-L. Zhang and Z.-H. Zhou. Multi-label learning by instance differentiation. In The 22nd AAAI Conference on Artificial Intelligence (AAAI'07), pages 669-674, Vancouver, Canada, 2007.

Digital Library

[34]

Q. Zhang and S. Goldman. EM-DD: An improved multiple-instance learning technique. In Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference, pages 1073-1080, Cambridge, MA, 2002. MIT Press.

[35]

Z.H. Zhou and M.L. Zhang. Ensembles of multi-instance learners. In ECML-03, 15th European Conference on Machine Learning, pages 492-502, 2003.

Cited By

Raff EHolt JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Reproducibility in multiple instance learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666718(13530-13544)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666718
Wang CZhang DHuang SLi XDing LMeng WLi L(2021)Crafting Adversarial Email Content against Machine Learning Based Spam Email DetectionProceedings of the 2021 International Symposium on Advanced Security on Software and Systems10.1145/3457340.3458302(23-28)Online publication date: 7-Jun-2021
https://dl.acm.org/doi/10.1145/3457340.3458302
Deldjoo YNoia TMerra F(2021)A Survey on Adversarial Recommender SystemsACM Computing Surveys10.1145/343972954:2(1-38)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3439729
Show More Cited By

Index Terms

A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction
  2. World Wide Web
    1. Web applications
      1. Internet communications tools
        Email

Recommendations

Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning
ICTAI '07: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02

Statistical spam filters are known to be vulnerable to ad- versarial attacks. One such adversarial attack, known as the Good Word Attack, thwarts spam filters by appending to spam messages sets of "good" words, which are common in legitimate e-mail but ...
Penetrating Bayesian Spam Filters
Machine learning combating DOS and DDOS attacks

In recent years, technology is booming at a breakneck speed as so the need of security. Vulnerabilities in the layers of the OSI model and the networks are paving new ways for intruders and hackers to steal the confidential information. Security attacks ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 9, Issue

6/1/2008

1964 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 June 2008

Published in JMLR Volume 9

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
421
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)7

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Raff EHolt JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Reproducibility in multiple instance learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666718(13530-13544)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666718
Wang CZhang DHuang SLi XDing LMeng WLi L(2021)Crafting Adversarial Email Content against Machine Learning Based Spam Email DetectionProceedings of the 2021 International Symposium on Advanced Security on Software and Systems10.1145/3457340.3458302(23-28)Online publication date: 7-Jun-2021
https://dl.acm.org/doi/10.1145/3457340.3458302
Deldjoo YNoia TMerra F(2021)A Survey on Adversarial Recommender SystemsACM Computing Surveys10.1145/343972954:2(1-38)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3439729
Liu HLiu CHe SChen J(2021)Short-Term Strong Wind Risk Prediction for High-Speed RailwayIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2021.305860822:7(4243-4255)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1109/TITS.2021.3058608
Yang SShen XCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)FALCONProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271787(67-76)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3271787
Carbonneau MCheplygina VGranger EGagnon G(2018)Multiple instance learningPattern Recognition10.1016/j.patcog.2017.10.00977:C(329-353)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1016/j.patcog.2017.10.009
Xiao YLiu BYin JHao Z(2017)A multiple-instance stream learning framework for adaptive document categorizationKnowledge-Based Systems10.1016/j.knosys.2017.01.001120:C(198-210)Online publication date: 15-Mar-2017
https://dl.acm.org/doi/10.1016/j.knosys.2017.01.001
Tomar DAgarwal S(2016)Twin Support Vector Machine for Multiple Instance Learning Based on Bag DissimilaritiesAdvances in Artificial Intelligence10.1155/2016/12697082016(1)Online publication date: 1-Aug-2016
https://dl.acm.org/doi/10.1155/2016/1269708
Liu CChen TDing XZou HTong Y(2016)A multi-instance multi-label learning algorithm based on instance correlationsMultimedia Tools and Applications10.1007/s11042-016-3494-z75:19(12263-12284)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s11042-016-3494-z
Burago ILowd DRay IDimitrakakis CMitrokotsa ASinha AWang XRen K(2015)Automated Attacks on Compression-Based ClassifiersProceedings of the 8th ACM Workshop on Artificial Intelligence and Security10.1145/2808769.2808778(69-80)Online publication date: 16-Oct-2015
https://dl.acm.org/doi/10.1145/2808769.2808778
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents