More Web Proxy on the site http://driver.im/

research-article

Robust detection of comment spam using entropy rate

Authors:

Alex Kantchelian,

Anthony Joseph,

J. D. TygarAuthors Info & Claims

AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence

Pages 59 - 70

https://doi.org/10.1145/2381896.2381907

Published: 19 October 2012 Publication History

Abstract

In this work, we design a method for blog comment spam detection using the assumption that spam is any kind of uninformative content. To measure the "informativeness" of a set of blog comments, we construct a language and tokenization independent metric which we call content complexity, providing a normalized answer to the informal question "how much information does this text contain?" We leverage this metric to create a small set of features well-adjusted to comment spam detection by computing the content complexity over groupings of messages sharing the same author, the same sender IP, the same included links, etc.

We evaluate our method against an exact set of tens of millions of comments collected over a four months period and containing a variety of websites, including blogs and news sites. The data was provided to us with an initial spam labeling from an industry competitive source. Nevertheless the initial spam labeling had unknown performance characteristics. To train a logistic regression on this dataset using our features, we derive a simple mislabeling tolerant logistic regression algorithm based on expectation-maximization, which we show generally outperforms the plain version in precision-recall space.

By using a parsimonious hand-labeling strategy, we show that our method can operate at an arbitrary high precision level, and that it significantly dominates, both in terms of precision and recall, the original labeling, despite being trained on it alone.

The content complexity metric, the use of a noise-tolerant logistic regression and the evaluation methodology are thus the three central contributions with this work.

References

[1]

A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743--2760, 1998.

Digital Library

[2]

A. Bratko, G. V. Cormack, B. Filipic, T. R. Lynam, and B. Zupan. Spam filtering using statistical data compression models. Journal of Machine Learning Research, 7:2673--2698, Dec 2006.

Digital Library

[3]

M. Brennan, S. Wrazien, and R. Greenstadt. Learning to extract quality discourse in online communities. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

[4]

B.-C. Chen, J. Guo, B. Tseng, and J. Yang. User reputation in a comment rating environment. In Proceedings of KDD'11, 2011.

Digital Library

[5]

H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. Zhao. Detecting and characterizing social spam campaigns. In Proceedings of IMC, 2010.

Digital Library

[6]

C.-F. Hsu, E. Khabiri, and J. Caverlee. Ranking Comments on the Social Web. In Proceedings of the IEEE International Conference on Computational Science and Engineering (CSE), pages 90--97, 2009.

Digital Library

[7]

K. Lee, J. Caverlee, and S. Webb. Uncovering social spammers: social honeypots+machine learning. In Proceedings of SIGIR, 2010.

Digital Library

[8]

K. Levenberg. A method for the solution of certain non-linear problems in least squares. In Quarterly of Applied Mathematics, volume 2, pages 164--168, 1944.

[9]

D. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. In SIAM Journal on Applied Mathematics, volume 11, pages 430--441, 1963.

[10]

G. Mishne, D. Carmel, and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In Proceedings of the International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.

[11]

G. Mishne and N. Glance. Leave a Reply: An Analysis of Weblog Comments. In Proceedings of the International World Wide Web Conference (WWW), 2006.

[12]

A. Mishra and R. Rastogi. Semi-supervised correction of biased comment ratings. In Proceedings of WWW'12, 2012.

Digital Library

[13]

I. Pavlov. LZMA SDK (software development kit), 2007.

[14]

P. L. R. H. Byrd and J. Nocedal. A limited memory algorithm for bound constrained optimization. In SIAM Journal on Scientific and Statistical Computing, volume 16, pages 1190--1208, 1995.

Digital Library

[15]

A. Ramachandran and N. Feamster. Understanding the network-level behavior of spammers. In Proceedings of Sigcomm, 2006.

Digital Library

[16]

A. Ramachandran, N. Feamster, and S. Vempala. Filtering spam with behavioral blacklisting. In Proceedings of CCS'07, 2007.

Digital Library

[17]

V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297--1322, April 2010.

Digital Library

[18]

T. Schurmann and P. Grassberger. Entropy estimation of symbol sequence. Chaos (Woodbury, N.Y.), 6(3):414--427, Sept. 1996.

[19]

D. Sculley and G. M. Wachman. Relaxed online svms for spam filtering. In Proceedings of SIGIR'07, 2007.

Digital Library

[20]

C. Shannon. Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1):50--64, 1951.

[21]

Y. Shin, M. Gupta, and S. Myers. Prevalence and mitigation of forum spamming. In INFOCOM, 2011 Proceedings IEEE, pages 2309--2317. IEEE, 2011.

[22]

Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, and I. Osipkov. Spamming botnets: Signatures and characteristics. In Proceedings of SIGCOMM 08, 2008.

Digital Library

[23]

J.-M. Xu, G. Fumera, F. Roli, and Z.-H. Zhou. Training spamassassin with active semi-supervised learning. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS'09), 2009.

[24]

L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, I. Osipkov, G. Hulten, and J. Tygar. Characterizing botnets from email spam records. In Proceedings of LEET, 2008.

Digital Library

Cited By

Shabadi LL CP SL VKashyap U(2023)Youtube Spam Detection Scheme Using Stacked Ensemble Machine Learning Model2023 International Conference on Network, Multimedia and Information Technology (NMITCON)10.1109/NMITCON58196.2023.10276002(1-7)Online publication date: 1-Sep-2023
https://doi.org/10.1109/NMITCON58196.2023.10276002
Valpadasu HChakri PHarshitha PTarun P(2023)Machine Learning based Spam Comments Detection on YouTube2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS)10.1109/ICICCS56967.2023.10142608(1234-1239)Online publication date: 17-May-2023
https://doi.org/10.1109/ICICCS56967.2023.10142608
Samineni LPeddi AKasukurthi ARao MNiharika GChereddy S(2023)Evaluation of AI Techniques for Detecting Deceptive Reviews in Cyberspace: A Study of Pre- and Post-COVID-19 Trends2023 Second International Conference on Electronics and Renewable Systems (ICEARS)10.1109/ICEARS56392.2023.10085689(961-967)Online publication date: 2-Mar-2023
https://doi.org/10.1109/ICEARS56392.2023.10085689
Show More Cited By

Index Terms

Robust detection of comment spam using entropy rate

Recommendations

Comment spam classification in blogs through comment analysis and comment-blog post relationships
CICLing'12: Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II

Spamming refers to the process of providing unwanted and irrelevant information to the users. It is a widespread phenomenon that is often noticed in e-mails, instant messages, blogs and forums. In our paper, we consider the problem of spamming in blogs. ...
Bayesian Approach Based Comment Spam Defending Tool
ISA '09: Proceedings of the 3rd International Conference and Workshops on Advances in Information Security and Assurance

Spam messes up user's inbox, consumes network resources and spread worms and viruses. Spam is flooding of unsolicited,unwanted e mail.Spam in blogs is called blog spam or comment spam.It is done by posting comments or flooding spams to the services such ...
A Self-Supervised Approach to Comment Spam Detection Based on Content Analysis

This paper studies the problems and threats posed by a type of spam in the blogosphere, called blog comment spam. It explores the challenges introduced by comment spam, generalizing the analysis substantially to any other short text type spam. The ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence

October 2012

116 pages

ISBN:9781450316644

DOI:10.1145/2381896

General Chair:
Ting Yu
North Carolina State University, USA
,
Program Chairs:
V. N. Venkatakrishan
University of Illinois at Chicago, USA
,
Apu Kapadia
Indiana University, Bloomington, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CCS'12

Sponsor:

SIGSAC

CCS'12: the ACM Conference on Computer and Communications Security

October 19, 2012

North Carolina, Raleigh, USA

Acceptance Rates

AISec '12 Paper Acceptance Rate 10 of 24 submissions, 42%;

Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
331
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shabadi LL CP SL VKashyap U(2023)Youtube Spam Detection Scheme Using Stacked Ensemble Machine Learning Model2023 International Conference on Network, Multimedia and Information Technology (NMITCON)10.1109/NMITCON58196.2023.10276002(1-7)Online publication date: 1-Sep-2023
https://doi.org/10.1109/NMITCON58196.2023.10276002
Valpadasu HChakri PHarshitha PTarun P(2023)Machine Learning based Spam Comments Detection on YouTube2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS)10.1109/ICICCS56967.2023.10142608(1234-1239)Online publication date: 17-May-2023
https://doi.org/10.1109/ICICCS56967.2023.10142608
Samineni LPeddi AKasukurthi ARao MNiharika GChereddy S(2023)Evaluation of AI Techniques for Detecting Deceptive Reviews in Cyberspace: A Study of Pre- and Post-COVID-19 Trends2023 Second International Conference on Electronics and Renewable Systems (ICEARS)10.1109/ICEARS56392.2023.10085689(961-967)Online publication date: 2-Mar-2023
https://doi.org/10.1109/ICEARS56392.2023.10085689
Oh H(2021)A YouTube Spam Comments Detection Scheme Using Cascaded Ensemble Machine Learning ModelIEEE Access10.1109/ACCESS.2021.31215089(144121-144128)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3121508
Shetty AAbreo BD’Souza AKondana AKarimbi K(2021)Video Description Based YouTube Comment ClassificationApplications of Artificial Intelligence in Engineering10.1007/978-981-33-4604-8_51(667-678)Online publication date: 11-May-2021
https://doi.org/10.1007/978-981-33-4604-8_51
Avin CGhobadi MGriner CSchmid S(2020)On the Complexity of Traffic Traces and ImplicationsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/33794864:1(1-29)Online publication date: 5-Jun-2020
https://dl.acm.org/doi/10.1145/3379486
Shaukat KLuo SVaradharajan VHameed IXu M(2020)A Survey on Machine Learning Techniques for Cyber Security in the Last DecadeIEEE Access10.1109/ACCESS.2020.30419518(222310-222354)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3041951
Kavitha KShetty AAbreo BD’Souza AKondana A(2020)Analysis and Classification of User Comments on YouTube VideosProcedia Computer Science10.1016/j.procs.2020.10.084177(593-598)Online publication date: 2020
https://doi.org/10.1016/j.procs.2020.10.084
Joseph ANelson BRubinstein BTygar J(2019)Adversarial Machine Learning10.1017/9781107338548Online publication date: 14-Mar-2019
https://doi.org/10.1017/9781107338548
Aiyar SShetty N(2018)N-Gram Assisted Youtube Spam Comment DetectionProcedia Computer Science10.1016/j.procs.2018.05.181132(174-182)Online publication date: 2018
https://doi.org/10.1016/j.procs.2018.05.181
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents