[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1390749.1390760acmotherconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A comparative study of statistical features of language in blogs-vs-splogs

Published: 24 July 2008 Publication History

Abstract

Language usage in Blogs deviate from the language used in traditional corpora largely due to the noise from various causes like spelling errors, grammatical irregularity, overuse of abbreviations and symbolic characters like emoticons. Spam Blogs or Splogs comprise the subset of blogs, which are usually written to target specific audience for marketing promotions and are mostly generated by software that readily imitates Zipfian distribution of words. Therefore it becomes a difficult task to separate splogs from non-splogs using only frequentist distribution of unigrams. In this detailed comparative study we present and highlight several additional statistical features of language, which are hard to imitate and serve as good discriminator between splogs and blogs.

References

[1]
A. Java, P. Kolari, T. Finin, J. Mayfield, A. Joshi, and J. Martineau. BlogVox: Separating Blog Wheat from Blog Chaff. In Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007), January 2007.
[2]
T. Joachims. Making large-scale support vector machine learning practical. pages 169--184, 1999.
[3]
P. Kolari, A. Java, T. Finin, T. Oates, and A. Joshi. Detecting Spam Blogs: A Machine Learning Approach. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006). Computer Science and Electrical Engineering, University of Maryland, Baltimore County, July 2006.
[4]
Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, pages 1--8, New York, NY, USA, 2007. ACM.
[5]
K. Yoshida, F. Adachi, T. Washio, H. Motoda, T. Homma, A. Nakashima, H. Fujikawa, and K. Yamazaki. Density-based spam detector. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 486--493, New York, NY, USA, 2004. ACM.

Cited By

View all
  • (2017)Mining of Social Media data of University studentsEducation and Information Technologies10.1007/s10639-016-9501-122:4(1515-1526)Online publication date: 1-Jul-2017
  • (2010)Learning age and gender using co-occurrence of non-dictionary words from stylistic variationsProceedings of the 7th international conference on Rough sets and current trends in computing10.5555/1876210.1876278(544-550)Online publication date: 28-Jun-2010
  • (2010)Learning Age and Gender Using Co-occurrence of Non-dictionary Words from Stylistic VariationsRough Sets and Current Trends in Computing10.1007/978-3-642-13529-3_58(544-550)Online publication date: 2010
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data
July 2008
130 pages
ISBN:9781605581965
DOI:10.1145/1390749
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Naïve Bayes
  2. blogs
  3. splogs
  4. svm

Qualifiers

  • Research-article

Conference

AND '08

Acceptance Rates

Overall Acceptance Rate 15 of 22 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Mining of Social Media data of University studentsEducation and Information Technologies10.1007/s10639-016-9501-122:4(1515-1526)Online publication date: 1-Jul-2017
  • (2010)Learning age and gender using co-occurrence of non-dictionary words from stylistic variationsProceedings of the 7th international conference on Rough sets and current trends in computing10.5555/1876210.1876278(544-550)Online publication date: 28-Jun-2010
  • (2010)Learning Age and Gender Using Co-occurrence of Non-dictionary Words from Stylistic VariationsRough Sets and Current Trends in Computing10.1007/978-3-642-13529-3_58(544-550)Online publication date: 2010
  • (2009)Studying the effects of noisy text on text mining applicationsProceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data10.1145/1568296.1568314(107-114)Online publication date: 23-Jul-2009
  • (2009)Learning Age and Gender of Blogger from Stylistic VariationProceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence10.1007/978-3-642-11164-8_33(205-212)Online publication date: 15-Dec-2009

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media