[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2046684.2046687acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Topic modeling of freelance job postings to monitor web service abuse

Published: 21 October 2011 Publication History

Abstract

Web services such as Google, Facebook, and Twitter are recurring victims of abuse, and their plight will only worsen as more attackers are drawn to their large user bases. Many attackers hire cheap, human labor to actualize their schemes, connecting with potential workers via crowdsourcing and freelancing sites such as Mechanical Turk and Freelancer.com. To identify solicitations for abuse jobs, these Web sites need ways to distinguish these tasks from ordinary jobs. In this paper, we show how to discover clusters of abuse tasks using latent Dirichlet allocation (LDA), an unsupervised method for topic modeling in large corpora of text. Applying LDA to hundreds of thousands of unlabeled job postings from Freelancer.com, we find that it discovers clusters of related abuse jobs and identifies the prevalent words that distinguish them. Finally, we use the clusters from LDA to profile the population of workers who bid on abuse jobs and the population of buyers who post their project descriptions.

References

[1]
D. Blei and J. McAuliffe. Supervised topic models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 121--128. MIT Press, Cambridge, MA, 2008.
[2]
D. M. Blei and J. Lafferty. Topic Models. In Text Mining: Theory and Applications. Taylor and Francis, London, UK, 2009.
[3]
D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML), page 113--120, Pittsburgh, Pennsylvania, 2006.
[4]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3:993--1022, Mar. 2003.
[5]
J. Chang and D. M. Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124--150, Mar. 2010.
[6]
J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: augmenting social networks with text. In Proceedings of the Fifteenth ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 169--178, Paris, France, 2009.
[7]
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, page 288--296. 2009.
[8]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
[9]
Facebook Overtakes Myspace. http://blog.alexa.com/2008/05/facebook-overtakes-myspace_07.html.
[10]
L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, volume 2, pages 524-- 531 vol. 2. IEEE, June 2005.
[11]
J. Franklin, V. Paxson, A. Perrig, and S. Savage. An Inquiry into the Nature and Causes of the Wealth of Internet Miscreants. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), Alexandria, VA, Oct. 2007.
[12]
Freelancer.com. http://www.freelancer.com/info/about.php.
[13]
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228--5235, Apr. 2004.
[14]
M. Hoffman, D. Blei, and F. Bach. Online learning for latent dirichlet allocation. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 856--864. 2010.
[15]
T. Hofmann. Probabilistic Latent Semantic Indexing. Research and Development in Information Retrieval, pages 50--57, 1999.
[16]
D. J. Hu and L. K. Saul. A probabilistic topic model of unsupervised learning for musical-key profiles. In Proceedings of the 10th International Society for Music Information Retrieval Conference, 2009.
[17]
P. G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. XRDS: Crossroads, 17:16--21, Dec. 2010.
[18]
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An Introduction to Variational Methods for Graphical Models. Mach. Learn., 37(2):183--233, Nov. 1999.
[19]
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. SMART stopword list. http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop, April 2004.
[20]
A. McCallum, X. Wang, and A. Corrada-Emmanuel. Topic and role discovery in social networks with experiments on enron and academic email. Journal of Artificial Intelligence Research, 30:249--272, Oct. 2007.
[21]
M. Motoyama, K. Levchenko, C. Kanich, D. McCoy, G. M. Voelker, and S. Savage. Re: CAPTCHAs -- Understanding CAPTCHA-Solving from an Economic Context. In Proceedings of the USENIX Security Symposium, Washington, D.C., Aug. 2010.
[22]
M. Motoyama, D. McCoy, K. Levchenko, S. Savage, and G. M. Voelker. Dirty Jobs: The Role of Freelance Labor in Web Service Abuse. In Proceedings of the USENIX Security Symposium, San Francisco, CA, Aug. 2011.
[23]
H. Ning, Y. Hu, and T. S. Huang. Searching Human Behaviors using Spatial-Temporal words. In IEEE International Conference on Image Processing (ICIP), volume 6, Oct. 2007.
[24]
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 2007.
[25]
B. Stone-Gross, T. Holz, G. Stringhini, and G. Vigna. The Underground Economy of Spam: a Botmaster's Perspective of Coordinating Large-Scale Spam Campaigns. In Proceedings of the 4th USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET), Apr. 2011.

Cited By

View all
  • (2018)Connecting Human to Cyber-World: Security and Privacy Issues in Mobile Crowdsourcing NetworksSecurity and Privacy for Next-Generation Wireless Networks10.1007/978-3-030-01150-5_4(65-100)Online publication date: 23-Nov-2018
  • (2017)National Leaders’ Twitter Speech to Infer Political Leaning and Election Results in 2015 Venezuelan Parliamentary Elections2017 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2017.118(866-871)Online publication date: Nov-2017
  • (2017)Inference Algorithms in Latent Dirichlet Allocation for Semantic ClassificationApplied Computational Intelligence and Mathematical Methods10.1007/978-3-319-67621-0_16(173-184)Online publication date: 5-Sep-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
AISec '11: Proceedings of the 4th ACM workshop on Security and artificial intelligence
October 2011
124 pages
ISBN:9781450310031
DOI:10.1145/2046684
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. crowdsourcing
  2. latent dirichlet allocation

Qualifiers

  • Research-article

Conference

CCS'11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Connecting Human to Cyber-World: Security and Privacy Issues in Mobile Crowdsourcing NetworksSecurity and Privacy for Next-Generation Wireless Networks10.1007/978-3-030-01150-5_4(65-100)Online publication date: 23-Nov-2018
  • (2017)National Leaders’ Twitter Speech to Infer Political Leaning and Election Results in 2015 Venezuelan Parliamentary Elections2017 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2017.118(866-871)Online publication date: Nov-2017
  • (2017)Inference Algorithms in Latent Dirichlet Allocation for Semantic ClassificationApplied Computational Intelligence and Mathematical Methods10.1007/978-3-319-67621-0_16(173-184)Online publication date: 5-Sep-2017
  • (2014)Doppelgänger FinderProceedings of the 2014 IEEE Symposium on Security and Privacy10.1109/SP.2014.21(212-226)Online publication date: 18-May-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media