Abstract
Cybercriminals have been using the Internet to accomplish illegitimate activities and to execute catastrophic attacks. Computer-Mediated Communication such as online chat provides an anonymous channel for predators to exploit victims. In order to prosecute criminals in a court of law, an investigator often needs to extract evidence from a large volume of chat messages. Most of the existing search tools are keyword-based, and the search terms are provided by an investigator. The quality of the retrieved results depends on the search terms provided. Due to the large volume of chat messages and the large number of participants in public chat rooms, the process is often time-consuming and error-prone. This paper presents a topic search model to analyze archives of chat logs for segregating crime-relevant logs from others. Specifically, we propose an extension of the Latent Dirichlet Allocation-based model to extract topics, compute the contribution of authors in these topics, and study the transitions of these topics over time. In addition, we present a special model for characterizing authors-topics over time. This is crucial for investigation because it provides a view of the activity in which authors are involved in certain topics. Experiments on two real-life datasets suggest that the proposed approach can discover hidden criminal topics and the distribution of authors to these topics.
Similar content being viewed by others
References
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th UAI, pp 487–494
Wang X, Mohanty N, McCallum A (2005) Group and topic discovery from relations and text. In: Proceedings of the 3rd ACM LinkKDD, pp 28–35
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 EMNLP, vol 1, pp 248–256
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the 1st SOMA, pp 80–88
Banerjee S, Agarwal N (2012) Analyzing collective behavior from blogs using swarm intelligence. KAIS, pp 1–25
Blei D, McAuliffe J (2008) Supervised topic models. Adv Neural Inf Process Syst 20:121–128
Lacoste-julien S, Sha F, Jordan MI (2008) DiscLDA: discriminative learning for dimensionality reduction and classification. In: Proceedings of the 22nd NIPS, pp 897–904
Ramage D, Heymann P, Manning CD, Garcia-Molina H (2009) Clustering the tagged web. In: Proceedings of the 2nd ACM WSDM, pp 54–63
Rubin T, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88:157–208
Chang J, Boyd-Graber J, Blei DM (2009) Connections between the lines: augmenting social networks with text. In: Proceedings of the 15th ACM SIGKDD, pp 169–178
Song X, Lin CY, Tseng BL, Sun MT (2005) Modeling and predicting personal information dissemination behavior. In: Proceedings of the 11th ACM SIGKDD, pp 479–488
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD, pp 424–433
Wang C, Blei DM, Heckerman D (2008) Continuous time dynamic topic models. In: UAI’08, pp 579–586
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd ICML, pp 113–120
AlSumait L, Barbará D, Domeniconi C (2008) On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE ICDM, pp 3–12
Du L, Buntine W, Jin H, Chen C (2012) Sequential latent dirichlet allocation. KAIS 31:475–503
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Minka T, Lafferty J (2002) Expectation-propagation for the generative aspect model. In: Proceedings of the 18th UAI, pp 352–359
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235
Heinrich G (2004) Parameter estimation for text analysis. Technical Report
Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd ECIR. Springer, Berlin, pp 338–349
PJF Inc. Chat log conviction numbers. Available: http://www.ciise.concordia.ca/~fung/pub/convictions.txt
Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Sharing clusters among related groups: hierarchical dirichlet processes. In: Proceedings of the 19th NIPS, pp 1385–1392
Acknowledgments
The authors would like to thank the anonymous reviewers for their constructive comments that greatly helped improve this paper. The research is supported in part by research grants from Le Fonds québécois de la recherche sur la nature et les technologies (FQRNT) new researchers start-up program, Concordia ENCS seed funding program, and the National Cyber-Forensics and Training Alliance Canada (NCFTA Canada).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
M. A. Basher, A.R., Fung, B.C.M. Analyzing topics and authors in chat logs for crime investigation. Knowl Inf Syst 39, 351–381 (2014). https://doi.org/10.1007/s10115-013-0617-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0617-y