Abstract
Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Brutlag, J.D., Meek, C.: Challenges of the Email Domain for Text Classification. In: ICML 2000, pp. 103–110 (2000)
Cohen, W.W.: Learning Rules that classify E-mail. In: Proc. of the 1996 AAAI Spring Symposium in Information Access (1996)
Crawford, E., Kay, J., McCreath, E.: Automatic Induction of Rules for e-mail Classification. In: ADCS 2001 Proceedings of the Sixth Australasian Document Computing Symposium, Coffs Harbour, NSW Australia, pp. 13–20 (2001)
Diao, Y., Lu, H., Wu, D.: A comparative study of classification-based personal e-mail filtering. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 408–419. Springer, Heidelberg (2000)
Hung, E.: Deduction of Procmail Recipes from Classified Emails. CMSC724 Database Management Systems, individual research project report (May 2001)
Kiritchenko, S., Matwin, S.: Email classification with co-training. In: Proc. of the 2001 Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Ontario, Canada, p. 8 (2001)
Lewis, D.D., Knowles, K.A.: Threading Electronic Mail: A Preliminary Study. Information Processing and Management 33(2), 209–217 (1997)
Manco, G., Masciari, E., Ruffolo, M., Tagarelli, A.: Towards an Adaptive Mail Classifier. In: AIIA 2002 (September 2002)
Murakoshi, H., Shimazu, A., Ochimizu, K.: Construction of Deliberation Structure in Email Communication. In: Pacific Association for Computational Linguistics (PACLING 1999), August 1999, pp. 16–28 (1999)
Rennie, J.: ifile: An Application of Machine Learning to E-Mail Filtering. In: Proc. KDD 2000 Workshop on Text Mining, Boston (2000)
Segal, R.B., Kephart, J.O.: MailCat: An Intelligent Assistant for Organizing E-Mail. In: Proc. of the 3rd International Conference on Autonomous Agents (1999)
Yang, Y.: A Study of Thresholding Strategies for Text Categorization. In: Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, pp. 137–145 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Klimt, B., Yang, Y. (2004). The Enron Corpus: A New Dataset for Email Classification Research. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science(), vol 3201. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30115-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-30115-8_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23105-9
Online ISBN: 978-3-540-30115-8
eBook Packages: Springer Book Archive