[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3038912.3052631acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Template Induction over Unstructured Email Corpora

Published: 03 April 2017 Publication History

Abstract

Unsupervised template induction over email data is a central component in applications such as information extraction, document classification, and auto-reply. The benefits of automatically generating such templates are known for structured data, e.g. machine generated HTML emails. However much less work has been done in performing the same task over unstructured email data.
We propose a technique for inducing high quality templates from plain text emails at scale based on the suffix array data structure. We evaluate this method against an industry-standard approach for finding similar content based on shingling, running both algorithms over two corpora: a synthetically created email corpus for a high level of experimental control, as well as user-generated emails from the well-known Enron email corpus. Our experimental results show that the proposed method is more robust to variations in cluster quality than the baseline and templates contain more text from the emails, which would benefit extraction tasks by identifying transient parts of the emails.
Our study indicates templates induced using suffix arrays contain approximately half as much noise (measured as entropy) as templates induced using shingling. Furthermore, the suffix array approach is substantially more scalable, proving to be an order of magnitude faster than shingling even for modestly-sized training clusters.
Public corpus analysis shows that email clusters contain on average 4 segments of common phrases, where each of the segments contains on average 9 words, thus showing that templatization could help users reduce the email writing effort by an average of 35 words per email in an assistance or auto-reply related task.

References

[1]
D. Aberdeen, O. Pacovsky, and A. Slater. The learning behind gmail priority inbox. In NIPS Workshop on Learning on Cores, Clusters and Clouds, 2010.
[2]
N. Ailon, Z. S. Karnin, E. Liberty, and Y. Maarek. Threading machine generated email. In Proc. of the 6th ACM International Conference on Web Search and Data Mining, pages 405--414, 2013.
[3]
J. Alpert and N. Hajaj. We knew the web was big. The Official Google Blog, 21, July 25 2008.
[4]
I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proc. of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 160--167, 2000.
[5]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 337--348, 2003.
[6]
R. Bekkerman. Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Computer Science Department Faculty Publication Series, University of Massachusetts, Amherst, (218), 2004.
[7]
E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 29(1):63--92, 2008.
[8]
G. Caruana and M. Li. A survey of emerging approaches to spam filtering. ACM Computing Surveys, 44(2):1--27, 2012.
[9]
L. A. Dabbish and R. E. Kraut. Email overload at work: An analysis of factors associated with email strain. In Proc. of the 20th Conference on Computer Supported Cooperative Work, pages 431--440, 2006.
[10]
D. Di Castro, L. Lewin-Eytan, Y. Maarek, R. Wolff, and E. Zohar. Enforcing k-anonymity in web mail auditing. In Proc. of the 9th International Conference on Web Search and Data Mining, to appear, 2016.
[11]
C. Hachenberg and T. Gottron. Locality sensitive hashing for scalable structural classification and clustering of web documents. In Proc. of the 22nd ACM International Conference on Information & Knowledge Management, pages 359--368, 2013.
[12]
A. Kannan, K. Kurach, S. Ravi, T. Kaufmann, A. Tomkins, B. Miklos, G. Corrado, L. Lukács, M. Ganea, P. Young, and V. Ramavajjala. Smart reply: Automated response suggestion for email. CoRR, abs/1606.04870, 2016.
[13]
S. Kiritchenko and S. Matwin. Email classification with co-training. In Proc. of the Conference of the Center for Advanced Studies on Collaborative Research, pages 301--312, 2011.
[14]
A. Kulkarni and T. Pedersen. Name discrimination and email clustering using unsupervised clustering and labeling of similar contexts. In Proc. of the 2nd Indian International Conference on Artificial Intelligence, pages 703--722, 2005.
[15]
N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997.
[16]
A. H. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of web data extraction tools. ACM Sigmod, 31(2):84--93, 2002.
[17]
J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proc. of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09, pages 497--506. ACM, 2009.
[18]
D. D. Lewis and K. A. Knowles. Threading electronic mail: A preliminary study. Information Processing & Management, 33(2):209--217, 1997.
[19]
H. Li, D. Shen, B. Zhang, Z. Chen, and Q. Yang. Adding semantics to email clustering. In Proc. of the 6th International Conference on Data Mining, pages 938--942, 2006.
[20]
G. A. Miller. WordNet: A lexical database for English. Communications of the ACM, 38(11):39--41, 1995.
[21]
P. Pantel and D. Lin. Spamcop: A spam classification & organization program. In Proc. of the AAAI Workshop on Learning for Text Categorization, pages 95--98, 1998.
[22]
S. Sarawagi. Automation in information extraction and integration. In Tutorial of the 28th Internationl Conference on Very Large Databases, 2002.
[23]
Y.-C. Wang, M. Joshi, W. W. Cohen, and C. P. Rosé. Recovering implicit thread structure in newsgroup style conversations. In Proc. of the 2nd International Conference on Weblogs and Social Media, pages 152--160, 2008.
[24]
P. Weiner. Linear pattern matching algorithms. In Proc. of the 14th Annual Symposium on Switching and Automata Theory, pages 1--11, 1973.
[25]
J. B. Wendt, M. Bendersky, L. Garcia-Pueyo, V. Josifovski, B. Miklos, I. Krka, A. Saikia, J. Yang, M.-A. Cartright, and S. Ravi. Hierarchical label propagation and discovery for machine generated email. In Proc. of the 9th International Conference on Web Search and Data Mining, 2016.
[26]
L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing, 3(4):243--269, 2004.
[27]
W. Zhang, A. Ahmed, J. Yang, V. Josifovski, and A. J. Smola. Annotating needles in the haystack without looking: Product information extraction from emails. In Proc. of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2257--2266, 2015.

Cited By

View all
  • (2021)Email Clustering & Generating Email Templates Based on Their TopicsProceedings of the 2021 5th International Conference on Information System and Data Mining10.1145/3471287.3471298(96-103)Online publication date: 27-May-2021
  • (2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
  • (2021)Sequence encoding incorporated CNN model for Email document sentiment classificationApplied Soft Computing10.1016/j.asoc.2021.107104(107104)Online publication date: Jan-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '17: Proceedings of the 26th International Conference on World Wide Web
April 2017
1678 pages
ISBN:9781450349130

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 03 April 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. enron corpus
  2. fixed phrase extraction
  3. human-generated email
  4. structural template
  5. suffix array generalization
  6. templatization

Qualifiers

  • Research-article

Conference

WWW '17
Sponsor:
  • IW3C2

Acceptance Rates

WWW '17 Paper Acceptance Rate 164 of 966 submissions, 17%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Email Clustering & Generating Email Templates Based on Their TopicsProceedings of the 2021 5th International Conference on Information System and Data Mining10.1145/3471287.3471298(96-103)Online publication date: 27-May-2021
  • (2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
  • (2021)Sequence encoding incorporated CNN model for Email document sentiment classificationApplied Soft Computing10.1016/j.asoc.2021.107104(107104)Online publication date: Jan-2021
  • (2019)Bridging Text Visualization and Mining: A Task-Driven SurveyIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2018.283434125:7(2482-2504)Online publication date: 1-Jul-2019
  • (2019)Large-Scale Information Extraction from Emails with Data ConstraintsBig Data Analytics10.1007/978-3-030-37188-3_8(124-139)Online publication date: 12-Dec-2019
  • (2018)More than ThreadsProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3269255(1711-1714)Online publication date: 17-Oct-2018
  • (2018)Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over EmailProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219901(734-743)Online publication date: 19-Jul-2018
  • (2018)Hidden in Plain SightProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186167(1865-1874)Online publication date: 10-Apr-2018
  • (2018)Learning Effective Embeddings for Machine Generated Emails with Applications to Email Category Prediction2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622048(1846-1855)Online publication date: Dec-2018
  • (2018)Template Trees: Extracting Actionable Information from Machine Generated EmailsDatabase and Expert Systems Applications10.1007/978-3-319-98812-2_1(3-18)Online publication date: 9-Aug-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media