More Web Proxy on the site http://driver.im/

research-article

Open access

Hierarchical Label Propagation and Discovery for Machine Generated Email

Authors:

James B. Wendt,

Michael Bendersky,

Lluis Garcia-Pueyo,

Vanja Josifovski,

Amitabh Saikia,

Marc-Allen Cartright,

Sujith RaviAuthors Info & Claims

WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

Pages 317 - 326

https://doi.org/10.1145/2835776.2835780

Published: 08 February 2016 Publication History

Abstract

Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%.

References

[1]

D. Aberdeen, O. Pacovsky, and A. Slater. The learning behind gmail priority inbox. In NIPS Workshop on Learning on Cores, Clusters and Clouds, 2010.

[2]

N. Ailon, Z. S. Karnin, E. Liberty, and Y. Maarek. Threading machine generated email. In Proceedings of WSDM, pages 405--414, 2013.

Digital Library

[3]

I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of SIGIR, pages 160--167, 2000.

Digital Library

[4]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of SIGMOD, pages 337--348, 2003.

Digital Library

[5]

Z. Bar-Yossef, I. Guy, R. Lempel, Y. Maarek, and V. Soroka. Cluster ranking with an application to mining mailbox networks. Knowledge and Information Systems, 14(1):101--139, 2008.

Digital Library

[6]

R. Bekkerman, A. McCallum, and G. Huang. Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Computer Science Department Faculty Publication Series, page 218, 2004.

[7]

Y. Bengio, O. Delalleau, and N. Le Roux. Label propagation and quadratic criterion. In O. Chapelle, B. Schölkopf, and A. Zien, editors, Semi-Supervised Learning, pages 193--216. MIT Press, 2006.

[8]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.

Digital Library

[9]

G. Caruana and M. Li. A survey of emerging approaches to spam filtering. ACM Computing Surveys (CSUR), 44(2):9, 2012.

Digital Library

[10]

L. A. Dabbish and R. E. Kraut. Email overload at work: an analysis of factors associated with email strain. In Proceedings of the Conference on Computer Supported Cooperative Work, pages 431--440, 2006.

Digital Library

[11]

L. A. Dabbish, R. E. Kraut, S. Fussell, and S. Kiesler. Understanding email use: predicting action on a message. In Proceedings of SIGCHI, pages 691--700, 2005.

Digital Library

[12]

Z. Duan, Y. Dong, and K. Gopalan. DMTP: Controlling spam through message delivery differentiation. Computer Networks, 51(10):2616--2630, 2007.

Digital Library

[13]

A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information filtering. In Proceedings of ECIR, pages 353--362. 2002.

Digital Library

[14]

M. Grbovic, G. Halawi, Z. Karnin, and Y. Maarek. How many folders do you really need?: Classifying email into a handful of categories. In Proceedings of CIKM, pages 869--878, 2014.

Digital Library

[15]

S. Gregory. Finding overlapping communities in networks by label propagation. New Journal of Physics, 12(10):103018, 2010.

[16]

C. Hachenberg and T. Gottron. Locality sensitive hashing for scalable structural classification and clustering of web documents. In Proceedings of CIKM, pages 359--368, 2013.

Digital Library

[17]

O. Kurland and L. Lee. Pagerank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR, pages 306--313, New York, NY, USA, 2005. ACM.

Digital Library

[18]

N. Kushmerick. Wrapper Induction for Information Extraction. PhD thesis, 1997. AAI9819266.

Digital Library

[19]

B. Martins and M. J. Silva. Language identification in web pages. In Proceedings of SAC, pages 764--768, 2005.

Digital Library

[20]

B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of EMNLP, pages 79--86, 2002.

Digital Library

[21]

P. Pantel et al. SpamCop: A spam classification & organization program. In Proceedings of AAAI-98 Workshop on Learning for Text Categorization, pages 95--98, 1998.

[22]

M. Porter. The porter stemming algorithm, 2009. http://tartarus.org/ martin/PorterStemmer/.

[23]

S. Radicati. Email statistics report, 2014--2018, 2014. http://www.radicati.com/wp/wp-content/uploads/2014/01/Email-Statistics-Report-2014--2018-Executive-Summary.pdf.

[24]

S. Ravi and Q. Diao. Large scale distributed semi-supervised learning using streaming approximation. arXiv preprint arXiv:1512.01752, 2015.

[25]

L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4):243--269, 2004.

Digital Library

[26]

D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Proceedings of NIPS, pages 321--328, 2004.

Digital Library

[27]

X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University-CALD-02-107, Carnegie Mellon University, 2002.

Cited By

Early KO'Hare NLuvogt CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Content-Based Email Classification at ScaleProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615462(4559-4566)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615462
Jang JPark CJang CKim GKang U(2022)Finding Key Structures in MMORPG Graph with Hierarchical Graph SummarizationACM Transactions on Knowledge Discovery from Data10.1145/352269116:6(1-21)Online publication date: 30-Jul-2022
https://dl.acm.org/doi/10.1145/3522691
Gupta RKondapally RDemartini GZuccon GCulpepper JHuang ZTong H(2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482027
Show More Cited By

Index Terms

Hierarchical Label Propagation and Discovery for Machine Generated Email
1. Information systems
  1. World Wide Web
    1. Web applications
      1. Internet communications tools
        Email

Recommendations

Template Induction over Unstructured Email Corpora
WWW '17: Proceedings of the 26th International Conference on World Wide Web

Unsupervised template induction over email data is a central component in applications such as information extraction, document classification, and auto-reply. The benefits of automatically generating such templates are known for structured data, e.g. ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Semi-supervised partial label learning algorithm via reliable label propagation
Abstract
Partial label learning (PLL) is a weakly supervised learning method that is able to predict one label as the correct answer from a given candidate label set. In PLL, when all possible candidate labels are as signed to real-world training examples, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

February 2016

746 pages

ISBN:9781450337168

DOI:10.1145/2835776

General Chairs:
Paul N. Bennett
Microsoft Research
,
Vanja Josifovski
Pinterest
,
Program Chairs:
Jennifer Neville
Purdue University
,
Filip Radlinski
Microsoft

Copyright © 2016 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 February 2016

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2016

Sponsor:

WSDM 2016: Ninth ACM International Conference on Web Search and Data Mining

February 22 - 25, 2016

California, San Francisco, USA

Acceptance Rates

WSDM '16 Paper Acceptance Rate 67 of 368 submissions, 18%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
4,313
Total Downloads

Downloads (Last 12 months)3,293
Downloads (Last 6 weeks)169

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Early KO'Hare NLuvogt CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Content-Based Email Classification at ScaleProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615462(4559-4566)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615462
Jang JPark CJang CKim GKang U(2022)Finding Key Structures in MMORPG Graph with Hierarchical Graph SummarizationACM Transactions on Knowledge Discovery from Data10.1145/352269116:6(1-21)Online publication date: 30-Jul-2022
https://dl.acm.org/doi/10.1145/3522691
Gupta RKondapally RDemartini GZuccon GCulpepper JHuang ZTong H(2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482027
Whittaker MEdmonds NTata SWendt JNajork M(2019)Online template induction for machine-generated emailsProceedings of the VLDB Endowment10.14778/3342263.334226412:11(1235-1248)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.14778/3342263.3342264
Gupta RKondapally RGuha S(2019)Large-Scale Information Extraction from Emails with Data ConstraintsBig Data Analytics10.1007/978-3-030-37188-3_8(124-139)Online publication date: 12-Dec-2019
https://doi.org/10.1007/978-3-030-37188-3_8
Pal AChakrabarti DCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)Label Propagation with Neural NetworksProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3269322(1671-1674)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3269322
Avigdor-Elgrabli NGelbhart RGrabovitch-Zuyev IRaviv ACuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)More than ThreadsProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3269255(1711-1714)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3269255
Sheng YTata SWendt JXie JZhao QNajork MGuo YFarooq F(2018)Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over EmailProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219901(734-743)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3219901
Di Castro DGamzu IGrabovitch-Zuyev ILewin-Eytan LPundir ASahoo NViderman MChampin PGandon FMédini LLalmas MIpeirotis P(2018)Automated Extractions for Machine Generated MailCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186582(655-662)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3184558.3186582
Potti NWendt JZhao QTata SNajork MChampin PGandon FMédini LLalmas MIpeirotis P(2018)Hidden in Plain SightProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186167(1865-1874)Online publication date: 10-Apr-2018
https://dl.acm.org/doi/10.1145/3178876.3186167
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten