Identifying the original contribution of a document via language modeling

B Shaparenko, T Joachims - Proceedings of the 32nd international ACM …, 2009 - dl.acm.org
B Shaparenko, T Joachims
Proceedings of the 32nd international ACM SIGIR conference on Research and …, 2009dl.acm.org
One goal of text mining is to provide readers with automatic methods for quickly finding the
key ideas in individual documents and whole corpora. To this effect, we propose a
statistically well-founded method for identifying the original ideas that a document
contributes to a corpus, focusing on self-referential diachronic corpora such as research
publications, blogs, email, and news articles. Our statistical model of passage impact defines
(interesting) original content through a combination of impact and novelty, and it can be …
One goal of text mining is to provide readers with automatic methods for quickly finding the key ideas in individual documents and whole corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and it can be used to identify the most original passages in a document. Unlike heuristic approaches, this statistical model is extensible and open to analysis. We evaluate the approach on both synthetic and real data, showing that the passage impact model outperforms a heuristic baseline method.
ACM Digital Library