Coverage-Based Methods for Distributional Stopword Selection in Text Segmentation

Joe Vasak²³ &
Fei Song²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6231))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

Abstract

Unlike the common stopwords in information retrieval, distributional stopwords are document-specific and refer to the words that are more or less evenly distributed across a document. Isolating distributional stopwords has been shown to be useful for text segmentation, since it helps improve the representation of a segment by reducing the overlapped words between neighboring segments. In this paper, we propose three new measures for distributional stopword selection and expand the notion of distributional stopwords from the document level to a topic level. Two of our new measures are based on the distributional coverage of a word and the other one is extended from an existing measure called distribution difference by relying on the density of words in a way similar to another measure called distribution significance. Our experiments show that these new measures are not only efficient to compute, but also more accurate than or comparable to the existing measures for distributional stopword selection and that distributional stopword selection at a topic level is more accurate than document level selection for subtopic segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Empirical Analysis of Static and Dynamic Stopword Generation Approaches

Exploring Influence of Topic Segmentation on Information Retrieval Quality

A universal information theoretic approach to the identification of stopwords

Article 02 December 2019

References

Hearst, M.: Multi-Paragraph Segmentation of Expository Text. In: Proceedings of the ACL, pp. 9–16 (1994)
Google Scholar
Reynar, J.C.: Topic Segmentation: Algorithms and Application. Ph.D. Thesis, University of Pennsylvania (1998)
Google Scholar
Utiyama, M., Isahara, H.: A Statistical Model for Domain-Independent Text Segmentation. In: Proceeedings of the ACL, pp. 491–498 (2001)
Google Scholar
Ji, X., Zha, H.: Domain-Independent Text Segmentation Using Anisotropic Diffusion and Dynamic Programming. In: Proceedings of the ACL, pp. 322–329 (2003)
Google Scholar
Vasak, J., Song, F.: Word Distribution Based Methods for Minimizing Segment Overlaps. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 491–498. Springer, Heidelberg (2007)
Chapter Google Scholar
Halliday, M.A.K., Hasan, R.: Cohesion in English. Longman, New York (1976)
Google Scholar
Skorochod’ko, E.F.: Adaptive Method of Automatic Abstracting and Indexing. In: Proceedings of the IFIP, vol. (71), pp. 1179–1182 (1972)
Google Scholar
Malioutov, I., Barzilay, R.: Minimum Cut Model for Spoken Lecture Segmentation. In: Proceedings of the ACM SIGIR, pp. 25–32 (2006)
Google Scholar
Beeferman, D., Berger, A., Lafferty, J.D.: Statistical Models for Text Segmentation. Machine Learning 34(1-3), 177–210 (1999)
Article MATH Google Scholar
Reynar, J., Ratnaparkhi, A.: A Maximum Entropy Approach to Identifying Sentence Boundaries. In: Proceedings of the ANLP, pp. 16–19 (1997)
Google Scholar
Choi, F.Y.Y.: Advances in Domain Independent Linear Text Segmentation. In: Proceedings of the NAACL, pp. 26–33 (2000)
Google Scholar
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada, N1G 2W1
Joe Vasak & Fei Song

Authors

Joe Vasak
View author publications
You can also search for this author in PubMed Google Scholar
Fei Song
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Botanická 68a, CZ-602 00, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vasak, J., Song, F. (2010). Coverage-Based Methods for Distributional Stopword Selection in Text Segmentation. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2010. Lecture Notes in Computer Science(), vol 6231. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15760-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-15760-8_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15759-2
Online ISBN: 978-3-642-15760-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics