Abstract
In the paper, the most state-of-the-art methods of automatic text summarization, which build summaries in the form of generic extracts, are considered. The original text is represented in the form of a numerical matrix. Matrix columns correspond to text sentences, and each sentence is represented in the form of a vector in the term space. Further, latent semantic analysis is applied to the matrix obtained to construct sentences representation in the topic space. The dimensionality of the topic space is much less than the dimensionality of the initial term space. The choice of the most important sentences is carried out on the basis of sentences representation in the topic space. The number of important sentences is defined by the length of the demanded summary. This paper also presents a new generic text summarization method that uses nonnegative matrix factorization to estimate sentence relevance. Proposed sentence relevance estimation is based on normalization of topic space and further weighting of each topic using sentences representation in topic space. The proposed method shows better summarization quality and performance than state-of-the-art methods on the DUC 2001 and DUC 2002 standard data sets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Mani, I. and Maybury, M.T., Advance in Automatic Text Summarization, Cambridge, Ma: The MIT Press, 1999.
Ježek, K. and Steinberger, J. Automatic Text Summarization (The State of the Art 2007 and New Challenges), Proc. of Znalosti 2008, Bratislava, 2008, pp. 1–12. http://textmining.zcu.cz/publications/Z08.pdf.
Garcia, E., Information Retrieval Tutorials: Document Indexing Tutorial. http://www.miislita.com/information-retrieval-tutorial/indexing.html.
Garcia, E., Vector Theory and Keyword Weights. http://www.miislita.com/term-vector/term-vector-1.html.
Chisholm, E. and Kolda, T.G., New Term Weighting Formulas for the Vector Space Method in Information Retrieval, Tech. Rep. no. ORNL-TM-13756, Oak Ridge National Laboratory, Oak Ridge, TN, March 1999.
Landauer, T.K. and Dumais, S.T., A solution to Plato’s Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction and Representation of Knowledge, Psychological Rev., 1997, vol. 104, pp. 211–240.
Ye, Y., Comparing Matrix Methods in Text-based Information Retrieval, Tech. Rep. School of Mathematical Sciences, Peking University, 2000. http://dean.pku.edu.cn/bksky/2000jzlwj/39.pdf.
Gong, Y. and Liu, X., Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis, SIGIR-2001, 2001.
Lee, D.D. and Seung, H.S., Learning the Parts of Objects by Non-negative Matrix Factorization, Nature, 1999, vol. 401, pp. 788–791.
Wei Xu, Xin Liu, and Yihong Gong, Document Clustering Based on Non-negative Matrix Factorization, Proc. of the 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, 2003.
Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., and Plemmons, R.J., Algorithms and Applications for Approximate Nonnegative Matrix Factorization, Computational Statistics Data Analysis, 2007, vol. 52, no. 1, pp. 155–173.
Rakesh, P., Shivapratap, G., Divya, G., and Soman, K.P., Evaluation of SVD and NMF Methods for Latent Semantic Analysis, Int. J. Recent Trends Engineering, 2009, vol. 1, no. 3.
Berry, M.W., Dumais, S.T., and O’Brien G.W., Using Linear Algebra for Intelligent Information Retrieval, Univ. of Tennessee Knoxville, TN, USA, 1994.
Steinberger, J., Text Summarization within the LSA Framework, PhD Dissertation, Univ. of West Bohemia in Pilsen, Czech Republic, 2007.
Ju-Hong Lee, Sun Park, Chan-Min Ahn, and Daeho Kim, Automatic Generic Document Summarization Based on Non-negative Matrix Factorization, Information Processing Management: Int. J., 2009, pp. 20–34.
Sun Park, Personalized Summarization Agent Using Non-negative Matrix Factorization, PRICAI 2008: Trends in Artificial Intelligence, 2008.
Sun Park, Ju-Hong Lee, Deok-Hwan Kim, and Chan-Min Ahn, Multi-document Summarization Using Weighted Similarity between Topic and Clustering-based Non-negative Semantic Feature, in Advances in Data and Web Management, 2007
Lin, C.-Y., Looking for a Few Good Metrics: Automatic Summarization Evaluation — How many samples are enough?, Proc. of NTCIR 2004, Tokyo, 2004, pp. 1765–1776.
Document Understanding Conferences. http://duc.nist.gov.
DTU Toolbox. http://isp.imm.dtu.dk/toolbox/menu.html.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © I.V. Mashechkin, M.I. Petrovskiy, D.S. Popov, D.V. Tsarev, 2011, published in Programmirovanie, 2011, Vol. 37, No. 6.
Rights and permissions
About this article
Cite this article
Mashechkin, I.V., Petrovskiy, M.I., Popov, D.S. et al. Automatic text summarization using latent semantic analysis. Program Comput Soft 37, 299–305 (2011). https://doi.org/10.1134/S0361768811060041
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768811060041