Some Experiments on Clustering Similar Sentences of Texts in Portuguese

Eloize Rossi Marques Seno¹ &
Maria das Graças Volpe Nunes¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5190))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

611 Accesses
5 Citations

Abstract

Identifying similar text passages plays an important role in many applications in NLP, such as paraphrase generation, automatic summarization, etc. This paper presents some experiments on detecting and clustering similar sentences of texts in Brazilian Portuguese. We propose an evalution framework based on an incremental and unsupervised clustering method which is combined with statistical similarity metrics to measure the semantic distance between sentences. Experiments show that this method is robust even to treat small data sets. It has achieved 86% and 93% of F-measure and Purity, respectively, and 0.037 of Entropy for the best case.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Identifying Similar Sentences by Using N-Grams of Characters

RuThes Thesaurus in Detecting Russian Paraphrases

A Sentence Similarity Method Based on Chunking and Information Content

References

Barzilay, R., McKeown, K.: Sentence Fusion for Multi-document News Summarization. Computational Linguistics 31(3), 297–327 (2005)
Article Google Scholar
Caldas Junior, J., Imamura, C.Y.M., Rezende, S.O.: Avaliação de um Algoritmo de Stemming para a Língua Portuguesa. In: 2nd Congress of Logic Applied to Technology, vol. 2, pp. 267–274 (2001)
Google Scholar
Larocca Neto, J., Santos, A.D., Kaestner, C.A.A., Freitas, A.A.: Document Clustering and Text Summarization. In: 4th International Conference Practical Applications of Knowledge Discovery and Data Mining – PAAD 2000, pp. 41–55 (2000)
Google Scholar
Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Itemsets. In: Barbará, D., Kamath, C. (eds.) 3rd SIAM International Conference on Data Mining, pp. 59–70 (2003)
Google Scholar
Hatzivassiloglou, V., Klavans, J.L., Eskin, E.: Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning. In: Empirical Methods in Natural Language Processing and Very Large Corpora – EMNL 1999, pp. 203–212 (1999)
Google Scholar
Hatzivassiloglou, V., Klavans, J.L., Holcombe, M.L., Barzilay, R., Kan, M., McKeown, K.R.: SimFinder: A Flexible Clustering Tool for Summarization. In: Workshop on Automatic Summarization at NAACL 2001, pp. 41–49 (2001)
Google Scholar
Pardo, T.A.S.: SENTER: Um Segmentador Sentencial Automático para o Português do Brasil. Technical Report NILC-TR-06-01, São Carlos-SP, Brazil, 6p (2006)
Google Scholar
Radev, D.R., Hatzivassiloglou, V., McKeown, K.R.: A Description of the CIDR System as Used for TDT-2. In: DARPA Broadcast News Workshop (1999)
Google Scholar
Rosell, M., Kann, V., Litton, J.: Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications. In: Sangal, R., Bendre, S.M. (eds.) International Conference on Natural Language Processing, pp. 207–216. Allied Publishers Private Limited (2004)
Google Scholar
Salton, G., Allan, J.: Text Retrieval Using the Vector Processing Model. In: 3rd Symposium on Document Analysis and Information Retrieval. In: 3rd Symposium on Document Analysis and Information Retrieval. University of Nevada, Las Vegas (1994)
Google Scholar
Schaal, M., Müller, R.M., Brunzel, M., Spiliopoulou, M.: RELFIN - Topic Discovery for Ontology Enhancement and Annotation. In: The Semantic Web: Research and Applications. LNCS, pp. 608–622. Springer, Berlin (2005)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: International Conference on Knowledge Discovery & Data Mining - KDD 2000 (2000)
Google Scholar
Tombros, A., Jose, J.M., Ruthven, I.: Clustering Top-Ranking Sentences for Information Access. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 523–528. Springer, Heidelberg (2003)
Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, Massachusetts (1979)
Google Scholar
Ye, H., Young, S.: A Clustering Approach to Semantic Decoding. In: 9th International Conference on Spoken Language Processing – ICSLP, Pittsburgh, PA, USA (2006)
Google Scholar
Radev, D., Otterbacher, J.: Zhang, Zhu.: Cross-document Relationship Classification for Text Summarization. In: Computational Linguistics (to appear, 2008)
Google Scholar

Download references

Author information

Authors and Affiliations

NILC-ICMC, University of São Paulo, CP 668P, 13560-970, São Carlos, SP, Brazil
Eloize Rossi Marques Seno & Maria das Graças Volpe Nunes

Authors

Eloize Rossi Marques Seno
View author publications
You can also search for this author in PubMed Google Scholar
Maria das Graças Volpe Nunes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

António Teixeira Vera Lúcia Strube de Lima Luís Caldas de Oliveira Paulo Quaresma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seno, E.R.M., Nunes, M.d.G.V. (2008). Some Experiments on Clustering Similar Sentences of Texts in Portuguese. In: Teixeira, A., de Lima, V.L.S., de Oliveira, L.C., Quaresma, P. (eds) Computational Processing of the Portuguese Language. PROPOR 2008. Lecture Notes in Computer Science(), vol 5190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85980-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-85980-2_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85979-6
Online ISBN: 978-3-540-85980-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Some Experiments on Clustering Similar Sentences of Texts in Portuguese

Abstract

Access this chapter

Preview

Similar content being viewed by others

Identifying Similar Sentences by Using N-Grams of Characters

RuThes Thesaurus in Detecting Russian Paraphrases

A Sentence Similarity Method Based on Chunking and Information Content

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Some Experiments on Clustering Similar Sentences of Texts in Portuguese

Abstract

Access this chapter

Preview

Similar content being viewed by others

Identifying Similar Sentences by Using N-Grams of Characters

RuThes Thesaurus in Detecting Russian Paraphrases

A Sentence Similarity Method Based on Chunking and Information Content

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation