[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Some Experiments on Clustering Similar Sentences of Texts in Portuguese

  • Conference paper
Computational Processing of the Portuguese Language (PROPOR 2008)

Abstract

Identifying similar text passages plays an important role in many applications in NLP, such as paraphrase generation, automatic summarization, etc. This paper presents some experiments on detecting and clustering similar sentences of texts in Brazilian Portuguese. We propose an evalution framework based on an incremental and unsupervised clustering method which is combined with statistical similarity metrics to measure the semantic distance between sentences. Experiments show that this method is robust even to treat small data sets. It has achieved 86% and 93% of F-measure and Purity, respectively, and 0.037 of Entropy for the best case.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Barzilay, R., McKeown, K.: Sentence Fusion for Multi-document News Summarization. Computational Linguistics 31(3), 297–327 (2005)

    Article  Google Scholar 

  2. Caldas Junior, J., Imamura, C.Y.M., Rezende, S.O.: Avaliação de um Algoritmo de Stemming para a Língua Portuguesa. In: 2nd Congress of Logic Applied to Technology, vol. 2, pp. 267–274 (2001)

    Google Scholar 

  3. Larocca Neto, J., Santos, A.D., Kaestner, C.A.A., Freitas, A.A.: Document Clustering and Text Summarization. In: 4th International Conference Practical Applications of Knowledge Discovery and Data Mining – PAAD 2000, pp. 41–55 (2000)

    Google Scholar 

  4. Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Itemsets. In: Barbará, D., Kamath, C. (eds.) 3rd SIAM International Conference on Data Mining, pp. 59–70 (2003)

    Google Scholar 

  5. Hatzivassiloglou, V., Klavans, J.L., Eskin, E.: Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning. In: Empirical Methods in Natural Language Processing and Very Large Corpora – EMNL 1999, pp. 203–212 (1999)

    Google Scholar 

  6. Hatzivassiloglou, V., Klavans, J.L., Holcombe, M.L., Barzilay, R., Kan, M., McKeown, K.R.: SimFinder: A Flexible Clustering Tool for Summarization. In: Workshop on Automatic Summarization at NAACL 2001, pp. 41–49 (2001)

    Google Scholar 

  7. Pardo, T.A.S.: SENTER: Um Segmentador Sentencial Automático para o Português do Brasil. Technical Report NILC-TR-06-01, São Carlos-SP, Brazil, 6p (2006)

    Google Scholar 

  8. Radev, D.R., Hatzivassiloglou, V., McKeown, K.R.: A Description of the CIDR System as Used for TDT-2. In: DARPA Broadcast News Workshop (1999)

    Google Scholar 

  9. Rosell, M., Kann, V., Litton, J.: Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications. In: Sangal, R., Bendre, S.M. (eds.) International Conference on Natural Language Processing, pp. 207–216. Allied Publishers Private Limited (2004)

    Google Scholar 

  10. Salton, G., Allan, J.: Text Retrieval Using the Vector Processing Model. In: 3rd Symposium on Document Analysis and Information Retrieval. In: 3rd Symposium on Document Analysis and Information Retrieval. University of Nevada, Las Vegas (1994)

    Google Scholar 

  11. Schaal, M., Müller, R.M., Brunzel, M., Spiliopoulou, M.: RELFIN - Topic Discovery for Ontology Enhancement and Annotation. In: The Semantic Web: Research and Applications. LNCS, pp. 608–622. Springer, Berlin (2005)

    Google Scholar 

  12. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: International Conference on Knowledge Discovery & Data Mining - KDD 2000 (2000)

    Google Scholar 

  13. Tombros, A., Jose, J.M., Ruthven, I.: Clustering Top-Ranking Sentences for Information Access. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 523–528. Springer, Heidelberg (2003)

    Google Scholar 

  14. Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, Massachusetts (1979)

    Google Scholar 

  15. Ye, H., Young, S.: A Clustering Approach to Semantic Decoding. In: 9th International Conference on Spoken Language Processing – ICSLP, Pittsburgh, PA, USA (2006)

    Google Scholar 

  16. Radev, D., Otterbacher, J.: Zhang, Zhu.: Cross-document Relationship Classification for Text Summarization. In: Computational Linguistics (to appear, 2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

António Teixeira Vera Lúcia Strube de Lima Luís Caldas de Oliveira Paulo Quaresma

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Seno, E.R.M., Nunes, M.d.G.V. (2008). Some Experiments on Clustering Similar Sentences of Texts in Portuguese. In: Teixeira, A., de Lima, V.L.S., de Oliveira, L.C., Quaresma, P. (eds) Computational Processing of the Portuguese Language. PROPOR 2008. Lecture Notes in Computer Science(), vol 5190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85980-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85980-2_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85979-6

  • Online ISBN: 978-3-540-85980-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics