Abstract
The paper proposes a new text similarity computing method based on concept similarity in Chinese text processing. The new method converts text to words vector space model at first, and then splits words into a set of concepts. Through computing the inner products between concepts, it obtains the similarity between words. The new method computes the similarity of text based on the similarity of words at last. The contributions of the paper include: 1) propose a new computing formula between words; 2) propose a new text similarity computing method based on words similarity; 3) successfully use the method in the application of similarity computing of WEB news; and 4) prove the validity of the method through extensive experiments.
Similar content being viewed by others
References
Nirenburg S. Two approaches of matching in example-based machine translation. In: Proc the 4th International Conference on Theoretical and Methodological Issues in Machine Translation(TMI-93), Kyoto, 1993. 47–57
Li S J, Zhang J, Huang X, et al. Semantic computation in Chinese question-answering system. J Comput Sci Tech, 2002,17(6),933–939
Ristad E S, Yianilos P N. Learning string-edit distance. IEEE PAMI, 1998, 20(5): 522–532
Chatterjee N. 2001. A statistical approach for similarity measurement between sentences for EBMT. In: Proceedings of Symposium on Translation Support Systems STRANS-2001. Kanpur: Indian Institute of Technology, 2001
Corley C, Mihalcea R. Measuring the Semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. Morristown. NJ: Assoc Comput Linguist, 2005, 13–18
Dagan I, Glickman O, Magnini B. The PASCAL recognising textual entailment challenge. In: Proceedings of the PASCAL Workshop. Berlin: Springer-Verlag, 2006. 3944: 177–190
Zhang Z, Otterbacher J, Radev D. Learning cross-document structural relations using boosting. In: Proceedings of the 12th International Conference on Information and Knowledge Management. New Orleans: ACM, 2003. 124–130
Dagan I, Lee L, Pereira F. Similarity-based models of word concurrence probabilities. Mach Learn, Special Issue on Machine Learning and Natural Language, 1999, 43–69
Dolan W B, Quirk C, Brockett C. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics. Morristown: Assoc Comput Linguist, 2004. 350–356
Budanitsky A, Hirst G. Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In: Proceedings of the NAACL Workshop on Word-Net and Other Lexical Resources. Morristown: Assoc Comput Linguist, 2001
Liu Q, Li S J. Word similarity computing based on How-net. In: Computational Linguistics and Chinese Language Processing. Taiwan: Assoc Comput Linguist Chin Lang Proc, 2002. 7(2): 59–76
Fan Xinghua, Sun Maosong. A high performance two-class Chinese text categorization method. Chin J Comput, 2006, 29(1): 124–131
Pan Qianhong, Wang Ju, Shi Zhongzhi. Text similarity computing based on attribute theory. Chin J Comput, 1999, 22(6): 651–655
Xu Xiaoling, Peng Jing, Shi Baomei, et al. A New All-pairs Shortest Paths Algorithm Based on Edge List. Comput Eng Appl, 2005, 41(29): 88–90
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the China Postdoctoral Science Foundation (Grant No. 20060400002), the Sichuan Youth Science and Technology Foundation of China (Grant No. 08JJ0109), the National Natural Science Foundation of China (Grant Nos.60473051, 60503037), the National High-tech Research and Development of China (Grant No. 2006AA01Z230) and the Natural Science Foundation of Beijing Natural Science Foundation (Grant No. 4062018)
Rights and permissions
About this article
Cite this article
Peng, J., Yang, D., Tang, S. et al. A new similarity computing method based on concept similarity in Chinese text processing. Sci. China Ser. F-Inf. Sci. 51, 1215–1230 (2008). https://doi.org/10.1007/s11432-008-0103-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-008-0103-4