[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Splitting-merging model of Chinese word tokenization and segmentation

Published: 01 December 1998 Publication History

Abstract

Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in one basket, this paper presents a novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word tokenization and segmentation. It uses the mutual information of Chinese characters to find the boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation by resolving ambiguities and detecting new words.

References

[1]
Bai, S. H. (1995) An integrated model of Chinese word segmentation and part of speech tagging. In Advances and Applications on Computational Linguistics, L. W. Chen and Q. Yuan (eds.), pp. 56-61. Tsinghua University Press.
[2]
Chen, K. J. and Liu, S. H. (1992) Word identification for mandarin Chinese sentences. Proceedings 14th International Conference on Computational Linguistics (COLING-92), Nantes, France, pp. 101-107.
[3]
Chiang, T. H., Chang, J. S., Lin, M. Y. and Su, K. Y. (1992) Statistical models for word segmentation and unknown word resolution. Proceedings ROC Computational Linguistics Conference (ROCLING) V, Taiwan, pp. 123-146.
[4]
Garside, R. G., Leech, G. N. and Sampson, G. R. (1987) The Computational Analysis of English: A Corpus-based Approach. London: Longman.
[5]
Guo, J. (1996a) Critical tokenization and its properties. Submitted. (Online available at URL: http://sunzi.iss.nus.sg:1996/guojin/papers/)
[6]
Guo, J. (1996b) Profile tokenization and its properties. Submitted. (Online available at URL: http://sunzi.iss.nus.sg:1996/guojin/papers/)
[7]
Jie, C. Y., Liu, Y. and Liang, N. Y. (1991) On the methods of Chinese automatic segmentation. Journal of Chinese Information Processing 3 (1): 1-9.
[8]
Liang, N. Y. (1986) On computer automatic word segmentation of written Chinese. Journal of Chinese Information Processing 1 (1).
[9]
Nie, J. Y., Jin, W. Y. and Hannan, M. L. (1994) A hybrid approach to unknown word detection and segmentation of Chinese. Proceedings of International Conference on Chinese Computing 1994 (ICCC-94), Singapore, pp. 326-335.
[10]
Pachunke, T., Mertineit, O., Wothke, K. and Schmidt, R. (1992) Broad coverage automatic morphological segmentation of German words. Proceedings 14th International Conference on Computational Linguistics (COLING-92), Nantes, France, pp. 1218-1222.
[11]
Sun, M. S. and T'sou, B. (1995) Ambiguity resolution in Chinese word segmentation. Proceedings 10th Pacific Asia Conference on Language, Information and Computation (PACLIC-95), Hong Kong, pp. 121-126.
[12]
Webster, J. J. and Kit, C-Y. (1992) Tokenization as the initial phase in NLP. Proceedings 14th International Conference on Computational Linguistics (COLING-92), Nantes, France, pp. 1106-1110.
[13]
Wu, M. W. and Su, K. Y. (1993) Corpus-based automatic compound extraction with mutual information and relative frequency count. Proceedings ROC Computational Linguistics Conference (ROCLING) VI, Taiwan, pp. 207-216.
[14]
Yeh, C. L. and Lee, H. J. (1991) Rule-based word identification for mandarin Chinese sentences - a unification approach. Computer Processing of Chinese and Oriental Languages 5 (2): 97-118.
[15]
Yosiyuki, K., Takenobu, T. and Hozumi, T. (1992) Analysis of Japanese compound nouns using collocation information. Proceedings 14th International Conference on Computational Linguistics (COLING-92), Nantes, France, pp. 865-869.
[16]
Yun, B. H., Lee, H. and Rim, H. C. (1995) Analysis of Korean compound nouns using statistical information. Proceedings 1995 International Conference on Computer Processing of Oriental Language (ICCPOL-95), Honolulu, Hawaii, pp. 76-79.
[17]
Chinese Knowledge Information Processing Group (CKIP) (1995) Technical Report 95-02, Institute of Information Science, Academia Sinica, Taiwan.

Cited By

View all
  • (2018)The head-modifier principle and multilingual term extractionNatural Language Engineering10.1017/S135132490400353511:2(129-157)Online publication date: 21-Dec-2018
  • (2010)Experience mining Google's production console logsProceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques10.5555/1928991.1928999(5-5)Online publication date: 3-Oct-2010
  1. Splitting-merging model of Chinese word tokenization and segmentation

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Natural Language Engineering
      Natural Language Engineering  Volume 4, Issue 4
      December 1998
      234 pages

      Publisher

      Cambridge University Press

      United States

      Publication History

      Published: 01 December 1998

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 11 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)The head-modifier principle and multilingual term extractionNatural Language Engineering10.1017/S135132490400353511:2(129-157)Online publication date: 21-Dec-2018
      • (2010)Experience mining Google's production console logsProceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques10.5555/1928991.1928999(5-5)Online publication date: 3-Oct-2010

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media