[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Visually and Phonologically Similar Characters in Incorrect Chinese Words: Analyses, Identification, and Applications

Published: 01 June 2011 Publication History

Abstract

Information about students’ mistakes opens a window to an understanding of their learning processes, and helps us design effective course work to help students avoid replication of the same errors. Learning from mistakes is important not just in human learning activities; it is also a crucial ingredient in techniques for the developments of student models. In this article, we report findings of our study on 4,100 erroneous Chinese words. Seventy-six percent of these errors were related to the phonological similarity between the correct and the incorrect characters, 46% were due to visual similarity, and 29% involved both factors. We propose a computing algorithm that aims at replication of incorrect Chinese words. The algorithm extends the principles of decomposing Chinese characters with the Cangjie codes to judge the visual similarity between Chinese characters. The algorithm also employs empirical rules to determine the degree of similarity between Chinese phonemes. To show its effectiveness, we ran the algorithm to select and rank a list of about 100 candidate characters, from more than 5,100 characters, for the incorrectly written character in each of the 4,100 errors. We inspected whether the incorrect character was indeed included in the candidate list and analyzed whether the incorrect character was ranked at the top of the candidate list. Experimental results show that our algorithm captured 97% of incorrect characters for the 4,100 errors, when the average length of the candidate lists was 104. Further analyses showed that the incorrect characters ranked among the top 10 candidates in 89% of the phonologically similar errors and in 80% of the visually similar errors.

References

[1]
Cangjie. 2010. An introduction to the Cangjie input method. http://en.wikipedia.org/wiki/Cangjie_input_method.
[2]
CDL. 2010. Chinese document laboratory, Academia Sinica. http://cdp.sinica.edu.tw/cdphanzi/. (In Chinese)
[3]
Chen, M. Y. 2000. Tone Sandhi: Patterns Across Chinese Dialects (Cambridge Studies in Linguistics 92). Cambridge University Press.
[4]
Chu, B.-F. 2010. Handbook of the Fifth Generation of the Cangjie Input Method. http://www.cbflabs.com/book/5cjbook/. (In Chinese)
[5]
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2009. Introduction to Algorithms 3rd Ed. MIT Press.
[6]
Croft, W. B., Metzler, D., and Strohman, T. 2010. Search Engines: Information Retrieval in Practice. Pearson.
[7]
Dict. 2010. An official source of information about traditional Chinese characters. http://www.cns11643.gov.tw/AIDB/welcome.do.
[8]
Fan, K.-C., Lin, C.-K., and Chou, K.-S. 1995. Confusion set recognition of online Chinese characters by artificial intelligence technique. Patt. Recog. 28, 3, 303--313.
[9]
Feldman, L. B. and Siok, W. W. T. 1999. Semantic radicals contribute to the visual identification of Chinese characters. J. Mem. Lang. 40, 4, 559--576.
[10]
Fromkin, V., Rodman, R., and Hyams, N. 2002. An Introduction to Language 7th Ed. Thomson.
[11]
HanDict. 2010. A source for traditional and simplified Chinese characters. http://www.zdic.net/appendix/f19.htm.
[12]
Huang, C.-M., Wu, M.-C., and Chang C.-C. 2008. Error detection and correction based on Chinese phonemic alphabet in Chinese text. Int. J. Uncertainty, Fuzziness Knowl.-Base. Syst. 16, suppl. 1, 89--105.
[13]
Jackendoff, R. 1995. Patterns in the Mind: Language and Human Nature. Basic Books.
[14]
Juang, D., Wang, J.-H., Lai, C.-Y., Hsieh, C.-C., Chien, L.-F., and Ho, J.-M. 2005. Resolving the unencoded character problem for Chinese digital libraries. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (JCDL’05). 311--319.
[15]
Jurafsky, D. and Martin, J. H. 2009. Speech and Language Processing 2nd Ed. Pearson.
[16]
Kuo, W.-J., Yen, T.-C., Lee, J.-R., Chen, L.-F., Lee, P.-L., Chen, S.-S., Ho, L.-T., Hung, D. L., Tzeng, O. J.-L., and Hsieh, J.-C. 2004. Orthographic and phonological processing of Chinese characters: An fMRI study. NeuroImage 21, 4, 1721--1731.
[17]
Leck, K.-J., Weekes, B. S., and Chen, M.-J. 1995. Visual and phonological pathways to the lexicon: Evidence from Chinese readers. Mem. Cogn. 23, 4, 468--476.
[18]
Lee, C.-Y. 2009. The cognitive and neural basis for learning to reading Chinese. J. Basic Educ. 18, 2, 63--85.
[19]
Lee, C.-Y., Huang, H.-W., Kuo, W.-J., Tsai, J.-L., and Tzeng, O. J.-L. 2010. Cognitive and neural basis of the consistency and lexicality effects in reading Chinese. J. Neurolinguist. 23, 1, 10--27.
[20]
Lee, C.-Y., Tsai, J.-L., Huang, H.-W., Hung, D. L., and Tzeng, O. J.-L. 2006. The temporal signatures of semantic and phonological activations for Chinese sublexical processing: An event-related potential study. Brain Res. 1121, 1, 150--159.
[21]
Lee, H. 2010a. Cangjie Input Methods in 30 Days 2. Foruto. http://input.foruto.com/cccls/cjzd.html. (In Chinese)
[22]
Lee, Mu. 2010b. A quantitative study of the formation of Chinese characters. http://chinese.exponode.com/0_1.htm. (In Chinese)
[23]
Liu, C.-L., Lai, M.-H., Chuang, Y.-H., and Lee, C.-Y. 2010. Visually and phonologically similar characters in incorrect simplified Chinese words. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 739--747.
[24]
Liu, C.-L., Lee, C.-Y., Tsai, J.-L., and Lee, C.-L. 2011. Forthcoming. A cognition-based interactive game platform for learning Chinese characters. In Proceedings of the 26th ACM Symposium on Applied Computing (SAC’11).
[25]
Liu, C.-L. and Lin, J.-H. 2008. Using structural information for identifying similar Chinese characters. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL’08). 93--96.
[26]
Liu, C.-L., Tien, K.-W., Chuang, Y.-H., Huang, C.-B., and Weng, J.-Y. 2009a. Two applications of lexical information to computer-assisted item authoring for elementary Chinese. In Proceedings of the 22nd International Conference on Industrial Engineering & Other Applications of Applied Intelligent Systems (IEA/AIE’’09). 470--480.
[27]
Liu, C.-L., Tien, K.-W., Lai, M.-H., Chuang, Y.-H., and Wu, S.-H. 2009b. Capturing errors in written Chinese words. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL’09). 25--28.
[28]
Liu, C.-L., Tien, K.-W., Lai, M.-H., Chuang, Y.-H., and Wu, S.-H. 2009c. Phonological and logographic influences on errors in written Chinese words. In Proceedings of the 7th Workshop on Asian Language Resources (ALR’09). 84--91.
[29]
Liu, C.-L., Jaeger, S., and Nakagawa, M. 2004. Online recognition of Chinese characters: The state-of-the-art. IEEE Trans. Patt. Anal. Mach. Intel. 26, 2, 198--213.
[30]
Lo, M. and Hue, C.-W. 2008. C-CAT: A computer software used to analyze and select Chinese characters and character components for psychological research. Behav. Res. Meth. 40, 4, 1098--1105.
[31]
Ma, W.-Y. and Chen, K.-J. 2003. Introduction to CKIP Chinese word segmentation system for the first international Chinese word segmentation bakeoff. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 168--171.
[32]
Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. The MIT Press.
[33]
Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
[34]
MOE 1996. Common errors in Chinese writings ( ). Ministry of Education, Taiwan. http://140.111.34.54/files/site_content/M0001/biansz/c9.htm.
[35]
Sison, R. and Shimura, M. 1998. Student modeling and machine learning. Int. J. Artif. Intell. Educ. 9, 1, 128--158.
[36]
SJTUD. 1988. Chinese character code workgroup of Shanghai Jiao Tong University. In A Dictionary of Chinese Character Information. Beijing Science Press (in Chinese).
[37]
Song, R., Lin, M., and Ge, S.-L. 2008. Similarity calculation of Chinese character glyph and its application in computer aided proofreading system. J. Chin. Comput. Syst. 29, 10, 1964--1968. In Chinese.
[38]
Sun, X., Chen, H., Yang, L., and Tang, Y. Y. 2002. Mathematical representation of a Chinese character and its applications. Int. J. Patt. Recog. Artif. Intell. 16, 6, 735--747.
[39]
Tsai, J.-L., Lee, C.-Y., Lin, Y.-C., Tzeng, O. J.-L., and Hung, D. L. 2006. Neighborhood size effects of Chinese words in lexical decision and reading. Lang. Linguist. 7, 3, 659--675.
[40]
Tsay, Y.-C. and Tsay, C.-C. 2003. Diagnoses of Incorrect Chinese Usage (In Chinese). Firefly Publisher.
[41]
UNICODE. 2010. Unicode Standard 4.0.1, Chapter 11. http://www.unicode.org/versions/Unicode4.0.0/ch11.pdf.
[42]
Virvou, M., Maras, D., and Tsiriga, V. 2000. Student modelling in an intelligent tutoring for the passive voice of English language. Educ. Technol. Soc. 3, 4, 139--150.
[43]
Wu, S.-H., Chen, Y.-Z., Yang, P.-C., Ku, T., and Liu, C.-L. 2010. Reducing the false alarm rate of Chinese character error detection and correction. In Proceedings of the 1st Joint Conference on Chinese Language Processing (SIGHAN’10). 54--61.
[44]
Wubihua. 2010. An input method used in mainland China. http://en.wikipedia.org/wiki/Wubihua_method.
[45]
Yeh, S.-L. and Li, J.-L. 2002. Role of structure and component in judgments of visual similarity of Chinese characters. J. Experi. Psych. Hum. Percept. Perform. 28, 4, 933--947.
[46]
Zhang, L., Zhou, M., Huang, C., and Pan, H. 2000. Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL’00). 248--254.
[47]
Ziegler, J. C., Tan, L. H., Perry, C., and Montant, M. 2000. Phonology matters: The phonological frequency effect in written Chinese. Psychol. Sci. 11, 3, 234--238.

Cited By

View all
  • (2024)Notice-to-Airmen Text-Error-Correction Method Based on Ccf-MacBERT and Knowledge GraphJournal of Aerospace Information Systems10.2514/1.I011481(1-9)Online publication date: 13-Dec-2024
  • (2024)An Unsupervised Domain-Adaptive Framework for Chinese Spelling CheckingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/368982123:11(1-16)Online publication date: 21-Nov-2024
  • (2024)Explore the Textual Perception Ability on the Images for Multimodal Large Language ModelsNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_26(300-311)Online publication date: 2-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing
ACM Transactions on Asian Language Information Processing  Volume 10, Issue 2
June 2011
111 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1967293
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2011
Accepted: 01 January 2011
Revised: 01 December 2010
Received: 01 September 2010
Published in TALIP Volume 10, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Error analysis of written Chinese text
  2. computer-assisted language learning
  3. psycholinguistics
  4. simplified Chinese
  5. student modeling
  6. traditional Chinese

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)11
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Notice-to-Airmen Text-Error-Correction Method Based on Ccf-MacBERT and Knowledge GraphJournal of Aerospace Information Systems10.2514/1.I011481(1-9)Online publication date: 13-Dec-2024
  • (2024)An Unsupervised Domain-Adaptive Framework for Chinese Spelling CheckingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/368982123:11(1-16)Online publication date: 21-Nov-2024
  • (2024)Explore the Textual Perception Ability on the Images for Multimodal Large Language ModelsNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_26(300-311)Online publication date: 2-Nov-2024
  • (2023)PYGC: A PinYin Language Model Guided Correction Model for Chinese Spell CheckingNeural Information Processing10.1007/978-981-99-8073-4_18(224-239)Online publication date: 20-Nov-2023
  • (2023)Overview of the NLPCC 2023 Shared Task: Chinese Spelling CheckNatural Language Processing and Chinese Computing10.1007/978-3-031-44699-3_30(337-345)Online publication date: 12-Oct-2023
  • (2022)The Effect of Visual Mnemonics and the Presentation of Character Pairs on Learning Visually Similar Characters for Chinese-As-Second-Language LearnersFrontiers in Psychology10.3389/fpsyg.2022.78389813Online publication date: 9-May-2022
  • (2022)An Adversarial Multi-task Learning Method for Chinese Text Correction with Semantic DetectionArtificial Neural Networks and Machine Learning – ICANN 202210.1007/978-3-031-15931-2_14(159-173)Online publication date: 7-Sep-2022
  • (2022)Prompt as a Knowledge Probe for Chinese Spelling CheckKnowledge Science, Engineering and Management10.1007/978-3-031-10989-8_41(516-527)Online publication date: 19-Jul-2022
  • (2021)WordErrorSim: An Adversarial Examples Generation Method in Chinese by Erroneous KnowledgeProceedings of the 2021 5th International Conference on Compute and Data Analysis10.1145/3456529.3456556(155-161)Online publication date: 2-Feb-2021
  • (2021)Chinese Spelling Error Detection Using a Fusion Lattice LSTMACM Transactions on Asian and Low-Resource Language Information Processing10.1145/342688220:2(1-11)Online publication date: 5-May-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media