[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2595188.2595196acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

Construction of a text digitization system for Nom historical documents

Published: 19 May 2014 Publication History

Abstract

This paper presents a text digitization system for Nom historical documents, employing image binarization, character segmentation and character recognition. It incorporates two versions of offline character recognition: one for automatic classification and the other for verification and correction by an operator. They employ the same recognition method but they are trained by two different sets of training patterns with 7,601 and 32,733 categories. For the recognition method, we use the Generalized Learning Vector Quantization (GLVQ) algorithm for coarse classification and the Modified Quadratic Discriminant Function (MQDF2) method for fine classification. Sample character patterns are generated artificially from 27 fonts of Chinese, Japanese and Nom characters since ground-truthed sample patterns are not available. Moreover, in order to accelerate large scale recognition, we use the kd-tree algorithm in the coarse classification process. The system also provides the interface through which an operator can verify and correct the results of image binarization, character segmentation and character recognition.

References

[1]
V. J. Shih, and T. L. Chu. The Han Nom Digital Library. In The International Nom Conference, The National Library of Vietnam, Hanoi, Nov. 2004.
[2]
M. S. Kim, M. D. Jang, H. I. Choi, T. H. Rhee, J. H. Kim, and H. K. Kwag. Digitalizing scheme of handwritten Hanja historical documents. In Proc. of the 1st International Workshop on Document Image Analysis for Libraries, USA, Jan. 2004, 321--327.
[3]
T. V. Phan, B. Zhu, and M. Nakagawa. Development of Nom Character Segmentation for Collecting Patterns from Historical Document Pages. In Proc. of 1st International Workshop on Historical Document Imaging and Processing, China, Sep. 2011, 133--139.
[4]
T. V. Phan, B. Zhu, and M. Nakagawa. Collecting Handwritten Nom Character Patterns from Historical Document Pages. In Proc. of 10th IAPR International Workshop on Document Analysis Systems, Australia, Mar. 2012, 344--348.
[5]
B. Su, S. Lu, and C. L Tan. Binarization of historical handwritten document images using local maximum and minimum filter. In Proc. of the 9th IAPR International Workshop on Document Analysis Systems, USA, Jun. 2010, 159--165.
[6]
N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. System, Man Cybernetics 9, 1979, 62--66.
[7]
J. Kittler, and J. Illingworth. Threshold selection based on a simple image statistics. Computer Vision Graphics Image Process 30, 1985, 125--147.
[8]
J. Schindelin, I. Arganda-Carreras, E. Frise, V. Kaynig, M. Longair, T. Pietzsch,... and A. Cardona. Fiji: an open-source platform for biological-image analysis. Nature methods, 9(7), 2012, 676--682.
[9]
B. Chen, B. Zhu, and M. Nakagawa. Effects of Generating a Large Amount of Artificial Patterns for On-line Handwritten Japanese Character Recognition. In Proc. of the 11th International Conference on Document Analysis and Recognition, China, Sep. 2011, 663--667.
[10]
K. C. Leung, and C. H. Leung. Recognition of Handwritten Chinese Characters by Combining Regularization, Fisher's Discriminant and Transformation Sample Generation. In Proc. of the 10th International Conference of Document Analysis and Recognition, Spain, 2009, 1026--1030.
[11]
J. Tsukumo, and H. Tanaka. Classification of handprinted Chinese characters using non-linear normalization and correlation methods. In Proc. of the 9th International Conference on Pattern Recognition, Italy, 1988, 168--171.
[12]
C. L. Liu. Normalization-cooperated gradient feature extraction for handwritten character recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(8), 2007, 1465--1469.
[13]
K. Fukunaga. Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, 1990.
[14]
F. Kimura, K. Takashina, S. Tsuruoka, and Y. Miyake. Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Trans. PAMI, 9(1), 1987, 149--153.
[15]
Y. Yang, and M. Nakagawa. Layered Search Spaces for Accelerating Large Set Character Recognition. In Proc. of the 18th International Conference on Pattern Recognition, 2006, 1006--1009.
[16]
J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 1975, 509--517.
[17]
T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, and K. Torkkola. LVQ PAK: The learning vector quantization program package. Technical report, Laboratory of Computer and Information Science Rakentajanaukio 2 C, 1996, 1991--1992.
[18]
A. Sato, and K. Yamada. Generalized learning vector quantization. Advances in neural information processing systems, 1996, 423--429.
[19]
B-H. Juang, and S. Katagiri. Discriminative learning for minimum error classification. Signal Processing, IEEE Transactions on, 40(12), 1992, 3043--3054.
[20]
C. L. Liu, and M. Nakagawa. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34(3), 2001, 601--615.
[21]
T. Fukumoto, T. Wakabayashi, F. Kimura, and Y. Miyake. Accuracy improvement of handwritten character recognition by GLVQ. In Proc. of the 7th International Workshop on Frontiers in handwriting recognition, 2000, 687--692.
[22]
T. V. Phan, M. Nakagawa, H. Baba, and A. Watanabe. MokkAnnotator -- A System for Archiving Mokkan Images. In Proc. of the 16th Biennial Conference of the International Graphonomics Society, Japan, Jun. 2013, 54--57.
[23]
M. Nakagawa, and K. Matsumoto. Collection of on-line handwritten Japanese character pattern databases and their analysis. Document Analysis and Recognition, 7(1), 2004, 69--81.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage
May 2014
200 pages
ISBN:9781450325882
DOI:10.1145/2595188
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Succeed: The Support Action Centre of Competence in Digitisation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 May 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Chu Nôm
  2. Nom character
  3. Vietnamese
  4. area Voronoi diagram
  5. binarization
  6. character segmentation
  7. document image analysis
  8. historical documents
  9. offline character recognition
  10. recursive xy cut
  11. text digitization

Qualifiers

  • Research-article

Funding Sources

Conference

DATeCH 2014
Sponsor:
  • Succeed

Acceptance Rates

DATeCH '14 Paper Acceptance Rate 31 of 49 submissions, 63%;
Overall Acceptance Rate 60 of 86 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 81
    Total Downloads
  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media