Abstract
In this paper, we propose a keyword spotting system for Korean document images and compare the proposed system with an OCR-based document retrieval system. The system is composed of character segmentation, feature extraction for the query keyword, and word-to-word matching. In the character segmentation step, we propose an effective method to resolve the connection between adjacent characters. In the query creation step, feature vector for the query is constructed by a combination of the features for the constituent characters. In the matching step, word-to-word matching is applied based on a character matching. We demonstrated that the proposed keyword spotting system is more efficient than the OCR-based one to search a keyword on Korean document images, especially when the quality of documents is quite poor.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ohta, M., Takasu, A., Adach, J.: Retrieval methods for English-text width missrecognized OCR characters. In: Proceedings of 4th International Conference on Document Analysis and Recognition, vol. 2, pp. 950–955 (1997)
Marukawa, K., Hu, T., Fujisawa, H., Shima, Y.: Document retrieval tolerating character recognition errors-evaluation and application. Pattern Recognition 30(8), 1361–1371 (1997)
Doermann, D.: The indexing and retrieval of document images: a survey. Computer Vision and Image Understanding 70(3), 287–298 (1998)
Chen, F., Wilcox, L., Bloomberg, D.: Word spotting in scanned images using hidden Markov models. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–4 (1993)
Lu, Y., Tan, C.L.: Word searching in document images using word portion matching. In: Fifth IAPR International Workshop on Document Analysis Systems, USA, pp. 319–328 (2002)
Lu, Y., Zhang, L., Tan, C.L.: A search engine for imaged documents in PDF files. In: 27th Annual International ACM SIGIR Conference, UK (2004)
DeCurtins, J., Chen, E.: Keyword spotting via word shape recognition. In: Proc. SPIE Document Recognition II, pp. 270–277 (1995)
Chen, F.R., Wilcox, L.D., Bloomberg, D.S.: A comparison of discrete and continuous hidden Markov models for phrase spotting in text images. In: Proc. International Conference on Document Analysis and Recognition, vol. 1, pp. 398–402 (1995)
Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Image document text retrieval without OCR. IEEE Transaction on Pattern Analysis and Machine Intelligence 24(7), 838–844 (2002)
Kim, H.G., Yang, J.H., Lee, J.S., Oh, I.S.: Image-based retrieval of printed Korean words using wavelets. Journal of Korea Information Science Society 28(2), 91–103 (2001)
Oh, I.S., Choi, Y.S., Yang, J.H., Kim, S.H.: A Keyword spotting system of Korean document images. In: Proc. 5th International Conference on Asian Digital Libraries, Singapore, p. 530 (2002)
Kwag, H.K.: A Study on Word Segmentation and Attribute Extraction from Document Images, Ph.D. dissertation, Chonnam National University, Korea (2001)
Jeong, C.B., Kim, S.H.: A document image preprocessing system for keyword spotting. In: Proc. International Conference on Asian Digital Libraries, China, pp. 440–443 (December 2004)
Yates, R.B., Neto, B.R.: Modern Information Retrieval, pp. 75–82. ACM press, New York (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, S.H., Park, S.C., Jeong, C.B., Kim, J.S., Park, H.R., Lee, G.S. (2005). Keyword Spotting on Korean Document Images by Matching the Keyword Image. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds) Digital Libraries: Implementing Strategies and Sharing Experiences. ICADL 2005. Lecture Notes in Computer Science, vol 3815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11599517_18
Download citation
DOI: https://doi.org/10.1007/11599517_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30850-8
Online ISBN: 978-3-540-32291-7
eBook Packages: Computer ScienceComputer Science (R0)