Abstract
This paper describes a method to extract words from table regions in document images. The proposed approach consists of two stages: cell detection and word extraction. In the cell detection module, a table frame is extracted first by analyzing connected components and then intersection points are detected by a method using masks in the table frame. We correct false intersections, and detect the location of the cells within the table. In the word extraction module, a text region in each cell is located by using the connected components information that was obtained during the cell extraction module, and segmented into text lines by using projection profiles. Finally we divide the segmented lines into words using gap clustering and special symbol detection. The method correctly included character components touching the table frame with words, so experimental results show that more than 99% of words were successfully extracted from table regions.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Oh, I.S., Choi, Y.S., Yang, J.H., Kim, S.H.: A Keyword Spotting System of Korean Document Images. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, p. 530. Springer, Heidelberg (2002)
Marinai, S., Marino, E., Cesarini, F., Soda, G.: A General System for the Retrieval of Document Images from Digital Libraries. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp. 150–173 (2004)
Lu, Y., Zhang, L., Tan, C.L.: Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp. 174–187 (2004)
Jeong, C.B., Kim, S.H.: A Document Image Preprocessing System for Keyword Spotting. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-p. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 440–443. Springer, Heidelberg (2004)
Lopresti, D., Nagy, G.: A Tabular Survey of Automated Table Processing. In: Chhabra, A.K., Dori, D. (eds.) GREC 1999. LNCS, vol. 1941, pp. 93–120. Springer, Heidelberg (2000)
Watanabe, T., Luo, Q., Sugie, N.: Layout Recognition of Multi-Kinds of Table-Form Documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 432–445 (1995)
Taylor, S., Fritzson, R., Pastor, J.: Extraction of Data from Pre-printed Forms. Machine Vision and Applications 5(3), 211–222 (1992)
Arias, J.F., Kasturi, R.: Efficient Extraction of Primitives from Line Drawings Composed of Horizontal and Vertical Lines. Machine Vision and Applications archive 10, 214–221 (1997)
Neves, L.A.P., Facon, J.: Methodology of Automatic Extraction of Table-Form Cells. In: XIII Brazilian Symposium on Computer Graphics and Image Processing, pp. 15–21 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jeong, CB., Park, SC., Son, HJ., Kim, SH. (2005). Word Extraction from Table Regions in Document Images. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds) Digital Libraries: Implementing Strategies and Sharing Experiences. ICADL 2005. Lecture Notes in Computer Science, vol 3815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11599517_25
Download citation
DOI: https://doi.org/10.1007/11599517_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30850-8
Online ISBN: 978-3-540-32291-7
eBook Packages: Computer ScienceComputer Science (R0)