[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Multi-page document analysis based on format consistency and clustering

Published: 01 August 2010 Publication History

Abstract

In multi-page documents, document elements belonging to the same component usually share format regularity. We call this regularity 'document component intrinsic format consistency' (DCIFC). We present a new document analysis method based on DCIFC, which is complementary to the traditional document analysis methods based on the visual characteristics of document elements. One key advantage of our method is that DCIFC is stable from document to document, and thus is not impacted by layout variability, which is a major challenge in document analysis. Our method uses clustering techniques to build statistical models and then applies the models to labelling document components. In this way, the method can adapt to specific documents using formal specificities of components. We apply our method to several document recognition tasks and show its superior performance.

References

[1]
Altamura, O., Esposito, F. and Malerba, D. (1999) 'WISDOM++: an interactive and adaptive document analysis system', Proceedings of the fifth International Conference on Document Analysis and Recognition (ICDAR-99), IEEE Computer Society Press, Los Vaqueros, pp. 366-369.
[2]
Belaïd, A. and Besagni, D. (2007) 'Metadata extraction from bibliographic documents for digital library', Digital Document Processing, pp. 329-350, Springer London.
[3]
Besagni, D., Belaid, A. and Benet, N. (2003) 'A segmentation method for bibliographic references by contextual tagging of fields', Proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR-03), IEEE Computer Society Press, Los Alamitos, pp. 384-388.
[4]
Cattoni, R. and Coianiz, T. (1998) 'Geometric layout analysis techniques for document image understanding: a review', TC-IRST Technical Report #9703-09.
[5]
Cesarini, F., Francesconi, E., Gori, M., Marinai, S., Sheng, J.Q. and Soda, G. (1997) 'Rectangle labelling for an invoice understanding system', Proceedings of ICDAR-97, pp. 324-330.
[6]
Déjean, H. and Meunier, J.L. (2005) 'Structuring documents according to their table of contents', Proceedings of Symposium on Document Engineering (DocEng-05), ACM Press, New York, NY, pp. 2-9.
[7]
Déjean, H. and Meunier, J.L. (2007) 'Logical document conversion: combining functional and formal knowledge', Proceedings of Symposium on Document Engineering (DocEng-07), ACM Press, New York, NY, pp. 135-143.
[8]
Hauser, S., Sabir, T. and Thoma, G. (2003) 'OCR correction using historical relationship from verified text in biomedical citations', Proceedings of 2003 Symposium on Document Image Understanding Technology (SDIUT-03), pp. 171-177.
[9]
He, F., Ding, X. and Peng, L. (2004) 'Hierarchical logical structure extraction of book documents by analyzing tables of contents', Proceedings of SPIE Conference on Document Recognition and Retrieval IX (DRR-04), pp. 6-13.
[10]
Jain, A.K., Myrthy, M.N. and Flynn, P.J. (1999) 'Data clustering: a survey', ACM Computing Survey, Vol. 31, No. 3, pp. 264-323.
[11]
Kawtrakul, A. and Yingsaeree, C. (2005) 'A unified framework for automatic metadata extraction from electronic document', Proceedings of the International Advanced Digital Library Conference (IADLC-05), pp. 71-77.
[12]
Klink, S., Dengel, A. and Kieninger, T. (2000) 'Document structure analysis based on layout and textual features', Proceedings of 4th IAPR International Workshop on Document Analysis systems(DAS-00), Rio de Janeiro, Brazil, pp. 99-111.
[13]
Lin, X.F. (2003) 'Header and footer extraction by page-association', Proceedings of SPIE Conference on Document Recognition and Retrieval X (DRR-03), pp. 164-171.
[14]
Luo, Q., Watanabel, T. and Nakayama, T. (1996) 'Identifying contents page of documents', Proceedings of ICPR-96, IEEE Computer Society Press, Vienna, pp. 696-700.
[15]
Malerba, D., Esposito, F., Lisi, F.A. and Altamura, O. (2001) 'Automated discovery of dependencies between logical components in document image understanding', Proceedings of ICDAR-01, pp. 174-178.
[16]
Mandal, S., Chowdhury, S.P., Das, A.K. and Chanda, B. (2003) 'Automated detection and segmentation of table of contents page from document images', Proceedings of ICDAR-03, IEEE Computer Society Press, Edinburgh, pp. 398-402.
[17]
Peng, F. and McCallum, A. (2004) 'Accurate information extraction from research papers using conditional random fields', Proceedings of HLTNAACL-04, pp. 329-336.
[18]
Saitoh, T., Tachikawa, M. and Yamaai, T. (1993) 'Document image segmentation and text area ordering', Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR-93), IEEE Computer Society Press, Los Alamitos, pp. 323-329.
[19]
Tang, Y.Y., Yan, C.D. and Suen, C.Y. (1994) 'Document processing for automatic knowledge acquisition', IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No. 1, pp. 3-21.
[20]
Thompson, K.C. and Novkolov, R. (2002) 'A clustering-based algorithm for automatic document separation', Proceedings of SIGIR-02, ACM Press, New York, NY, pp. 135-143.
[21]
Tsuruoka, S., Hirano, C., Yoshikawa, T. and Shinogi, T. (2001) 'Image-based structure analysis for a table of contents and conversion to XML documents', Proceedings of Document Layout Interpretation and its Application (DLIA-01), pp. 59-62.

Cited By

View all
  • (2018)Semantic similarity-based PageRank using wordnetInternational Journal of Computer Applications in Technology10.1504/IJCAT.2013.05229246:2(101-112)Online publication date: 28-Dec-2018

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Computer Applications in Technology
International Journal of Computer Applications in Technology  Volume 38, Issue 4
August 2010
107 pages
ISSN:0952-8091
EISSN:1741-5047
Issue’s Table of Contents

Publisher

Inderscience Publishers

Geneva 15, Switzerland

Publication History

Published: 01 August 2010

Author Tags

  1. clustering
  2. component labelling
  3. document analysis
  4. document recognition
  5. format consistency
  6. information retrieval
  7. multi-page documents
  8. multiple pages

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Semantic similarity-based PageRank using wordnetInternational Journal of Computer Applications in Technology10.1504/IJCAT.2013.05229246:2(101-112)Online publication date: 28-Dec-2018

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media