[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Improving information extraction from visually rich documents using visual span representations

Published: 01 January 2021 Publication History

Abstract

Along with textual content, visual features play an essential role in the semantics of visually rich documents. Information extraction (IE) tasks perform poorly on these documents if these visual cues are not taken into account. In this paper, we present Artemis - a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is two-fold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. We evaluate Artemis on four heterogeneous datasets from different domains over a suite of information extraction tasks. Results show that it outperforms state-of-the-art text-based methods by up to 17 points in F1-score.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[2]
T. Breuel. 2007. The hOCR Microformat for OCR Workflow and Results. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. 1063--1067.
[3]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Vips: a vision-based page segmentation algorithm. (2003).
[4]
Kuang Chen, Akshay Kannan, Yoriyasu Yano, Joseph M Hellerstein, and Tapan S Parikh. 2012. Shreddr: pipelined paper digitization for low-resource organizations. In Proceedings of the 2nd ACM Symposium on Computing for Development. 3.
[5]
Antonio Clavelli, Dimosthenis Karatzas, and Josep Llados. 2010. A framework for the assessment of text extraction algorithms on complex colour images. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. 19--26.
[6]
Quanyu Dai, Qiang Li, Jian Tang, and Dan Wang. 2018. Adversarial network embedding. In Thirty-second AAAI conference on artificial intelligence.
[7]
Brian Davis, Bryan Morse, Scott Cohen, Brian Price, and Chris Tensmeyer. 2019. Deep visual template-free form parsing. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 134--141.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[9]
AnHai Doan, Jeffrey F Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, et al. 2009. Information extraction challenges in managing unstructured data. ACM SIGMOD Record 37, 4 (2009), 14--20.
[10]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. International journal of computer vision 88, 2 (2010), 303--338.
[11]
Ignazio Gallo, Alessandro Zamberletti, and Lucia Noce. 2015. Content extraction from marketing flyers. In International Conference on Computer Analysis of Images and Patterns. Springer, 325--336.
[12]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.
[13]
The Stanford NLP Group. 2020. Stanford Part-Of-Speech Tagger. Accessed: 2020-01-31.
[14]
The Stanford NLP Group. 2020. Stanford Word Tokenizer. Accessed: 2020-01-31.
[15]
Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. [n.d.]. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In International Conference on Document Analysis and Recognition (ICDAR).
[16]
Nurse Tech Inc. 2018. NurseBrains. Accessed: 2019-01-25.
[17]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. arXiv preprint (2017).
[18]
Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799 (2018).
[19]
Keras. 2018. Keras: Deep Learning for Humans. Accessed: 2018-09-30.
[20]
D Kinga and J Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Vol. 5.
[21]
Nicholas Kushmerick. 2000. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 1-2 (2000), 15--68.
[22]
Matthew Lamm. 2020. Natural Language Processing with Deep Learning. Accessed: 2020-01-31.
[23]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436--444.
[24]
David Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, D Grossman, and Jefferson Heard. 2006. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 665--666.
[25]
Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015).
[26]
Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph convolution for multimodal information extraction from visually rich documents. arXiv preprint arXiv:1903.11279 (2019).
[27]
Astera LLC. 2018. ReportMiner: A Data Extraction Solution. Accessed: 2018-09-30.
[28]
Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018. Ceres: Distantly supervised relation extraction from the semi-structured web. arXiv preprint arXiv:1804.04635 (2018).
[29]
Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296 (2019).
[30]
Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bidirectional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).
[31]
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation learning for information extraction from form-like documents. In proceedings of the 58th annual meeting of the Association for Computational Linguistics. 6495--6504.
[32]
Tomohiro Manabe and Keishi Tajima. 2015. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment 8, 12 (2015), 1606--1617.
[33]
Christopher Manning. 2017. Representations for language: From word embeddings to sentence meanings. Accessed: 2020-01-31.
[34]
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55--60.
[35]
Marcin Michał Mirończuk. 2018. The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction. Knowledge and Information Systems 54, 3 (2018), 711--776.
[36]
Austin F Mount-Campbell, Kevin D Evans, David D Woods, Esther M Chipps, Susan D Moffatt-Bruce, and Emily S Patterson. 2019. Value and usage of a workaround artifact: A cognitive work analysis of "brains" use by hospital nurses. Journal of Cognitive Engineering and Decision Making 13, 2 (2019), 67--80.
[37]
Bastien Moysset, Christopher Kermorvant, Christian Wolf, and Jérôme Louradour. 2015. Paragraph text segmentation into lines with recurrent neural networks. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 456--460.
[38]
NIST. 2018. NIST Special Database 6. Accessed: 2018-09-30.
[39]
Feng Niu, Ce Zhang, Christopher Ré, and Jude W Shavlik. 2012. DeepDive: Webscale Knowledge-base Construction using Statistical Learning and Inference. VLDS 12 (2012), 25--28.
[40]
Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345--1359.
[41]
Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).
[42]
Frédéric Patin. 2003. An introduction to digital image processing. online]: http://www.programmersheaven.com/articles/patin/ImageProc.pdf (2003).
[43]
P David Pearson, Michael L Kamil, Peter B Mosenthal, Rebecca Barr, et al. 2016. Handbook of reading research. Routledge.
[44]
Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large DataBases, Vol. 11. NIH Public Access, 269.
[45]
Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris Ré. 2017. Snorkel: Fast training set generation for information extraction. In Proceedings of the 2017 ACM international conference on management of data. 1683--1686.
[46]
Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems. 3567--3575.
[47]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[48]
Sunita Sarawagi et al. 2008. Information extraction. Foundations and Trends® in Databases 1, 3 (2008), 261--377.
[49]
Ritesh Sarkhel, Moniba Keymanesh, Arnab Nandi, and Srinivasan Parthasarathy. 2020. Interpretable Multi-headed Attention for Abstractive Summarization at Controllable Lengths. In Proceedings of the 28th International Conference on Computational Linguistics. 6871--6882.
[50]
Ritesh Sarkhel and Arnab Nandi. 2019. Deterministic routing between layout abstractions for multi-scale classification of visually rich documents. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 3360--3366.
[51]
Ritesh Sarkhel and Arnab Nandi. 2019. Visual segmentation for information extraction from heterogeneous visually rich documents. In Proceedings of the 2019 International Conference on Management of Data. ACM, 247--262.
[52]
Ritesh Sarkhel, Jacob J Socha, Austin Mount-Campbell, Susan Moffatt-Bruce, Simon Fernandez, Kashvi Patel, Arnab Nandi, and Emily S Patterson. 2018. How Nurses Identify Hospitalized Patients on Their Personal Notes: Findings From Analyzing 'Brains' Headers with Multiple Raters. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care, Vol. 7. SAGE Publications Sage India: New Delhi, India, 205--209.
[53]
Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS) 42, 3 (2017), 1--21.
[54]
Ray Smith. 2007. An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, Vol. 2. IEEE, 629--633.
[55]
Fei Sun, Dandan Song, and Lejian Liao. 2011. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 245--254.
[56]
GFG Thoma. 2003. Ground truth data for document image analysis. In Symposium on document image understanding and technology (SDIUT). 199--205.
[57]
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. 384--394.
[58]
David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019).
[59]
Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1301--1316.
[60]
Bishan Yang and Tom Mitchell. 2016. Joint extraction of events and entities within a document context. arXiv preprint arXiv:1609.03632 (2016).
[61]
Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C Lee Giles. 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5315--5324.
[62]
Xin Yi, Ekta Walia, and Paul Babyn. 2019. Generative adversarial network in medical imaging: A review. Medical image analysis 58 (2019), 101552.

Cited By

View all
  • (2022)Scalable and Cost-effective Serverless Architecture for Information Extraction WorkflowsProceedings of the 2nd Workshop on High Performance Serverless Computing10.1145/3526060.3535458(15-23)Online publication date: 30-Jun-2022
  1. Improving information extraction from visually rich documents using visual span representations

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 14, Issue 5
    January 2021
    142 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 January 2021
    Published in PVLDB Volume 14, Issue 5

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Scalable and Cost-effective Serverless Architecture for Information Extraction WorkflowsProceedings of the 2nd Workshop on High Performance Serverless Computing10.1145/3526060.3535458(15-23)Online publication date: 30-Jun-2022

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media