[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Computer Science Diagram Understanding with Topology Parsing

Published: 30 July 2022 Publication History

Abstract

Diagram is a special form of visual expression for representing complex concepts, logic, and knowledge, which widely appears in educational scenes such as textbooks, blogs, and encyclopedias. Current research on diagrams preliminarily focuses on natural disciplines such as Biology and Geography, whose expressions are still similar to natural images. In this article, we construct the first novel geometric type of diagrams dataset in Computer Science field, which has more abstract expressions and complex logical relations. The dataset has exhaustive annotations of objects and relations for about 1,300 diagrams and 3,500 question-answer pairs. We introduce the tasks of diagram classification (DC) and diagram question answering (DQA) based on the new dataset, and propose the Diagram Paring Net (DPN) that focuses on analyzing the topological structure and text information of diagrams. We use DPN-based models to solve DC and DQA tasks, and compare the performances to well-known natural images classification models and visual question answering models. Our experiments show the effectiveness of the proposed DPN-based models on diagram understanding tasks, also indicate that our dataset is more complex compared to previous natural image understanding datasets. The presented dataset opens new challenges for research in diagram understanding, and the DPN method provides a novel perspective for studying such data. Our dataset can be available from https://github.com/WayneWong97/CSDia.

References

[1]
Jaided AI. 2020. EasyOCR. Retrieved October 9, 2020 from https://github.com/JaidedAI/EasyOCR.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.
[3]
Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16, 6 (2010), 345–379.
[4]
Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2612–2620.
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
[6]
Ronald W. Ferguson and Kenneth D. Forbus. 1998. Telling juxtapositions: Using repetition and alignable difference in diagram understanding. Advances in Analogy Research 5, 1 (1998), 109–117.
[7]
Ronald W. Ferguson and Kenneth D. Forbus. 2000. GeoRep: A flexible tool for spatial representation of line drawings. In Proceedings of the AAAI/IAAI. 510–516.
[8]
Robert P. Futrelle, Mingyan Shao, Chris Cieslik, and Andrea Elaina Grimes. 2003. Extraction, layout analysis and classification of diagrams in pdf documents. In Proceedings of the 7th International Conference on Document Analysis and Recognition.1007–1013.
[9]
Perez Gomez, Manuel Jose, and Raul Ortega. 2020. ISAAQ–Mastering textbook questions with pre-trained transformers and bottom-up and top-down attention. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
[10]
Jose Manuel Gomez-Perez and Raul Ortega. 2019. Look, read and enrich-learning from scientific figures and their captions. In Proceedings of the 10th International Conference on Knowledge Capture. 101–108.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[12]
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2017. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(\lt\)0.5 MB model size. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017).
[13]
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2901–2910.
[14]
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In Proceedings of the European Conference on Computer Vision. 235–251.
[15]
Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4999–5007.
[16]
Daesik Kim, Seonhoon Kim, and Nojun Kwak. 2019. Textbook question answering with multi-modal context graph understanding and self-supervised open-set comprehension. In Proceedings of the Association for Computational Linguistics.
[17]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems.
[18]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
[19]
Jayant Krishnamurthy, Oyvind Tafjord, and Aniruddha Kembhavi. 2016. Semantic parsing to probabilistic programs for situated question answering. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
[20]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
[21]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.
[22]
Changshu Liu. 2002. Digital Logic Circuit. National Defense Industry Press.
[23]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
[24]
Dengsheng Lu and Qihao Weng. 2007. A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing 28, 5 (2007), 823–870.
[25]
Kurt Mehlhorn and Peter Sanders. 2008. Algorithms and Data Structures: The Basic Toolbox. Springer Science & Business Media.
[26]
David Morris, Eric Müller-Budack, and Ralph Ewerth. 2020. SlideImages: A dataset for educational image classification. In Proceedings of the European Conference on Information Retrieval. 289–296.
[27]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532–1543.
[28]
Mrinmaya Sachan and Eric Xing. 2017. Learning to solve geometry problems from natural language demonstrations in textbooks. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017). 251–261.
[29]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.
[30]
Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, and Oren Etzioni. 2014. Diagram understanding in geometry questions. In Proceedings of the 28th AAAI Conference on Artificial Intelligence.
[31]
Clifford A. Shaffer. 2012. Data structures and algorithm analysis. Prentice Hall Upper Saddle River, NJ.
[32]
Hui Shuai. 2018. High Score Notes of Data Structure. China Machine Press.
[33]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[34]
Shuofei Tang, Xudong Liu, and Cheng Wang. 2000. Principles of Computer Organization. Higher Education Press.
[35]
Xiaodan Tang, Hongbing Liang, Fengping Zhe, and Ziying Tang. 2007. Computer Operating System. Xidian University Press.
[36]
Yasuhiko Watanabe and Makoto Nagao. 1998. Diagram understanding using integration of layout information and textual information. In Proceedings of the COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.
[37]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492–1500.
[38]
Weimin Yan and Minwei Wu. 2002. Data Structure C version. TsingHua University Press.
[39]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 1 (2014), 67–78.
[40]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281–6290.
[41]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1821–1830.
[42]
C. Zheng, Q. Guo, and P. Kordjamshidi. 2020. Cross-Modality relevance for reasoning on language and vision. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Cited By

View all
  • (2024)MvStHgL: Multi-view Hypergraph Learning with Spatial-temporal Periodic Interests for Next POI RecommendationACM Transactions on Information Systems10.1145/3664651Online publication date: 10-May-2024
  • (2024)Alignment Relation is What You Need for Diagram ParsingIEEE Transactions on Image Processing10.1109/TIP.2024.337451133(2131-2144)Online publication date: 2024
  • (2024)CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01325(13969-13979)Online publication date: 16-Jun-2024
  • Show More Cited By

Index Terms

  1. Computer Science Diagram Understanding with Topology Parsing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 6
    December 2022
    631 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3543989
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 July 2022
    Online AM: 18 March 2022
    Accepted: 01 February 2022
    Revised: 01 October 2021
    Received: 01 March 2021
    Published in TKDD Volume 16, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Diagram understanding
    2. topology
    3. Computer Science

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Key Research and Development Program
    • National Natural Science Foundation of China
    • Innovative Research Group of the National Natural Science Foundation of China
    • Innovation Research Team of Ministry of Education
    • Xi’an Jiaotong University, China Postdoctoral Science Foundation
    • China Knowledge Centre for Engineering Science and Technology, the Fundamental Research Funds for the Central Universities

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)159
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 06 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)MvStHgL: Multi-view Hypergraph Learning with Spatial-temporal Periodic Interests for Next POI RecommendationACM Transactions on Information Systems10.1145/3664651Online publication date: 10-May-2024
    • (2024)Alignment Relation is What You Need for Diagram ParsingIEEE Transactions on Image Processing10.1109/TIP.2024.337451133(2131-2144)Online publication date: 2024
    • (2024)CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01325(13969-13979)Online publication date: 16-Jun-2024
    • (2024)Advancing pharmacogenomics research: automated extraction of insights from PubMed using SpaCy NLP frameworkPharmacogenomics10.1080/14622416.2024.242994625:14-15(573-578)Online publication date: 20-Nov-2024
    • (2024)Next location prediction using heterogeneous graph-based fusion network with physical and social awarenessInternational Journal of Geographical Information Science10.1080/13658816.2024.237572538:10(1965-1990)Online publication date: 10-Jul-2024
    • (2024)Building a Comprehensive Intent-Based Networking Framework: A Practical Approach from Design Concepts to ImplementationJournal of Network and Systems Management10.1007/s10922-024-09819-732:3Online publication date: 1-Jul-2024
    • (2024)Deep learning for named entity recognition: a surveyNeural Computing and Applications10.1007/s00521-024-09646-636:16(8995-9022)Online publication date: 28-Mar-2024
    • (2024)Generating Qualitative Descriptions of Diagrams with a Transformer-Based Language ModelDiagrammatic Representation and Inference10.1007/978-3-031-71291-3_5(61-75)Online publication date: 9-Sep-2024
    • (2023)Temporal-geographical attention-based transformer for point-of-interest recommendationJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23482445:6(12243-12253)Online publication date: 2-Dec-2023
    • (2023)A Diffusion Model for POI RecommendationACM Transactions on Information Systems10.1145/362447542:2(1-27)Online publication date: 8-Nov-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media