More Web Proxy on the site http://driver.im/

research-article

Computer Science Diagram Understanding with Topology Parsing

Authors:

Lingling Zhang,

Jun LiuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 6

Article No.: 114, Pages 1 - 20

https://doi.org/10.1145/3522689

Published: 30 July 2022 Publication History

Abstract

Diagram is a special form of visual expression for representing complex concepts, logic, and knowledge, which widely appears in educational scenes such as textbooks, blogs, and encyclopedias. Current research on diagrams preliminarily focuses on natural disciplines such as Biology and Geography, whose expressions are still similar to natural images. In this article, we construct the first novel geometric type of diagrams dataset in Computer Science field, which has more abstract expressions and complex logical relations. The dataset has exhaustive annotations of objects and relations for about 1,300 diagrams and 3,500 question-answer pairs. We introduce the tasks of diagram classification (DC) and diagram question answering (DQA) based on the new dataset, and propose the Diagram Paring Net (DPN) that focuses on analyzing the topological structure and text information of diagrams. We use DPN-based models to solve DC and DQA tasks, and compare the performances to well-known natural images classification models and visual question answering models. Our experiments show the effectiveness of the proposed DPN-based models on diagram understanding tasks, also indicate that our dataset is more complex compared to previous natural image understanding datasets. The presented dataset opens new challenges for research in diagram understanding, and the DPN method provides a novel perspective for studying such data. Our dataset can be available from https://github.com/WayneWong97/CSDia.

References

[1]

Jaided AI. 2020. EasyOCR. Retrieved October 9, 2020 from https://github.com/JaidedAI/EasyOCR.

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.

Digital Library

[3]

Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16, 6 (2010), 345–379.

Digital Library

[4]

Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2612–2620.

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.

[6]

Ronald W. Ferguson and Kenneth D. Forbus. 1998. Telling juxtapositions: Using repetition and alignable difference in diagram understanding. Advances in Analogy Research 5, 1 (1998), 109–117.

[7]

Ronald W. Ferguson and Kenneth D. Forbus. 2000. GeoRep: A flexible tool for spatial representation of line drawings. In Proceedings of the AAAI/IAAI. 510–516.

[8]

Robert P. Futrelle, Mingyan Shao, Chris Cieslik, and Andrea Elaina Grimes. 2003. Extraction, layout analysis and classification of diagrams in pdf documents. In Proceedings of the 7th International Conference on Document Analysis and Recognition.1007–1013.

[9]

Perez Gomez, Manuel Jose, and Raul Ortega. 2020. ISAAQ–Mastering textbook questions with pre-trained transformers and bottom-up and top-down attention. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

[10]

Jose Manuel Gomez-Perez and Raul Ortega. 2019. Look, read and enrich-learning from scientific figures and their captions. In Proceedings of the 10th International Conference on Knowledge Capture. 101–108.

Digital Library

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[12]

Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2017. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(\lt\)0.5 MB model size. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017).

[13]

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2901–2910.

[14]

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In Proceedings of the European Conference on Computer Vision. 235–251.

[15]

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4999–5007.

[16]

Daesik Kim, Seonhoon Kim, and Nojun Kwak. 2019. Textbook question answering with multi-modal context graph understanding and self-supervised open-set comprehension. In Proceedings of the Association for Computational Linguistics.

[17]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems.

[18]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.

Digital Library

[19]

Jayant Krishnamurthy, Oyvind Tafjord, and Aniruddha Kembhavi. 2016. Semantic parsing to probabilistic programs for situated question answering. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.

[20]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.

[21]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.

[22]

Changshu Liu. 2002. Digital Logic Circuit. National Defense Industry Press.

[23]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.

[24]

Dengsheng Lu and Qihao Weng. 2007. A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing 28, 5 (2007), 823–870.

Digital Library

[25]

Kurt Mehlhorn and Peter Sanders. 2008. Algorithms and Data Structures: The Basic Toolbox. Springer Science & Business Media.

[26]

David Morris, Eric Müller-Budack, and Ralph Ewerth. 2020. SlideImages: A dataset for educational image classification. In Proceedings of the European Conference on Information Retrieval. 289–296.

Digital Library

[27]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532–1543.

[28]

Mrinmaya Sachan and Eric Xing. 2017. Learning to solve geometry problems from natural language demonstrations in textbooks. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017). 251–261.

[29]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.

[30]

Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, and Oren Etzioni. 2014. Diagram understanding in geometry questions. In Proceedings of the 28th AAAI Conference on Artificial Intelligence.

[31]

Clifford A. Shaffer. 2012. Data structures and algorithm analysis. Prentice Hall Upper Saddle River, NJ.

[32]

Hui Shuai. 2018. High Score Notes of Data Structure. China Machine Press.

[33]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[34]

Shuofei Tang, Xudong Liu, and Cheng Wang. 2000. Principles of Computer Organization. Higher Education Press.

[35]

Xiaodan Tang, Hongbing Liang, Fengping Zhe, and Ziying Tang. 2007. Computer Operating System. Xidian University Press.

[36]

Yasuhiko Watanabe and Makoto Nagao. 1998. Diagram understanding using integration of layout information and textual information. In Proceedings of the COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.

[37]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492–1500.

[38]

Weimin Yan and Minwei Wu. 2002. Data Structure C version. TsingHua University Press.

[39]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 1 (2014), 67–78.

[40]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281–6290.

[41]

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1821–1830.

[42]

C. Zheng, Q. Guo, and P. Kordjamshidi. 2020. Cross-Modality relevance for reasoning on language and vision. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Cited By

An JGao MTang J(2024)MvStHgL: Multi-view Hypergraph Learning with Spatial-temporal Periodic Interests for Next POI RecommendationACM Transactions on Information Systems10.1145/3664651Online publication date: 10-May-2024
https://dl.acm.org/doi/10.1145/3664651
Zhang XZhang LHu XLiu JWang SWang Q(2024)Alignment Relation is What You Need for Diagram ParsingIEEE Transactions on Image Processing10.1109/TIP.2024.337451133(2131-2144)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3374511
Wang SZhang LZhu LQin TYap KZhang XLiu J(2024)CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01325(13969-13979)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01325
Show More Cited By

Index Terms

Computer Science Diagram Understanding with Topology Parsing
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Diagram definition: a case study with the UML class diagram
MODELS'11: Proceedings of the 14th international conference on Model driven engineering languages and systems

The abstract syntax of a graphical modeling language is typically defined with a metamodel while its concrete syntax (diagram) is informally defined with text and figures. Recently, the Object Management Group (OMG) released a beta specification, called ...
Experiments in the automatic marking of ER-diagrams
ITiCSE '05: Proceedings of the 10th annual SIGCSE conference on Innovation and technology in computer science education

In this paper we present an approach to the computer understanding of diagrams and show how it can be successfully applied to the automatic marking (grading) of student attempts at drawing entity-relationship (ER) diagrams. The automatic marker has been ...
Generating Qualitative Descriptions of Diagrams with a Transformer-Based Language Model
Diagrammatic Representation and Inference
Abstract
To address the task of diagram understanding we propose to distinguish between the perception of the geometric configuration of a diagram from the assignment of meaning to the geometric entities and their topological relationships. As a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 6

December 2022

631 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3543989

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2022

Online AM: 18 March 2022

Accepted: 01 February 2022

Revised: 01 October 2021

Received: 01 March 2021

Published in TKDD Volume 16, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Key Research and Development Program
National Natural Science Foundation of China
Innovative Research Group of the National Natural Science Foundation of China
Innovation Research Team of Ministry of Education
Xi’an Jiaotong University, China Postdoctoral Science Foundation
China Knowledge Centre for Engineering Science and Technology, the Fundamental Research Funds for the Central Universities

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
620
Total Downloads

Downloads (Last 12 months)159
Downloads (Last 6 weeks)14

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

An JGao MTang J(2024)MvStHgL: Multi-view Hypergraph Learning with Spatial-temporal Periodic Interests for Next POI RecommendationACM Transactions on Information Systems10.1145/3664651Online publication date: 10-May-2024
https://dl.acm.org/doi/10.1145/3664651
Zhang XZhang LHu XLiu JWang SWang Q(2024)Alignment Relation is What You Need for Diagram ParsingIEEE Transactions on Image Processing10.1109/TIP.2024.337451133(2131-2144)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3374511
Wang SZhang LZhu LQin TYap KZhang XLiu J(2024)CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01325(13969-13979)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01325
dos Reis ECaneppa SVasconcelos Pde Lima Santos P(2024)Advancing pharmacogenomics research: automated extraction of insights from PubMed using SpaCy NLP frameworkPharmacogenomics10.1080/14622416.2024.242994625:14-15(573-578)Online publication date: 20-Nov-2024
https://doi.org/10.1080/14622416.2024.2429946
He SDu WZhang YChen LChen ZChen N(2024)Next location prediction using heterogeneous graph-based fusion network with physical and social awarenessInternational Journal of Geographical Information Science10.1080/13658816.2024.237572538:10(1965-1990)Online publication date: 10-Jul-2024
https://doi.org/10.1080/13658816.2024.2375725
Yu HRahimi HJanz CWang DLi ZYang CZhao Y(2024)Building a Comprehensive Intent-Based Networking Framework: A Practical Approach from Design Concepts to ImplementationJournal of Network and Systems Management10.1007/s10922-024-09819-732:3Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10922-024-09819-7
Hu ZHou WLiu X(2024)Deep learning for named entity recognition: a surveyNeural Computing and Applications10.1007/s00521-024-09646-636:16(8995-9022)Online publication date: 28-Mar-2024
https://dl.acm.org/doi/10.1007/s00521-024-09646-6
Schorlemmer MBallout MKühnberger K(2024)Generating Qualitative Descriptions of Diagrams with a Transformer-Based Language ModelDiagrammatic Representation and Inference10.1007/978-3-031-71291-3_5(61-75)Online publication date: 9-Sep-2024
https://doi.org/10.1007/978-3-031-71291-3_5
Jiang SWu J(2023)Temporal-geographical attention-based transformer for point-of-interest recommendationJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23482445:6(12243-12253)Online publication date: 2-Dec-2023
https://dl.acm.org/doi/10.3233/JIFS-234824
Qin YWu HJu WLuo XZhang M(2023)A Diffusion Model for POI RecommendationACM Transactions on Information Systems10.1145/362447542:2(1-27)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1145/3624475
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents