More Web Proxy on the site http://driver.im/

research-article

Making It Tractable to Detect and Correct Errors in Graphs

Authors:

Chao TianAuthors Info & Claims

ACM Transactions on Database Systems, Volume 49, Issue 4

Article No.: 16, Pages 1 - 75

https://doi.org/10.1145/3702315

Published: 16 December 2024 Publication History

Abstract

This article develops Hercules, a system for entity resolution (ER), conflict resolution (CR), timeliness deduction (TD), and missing value/link imputation (MI) in graphs. It proposes GCR⁺s, a class of graph cleaning rules (GCR) that support not only predicates for ER and CR but also temporal orders to deduce timeliness and data extraction to impute missing data. As opposed to previous graph rules, GCR⁺s are defined with a dual graph pattern to accommodate irregular structures of schemaless graphs and adopt patterns of a star form to reduce the complexity. We show that while the implication and satisfiability problems are intractable for GCR⁺s, it is in polynomial time to detect and correct errors with GCR⁺s. Underlying Hercules, we train a ranking model to predict the temporal orders on attributes and embed it as a predicate of GCR⁺s. We provide an algorithm for discovering GCR⁺s by combining the generations of patterns and predicates. We also develop a method for conducting ER, CR, TD, and MI in the same process to improve the overall quality of graphs by leveraging their interactions and chasing with GCR⁺s; we show that the method has the Church–Rosser property under certain conditions. Using real-life and synthetic graphs, we empirically verify that Hercules is 53% more accurate than the state-of-the-art graph cleaning systems and performs comparably in efficiency and scalability.

References

[1]

2017. Wikidata Vandalism Dataset. Retrieved from https://www.wsdm-cup-2017.org/vandalism-detection.html

[2]

2021. DBLP Collaboration Network. Retrieved from https://snap.stanford.edu/data/com-DBLP.html

[3]

2021. IMDB. Retrieved from https://www.imdb.com/interfaces

[4]

2022. DBpedia. Retrieved from http://www.dbpedia.org

[5]

2022. WikiData. Retrieved from https://www.wikidata.org/

[6]

Ziawasch Abedjan, Patrick Schulze, and Felix Naumann. 2014. DFD: Efficient functional dependency discovery. In CIKM. 949–958.

Digital Library

[7]

Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.

Digital Library

[8]

Naser Ahmadi, Viet-Phi Huynh, Venkata Vamsikrishna Meduri, Stefano Ortona, and Paolo Papotti. 2020. Mining expressive rules in knowledge graphs. ACM J. Data Inf. Qual. 12, 2 (2020), 8:1–8:27.

[9]

João Paulo Aires and Felipe Meneguzzi. 2017. Norm conflict identification using deep learning. In AAMAS Workshops. 194–207.

[10]

Waseem Akhtar, Alvaro Cortés-Calabuig, and Jan Paredaens. 2010. Constraints in RDF. In SDKB.

Digital Library

[11]

Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In SIGMOD. 783–794.

Digital Library

[12]

Arvind Arasu, Christopher Ré, and Dan Suciu. 2009. Large-scale deduplication with constraints using dedupalog. In ICDE. 952–963.

Digital Library

[13]

Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In PODS. 68–79.

Digital Library

[14]

Abdallah Arioua and Angela Bonifati. 2018. User-guided repairing of inconsistent knowledge bases. In EDBT.

[15]

Hiba Arnaout, Trung-Kien Tran, Daria Stepanova, Mohamed Hassan Gad-Elrab, Simon Razniewski, and Gerhard Weikum. 2022. Utilizing language model probes for knowledge graph repair. In Wiki Workshop 2022.

[16]

Rayhana Baghli and Bruno Traverson. 2014. Verbalization of business rules—Application to OCL constraints in the utility domain. In MODELSWARD. 348–355.

[17]

Zeinab Bahmani, Leopoldo E. Bertossi, and Nikolaos Vasiloglou. 2017. ERBlox: Combining matching dependencies with machine learning for entity resolution. Int. J. Approx. Reason. 83 (2017), 118–141.

Digital Library

[18]

Leopoldo Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers.

[19]

Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2013. Data cleaning and query answering with matching dependencies and matching functions. Theory Comput. Syst. 52, 3 (2013), 441–482.

Digital Library

[20]

Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1, 1 (2007), 5.

Digital Library

[21]

Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. Efficient denial constraint discovery with hydra. Proc. VLDB 11, 3 (2017), 311–323.

Digital Library

[22]

Aleksandar Bojchevski and Stephan Günnemann. 2019. Certifiable robustness to graph perturbations. In NeurIPS.

[23]

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In CoNLL. 10–21.

[24]

David A. Bright, Russell Brewer, and Carlo Morselli. 2021. Using social network analysis to study crime: Navigating the challenges of criminal justice records. Soc. Netw. 66 (2021), 50–64.

[25]

Businesswire. 2022. Over 80 Percent of Companies Rely on Stale Data for Decision-Making. Retrieved from https://www.businesswire.com/news/home/20220511005403/en/Over-80-Percent-of-Companies-Rely-on-Stale-Data-for-Decision-Making

[26]

Yang Cao, Wenfei Fan, and Wenyuan Yu. 2013. Determining the relative accuracy of attributes. In SIGMOD. 565–576.

Digital Library

[27]

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Toward an architecture for never-ending language learning. In AAAI. 1306–1313.

Digital Library

[28]

Karel Cemus and Tomas Cerny. 2017. Automated extraction of business documentation in enterprise information systems. ACM SIGAPP Appl. Comput. Rev. 16, 4 (2017), 5–13.

Digital Library

[29]

Lihan Chen, Sihang Jiang, Jingping Liu, Chao Wang, Sheng Zhang, Chenhao Xie, Jiaqing Liang, Yanghua Xiao, and Rui Song. 2022. Rule mining over knowledge graphs via reinforcement learning. Knowl. Based Syst. 242 (2022).

Digital Library

[30]

Meiqi Chen, Yuan Zhang, Xiaoyu Kou, Yuntao Li, and Yan Zhang. 2021. r-GAT: Relational graph attention network for multi-relational graphs. CoRR abs/2109.05922 (2021).

[31]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP. 1724–1734.

[32]

Xu Chu, Ihab F. Ilyas, and Paraschos Koutris. 2016. Distributed data deduplication. Proc. VLDB 9, 11 (2016), 864–875.

Digital Library

[33]

Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proc. VLDB 6, 13 (2013), 1498–1509.

Digital Library

[34]

Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In VLDB. 315–326.

Digital Library

[35]

Alvaro Cortés-Calabuig and Jan Paredaens. 2012. Semantics of constraints in RDFS. In AMW.

[36]

Pádraig Cunningham and Sarah Jane Delany. 2020. Underestimation bias and underfitting in machine learning. In TAILOR. 20–31.

[37]

Sanjib Das, Paul Suganthan G. C., AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD. 1431–1446.

Digital Library

[38]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.

[39]

Benjamin Doerr. 2020. Probabilistic tools for the analysis of randomized optimization heuristics. In Theory of Evolutionary Computation. 1–87.

[40]

Mohamad Dolatshah, Mathew Teoh, Jiannan Wang, and Jian Pei. 2018. Cleaning crowdsourced labels using oracles for statistical classification. Proc. VLDB 12, 4 (2018), 376–389.

Digital Library

[41]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proc. VLDB 11, 11 (2018), 1454–1467.

Digital Library

[42]

Jonathan A. Edlow and Peter J. Pronovost. 2023. Misdiagnosis in the emergency department: Time for a system solution. J. Am. Med. Assoc. 329, 8 (2023), 631–632.

[43]

Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis. 2014. GRAMI: Frequent subgraph and pattern mining in a single large graph. Proc. VLDB 7, 7 (2014), 517–528.

Digital Library

[44]

Christos Faloutsos, Danai Koutra, and Joshua T. Vogelstein. 2013. DELTACON: A principled massive-graph similarity function. In SDM. 162–170.

[45]

Jicong Fan, Yuqian Zhang, and Madeleine Udell. 2020. Polynomial matrix completion for missing data imputation and transductive learning. In IAAI. 3842–3849.

[46]

Wenfei Fan. 2022. Big graphs: Challenges and opportunities. Proc. VLDB 15, 12 (2022), 3782–3797.

Digital Library

[47]

Wenfei Fan, Zhe Fan, Chao Tian, and Xin Luna Dong. 2015. Keys for graphs. Proc. VLDB 8, 12 (2015), 1590–1601.

Digital Library

[48]

Wenfei Fan, Wenzhi Fu, Ruochun Jin, Muyang Liu, Ping Lu, and Chao Tian. 2023. Making it tractable to catch duplicates and conflicts in graphs. Proc. ACM Manag. Data 1, 1 (2023), 86:1–86:28.

Digital Library

[49]

Wenfei Fan, Wenzhi Fu, Ruochun Jin, Ping Lu, and Chao Tian. 2022. Discovering association rules from big graphs. Proc. VLDB 15, 7 (2022), 1479–1492.

Digital Library

[50]

Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495–520.

Digital Library

[51]

Wenfei Fan, Ling Ge, Ruochun Jin, Ping Lu, and Wenyuan Yu. 2022. Linking entities across relations and graphs. In ICDE.

[52]

Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33, 1 (2008), 6:1–6:48.

[53]

Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23, 5 (2011), 683–698.

Digital Library

[54]

Wenfei Fan, Floris Geerts, Nan Tang, and Wenyuan Yu. 2014. Conflict resolution with data currency and consistency. J. Data Inf. Qual. 5, 1–2 (2014), 6:1–6:37.

[55]

Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie. 2023. Discovering Top-k rules using subjective and objective criteria. Proc. ACM Manag. Data 1, 1 (2023), 70:1–70:29.

Digital Library

[56]

Wenfei Fan, Chunming Hu, Xueli Liu, and Ping Lu. 2020. Discovering graph functional dependencies. ACM Trans. Database Syst. 45, 3 (2020), 15:1–15:42.

Digital Library

[57]

Wenfei Fan, Ruochun Jin, Muyang Liu, Ping Lu, Chao Tian, and Jingren Zhou. 2020. Capturing associations in graphs. Proc. VLDB 13, 11 (2020), 1863–1876.

Digital Library

[58]

Wenfei Fan, Ruochun Jin, Ping Lu, Chao Tian, and Ruiqi Xu. 2022. Towards event prediction in temporal graphs. Proc. VLDB 15, 9 (2022), 1861–1874.

Digital Library

[59]

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. VLDB J. 21, 2 (2012), 213–238.

Digital Library

[60]

Wenfei Fan, Xueli Liu, Ping Lu, and Chao Tian. 2020. Catching numeric inconsistencies in graphs. ACM Trans. Database Syst. 45, 2 (2020), 9:1–9:47.

Digital Library

[61]

Wenfei Fan and Ping Lu. 2019. Dependencies for graphs. ACM Trans. Database Syst. 44, 2 (2019), 5:1–5:40.

Digital Library

[62]

Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin, and Wenyuan Yu. 2024. Linking entities across relations and graphs. ACM Trans. Database Syst. 49, 1 (2024), 2:1–2:50.

Digital Library

[63]

Wenfei Fan, Ping Lu, and Chao Tian. 2020. Unifying logic rules and machine learning for entity enhancing. Sci. Chin. Inf. Sci. 63, 7 (2020).

[64]

Wenfei Fan, Ping Lu, Chao Tian, and Jingren Zhou. 2019. Deducing certain fixes to graphs. Proc. VLDB 12, 7 (2019), 752–765.

Digital Library

[65]

Wenfei Fan and Chao Tian. 2022. Incremental graph computations: Doable and undoable. ACM Trans. Database Syst. 47, 2 (2022), 6:1–6:44.

Digital Library

[66]

Wenfei Fan, Chao Tian, Yanghao Wang, and Qiang Yin. 2021. Parallel discrepancy detection and incremental detection. Proc. VLDB 14, 8 (2021), 1351–1364.

Digital Library

[67]

Wenfei Fan, Resul Tugay, Yaoshu Wang, Min Xie, and Muhammad Asif Ali. 2023. Learning and deducing temporal orders. Proc. VLDB 16, 8 (2023), 1944–1957.

Digital Library

[68]

Wenfei Fan, Yinghui Wu, and Jingbo Xu. 2016. Functional dependencies for graphs. In SIGMOD. 1843–1857.

Digital Library

[69]

Nausheen Fatma, Manoj Chinnakotla, and Manish Shrivastava. 2017. The unusual suspects: Deep learning based mining of interesting entity trivia from knowledge graphs. In AAAI.

[70]

Annamaria Ficara, Lucia Cavallaro, Francesco Curreri, Giacomo Fiumara, Pasquale De Meo, Ovidiu Bagdasar, Wei Song, and Antonio Liotta. 2021. Criminal networks analysis in missing data scenarios through graph distances. PLoS One 16, 8 (2021), e0255067.

[71]

Peter A. Flach and Iztok Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139–160.

Digital Library

[72]

Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek. 2015. Fast rule mining in ontological knowledge bases with AMIE+. VLDB J. 24, 6 (2015), 707–730.

Digital Library

[73]

Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian Suchanek. 2013. AMIE: Association rule mining under incomplete evidence in ontological knowledge bases. In WWW.

Digital Library

[74]

Kun Gao, Katsumi Inoue, Yongzhi Cao, and Hanpin Wang. 2022. Learning first-order rules with differentiable logic program semantics. In IJCAI. 3008–3014.

[75]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In EMNLP. 6894–6910.

[76]

Alberto García-Durán, Sebastijan Dumancic, and Mathias Niepert. 2018. Learning sequence encoders for temporal knowledge graph completion. In EMNLP. 4816–4821.

[77]

Michael Garey and David Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company.

Digital Library

[78]

Congcong Ge, Yunjun Gao, Honghui Weng, Chong Zhang, Xiaoye Miao, and Baihua Zheng. 2020. KGClean: An embedding powered knowledge graph cleaning framework. CoRR abs/2004.14478 (2020).

[79]

Liqiang Geng and Howard J. Hamilton. 2006. Interestingness measures for data mining: A survey. ACM Comput. Surv. 38, 3 (2006), 9.

Digital Library

[80]

Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB 1, 1 (2008), 376–390.

Digital Library

[81]

Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac. 2010. Record linkage with uniqueness constraints and erroneous values. Proc. VLDB 3, 1 (2010), 417–428.

Digital Library

[82]

Mahboubeh Haddad, Fereshte Sheybani, HamidReza Naderi, Mohammad Saeed Sasan, Mona Najaf Najafi, Malihe Sedighi, and Atena Seddigh. 2021. Errors in diagnosing infectious diseases: A physician survey. Front. Med. 8 (2021), 779454.

[83]

Shuang Hao, Chengliang Chai, Guoliang Li, Nan Tang, Ning Wang, and Xiang Yu. 2023. HOFD: An outdated fact detector for knowledge bases. IEEE Trans. Know. Data Eng. (2023), 1–14.

Digital Library

[84]

Yuan He, Jiaoyan Chen, Denvar Antonyrajah, and Ian Horrocks. 2022. BERTMap: A BERT-Based ontology alignment system. In AAAI. 5684–5691.

[85]

Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. In SIGMOD. 829–846.

Digital Library

[86]

Jelle Hellings, Marc Gyssens, Jan Paredaens, and Yuqing Wu. 2016. Implication and axiomatization of functional and constant constraints. Ann. Math. Artif. Intell. 76, 3–4 (2016), 251–279.

Digital Library

[87]

Wenjie Hu, Yang Yang, Ziqiang Cheng, Carl Yang, and Xiang Ren. 2021. Time-series event prediction with evolutionary state graph. In WSDM.

Digital Library

[88]

Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. Knowledge graph embedding based question answering. In WSDM. 105–113.

Digital Library

[89]

Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2 (1999), 100–111.

[90]

Eyke Hüllermeier and Stijn Vanderlooy. 2010. Combining predictions in pairwise classification: An optimal adaptive voting strategy and its relation to weighted voting. Pattern Recogn. 43, 1 (2010), 128–142.

Digital Library

[91]

Witold Jacak and Karin Pröll. 2011. Neural networks based system for cancer diagnosis support. In EUROCAST.

[92]

Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. 2020. Graph structure learning for robust graph neural networks. In KDD. 66–74.

Digital Library

[93]

Seyed Mehran Kazemi and David Poole. 2018. SimplE: Embedding for link prediction in knowledge graphs. In NeurIPS. 4289–4300.

[94]

Anthony C. Klug. 1988. On conjunctive queries containing inequalities. J. ACM 35, 1 (1988), 146–160.

Digital Library

[95]

Lars Kolb, Andreas Thor, and Erhard Rahm. 2012. Dedoop: Efficient deduplication with hadoop. Proc. VLDB 5, 12 (2012), 1878–1881.

Digital Library

[96]

Lingzhen Kong, Lina Wang, Wenwen Gong, Chao Yan, Yucong Duan, and Lianyong Qi. 2022. LSH-aware multitype health data prediction with privacy preservation in edge environment. World Wide Web 25, 5 (2022), 1793–1808.

Digital Library

[97]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB 3, 1 (2010), 484–493.

Digital Library

[98]

Clyde P. Kruskal, Larry Rudolph, and Marc Snir. 1990. A complexity theory of efficient parallel algorithms. Theor. Comput. Sci. 71, 1 (1990), 95–132.

Digital Library

[99]

Selasi Kwashie, Jixue Liu, Jiuyong Li, Lin Liu, Markus Stumptner, and Lujing Yang. 2019. Certus: An effective entity resolution approach with graph differential dependencies (GDDs). Proc. VLDB 12, 6 (2019), 653–666.

Digital Library

[100]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.

[101]

Manuel Leone, Stefano Huber, Akhil Arora, Alberto García-Durán, and Robert West. 2022. A critical re-evaluation of neural methods for entity alignment. Proc. VLDB 15, 8 (2022), 1712–1725.

Digital Library

[102]

Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang. 2020. GraphER: Token-centric entity resolution with graph convolutional neural networks. In AAAI.

[103]

Manling Li, Qi Zeng, Ying Lin, Kyunghyun Cho, Heng Ji, Jonathan May, Nathanael Chambers, and Clare Voss. 2020. Connecting the dots: Event graph schema induction with path language modeling. In EMNLP. 684–695.

[104]

Zixuan Li, Xiaolong Jin, Wei Li, Saiping Guan, Jiafeng Guo, Huawei Shen, Yuanzhuo Wang, and Xueqi Cheng. 2021. Temporal knowledge graph reasoning based on evolutional representation learning. In SIGIR.

Digital Library

[105]

Yi Heng Lim, Qi Zhu, Joshua Selfridge, and Muhammad Firmansyah Kasim. 2024. Parallelizing non-linear sequential models over the sequence length. In ICLR.

[106]

Xi Victoria Lin, Richard Socher, and Caiming Xiong. 2018. Multi-hop knowledge graph reasoning with reward shaping. In EMNLP.

[107]

Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. 2015. Modeling relation paths for representation learning of knowledge bases. In EMNLP. 705–714.

[108]

Ying Lin, Han Wang, Jiangning Chen, Tong Wang, Yue Liu, Heng Ji, Yang Liu, and Premkumar Natarajan. 2021. Personalized entity resolution with dynamic heterogeneous knowledge graph representations. CoRR abs/2104.02667 (2021).

[109]

Ashley Little. 2020. Outdated Data: Worse Than No Data? Retrieved from https://info.aldensys.com/joint-use/outdated-data-is-worse-than-no-data

[110]

Stéphane Lopes, Jean-Marc Petit, and Lotfi Lakhal. 2000. Efficient discovery of functional dependencies and Armstrong relations. In EDBT. Springer, 350–364.

Digital Library

[111]

Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective error correction via a unified context representation and transfer learning. Proc. VLDB 13, 11 (2020), 1948–1961.

Digital Library

[112]

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In SIGMOD. 865–882.

Digital Library

[113]

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing LSTM language models. In ICLR.

[114]

Meta. 2023. https://about.meta.com

[115]

Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press.

[116]

Euhyun Moon and Eric C. Cyr. 2022. Parallel training of GRU networks with a multi-grid solver for long sequences. In ICLR.

[117]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19–34.

Digital Library

[118]

Boris Muzellec, Julie Josse, Claire Boyer, and Marco Cuturi. 2020. Missing data imputation using optimal transport. In ICML.

[119]

Mohammad Hossein Namaki, Yinghui Wu, Qi Song, Peng Lin, and Tingjian Ge. 2017. Discovering graph temporal association rules. In CIKM.

Digital Library

[120]

Noel Novelli and Rosine Cicchetti. 2001. Fun: An efficient algorithm for mining functional and embedded dependencies. In ICDT. 189–203.

Digital Library

[121]

Daniel Obraczka, Jonathan Schuchart, and Erhard Rahm. 2021. EAGER: Embedding-assisted entity resolution for knowledge graphs. CoRR abs/2101.06126 (2021).

[122]

Karolina Okrasa and Pawel Rzazewski. 2021. Fine-grained complexity of the graph homomorphism problem for bounded-treewidth graphs. SIAM J. Comput. 50, 2 (2021), 487–508.

Digital Library

[123]

Stefano Ortona, Venkata Vamsikrishna Meduri, and Paolo Papotti. 2018. Robust discovery of positive and negative rules in knowledge bases. In ICDE. 1168–1179.

[124]

Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In SIGMOD.

Digital Library

[125]

Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B. Schardl, and Charles E. Leiserson. 2020. EvolveGCN: Evolving graph convolutional networks for dynamic graphs. In AAAI. 5363–5370.

[126]

Dongwon Park, Dong Un Kang, Jisoo Kim, and Se Young Chun. 2020. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In ECCV, Vol. 12351. 327–343.

Digital Library

[127]

Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches and evaluation methods. Semant. Web 8, 3 (2017), 489–508.

Digital Library

[128]

Eduardo H. M. Pena, Eduardo C. de Almeida, and Felix Naumann. 2019. Discovery of approximate (and exact) denial constraints. Proc. VLDB 13, 3 (2019), 266–278.

Digital Library

[129]

Maksim Podkorytov, Daniel Bis, and Xiuwen Liu. 2021. How can the [MASK] know? The sources and limitations of knowledge in BERT. In IJCNN. 1–8.

[130]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In CIKM. 1379–1388.

Digital Library

[131]

Meng Qu, Junkun Chen, Louis-Pascal A. C. Xhonneux, Yoshua Bengio, and Jian Tang. 2021. RNNLogic: Learning logic rules for reasoning on knowledge graphs. In ICLR.

[132]

Kashif Rabbani, Matteo Lissandrini, and Katja Hose. 2023. Extraction of validating shapes from very large knowledge graphs. Proc. VLDB 16, 5 (2023), 1023–1032.

Digital Library

[133]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB 10, 11 (2017), 1190–1201.

Digital Library

[134]

Ryan A. Rossi and Nesreen K. Ahmed. 2015. The network data repository with interactive graph analytics and visualization. In AAAI.

[135]

Tara Safavi and Danai Koutra. 2020. CoDEx: A comprehensive knowledge graph completion benchmark. In EMNLP. 8328–8350.

[136]

Marcus Schaefer and Christopher Umans. 2002. Completeness in the polynomial-time hierarchy: A compendium. SIGACT News 33, 3 (2002), 32–49.

Digital Library

[137]

Philipp Schirmer, Thorsten Papenbrock, Ioannis Koumarelas, and Felix Naumann. 2020. Efficient discovery of matching dependencies. ACM Trans. Database Syst. 45, 3 (2020), 1–33.

Digital Library

[138]

Philipp Schirmer, Thorsten Papenbrock, Sebastian Kruse, Felix Naumann, Dennis Hempfing, Torben Mayer, and Daniel Neuschäfer-Rube. 2019. DynFD: Functional dependency discovery in dynamic datasets. In EDBT.

[139]

Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In ESWC.

Digital Library

[140]

Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, and Bowen Zhou. 2019. ConvTransE Implementation. Retrieved from https://github.com/JD-AI-Research-Silicon-Valley/SACN

[141]

Pengpeng Shao, Dawei Zhang, Guohua Yang, Jianhua Tao, Feihu Che, and Tong Liu. 2022. Tucker decomposition-based temporal knowledge graph completion. Knowl. Based Syst. 238 (2022), 107841.

Digital Library

[142]

Victor S. Sheng and Jing Zhang. 2019. Machine learning with crowdsourcing: A brief summary of the past research and future directions. In AAAI. 9837–9843.

Digital Library

[143]

Kartik Shenoy, Filip Ilievski, Daniel Garijo, Daniel Schwabe, and Pedro A. Szekely. 2022. A study of the quality of Wikidata. J. Web Semant. 72 (2022), 100679.

Digital Library

[144]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proc. VLDB 11, 2 (2017), 189–202.

Digital Library

[145]

Julie Smiley. 2016. Missing Data and its Impact on Clinical Research. Retrieved from https://blogs.oracle.com/health-sciences/post/missing-data-and-its-impact-on-clinical-research

[146]

Indro Spinelli, Simone Scardapane, and Aurelio Uncini. 2020. Missing data imputation with adversarially-trained graph convolutional networks. Neural Netw. 129 (2020), 249–260.

[147]

Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In WWW. 697–706.

Digital Library

[148]

Katia P. Sycara. 1993. Machine learning for intelligent support of conflict resolution. Decis. Supp. Syst. 10, 2 (1993), 121–136.

Digital Library

[149]

Xianfeng Tang, Yandong Li, Yiwei Sun, Huaxiu Yao, Prasenjit Mitra, and Suhang Wang. 2020. Transferring robustness for graph neural network against poisoning attacks. In WSDM. 600–608.

Digital Library

[150]

Thomas Pellissier Tanon and Fabian M. Suchanek. 2021. Neural knowledge base repairs. In ESWC.

[151]

Yufei Tao. 2018. Massively parallel entity matching with linear classification in low dimensional space. In ICDT. 20:1–20:19.

[152]

Thong Tran and Tru H. Cao. 2013. Automatic detection of outdated information in wikipedia infoboxes. Res. Comput. Sci. 70 (2013), 211–222.

[153]

Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (1990), 103–111.

Digital Library

[154]

Ron van der Meyden. 1997. The complexity of querying indefinite data about linearly ordered domains. J. Comput. Syst. Sci. 54, 1 (1997), 113–135.

Digital Library

[155]

Larysa Visengeriyeva and Ziawasch Abedjan. 2018. Metadata-driven error detection. In SSDBM. 1:1–1:12.

Digital Library

[156]

Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: A free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85.

Digital Library

[157]

Binghui Wang, Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. 2021. Certified robustness of graph neural networks against adversarial structural perturbation. In KDD. 1645–1653.

Digital Library

[158]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. VLDB 5, 11 (2012), 1483–1494.

Digital Library

[159]

Tobias Weller and Heiko Paulheim. 2021. Evidential relational-graph convolutional networks for entity classification in knowledge graphs. In CIKM. 3533–3537.

Digital Library

[160]

Steven Euijong Whang and Hector Garcia-Molina. 2013. Joint entity resolution on multiple datasets. VLDB J. 22, 6 (2013), 773–795.

Digital Library

[161]

Eric Wong and J. Zico Kolter. 2018. Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML. 5283–5292.

[162]

Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based learning for missing data imputation in HoloClean. In MLSys.

[163]

Yuting Wu, Xiao Liu, Yansong Feng, Zheng Wang, Rui Yan, and Dongyan Zhao. 2019. Relation-aware entity alignment for heterogeneous knowledge graphs. In IJCAI. 5278–5284.

[164]

Catharine M. Wyss, Chris Giannella, and Edward L. Robertson. 2001. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances—Extended abstract. In DaWak.

[165]

Yuhao Yang, Chao Huang, Lianghao Xia, and Chenliang Li. 2022. Knowledge graph contrastive learning for recommendation. In SIGIR. 1434–1443.

Digital Library

[166]

H. Yao, H. Hamilton, and C. Butz. 2002. FD_Mine: Discovering functional dependencies in a database using equivalences. In IEEE ICDM. 1–15.

[167]

Rex Ying, A. Wang, Jiaxuan You, and Jure Leskovec. 2020. Frequent subgraph mining by walking in order embedding space. In ICML Workshops.

[168]

Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing data imputation using generative adversarial nets. In ICML. 5675–5684.

[169]

Jiaxuan You, Xiaobai Ma, Daisy Yi Ding, Mykel J. Kochenderfer, and Jure Leskovec. 2020. Handling missing data with graph representation learning. In NeurIPS.

[170]

Xiangxiang Zeng, Xinqi Tu, Yuansheng Liu, Xiangzheng Fu, and Yansen Su. 2022. Toward better drug discovery with knowledge graph. Curr. Opin. Struct. Biol. 72 (2022), 114–126.

[171]

Dongxiang Zhang, Long Guo, Xiangnan He, Jie Shao, Sai Wu, and Heng Tao Shen. 2018. A graph-theoretic fusion framework for unsupervised entity resolution. In ICDE. 713–724.

[172]

Ge Zhang, Jia Wu, Jian Yang, Amin Beheshti, Shan Xue, Chuan Zhou, and Quan Z. Sheng. 2021. FRAUDRE: Fraud detection dual-resistant to graph inconsistency and imbalance. In ICDM.

[173]

Kai Zhang, Qian Yu, Kai Lei, and Kuai Xu. 2014. Characterizing tweeting behaviors of sina weibo users via public data streaming. In WAIM, Vol. 8485. 294–297.

[174]

Qinggang Zhang, Junnan Dong, Keyu Duan, Xiao Huang, Yezi Liu, and Linchuan Xu. 2022. Contrastive knowledge graph error detection. In CIKM.

Digital Library

[175]

Yunjia Zhang, Zhihan Guo, and Theodoros Rekatsinas. 2020. A statistical perspective on discovering functional dependencies in noisy data. In SIGMOD. 861–876.

Digital Library

[176]

Jing Zheng, Jian Liu, Chuan Shi, Fuzhen Zhuang, Jingzhi Li, and Bin Wu. 2017. Recommendation in heterogeneous information network via dual similarity regularization. Int. J. Data Sci. Anal. 3 (2017), 35–48.

[177]

Zheng Zheng, Tri Minh Quach, Ziyi Jin, Fei Chiang, and Mostafa Milani. 2019. CurrentClean: Interactive change exploration and cleaning of stale data. In CIKM. 2917–2920.

Digital Library

[178]

Ziyue Zhong, Meihui Zhang, Ju Fan, and Chenxiao Dou. 2022. Semantics driven embedding learning for effective entity alignment. In ICDE. 2127–2140.

[179]

Linhong Zhu, Majid Ghasemi-Gol, Pedro Szekely, Aram Galstyan, and Craig A. Knoblock. 2016. Unsupervised entity resolution on multi-type graphs. In ISWC.

Digital Library

[180]

Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In WWW. 2069–2080.

Digital Library

[181]

Daniel Zügner and Stephan Günnemann. 2019. Certifiable robustness and robust training for graph convolutional networks. In KDD. 246–256.

Digital Library

Index Terms

Making It Tractable to Detect and Correct Errors in Graphs
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Incomplete, inconsistent, and uncertain databases

Recommendations

Making It Tractable to Catch Duplicates and Conflicts in Graphs
PACMMOD

This paper proposes an approach for entity resolution (ER) and conflict resolution (CR) in large-scale graphs. It is based on a class of Graph Cleaning Rules (GCRs), which support the primitives of relational data cleaning rules, and may embed machine ...
Rock: Cleaning Data by Embedding ML in Logic Rules
SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

We introduce Rock, a system for cleaning relational data. Rock implements a framework that unifies machine learning (ML) and logic deduction by embedding ML classifiers in rules as predicates. In a unified process, it identifies tuples that refer to the ...
Dominating set is fixed parameter tractable in claw-free graphs

We show that the Dominating Set problem parameterized by solution size is fixed-parameter tractable (FPT) in graphs that do not contain the claw (K"1","3, the complete bipartite graph on four vertices where the two parts have one and three vertices, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 49, Issue 4

December 2024

198 pages

EISSN:1557-4644

DOI:10.1145/3613725

Editor:
Christopher Jermaine
Rice University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2024

Online AM: 02 November 2024

Accepted: 01 August 2024

Revised: 13 May 2024

Received: 27 December 2023

Published in TODS Volume 49, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Royal Society Wolfson Research Merit Award
Fundamental Research Funds for the Central Universities, NSFC
NSFC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
140
Total Downloads

Downloads (Last 12 months)140
Downloads (Last 6 weeks)72

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents