[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3448016.3457258acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond

Published: 18 June 2021 Publication History

Abstract

Deep Learning revolutionizes almost all fields of computer science including data management. However, the demand for high-quality training data is slowing down deep neural nets' wider adoption. To this end, data augmentation (DA), which generates more labeled examples from existing ones, becomes a common technique. Meanwhile, the risk of creating noisy examples and the large space of hyper-parameters make DA less attractive in practice. We introduce Rotom, a multi-purpose data augmentation framework for a range of data management and mining tasks including entity matching, data cleaning, and text classification. Rotom features InvDA, a new DA operator that generates natural yet diverse augmented examples by formulating DA as a seq2seq task. The key technical novelty of Rotom is a meta-learning framework that automatically learns a policy for combining examples from different DA operators, whereby combinatorially reduces the hyper-parameters space. Our experimental results show that Rotom effectively improves a model's performance by combining multiple DA operators, even when applying them individually does not yield performance improvement. With this strength, Rotom outperforms the state-of-the-art entity matching and data cleaning systems in the low-resource settings as well as two recently proposed DA techniques for text classification.

Supplementary Material

MP4 File (3448016.3457258.mp4)
Deep Learning revolutionizes almost all fields of computer science including databases and data management. However, the high demand for high-quality labeled training data is slowing down deep neural nets wider adoption. To this end, data augmentation (DA), which generates more labeled examples from existing ones, becomes a common technique. Meanwhile, the risk of creating noisy/unnatural examples and the large space of hyper-parameters make DA less attractive in practice. We introduce Rotom, a multipurpose data augmentation framework for a range of data management and mining tasks including entity matching, data cleaning, and text classification. Rotom features InvDA, a new DA operator that generates natural yet diverse augmented examples by formulating DA as a seq2seq task. The key technical novelty of Rotom is a meta-learning framework that automatically learns a policy model for selecting and combining examples generated by different DA operators, whereby combinatorially reduces the search space of hyper-parameters. Our experimental results show that Rotom can effectively improve a models performance by combining multiple DA operators, even when applying them individually does not yield performance improvement. With this strength, Rotom outperforms a previous deep entity matching system with <9% training data, achieves new state-of-the-art results in low-resource data cleaning, and improves against two recently proposed data augmentation techniques from NLP.

References

[1]
Ziawasch Abedjan, Cuneyt G Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal rules discovery for web data cleaning. PVLDB, Vol. 9, 4 (2015), 336--347.
[2]
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do Not Have Enough Data? Deep Learning to the Rescue!. In AAAI. 7383--7390.
[3]
Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In SIGMOD. 783--794.
[4]
Kedar Bellare, Suresh Iyengar, Aditya G Parameswaran, and Vibhor Rastogi. 2012. Active sampling for entity matching. In KDD. 1131--1139.
[5]
David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. 2019 a. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In ICLR .
[6]
David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019 b. Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS . 5049--5059.
[7]
Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD. 143--154.
[8]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et almbox. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
[9]
Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bö hm, Dan Olteanu, George H. L. Fletcher, Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 463--473. https://doi.org/10.5441/002/edbt.2020.58
[10]
Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: Automatic Relational Data Augmentation for Machine Learning. PVLDB, Vol. 13, 9 (2020), 1373--1387.
[11]
Peter Christen. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In KDD. 151--159.
[12]
Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection .Springer.
[13]
Xu Chu and Ihab F Ilyas. 2016. Qualitative data cleaning. PVLDB, Vol. 9, 13 (2016), 1605--1608.
[14]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013, Christian S. Jensen, Christopher M. Jermaine, and Xiaofang Zhou (Eds.). IEEE Computer Society, 458--469.
[15]
Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In SIGMOD, Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (Eds.). ACM, 1247--1261.
[16]
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2019. Autoaugment: Learning augmentation strategies from data. In CVPR. 113--123.
[17]
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR workshops . 702--703.
[18]
Sanjib Das, Paul Suganthan GC, AnHai Doan, Jeffrey F Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD . 1431--1446.
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
[20]
Mohamad Dolatshah, Mathew Teoh, Jiannan Wang, and Jian Pei. 2018. Cleaning Crowdsourced Labels Using Oracles For Statistical Classification. PVLDB, Vol. 12, 4 (2018), 376--389.
[21]
Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440 (2017).
[22]
Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In ICDE. IEEE Computer Society, 1001--1012.
[23]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML, Vol. 70. 1126--1135.
[24]
Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-end multi-perspective matching for entity resolution. In AAAI. AAAI Press, 4961--4967.
[25]
Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. PVLDB, Vol. 5, 12 (2012), 2018--2019.
[26]
Daniel Haas, Jiannan Wang, Eugene Wu, and Michael J. Franklin. 2015. CLAMShell: Speeding up Crowds for Low-latency Data Labeling. PVLDB, Vol. 9, 4 (2015), 372--383.
[27]
Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google's Datasets. In SIGMOD, Fatma Ö zcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 795--806.
[28]
Ryuichiro Hataya, Jan Zdenek, Kazuki Yoshizoe, and Hideki Nakayama. 2019. Faster autoaugment: Learning augmentation strategies using backpropagation. arXiv preprint arXiv:1911.06987 (2019).
[29]
Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. Holodetect: Few-shot learning for error detection. In SIGMOD. 829--846.
[30]
Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), Vol. 25 (2008).
[31]
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. In ICLR. OpenReview.net.
[32]
Zhiting Hu, Bowen Tan, Russ Salakhutdinov, Tom Mitchell, and Eric Xing. 2019. Learning data manipulation for augmentation and weighting. In NeurIPS . 15764--15775.
[33]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning .ACM .
[34]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042 (2019).
[35]
Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201 (2018).
[36]
Solmaz Kolahi and Laks VS Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In ICDT . 53--62.
[37]
Pradap Konda, Sanjib Das, Paul Suganthan GC, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et almbox. 2016. Magellan: Toward building entity matching management systems. PVLDB, Vol. 9, 12 (2016), 1197--1208.
[38]
loannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: duplicate detection with matching dependencies. PVLDB, Vol. 13, 5 (2020), 712--725.
[39]
Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. 2010. Outlier detection techniques. Tutorial at KDD, Vol. 10 (2010), 1--76.
[40]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, Vol. 9, 12 (2016), 948--959.
[41]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NeurIPS, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105.
[42]
Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data Management in Machine Learning: Challenges, Techniques, and Systems. In SIGMOD, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 1717--1722.
[43]
Arun Kumar, Jeffrey F. Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. To Join or Not to Join?: Thinking Twice about Joins before Feature Selection. In SIGMOD, Fatma Ö zcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 19--34.
[44]
Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245 (2020).
[45]
Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3.
[46]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[47]
Guoliang Li, Yudian Zheng, Ju Fan, Jiannan Wang, and Reynold Cheng. 2017. Crowdsourced data management: Overview and challenges. In SIGMOD . 1711--1716.
[48]
Yonggang Li, Guosheng Hu, Yongtao Wang, Timothy Hospedales, Neil M Robertson, and Yongxing Yang. 2020 a. DADA: Differentiable Automatic Data Augmentation. arXiv preprint arXiv:2003.03780 (2020).
[49]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020 b. Deep entity matching with pre-trained language models. PVLDB, Vol. 14, 1, 50--60.
[50]
Hanwen Liang, Shifeng Zhang, Jiacheng Sun, Xingqiu He, Weiran Huang, Kechen Zhuang, and Zhenguo Li. 2019. Darts
[51]
: Improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035 (2019).
[52]
Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. 2019. Fast autoaugment. In NeurIPS. 6665--6675.
[53]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. DARTS: Differentiable Architecture Search. In ICLR .
[54]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[55]
Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. PVLDB, Vol. 13, 11 (2020).
[56]
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In SIGMOD. 865--882.
[57]
Andrew McCallum and Ben Wellner. 2005. Conditional models of identity uncertainty with application to noun coreference. In NeurIPS . 905--912.
[58]
Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat. 2020. A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. In SIGMOD . 1133--1147.
[59]
Zhengjie Miao, Yuliang Li, Xiaolan Wang, and Wang-Chiew Tan. 2020. Snippext: Semi-supervised opinion mining with augmented data. In Proceedings of The Web Conference 2020. 617--628.
[60]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS, Christopher J. C. Burges, Lé on Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 3111--3119.
[61]
George A Miller. 1998. WordNet: An electronic lexical database .MIT press.
[62]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19--34.
[63]
Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In CIKM . 629--638.
[64]
Tong Niu and Mohit Bansal. 2019. Automatically learning data augmentation policies for dialogue tasks. arXiv preprint arXiv:1909.12868 (2019).
[65]
Hyunjung Park and Jennifer Widom. 2014. CrowdFill: collecting structured data from the crowd. In SIGMOD, Curtis E. Dyreson, Feifei Li, and M. Tamer Ö zsu (Eds.). ACM, 577--588.
[66]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et almbox. 2019. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS . 8026--8037.
[67]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, Vol. 1, 8 (2019), 9.
[68]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
[69]
Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull., Vol. 23, 4 (2000), 3--13.
[70]
Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017a. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB, Vol. 11, 3 (2017), 269--282.
[71]
Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. 2017b. Learning to compose domain-specific transformations for data augmentation. In NeurIPS . 3236--3246.
[72]
Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017a. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, Vol. 10, 11 (2017).
[73]
Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya G. Parameswaran, and Christopher Ré. 2017b. SLiMFast: Guaranteed Results for Data Fusion and Source Reliability. In SIGMOD . ACM, 1399--1414.
[74]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS . 91--99.
[75]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[76]
Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In KDD. 269--278.
[77]
Jürgen Schmidhuber. 1987. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook . Ph.D. Dissertation. Technische Universit"at München.
[78]
Burr Settles. 2009. Active learning literature survey . Technical Report. University of Wisconsin-Madison Department of Computer Sciences.
[79]
Vraj Shah, Arun Kumar, and Xiaojin Zhu. 2017. Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers? PVLDB, Vol. 11, 3 (2017), 366--379.
[80]
Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685 (2020).
[81]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NeurIPS . 3104--3112.
[82]
Sheila Tejada, Craig A Knoblock, and Steven Minton. 2002. Learning domain-independent string transformation weights for high accuracy object identification. In KDD. 350--359.
[83]
Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Supervision to Label Training Data. PVLDB, Vol. 12, 3 (2018), 223--236.
[84]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.
[85]
Ricardo Vilalta and Youssef Drissi. 2002. A perspective view and survey of meta-learning. Artificial intelligence review, Vol. 18, 2 (2002), 77--95.
[86]
Wei Wang, Meihui Zhang, Gang Chen, H. V. Jagadish, Beng Chin Ooi, and Kian-Lee Tan. 2016. Database Meets Deep Learning: Challenges and Opportunities. SIGMOD Rec., Vol. 45, 2 (2016), 17--22.
[87]
Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019).
[88]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, Vol. 8, 3--4 (1992), 229--256.
[89]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et almbox. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXiv (2019), arXiv--1910.
[90]
Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 (2019).
[91]
Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541 (2018).
[92]
Dongxiang Zhang, Yuyang Nie, Sai Wu, Yanyan Shen, and Kian-Lee Tan. 2020 a. Multi-Context Attention for Entity Matching. In Proceedings of The Web Conference 2020 . 2634--2640.
[93]
Dan Zhang, Yoshihiko Suhara, Jinfeng Li, Madelon Hulsebos, cC agatay Demiralp, and Wang-Chiew Tan. 2020 b. Sato: Contextual Semantic Type Detection in Tables. PVLDB, Vol. 13, 11 (2020), 1835--1848.
[94]
Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbeel. 2016. Learning Deep Control Policies for Autonomous Aerial Vehicles with MPC-Guided Policy Search. ICRA (2016).
[95]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In NeurIPS . 649--657.
[96]
Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In The World Wide Web Conference . 2413--2424.

Cited By

View all
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
  • (2024)Rock: Cleaning Data by Embedding ML in Logic RulesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653372(106-119)Online publication date: 9-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data augmentation
  2. deep learning
  3. entity matching
  4. error detection

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)349
  • Downloads (Last 6 weeks)14
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
  • (2024)Rock: Cleaning Data by Embedding ML in Logic RulesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653372(106-119)Online publication date: 9-Jun-2024
  • (2024)LRER: A Low-Resource Entity Resolution Framework with Hybrid Information2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651166(1-8)Online publication date: 30-Jun-2024
  • (2024)MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00264(3421-3434)Online publication date: 13-May-2024
  • (2024)Low-resource entity resolution with domain generalization and active learningNeurocomputing10.1016/j.neucom.2024.128131599(128131)Online publication date: Sep-2024
  • (2024)SETEMKnowledge-Based Systems10.1016/j.knosys.2024.111708293:COnline publication date: 7-Jun-2024
  • (2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
  • (2024)Using Data Augmentation to Support AI-Based Requirements Evaluation in Large-Scale ProjectsSystems, Software and Services Process Improvement10.1007/978-3-031-71139-8_7(97-111)Online publication date: 7-Sep-2024
  • (2023)Blocker and Matcher Can Mutually Benefit: A Co-Learning Framework for Low-Resource Entity ResolutionProceedings of the VLDB Endowment10.14778/3632093.363209617:3(292-304)Online publication date: 1-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media