research-article

Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond

Authors:

Zhengjie Miao,

Yuliang Li,

Xiaolan WangAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 1303 - 1316

https://doi.org/10.1145/3448016.3457258

Published: 18 June 2021 Publication History

Get Access

Abstract

Deep Learning revolutionizes almost all fields of computer science including data management. However, the demand for high-quality training data is slowing down deep neural nets' wider adoption. To this end, data augmentation (DA), which generates more labeled examples from existing ones, becomes a common technique. Meanwhile, the risk of creating noisy examples and the large space of hyper-parameters make DA less attractive in practice. We introduce Rotom, a multi-purpose data augmentation framework for a range of data management and mining tasks including entity matching, data cleaning, and text classification. Rotom features InvDA, a new DA operator that generates natural yet diverse augmented examples by formulating DA as a seq2seq task. The key technical novelty of Rotom is a meta-learning framework that automatically learns a policy for combining examples from different DA operators, whereby combinatorially reduces the hyper-parameters space. Our experimental results show that Rotom effectively improves a model's performance by combining multiple DA operators, even when applying them individually does not yield performance improvement. With this strength, Rotom outperforms the state-of-the-art entity matching and data cleaning systems in the low-resource settings as well as two recently proposed DA techniques for text classification.

Supplementary Material

MP4 File (3448016.3457258.mp4)

Deep Learning revolutionizes almost all fields of computer science including databases and data management. However, the high demand for high-quality labeled training data is slowing down deep neural nets wider adoption. To this end, data augmentation (DA), which generates more labeled examples from existing ones, becomes a common technique. Meanwhile, the risk of creating noisy/unnatural examples and the large space of hyper-parameters make DA less attractive in practice. We introduce Rotom, a multipurpose data augmentation framework for a range of data management and mining tasks including entity matching, data cleaning, and text classification. Rotom features InvDA, a new DA operator that generates natural yet diverse augmented examples by formulating DA as a seq2seq task. The key technical novelty of Rotom is a meta-learning framework that automatically learns a policy model for selecting and combining examples generated by different DA operators, whereby combinatorially reduces the search space of hyper-parameters. Our experimental results show that Rotom can effectively improve a models performance by combining multiple DA operators, even when applying them individually does not yield performance improvement. With this strength, Rotom outperforms a previous deep entity matching system with <9% training data, achieves new state-of-the-art results in low-resource data cleaning, and improves against two recently proposed data augmentation techniques from NLP.

Download
134.20 MB

References

[1]

Ziawasch Abedjan, Cuneyt G Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal rules discovery for web data cleaning. PVLDB, Vol. 9, 4 (2015), 336--347.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Ore Image Classification Based on Improved CNN

A comprehensive survey of recent trends in deep learning for digital images augmentation

Effectiveness of convolutional layers in pre-trained models for classifying common weeds in groundnut and corn crops

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations