More Web Proxy on the site http://driver.im/

research-article

BABOONS: black-box optimization of data summaries in natural language

Author:

Immanuel TrummerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 11

Pages 2980 - 2993

https://doi.org/10.14778/3551793.3551846

Published: 01 July 2022 Publication History

Abstract

BABOONS (BlAck BOx Optimization of Natural language data Summaries) optimizes text data summaries for an arbitrary, user-defined utility function. Primarily, it targets scenarios in which utility is evaluated via large language models. Users describe their utility function in natural language or provide a model, trained to score text summaries in a specific domain.

BABOONS uses reinforcement learning to explore the space of possible descriptions. In each iteration, BABOONS generates summaries and evaluates their utility. To reduce data processing overheads during summary generation, BABOONS uses a proactive processing strategy that dynamically merges current with likely future queries for efficient processing. Also, BABOONS supports scenario-specific sampling and batch processing strategies. These mechanisms allow to scale processing to large data and item sets. The experiments show that BABOONS scales significantly better than baselines. Also, they show that summaries generated by BABOONS receive higher average grades from users in a large survey.

References

[1]

2021. https://www.upcounsel.com/what-is-a-product-description.

[2]

Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, Jeff Naughton, Peter Bailis, and Matei Zaharia. 2021. DIFF: a relational interface for large-scale data explanation. VLDB Journal 30, 1 (2021), 45--70.

Digital Library

[3]

Mohiuddin Ahmed. 2019. Data summarization: a survey. Knowledge and Information Systems 58, 2 (2019), 249--273.

Digital Library

[4]

Fabrizio Angiulli, Fabio Fassetti, and Luigi Palopoli. 2009. Detecting outlying properties of exceptional objects. ACM Transactions on Database Systems 34, 1 (2009).

Digital Library

[5]

B. Buchanan and D. Goldman. 1989. Us vs. them: the minefield of comparative ads. Harvard Business Review 67, 3 (1989), 38--40, 42, 44 passim.

[6]

Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, Vol. 1.4171--4186. arXiv:1810.04805

[7]

Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, Vol. 1. 4171--4186. arXiv:1810.04805

[8]

Charanpal Dhanjal and Stéphan Clémencon. 2011. Maximising the quality of influence. Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011 (2011), 956--967.

[9]

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning Matthias. In NEURIPS. 1--9.

[10]

Dimitra Gkatzia, Helen Hastie, Srinivasan Janarthanam, and Oliver Lemon. 2013. Generating student feedback from time-series data using reinforcement learning. ENLG 2013 - 14th European Workshop on Natural Language Generation, Proceedings (2013), 115--124.

[11]

Eli Goldberg, Norbert Driedger, and Richard Kittredge. 1994. Using natural-language processing to produce weather forecasts. IEEE Expert-Intelligent Systems and their Applications 9, 2 (1994), 45--53.

Digital Library

[12]

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for black-box optimization. In SIGKDD, Vol. Part F1296. 1487--1496.

Digital Library

[13]

Kai Han, Shuang Cui, Tianshuai Zhu, Enpei Zhang, Benwei Wu, Zhizhuo Yin, Tong Xu, Shaojie Tang, and He Huang. 2021. Approximation Algorithms for Submodular Data Summarization with a Knapsack Constraint. Proceedings of the ACM on Measurement and Analysis of Computing Systems 5, 1 (2021), 1--31.

Digital Library

[14]

Naeemul Hassan, Afroza Sultana, You Wu, Gensheng Zhang, Chengkai Li, Jun Yang, and Cong Yu. 2014. Data in, fact out: Automated monitoring of facts by FactWatcher. Proceedings of the VLDB Endowment 7, 13 (2014), 1557--1560.

Digital Library

[15]

Joseph M. JM Hellerstein, PJ Peter J. Haas, and HJ Helen J. Wang. 1997. Online aggregation. SIGMOD Record 26, 2 (1997), 171--182.

Digital Library

[16]

W Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58, 301 (1963), 13--30.

[17]

Haibo Hu, Jianliang Xu, Wing Sing Wong, Baihua Zheng, Dik Lun Lee, and Wang Chien Lee. 2005. Proactive caching for spatial queries in mobile environments. Proceedings - International Conference on Data Engineering Icde (2005), 403--414.

Digital Library

[18]

Xiao Jiang, Chengkai Li, Ping Luo, Min Wang, and Yong Yu. 2011. Prominent streak discovery in sequence data. In SIGKDD. 1280--1288.

Digital Library

[19]

Shantanu Joshi and Christopher Jermaine. 2008. Materialized sample views for database approximation. ICDE 20, 3 (2008), 337--351.

Digital Library

[20]

Surveys June and White Paper. 2018. White Paper: How Google Surveys Works. June (2018).

[21]

Juraj Juraska and Marilyn Walker. 2021. Attention Is Indeed All You Need: Semantically Attention-Guided Decoding for Data-to-Text NLG. INLG 2021 - 14th International Conference on Natural Language Generation, Proceedings September (2021), 416--431. arXiv:2109.07043

[22]

Kaggle. 2019. https://www.kaggle.com/ghadahalshehrei/laptops-info.

[23]

Kaggle. 2019. https://www.kaggle.com/itrummer/stack-overflow-developer-survey-voice-interface.

[24]

Kaggle. 2019. https://www.kaggle.com/yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018.

[25]

Kaggle. 2020. https://www.kaggle.com/sibmike/iowaliquorsales2020.

[26]

Zdeněk Kasner and Ondřej Dušek. 2020. Data-to-Text Generation with Iterative Text Editing. In INLG. 60--67. arXiv:2011.01694

[27]

Andreas Krause and D Golovin. 2012. Submodular function maximization. Technical Report. http://las.ethz.ch/files/krause12survey.pdf

[28]

Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. (2017). arXiv:1711.01299 http://arxiv.org/abs/1711.01299

[29]

Sanjay Krishnan and Eugene Wu. 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. July 2017 (2019). arXiv:1904.11827 http://arxiv.org/abs/1904.11827

[30]

J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 1 (1977), 159.

[31]

Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. In EMNLP. 1203--1213. arXiv:1603.07771

[32]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. (2020), 7871--7880. arXiv:1910.13461

[33]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggregation via Random Walks. SIGMOD 46, 1 (2016), 615--629.

Digital Library

[34]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 1 (2019). arXiv:1907.11692

[35]

Joy Mahapatra and Utpal Garain. 2021. Exploring Structural Encoding for Data-to-Text Generation. In INLG. 404--415.

[36]

Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. 2021. Reinforcement learning for combinatorial optimization: A survey. Computers and Operations Research 134 (2021). arXiv:2003.03600

[37]

Kathleen McKeown, Jacques Robin, and Karen Kukich. 1995. Generating concise natural language summaries. Information Processing and Management 31, 5 (1995), 703--733.

Digital Library

[38]

Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2016. What to talk about and how? Selective generation using LSTMs with coarse-to-fine alignment. In NAACL. 720--730. arXiv:1509.00838

[39]

Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi. 2016. FAst coNsTrained submodular maximization: Personalized data summarization. 33rd International Conference on Machine Learning, ICML 2016 3 (2016), 2042--2054.

[40]

Volodymyr Mnih, Adria Puigdomenech Badia, Lehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. 33rd International Conference on Machine Learning, ICML 2016 4 (2016), 2850--2869. arXiv:1602.01783

[41]

Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Step-by-step: Separating planning from realization in neural data-to-text generation. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1 (2019), 2267--2277. arXiv:1904.03396

[42]

GL Nemhauser and LA Wolsey. 1978. Best algorithms for approximating the maximum of a submodular set function. Mathematics of Operations Research 3, 3 (1978), 177--188. http://mor.journal.informs.org/content/3/3/177.short

Digital Library

[43]

Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. Data Engineering March (2021), 24.

[44]

OpenAI. 2021. https://stable-baselines3.readthedocs.io/en/master/.

[45]

Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In AAAI. 6908--6915. arXiv:1809.00582

Digital Library

[46]

Justus J. Randolph. 2005. Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa. Joensuu Learning and Instruction Symposium (2005). http://eric.ed.gov/?id=ED490661

[47]

Weixiong Rao, Lei Chen, Ada Wai Chee Fu, and Yingyi Bu. 2007. Optimal proactive caching in peer-to-peer network: Analysis and application. In International Conference on Information and Knowledge Management, Proceedings. 663--672.

Digital Library

[48]

Ehud Reiter and Robert Dale. 1997. Building Applied Natural Language Generation Systems. Natural Language Engineering 3, 1 (1997), 57--87.

Digital Library

[49]

Sudeepa Roy and Dan Suciu. 2014. A formal approach to finding explanations for database queries. Proceedings of the ACM SIGMOD International Conference on Management of Data (2014), 1579--1590.

Digital Library

[50]

S. Sarawagi. 2000. User-adaptive exploration of multidimensional data. In VLDB. 307--316. http://citeseer.ist.psu.edu/sarawagi00useradaptive.html

[51]

Timos Sellis and Subrata Ghosh. 1990. On the multiple-query optimization problem. KDE 2, 2 (1990), 262--266.

Digital Library

[52]

Pranay Kumar Venkata Sowdaboina, Sutanu Chakraborti, and Somayajulu Sripada. 2014. Learning to summarize time series data. In LNCS, Vol. 8403 LNCS. 515--528.

Digital Library

[53]

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to summarize from human feedback. Advances in Neural Information Processing Systems 2020-Decem, NeurIPS (2020), 1--45. arXiv:2009.01325

[54]

Afroza Sultana, Naeemul Hassan, Chengkai Li, Jun Yang, and Cong Yu. 2014. Incremental discovery of prominent situational facts. Proceedings - International Conference on Data Engineering December 1992 (2014), 112--123. arXiv:1311.4529

[55]

Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement learning, second edition: An introduction. 532 pages. arXiv:1603.02199

[56]

Bo Tang, Shi Han, Man Lung, Yiu Rui, and Ding Dongmei. 2017. Extracting Top-K Insights from Multi-dimensional Data. In SIGMOD. 1509--1524.

[57]

Ngan Thi Dong. 2013. Natural language generation from graphs. Ph.D. Dissertation.

[58]

Immanuel Trummer and Anderson Connor. 2021. Optimally summarizing data by small fact sets for concise answers to voice queries. In ICDE. 1715--1726.

[59]

Immanuel Trummer, Yicheng Wang, and Saketh Mahankali. 2019. A holistic approach for query evaluation and result vocalization in voice-based OLAP. In SIGMOD. 936--953.

[60]

Immanuel Trummer, Jiancheng Zhu, and Mark Bryan. 2017. Data vocalization: optimizing voice output of relational data. PVLDB 10, 11 (2017), 1574--1585.

Digital Library

[61]

Manasi Vartak, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2014. SeeDB: automatically generating query visualizations. VLDB 7, 13 (2014), 1581--1584.

Digital Library

[62]

Tianhao Wang, Yi Zeng, Ming Jin, and Ruoxi Jia. 2021. A Unified Framework for Task-Driven Data Quality Management. (2021), 1--20. arXiv:2106.05484 http://arxiv.org/abs/2106.05484

[63]

Yanhao Wang, Yuchen Li, and Kian Lee Tan. 2019. Efficient Representative Subset Selection over Sliding Windows. IEEE Transactions on Knowledge and Data Engineering 31, 7 (2019), 1327--1340. arXiv:1706.04764

[64]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1 (2018), 1112--1122. arXiv:1704.05426

[65]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick Von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M Rush. 2020. Transformers : State-of-the-Art Natural Language Processing. (2020), 38--45.

[66]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP. 38--45. arXiv:arXiv:1910.03771v5

[67]

Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2015. Voyager: exploratory analysis via faceted browsing of visualization recommendations. Transactions on Visual and Computer Graphics 22, 1 (2015), 649--658.

Digital Library

[68]

Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment 6, 8 (2013), 553--564.

Digital Library

[69]

Tianyi Wu, Dong Xin, Qiaozhu Mei, and Jiawei Han. 2009. Promotion analysis in multi-dimensional space. Proceedings of the VLDB Endowment 2, 1 (2009), 109--120.

Digital Library

[70]

You Wu, Pankaj K Agarwal, Chengkai Li, Jun Yang, and Cong Yu. 2012. On "one of the few" objects. In KDD. 1487--1495.

Digital Library

[71]

Yueji Yang, Yuchen Li, Panagiotis Karras, and Anthony K.H. Tung. 2021. Context-aware Outstanding Fact Mining from Knowledge Graphs. Vol. 1. Association for Computing Machinery. 2006--2016 pages.

Digital Library

[72]

Ruslan Yermakov, Bayer Ag, Nicholas Drago, Bayer Ag, Angelo Ziletti, and Bayer Ag. 2021. Biomedical Data-to-Text Generation via Fine-Tuning Transformers. In INLG. 364--370.

[73]

Gensheng Zhang, Damian Jimenez, and Chengkai Li. 2018. Maverick: discovering exceptional facts from knowledge graphs. In SIGMOD. 1317--1332. www.snopes.com/

[74]

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2020. Fine-Tuning Language Models from Human Preferences. https://arxiv.org/abs/1909.08593 (2020). arXiv:1909.08593 http://arxiv.org/abs/1909.08593

Cited By

Trummer IDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Demonstrating NaturalMiner: Searching Large Data Sets for Abstract Patterns Described in Natural LanguageCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589694(139-142)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589694
Trummer I(2022)From BERT to GPT-3 codexProceedings of the VLDB Endowment10.14778/3554821.355489615:12(3770-3773)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3554821.3554896

Recommendations

MP-Trees: A Packing-Based Macro Placement Algorithm for Modern Mixed-Size Designs

In this paper, we present a new multipacking-tree (MP-tree) representation for macro placements to handle modern mixed-size designs with large macros and high chip utilization rates. Based on binary trees, the MP-tree is very efficient, effective, and ...
Definitional Interpreters for Higher-Order Programming Languages

Higher-order programming languages (i.e., languages in which procedures or labels can occur as values) are usually defined by interpreters that are themselves written in a programming language based on the lambda calculus (i.e., an applicative language ...
Layout-sensitive language extensibility with SugarHaskell
Haskell '12: Proceedings of the 2012 Haskell Symposium

Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 11

July 2022

980 pages

ISSN:2150-8097

Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2022

Published in PVLDB Volume 15, Issue 11

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
134
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)4

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Trummer IDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Demonstrating NaturalMiner: Searching Large Data Sets for Abstract Patterns Described in Natural LanguageCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589694(139-142)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589694
Trummer I(2022)From BERT to GPT-3 codexProceedings of the VLDB Endowment10.14778/3554821.355489615:12(3770-3773)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3554821.3554896

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents