[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

BABOONS: black-box optimization of data summaries in natural language

Published: 01 July 2022 Publication History

Abstract

BABOONS (BlAck BOx Optimization of Natural language data Summaries) optimizes text data summaries for an arbitrary, user-defined utility function. Primarily, it targets scenarios in which utility is evaluated via large language models. Users describe their utility function in natural language or provide a model, trained to score text summaries in a specific domain.
BABOONS uses reinforcement learning to explore the space of possible descriptions. In each iteration, BABOONS generates summaries and evaluates their utility. To reduce data processing overheads during summary generation, BABOONS uses a proactive processing strategy that dynamically merges current with likely future queries for efficient processing. Also, BABOONS supports scenario-specific sampling and batch processing strategies. These mechanisms allow to scale processing to large data and item sets. The experiments show that BABOONS scales significantly better than baselines. Also, they show that summaries generated by BABOONS receive higher average grades from users in a large survey.

References

[1]
2021. https://www.upcounsel.com/what-is-a-product-description.
[2]
Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, Jeff Naughton, Peter Bailis, and Matei Zaharia. 2021. DIFF: a relational interface for large-scale data explanation. VLDB Journal 30, 1 (2021), 45--70.
[3]
Mohiuddin Ahmed. 2019. Data summarization: a survey. Knowledge and Information Systems 58, 2 (2019), 249--273.
[4]
Fabrizio Angiulli, Fabio Fassetti, and Luigi Palopoli. 2009. Detecting outlying properties of exceptional objects. ACM Transactions on Database Systems 34, 1 (2009).
[5]
B. Buchanan and D. Goldman. 1989. Us vs. them: the minefield of comparative ads. Harvard Business Review 67, 3 (1989), 38--40, 42, 44 passim.
[6]
Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, Vol. 1.4171--4186. arXiv:1810.04805
[7]
Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, Vol. 1. 4171--4186. arXiv:1810.04805
[8]
Charanpal Dhanjal and Stéphan Clémencon. 2011. Maximising the quality of influence. Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011 (2011), 956--967.
[9]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning Matthias. In NEURIPS. 1--9.
[10]
Dimitra Gkatzia, Helen Hastie, Srinivasan Janarthanam, and Oliver Lemon. 2013. Generating student feedback from time-series data using reinforcement learning. ENLG 2013 - 14th European Workshop on Natural Language Generation, Proceedings (2013), 115--124.
[11]
Eli Goldberg, Norbert Driedger, and Richard Kittredge. 1994. Using natural-language processing to produce weather forecasts. IEEE Expert-Intelligent Systems and their Applications 9, 2 (1994), 45--53.
[12]
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for black-box optimization. In SIGKDD, Vol. Part F1296. 1487--1496.
[13]
Kai Han, Shuang Cui, Tianshuai Zhu, Enpei Zhang, Benwei Wu, Zhizhuo Yin, Tong Xu, Shaojie Tang, and He Huang. 2021. Approximation Algorithms for Submodular Data Summarization with a Knapsack Constraint. Proceedings of the ACM on Measurement and Analysis of Computing Systems 5, 1 (2021), 1--31.
[14]
Naeemul Hassan, Afroza Sultana, You Wu, Gensheng Zhang, Chengkai Li, Jun Yang, and Cong Yu. 2014. Data in, fact out: Automated monitoring of facts by FactWatcher. Proceedings of the VLDB Endowment 7, 13 (2014), 1557--1560.
[15]
Joseph M. JM Hellerstein, PJ Peter J. Haas, and HJ Helen J. Wang. 1997. Online aggregation. SIGMOD Record 26, 2 (1997), 171--182.
[16]
W Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58, 301 (1963), 13--30.
[17]
Haibo Hu, Jianliang Xu, Wing Sing Wong, Baihua Zheng, Dik Lun Lee, and Wang Chien Lee. 2005. Proactive caching for spatial queries in mobile environments. Proceedings - International Conference on Data Engineering Icde (2005), 403--414.
[18]
Xiao Jiang, Chengkai Li, Ping Luo, Min Wang, and Yong Yu. 2011. Prominent streak discovery in sequence data. In SIGKDD. 1280--1288.
[19]
Shantanu Joshi and Christopher Jermaine. 2008. Materialized sample views for database approximation. ICDE 20, 3 (2008), 337--351.
[20]
Surveys June and White Paper. 2018. White Paper: How Google Surveys Works. June (2018).
[21]
Juraj Juraska and Marilyn Walker. 2021. Attention Is Indeed All You Need: Semantically Attention-Guided Decoding for Data-to-Text NLG. INLG 2021 - 14th International Conference on Natural Language Generation, Proceedings September (2021), 416--431. arXiv:2109.07043
[22]
Kaggle. 2019. https://www.kaggle.com/ghadahalshehrei/laptops-info.
[23]
Kaggle. 2019. https://www.kaggle.com/itrummer/stack-overflow-developer-survey-voice-interface.
[24]
Kaggle. 2019. https://www.kaggle.com/yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018.
[25]
Kaggle. 2020. https://www.kaggle.com/sibmike/iowaliquorsales2020.
[26]
Zdeněk Kasner and Ondřej Dušek. 2020. Data-to-Text Generation with Iterative Text Editing. In INLG. 60--67. arXiv:2011.01694
[27]
Andreas Krause and D Golovin. 2012. Submodular function maximization. Technical Report. http://las.ethz.ch/files/krause12survey.pdf
[28]
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. (2017). arXiv:1711.01299 http://arxiv.org/abs/1711.01299
[29]
Sanjay Krishnan and Eugene Wu. 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. July 2017 (2019). arXiv:1904.11827 http://arxiv.org/abs/1904.11827
[30]
J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 1 (1977), 159.
[31]
Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. In EMNLP. 1203--1213. arXiv:1603.07771
[32]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. (2020), 7871--7880. arXiv:1910.13461
[33]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggregation via Random Walks. SIGMOD 46, 1 (2016), 615--629.
[34]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 1 (2019). arXiv:1907.11692
[35]
Joy Mahapatra and Utpal Garain. 2021. Exploring Structural Encoding for Data-to-Text Generation. In INLG. 404--415.
[36]
Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. 2021. Reinforcement learning for combinatorial optimization: A survey. Computers and Operations Research 134 (2021). arXiv:2003.03600
[37]
Kathleen McKeown, Jacques Robin, and Karen Kukich. 1995. Generating concise natural language summaries. Information Processing and Management 31, 5 (1995), 703--733.
[38]
Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2016. What to talk about and how? Selective generation using LSTMs with coarse-to-fine alignment. In NAACL. 720--730. arXiv:1509.00838
[39]
Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi. 2016. FAst coNsTrained submodular maximization: Personalized data summarization. 33rd International Conference on Machine Learning, ICML 2016 3 (2016), 2042--2054.
[40]
Volodymyr Mnih, Adria Puigdomenech Badia, Lehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. 33rd International Conference on Machine Learning, ICML 2016 4 (2016), 2850--2869. arXiv:1602.01783
[41]
Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Step-by-step: Separating planning from realization in neural data-to-text generation. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1 (2019), 2267--2277. arXiv:1904.03396
[42]
GL Nemhauser and LA Wolsey. 1978. Best algorithms for approximating the maximum of a submodular set function. Mathematics of Operations Research 3, 3 (1978), 177--188. http://mor.journal.informs.org/content/3/3/177.short
[43]
Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. Data Engineering March (2021), 24.
[44]
OpenAI. 2021. https://stable-baselines3.readthedocs.io/en/master/.
[45]
Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In AAAI. 6908--6915. arXiv:1809.00582
[46]
Justus J. Randolph. 2005. Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa. Joensuu Learning and Instruction Symposium (2005). http://eric.ed.gov/?id=ED490661
[47]
Weixiong Rao, Lei Chen, Ada Wai Chee Fu, and Yingyi Bu. 2007. Optimal proactive caching in peer-to-peer network: Analysis and application. In International Conference on Information and Knowledge Management, Proceedings. 663--672.
[48]
Ehud Reiter and Robert Dale. 1997. Building Applied Natural Language Generation Systems. Natural Language Engineering 3, 1 (1997), 57--87.
[49]
Sudeepa Roy and Dan Suciu. 2014. A formal approach to finding explanations for database queries. Proceedings of the ACM SIGMOD International Conference on Management of Data (2014), 1579--1590.
[50]
S. Sarawagi. 2000. User-adaptive exploration of multidimensional data. In VLDB. 307--316. http://citeseer.ist.psu.edu/sarawagi00useradaptive.html
[51]
Timos Sellis and Subrata Ghosh. 1990. On the multiple-query optimization problem. KDE 2, 2 (1990), 262--266.
[52]
Pranay Kumar Venkata Sowdaboina, Sutanu Chakraborti, and Somayajulu Sripada. 2014. Learning to summarize time series data. In LNCS, Vol. 8403 LNCS. 515--528.
[53]
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to summarize from human feedback. Advances in Neural Information Processing Systems 2020-Decem, NeurIPS (2020), 1--45. arXiv:2009.01325
[54]
Afroza Sultana, Naeemul Hassan, Chengkai Li, Jun Yang, and Cong Yu. 2014. Incremental discovery of prominent situational facts. Proceedings - International Conference on Data Engineering December 1992 (2014), 112--123. arXiv:1311.4529
[55]
Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement learning, second edition: An introduction. 532 pages. arXiv:1603.02199
[56]
Bo Tang, Shi Han, Man Lung, Yiu Rui, and Ding Dongmei. 2017. Extracting Top-K Insights from Multi-dimensional Data. In SIGMOD. 1509--1524.
[57]
Ngan Thi Dong. 2013. Natural language generation from graphs. Ph.D. Dissertation.
[58]
Immanuel Trummer and Anderson Connor. 2021. Optimally summarizing data by small fact sets for concise answers to voice queries. In ICDE. 1715--1726.
[59]
Immanuel Trummer, Yicheng Wang, and Saketh Mahankali. 2019. A holistic approach for query evaluation and result vocalization in voice-based OLAP. In SIGMOD. 936--953.
[60]
Immanuel Trummer, Jiancheng Zhu, and Mark Bryan. 2017. Data vocalization: optimizing voice output of relational data. PVLDB 10, 11 (2017), 1574--1585.
[61]
Manasi Vartak, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2014. SeeDB: automatically generating query visualizations. VLDB 7, 13 (2014), 1581--1584.
[62]
Tianhao Wang, Yi Zeng, Ming Jin, and Ruoxi Jia. 2021. A Unified Framework for Task-Driven Data Quality Management. (2021), 1--20. arXiv:2106.05484 http://arxiv.org/abs/2106.05484
[63]
Yanhao Wang, Yuchen Li, and Kian Lee Tan. 2019. Efficient Representative Subset Selection over Sliding Windows. IEEE Transactions on Knowledge and Data Engineering 31, 7 (2019), 1327--1340. arXiv:1706.04764
[64]
Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1 (2018), 1112--1122. arXiv:1704.05426
[65]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick Von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M Rush. 2020. Transformers : State-of-the-Art Natural Language Processing. (2020), 38--45.
[66]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP. 38--45. arXiv:arXiv:1910.03771v5
[67]
Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2015. Voyager: exploratory analysis via faceted browsing of visualization recommendations. Transactions on Visual and Computer Graphics 22, 1 (2015), 649--658.
[68]
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment 6, 8 (2013), 553--564.
[69]
Tianyi Wu, Dong Xin, Qiaozhu Mei, and Jiawei Han. 2009. Promotion analysis in multi-dimensional space. Proceedings of the VLDB Endowment 2, 1 (2009), 109--120.
[70]
You Wu, Pankaj K Agarwal, Chengkai Li, Jun Yang, and Cong Yu. 2012. On "one of the few" objects. In KDD. 1487--1495.
[71]
Yueji Yang, Yuchen Li, Panagiotis Karras, and Anthony K.H. Tung. 2021. Context-aware Outstanding Fact Mining from Knowledge Graphs. Vol. 1. Association for Computing Machinery. 2006--2016 pages.
[72]
Ruslan Yermakov, Bayer Ag, Nicholas Drago, Bayer Ag, Angelo Ziletti, and Bayer Ag. 2021. Biomedical Data-to-Text Generation via Fine-Tuning Transformers. In INLG. 364--370.
[73]
Gensheng Zhang, Damian Jimenez, and Chengkai Li. 2018. Maverick: discovering exceptional facts from knowledge graphs. In SIGMOD. 1317--1332. www.snopes.com/
[74]
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2020. Fine-Tuning Language Models from Human Preferences. https://arxiv.org/abs/1909.08593 (2020). arXiv:1909.08593 http://arxiv.org/abs/1909.08593

Cited By

View all
  • (2023)Demonstrating NaturalMiner: Searching Large Data Sets for Abstract Patterns Described in Natural LanguageCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589694(139-142)Online publication date: 4-Jun-2023
  • (2022)From BERT to GPT-3 codexProceedings of the VLDB Endowment10.14778/3554821.355489615:12(3770-3773)Online publication date: 29-Sep-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 11
July 2022
980 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2022
Published in PVLDB Volume 15, Issue 11

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Demonstrating NaturalMiner: Searching Large Data Sets for Abstract Patterns Described in Natural LanguageCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589694(139-142)Online publication date: 4-Jun-2023
  • (2022)From BERT to GPT-3 codexProceedings of the VLDB Endowment10.14778/3554821.355489615:12(3770-3773)Online publication date: 29-Sep-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media