[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

BlinkFill: semi-supervised programming by example for syntactic string transformations

Published: 01 June 2016 Publication History

Abstract

The recent Programming By Example (PBE) techniques such as FlashFill have shown great promise for enabling end-users to perform data transformation tasks using input-output examples. Since examples are inherently an under-specification, there are typically a large number of hypotheses conforming to the examples, and the PBE techniques suffer from scalability issues for finding the intended program amongst the large space.
We present a semi-supervised learning technique to significantly reduce this ambiguity by using the logical information present in the input data to guide the synthesis algorithm. We develop a data structure InputDataGraph to succinctly represent a large set of logical patterns that are shared across the input data, and use this graph to efficiently learn substring expressions in a new PBE system BlinkFill. We evaluate BlinkFill on 207 real-world benchmarks and show that BlinkFill is significantly faster (on average 41x) and requires fewer input-output examples (1.27 vs 1.53) to learn the desired transformations in comparison to FlashFill.

References

[1]
Z. Abedjan, J. Morcos, M. N. Gubanov, I. F. Ilyas, M. Stonebraker, P. Papotti, and M. Ouzzani. Dataxformer: Leveraging the web for semantic transformations. In CIDR, 2015.
[2]
D. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: Extracting relational data from semi-structured spreadsheets using examples. In PLDI, pages 218--228, 2015.
[3]
X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. TEGRA: table extraction by global record alignment. In SIGMOD, pages 1713--1728, 2015.
[4]
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., New York, NY, USA, 1 edition, 2003.
[5]
H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. VLDB J., 20(2):209--226, 2011.
[6]
M. Gualtieri. Deputize end-user developers to deliver business agility and reduce costs. Forrester Report for Program Management Professionals, 2009.
[7]
S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, pages 317--330, 2011.
[8]
S. Gulwani, W. Harris, and R. Singh. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8):97--105, 2012.
[9]
Y. He, K. Ganjam, and X. Chu. SEMA-JOIN: joining semantically-related tables using big table corpora. PVLDB, 8(12):1358--1369, 2015.
[10]
J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR, 2015.
[11]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011.
[12]
T. Lau, S. Wolfman, P. Domingos, and D. Weld. Programming by demonstration using version space algebra. Machine Learning, 53(1--2), 2003.
[13]
T. A. Lau, P. Domingos, and D. S. Weld. Version space algebra and its application to programming by demonstration. In ICML, pages 527--534, 2000.
[14]
V. Le and S. Gulwani. Flashextract: a framework for data extraction by examples. In PLDI, 2014.
[15]
D. Mandelin, L. Xu, R. Bodík, and D. Kimelman. Jungloid mining: helping to navigate the api jungle. In PLDI, pages 48--61, 2005.
[16]
N. Meng, M. Kim, and K. S. McKinley. Lase: Locating and applying systematic edits by learning from examples. In ICSE, pages 502--511, 2013.
[17]
R. C. Miller and B. A. Myers. Outlier finding: focusing user attention on possible errors. In UIST, pages 81--90, 2001.
[18]
R. C. Miller and B. A. Myers. LAPIS: Smart editing with text structure. In CHI, pages 496--497, 2002.
[19]
T. M. Mitchell. Generalization as search. Artif. Intell., 18(2), 1982.
[20]
J. Morcos, Z. Abedjan, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: An interactive data transformation tool. In SIGMOD, pages 883--888, 2015.
[21]
A. Polozov and S. Gulwani. Flashmeta: A framework for inductive program synthesis. In OOPSLA, pages 107--126, 2015.
[22]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001.
[23]
J. Rissanen. Paper: Modeling by shortest data description. Automatica, 14(5):465--471, 1978.
[24]
C. Scaffidi, B. A. Myers, and M. Shaw. Topes: reusable abstractions for validating data. In ICSE, pages 1--10, 2008.
[25]
C. Scaffidi, B. A. Myers, and M. Shaw. Intelligently creating and recommending reusable reformatting rules. In IUI, pages 297--306, 2009.
[26]
R. Singh and S. Gulwani. Learning semantic string transformations from examples. PVLDB, 5(8):740--751, 2012.
[27]
R. Singh and S. Gulwani. Predicting a correct program in programming by example. In CAV, pages 398--414, 2015.
[28]
R. Singh and S. Gulwani. Transforming spreadsheet data types using examples. In POPL, pages 343--356, 2016.

Cited By

View all
  • (2024)Data-Driven Insight Synthesis for Multi-Dimensional DataProceedings of the VLDB Endowment10.14778/3641204.364121117:5(1007-1019)Online publication date: 1-Jan-2024
  • (2024)Unprecedented Code Change Automation: The Fusion of LLMs and Transformation by ExampleProceedings of the ACM on Software Engineering10.1145/36437551:FSE(631-653)Online publication date: 12-Jul-2024
  • (2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 9, Issue 10
June 2016
132 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2016
Published in PVLDB Volume 9, Issue 10

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)3
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Data-Driven Insight Synthesis for Multi-Dimensional DataProceedings of the VLDB Endowment10.14778/3641204.364121117:5(1007-1019)Online publication date: 1-Jan-2024
  • (2024)Unprecedented Code Change Automation: The Fusion of LLMs and Transformation by ExampleProceedings of the ACM on Software Engineering10.1145/36437551:FSE(631-653)Online publication date: 12-Jul-2024
  • (2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
  • (2024)Exploring Data Preparation Modules by ExamplesIntelligent Information and Database Systems10.1007/978-981-97-4982-9_5(52-69)Online publication date: 15-Apr-2024
  • (2024)SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical AttributesMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70359-1_13(213-229)Online publication date: 8-Sep-2024
  • (2023)Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-VProceedings of the VLDB Endowment10.14778/3583140.358316916:6(1587-1600)Online publication date: 20-Apr-2023
  • (2023)Explainable Program Synthesis by Localizing SpecificationsProceedings of the ACM on Programming Languages10.1145/36228747:OOPSLA2(2171-2195)Online publication date: 16-Oct-2023
  • (2023)Automated Translation of Functional Big Data Queries to SQLProceedings of the ACM on Programming Languages10.1145/35860477:OOPSLA1(580-608)Online publication date: 6-Apr-2023
  • (2023)PyEvolve: Automating Frequent Code Changes in Python ML SystemsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00091(995-1007)Online publication date: 14-May-2023
  • (2023)Data Preparation: A Technological Perspective and ReviewSN Computer Science10.1007/s42979-023-01828-84:4Online publication date: 2-Jun-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media