[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Structure interpretation of text formats

Published: 13 November 2020 Publication History

Abstract

Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing.
We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired. Our key insight is to allow arbitrary combination of extraction and parsing techniques through a concept called partial structures. Partial structures act as a common language through which the file structure can be shared and refined by different techniques. This makes Unravel more powerful than applying the individual techniques in parallel or sequentially. Further, with this rule-based extensible approach, we introduce the novel notion of re-interpretation where the variety of techniques supported by our system can be exploited to improve accuracy while optimizing for particular quality measures or restricted environments. On our benchmark of 617 text files gathered from a variety of sources, Unravel is able to extract the intended table in many more cases compared to state-of-the-art techniques.

Supplementary Material

Auxiliary Presentation Video (oopsla20main-p421-p-video.mp4)
Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing. We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired.

References

[1]
Arvind Arasu and Hector Garcia-Molina. 2003. Extracting Structured Data from Web Pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) ( SIGMOD '03). ACM, New York, NY, USA, 337-348. https://doi.org/10.1145/872757.872799
[2]
Sarah Chasins and Rastislav Bodik. 2017. Skip Blocks: Reusing Execution History to Accelerate Web Scripts. Proc. ACM Program. Lang. 1, OOPSLA, Article 51 (Oct. 2017 ), 28 pages. https://doi.org/10.1145/3133875
[3]
Sarah E. Chasins, Maria Mueller, and Rastislav Bodik. 2018. Rousillon: Scraping Distributed Hierarchical Web Data. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) ( UIST '18). Association for Computing Machinery, New York, NY, USA, 963-975. https://doi.org/10.1145/3242587.3242661
[4]
Cognos Analytics 2019. Cognos Analytics: How XML files are flattened. https://www.ibm.com/support/knowledgecenter/ en/SSEP7J_10.2.2/com.ibm. swg.ba.cognos.dg_rtm_wb.10.2.2.doc/c_howxmlfilesareflattenednd09ab.html. Accessed: 2019-11-20.
[5]
Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, Robert M. Graham, Michael A. Harrison, and Ravi Sethi (Eds.). ACM, 238-252. https://doi.org/10.1145/512950.512973
[6]
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2001. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 109-118. http://dl.acm.org/citation.cfm?id= 645927. 672370
[7]
Allen Cypher, Daniel C. Halbert, David Kurlander, Henry Lieberman, David Maulsby, Brad A. Myers, and Alan Turransky (Eds.). 1993. Watch what I do: programming by demonstration. MIT Press, Cambridge, MA, USA. http://portal.acm.org/ citation.cfm?id= 168080
[8]
Mark Daly, Yitzhak Mandelbaum, David Walker, Mary Fernández, Kathleen Fisher, Robert Gruber, and Xuan Zheng. 2006. PADS: An End-to-end System for Processing Ad Hoc Data. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (Chicago, IL, USA) ( SIGMOD '06). ACM, New York, NY, USA, 727-729. https: //doi.org/10.1145/1142473.1142568
[9]
Data Miner 2019. Data Miner: Extract data from any website with 1 click. https://data-miner.io/. Accessed: 2019-11-20.
[10]
M. Du and F. Li. 2016. Spell: Streaming Parsing of System Event Logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM). 859-864. https://doi.org/10.1109/ICDM. 2016.0103
[11]
ELK 2019. ELK. https://www.elastic.co/what-is/elk-stack. Accessed: 2019-11-20.
[12]
Kathleen Fisher and Robert Gruber. 2005. PADS: A Domain-specific Language for Processing Ad Hoc Data. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (Chicago, IL, USA) ( PLDI '05). ACM, New York, NY, USA, 295-304. https://doi.org/10.1145/1065010.1065046
[13]
Kathleen Fisher and David Walker. 2011. The PADS Project: An Overview. In Proceedings of the 14th International Conference on Database Theory (Uppsala, Sweden) ( ICDT '11). ACM, New York, NY, USA, 11-17. https://doi.org/10.1145/1938551.1938556
[14]
Kathleen Fisher, David Walker, Kenny Qili Zhu, and Peter White. 2008. From dirt to shovels: fully automatic tool generation from ad hoc data. In POPL, George C. Necula and Philip Wadler (Eds.). ACM, 421-434. http://dblp.uni-trier.de/db/conf/ popl/popl2008.html#FisherWZW08
[15]
Yihan Gao, Silu Huang, and Aditya G. Parameswaran. 2018. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets. In SIGMOD Conference, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 943-958. http://dblp.uni-trier.de/db/conf/sigmod/sigmod2018.html#GaoHP18
[16]
Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale Information Extraction with Vertex. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE '11). IEEE Computer Society, Washington, DC, USA, 1209-1220. https://doi.org/10.1109/ICDE. 2011.5767842
[17]
Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jefrey Heer. 2011. Proactive Wrangling: Mixed-initiative End-user Programming of Data Transformation Scripts. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (Santa Barbara, California, USA) ( UIST '11). ACM, New York, NY, USA, 65-74. https://doi.org/ 10.1145/2047196.2047205
[18]
Hossein Hamooni, Biplob Debnath, Jianwu Xu, Hui Zhang, Guofei Jiang, and Abdullah Mueen. 2016. LogMine: Fast Pattern Recognition for Log Analytics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, Indiana, USA) ( CIKM '16). ACM, New York, NY, USA, 1573-1582. https: //doi.org/10.1145/2983323.2983358
[19]
Json Normalize 2019. pandas.io.json.json_normalize. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io. json.json_normalize.html/. Accessed: 2019-11-20.
[20]
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jefrey Heer. 2011a. Wrangler: Interactive Visual Specification of Data Transformation Scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver,

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 4, Issue OOPSLA
November 2020
3108 pages
EISSN:2475-1421
DOI:10.1145/3436718
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2020
Published in PACMPL Volume 4, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data extraction
  2. format diversity
  3. program synthesis

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 1,647
    Total Downloads
  • Downloads (Last 12 months)978
  • Downloads (Last 6 weeks)151
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media