[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3183713.3193546acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

GeoFlux: Hands-Off Data Integration Leveraging Join Key Knowledge

Published: 27 May 2018 Publication History

Abstract

Data integration is frequently required to obtain the full value of data from multiple sources. In spite of extensive research on tools to assist users, data integration remains hard, particularly for users with limited technical proficiency. To address this barrier, we study how much we can do with no user guidance. Our vision is that the user should merely specify two input datasets to be joined and get a meaningful integrated result. It turns out that our vision can be realized if the system can correctly determine the join key, for example based on domain knowledge.
We demonstrate this notion by considering a broad domain: socioeconomic data aggregated by geography, a widespread category that accounts for 80% of the data published by government agencies. Intuitively two such datasets can be integrated by joining on the geographic unit column. Although it sounds easy, this task has many challenges: How can we automatically identify columns corresponding to geographic units, other dimension variables and measure variables, respectively? If multiple geographic types exist, which one should be chosen for the join? How to join tables with idiosyncratic schema, different geographic units of aggregation or no aggregation at all?
We have developed GeoFlux, a data integration system that handles all these challenges and joins tabular data by automatically aggregating geographic information with a new, advanced crosswalk algorithm. In this demo paper, we overview the architecture of the system and its user-friendly interfaces, and then demonstrate via a real-world example that it is general, fully automatic and easy-to-use. In the demonstration, we invite users to interact with GeoFlux to integrate more sample socioeconomic data from data.ny.gov.

References

[1]
Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data Integration for the Relational Web. Proc. VLDB Endow., Vol. 2, 1 (Aug. 2009), 1090--1101. //books.google.com/books?id=kwTSupXTvwYC
[2]
Hong-Hai Do and Erhard Rahm. 2002. COMA: A System for Flexible Combination of Schema Matching Approaches VLDB. VLDB Endowment, 610--621. http://dl.acm.org/citation.cfm?id=1287369.1287422
[3]
Carl Franklin. 1992. An Introduction to Geographic Information Systems: Linking Maps to Databases. Database, Vol. 15, 2 (1992), 12--21.
[4]
Eduard Hovy, José Luis Ambite, and Andrew Philpot. {n. d.}. Addressing a Bottleneck in Data Integration using Automated Learning Techniques. (.{n. d.}).
[5]
Li Qian, Michael J. Cafarella, and H. V. Jagadish. 2012. Sample-driven Schema Mapping. In SIGMOD. ACM, New York, NY, USA, 73--84.
[6]
Jie Song, Danai Koutra, Murali Mani, and H. V. Jagadish. 2018. GeoAlign: Interpolating Aggregates over Unaligned Partitions Proceedings of the 21th International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26-29, 2018. 361--372.
[7]
United States Census Bureau. 2010. Standard Hierarchy of Census Geographic Entities. Available from https://census.gov/. (2010).
[8]
Hadley Wickham. 2014. Tidy data. The Journal of Statistical Software Vol. 59 (2014). Issue 10. http://www.jstatsoft.org/v59/i10/

Cited By

View all
  • (2023)Design of a data processing method for the farmland environmental monitoring based on improved Spark componentsFrontiers in Big Data10.3389/fdata.2023.12823526Online publication date: 20-Nov-2023
  • (2021)Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case StudyInformation and Software Technologies10.1007/978-3-030-88304-1_7(84-95)Online publication date: 7-Oct-2021
  • (2020)Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data ExtractionInformation and Software Technologies10.1007/978-3-030-59506-7_13(147-158)Online publication date: 8-Oct-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. crosswalk
  2. geographic data
  3. keywords{automatic data integration
  4. multi-dimensional aggregate interpolation}
  5. socioeconomic data

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)56
  • Downloads (Last 6 weeks)7
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Design of a data processing method for the farmland environmental monitoring based on improved Spark componentsFrontiers in Big Data10.3389/fdata.2023.12823526Online publication date: 20-Nov-2023
  • (2021)Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case StudyInformation and Software Technologies10.1007/978-3-030-88304-1_7(84-95)Online publication date: 7-Oct-2021
  • (2020)Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data ExtractionInformation and Software Technologies10.1007/978-3-030-59506-7_13(147-158)Online publication date: 8-Oct-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media