Abstract
Data integration provides users a uniform interface for multiple heterogonous data sources. This problem has attracted a large amount of attention from both research and industry areas. In this paper, we overview the state-of-art approaches in data integration which are roughly divided into five parts: schema matching, entity resolution, data fusion, integration system, and new problems arisen.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wang, Y., He, Y.: Synthesizing mapping relationships using table corpus, pp. 1117–1132. ACM (2017)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Chamanara, J., König-Ries, B., et al.: QUIS: InSitu heterogeneous data source querying. VLDB 10, 1877–1880 (2017)
Arocena, P.C., Glavic, B., Ciucanu, R., et al.: The iBench intergration metadata generator. VLDB 9, 108–119 (2015)
Hai, R., Geisler, S., Quix,C.: Constance: an intelligent data lake system, pp. 2097–2100. ACM (2016)
Wang, L., et al.: Schema management for document stores. PVLDB 8(9), 922–933 (2015)
Kolaitis, P.G., Pichler, R., Sallinger, E., et al.: Nested dependencies: structure and reasoning, pp. 176–187. ACM (2014)
Konstantinidis, G., Ambite, J.L.: Optimizing the chase: scalable data integration under constraints. VLDB 7, 1869–1880 (2014)
Rong, C., Lin, C., Silva, Y.N., et al.: Fast and scalable distributed set similarity joins for big data analytics, pp. 1059–1070. IEEE (2017)
Vernica, R., Carey, M., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506. ACM (2010)
Li, G.: Human-in-the-loop data integration. VLDB 10, 2006–2017 (2017)
Li, F., Lee, M.L., Hsu, W., et al.: Linking temporal records for profiling entities, pp. 593–605. ACM (2015)
Olteanu, D., Papageorgiou, L., van Schaik, S.J.: Πgora: an integration system for probabilistic data, pp. 1324–1327. IEEE (2013)
Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: a probabilistic database management system. In: SIGMOD (2009)
Kumar, A., Ré, C.: Probabilistic management of OCR data using an RDBMS. PVLDB 5(4), 322–333 (2011)
Olteanu, D., Huang, J., Koch, C.: Approximate confidence computa- tion in probabilistic databases. In: ICDE (2010)
Druzdzel, M.: SMILE: structural modeling, inference, and learning engine and GeNIe: a development environment for graphical decision - theoretic models. In: AIII (1999)
Abedjan, Z., Akcora, C.G., Ouzzani, M., et al.: Temporal rules discovery for web data cleaning. VLDB 9, 336–347 (2015)
Alexe, B., Roth, M., Tan, W.-C.: Preference-aware integration of temporal data. VLDB 8, 365–376 (2014)
Petermann, A., Junghanns, M., Müller, R., et al.: Graph-based data integration and business intelligence with BIIIG. VLDB 7, 1577–1580 (2014)
Li, Q., Li, Y., Gao, J., et al.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation, pp. 1187–1198. ACM (2014)
Li, Q., Li, Y., Gao, J., et al.: A confidence-aware approach for truth discovery on long-tail data. VLDB 4, 425–436 (2014)
Joglekar, M., Rekatsinas, T., Garcia-Molina, H., et al.: SLiMFast: guaranteed results for data fusion and source reliability, pp. 1399–1414. ACM (2017)
Chen, Y. Chen, L., Zhang, C.J.: CrowdFusion: a crowdsource approach on data fusion refinement, pp. 127–130. IEEE (2017)
Pradhan, R., Bykau, S., Prabhakar, S.: Staging user feedback toward rapid conflict resolution in data fusion, pp. 603–618. ACM (2017)
Russell, S.J., Norvig, P.: Articial Intelligence: A Modern Approach, 2nd edn. Prentice Hall, Upper Saddle River (2003)
Dong, X.L., Berti-Equille, L., Hu, Y., Srivastava, D.: Global detection of complex copying relationships between sources. PVLDB 3(1–2), 1358–1369 (2010)
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. PVLDB 2(1), 550–561 (2009)
Pochampally, R., Das Sarma, A., Dong, X.L., et al.: Fusing data with correlations, pp. 433-444. ACM (2014)
Yu, R., Gadiraju, U., Fetahu, B., et al.: FuseM: query-centric data fusion on structured web markup, pp. 179–182. IEEE (2017)
Pandey,Y., et al.: Safety check – a semantic web application for emergency management. ACM (2017)
Hristidis, V., et al.: Survey of data management and analysis in disaster situations. J. Syst. Softw. 83(10), 1701–1714 (2010)
McBride, B.: Jena: a semantic web toolkit. IEEE Internet Comput. 6, 55–59 (2002)
Zhang, C., Shin, J., et al.: Extracting databases from dark data with DeepDive, pp. 847–859. ACM (2016)
Peters, S.E., et al.: A machine reading system for assembling synthetic paleontological databases. PloS One 9, e113523 (2014)
Fernandez, R.C., Deng, D., Mansour, E., et al.: A demo of the data civilizer system, pp. 1639–1642. ACM (2017)
Salloum, M., Dong, X.L., Srivastava, D., et al.: Online ordering of overlapping data source. VLDB 7, 133–144 (2014)
Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources, pp. 919–930. ACM (2014)
Bonaque, R., Cao, T.D., Mendoza, O., et al.: Mixedinstance querying: a lightweight integration architecture for data journalism. VLDB 9, 1513–1516 (2016)
Deshpande, O., Lamba, D.S., Tourn, M., et al.: Building, maintaining, and using knowledge bases: a report from the trenches, pp. 1209–1220. ACM (2013)
Rodríguez, M., Goldberg, S., Wang, D.Z.: SigmaKB: multiple probabilistic knowledge base fusion. VLDB 9, 1577–1580 (2016)
Acknowledgment
This work was supported by NSFC61602159, 61370222 and Program for Group of Science Harbin technological innovation 2015RAXXJ004.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hui, J., Li, L., Zhang, Z. (2018). Integration of Big Data: A Survey. In: Zhou, Q., Gan, Y., Jing, W., Song, X., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 901. Springer, Singapore. https://doi.org/10.1007/978-981-13-2203-7_9
Download citation
DOI: https://doi.org/10.1007/978-981-13-2203-7_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2202-0
Online ISBN: 978-981-13-2203-7
eBook Packages: Computer ScienceComputer Science (R0)