A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes
<p>Data Integration level and its corresponding conflict resolution tasks.</p> "> Figure 2
<p>The pair-wise-source-to-target matching and formulation process based on the best effort data integration framework. (<bold>a</bold>) The participated iDO instances as observed from three structured data sources of <inline-formula><mml:math id="mm458"><mml:semantics><mml:mrow><mml:msub><mml:mi>S</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:semantics></mml:math></inline-formula> (<bold>b</bold>) The pair-wise-source-to-target matching process for the three structured data sources of <inline-formula><mml:math id="mm36"><mml:semantics><mml:mrow><mml:msub><mml:mi>S</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:semantics></mml:math></inline-formula>. (<bold>c</bold>) The pair-wise-source-to-target matching process for the three iDO instances according to their local instance <inline-formula><mml:math id="mm37"><mml:semantics><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mi>D</mml:mi><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo> </mml:mo></mml:mrow></mml:semantics></mml:math></inline-formula> or reference instance <inline-formula><mml:math id="mm38"><mml:semantics><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mi>D</mml:mi><mml:msubsup><mml:mi>O</mml:mi><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:semantics></mml:math></inline-formula><italic>)</italic> formulations.</p> "> Figure 3
<p>Possible world’s example with associated probability values.</p> "> Figure 4
<p>Example of the probabilistic data fusion problem under the existence of multi-valued attributes. (<bold>a</bold>) The participated iDO instances. (<bold>b</bold>) The pair-wise probabilistic Linkages. (<bold>c</bold>) The probabilistic entities merging alternatives from 4b instances. (<bold>d</bold>) The populated data values for the global attributes of the participated instances in 4c.</p> "> Figure 4 Cont.
<p>Example of the probabilistic data fusion problem under the existence of multi-valued attributes. (<bold>a</bold>) The participated iDO instances. (<bold>b</bold>) The pair-wise probabilistic Linkages. (<bold>c</bold>) The probabilistic entities merging alternatives from 4b instances. (<bold>d</bold>) The populated data values for the global attributes of the participated instances in 4c.</p> "> Figure 5
<p>City data fusion alternatives’ example with their updated reliability score based on the C2.1.1 case. (<bold>a</bold>) The participated entities with their original matching tree. (<bold>b</bold>) The probabilistic entities linkage/merging results. (<bold>c</bold>) The possible data fused values for <italic>City</italic> attribute at each possible entities merging alternative as produced based on <xref ref-type="fig" rid="BDCC-06-00114-f005">Figure 5</xref>b.</p> "> Figure 6
<p>The updated-City data fusion alternatives’ example is based on the newly added evidence from the fourth data source. (<bold>a</bold>) The participated entities with their original matching tree. (<bold>b</bold>) The probabilistic entities linkage/merging results. (<bold>c</bold>) The possible data fused values for <italic>City</italic> attribute at each possible entities merging alternative as produced based on <xref ref-type="fig" rid="BDCC-06-00114-f006">Figure 6</xref>b.</p> "> Figure 6 Cont.
<p>The updated-City data fusion alternatives’ example is based on the newly added evidence from the fourth data source. (<bold>a</bold>) The participated entities with their original matching tree. (<bold>b</bold>) The probabilistic entities linkage/merging results. (<bold>c</bold>) The possible data fused values for <italic>City</italic> attribute at each possible entities merging alternative as produced based on <xref ref-type="fig" rid="BDCC-06-00114-f006">Figure 6</xref>b.</p> "> Figure 7
<p>Phone data fusion alternatives’ example with their updated reliability score based on the C2.2.1 case. (<bold>a</bold>) The participated entities with their original matching tree. (<bold>b</bold>) The probabilistic entities linkage/merging results. (<bold>c</bold>) The possible data fused values for <italic>Phone</italic> attribute at each possible entities merging alternative as produced based on <xref ref-type="fig" rid="BDCC-06-00114-f007">Figure 7</xref>b.</p> ">
Abstract
:1. Introduction
- (1).
- Handling the data fusion representational challenge of accepting probabilistic data, i.e., reliability scores of data values and probabilistic similarity value sets, to generate probabilistic global entities with their fused data value alternatives.
- (2).
- A formal representation of varied data fusion cases to fit into the probabilistic data fusion problem. The formulation of data fusion cases is outlined by observing the origin of the generated data values and their uncertainty.
- (3).
- Incorporating data lineage into the data fusion model to trace the source of data that offers additional information with which to understand the conflict and uncertainty among the observed data and to facilitate the on-demand data fusion process.
- (4).
- The construction of a data fusion computational method that conditionally calculates the posterior/updated reliability score for a possible world of true fused value(s).
- (5).
- The implementation of the data fusion method within the probabilistic data integration system that can show the applicability of addressing on-demand probabilistic fusion for different data conflict types based on the merging of probabilistic entity alternatives.
- (6).
- Finally, our proposed data fusion approach is designed to cope with a dynamic, volatile and on-demand fusion environment and to support efficient modification re-execution. It provides a means of a new Decision Model (DM) logic that isolates the matching logic from the decision logic, yielding an additional efficiency advantage while dealing with dynamic data. The constructed model can work in two-fold benefit solutions; it can work as a complete probabilistic DM or replace the manual DM sub-stage in traditional DM logic.
2. Data Fusion Background and Related Work
2.1. Data Integration
2.2. Uncertainty Modeling
2.3. Data Fusion Related Work
- Conflict Resolution by Considering Accuracy of Source: Data sources have different accuracies, and some are considered more trustworthy. Therefore, a more precise decision can be obtained by considering sources’ reliability as proposed in [67,69,71,78,79]. Using this technique requires a probability model that iteratively computes the sources’ accuracy to decide the true values. Research by Panse and Ritter [69] investigated how a set of probabilistic tuples designated as duplicates can be merged by considering uncertainty on the instance and source reliability. According to [67], a general optimization framework called CRH is proposed that seamlessly integrates the truth-finding process on various data types.
- Conflict Resolution by Considering Freshness of Sources: The data is often changing dynamically in the real world. The value of being true or false can be, in a subtle case, out-of-date. Accordingly, research done in [24,80] addressed data sources’ freshness and treated incorrect and out-of-date values differently by describing their probabilistic model accordingly.
- Conflict Resolution by Considering Dependency between Sources: In many domains, especially on the Web, data sources may copy from each other for some of their data. Dependency among sources in sort of source copying has been considered in research work such as [23,24,25,70,73,81,82]. According to [23,24], source accuracy has been considered for the analysis of source dependencies. Research by [70] introduced a two-stage fusion approach based on Markov Logic Networks. Unlike current data fusion research that focuses on resolving data conflict on a single attribute, this effort considered the interrelationship of data conflict on different attributes to improve the accuracy of the fused results. These research efforts are effective in detecting positive correlation on false data but are not effective with positive correlation on true data or negative correlation. Their models also rely on a single truth assumption, as everyone has a unique birthplace. In practice there can be multiple truths for certain facts, such as someone may have multiple professions. Accordingly, [25] introduced a data fusion approach with correlation, i.e., the correlation among sources can be much broader than copying. It can be positive or negative, and caused by different reasons.
3. Preliminaries
3.1. Informative Digital Object (iDO) Concept
3.2. The Best-Effort Data Integration Framework
- is a target data source that belongs to a set of target sources, i.e., A source can be in type of A reference instance is denoted as for
- is a local data source that belongs to a set of local sources, i.e., A source can be in a type of or A local instance is denoted as for
- is a triple of mapping is a set of one-to-one probabilistic matching for each reference attribute value against a local attribute value if initially the similarity value between a pair of main parameter attribute’s values originated from a reference instance with its corresponding local instance is greater than a specified threshold value, i.e., and is the similarity threshold value for considering the matching between the pairs of main parameter’s data values. Thus, for each instance pairs from there is against local source-to-target entities matching in the form of and denotes the pair-wise matching operation.
- is a set of mutual probabilistic global entities merging alternatives that are generated from merging their possible corresponding iDOs, as they have pair-wise linkage results in sort of a reference instance to its possible local instances, i.e., . A probabilistic global entity is a set of instances merged from a reference instance with its possible local instances, i.e., . Each possible entities merging alternative has an assigned probability distribution value obtained from multiplying the probability linkages of its linked instances, i.e., . For each requested attribute and within each possible merge, there could be a multi-valued attribute in which each possible attribute’s value alternative is assigned with a probabilistic data fusion value obtained from updating and conditionally computing the reliability scores of its attribute’s values, i.e., . An possible world may contain single or multi-possible true values, i.e., .
3.3. Data Lineage
3.4. Possible-World Semantic
3.5. Probabilistic Entity Linkage Definition
3.6. Probabilistic Data Fusion Assumptions
4. The Probabilistic Data Fusion Problem
4.1. Data Fusion Cases
4.2. Problem Formulation
- A set of participated sources. Each source can be either a local source , or a target source , and it contains a set of objects that represent a particular aspect of a RWO.
- A participated object is a triple of a comprise set of attribute’s names, values, and types, i.e., , where the type of a participated attribute is obtained based on the type of its corresponding global attribute, i.e., , and . An attribute may have a single data value, i.e., , or multiple data values, i.e., .
- entities merging set obtained from linkage set. consists of all the possible subsets of the entities merging alternatives . A subset is depicted as , and . The assigned instances in each alternative differ from one to another. Due to the assigned instances in each entity merging’s alternative, each attribute may include different sets of data values.
- A confidence degree, which is referred to as a reliability score and is denoted by , to indicate the probability of a specific value provided by attribute being true and associated with a particular global attribute. Accordingly, there is given a reliability source’s score of to be associated with each data value.
- A matching function returning a precise (Match, or Not Match) decision between a pair of participated data values obtained from iDOs that belong to a particular merging set. For a specific attribute, the generated data values are the union of all distinct values. Each obtained data value could be derived from multiple similar values originating from multiple iDOs as they are observed from a participating entity with its corresponding instances. The matching outputs produce a data values’ domain based on the obtained global schema/attributes and the participated data sources. This domain of data values is depicted below and Figure 4d shows an example of this domain generation:
- -
- depicts a populated data value from its corresponding data values that existed at the participated and sources. may assemble a single value of , or a combined data value from its corresponding similar values as .
- -
- depicts the data lineage of each iDO in the set that has the attribute value. At each generated global data value, indicates the data lineage union for those similar data values as originated from the participated data sources and related to one global data value, i.e., . Due to the similar values, could indicate one or many lineages.
- -
- or for representation simplicity, represents the reliability scores set for all included iDOs’ data values in a global data value. Depending on the observed data lineage for data value, the set may have single or multiple reliability scores. This means, a generated global attribute’s value can be obtained from one or more data sources or iDOs, and hence, it can be assigned with single to multiple reliability scores.
5. The Probabilistic Data Fusion Model
5.1. The Probabilistic Data Fusion Sample Space and Possible-Worlds Generation
5.1.1. The Data Fusion Sample Space Production
- -
- implies the event when the generated data value that belongs to is a true fused value, i.e., with a probability equals to the union of the original reliability scores for all lineage exited in event, i.e., .
- -
- implies the event when the generated data value that belongs to is a false fused value, i.e., , with probability equals to the union of the reliability score’s complements for all existed in , i.e., .
- -
- set indicates either the original reliability scores’ set or the complement scores’ set for its associated data value’s event .
- -
- implies the union of the true and false data value’s events that are contained in a world. Depending on , the set denotes one or more of the actual data values’ events, such that none, some or all of them can be combined data values.
- -
- implies the reliability scores set for a world, as it is gained from the union of the reliability scores sets of all the distinct events that are included in the world. Depending on the participating events in an alternative, i.e., all are true, some are true, or none is true (all false), the reliability score set for a data fusion alternative may have the original or/and the complement reliability scores sets.
- -
- .
- -
5.1.2. The Obtained Possible-Worlds Based on the Data Fusion Cases
- .
- .
- -
- If the data fusion case is MTC-OWA (C1.2), then multiple data values can be true at the same time, and it is possible to have the true value that does not exist in domain. Therefore, the generated possible worlds will be equal to the sample space of as shown below:
- .
- -
- If the data fusion case is MTC-OWA (C1.1), then multiple data values can be true at the same time, and it is not possible to have a true value that does not exist in domain. Therefore, the generated possible worlds would be as shown below:
- .
- -
- If the data fusion case is ITCC-CWA (C2.1.1), then one data value can be true at a time, and it is not possible to have the true value from outside domain. Therefore, the generated possible worlds would be as shown below:
- .
- -
- If the data fusion case is ITCC-OWA (C2.1.2), then one data value can be true at a time, and it is possible to have the true value from outside domain. Therefore, the generated possible worlds would be as shown below:
- -
- .
- -
- .
- -
- .
- -
- .
- -
- -
- If the data fusion case is ITCU-CWA (C2.2.1), then one data value can be true at a time, and it is not possible to have the true value from outside domain. Therefore, the generated possible worlds would be as shown below:
- -
- If the data fusion case is ITCU-OWA (C2.2.2), then one data value can be true at a time and it is possible to have the true value from outside domain. Therefore, the generated possible worlds would be as shown below:
5.2. The Probabilistic Data Fusion Computational Method
- .
- .
- .
5.3. Probability to Possibility Transformation Method
6. Proof of Concept: Model Implementation and Discussion
7. Limitations
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Derivation of Data Fusion Computation Formula
- .
- .
- .
- I.
- From Assumption 2,
- II.
- From Assumption 3,
- III.
- From both Assumptions 2 and 3 we have:
- -
- Based on Equations (6) and (8), we get the following:
References
- Almutairi, M.M.; Yamin, M.; Halikias, G. An Analysis of Data Integration Challenges from Heterogeneous Databases. In Proceedings of the 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 17–19 March 2021; pp. 352–356. [Google Scholar]
- Aggoune, A. Intelligent data integration from heterogeneous relational databases containing incomplete and uncertain information. Intell. Data Anal. 2022, 26, 75–99. [Google Scholar] [CrossRef]
- Jaradat, A.; Halimeh, A.A.; Deraman, A.; Safieddine, F. A best-effort integration framework for imperfect information spaces. Int. J. Intell. Inf. Database Syst. 2018, 11, 296–314. [Google Scholar] [CrossRef]
- Beneventano, D.; Bergamaschi, S.; Gagliardelli, L.; Simonini, G. Entity resolution and data fusion: An integrated approach. In Proceedings of the SEBD 2019: 27th Italian Symposium on Advanced Database Systems, Grosseto, Italy, 16–19 June 2019. [Google Scholar]
- Sampri, A.; Geifman, N.; Le Sueur, H.; Doherty, P.; Couch, P.; Bruce, I.; Peek, N. Probabilistic Approaches to Overcome Content Heterogeneity in Data Integration: A Study Case in Systematic Lupus Erythematosus. Stud. Health Technol. Inform. 2020, 270, 387–391. [Google Scholar] [PubMed]
- Zhao, X.; Jia, Y.; Li, A.; Jiang, R.; Song, Y. Multi-source knowledge fusion: A survey. World Wide Web 2020, 23, 2567–2592. [Google Scholar] [CrossRef] [Green Version]
- Zhang, M.; Wang, H.; Li, J.; Gao, H. One-pass inconsistency detection algorithms for big data. IEEE Access 2019, 7, 22377–22394. [Google Scholar] [CrossRef]
- Bakhtouchi, A. Data reconciliation and fusion methods: A survey. Appl. Comput. Inform. 2020, 18, 182–194. [Google Scholar] [CrossRef]
- Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. Blocking and filtering techniques for entity resolution: A survey. ACM Comput. Surv. (CSUR) 2020, 53, 31. [Google Scholar] [CrossRef] [Green Version]
- Papadakis, G.; Ioannou, E.; Palpanas, T. Entity resolution: Past, present and yet-to-come: From structured to heterogeneous, to crowd-sourced, to deep learned. In Proceedings of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, 30 March 2020. [Google Scholar]
- Munir, A.; Blasch, E.; Kwon, J.; Kong, J.; Aved, A. Artificial intelligence and data fusion at the edge. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 62–78. [Google Scholar] [CrossRef]
- Stonebraker, M.; Bruckner, D.; Ilyas, I.F.; Beskales, G.; Cherniack, M.; Zdonik, S.B.; Pagan, A.; Xu, S. Data Curation at Scale: The Data Tamer System. In Proceedings of the Cidr, Asilomar, CA, USA, 6–9 January 2013. [Google Scholar]
- Golshan, B.; Halevy, A.; Mihaila, G.; Tan, W.-C. Data integration: After the teenage years. In Proceedings of the Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Raleigh, CA, USA, 14–19 May 2017; pp. 101–106. [Google Scholar]
- De Sa, C.; Ratner, A.; Ré, C.; Shin, J.; Wang, F.; Wu, S.; Zhang, C. Deepdive: Declarative knowledge base construction. ACM SIGMOD Rec. 2016, 45, 60–67. [Google Scholar] [CrossRef]
- Stonebraker, M.; Ilyas, I.F. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. 2018, 41, 3–9. [Google Scholar]
- Miller, R.J. Open data integration. Proc. VLDB Endow. 2018, 11, 2130–2139. [Google Scholar] [CrossRef]
- Lau, B.P.L.; Marakkalage, S.H.; Zhou, Y.; Hassan, N.U.; Yuen, C.; Zhang, M.; Tan, U.-X. A survey of data fusion in smart city applications. Inf. Fusion 2019, 52, 357–374. [Google Scholar] [CrossRef]
- Blanco, L.; Crescenzi, V.; Merialdo, P.; Papotti, P. Probabilistic models to reconcile complex data from inaccurate data sources. In Proceedings of the International Conference on Advanced Information Systems Engineering, Hammamet, Tunisia, 7–9 June 2010; pp. 83–97. [Google Scholar]
- Magnani, M.; Montesi, D. A survey on uncertainty management in data integration. J. Data Inf. Qual. (JDIQ) 2010, 2, 1–33. [Google Scholar] [CrossRef]
- Liu, Y.; Bao, T.; Sang, H.; Wei, Z. A Novel Method for Conflict Data Fusion Using an Improved Belief Divergence Measure in Dempster–Shafer Evidence Theory. Math. Probl. Eng. 2021, 2021, 6558843. [Google Scholar] [CrossRef]
- Yuan, Q.; Pi, Y.; Kou, L.; Zhang, F.; Li, Y.; Zhang, Z. Multi-source data processing and fusion method for power distribution internet of things based on edge intelligence. arXiv 2022, arXiv:2203.17230. [Google Scholar] [CrossRef]
- Barbedo, J.G.A. Data Fusion in Agriculture: Resolving Ambiguities and Closing Data Gaps. Sensors 2022, 22, 2285. [Google Scholar] [CrossRef] [PubMed]
- Dong, X.L.; Naumann, F. Data fusion: Resolving data conflicts for integration. Proc. VLDB Endow. 2009, 2, 1654–1655. [Google Scholar] [CrossRef] [Green Version]
- Dong, X.L.; Berti-Equille, L.; Srivastava, D. Data fusion: Resolving conflicts from multiple sources. In Handbook of Data Quality; Springer: Berlin/Heidelberg, Germany, 2013; pp. 293–318. [Google Scholar]
- Pochampally, R.; Das Sarma, A.; Dong, X.L.; Meliou, A.; Srivastava, D. Fusing data with correlations. In Proceedings of the Proceedings of the 2014 ACM SIGMOD International Conference on Management of data, Snowbird, UT, USA, 22–27 June 2014; pp. 433–444. [Google Scholar]
- Ioannou, E.; Nejdl, W.; Niederée, C.; Velegrakis, Y. LinkDB: A probabilistic linkage database system. In Proceedings of the Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, Snowbird, UT, USA, 12–16 June 2011; pp. 1307–1310. [Google Scholar]
- Wang, H.; Ding, X.; Li, J.; Gao, H. Rule-based entity resolution on database with hidden temporal information. IEEE Trans. Knowl. Data Eng. 2018, 30, 2199–2212. [Google Scholar] [CrossRef]
- Halevy, A.; Rajaraman, A.; Ordille, J. Data integration: The teenage years. In Proceedings of the Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006; pp. 9–16. [Google Scholar]
- Papadakis, G.; Ioannou, E.; Palpanas, T. Entity Resolution: Past, Present and Yet-to-Come. In Proceedings of the EDBT, Lisbon, Portugal, 26–29 March 2020; pp. 647–650. [Google Scholar]
- Li, L.; Wang, H.; Li, J.; Gao, H. A Survey of Uncertain Data Management. Front. Comput. Sci. 2020, 4, 162–190. [Google Scholar] [CrossRef]
- Dumpa, I.K.; Kota, R.S.; Sadri, F. Information Integration with Uncertainty: Performance. DBKDA 2014 2014, 15, 15. [Google Scholar]
- Sarma, A.D.; Dong, X.L.; Halevy, A.Y. Uncertainty in data integration and dataspace support platforms. In Schema Matching and Mapping; Springer: Berlin/Heidelberg, Germany, 2011; pp. 75–108. [Google Scholar]
- Deng, D.; Fernandez, R.C.; Abedjan, Z.; Wang, S.; Stonebraker, M.; Elmagarmid, A.K.; Ilyas, I.F.; Madden, S.; Ouzzani, M.; Tang, N. The Data Civilizer System. In Proceedings of the Cidr, Chaminade, CA, USA, 8–11 January 2017. [Google Scholar]
- Bilke, A.; Bleiholder, J.; Böhm, C.; Draba, K.; Naumann, F.; Weis, M. Automatic Data Fusion with HumMer; Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät II: Trondheim, Norway, 2005. [Google Scholar]
- Bleiholder, J.; Draba, K.; Naumann, F. FuSem-Exploring Different Semantics of Data Fusion. In Proceedings of the VLDB, Vienna, Austria, 23–27 September 2007; pp. 1350–1353. [Google Scholar]
- Mirza, A.; Siddiqi, I. Data level conflicts resolution for multi-sources heterogeneous databases. In Proceedings of the 2016 Sixth International Conference on Innovative Computing Technology (INTECH), Dublin, Ireland, 24–26 August 2016; pp. 36–40. [Google Scholar]
- Dong, X.L.; Berti-Equille, L.; Srivastava, D. Integrating conflicting data: The role of source dependence. Proc. VLDB Endow. 2009, 2, 550–561. [Google Scholar] [CrossRef] [Green Version]
- Ioannou, E.; Garofalakis, M. Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. 2015, 27, 2245–2260. [Google Scholar] [CrossRef]
- Papadakis, G.; Ioannou, E.; Niederée, C.; Palpanas, T.; Nejdl, W. Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data. In Proceedings of the Proceedings of the fifth ACM International Conference on Web Search and Data Mining, New York, NY, USA, 8–12 February 2012; pp. 53–62. [Google Scholar]
- Papadakis, G.; Ioannou, E.; Palpanas, T.; Niederée, C.; Nejdl, W. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 2012, 25, 2665–2682. [Google Scholar] [CrossRef] [Green Version]
- Papadakis, G.; Koutrika, G.; Palpanas, T.; Nejdl, W. Meta-blocking: Taking entity resolutionto the next level. IEEE Trans. Knowl. Data Eng. 2013, 26, 1946–1960. [Google Scholar] [CrossRef]
- Papenbrock, T.; Heise, A.; Naumann, F. Progressive duplicate detection. IEEE Trans. Knowl. Data Eng. 2014, 27, 1316–1329. [Google Scholar] [CrossRef] [Green Version]
- Papadakis, G.; Svirsky, J.; Gal, A.; Palpanas, T. Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 2016, 9, 684–695. [Google Scholar] [CrossRef] [Green Version]
- Papadakis, G.; Tsekouras, L.; Thanos, E.; Giannakopoulos, G.; Palpanas, T.; Koubarakis, M. The return of jedai: End-to-end entity resolution for structured and semi-structured data. Proc. VLDB Endow. 2018, 11, 1950–1953. [Google Scholar] [CrossRef]
- Panse, F.; Naumann, F. Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 2373–2376. [Google Scholar]
- Panse, F.; Düjon, A.; Wingerath, W.; Wollmer, B. Generating Realistic Test Datasets for Duplicate Detection at Scale Using Historical Voter Data. In Proceedings of the EDBT, Nicosia, Cyprus, 23–26 March 2021; pp. 570–581. [Google Scholar]
- Vidal, M.-E.; Jozashoori, S.; Sakor, A. Semantic data integration techniques for transforming big biomedical data into actionable knowledge. In Proceedings of the 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), Cordoba, Spain, 5–7 June 2019; pp. 563–566. [Google Scholar]
- Ayat, N.; Akbarinia, R.; Afsarmanesh, H.; Valduriez, P. Entity resolution for probabilistic data. Inf. Sci. 2014, 277, 492–511. [Google Scholar] [CrossRef] [Green Version]
- Motro, A. Imprecision and uncertainty in database systems. In Fuzziness in Database Management Systems; Springer: Berlin/Heidelberg, Germany, 1995; pp. 3–22. [Google Scholar]
- Clark, D.A. Verbal uncertainty expressions: A critical review of two decades of research. Curr. Psychol. 1990, 9, 203–235. [Google Scholar] [CrossRef]
- Smets, P. Imperfect information: Imprecision and uncertainty. In Uncertainty Management in Information Systems; Springer: Berlin/Heidelberg, Germany, 1997; pp. 225–254. [Google Scholar]
- Zimanyi, E.; Pirotte, A. Imperfect knowledge in relational databases. In Uncertainty Management in Information Systems; Motro, A., Smets, P., Eds.; Springer: Boston, MA, USA, 1997; pp. 35–87. [Google Scholar] [CrossRef]
- Suciu, D. Probabilistic databases for all. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Portland, OR, USA, 14–19 June 2020; pp. 19–31. [Google Scholar]
- Suciu, D.; Olteanu, D.; Ré, C.; Koch, C. Probabilistic Databases, Synthesis Lectures on Data Management; Morgan Claypool: San Rafael, CA, USA, 2011. [Google Scholar]
- Ceylan, I.I.; Darwiche, A.; Van den Broeck, G. Open-world probabilistic databases: Semantics, algorithms, complexity. Artif. Intell. 2021, 295, 103474. [Google Scholar] [CrossRef]
- Sarma, A.D.; Benjelloun, O.; Halevy, A.; Widom, J. Working models for uncertain data. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA, 3–7 April 2006; p. 7. [Google Scholar]
- Chen, R.; Mao, Y.; Kiringa, I. GRN model of probabilistic databases: Construction, transition and querying. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 291–302. [Google Scholar]
- Dalvi, N.; Suciu, D. Management of probabilistic data: Foundations and challenges. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Beijing, China, 26–28 June 2007; pp. 1–12. [Google Scholar]
- Sen, P.; Deshpande, A.; Getoor, L. PrDB: Managing and exploiting rich correlations in probabilistic databases. VLDB J. 2009, 18, 1065–1090. [Google Scholar] [CrossRef]
- Mauritz, R.; Nijweide, F.; Goseling, J.; van Keulen, M. Autoencoder-Based Cleaning in Probabilistic Databases. ACM J. Data Inf. Qual 2021. Available online: https://ris.utwente.nl/ws/portalfiles/portal/256093655/arxiv_preprint_2106.09764.pdf (accessed on 26 September 2022).
- Antova, L.; Koch, C.; Olteanu, D. 10^(10^6) worlds and beyond: Efficient representation and processing of incomplete information. VLDB J. 2009, 18, 1021–1040. [Google Scholar] [CrossRef]
- Widom, J. Trio: A System for Integrated Management of Data, Accuracy, and Lineage; Stanford InfoLab: Stanford, CA, USA, 2004. [Google Scholar]
- Jampani, R.; Xu, F.; Wu, M.; Perez, L.L.; Jermaine, C.; Haas, P.J. Mcdb: A monte carlo approach to managing uncertain data. In Proceedings of the Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2008; pp. 687–700. [Google Scholar]
- De Keijzer, A.; Van Keulen, M. IMPrECISE: Good-is-good-enough data integration. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, Washington, DC, USA, 7–12 April 2008; pp. 1548–1551. [Google Scholar]
- Van Keulen, M.; De Keijzer, A. Qualitative effects of knowledge rules and user feedback in probabilistic data integration. VLDB J. 2009, 18, 1191–1217. [Google Scholar] [CrossRef]
- Grohe, M.; Lindner, P. Infinite probabilistic databases. arXiv 2020, arXiv:2011.14860. [Google Scholar] [CrossRef]
- Li, Y.; Li, Q.; Gao, J.; Su, L.; Zhao, B.; Fan, W.; Han, J. Conflicts to harmony: A framework for resolving conflicts in heterogeneous data by truth discovery. IEEE Trans. Knowl. Data Eng. 2016, 28, 1986–1999. [Google Scholar] [CrossRef]
- Xu, J.; Zadorozhny, V.; Grant, J. IncompFuse: A logical framework for historical information fusion with inaccurate data sources. J. Intell. Inf. Syst. 2020, 54, 463–481. [Google Scholar] [CrossRef]
- Panse, F.; Ritter, N. Relational data completeness in the presence of maybe-tuples. Ingénierie Systèmes D’information (2001) 2010, 15, 85–104. [Google Scholar] [CrossRef]
- Yong-Xin, Z.; Qing-Zhong, L.; Zhao-Hui, P. A novel method for data conflict resolution using multiple rules. Comput. Sci. Inf. Syst. 2013, 10, 215–235. [Google Scholar] [CrossRef]
- Cooper, R.; Devenny, L. A Database System for Absorbing Conflicting and Uncertain Information from Multiple Correspondents. In Proceedings of the British National Conference on Databases, Birmingham, UK, 7–9 July 2009; pp. 199–202. [Google Scholar]
- Dong, X.L.; Gabrilovich, E.; Heitz, G.; Horn, W.; Murphy, K.; Sun, S.; Zhang, W. From data fusion to knowledge fusion. arXiv 2015, arXiv:1503.00302. [Google Scholar] [CrossRef] [Green Version]
- Liu, X.; Dong, X.L.; Ooi, B.C.; Srivastava, D. Online data fusion. Proc. VLDB Endow. 2011, 4, 932–943. [Google Scholar] [CrossRef]
- Singh, Y.; Kaur, A.; Suri, B.; Singhal, S. Systematic Literature Review on Regression Test Prioritization Techniques. Informatica 2012, 36, 379–408. [Google Scholar]
- Zhang, L.; Xie, Y.; Xidao, L.; Zhang, X. Multi-source heterogeneous data fusion. In Proceedings of the 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 26–28 May 2018; pp. 47–51. [Google Scholar]
- Yang, Y.; Gu, L.; Zhu, X. Conflicts Resolving for Fusion of Multi-source Data. In Proceedings of the 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), Hangzhou, China, 23–25 June 2019; pp. 354–360. [Google Scholar]
- Bleiholder, J.; Naumann, F. Data fusion. ACM Comput. Surv. (CSUR) 2009, 41, 1–41. [Google Scholar] [CrossRef]
- Yin, X.; Han, J.; Philip, S.Y. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 2008, 20, 796–808. [Google Scholar]
- Jiang, Z. Reconciling Continuous Attribute Values from Multiple Data Sources. PACIS 2008 Proc. 2008, 264. Available online: https://aisel.aisnet.org/pacis2008/264/ (accessed on 26 September 2022).
- Dellis, E.; Seeger, B. Efficient Computation of Reverse Skyline Queries. In Proceedings of the VLDB, Vienna, Austria, 16 February 2007; pp. 291–302. [Google Scholar]
- Slaney, J.; Paleo, B.W. Conflict resolution: A first-order resolution calculus with decision literals and conflict-driven clause learning. J. Autom. Reason. 2018, 60, 133–156. [Google Scholar] [CrossRef] [Green Version]
- Maunder, M.N.; Piner, K.R. Dealing with data conflicts in statistical inference of population assessment models that integrate information from multiple diverse data sets. Fish. Res. 2017, 192, 16–27. [Google Scholar] [CrossRef]
- Pasternack, J.; Roth, D. Making better informed trust decisions with generalized fact-finding. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011. [Google Scholar]
- Yin, X.; Tan, W. Semi-supervised truth discovery. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 217–226. [Google Scholar]
- Zhao, B.; Rubinstein, B.I.; Gemmell, J.; Han, J. A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 2012, 5, 550–561. [Google Scholar] [CrossRef] [Green Version]
- Galland, A.; Abiteboul, S.; Marian, A.; Senellart, P. Corroborating information from disagreeing views. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, New York, NY, USA, 3–6 February 2010; pp. 131–140. [Google Scholar]
- Jaradat, A.; Deraman, A.; Idris, S.; Din, L.; Said, N. Pemodelan maklumat biodiversiti: Pendekatan objek digital informative. In Proceedings of the 6th ITB-UKM joint Seminar on Chemistry, Bali, Indonesia, 17–18 May 2005. [Google Scholar]
- Deraman, A.; Yahaya, J.; Salim, J.; Idris, S.; Jambari, D.I.; Komoo, A.J.I.; Leman, M.S.; Unjah, T.; Sarman, M.; Sian, L.C. The development of myGeo-RS: A knowledge management system of geodiversity data for tourism industries. Commun. IBIMA 2009, 8, 142–146. [Google Scholar]
- Peng, L. Research on Data Uncertainty and Lineage Through Trio. In Proceedings of the 2019 The World Symposium on Software Engineering, Wuhan, China, 20–23 September 2019; pp. 73–77. [Google Scholar]
- Roy, S. Uncertain Data Lineage. Encycl. Database Syst. 2018, 4280–4286. [Google Scholar] [CrossRef]
- Kimmig, A.; De Raedt, L. Probabilistic logic programs: Unifying program trace and possible world semantics. In Proceedings of the Workshop on Probabilistic Programming Semantics, Paris, France, 1 January 2017. [Google Scholar]
- Fan, W.; Geerts, F.; Tang, N.; Yu, W. Conflict resolution with data currency and consistency. J. Data Inf. Qual. (JDIQ) 2014, 5, 1–37. [Google Scholar] [CrossRef] [Green Version]
- Klir, G.J. Uncertainty and Information: Foundations of Generalized Information Theory; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
- Kuicheu, N.C.; Wang, N.; Fanzou Tchuissang, G.N.; Xu, D.; Dai, G.; Siewe, F. Managing uncertain mediated schema and semantic mappings automatically in dataspace support platforms. Comput. Inform. 2013, 32, 175–202. [Google Scholar]
- Doucouliagos, C. A note on the evolution of homo economicus. J. Econ. Issues 1994, 28, 877–883. [Google Scholar] [CrossRef]
Multi-Valued Attribute Cases | CWA | OWA | |||
---|---|---|---|---|---|
Case | Representation | Case | Representation | ||
Multi-True values Case (MTC) | MTC-CWA | MTC-OWA | |||
Inconsistent True values Case (ITC) | Contradiction case (ITCC) | ITCC-CWA | ITCC-OWA | ||
Uncertainty case (ITCU) | ITCU-CWA | ITCU-OWA |
Lineage for w.j Merging Alternatives | |||
---|---|---|---|
Equation | |
---|---|
(4) | . |
(5) | . |
(6) |
Data Fusion Cases | Possible Worlds Generation |
C1.2: MTC-OWA | , . |
C1.1: MTC-CWA | . . |
C2.1.1: ITCC-CWA | . |
C2.1.2: ITCC-OWA | . . |
C2.2.1: ITCU-CWA | . |
C2.2.2: ITCU-OWA |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jaradat, A.; Safieddine, F.; Deraman, A.; Ali, O.; Al-Ahmad, A.; Alzoubi, Y.I. A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes. Big Data Cogn. Comput. 2022, 6, 114. https://doi.org/10.3390/bdcc6040114
Jaradat A, Safieddine F, Deraman A, Ali O, Al-Ahmad A, Alzoubi YI. A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes. Big Data and Cognitive Computing. 2022; 6(4):114. https://doi.org/10.3390/bdcc6040114
Chicago/Turabian StyleJaradat, Ashraf, Fadi Safieddine, Aziz Deraman, Omar Ali, Ahmad Al-Ahmad, and Yehia Ibrahim Alzoubi. 2022. "A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes" Big Data and Cognitive Computing 6, no. 4: 114. https://doi.org/10.3390/bdcc6040114
APA StyleJaradat, A., Safieddine, F., Deraman, A., Ali, O., Al-Ahmad, A., & Alzoubi, Y. I. (2022). A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes. Big Data and Cognitive Computing, 6(4), 114. https://doi.org/10.3390/bdcc6040114