[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Enabling Digital Twins to Support the UN SDGs
Previous Article in Journal
Graph-Based Conversation Analysis in Social Media
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes

1
Management Information Systems Department, College of Business Administration, American University of the Middle East, Egaila 54200, Kuwait
2
School of Architecture, Computing, and Engineering, University of East London, E16 2RD London, UK
3
Faculty of Ocean Engineering Technology and Informatics, Universiti Malaysia Terengganu, Kuala Terengganu 21030, Terengganu, Malaysia
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2022, 6(4), 114; https://doi.org/10.3390/bdcc6040114
Submission received: 6 July 2022 / Revised: 25 September 2022 / Accepted: 27 September 2022 / Published: 13 October 2022
Figure 1
<p>Data Integration level and its corresponding conflict resolution tasks.</p> ">
Figure 2
<p>The pair-wise-source-to-target matching and formulation process based on the best effort data integration framework. (<bold>a</bold>) The participated iDO instances as observed from three structured data sources of <inline-formula><mml:math id="mm458"><mml:semantics><mml:mrow><mml:msub><mml:mi>S</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&amp;</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:semantics></mml:math></inline-formula> (<bold>b</bold>) The pair-wise-source-to-target matching process for the three structured data sources of <inline-formula><mml:math id="mm36"><mml:semantics><mml:mrow><mml:msub><mml:mi>S</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&amp;</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:semantics></mml:math></inline-formula>. (<bold>c</bold>) The pair-wise-source-to-target matching process for the three iDO instances according to their local instance <inline-formula><mml:math id="mm37"><mml:semantics><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>p</mml:mi><mml:mi>D</mml:mi><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo> </mml:mo></mml:mrow></mml:semantics></mml:math></inline-formula> or reference instance <inline-formula><mml:math id="mm38"><mml:semantics><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mi>D</mml:mi><mml:msubsup><mml:mi>O</mml:mi><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:semantics></mml:math></inline-formula><italic>)</italic> formulations.</p> ">
Figure 3
<p>Possible world’s example with associated probability values.</p> ">
Figure 4
<p>Example of the probabilistic data fusion problem under the existence of multi-valued attributes. (<bold>a</bold>) The participated iDO instances. (<bold>b</bold>) The pair-wise probabilistic Linkages. (<bold>c</bold>) The probabilistic entities merging alternatives from 4b instances. (<bold>d</bold>) The populated data values for the global attributes of the participated instances in 4c.</p> ">
Figure 4 Cont.
<p>Example of the probabilistic data fusion problem under the existence of multi-valued attributes. (<bold>a</bold>) The participated iDO instances. (<bold>b</bold>) The pair-wise probabilistic Linkages. (<bold>c</bold>) The probabilistic entities merging alternatives from 4b instances. (<bold>d</bold>) The populated data values for the global attributes of the participated instances in 4c.</p> ">
Figure 5
<p>City data fusion alternatives’ example with their updated reliability score based on the C2.1.1 case. (<bold>a</bold>) The participated entities with their original matching tree. (<bold>b</bold>) The probabilistic entities linkage/merging results. (<bold>c</bold>) The possible data fused values for <italic>City</italic> attribute at each possible entities merging alternative as produced based on <xref ref-type="fig" rid="BDCC-06-00114-f005">Figure 5</xref>b.</p> ">
Figure 6
<p>The updated-City data fusion alternatives’ example is based on the newly added evidence from the fourth data source. (<bold>a</bold>) The participated entities with their original matching tree. (<bold>b</bold>) The probabilistic entities linkage/merging results. (<bold>c</bold>) The possible data fused values for <italic>City</italic> attribute at each possible entities merging alternative as produced based on <xref ref-type="fig" rid="BDCC-06-00114-f006">Figure 6</xref>b.</p> ">
Figure 6 Cont.
<p>The updated-City data fusion alternatives’ example is based on the newly added evidence from the fourth data source. (<bold>a</bold>) The participated entities with their original matching tree. (<bold>b</bold>) The probabilistic entities linkage/merging results. (<bold>c</bold>) The possible data fused values for <italic>City</italic> attribute at each possible entities merging alternative as produced based on <xref ref-type="fig" rid="BDCC-06-00114-f006">Figure 6</xref>b.</p> ">
Figure 7
<p>Phone data fusion alternatives’ example with their updated reliability score based on the C2.2.1 case. (<bold>a</bold>) The participated entities with their original matching tree. (<bold>b</bold>) The probabilistic entities linkage/merging results. (<bold>c</bold>) The possible data fused values for <italic>Phone</italic> attribute at each possible entities merging alternative as produced based on <xref ref-type="fig" rid="BDCC-06-00114-f007">Figure 7</xref>b.</p> ">
Versions Notes

Abstract

:
Real-world data obtained from integrating heterogeneous data sources are often multi-valued, uncertain, imprecise, error-prone, outdated, and have different degrees of accuracy and correctness. It is critical to resolve data uncertainty and conflicts to present quality data that reflect actual world values. This task is called data fusion. In this paper, we deal with the problem of data fusion based on probabilistic entity linkage and uncertainty management in conflict data. Data fusion has been widely explored in the research community. However, concerns such as explicit uncertainty management and on-demand data fusion, which can cope with dynamic data sources, have not been studied well. This paper proposes a new probabilistic data fusion modeling approach that attempts to find true data values under conditions of uncertain or conflicted multi-valued attributes. These attributes are generated from the probabilistic linkage and merging alternatives of multi-corresponding entities. Consequently, the paper identifies and formulates several data fusion cases and sample spaces that require further conditional computation using our computational fusion method. The identification is established to fit with a real-world data fusion problem. In the real world, there is always the possibility of heterogeneous data sources, the integration of probabilistic entities, single or multiple truth values for certain attributes, and different combinations of attribute values as alternatives for each generated entity. We validate our probabilistic data fusion approach through mathematical representation based on three data sources with different reliability scores. The validity of the approach was assessed via implementation into our probabilistic integration system to show how it can manage and resolve different cases of data conflicts and inconsistencies. The outcome showed improved accuracy in identifying true values due to the association of constructive evidence.

1. Introduction

In this era of information technology advancement, individuals and organizations must manage, process, exchange, and share information in heterogeneous environments [1,2]. To address this need, the foremost requirement is to integrate data originating from autonomous and heterogeneous sources. Data integration is the general process of producing a unified (mediated) repository from a set of heterogeneous and autonomous sources that may contain (semi-)structured or unstructured data [3,4,5,6]. Data integration represents a significant part of the activities of the global data management industry [3,7]. It provides a comprehensive yet concise overview of all available data without requiring the user to view each data source individually [1,3].
The need to access information from different sources through a uniform interface has been the driving force behind much research in the area of data integration [1]. Many approaches and tools have been proposed in recent decades to achieve integration at different schema and instance levels, with varying degrees of accuracy and success [8,9,10,11]. Such proposals have been presented in data integration systems, data management tools and techniques, and conflict resolution approaches [12,13,14,15]. This diversity of these technological solutions is the result of the adaption of diverse application domains with different needs and degrees of heterogeneity, such as the consideration of scientific data, online data, big data, and the combination of structured and unstructured data [13,16]. Issues related to overcoming data heterogeneity or the consequences of having the same information stored in different ways remain major concerns to data integration researchers [3,8,9,10,17,18]. According to Magnani and Montesi [19], Jaradat, Halimeh, Deraman and Safieddine [3], Bakhtouchi [8], and Papadakis, Skoutas, Thanos and Palpanas [9], schema mapping and entity linkage (also known as “duplicate deduction”) in diverse sources have been recognized as two major issues that must be addressed in order to integrate data into a single, consistent representation. Another major issue, named “data fusion” or “data conflict resolution”, has been recognized; this requires further research to achieve consistent, integrated data [8,9,10,20]. Data fusion is about how to find a true value from contradicting attribute values when integrating data from several sources [17,21,22]. Although the resolution of data conflicts has been mostly neglected, several approaches and techniques have evolved [2,17,23,24,25], many of which attempt to “prevent” data conflicts by focusing solely on the uncertainty of missing values, while others employ a variety of resolution strategies to “resolve” conflicts [3,9,23,24]. Accordingly, the problem regarding the lack of a proper data conflict resolution modeling approach with explicit uncertainty management and correlations has been identified in [3,8,19,24,26]. In particular, an approach is required to tackle the data fusion problem based on probabilistic data linkage results and addresses its representational and computational uncertainty management challenges. This motivates us to study the probabilistic data fusion problem based on source accuracy, probabilistic entity merging, data conflict, and uncertainty management.
This paper looks closely at a probabilistic data fusion approach for extracting true values from uncertain and conflicting attributes. To this end, we propose a new probabilistic data fusion modeling approach. The approach represents and manages the uncertain and conflicting multi-valued attributes of a generated entity named “network digital object” (nDO). An obtained nDO, with its possible alternative, is generated from the probabilistic merging of multi-corresponding entities. Our approach considers both the Open-World Assumption (OWA) and Closed-World Assumption (CWA) while constructing the possible worlds of the fused values and their probabilistic formulations. It also considers both single and multiple truth assumptions. Based on these considerations, more meaningful probabilistic data fusion answers can be obtained to fit real-world data integration and fusion scenarios. The main contributions of this paper are as follows:
(1).
Handling the data fusion representational challenge of accepting probabilistic data, i.e., reliability scores of data values and probabilistic similarity value sets, to generate probabilistic global entities with their fused data value alternatives.
(2).
A formal representation of varied data fusion cases to fit into the probabilistic data fusion problem. The formulation of data fusion cases is outlined by observing the origin of the generated data values and their uncertainty.
(3).
Incorporating data lineage into the data fusion model to trace the source of data that offers additional information with which to understand the conflict and uncertainty among the observed data and to facilitate the on-demand data fusion process.
(4).
The construction of a data fusion computational method that conditionally calculates the posterior/updated reliability score for a possible world of true fused value(s).
(5).
The implementation of the data fusion method within the probabilistic data integration system that can show the applicability of addressing on-demand probabilistic fusion for different data conflict types based on the merging of probabilistic entity alternatives.
(6).
Finally, our proposed data fusion approach is designed to cope with a dynamic, volatile and on-demand fusion environment and to support efficient modification re-execution. It provides a means of a new Decision Model (DM) logic that isolates the matching logic from the decision logic, yielding an additional efficiency advantage while dealing with dynamic data. The constructed model can work in two-fold benefit solutions; it can work as a complete probabilistic DM or replace the manual DM sub-stage in traditional DM logic.
The rest of the paper is organized as follows. Section 2 briefly reviews some literature and related work. Section 3 introduces the proposed approach. Section 4 describes our data fusion problem based on the probabilistic integration of heterogeneous data sources. In Section 5, we develop our probabilistic data fusion model and its constructed computational method. Section 6 presents a proof of concept through system implementation and mathematical proof to illustrate the validity of the proposed model. Section 7 outlines the limitations related to the proposed approach. Finally, the conclusions of this paper are given in Section 8, and some future directions for our work are identified.

2. Data Fusion Background and Related Work

In this section, we first highlight data integration in general and its related conflict resolution tasks. After that, uncertainty modeling and probabilistic databases are discussed.

2.1. Data Integration

The field of data integration has expanded in several directions as the technology and application landscapes have evolved over the years [13]. This expansion has been observed from the diverse schema and/or instance integration challenges and solution proposals. These proposals include data integration systems, conflict resolutions, and data management tools and techniques. They have evolved from precise integration systems to uncertainty management mechanisms [13]. A diverse range of technological solutions has also been required to adapt to the needs and environments of a wide range of application domains, such as the consideration of scientific data, online data, or big data, and the combination of structured and unstructured data [3,13,16]. The data integration field has been motivated toward the utilization and development of combined techniques from artificial intelligence and data uncertainty management fields [13]. The contributions of these fields were to propose data representation formalisms of heterogeneous data sources’ contents and to manage and resolve the heterogeneity of the integrated outcomes [13].
As a general view of data integration, we can state that it is the general process of producing a unified (mediated) repository from a set of multiple heterogeneous and autonomous sources that may contain (semi-)structured or unstructured data [3,4,5,27]. Data integration contributes a significant solution to the global data management industry and data quality field [3,7]. It facilitates information access and reuse through a single point, enables users to focus on specifying what they want and frees them from the tedious tasks of how to obtain the answers, and helps users gain more comprehensive information that satisfies their needs and interests [28]. Traditional integration goals are to increase the completeness, conciseness and correctness of the information available to users and to free them from the knowledge about the data sources themselves [8,23]. However, since data from various independent sources are integrated, schema and instance uncertainty are naturally inevitable at all stages of the integration process; hence, they need to be handled and managed accordingly [8,19,29]. To meet this general requirement, we need data integration that performs two integration levels (i.e., schema and instance integration levels) and three formidable conflict resolution tasks (i.e., schema mapping, entity linkage and data fusion). Figure 1 shows these integration levels and their corresponding conflict resolution tasks as the integration process proceeded from multiple heterogeneous data sources.
The data integration industry has approached these tasks of generating integrated outcomes in two broad resolution strategies: (i) the precise integration strategy and (ii) the uncertainty integration management strategy. Classical data integration (CDI) approaches, such as traditional mediation, Peer-to-Peer (P2P), and warehouse, are based on the first strategy. This strategy aims to provide a high-quality end of data for well-defined and well-understood precise data integration tasks but does not cope well with uncertainty management. From the traditional perspective, CDI deals with uncertain data as something that can be avoided by unrealistic simplified and restrictive assumptions, such as sharing universal key identification and having enough data [13]. Moreover, uncertain information is ignored, avoided, manually resolved, or removed at some point through defuzzification, viewing consistent answers, using predefined thresholds, or semi-automatic choice of the most likely outcome only [8,18,19]. The recent data integration generation that represents the dataspace vision [20] and pay-as-you-go data management with their relevant principles of automation, probabilistic data representation and manipulation, and best-effort integration are all based on the uncertainty management strategy. This strategy and its approaches identified that uncertainty is inherently unavoidable in the integration process, yet uncertain information is valuable and even more useful than losing it; hence, it is an important result to be managed and viewed by the users [5,7,30,31,32].
Several data integration systems have been developed by the research committee and by the commercial industry, such as Big-Gorilla [13], Tamer [12,15], Deepdive [14], Data Civilizer [33], HUMMER [34], and FUSEM [35]. For instance, Big-Gorilla is an open-source system consisting of data integration and preparation operator modules such as string matching, entity matching, and schema matcher. At the same time, Hummer and FUSEM can be viewed as data fusion systems that contain tools for resolving data conflict (i.e., addressing the data fusion task) [13,36]. Moreover, research approaches related to probabilistic schema integration, entity linkage and data fusion have also been developed by the data integration research committee. For instance, Dong and her collaborative researchers proposed a probabilistic schema integration approach in [32], as well as data fusion approaches that consider data sources correlation in sort of data duplication [23,24,25,37]. Moreover, research efforts related to probabilistic entity linkage and resolution can be found at [9,10,38,39,40,41,42,43,44,45,46].
The uncertainty management strategy and its corresponding resolution approaches were proposed due to several reasons and motivations that have been identified by the data integration literature [13]. One reason for this proposal is information space complexity [5,11,27,47]. This complexity is related to the expanding growth, heterogeneity nature, and data quality diversity of related online information sources. Another reason is observed from the precise integration strategy limitation [18,32]. Precise integration is still a major undertaking that requires a significant upfront human intervention. Data integration literature has documented that CDI methods occupy a position at a high-cost setting up and maintenance, information loss, and misleading and error-prone results. CDI methods are also less effective for numerous or rapidly changing data resources [26,32,48]. Finally, the challenging nature of the instance integration level, as it represents the entity linkage and data fusion tasks for resolving the entity identification and data conflict problems, is another motive for the second strategy adoption [9,18]. Instance integration denotes the main purpose of the integration process as providing users with the actual comprehensive and complementary information about their requested entities throughout the generation of global entities [3]. According to the data integration industry literature, more efforts have been focused on the data integration industry literature, and more have been focused on schema integration and entity linkage. In contrast, fewer efforts were focused on the data fusion task (or data conflict resolution) based on uncertainty management [6,20,27].

2.2. Uncertainty Modeling

Uncertainty is used as the general and specific label to refer to the information items that represent multiple possible states of the external world, yet it is unknown which state corresponds to the actual situation of the world. Therefore, these types of information cannot be asserted with complete confidence of being correct. Uncertainty may concern the actual data relating to the occurrence of an event in the world and the relations among various events. Under this label, several distinct types can be observed, such as impreciseness, incompleteness, errors, vagueness and ambiguity and redundancy natures of the integrated datasets as explained further in [49]. In fact, there have been many attempts to classify various possible types of uncertain information, such as the ones presented by [50,51]. In this paper, we are concerned with four basic types, i.e., imprecise, inconsistent, incomplete and uncertain data, as illustrated next.
Imprecision means the available information is not specific enough and denotes a set of possible alternatives, where the real value is one of those alternatives. For instance, “Ali’s age is either 33 or 38” (disjunction); it is 34 ± 1 year (error margins); or we simply do not know Ali’s age (Ali’s age is NULL). A null value usually denotes that the information is unavailable (incomplete). However, it could be regarded as imprecise information from the entire domain of legal values. It may also indicate unknown or inapplicable values. Inconsistency denotes that the available information reflects conflict representations about a particular aspect and cannot be true at the same time. For example, “Ali’s age is between 33 and 38” and “Ali’s age is 40”. Finally, uncertainty expresses doubt about if our information about a particular aspect is true or not, as “Ali’s age is probably 33”. The doubt can be quantified with a confidence value as “Ali’s age is 33 with probability 0.6” [30,51,52].
Imprecision and inconsistency are relevant to the content of an attribute value of a data object. At the same time, uncertainty is relevant to the degree of truth of its existence and its attribute values. Therefore, imprecision and inconsistency induce uncertainty, and probabilistic information can combine uncertainty with imprecision or inconsistency. For instance, consider the above examples “33 or 38” and “33 with probability 0.6”. The first example is a form of impreciseness, whereas the second is a form of uncertainty. By combining these two examples: the age of Ali is 33 with a probability of 0.6 and 38 with a probability of 0.4, information expresses certainty in terms of a probability measure on each alternative, and hence, uncertainty is reduced [19].
The AI and database communities have taken into consideration several approaches, such as rule-based systems, Dempster–Shafer, and fussy sets, for modeling uncertain information [21,30,53,54]. For instance, some efforts used NULL values representation to replace all partial information about the imperfect parts of a description, and hence, uncertainty was ignored. Others suggested the use of meta-information (soundness and completeness) to declare the portion of perfect from imperfect information, and hence, uncertainty was avoided. In addition, there were solutions developed using the certainty factors, such as weights, to denote confidence about the stated facts and rules that describe real-world objects. However, declaring confidence factors is a subjective issue, and objections have been raised showing that the lack of firm semantics may lead to unintuitive results [49].
In addition, approaches based on probability theory were developed to provide a model that supports alternative results [30]. The probabilistic model has been dominant in the database community to develop probabilistic databases that capture, manage and represent uncertain data [53,54,55]. These proposals were mainly based on symbolic-qualitative and/or numeric-quantitative models [53]. Among them are the uncertain data model, probabilistic graphical model, and possible-world model [30,53]. One of the most common uncertain data models is the extension of the relational model to include c-table, ?-table, or-set table, or-set-? table, which represents uncertainty at the tuple and attribute levels [30,56]. The probabilistic graphical model handles the database as two-tuples (R, P), where R is a set of relations and P is a probabilistic graphical model presented either in Bayesian networks or Markov networks [30]. The dependency relationship between uncertain data can be addressed in this graphical representation. Each tuple or attribute in R is associated with a random variable that represents the existing probability, and each node of P is a random variable that may correspond to a certain tuple in R. The edges represent the relationship between random variables [30,57]. The typical database with uncertainty management is the probabilistic database, where the possible-world model is the base for such probabilistic representation [30,53,58], and [54]. Each possible-world instance corresponds to a certain database, where the uncertain attributes are certain values satisfying constraints [30,55]. Some typical uncertain databases and possible-world instances are formed due to the existence of attribute-level uncertainty with continues attribute values, attribute-level uncertainty with discrete attribute values, and record-level uncertainty with or without generating rules [30,59]. Correspondingly, various probabilistic databases have been proposed, such as DCAE [60] MayBMS [61], Trio [62], MCDB [63], and IMPrECISE [64,65]. For example, MayBMS, Trio, DCAE, and MCDB are probabilistic database approaches for uncertain relational data, while IMPrECISE is an approach for uncertain XML data format [60,66]. Moreover, MayBMS focused on the tuple-level uncertainty, while MCDB and DCAE focused on attribute-level uncertainty. The possible-world model is the mathematical backbone for these approaches, and they are mainly designed for single-type of data [66,67].
However, many systems based on probabilistic databases still have certain semantic deficiencies that limit their potential applications [55]. This deficiency is mainly related to the avoidance of considering the Open-World Assumption (OWA) by adopting the Closed-World Assumption (CWA) only while creating the probabilistic database. Thus, these probabilistic databases lack a suitable handling of data incompleteness [55]. CWA indicates the facts that do not appear in the database to have a probability of zero. This means that CWA assumes that the facts of a specific database domain are complete, which is not always true in real-world applications.
In contrast, the OWA does not assume the completeness of a specific database domain. It considers the possibility of having an unknown value or values outside the database domain to be true; hence, an unknown fact (i.e., also known as an open fact) that does not appear in a specific database might have a probability greater than zero. Accordingly, incompleteness is represented or handled using those probabilistic database efforts [55]. Our proposed data fusion approach considers both assumptions (i.e., OWA and CWA) while constructing the possible worlds of the fused values and their probabilistic formulations, thus addressing these concerns. Based on this consideration, more meaningful probabilistic data fusion answers can be obtained to fit real-world data integration and fusion scenarios.

2.3. Data Fusion Related Work

Data fusion is the process of reformulating objects from multiple observations [68]. Data fusion can be defined as a process of generating a single fused representation over two or more probable linked entities, which are being merged and replaced in a central repository by a new entity containing attributes values as the union of their respective sets [11,22,26,69]. It can also be defined as providing users with more trustable data with higher quality [26,70]. As a result, for each obtained attribute, there are likely to be several possible values, each of which has originated from independent or correlated sources and asserted with a different degree of confidence by correspondents with differing reliability levels [71]. Therefore, rather than just take an ad hoc or precise decision on which version is correct, it is better to absorb all opinions and make a probabilistic estimation of the likely correct one [3,18,71].
Estimating the truth while forming this fused representation requires managing several types of conflict and uncertainty at the entity and data levels since the new entity may have multi-possible worlds [69]. In fact, the fusion of multi-valid or invalid attribute values from probabilistic merged entities represents a major reconciliation challenge [3,8,19]. Therefore, proposing advanced fusion methods with the capabilities of accepting and handling uncertain data from probabilistic entities merging, considering the role of source dependence, and storing versions of data values with an associated likelihood of correctness is a crucial issue in the current age of information integration and data quality fields [3,8,11,17,19,21,26].
Over the last two decades, intensive efforts toward proposing precise and/or probabilistic data fusion approaches have been observed in the data integration community [8]. Related works that address the management of uncertain data values throughout the proposal of probabilistic data fusion solutions can be found in [8,18,21,24,25,67,68,69,70,72,73,74,75,76]. This intensive effort has been established due to the importance of obtaining quality information for the intended users. Generally, the work conducted by Bleiholder and Naumann [77], and Dong and Naumann [23] included an early survey of data fusion approaches, as typical data fusion approaches were mainly rule-based. Due to the produced results, those approaches can also be divided into deciding and mediating strategies. Deciding strategy picked a preferred value among existing values while mediating strategy produced an entirely new value such as the average of conflicting numbers. Moreover, recent research by Bakhtouchi [8] Bakhtouchi [8] and Xu, Zadorozhny and Grant [68] presented one of the latest surveys in the field of data fusion. In their papers, three data fusion or conflict handling strategies were highlighted based on the previous categorization found by Bleiholder and Naumann [77]. These strategies are conflict ignorance, conflict avoidance and conflict resolving [8,23,77]. Conflict ignorance performs no resolution over conflicting data. Sometimes, it was not even aware of the existence of conflicts. Therefore, inconsistent results have a high potential to be produced. (Pass It On) and (Consider All Possibilities) are two ignoring strategies representatives. Conflict avoidance strategies acknowledge the existence of possible conflicts, yet it does not perform individual resolution for each conflict. Instead, a single decision is made, such as using preference of a source, and applying it to all conflicts. In this case, the data that are coming from the preferred source are contained only. Take the Information, No Gossiping, and Trust Your Friends are all avoiding strategies. Finally, conflict resolution strategies provide the means for individual fusion decision for each conflict. Such a decision can be instance-based, i.e., it regards the actual conflicting values, using Cry with The Wolves, Roll the Dice, or Meet in the Middle strategies. The resolution decision can be metadata-based, such as Keep Up to Date strategy.
According to research by [23,37], none of the above strategies is perfect, and they all fall short in some or all of the following aspects. (i) Data sources are of different quality, and we often trust data from more accurate sources, but accurate sources can make mistakes as well; thus, neither treating all sources as the same nor taking all data from accurate sources without verifying is appropriate. (ii) The real world is dynamic, and the true value often evolves overtime, but it is hard to distinguish incorrect value from outdated value. Thus, taking the most common value may end up with an out-of-date value, while taking the most recent value may end up with a wrong value. (iii) Data sources can copy from each other, and errors can be propagated quickly. Thus, ignoring possible dependency among sources can lead to biased decisions. Finally, (iv) linkage decisions might be probabilistic. In this case, merging depends on the linkage context; ignoring differences leads to conflict-avoidance and wrong value consideration [26,69]. For these reasons, researchers were motivated to develop advanced strategies that rely on managing uncertainty and probability modeling to resolve conflicts. Such advance resolution strategies were based on considering the accuracy, freshness and/or correlation of the participated sources. Basically, three different categorizations of advanced fusion strategies, which conditionally consider source accuracy, data freshness, and sources correlation in terms of copying data from each other, have been outlined in [8,25].
  • Conflict Resolution by Considering Accuracy of Source: Data sources have different accuracies, and some are considered more trustworthy. Therefore, a more precise decision can be obtained by considering sources’ reliability as proposed in [67,69,71,78,79]. Using this technique requires a probability model that iteratively computes the sources’ accuracy to decide the true values. Research by Panse and Ritter [69] investigated how a set of probabilistic tuples designated as duplicates can be merged by considering uncertainty on the instance and source reliability. According to [67], a general optimization framework called CRH is proposed that seamlessly integrates the truth-finding process on various data types.
  • Conflict Resolution by Considering Freshness of Sources: The data is often changing dynamically in the real world. The value of being true or false can be, in a subtle case, out-of-date. Accordingly, research done in [24,80] addressed data sources’ freshness and treated incorrect and out-of-date values differently by describing their probabilistic model accordingly.
  • Conflict Resolution by Considering Dependency between Sources: In many domains, especially on the Web, data sources may copy from each other for some of their data. Dependency among sources in sort of source copying has been considered in research work such as [23,24,25,70,73,81,82]. According to [23,24], source accuracy has been considered for the analysis of source dependencies. Research by [70] introduced a two-stage fusion approach based on Markov Logic Networks. Unlike current data fusion research that focuses on resolving data conflict on a single attribute, this effort considered the interrelationship of data conflict on different attributes to improve the accuracy of the fused results. These research efforts are effective in detecting positive correlation on false data but are not effective with positive correlation on true data or negative correlation. Their models also rely on a single truth assumption, as everyone has a unique birthplace. In practice there can be multiple truths for certain facts, such as someone may have multiple professions. Accordingly, [25] introduced a data fusion approach with correlation, i.e., the correlation among sources can be much broader than copying. It can be positive or negative, and caused by different reasons.
Furthermore, those advanced resolution solutions were generally classified in [24] into three categories: voting, quality-based, and relation-based. Voting is the main baseline category, where the value with the highest vote count or provided by the largest number of data sources is selected from the conflicting values. Quality-based is based on evaluating data sources’ trustworthiness. Accordingly, the value with a higher vote for a high-quality data source is computed. Based on how we measure source trustworthiness, the quality-based category can be subcategorized into Web-link-based methods, IR-based methods, graphical-model methods, and Bayesian methods. Approaches based on the Bayesian method, such as [23,24,37,78], measure the source’s trustworthiness by its accuracy, which primarily indicates the probability of each of its values being true [19]. Relation-based category extends the quality-based methods by considering the relationships between the sources. The relationship can be in a sort of copying between a pair of sources or in a sort of correlation among a subset of sources. Further discussion on these advanced methods and approaches can be found in [6,20,23,24,67,72,78,83,84,85,86].
Nevertheless, most of these advanced methods followed an iterative approach to truth-finding, data/source quality evaluation, and sources correlation detection [72]. They are primarily unsupervised or semi-supervised, as they do not use training data [84]. They were mostly considered probabilistic handling of the merged uncertain data based on certain precise linkage and entity merging results [69]. As a result, their fusion techniques were mainly designed to support offline fusion and static information that does not evolve overtime [8,24,70,73]. The research by [73] built the first online data fusion system called SOLARIS. Moreover, they mostly resolve the conflicting data by relying on either a single truth assumption or multiple truth assumptions [25]. In practice, an entity may contain attributes with a single truth value and/or multiple truth values, such as a person may have a single birth date and multiple professions. Even more, fuse multi-valid or invalid attribute values due to probabilistic linked and merged entities and over dynamic and volatile information sources have only been rarely and individually addressed in [69,70,73].
By considering source accuracy, probabilistic entity linkage nature and on-demand data fusion process, our data fusion approach is relative to the approaches proposed in [20,26,67,69,70,73,79]. Even though it is different from the above works in the following aspects. First, while our approach is based on the ‘quality-based strategy, it proposed to manage and resolve two major fusion cases. Firstly, the multi-true values (i.e., multiple truth assumptions), and inconsistent-true values (i.e., single truth assumptions) are based on closed-world and open-world assumptions. Secondly, the data fusion is processed over probabilistic entity linkage and multi-merging alternatives. Our approach can also support on-demand fusion and cope with dynamic and volatile conflicting and uncertain data by keeping and storing the reliability scores alongside its actual data values and by taking the matching and computation process as a chain of separate processes, where each one has its own inputs and outputs data [87]. Upon a user request, such a fusion process for a selected attribute and over a probabilistic global entity is initiated by matching its values from their corresponding sources and entities to form a global domain of merged attribute’s values pairs, and then the computational fusion is executed separately.

3. Preliminaries

This section introduces some background of the proposed model, including the definitions of the informative digital object (iDO) concept, the integration framework, the lineage, and the rules of possible-world semantic.

3.1. Informative Digital Object (iDO) Concept

The term informative Digital Object (iDO) describes a Real-World Object (RWO) as a person or a place that has a distinct identity locally. In fact, iDO is the ground root for proposing the network Digital Object (nDO) concept that represents a particular global entity generated from the probabilistic merging of its corresponding local iDOs.
We can define the iDO concept as a uniquely identifiable container that aggregates and presents relevant multi-entities components in terms of the actual content aspect and in the form of compound digital existence, throughout mapping the associated relationships among them with a coherent and cohesive representation of information context about a specific RWO [87]. iDO is an extended content-based model mechanism for the Digital Object (DO) concept, which consists of diverse contents of information units constructed from sequence transformation of meta-levels components of relevant data. The actual contents are grouped under varied categories to give an organized representation of an RWO’s features and to convey meaningful and comprehensive information. Thus, multi-parts components can be managed and viewed as a single entity [87].
In the iDO model, the object’s features are classified under three major categories, i.e., identification, descriptive, and supportive, and where each category consists of various subcategories due to the correlation and mapping type between a specific attribute with its corresponding iDO [3]. This categorization would provide domain-independent entity resolution rules [3]. Detailed information about iDO concept to represent the entities from the participated sources within a probabilistic data integration framework, the attribute types and categories, and the possible-world generation rules and mapping types can be found in our previous research paper as [3,88]. The iDO representation for the RWOs is given in the following definitions.
Definition 1.
The participated data sources that are required to be integrated is a set of n sources, where each participated source can be in a type of (semi-)structured   ( t p 1 ) , or unstructured   ( t p 2 ) , i.e., S = { S 1 , S 2 , , S n } : i [ 1 , n ] ,   S i . t p { t p 1 , t p 2 } . This representation helps distinguish the matching comparison process between each of these two types. For example, the participated ( t p 1 )  sources are assumed to contain unique objects. Hence, the internal matching comparison is not required as no duplicated objects can be found. In contrast, ( t p 2 ) sources may contain duplicated objects, and then the comparison process proceeds internally. Detailed information about this pair-wise source-to-target matching process can be found in [3].
Definition 2.
Entities of the participated data sources for integration are considered to belong to the same domain. They are assumed to be modeled in sort of iDOs to represent persons, restaurants or any other RWOs. Each  S i source contains a set of m iDOs, where the range of m in the participated sources are different, i.e., S i = { i D O i 1 , i D O i 2 , ,   i D O i m } :   1 h m . An ( i D O i h ) is triple of a comprise set of attributes’ names ( A k i h : 1 k c ) , attributes’ values ( a k . g i h :   1 g q ) , and attributes’ types ( t y { I d n ,   D e s c , S u p p } ) , i.e., A k i h . a k . g i h . t y . These attributes describe the shared features of iDOs, where an attribute may contain single or multiple data values and belongs to a specific category that encodes the attribute type ( t y ) . According to [3], a specific mapping rule based on the attribute type can be applied to obtain the possible world.

3.2. The Best-Effort Data Integration Framework

Our probabilistic data fusion approach is formulated based on the proposed best-effort data integration framework presented in [3]. This framework considers that the global schema generation and its mapping production are a priori performed. Hence, a ( G S ) global schema consisting of global attributes is obtained before the initiating of the instance integration process, i.e., A k i h ( A 1 G S . t y ,   A 2 G S . t y ,   ,   A c G S . t y ) . These generated global attributes correspond to a specific attribute that existed at the participated data sources. Each ( A k G S ) with its mapped data sources’ attributes must belong to a specific ( t y ) attribute’s type, such that one of these global attributes represents the main parameter, such as the name of an author or a restaurant. Thus, the type of this main parameter attribute is ( t y = I d n ) .
Despite the precise integration at the schema level, the framework takes the instance integration as a non-trivial process that requires probability management capabilities. Therefore, a probabilistic global entity named network digital object ( n D O w : 1 w z ) is added to the traditional framework formulation. This framework aims to remove the manual interventions by allowing less precise but automatic instance integration (i.e., entity linkage and data fusion) answers. It also corresponds to the pair-wise source-to-target matching process. In this process, a participated ( S i ) data sources can be presented as a target data source ( T s t :   T s t S ) or a local source ( L s s :   L s s S ) . Accordingly, an i D O i h that belongs to a T s t data source is denoted as a reference entity/instance, i.e., r D O w i h : 1 w z , while an i D O i h that belongs to a L s t data source is denoted as a possible local entity/instance that may link with a specific r D O w i h entity/instance, i.e., p D O w x i h : 1 x y . This matching process allows a set of possible local instances to be compared and probabilistically matched against a specific reference instance based on their shared attribute values. In correspondence to the global schema generation, each participated i D O i h must have an attribute that represents the main parameter for the matching process between pairs of references to local instances, i.e., ( A k i h . I d n ) . Figure 2 illustrates the matching formulation and process between three iDOs obtained from three structured data sources ( t p = t p 1 ) .
From the matching process, a probabilistic pair-wise entity linkage result is obtained as r D O w = { r D O w i h : p D O w 1 i h [ P r ( L w 1 , ) ] , p D O w 2 i h [ P r ( L w 2 , ) ] , , p D O w y i h [ Pr ( L w y ) ] } : 1 x y ,   0 P r ( L w x ) 1 , and P r ( L w x , ) is the probability linkage value for a pair of ( r D O w i h : p D O w x i h ) instances as representing the same RWO. By considering the possible-worlds generation rules and the probabilistic distribution, the probabilistic entities merging can be computed to generate a global merged entity that is denoted as ( n D O w ) . In correspondence, the best-effort data integration framework is four components of ( L s s ,   T s t ,   M s , t , n D O w ) , where:
  • T s t is a target data source that belongs to a T s set of n target sources, i.e., T s = ( T s 1 , T s 2 , , T s n ) : t [ 1 , n ] ,   T s t T s S . A T s t source can be in type of t p 1   o r   t p 2 . A reference instance is denoted as r D O w i h = { a 1 i h . i d n ,   a 2.1 i h . t y , a 2.2 i h . t y , , a 2 . q i h . t y ,   a 3.1 i h . t y , ,   a c . q i h . t y } : g = 1 for a 1 . g i h . i d n .
  • L s s is a local data source that belongs to a L s set of n local sources, i.e., L s = ( L s 1 , L s 2 , , L s n ) : s [ 1 , n ] , L s s L s S . A L s s . t p source can be in a type of t p 1 or t p 2 . A local instance is denoted as p D O w x i h = { a 1 i h . i d n , a 2.1 i h . t y , a 2.2 i h . t y , , a 2 . q i h . t y ,   a 3.1 i h . t y , ,   a c . q i h . t y } :   g = 1 for a 1 . g i h . i d n .
  • M s , t is a triple of ( T s t . r D O w i h . ( a k . g i h . i d n ,   , a c . q i h . t y ) ; L s s . p D O w x i h . ( a k . g i h . i d n ,   , a c . q i h . t y ) ; m s . k ~ t . k )   M s , t mapping is a set of one-to-one probabilistic matching for each reference attribute value a k . g i h . t y r D O w i h against a local attribute value a k . g i h . t y p D O w x i h , if initially the similarity ( m s . k ~ t . k ) value between a pair of main parameter attribute’s values originated from a reference instance with its corresponding local instance is greater than a specified threshold value, i.e., m s . k ~ t . k ( p D O w x i h ( a k i h . i d n ) )   ~   ( r D O w i h ( a k i h . i d n ) ) δ : r D O w i h p D O w x i h , and ( δ ) is the similarity threshold value for considering the matching between the pairs of main parameter’s data values. Thus, for each instance pairs from i D O i h = i = 1 n h = 1 m k = 1 c g = 1 q ( i D O i h . a k . g i h . t y ) there is L s s . i D O i h against T s t . i D O i h local source-to-target entities matching in the form of p D O w x i h . a k . g i h . t y   ~   r D O w i h . a k . g i h . t y :   r D O w i h p D O w x i h , and ( ~ ) denotes the pair-wise matching operation.
  • n D O w is a set of z mutual probabilistic global entities merging alternatives that are generated from merging their possible corresponding iDOs, as they have pair-wise linkage results in sort of a reference instance to its possible local instances, i.e., r D O w = ( r D O w i h : p D O w 1 i h [ P r ( L w 1 , ) ] , p D O w 2 i h [ P r ( L w 2 , ) ] , , p D O w y i h [ P r ( L w y , ) ] ) : 1 w z , 1 x y . A probabilistic global entity is a set of instances merged from a reference instance with its possible local instances, i.e., n D O w = { ( n D O w . 1 , P r ( M w . 1 ) ) , ( n D O w .2 ,     P r ( M w .2 ) ) , , ( n D O w . f , P r ( M w . f ) ) } : 1 j f , j = 1 f P r ( M w . j ) = 1 . Each possible entities merging alternative has an assigned probability distribution value obtained from multiplying the probability linkages of its linked instances, i.e., n D O w . j = ( ( r D O w i h : p D O w 1 i h , p D O w 2 i h , , p D O w y i h ) , Pr ( M w . j ) ) . For each requested attribute and within each possible merge, there could be a multi-valued attribute in which each possible attribute’s value alternative is assigned with a probabilistic data fusion value obtained from updating and conditionally computing the reliability scores of its attribute’s values, i.e., n D O w . j . A k G s = { ( a ( P w s t v . 1 ) , μ ( a ( P w s t v . 1 ) ) ) , ( a ( P w s t v .2 ) , μ ( a ( P w s t v .2 ) ) ) , , ( a ( P w s t v . P ) , μ ( a ( P w s t v . P ) ) ) } :   1 l L ,   p = 1 P μ ( a ( P w s t v . p ) ) = 1 . An a ( P w s t v . p ) possible world may contain single or multi-possible true values, i.e., a ( P w s t v . p ) = { a k . 1 w . j , a k .2 w . j , , a k . g w . j } .

3.3. Data Lineage

Data lineage represents a very important perspective in the integration process. In particular, while trying to resolve and manage uncertain and conflicted data contributed from matched and integrated entities that originated from heterogeneous and volatile data sources [89,90]. Data lineage would provide information on the entities and the data values’ origin, but also explanations for any generated information and returned results. In this approach, we combine lineage and uncertainty management into one data model. Lineage is closely related to uncertainty and conflicts because it is a powerful mechanism for tracing the uncertainty origin [89].
Having a model supporting lineage means it has the ability for modeling the trace to the origin of an iDO and its data values residing in a data source. It offers additional information that helps understand conflict and uncertainty. It also facilitates the correlations among the participated iDOs. For instance, suppose that entities generated from a particular structured source are distinct real-world objects. If two matched iDOs originated from the same structured source, then we know that a possible world cannot contain these references in one possible merged alternative. Thus, impossible worlds would not be included due to the construction of rules that consider the lineage of the data source and its type.
The proposed approach uses the lineage to identify and resolve conflicts that arise at the linkage and data fusion tasks. Data linage is a convenient mechanism to compute the probability for the data fused values where multi-values can exist at different sources [89,90]. We can obtain the correct reliability scores ( μ a k . g i h ) of a data by encapsulating linage to its actual value, such as having a data with linage from two data sources shows that this value is combined, hence its probabilistic fusion needs to be computed accordingly. i h depicts the lineage of a specific data value ( a k . g i h ) as obtained from certain S i . t p data source/s and that belongs to certain i D O i h object. Due to the similar a k . g i h values as obtained from multiple merged iDOs, i h may indicate the union lineages of these iDOs. Thus, the lineage of a data value i h = ( 1 , 2 , , | i h | ) ,   i h     a k . g i h .

3.4. Possible-World Semantic

Possible world is a fundamental concept in uncertainty management research, as most of the related works are based on it. It helps manage the odds of matching outputs, linkage’s answers, merging results, and multi-valued data [19,26,91]. As the participated data sources and the matching outputs contain incomplete, imprecise or uncertain information, they implicitly represent a collection of possible appearances in a sample space, called possible-worlds or alternatives (Pws). Possible worlds describe an object or item where many possibilities may exist as an answer for that description [3].
A possible world is a hypothetical state about an object or item that represents certain and ordinary information. It is obtained from choosing one alternative among a collection of representations that form the total sample space for each item containing uncertain data [52,92]. If there is a possibility showing that none of the given answers exists or is true, this should be treated as an alternative. In order for a database to recognize this alternative, the OWA should be taken into consideration [55]. For instance, by referring to the example presented in Figure 3, we can state that the disjunctive information of (John teaches Physics or Calculus) shows imprecise information about the subject that John teaches. Does this show uncertainty about John’s attributes? If the disjunction is interpreted inclusively using the CWA, then it represents three possible worlds: one in which John teaches Physics, he teaches Calculus, or he teaches both. Under the OWA, however, there would be a fourth-world possibility, i.e., John teaches none of them (neither Calculus nor Physics). This means there is Some Other unknown Value (SOV) that John teaches. The disjunctive facts state that one of these worlds represents the actual situation, but it is unknown which one. Assigning probability values to these worlds gives them confidence degrees that measure the certainty level for each alternative being true. For example, if we know John teaches Physics with probability (0.8) or Calculus with probability (0.6), then we can be more certain that John is more likely teaches both subjects, as this alternative has the highest confidence. Figure 3 illustrates these possible-world examples according to both assumptions of CWA and OWA.
The presence of a null value in the outcome means we have a missing value. A missing value that exists in the real world but for some reason is not available or unknown. Moreover, the missing value is characterized with respect to the presence and meaning of null values and to the validity of the CWA or the OWA [52,92]. CWA denotes that only the values actually observed from the participated data sources, and no other values present facts of the real-world [55]. Thus, the correct value must only be contained from the participated data sources. In contrast, OWA states neither the truth nor the falsity of facts not represented in the participated data sources. Therefore, primitive sources are not necessarily complete, as the correct value may not be contained in these sources [52,92].
Null or missing value comes in five different types, as stated by [52]. These types represent different interpretations, such as a value can be missing either because it exists but is unknown, because it does not exist at all, or because it may exist. However, it is not actually known whether it exists or not. Due to the scope of this research and the nature existence of the multi-valued attributes, this research restricted the null value manipulation to Atomic Existential null value to correspond to the uncertainty conflict situation based on CWA and OWA. Atomic existential value (i.e., only one value is possible), such as when the age of a person is unknown ( N u l l ). In the existential null, there might be supplementary knowledge concerning the unknown value that the attribute may take as having a domain of possible values [3,52]. As this paper deals with uncertain data and results, several sample spaces   ( Ω w . j . k ) and possible worlds ( P w s T v C a s e . w . j . k ) will be obtained to manage the uncertainty and conflicts of the data fusion results.

3.5. Probabilistic Entity Linkage Definition

In this research paper, instance integration is approached as a non-trivial integration problem where a collective iteration process is practically inapplicable and inefficiently expensive, and the manual intervention is unaccepted (impossible or hard to achieve). It also portrays the incorporation of uncertainty management in entity linkage and data fusion tasks by using the probability theory to manage uncertain and conflict items.
Probabilistic management is the process of starting with imperfect data and manipulating correlated data to generate a new probabilistic global entity by probabilistically linking and merging its represented reference entity with its possible local entities and by probabilistically fusing its generated attributes’ values (i.e., single-valued attributes value, and multi-valued attributes) [3]. The probabilistic entity is a generated global entity that contains a set of possible instances merged from the reference instance with its possible local instances, i.e., n D O w = { ( n D O w . 1 , Pr ( M w . 1 ) ) , ( n D O w .2 , Pr ( M w .2 ) ) , , ( n D O w . f , Pr ( M w . f ) ) } . Each possible merge has an assigned probability distribution value, i.e., Pr ( M w . j ) generated from multiplying the probability linkages of its possible linked local instances, i.e., n D O w . j = { ( r D O w i h : p D O w 1 i h , p D O w 2 i h , , p D O w y i h ) , Pr ( M w . j ) } : j = 1 f Pr ( M w . j ) = 1 [3].
The probabilistic data fusion task this research paper deals with is explicitly considered the pair-wise probabilistic entity linkages and their corresponding probabilistic entity merging results. Section 4 covers the probabilistic data fusion problem based on these linkages and merging results.

3.6. Probabilistic Data Fusion Assumptions

The participated sources are taken under the condition of an independent cause of the error. This is a valid condition since the participated sources are independently maintained, in which the data values in one source are not derived from the data values in other sources. Besides that, the obtained data values for multiple merged instances follow the categorical value condition, s the values that do not match exactly are considered distinct. Thus, the accuracy of the supplied data values in our fusion problem implies the assumptions below:
Assumption 1.
Each participated data value is assumed to be associated with a confidence value that indicates its probability degree of correctness, i.e.,  ( μ a k . g i h ) .
Assumption 2.
Since the data sources’ correlation in sort of data copying is out of this research scope. Hence the errors of attribute values are independent across different data sources. In other words, for any distinct data sources of ( S 1 , , S n )   the value of a k . 1 1 h   recorded in S 1 is not dependent on the values of ( a k .2 2 h , , a k . q i h )   recorded in   ( S 2 , , S n ) , (and so on). Once we know the true value, i.e., a k . T v w . j of   A k w . j   attribute (and vice versa). This means the reliability score of an attribute value does not depend on the reliability scores of other values, once we know the presented event that corresponds to the true world of the fused value, i.e., ( a k . T v w . j = a k . q i h . T a k . T v w . j = a k . q i h . F ) : a k . T v w . j = the possible true data value, T = being true, F = being false and   represents an exclusive OR (XOR).
Assumption 3.
Prior probabilities of   a k . T v w . j = a k . 1 1 h . T = a k . 1 1 h . F , a k . T v w . j = a k .2 2 h . T = a k .2 2 h . F , , a k . T v w . j =   a k . q i h . T = a k . q i h . F ) are the same and not dependent if we have no reason a priori to believe that one value is more likely to occur. These probabilities are given to a k . T v w . j before the reliability scores of A k w . j . a k . g i h are being considered.

4. The Probabilistic Data Fusion Problem

The probabilistic data fusion problem that this paper aims to address is explicitly considered the entities linkage and merging decisions that can be resolved using probabilistic decisions. This means that the representation of entities merging’s answer could produce multi-possible alternatives, where each alternative may have different combinations of possible linked instances, which in turn may contain similar and/or distinct data values in their shared attributes. Moreover, as an attribute’s values are originated from heterogeneous data sources, they can be distinct, misspelt, incomplete, outdated, or even may contain typographical errors. They also can be more or less certain and reliable.
In fact, the quality of the participated sources affects our belief in the correctness of their information items. Often, those sources provide a degree of confidence for their contributed information as generated from statistical tools, thereby lending their prediction a probabilistic interpretation [24,25]. Consequently, multi-valid or invalid global attributes’ values can exist. The assigned data values for each global attribute can differ from one alternative to another, and each distinct value can be originated from one to multiple data sources. This consideration brings an additional representational and computational challenge to the development of the data fusion model. Accordingly, there is a need to consider the accuracy of the attribute values as originated from their data sources and their combined reliability scores based on the included instances of entities’ merging alternative.
To make a clear presentation of our fusion problem and to facilitate the understanding of the following solution space, we first highlight the data fusion cases that this paper aims to identify and address. Then, we provide a formal definition of our data fusion problem, i.e., the probabilistic fusion for multi-valued attributes based on probabilistic entities merging alternatives. The obtained attributes may contain valid or invalid data as they originated from different sources or instances with different reliability scores.

4.1. Data Fusion Cases

Among the different values that could be merged for a particular global attribute ( A k G s ) and within a specific set of entities merging alternative ( n D O w . j ) , i.e., D o m   A k w . j , one value can only be true, or multiple values can be true [7,8,19,71,77]. For example, a person can have one true age, yet they can also have multiple true phone numbers. These differences reflect two fusion cases: (i) an attribute with multi-true values, such as the phone number for global ( n D O w . j ) entity ( 768 4001 ,   567 3211 ) . (ii) An attribute may have inconsistent values. This case forms a conflict classified by previous research into two types of conflict: Uncertainty and Contradiction [8,23,37]. Contradiction is a type of conflict between two or more inconsistent non-null values that refer to the same object attribute, such as the merged Age values for a person from two sources is   ( 35 ,   37 ) , in which a person can only have one true age, while uncertainty represents the conflict between non-null values(s) with one or more null values that refer to the same object’s attribute.
Considering the above discussion of data fusion cases, the non-null and null values, and the CWA and OWA assumptions, the data fusion challenge can be split into six different cases, as shown in Table 1. Each case depicts a specific multi-valued attribute challenge that is required to be handled accordingly. Each case is also given a specific representation, such as ( C 1.1 ) is the representation for the multi-true value case under the closed-world assumption (i.e., MTC-CWA), while ( C 2.1.2 ) is the representation for the contradiction case under the open-world assumption (i.e., ITCC-OWA).

4.2. Problem Formulation

In this paper, the data fusion problem addresses the probabilistic fusion of multi-valid or invalid attribute values within a particular   n D O w   set. This set may contain multiple alternatives of probabilistic entities merging subsets, i.e., n D O w = { ( n D O w . 1 , Pr ( M w . 1 ) ) ,   ( n D O w .2 , Pr ( M w .2 ) ) , , ( n D O w . f , Pr ( M w . f ) ) } : 1 j f ,   to decide which attribute value’s or values’ alternative is more likely the correct answer for n D O w . j entities merging alternative. This n D O w alternatives set is obtained from pair-wise probabilistic linkage decision of local instances ( p D O w x i h ) to a reference instance ( r D O w i h ) , i.e., ( r D O w i h : p D O w 1 i h [ P r ( L w 1 , ) ] ,   p D O w 2 i h [ P r ( L w 2 , ) ] , , p D O w y i h [ P r ( L w y , ) ] ) : 1 w z , 1 x y . p D O w x i h is a possible instance from   L s s   local data source that is probably linked to a particular r D O w i h reference instance. r D O w i h is an iDO instance from   T s t   target data source. Moreover, as the actual data values are assigned with reliability scores, A k G s . a k . g global attribute value that is populated from similar values originating from multiple data sources, i.e., A k G s . a k . g = A k i h . a k . g i h , may have multiple reliability scores at once ( μ A k G s . a k . g = μ a k . g i h ) [3]. Identifying the probabilistic true value or values of data items as obtained from independent data sources and populated based on multiple probabilistic entities merging alternatives is initiated from the following inputs:
  • A set of participated   S = { S 1 , , S n } sources. Each S i , i [ 1 , n ] source can be either a local source L s s , or a target source T s t , and it contains a set of i D O i h objects that represent a particular aspect of a RWO.
  • A participated   i D O i h   object is a triple of a comprise set of attribute’s names, values, and types, i.e., ( A k i h ,   a k . g i h , t y ) , where the type of a participated attribute is obtained based on the type of its corresponding global attribute, i.e., A k i h . t y = A k G s . t y :   t y { I d n , D e s c , S u p p } , and A k i h . t y = A k G S . I d n . An attribute may have a single data value, i.e., ( A k i h . a k . g i h = A k i h . a k . 1 i h = A k i h . a k i h ) , or multiple data values, i.e., ( A k . a k . g : 1 < g q ) .
  • n D O w entities merging set obtained from r D O w   linkage set. n D O w   consists of all the possible subsets of the entities merging alternatives { ( n D O w . 1 , Pr ( M w . 1 ) ) ,   ( n D O w .2 , Pr ( M w .2 ) ) , ,   ( n D O w . f , Pr ( M w . f ) ) } . A n D O w . j subset is depicted as ( ( r D O w i h : p D O w 1 i h , p D O w 2 i h , , p D O w y i h ) , Pr ( M w . j )   ) , and j = 1 f Pr ( M w . j ) = 1 . The assigned p D O w x i h instances in each alternative differ from one to another. Due to the assigned p D O w x i h instances in each entity merging’s alternative, each attribute may include different sets of data values.
  • A confidence degree, which is referred to as a reliability score and is denoted by μ a k . g i h , to indicate the probability of a specific value provided by A k i h attribute being true and associated with a particular A k G s global attribute. Accordingly, there is given a reliability source’s score of μ a k . g i h to be associated with each a k . g i h data value.
  • A matching function returning a precise (Match, or Not Match) decision between a pair of participated data values obtained from iDOs that belong to a particular n D O w merging set. For a specific attribute, the generated data values are the union of all distinct values. Each obtained data value could be derived from multiple similar values originating from multiple iDOs as they are observed from a participating r D O w i h entity with its corresponding p D O w x i h instances. The matching outputs produce a data values’ domain based on the obtained global schema/attributes and the participated data sources. This domain of data values is depicted below and Figure 4d shows an example of this domain generation:
D O M   A k G s = g w = 1 q w ( a k . g w . i h , g i h = 1 11 q n m μ a k . g w . i h ) : i ( T s t , L s s )
where, within a specific   A k G S global attribute and   n D O w entities merging set:
-
a k . g w . i h depicts a populated data value from its corresponding data values that existed at the participated T s t and L s s   sources. a k . g w . i h may assemble a single value of a k . g w . i h = a k . 1 i h , or a combined data value from its corresponding similar values as a k . g w . i h = ( a k . 1 i h , , a k . q i h ) :     a k . g i h n D O w ,   a k . 1 i h = a k .2 i h = = a k . q i h .
-
i h depicts the data lineage of each iDO in the   r D O w   set that has the attribute value. At each generated global data value, i h indicates the data lineage union for those similar data values as originated from the participated data sources and related to one a k . g w . i h   global data value, i.e., i h = ( 1 , , | i h | ) ,   i h     a k . g w . i h . Due to the similar a k . g i h values, i h could indicate one or many lineages.
-
g i h = 1 11 q n m ( μ a k . g w . i h ) or μ a k . g w . i h for representation simplicity, represents the reliability scores set for all included iDOs’ data values in a a k . g w . i h global data value. Depending on the observed i h data lineage for a k . g w . i h data value, the μ a k . g w . i h set may have single or multiple reliability scores. This means, a generated global attribute’s value can be obtained from one or more data sources or iDOs, and hence, it can be assigned with single to multiple reliability scores.
Considering a specific A k G S attribute and a n D O w . j merging alternative, the actual combination of its data values contains the union of all distinct values that are generated from the matching function and that belonged to the assigned instances in a specific n D O w . j alternative. This combination of data values for an attribute produces a data values domain as denoted by the following formulation:
D o m   A k w . j = g w . j = 1 w . j q w . j ( a k . g w . j . i h , g i h = 1 11 q n m ( μ a k . g w . j . i h ) ) :   D o m   A k w . j D o m   A k G s . w
At this domain, a k . g w . j . i h ,   i h ,   and   μ a k . g w . j . i h are, respectively, referred to as the populated data values, lineage and reliability scores for a specific n D O w . j = ( r D O w i h : p D O w 1 i h , p D O w 2 i h , , p D O w y i h ) , Pr ( M w . j ) merging alternative. Due to null value existence, an additional challenge to generate D o m   A k w . j domain is observed. This occurred since one of the data values that belongs to D o m   A k w . j is an atomic null value, i.e., ( a k . g . N l w . j . i h ) . This means the null value is denoted by a domain of possible values of D o m   a k . g . N l w . j . i h , such that either one or none of these possible values can be a true value. Thus, D o m   A k w . j in Equation (2) can be reformulated as stated in Equation (3).
D o m   A k w . j = ( g w . j = 1 q w . j | g . N l w . j | ( a k . g w . j . i h , g i h = 1 11 q n m ( μ a k . g w . j . i h ) ) ,   ( a k . g . N l w . j . i h , g i h = 1 11 q n m ( μ a k . g . N l w . j . i h ) ) ) = ( g w . j = 1 q w . j | g . N l w . j | ( a k . g w . j . i h , g i h = 1 11 q n m ( μ a k . g w . j . i h ) ) ,   N l g = N l 1 N l q ( a k . g . N l g w . j . i h , g i h = 1 11 q n m ( μ a k . g . N l g w . j . i h ) ) ) :     a k . g . N l w . j . i h D o m   A k w . j    
The example of a probabilistic instance integration process for three restaurant instances is given in Figure 4 to illustrate our data fusion problem. These iDOs originated from three structured sources ( S 1 . t p 1 ,   S 2 . t p 1 ,   a n d   S 3 . t p 1 ), where the reliability scores for their data values are ( μ a 1.1 13 = μ a 2.1 13 = μ a 3.1 13 = 0.9 , μ a 1.1 21 = μ a 2.1 21 = μ a 3.1 21 = 0.6 ,   and   μ a 1.1 31 = μ a 3.1 31 = 0.45 ) . Moreover, the Name attribute has been recognized as the main parameter for the instance matching and integration process; hence, t y = I d n for this attribute (i.e., N a m e 1.1 G S . i d n ). In this example, the participated iDO instances are shown in Figure 4a, whereas the produced probabilistic entity linkages and merging alternatives are shown in Figure 4b,c, respectively. In addition, the data value domains for the global Phone and Address attributes are presented in Figure 4d.
From Figure 4a, we see that Name and Address global attributes are obtained from their corresponding restaurant’s Name, and Address, as existed in three participated iDOs, where each iDO originated from a different data source, i.e., ( N a m e 1 13 ,   N a m e 1 21 , N a m e 1 31 ) N a m e 1.1 G s . i d n ,   a n d   ( A d d r e s s 3 13 , A d d r e s s 3 21 , A d d r e s s 2 31 ) ( A d d r e s s 3 G s . t y ) . In contrast, the global Phone attribute is generated from its corresponding Phone attributes as existed in two participated iDOs, where each iDO belongs to a different source, i.e., i D O 13 from S 1 , and i D O 21 from S 2 :   ( P h o n e 2 13 , P h o n e 2 21 ) ( P h o n e 2 G s . t y ) . Due to these correspondences, we note that D o m   A 2 G s in Figure 4d does not contain a null value since i D O 31 from S 3 do not have a Phone attribute, rather it consists of one value of ( 818 / 762   1221 2.1 ( 13 , 21 ) , ( 0 . 9 , 0 . 6 ) ) . This value comes with two lineages and two reliability scores as obtained from i D O 13 and i D O 21 instances.   D o m   A 2 G s   domain presents the combined data values for the global Phone attribute as populated in Equation (1). In contrast,   D o m   A 3 Gs   has two data values with different cardinality of lineages and reliability scores, i.e., ( ( 12335   F i e n e g a   B l v d 3.2 31 ) , ( 0.45 ) ) , and ( ( 12335   F i e n e g a   B l v d 3.2 31 ) , ( 0.45 ) ) . D o m   A 3 G s domain depicts the combined data values for the global Address attribute as populated using Equation (1). Basically, these domains of data values comprise all distinct Phone and Address values that are obtained from running a matching function over the pair of ( i D O 13 ~ i D O 21 ) and ( i D O 13 ~ i D O 31 ) instances to generate { r D O 1 13 : ( p D O 11 21 ,   0.97 ) ,   ( p D O 12 31 , 0.85 ) } linkage set as shown in Figure 4b. These reference and local instances correspond to the participated iDOs as r D O 1 13 i D O 13 ,   p D O 11 21 i D O 21 ,   and   p D O 12 31 i D O 31 . Based on our data fusion problem, the data values in these domains need to be fused, i.e., finding the true data value or alternative values based on the entities merging alternatives in 4c. Accordingly, the data values domains of D o m   A k w . j to be fused are populated based on Equation (2) and are shown in Table 2. If the data fusion case belongs to the ITCU categories as previously presented in Table 1, the data values domains of D o m   A k w . j to be fused will be populated based on Equation (3).
From D o m   A 3 G s in Figure 4d and D o m   A 3 1.3 in Table 2, we noticed that these domains share the same values, yet the linages for their first value are different. The lineage for ( 12224   Ventura   Blvd 3.1 i h ) in D o m   A 3 G s is i h = ( 13 , 21 ) but in D o m   A 3 1.3 is i h = ( 13 ) only. This occurs since p D O 11 21 does not exist in the n D O 1.3 merging alternatives, as the above figure shows. The data fusion computation method will be executed based on these domains to obtain the possible true fused value or values.
It is worth stating that this problem definition copes with dynamic and volatile data values that may evolve over time. It could also correspond to the online data fusion problem since the prior entities’ linkage, and merging stages are processed separately, where each stage keeps and stores the obtained probability outcomes alongside their actual data. The fusion process is carried out on these outcomes upon a user’s request, as the obtained probabilities are stored alongside the actual data values.

5. The Probabilistic Data Fusion Model

In this section, we formally describe the data fusion solution and show how we leverage the trustworthiness of data sources and their values in truth discovery. To determine a k . T v w . j that might be the value(s) observed from the participated sources, the production of the data fusion sample space ( Ω w . j . k ) and possible worlds ( P w s T v c a s e .   w . j . k ) within a particular A k w . j attribute’s values, and n D O w . j merging alternative are discussed next. Then, the probabilistic data fusion method is constructed to compute the conditional probability (i.e., updated reliability score) for a possible data values’ world that is probably recognized as the true fused answer, using μ a k . g w . j . i h scores and given a data fusion case to obtain a possible-worlds’ set over a domain of D o m   A k w . j .

5.1. The Probabilistic Data Fusion Sample Space and Possible-Worlds Generation

The form of the data value and its associated reliability scores affect the data fusion sample space production. In addition, recognizing a possible-worlds set over its sample space depends on the applied data fusion case. Accordingly, the data fusion sample space production and possible-world generation are discussed next based on the identified cases in Table 1.

5.1.1. The Data Fusion Sample Space Production

Each value in D o m   A k w . j is assigned with a reliability score ( 0 μ a k . g i h 1 ) to indicate the probability of being a true fused value, i.e., ( a k . g i h . T : μ a k . g i h . T = μ a k . g i h ) . Since μ a k . g i h 1 , then ( ¬ μ a k . g i h = 1 μ a k . g i h ) that indicates the probability of the data value being a false fused value, i.e., ( a k . g i h . F : μ a k . g i h . F = ¬ μ a k . g i h . T = ¬ μ a k . g i h ) . Thus, a pair of ( a k . g w . j . i h ,   μ a k . g w . j . i h ) in D o m   A k w . j can be interpreted as a pair set of two mutual events. This is depicted as:
( a k . g w . j . i h , g i h = 1 11 q n m μ a k . g w . j . i h ) { ( a k . g w . j . i h . T , g i h = 1 11 q n m μ a k . g w . j . i h . T ) , ( a k . g w . j . i h . F , g i h = 1 11 q n m μ a k . g w . j . i h . F ) }
where
-
a k . g w . j . i h . T implies the event when the generated data value that belongs to D o m   A k w . j is a true fused value, i.e., a k . T v w . j = a k . g w . j . i h a k . T v w . j D o m   A k w . j ,   with a probability equals to the union of the original reliability scores for all i h lineage exited in a k . g w . j . T event, i.e., μ a k . g w . j . i h . T = ( μ a k . 1 i h , μ a k .2 i h , , μ a k . q i h ) ,     i h a k . g w . j . i h .
-
a k . g w . j . i h . F implies the event when the generated data value that belongs to D o m   A k w . j is a false fused value, i.e., a k . T v w . j a k . g w . j . i h a k . T v w . j D o m   A k w . j , with probability equals to the union of the reliability score’s complements for all i h existed in a k . g w . j . F , i.e., μ a k . g w . j . i h . F = ¬ μ a k . g w . j . i h . T = ( ¬ μ a k . 1 i h , , ¬ μ a k . q i h ) ,   i h = a k . g w . j . i h .
-
μ a k . g w . j . i h . T \ F set indicates either the original reliability scores’ set or the complement scores’ set for its associated data value’s event a k . g w . j . i h . T \ F .
Due to the i h   observed, the data value in an event may resemble a single data value, i.e.,   a k . g w . j . T \ F = a k . g w . j . i h . T \ F , or a combined data value from its similar a k . g i h values, i.e., a k . g w . j . T \ F = a k . g w . j . i h . T \ F = ( a k . 1 i h . T | F , a k .2 i h . T | F , , a k . q i h . T \ F ) ,     a k . g i h = a k . g w . j . i h n D O w . j . The reliability scores set assigned to a data value’s event   ( a k . g w . j . T \ F ) may also have a single reliability score of μ a k . g w . j . i h . T \ F = μ a k . g w . j . i h . T \ F , or multiple reliability scores of μ a k . g w . j . i h . T \ F = ( μ a k . 1 i h . T \ F , μ a k .2 i h . T \ F ,   ,   μ a k . q i h . T \ F ) ,   a k . g i h = a k . g w . j . i h n D O w . j .
For a domain holding one pair of data value, i.e., D o m   A k w . j = { ( a k . g w . j . i h , μ a k . g w . j . i h ) } , the two events of ( a k . g w . j . i h . T ,   a k . g w . j . i h . F ) comprise the data fusion sample space, where each event represents a possible world of true data fused value, i.e., ( a k . T v w . j = a k . g w . j . i h . T a k . T v w . j = a k . g w . j . i h . F ) , such that ( ) represents an exclusive OR (i.e., XOR). This is noticed since the general data fusion’s sample space of true alternatives is obtained as: Ω w . j . k = l = 1 L Ω t v . l : 1 l L , L = 2 | D o m   A k w . j | [3].
As multi-distinct data values are being observed in D o m   A k w . j , the generation of the data fusion’s sample space requires considering the Cartesian product operation over the events sets that could be obtained from D o m   A k w . j in Equation (4). Therefore, the produced sample space over D o m   A k w . j is the result of the Cartesian product operation over the events’ sets of its data values pairs, as presented below in Equation (5):
Ω w . j . k = { ( a k . 1 w . j . i h . T , μ a k . 1 w . j . i h . T ) , ( a k . 1 w . j . i h . F , μ a k . 1 w . j . i h . F ) } × × { ( a k . q w . j . i h . T , μ a k . q w . j . i h . T ) , ( a k . q w . j . i h . F , μ a k . q w . j . i h . F ) } Ω w . j . k = ( × g w . j = 1 q w . j { ( a k . g i h . T , μ a k . g i h . T ) , ( a k . g i h . F , μ a k . g i h . F ) } | a k . g w . j D o m   A k w . j | i h | 1 )
Ω w . j . k will contain multiple mutual data fusion worlds of Ω t v . l , such that each data value pair included in a Ω t v . l world is encoded by either its true or false event. Moreover, since each event is a pair of true or false forms of the actual data value with its reliability scores, the Ω t v . l world is depicted as a pair that comprises multiple data values’ events with their associated reliability score sets. This representation of alternatives in sort of pairs form is depicted in Equation (6). From this pair representation, the mutual alternatives produced at a particular data fusion’s sample space are as shown below:
Ω w . j . k =   ( { a ( Ω t v . l ) , μ ( Ω t v . l ) } | a ( Ω t v . l ) = g w . j = 1 q w . j ( a k . g w . j . i h . T \ F )                       μ ( Ω t v . l ) = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j .   i h . T \ F ) ) Ω w . j . k = { { ( a k . 1 i h . T , , a k . q 1 i h . T , a k . q i h . T ) , ( μ a k . 1   i h . T , , μ a k . q 1   i h . T , μ a k . q   i h . T ) } , , { ( a k . 1 i h . F , , a k . q 1 i h . F , a k . q i h . T ) , ( μ a k . 1 i h . F , , μ a k . q 1 i h . F , μ a k . q i h . T ) } { ( a k . 1 i h . F , , a k . q 1 i h . F , a k . q i h . F ) , ( μ a k . 1 i h . F , , μ a k . q 1 i h . F , μ a k . q i h . F ) } }
where
-
a ( Ω t v . l ) implies the union of the true and false ( a k . g w . j . i h . T \ F ) data value’s events that are contained in a Ω t v . l world. Depending on D o m   A k w . j , the a ( Ω t v . l ) set denotes one or more of the actual data values’ events, such that none, some or all of them can be combined data values.
-
μ ( Ω t v . l ) implies the reliability scores set for a Ω t v . l   world, as it is gained from the union of the reliability scores sets of all the distinct a k . g w . i h . T / F events that are included in the Ω t v . l world. Depending on the participating events in an Ω t v . l alternative, i.e., all are true, some are true, or none is true (all false), the reliability score set for a data fusion alternative μ ( Ω t v . l ) may have the original or/and the complement reliability scores sets.
Table 3 illustrates the sample space production and is outlined based on D o m   A 3 1.1 from Table 2.
Due to null value existence, an additional challenge to generate the Ω w . j . k sample space is observed. This challenge is related to D o m   A k w . j generation in Equation (3). The sample space for this domain is generated throughout the Cartesian product operations over the events pair’s sets of the non-null and the null data values’ sets based on Equation (5). However, since the atomic null indicates that at most one data value can be true, then any generated world that represents more than one data value by its true event would be an impossible world, i.e., I w s w . j . k ( N u l l ) , and it should be eliminated from the ( Ω w . j . k ( N u l l ) ) sample space production. Thus, the sample space production due to null value presence is formulated in Equation (7).
Ω w . j . k = ( Ω w . j . k ( NonNull ) × Ω w . j . k ( Null ) ) :   Ω t v . l w . j . k ( Null )                         Ω w . j . k ( N u l l ) ,   ( | N l g = N l 1 N l q a k . g . N l g w . j . i h . T | a (   Ω t v . l w . j . k ( Null ) ) )                         1 ,   Ω w . j . k ( N u l l ) = l N l = 1 N l L N l Ω t v . l N l w . j . k , L N l = | D o m   a k . g . N l w . j . i h | + 1 ,
where:
-
Ω w . j . k ( NonNull ) = × g w . j = 1 q w . j | g . N l w . j | { ( a k . g w . j . i h . T , μ a k . g w . j . i h . T ) , ( a k . g w . j . i h . F , μ a k . g w . j . i h . F ) } .
-
Ω w . j . k ( Null ) = × N l g = N l 1 N l q { ( a k . g . N l g w . j . i h . T , μ a k . g . N l g w . j . i h . T ) , ( a k . g . N l g w . j . i h . F , μ a k . g . N l g w . j . i h . F ) } I w s w . j . k ( Null ) :   a ( Ω t v . l w . j . k ( Null ) ) ( | N l g = N l 1 N l q a k . g . N l g w . j . i h . T | 2 ) ,   Ω t v . l w . j . k ( Null ) I w s w . j . k ( Null ) .
Each obtained world from this operation will be represented as a pair of multi-possible values’ events with their associated reliability scores sets as previously stated in Equation (6), such that each world consists of a k . g . N l g w . j . i h . T / F and μ a k . g . N l g w . j . i h . T / F events.

5.1.2. The Obtained Possible-Worlds Based on the Data Fusion Cases

The earlier stated sample spaces in Equations (6) and (7) will be utilized to obtain the possible-worlds sets for the data fusion cases. In fact, a possible-worlds set can be equal or lesser than its sample space due to the impossible-worlds incidents ( I w s C a s e .   w . j . k ) . For instance, alternatives containing multi-true events are considered possible worlds under the MTC cases only. However, alternatives that have at most one true event are only recognized as possible worlds under the ITC cases. Due to the data fusion cases, the production of possible worlds set out of a Ω w . j . k sample space is presented in Equation (8), while Table 4 shows the possible-worlds production rules for each data fusion case presented earlier in Table 1.
P w s T v C a s e . w . j . k = Ω w . j . k I w s C a s e . w . j . k P w s T v C a s e . w . j . k = p = 1 P P w s t v . p : 1 p P ,   P w s T v C a s e . w . j . k Ω w . j . k ,   &   P w s t v . p ( Ω t v . l I w s w . j . k ) ,   ( P w s T v C a s e . w . j . k P w s t v . p Ω w . j . k ) ,     P w s t v . p I w s w . j . k ,   a n d   a ( Ω t v . l ) a ( P w s t v . p ) ,   μ ( Ω t v . l ) μ ( P w s t v . p )     Ω t v . l = P w s t v . p   &   Ω t v . l P w s T v C a s e . w . j . k
Example 2: To illustrate the sample space and possible-worlds production based on the data fusion cases, the data value domain and sample space from D o m   A 3 1.1 inTable 2 and Table 3 are used. By using D o m   A 3 1.1 domain in Table 2 and the sample space production in Table 3, the below sample space alternatives are generalized: Ω 1.1.3 = { Ω t v . 1 , Ω t v .2 , Ω t v .3 , Ω t v .4 } ,     w h e r e
  • Ω t v . 1 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 1.1 . ( 31 ) . T ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 1.1 . ( 31 ) . T ) ) Ω t v . 1 = ( a ( Ω t v . 1 ) , μ ( Ω t v . 1 ) ) Ω t v . 1 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.9 , 0.6 , 0.45 ) ) .
  • Ω t v .2 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 1.1 . ( 31 ) . F ) ) Ω t v .2 = ( a ( Ω t v .2 ) , μ ( Ω t v .2   ) ) Ω t v .2 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) .
  • Ω t v .3 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 1.1 . ( 31 ) . T ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 1.1 . ( 31 ) . T ) ) Ω t v .3 = ( a ( Ω t v .3 ) , μ ( Ω t v .3 ) ) Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) .
  • Ω t v .4 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 1.1 . ( 31 ) . F ) ) Ω t v .4 = ( a ( Ω t v .4 ) , μ ( Ω t v .4 ) ) Ω t v .4 = ( (   U n k n o w n ) , ( 0.1 , 0.4 , 0.55 ) )
After producing the above sample space, the possible-worlds production can be generated for a data fusion case based on the production rules presented in Table 4. Accordingly, the possible-worlds sets that can be observed in a given data fusion case are listed below:
-
If the data fusion case is MTC-OWA (C1.2), then multiple data values can be true at the same time, and it is possible to have the true value that does not exist in D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 1.2 ) . ( 1.1.3 ) ) will be equal to the sample space of Ω 1.1.3 as shown below:
  • P w s T v ( C 1.2 ) . ( 1.1.3 ) = Ω 1.1.3 = { Ω t v . 1 , Ω t v .2 , Ω t v .3 , Ω t v .4 } = { P w s t v . 1 ,   P w s t v .2 ,   P w s t v .3 ,   P w s t v .4 } :   P w s t v . 1 = Ω t v . 1 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.9 , 0.6 , 0.45 ) ) ,   P w s t v .2 = Ω t v .2 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) ,   P w s t v .3 = Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) ,   &   P w s t v .4 = Ω t v .4 = ( (   U n k n o w n ) , ( 0.1 , 0.4 , 0.55 ) ) .
-
If the data fusion case is MTC-OWA (C1.1), then multiple data values can be true at the same time, and it is not possible to have a true value that does not exist in D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 1.1 ) . ( 1.1.3 ) ) would be as shown below:
  • P w s T v ( C 1.1 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 1.1 ) . ( 1.1.3 ) :   I w s C 1.1 . w . j . k = { Ω t v .4 } P w s T v ( C 1.1 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 ,   P w s t v .3 } :   P w s t v . 1 =   Ω t v . 1 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.9 , 0.6 , 0.45 ) ) ,   P w s t v .2 = Ω t v .2 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) ,   P w s t v .3 = Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) .
-
If the data fusion case is ITCC-CWA (C2.1.1), then one data value can be true at a time, and it is not possible to have the true value from outside D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 2.1.1 ) . ( 1.1.3 ) ) would be as shown below:
  • P w s T v ( C 2.1.1 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 2.1.1 ) . ( 1.1.3 ) : I w s ( C 2.1.1 ) . ( 1.1.3 ) = { Ω t v . 1 , Ω t v .4 }   P w s T v ( C 2.1.1 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 } :   P w s t v . 1 = Ω t v .2 = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) ,   P w s t v .2 = Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) .
-
If the data fusion case is ITCC-OWA (C2.1.2), then one data value can be true at a time, and it is possible to have the true value from outside D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 2.1.2 ) . ( 1.1.3 ) ) would be as shown below:
  • P w s T v ( C 2.1.2 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 2.1.2 ) . ( 1.1.3 ) : I w s ( C 2.1.2 ) . ( 1.1.3 ) = { Ω t v . 1 }   P w s T v ( C 2.1.2 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 , P w s t v .3 } :   P w s t v . 1 = Ω t v .2 = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) ,   P w s t v .2 = Ω t v .3 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) ,   &   P w s t v .3 = Ω t v .4 = ( (   U n k n o w n ) , ( 0 . 1 , 0 . 4 , 0 . 55 ) )
To present the illustrated example for the uncertainty data fusion cases, we assume that the data value of a 3.2 31 in D O M   A 3 1.1 domain is null, where it comes with additional knowledge about its possible domain of values as D o m   a 3.2 . Nl 1.1 . ( 31 ) = { a 3.2 . N l 1 , a 3.2 . N l 2 } ,   such that a 3.2 . N l 1 = 114   E q a i l a   B l v d and ,     a 3.2 . N l 2 = 105   E q a i l a   B l v d . Based on Equation (7), the data fusion sample space’s sets under uncertainty are outlined as shown below:
Ω 1.1.3 = ( Ω 1.1.3 ( NonNull ) × Ω 1.1.3 ( Null ) ) : Where
Ω 1.1.3 ( NonNull ) = { ( a 3.1 1.1 . T , μ a 3.1 1.1 . T ) , ( a 3.1 1.1 . F , μ a 3.1 1.1 . F ) }         = { ( 12224   V e n t u r a   B l v d 3.1 1.1 , ( 0.9 , 0.6 ) ) , (   U n k n o w n , ( 0.1 , 0.4 ) ) }
Ω 1.1.3 ( Null ) = { ( ( a 3.2 . N l 1 1.1 . ( 31 ) . T ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.2 . N l 1 1.1 . ( 31 ) . T , μ a 3.2 . N l 1 1.1 . ( 31 ) . F ) ) , ( ( a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . T ) , ( μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . T ) ) , ( ( a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . F ) ) }
Ω 1.1.3 ( Null ) = { ( ( 114   E q a i l a   B l v d 3.2 . N l 1 ) , ( 0.45 , 0.55 ) ) , ( ( 105   E q a i l a   B l v d 3.2 . N l 2 ) , ( 0.55 , 0.45 ) ) , ( (   U n k n o w n ) , ( 0.55 , 0.55 ) ) } Ω 1.1.3 = { Ω t v . 1 , Ω t v .2 , Ω t v .3 , Ω t v .4 , Ω t v .5 , Ω t v .6 } :
-
Ω t v . 1 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 . N l 1 1.1 . ( 31 ) . T ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 . N l 1 1.1 . ( 31 ) . T , μ a 3.2 . N l 2 1.1 . ( 31 ) . F ) ) = ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 114   E q a i l a   B l v d 3.2 . N l 1 1.1 . ( 31 )   ) , ( 0.9 , 0.6 ,   0.45 , 0.55 ) ) .
-
Ω t v .2 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . T ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 2 1.1 . ( 31 ) . T ) ) = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) , 105   Eqaila   Blvd 3.2 . Nl 2 1.1 . ( 31 )   ) , ( 0.9 , 0.6 ,   0.55 , 0.45 ) ) . Ω t v .3 = ( ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 2 1.1 . ( 31 ) . F ) ) = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 ,   0.55 , 0.55 ) ) .
-
Ω t v .4 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 . N l 1 1.1 . ( 31 ) . T ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . T , μ a 3.2 . N l 2 1.1 . ( 31 ) . F ) ) = ( ( 114   E q a i l a   B l v d 3.2 . N l 1 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.45 , 0.55 ) ) .
-
Ω t v .5 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . T ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 2 1.1 . ( 31 ) . T ) ) = ( ( 105   E q a i l a   B l v d 3.2 . N l 2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.55 , 0.45 ) ) .
-
Ω t v .6 = ( ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 . N l 1 1.1 . ( 31 ) . F ,   a 3.2 . N l 2 1.1 . ( 31 ) . F ) , ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 . N l 1 1.1 . ( 31 ) . F , μ a 3.2 . N l 2 1.1 . ( 31 ) . F ) ) = ( (   U n k n o w n ) , ( 0.1 , 0.4 ,   0.55 , 0.55 ) )
Below are possible worlds for the data fusion’s uncertainty cases as observed based on the above sample space sets:
-
If the data fusion case is ITCU-CWA (C2.2.1), then one data value can be true at a time, and it is not possible to have the true value from outside D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 2.2.1 ) . ( 1.1.3 ) ) would be as shown below:
P w s T v ( C 2.2.1 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 2.2.1 ) . ( 1.1.3 ) : I w s ( C 2.2.1 ) . ( 1.1.3 ) = { Ω t v . 1 , Ω t v .2 , Ω t v .4 } P w s T v ( C 2.2.1 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 ,   P w s t v .3 } :
  • P w s t v . 1 = Ω t v .3 = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 ,   0.55 , 0.55 ) ) ,  
  • P w s t v .2 = Ω t v .4 = ( ( 114   E q a i l a   B l v d 3.2 . N l 1 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.45 , 0.55 ) ) ,
  • P w s t v .3 = Ω t v .5 = ( ( 105   Eqaila   Blvd 3.2 . Nl 2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.55 , 0.45 ) )
-
If the data fusion case is ITCU-OWA (C2.2.2), then one data value can be true at a time and it is possible to have the true value from outside D O M   A 3 1.1 domain. Therefore, the generated possible worlds ( P w s ( C 2.2.2 ) . ( 1.1.3 ) ) would be as shown below:
P w s T v ( C 2.2.2 ) . ( 1.1.3 ) = Ω 1.1.3 I w s ( C 2.2.2 ) . ( 1.1.3 ) : I w s ( C 2.2.2 ) . ( 1.1.3 ) = { Ω t v . 1 , Ω t v .2 }   P w s T v ( C 2.2.2 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 , P w s t v .3 , P w s t v .4 } :  
  • P w s t v . 1 = Ω t v .3 = ( ( 12224   Ventura   Blvd 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 ,   0.55 , 0.55 ) ) ,  
  • P w s t v .2 = Ω t v .4 = ( ( 114   E q a i l a   B l v d 3.2 . N l 1 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.45 , 0.55 ) ) ,
  • P w s t v .3 = Ω t v .5 = ( ( 105   Eqaila   Blvd 3.2 . Nl 2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 ,   0.55 , 0.45 ) ) ,
  • P w s t v .4 = Ω t v .6 = ( (   U n k n o w n ) , ( 0.1 , 0.4 ,   0.55 , 0.55 ) )
Therefore, a probabilistic global entity with its actual data values and reliability scores is created, and the updated reliability scores for those data values can be computed based on the obtained possible-worlds sets and their recognized data fusion cases. The next section shows the constructed probabilistic data fusion computational method.

5.2. The Probabilistic Data Fusion Computational Method

After conceptually representing the actual data values’ merging for a probabilistic global entity in sort of multi-alternatives of possible true data fusion answers within a P w s T v C a s e . w . j . k set, in this section, we formally construct the data fusion computation method and show how we leverage the trustworthiness of sources in truth discovery.
The data fusion method is operated to compute the conditional probability, i.e., updated reliability score, for a possible data value’s or values’ world that is/are most likely recognized as the true fused answer using μ ( Ω t v . l p ) scores and given a data fusion case. Thus, the conditional probability form of ( a k . T v w . j = a ( Ω t v . l ) | C a s e ) ,   μ ( a k . T v w . j = a ( Ω t v . l ) | C a s e ) :   Ω t v . l P w s T v C a s e . w . j . k ,   Ω t v . l P w s t v . p ,   a ( Ω t v . l ) a ( P w s t v . p ) ,   &   μ ( Ω t v . l ) μ ( P w s t v . p ) is used to represent a possible data values’ world that is most likely recognized as the true fusion answer with its conditional probability value that required further computation. In this form,   C a s e   condition implies a given possible-world set in regard to the presented fusion cases in Table 1,   C a s e { M T C - C W A , M T C - O W A , I T C C - C W A , I T C C - O W A , I T C U - C W A , I T C U - O W A } . In order to do that, the listed assumptions in Section 3.5 are considered.
To obtain a single reliability score, i.e., the conditional probability value, for each possible data fusion’s alternative within a particular D o m   A k w . j and n D O w . j alternative, the data fusion computational formula is constructed in Equation (9). This formula is constructed using Bayes’ theorem, the probability distribution from μ ( P w s t v . p ) , and assumptions 1 to 3 (refer to Section 3.6) (the detailed derivations are given in Appendix A):
μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = μ ( ( a k . T v w . j = a ( P w s t v . p ) )   P w s T v C a s e . w . j . k ) μ ( C a s e = P w s T v C a s e . w . j . k ) = μ ( ( a k . T v w . j = a ( P w s t v . p ) )   P w s T v C a s e . w . j . k )   μ ( P w s T v C a s e . w . j . k ) = μ ( P w s t v . p ) p = 1 P μ ( P w s t v . p ) = μ ( P w s t v . p ) μ ( P w s t v . 1 ) + μ ( P w s t v .2 ) + + μ ( P w s t v . P ) Substituting   for     μ ( Ω t v . l w . j . k )   μ ( P w s t v . p )   from   Equations   ( 6 )   and   ( 8 ) , we   get; ; μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . p g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . 1 + g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v .2 + + g v . j = 1 q v . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . P
where p = 1 P μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = 1 .
This computation is processed to all P w s t v . 1 ,   t v .2 , ,   t v . P P w s T v C a s e . w . j . k worlds that observed from Ω w . 1 . k , , w . f . k sets. Thus, the possible-worlds sets of correct values and their posterior reliability scores are obtained for all merging alternatives that assemble the processed n D O w entity. This concludes that a probabilistic global entity with its actual values is obtained, and the updated reliability for requested attribute values are computed accordingly.
Example 3. To illustrate the computation method for finding the probabilistic true data fusion answer within a specific set of P w s T v C a s e . w . j . k , this example is continued based on the possible-worlds sets that are obtained in example 2 using the I T C C - O W A (C2.1.2), data fusion case.
Since P w s ( C 2.1.2 ) . ( 1.1.3 ) = { P w s t v . 1 ,   P w s t v .2 , P w s t v .3 } , where   P w s t v . 1 = ( ( 12224   V e n t u r a   B l v d 3.2 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) , P w s t v .2 = ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) , and P w s t v .3 = ( (   U n k n o w n ) , ( 0 . 1 , 0 . 4 , 0 . 55 ) ) , then the conditional probability of the updated reliability score for a specific possible-world’s data value being true is computed as below.
P w s T v ( C 2.1.2 ) . ( 1.1.3 ) = { a ( P w s t v . p ) ,   μ ( P w s t v . p ) } = { ( ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) , ( 0.9 , 0.6 , 0.55 ) ) , ( ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) , ( 0.1 , 0.4 , 0.45 ) ) , ( (   U n k n o w n ) , ( 0.1 , 0.4 , 0.55 ) ) }
  • μ ( a k . T v v . j = a ( P w s t v . 1 ) | P w s T v ( C 2.1.2 ) . ( 1.1.3 ) ) = μ ( P w s t v . 1 ) μ ( P w s t v . 1 ) + μ ( P w s t v .2 ) + μ ( P w s t v .3 ) = 0.9 × 0.6 × 0.55 ( 0.9 × 0.6 × 0.55 ) + ( 0.1 × 0.4 × 0.45 ) + ( 0.1 × 0.4 × 0.55 ) = 0.297 0.297 + 0.018 + 0.022 = 0.297 0.337 0.881 .
  • μ ( a k . T v v . j = a ( P w s t v .2 ) | P w s T v ( C 2.1.2 ) . ( 1.1.3 ) ) = μ ( P w s t v .2 ) μ ( P w s t v . 1 ) + μ ( P w s t v .2 ) + μ ( P w s t v .3 ) = 0.018 0.337 0.053 .
  • μ ( a k . T v v . j = a ( P w s t v .3 ) | P w s T v ( C 2.1.2 ) . ( 1.1.3 ) ) = μ ( P w s t v .3 ) μ ( P w s t v . 1 ) + μ ( P w s t v .2 ) + μ ( P w s t v .3 ) = 0.022 0.337 0.065 .
Based on the updated reliability score for each data value. The probability of 12224   V e n t u r a   B l v d being the true value equals to 0.881, the probability of 12335   F i e n e g a   B l v d being the true value equals to 0.053, and the probability of none of these values being true equals to 0.065. We conclude that the most likely correct address’s value, i.e., the data fusion answer for the global entity of n D O 1.1 = ( r D O 1 13 : p D O 11 21 , p D O 12 31 ) [ 0.8245 ] ( A r t s   D e l i c a t e ( 13 ,   21 , 31 ) , 0.8245 ) , is 12224   V e n t u r a   B l v d with 0.881 probability.

5.3. Probability to Possibility Transformation Method

The main reason for the transformation method is to allow a user to choose the range of true data fused values’ alternatives that they are willing to retrieve and view. This method has the advantage of offering effective retrieved answers; fewer but more likely alternatives can only be retrieved using a threshold value chosen based on user selection. In addition, the possibility theory has the advantage over the probability theory of providing a more efficient information retrieval and ranking strategy by using a possibility threshold value instead of a probability threshold value [93,94].
In fact, determining a probabilistic threshold value for retrieving some alternatives is very difficult due to the variety of probability distribution values ranges. By using a possibility threshold value, the user can randomly select a possibility value to be the threshold value ( β ) for retrieving the data fusion answer’s alternatives. Therefore, alternatives whose possible values are equaled or exceed the selected possible threshold value would only be retrieved.
This transformation method is based on dividing the probability values for the true data fused value’s alternatives that belong to a particular possible-worlds set of ( n D O w . j ) over the highest probability value among. Thus, the transformed possibility value for the alternative with a maximum probability distribution value will equal one. The formula below shows the transformation computation.
P o s ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) M a x   ( μ ( a k . T v w . j = a   ( P w s t v . p ) | P w s T v C a s e . w . j . k ) ) :   P w s t v . p P w s T v C a s e . w . j . k

6. Proof of Concept: Model Implementation and Discussion

The data fusion computational method has been mathematically proved as discussed in Appendix A. This mathematical approval is established by considering the integration of data values as associated with different reliability scores and as obtained from three different data sources. The data fusion method has been implemented in our probabilistic integration system as an extended merging and computation functions that operates at the attribute level to handle the two inconsistent true value cases under the CWA (i.e., ITCC-CWA (C2.1.1) and ITCU-CWA (C2.2.1)). In this implementation, the data values for a global attribute are matched and grouped under one domain. The decision model for this data fusion method generates a set of possible worlds of fused data values’ alternatives, in which each possible world is associated with an updated reliability value, i.e., ( a ( P w s t v . p ) , μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) ) . The updated value is computed based on Equation (9). The updated reliability values within each entities merging’s alternatives are transformed into possibility values using Equation (10). This transformation is constructed to facilitate the retrieval process based on the user’s possibility threshold values.
To see the feasibility of the proposed data fusion approach and how the data fusion results may change by obtaining new pieces of evidence, Figure 5 and Figure 6 are presented. In Figure 5, the decision outputs for the data fusion method under ITCC-CWA case are illustrated. It shows in Figure 5c, the possible fused values for City attribute at each possible entities merging alternative as produced from the entities merging set of {(RO1(1,1,1)): PO1(2,1,1) [1.0], PO2(3,1,2) [1.0]}, where the participated entities with their original matching tree are presented in Figure 5a, while their probabilistic entities merging result is shown in Figure 5b. These entities originated from three participating data sources, where the reliability score for “Amsterdam” data value as originating from the first data source is equal to 0.8, and the reliability score for “Enschede” data value as originating from the second data source is equal to 0.8, and the reliability score for “Georgia” data value as originated from the third source is equal to 0.4, i.e., a 2.1 11 = Amsterdam ,   μ a 2.1 11 = 0.8 ,   a 2.2 21 = Enschede ,   μ a 2.2 21 = 0.8 ,   and   a 2.3 32 = Georgia ,   μ a 2.3 32 = 0.4 . Using the merged entities alternative of nDO1.1.1, and the reliability scores of the City data values that correspond to nDO1.1.1 entities’ alternative, the updated reliability score for each possible data fused alternative is computed using Equations (9) and (10) and presented their corresponding values in the attribute’s probability and possibility columns as shown in Figure 5c.
From Figure 5b,c, we can state that RO1, PO1, and PO2 represent the same RWO named “Peter Pan”, with three different possible alternatives of the city name where “Peter Pan” lives. The original reliability scores, as obtained from the data sources, has determined that “Amsterdam” and “Enschede” are equal to 0.8, but “Georgia” is equal to 0.4. The updated reliability scores, which is the conditional probability of having a certain attribute’s value being the true fused value as obtained based on Equation (9), is approximately equal to 0.46 for “Amsterdam” and “Enschede” values, and it is approximately equal to 0.08 for “Georgia”. Based on these updated reliability scores, the possibility values have been obtained using Equation (10) to indicate that “Amsterdam” and “Enschede” values share the same possibility of either one of them being the true fused value of where “Perter Pan” lives, i.e., P o s ( a 2 . T v 1.1 = a ( P w s t v . 1 ) | P w s T v ( C 2.1.1 ) . 1.1.2 ) = P o s ( a 2 . T v 1.1 = a ( P w s t v .2 ) | P w s T v ( C 2.1.1 ) .1.1.2 ) = 1 . It also indicates the possibility of having “Peter Pan” lives in “Georgia” is very low as P o s ( a 2 . T v 1.1 = a ( P w s t v .3 ) | P w s T v ( C 2.1.1 ) .1.1.2 ) = 0.08 . Based on the given evidence from the participated entities and their attribute values reliability scores, we can state that “Peter Pan” is more likely lives in either “Amsterdam” or “Enschede” with a probability of 0.46 and with a possibility of 1.0.
To show how the data fusion computational results may change according to new information/evidence, a new data source that contains information about a person named “P. Perter” is included in the participated data sources and entities of the presented example in Figure 5. This new entity has a city value of “Amsterdam” with a 0.7 original reliability score. Accordingly, the data fusion’s alternatives result after considering the additional information from the new data source are presented in Figure 6.
In Figure 6, the decision outputs for the data fusion method under ITCC-CWA case is illustrated. Figure 6c shows the possible fused values for City attribute at each possible entities merging alternative as produced from the entities merging set of {(RO1(1,1,1)): PO1(2,1,1) [1.0], PO2(3,1,2) [1.0], PO3(4,1,2) [0.96]}, where the participated entities with their original matching tree are presented in Figure 6a, while their probabilistic entities merging result is shown in Figure 6b. These entities originated from four participating data sources, where the reliability score for the “Amsterdam” data value as originating from the first data source is equal to 0.8, and from the fourth data source is equal to 0.7, and the reliability score for “Enschede” data value as originated from the second source is equal to 0.8, and the reliability score for “Georgia” data value as originated from the third source is equal to 0.4, i.e., a 2.1 11 , 42 = Amsterdam ,   μ a 2.1 11 , 42 = ( 0.8 ,   0.7 ) ,   a 2.2 21 = Enschede ,   μ a 2.2 21 = 0.8 , and a 2.3 32 = Georgia ,   μ a 2.3 32 = 0.4 . Using the merged entities alternatives of nDO1.1.1 and nDO1.1.2, and the reliability scores of the City data values that correspond to nDO1.1.1 and nDO1.1.2 entities’ alternatives, the updated reliability score under each entities merging alternative and for each possible data fused alternative is computed using Equations (9) and (10) and presented their corresponding values in the attribute’s probability and possibility columns as shown in Figure 6c.
Figure 6c shows two possible entities merging alternatives, i.e., nDO1.1.1 and nDO1.1.2. For nDO1.1.1, the four participating entities correspond to the same person of “Peter Pan”, where the “Amsterdam” value is obtained from the first source, and the new added one with a union reliability score of (0.8 and 0.7). Due to the new information from the fourth data source, the updated reliability scores shown in Figure 6c have changed from the ones presented in Figure 5c. Accordingly, the updated reliability score of having “Amsterdam” as the true fused city value for the “Peter Pan” entity is approximately equal to 0.67. Therefore, the probability for “Enschede” becomes approximately equal to 0.28, and for “Georgia” is approximately equal to 0.08. Therefore, “Amsterdam” has the highest possibility of being the true fused value of where “Peter Pan” lives, i.e., P o s ( a 2 . T v 1.1 = a ( P w s t v . 1 ) | P w s T v ( C 2.1.1 ) .1.1.2 ) = 1 . On the other hand, the possibility of “Enschede” or “Georgia” being the true fused value of where “Peter Pan” lives becomes equal to 0.4 or 0.07, respectively. Based on this information, we can state that “Peter Pan” is more likely to live in “Amsterdam” with a probability of 0.67 and a possibility of 1.0. For the second entity merging alternative of nDO1.1.2, three participated entities were corresponding to the same person “Peter Pan”, while PO3(4,1,2) entity that originated from the fourth source did not belong the generated global entity of “Peter Pan”, i.e., {(RO1(1,1,1)): PO1(2,1,1), PO2(3,1,2)},{PO3(4,1,2)}[0.037]. The participated entities in nDO1.1.2 are the same ones presented in nDO1.1.1 in Figure 5; hence, the update reliability scores for the City’s value alternatives within nDO1.1.2 merging alternative are the same as the one presented in Figure 5c.
Figure 7 illustrates the decision outputs for the data fusion method under ITCU-CWA case and for the Phone attribute, where two possible entities merging alternatives were observed, i.e., nDO1.3.1 and nDO1.3.2. Figure 7c shows the possible fused values for the Phone attribute and for each possible entities merging alternatives as produced form the entities merging set of {(RO3(1,1,3)): PO1(2,1,3) [1.0], PO2(3,1,3) [0.73]}, where the participated entities with their original matching tree are presented in Figure 7a, while their probabilistic entities merging result is shown in Figure 7b. These entities originated from three participating data sources; such that the reliability score of “622,222,222” data value as originated from the first data source is equal to 0.8, and from the second data source is equal to 0.7, while the phone value for the record obtained from the third data source is unknown, i.e., NULL, and it is reliability score is 0.8, i.e., a 3.1 13 = a 3.1 23 = 622222222 ,   μ a 3.1 13 = 0.8 ,   μ a 3.1 23 = 0.7 ,   a 3.2 33 =   NULL ,   μ a 3.2 33 = 0.8 . Using the merged entities alternatives of nDO1.3.1 and nDO1.3.2. The reliability scores of the Phone data values that correspond to nDO1.3.1 and nDO1.3.2 entities’ alternatives, the updated reliability score under each entities merging alternative and for each possible data fused alternative is computed using Equations (9) and (10), and presented their corresponding values in the attribute’s probability and possibility columns as shown in Figure 7c.
In Figure 7c, two possible entity merging alternatives were stated, and the data fusion value for the phone attribute was conditionally computed accordingly. For instance, nDO1.3.1 alternative indicates the three participating entities correspond to the same person of “John Doe”, with an approximate probability value of 0.73 and with a possibility value of 1.0. In this alternative, the “622,222,222” phone value is obtained from the two records that belong to the first and second sources; hence, its original reliability score is (0.8, 0.7), but for the unknown “Null” value is 0.8 as it is obtained from the third data source. Accordingly, the updated reliability scores, as computed from Equation (9), is 0.7 for “622,222,222” value and it is 0.3 for the unknown “Null” value. Based on these updated scores, the possibility values have been obtained from Equation (10) to indicate that “622,222,222” has the highest possibility of being the true phone number for the “John Doe” global entity, i.e., P o s ( a 3 . T v 3.1 = a ( P w s t v . 1 ) | P w s T v ( C 2.2.1 ) .3.1.3 ) = 1 . The alternative of not knowing John Doe’s number has a low possibility of being true, i.e., P o s ( a 3 . T v 3.2 = a ( P w s t v . 1 ) | P w s T v ( C 2.2.1 ) .3.2.3 ) = 0.43 . However, the Phone data fusion value under nDO1.3.2 alternative is different from the first alternative since the third record, i.e., PO2(3,1,3), does not belong to the generated global entity of “John Doe”. This global entity is generated from {(RO3(1,1,3)): PO1(2,1,3) [1.0]} records only, hence, “622,222,222” is the only Phone value to be observed with (0.8, 0.7) reliability scores. Accordingly, the updated reliability score as computed from Equation (9) is 1.0 for the “622,222,222” value.
Based on the implementation of our fusion method within our developed probabilistic integration system and by using a sample dataset, we managed to show how our data fusion method can operate in different fusion cases and how it can cope with the dynamic nature of new information or evidence. We also managed to conclude that by using the offered fusion method, a data value with a higher confidence score and cardinality will be more likely the true value. This is observed by running varied examples related to the data conflict cases. This claim is true as independency assumption among the participated sources is assumed, in which positive evidence can be obtained from having data values with high confidence scores and existed in multiple data sources. Moreover, by considering the possibility of transformation, a better retrieval mechanism for the probable true data fusion’s answers can be achieved.
In terms of considering source accuracy, probabilistic entity linkage nature, and on-demand fusion process, our data fusion approach is relative to the approaches proposed by [26,32,48,73,76,82]. Even though we identified many aspects that make it different to the previous works. First, while our approach is based on the quality based’ strategy, it proposes to manage and resolve two major fusion cases; multi-true values (i.e., multiple truth assumptions), and inconsistent-true values (i.e., single truth assumptions) based on closed-world and open-world assumptions. Second, the data fusion is processed over probabilistic entity linkage and multi-merging alternatives. Our approach can also support on-demand fusion and cope with dynamic and volatile conflicting and uncertain data by keeping and storing the reliability scores alongside its actual data values and by taking the matching and computation process as a chain of separate processes, where each one has its own inputs and outputs data [3]. Upon a user request, such a fusion process for a selected attribute and over a probabilistic global entity is initiated by matching its values from their corresponding sources and entities to form a global domain of merged attribute’s values pairs, and then the computational fusion is executed separately.

7. Limitations

It is worth noticing that entities integration with uncertainty management is hard in general and comes in many forms due to the variety of ways to be defined and processed and the variety of uncertainty types that might be appeared; hence, no single solution addresses all challenges [19]. It does not seem possible to address the computational and representational challenges in general. Yet, we can still study these challenges for the problem with uncertainty management under specific formalizations, uncertainty cases, and scope.
Accordingly, our proposed model is constructed based on a precise mediated and centralized schema structure, where a probabilistic schema integration and a decentralized integration concern are out of this paper’s scope. Moreover, while the experiment demonstrated that our proposed process works in theory, the formula needs to be implemented in several real case scenarios, such as scientific collaborations or personal information, to determine any challenges, accuracy, and the margin of error. Another limitation in our proposed model is related to the ignorance of data sources correlation; data sources can copy from each other, and errors can be propagated quickly. Therefore, ignoring possible dependency among data sources can lead to biased decisions. Another limitation is related to the implementation of our proposed model due to the utilization of a text matching function only. Further enhancement can be added by including a function to compare and match images. Therefore, the data fused values can be found for different forms of data.

8. Conclusions

This paper presented a new probabilistic data fusion model. It described a specific scenario of a probabilistic fusion problem and solution space, where several representational and computational challenges have been identified and formulated. The problem scenario correlates to the attempts and needs to manage and resolve uncertain and conflicting data for multiple attribute values. These values originated due to the probabilistic entities’ integration outcomes over heterogeneous and autonomous data sources. The proposed data fusion method is implemented within the probabilistic integration system to verify its efficiency and feasibility in resolving different data conflict cases. This implementation demonstrated the ability of the system to manage the static and dynamic environment in managing data from a variety of sources. The method automates the homo economics decision making in selecting the most probable and true value [95]. While the experiment demonstrated that our proposed process works in theory, the formula needs to be implemented in several real case scenarios to determine any challenges, accuracy, and the margin of error.
Several challenges related to the data integration and fusion problem under uncertainty management, data correlations, multiple correspondences and probabilistic merging of correlated entities still need to be addressed, and that will continue to occupy the information integration community for a long time to come. Future works include exploring our method in other data fusion strategies, such as capturing sources and attributes correlation to identify positive and negative evidence while conditionally computing the probability of a data being true. It also includes implementing other fusion cases and allocating a suitable benchmark for evaluation purposes. With emerging approaches to data fusion, the industry is in need of a standardization testing mechanism that could also be explored. This testing mechanism would assess the output quality of such approaches in the form of an index of success rate or margin of error for a given fusion process.

Author Contributions

Conceptualization, A.J. and A.D.; methodology, A.J.; software, A.J.; validation, A.J., F.S., O.A., and A.A.-A.; formal analysis, A.J.; investigation, A.J. and A.D.; resources, A.J.; data curation, A.J. and A.A.-A.; writing—original draft, A.J., F.S., O.A., A.A.-A., and Y.I.A.; writing—review and editing, A.J., F.S., A.D., and A.A.-A.; visualization, A.J. and A.A.-A.; supervision, A.D. and A.J.; project administration, A.J.; funding acquisition, A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data accessed on 1 April 2022 from RIDDLE: Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty at https://www.cs.utexas.edu/users/ml/riddle/data.html (accessed on 1 April 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation of Data Fusion Computation Formula

Consider three data sources S 1 , S 2 and S 3 . We denote by A s 1 w . j . k the value of attribute A k for a particular n D O w . j entity merging’s alternative as observed in S 1 , by A s 2 w . j . k the value of attribute A k for the same alternative as observed in S 2 , and by A s 3 w . j . k the value of attribute A k for the same alternative as observed in S 3 . For instance, we may find A . a k . g w . j = a x in S 1 ( A s 1 = a x ), A . a k . g w . j = a z in S 2 ( A s 2 = a z ), and A . a k . g w . j = a x in S 3 ( A s 3 = a x ). For any number of reasons, the data in these sources may be wrong. Hence, μ a x . T = ( μ A s 1 , μ A s 3 ) , μ a x . F = ( ¬ μ A s 1 , ¬ μ A s 3 ) , μ a z . T = μ A s 2 ,   μ a z . F = ¬ μ A s 2 . We would like to determine the probability that a specific value/s (which may or may not be the value observed from a data source) is indeed the true data fused value/s of an attribute.
Depending on D o m   A k w . j and the P w s T v C a s e . w . j . k = p = 1 P P w s t v . p set that are observed based on Ω w . j . k sample space due to the given fusion case, the required probability term can be expressed as:
μ ( a k . T v w . j = a ( P w s t v . p ) | A s 1 = a x , A s 2 = a z , A s 3 = a x , P w s T v C a s e . w . j . k ) = μ ( A s 1 , 3 = a x , A s 2 = a z , | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) · μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) p = 1 P μ ( A s 1 , 3 = a x , A s 2 = a z | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) · μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k )
What is available?
Given the background information of the data fusion’s sub-cases and situations we have:
  • μ ( A s 1 , 3 w . j . k = a x | a k . T v = a 1 , P w s T v C a s e . w . j . k ) = ( μ A s 1 , μ A s 3 ) = ( μ a x 1 . T , μ a x 3 . T ) =   μ a x 1 , 3 . T
  • μ ( A s 1 , 3 w . j . k = a x | a k . T v a 1 , P w s T v C a s e . w . j . k ) = ( ¬ μ A s 1 , ¬ μ A s 3 ) = ( μ a x 1 . F , μ a x 3 . F ) =   μ a x 1 , 3 . F
  • μ ( A s 2 w . j . k = a z | a k . T v = a z , P w s T v C a s e . w . j . k ) = μ A s 2 = μ a z . T .
  • μ ( A s 2 w . j . k = a z | a k . T v a z , P w s T v C a s e . w . j . k ) = ¬ μ A s 2 = μ a z . F .
From assumption 3 and within a data fusion case, we also have:
  • μ ( a k . T v w . j = a x . T \ F , P w s T v C a s e . w . j . k ) = μ ( a k . T v w . j = a z . T \ F , P w s T v C a s e . w . j . k ) = ( μ a k . T v ) = 0.5 .
Based on the above Bayes’ formula, the given reliability information about an attribute in the three sources, assumption 2 and assumption 3, the data fusion method presented in Equation (9) is derived as follows:
I. 
From Assumption 2,
μ ( A s 1 , 3 = a x , A s 2 = a z | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k )                   = μ ( A s 1 , 3 = a x | a k . T v w . j = a x . T \ F , P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . T \ F , P w s T v C a s e . w . j . k )                   =   μ a x 1 , 3 . T \ F · μ a z . T \ F
where
  μ a x 1 , 3 . T \ F =   (   μ a x 1 , 3 . T )   (   μ a x 1 , 3 . F ) =   μ a x 1 , 3 . T   μ a x 1 , 3 . F :   μ a x 1 , 3 . T =   ( μ a x 1 . T , μ a x 3 . T ) = μ a x 1 . T · μ a x 3 . T ,   and     μ a x 1 , 3 . F =   ( μ a x 1 . F , μ a x 3 . F ) = μ a x 1 . F · μ a x 3 . F
II. 
From Assumption 3,
μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) = ( μ ( a k . T v w . j = a x . T \ F , P w s T v C a s e . w . j . k ) · μ ( a k . T v = a z . T \ F , P w s T v C a s e . w . j . k ) ) = μ a k . T v 2 = 0.5 2
III. 
From both Assumptions 2 and 3 we have:
p = 1 P μ ( A s 1 , 3 = a x , A s 2 = a z | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) · μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k )                   = ( μ ( A s 1 , 3 = a x | a k . T v w . j = a x . T , P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . T , P w s T v C a s e . w . j . k ) · μ a k . T v 2 )                   + ( μ ( A s 1 , 3 = a x | a k . T v w . j = a x . T , P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . F , P w s T v C a s e . w . j . k ) · μ a k . T v 2 )                   + ( μ ( A s 1 , 3 = a x | a k . T v w . j = a x . F , P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . T , P w s T v C a s e . w . j . k ) · μ a k . T v 2 )                   + ( μ ( A s 1 , 3 = a x | a k . T v w . j = a x . F ,   P w s T v C a s e . w . j . k ) · μ ( A s 2 = a z | a k . T v w . j = a z . F , P w s T v C a s e . w . j . k ) · μ a k . T v 2 )                   = (   μ a x 1 , 3 . T · μ a z . T +   μ a x 1 , 3 . T · μ a z . F +   μ a x 1 , 3 . F · μ a z . T +   μ a x 1 , 3 . F · μ a z . F ) · μ a k . T v 2
Substituting for   μ ( A s 1 , 3 = a x , A s 2 = a z | a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k )   from I, for μ ( a k . T v w . j = a ( P w s t v . p ) , P w s T v C a s e . w . j . k ) from II, and for p = 1 P μ from III, we get:
μ ( a k . T v w . j = a ( P w s t v . p ) | A s 1 = a x , A s 2 = a z , A s 3 = a x , P w s T v C a s e . w . j . k ) = (   μ a x 1 , 3 . T \ F · μ a z . T \ F · μ a k . T v 2 ) t v . l p (   μ a x 1 , 3 . T · μ a z . T +   μ a x 1 , 3 . T · μ a z . F +   μ a x 1 , 3 . F · μ a z . T +   μ a x 1 , 3 . F · μ a z . F ) μ a k . T v 2 = (   μ a x 1 , 3 . T \ F · μ a z . T \ F ) t v . p   μ a x 1 , 3 . T · μ a z . T +   μ a x 1 , 3 . T · μ a z . F +   μ a x 1 , 3 . F · μ a z . T +   μ a x 1 , 3 . F · μ a z . F = (   μ a x 1 , 3 . T \ F · μ a z . T \ F ) t v . p p = 1 P (   μ a x 1 , 3 . T \ F · μ a z . T \ F ) t v . p
-
Based on Equations (6) and (8), we get the following:
(   μ a x 1 , 3 . T \ F μ a z . T \ F ) t v . p =   μ ( P w s t v . p ) =   ( g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) ) = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . p
and
p = 1 P (   μ a x 1 , 3 . T \ F μ a z . T \ F ) t v . p = p = 1 P ( μ ( P w s t v . p ) )                 = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T ) t v . 1 + g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T ) t v .2                 + + g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . F ) t v . P
μ ( a k . T v w . j = a ( P w s t v . p ) | A s 1 = a x , A s 2 = a z , A s 3 = a x , P s w T v C a s e . w . j . k ) = μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = (   μ a x 1 , 3 . T \ F μ a z . T \ F ) t v . p p = 1 P (   μ a x 1 , 3 . T \ F μ a z . T \ F ) t v . p = μ ( P w s t v . p ) μ ( P w s t v . 1 ) + + μ ( P w s t v . P ) = g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . p g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . 1 + g w . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v .2 + + g v . j = 1 q w . j g i h = 1 11 q n m ( μ a k . g w . j . i h . T \ F ) t v . P
where
p = 1 P μ ( a k . T v w . j = a ( P w s t v . p ) | P w s T v C a s e . w . j . k ) = 1 ,   and   P = 4

References

  1. Almutairi, M.M.; Yamin, M.; Halikias, G. An Analysis of Data Integration Challenges from Heterogeneous Databases. In Proceedings of the 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 17–19 March 2021; pp. 352–356. [Google Scholar]
  2. Aggoune, A. Intelligent data integration from heterogeneous relational databases containing incomplete and uncertain information. Intell. Data Anal. 2022, 26, 75–99. [Google Scholar] [CrossRef]
  3. Jaradat, A.; Halimeh, A.A.; Deraman, A.; Safieddine, F. A best-effort integration framework for imperfect information spaces. Int. J. Intell. Inf. Database Syst. 2018, 11, 296–314. [Google Scholar] [CrossRef]
  4. Beneventano, D.; Bergamaschi, S.; Gagliardelli, L.; Simonini, G. Entity resolution and data fusion: An integrated approach. In Proceedings of the SEBD 2019: 27th Italian Symposium on Advanced Database Systems, Grosseto, Italy, 16–19 June 2019. [Google Scholar]
  5. Sampri, A.; Geifman, N.; Le Sueur, H.; Doherty, P.; Couch, P.; Bruce, I.; Peek, N. Probabilistic Approaches to Overcome Content Heterogeneity in Data Integration: A Study Case in Systematic Lupus Erythematosus. Stud. Health Technol. Inform. 2020, 270, 387–391. [Google Scholar] [PubMed]
  6. Zhao, X.; Jia, Y.; Li, A.; Jiang, R.; Song, Y. Multi-source knowledge fusion: A survey. World Wide Web 2020, 23, 2567–2592. [Google Scholar] [CrossRef] [Green Version]
  7. Zhang, M.; Wang, H.; Li, J.; Gao, H. One-pass inconsistency detection algorithms for big data. IEEE Access 2019, 7, 22377–22394. [Google Scholar] [CrossRef]
  8. Bakhtouchi, A. Data reconciliation and fusion methods: A survey. Appl. Comput. Inform. 2020, 18, 182–194. [Google Scholar] [CrossRef]
  9. Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. Blocking and filtering techniques for entity resolution: A survey. ACM Comput. Surv. (CSUR) 2020, 53, 31. [Google Scholar] [CrossRef] [Green Version]
  10. Papadakis, G.; Ioannou, E.; Palpanas, T. Entity resolution: Past, present and yet-to-come: From structured to heterogeneous, to crowd-sourced, to deep learned. In Proceedings of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, 30 March 2020. [Google Scholar]
  11. Munir, A.; Blasch, E.; Kwon, J.; Kong, J.; Aved, A. Artificial intelligence and data fusion at the edge. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 62–78. [Google Scholar] [CrossRef]
  12. Stonebraker, M.; Bruckner, D.; Ilyas, I.F.; Beskales, G.; Cherniack, M.; Zdonik, S.B.; Pagan, A.; Xu, S. Data Curation at Scale: The Data Tamer System. In Proceedings of the Cidr, Asilomar, CA, USA, 6–9 January 2013. [Google Scholar]
  13. Golshan, B.; Halevy, A.; Mihaila, G.; Tan, W.-C. Data integration: After the teenage years. In Proceedings of the Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Raleigh, CA, USA, 14–19 May 2017; pp. 101–106. [Google Scholar]
  14. De Sa, C.; Ratner, A.; Ré, C.; Shin, J.; Wang, F.; Wu, S.; Zhang, C. Deepdive: Declarative knowledge base construction. ACM SIGMOD Rec. 2016, 45, 60–67. [Google Scholar] [CrossRef]
  15. Stonebraker, M.; Ilyas, I.F. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. 2018, 41, 3–9. [Google Scholar]
  16. Miller, R.J. Open data integration. Proc. VLDB Endow. 2018, 11, 2130–2139. [Google Scholar] [CrossRef]
  17. Lau, B.P.L.; Marakkalage, S.H.; Zhou, Y.; Hassan, N.U.; Yuen, C.; Zhang, M.; Tan, U.-X. A survey of data fusion in smart city applications. Inf. Fusion 2019, 52, 357–374. [Google Scholar] [CrossRef]
  18. Blanco, L.; Crescenzi, V.; Merialdo, P.; Papotti, P. Probabilistic models to reconcile complex data from inaccurate data sources. In Proceedings of the International Conference on Advanced Information Systems Engineering, Hammamet, Tunisia, 7–9 June 2010; pp. 83–97. [Google Scholar]
  19. Magnani, M.; Montesi, D. A survey on uncertainty management in data integration. J. Data Inf. Qual. (JDIQ) 2010, 2, 1–33. [Google Scholar] [CrossRef]
  20. Liu, Y.; Bao, T.; Sang, H.; Wei, Z. A Novel Method for Conflict Data Fusion Using an Improved Belief Divergence Measure in Dempster–Shafer Evidence Theory. Math. Probl. Eng. 2021, 2021, 6558843. [Google Scholar] [CrossRef]
  21. Yuan, Q.; Pi, Y.; Kou, L.; Zhang, F.; Li, Y.; Zhang, Z. Multi-source data processing and fusion method for power distribution internet of things based on edge intelligence. arXiv 2022, arXiv:2203.17230. [Google Scholar] [CrossRef]
  22. Barbedo, J.G.A. Data Fusion in Agriculture: Resolving Ambiguities and Closing Data Gaps. Sensors 2022, 22, 2285. [Google Scholar] [CrossRef] [PubMed]
  23. Dong, X.L.; Naumann, F. Data fusion: Resolving data conflicts for integration. Proc. VLDB Endow. 2009, 2, 1654–1655. [Google Scholar] [CrossRef] [Green Version]
  24. Dong, X.L.; Berti-Equille, L.; Srivastava, D. Data fusion: Resolving conflicts from multiple sources. In Handbook of Data Quality; Springer: Berlin/Heidelberg, Germany, 2013; pp. 293–318. [Google Scholar]
  25. Pochampally, R.; Das Sarma, A.; Dong, X.L.; Meliou, A.; Srivastava, D. Fusing data with correlations. In Proceedings of the Proceedings of the 2014 ACM SIGMOD International Conference on Management of data, Snowbird, UT, USA, 22–27 June 2014; pp. 433–444. [Google Scholar]
  26. Ioannou, E.; Nejdl, W.; Niederée, C.; Velegrakis, Y. LinkDB: A probabilistic linkage database system. In Proceedings of the Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, Snowbird, UT, USA, 12–16 June 2011; pp. 1307–1310. [Google Scholar]
  27. Wang, H.; Ding, X.; Li, J.; Gao, H. Rule-based entity resolution on database with hidden temporal information. IEEE Trans. Knowl. Data Eng. 2018, 30, 2199–2212. [Google Scholar] [CrossRef]
  28. Halevy, A.; Rajaraman, A.; Ordille, J. Data integration: The teenage years. In Proceedings of the Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006; pp. 9–16. [Google Scholar]
  29. Papadakis, G.; Ioannou, E.; Palpanas, T. Entity Resolution: Past, Present and Yet-to-Come. In Proceedings of the EDBT, Lisbon, Portugal, 26–29 March 2020; pp. 647–650. [Google Scholar]
  30. Li, L.; Wang, H.; Li, J.; Gao, H. A Survey of Uncertain Data Management. Front. Comput. Sci. 2020, 4, 162–190. [Google Scholar] [CrossRef]
  31. Dumpa, I.K.; Kota, R.S.; Sadri, F. Information Integration with Uncertainty: Performance. DBKDA 2014 2014, 15, 15. [Google Scholar]
  32. Sarma, A.D.; Dong, X.L.; Halevy, A.Y. Uncertainty in data integration and dataspace support platforms. In Schema Matching and Mapping; Springer: Berlin/Heidelberg, Germany, 2011; pp. 75–108. [Google Scholar]
  33. Deng, D.; Fernandez, R.C.; Abedjan, Z.; Wang, S.; Stonebraker, M.; Elmagarmid, A.K.; Ilyas, I.F.; Madden, S.; Ouzzani, M.; Tang, N. The Data Civilizer System. In Proceedings of the Cidr, Chaminade, CA, USA, 8–11 January 2017. [Google Scholar]
  34. Bilke, A.; Bleiholder, J.; Böhm, C.; Draba, K.; Naumann, F.; Weis, M. Automatic Data Fusion with HumMer; Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät II: Trondheim, Norway, 2005. [Google Scholar]
  35. Bleiholder, J.; Draba, K.; Naumann, F. FuSem-Exploring Different Semantics of Data Fusion. In Proceedings of the VLDB, Vienna, Austria, 23–27 September 2007; pp. 1350–1353. [Google Scholar]
  36. Mirza, A.; Siddiqi, I. Data level conflicts resolution for multi-sources heterogeneous databases. In Proceedings of the 2016 Sixth International Conference on Innovative Computing Technology (INTECH), Dublin, Ireland, 24–26 August 2016; pp. 36–40. [Google Scholar]
  37. Dong, X.L.; Berti-Equille, L.; Srivastava, D. Integrating conflicting data: The role of source dependence. Proc. VLDB Endow. 2009, 2, 550–561. [Google Scholar] [CrossRef] [Green Version]
  38. Ioannou, E.; Garofalakis, M. Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. 2015, 27, 2245–2260. [Google Scholar] [CrossRef]
  39. Papadakis, G.; Ioannou, E.; Niederée, C.; Palpanas, T.; Nejdl, W. Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data. In Proceedings of the Proceedings of the fifth ACM International Conference on Web Search and Data Mining, New York, NY, USA, 8–12 February 2012; pp. 53–62. [Google Scholar]
  40. Papadakis, G.; Ioannou, E.; Palpanas, T.; Niederée, C.; Nejdl, W. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 2012, 25, 2665–2682. [Google Scholar] [CrossRef] [Green Version]
  41. Papadakis, G.; Koutrika, G.; Palpanas, T.; Nejdl, W. Meta-blocking: Taking entity resolutionto the next level. IEEE Trans. Knowl. Data Eng. 2013, 26, 1946–1960. [Google Scholar] [CrossRef]
  42. Papenbrock, T.; Heise, A.; Naumann, F. Progressive duplicate detection. IEEE Trans. Knowl. Data Eng. 2014, 27, 1316–1329. [Google Scholar] [CrossRef] [Green Version]
  43. Papadakis, G.; Svirsky, J.; Gal, A.; Palpanas, T. Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 2016, 9, 684–695. [Google Scholar] [CrossRef] [Green Version]
  44. Papadakis, G.; Tsekouras, L.; Thanos, E.; Giannakopoulos, G.; Palpanas, T.; Koubarakis, M. The return of jedai: End-to-end entity resolution for structured and semi-structured data. Proc. VLDB Endow. 2018, 11, 1950–1953. [Google Scholar] [CrossRef]
  45. Panse, F.; Naumann, F. Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 2373–2376. [Google Scholar]
  46. Panse, F.; Düjon, A.; Wingerath, W.; Wollmer, B. Generating Realistic Test Datasets for Duplicate Detection at Scale Using Historical Voter Data. In Proceedings of the EDBT, Nicosia, Cyprus, 23–26 March 2021; pp. 570–581. [Google Scholar]
  47. Vidal, M.-E.; Jozashoori, S.; Sakor, A. Semantic data integration techniques for transforming big biomedical data into actionable knowledge. In Proceedings of the 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), Cordoba, Spain, 5–7 June 2019; pp. 563–566. [Google Scholar]
  48. Ayat, N.; Akbarinia, R.; Afsarmanesh, H.; Valduriez, P. Entity resolution for probabilistic data. Inf. Sci. 2014, 277, 492–511. [Google Scholar] [CrossRef] [Green Version]
  49. Motro, A. Imprecision and uncertainty in database systems. In Fuzziness in Database Management Systems; Springer: Berlin/Heidelberg, Germany, 1995; pp. 3–22. [Google Scholar]
  50. Clark, D.A. Verbal uncertainty expressions: A critical review of two decades of research. Curr. Psychol. 1990, 9, 203–235. [Google Scholar] [CrossRef]
  51. Smets, P. Imperfect information: Imprecision and uncertainty. In Uncertainty Management in Information Systems; Springer: Berlin/Heidelberg, Germany, 1997; pp. 225–254. [Google Scholar]
  52. Zimanyi, E.; Pirotte, A. Imperfect knowledge in relational databases. In Uncertainty Management in Information Systems; Motro, A., Smets, P., Eds.; Springer: Boston, MA, USA, 1997; pp. 35–87. [Google Scholar] [CrossRef]
  53. Suciu, D. Probabilistic databases for all. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Portland, OR, USA, 14–19 June 2020; pp. 19–31. [Google Scholar]
  54. Suciu, D.; Olteanu, D.; Ré, C.; Koch, C. Probabilistic Databases, Synthesis Lectures on Data Management; Morgan Claypool: San Rafael, CA, USA, 2011. [Google Scholar]
  55. Ceylan, I.I.; Darwiche, A.; Van den Broeck, G. Open-world probabilistic databases: Semantics, algorithms, complexity. Artif. Intell. 2021, 295, 103474. [Google Scholar] [CrossRef]
  56. Sarma, A.D.; Benjelloun, O.; Halevy, A.; Widom, J. Working models for uncertain data. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA, 3–7 April 2006; p. 7. [Google Scholar]
  57. Chen, R.; Mao, Y.; Kiringa, I. GRN model of probabilistic databases: Construction, transition and querying. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 291–302. [Google Scholar]
  58. Dalvi, N.; Suciu, D. Management of probabilistic data: Foundations and challenges. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Beijing, China, 26–28 June 2007; pp. 1–12. [Google Scholar]
  59. Sen, P.; Deshpande, A.; Getoor, L. PrDB: Managing and exploiting rich correlations in probabilistic databases. VLDB J. 2009, 18, 1065–1090. [Google Scholar] [CrossRef]
  60. Mauritz, R.; Nijweide, F.; Goseling, J.; van Keulen, M. Autoencoder-Based Cleaning in Probabilistic Databases. ACM J. Data Inf. Qual 2021. Available online: https://ris.utwente.nl/ws/portalfiles/portal/256093655/arxiv_preprint_2106.09764.pdf (accessed on 26 September 2022).
  61. Antova, L.; Koch, C.; Olteanu, D. 10^(10^6) worlds and beyond: Efficient representation and processing of incomplete information. VLDB J. 2009, 18, 1021–1040. [Google Scholar] [CrossRef]
  62. Widom, J. Trio: A System for Integrated Management of Data, Accuracy, and Lineage; Stanford InfoLab: Stanford, CA, USA, 2004. [Google Scholar]
  63. Jampani, R.; Xu, F.; Wu, M.; Perez, L.L.; Jermaine, C.; Haas, P.J. Mcdb: A monte carlo approach to managing uncertain data. In Proceedings of the Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2008; pp. 687–700. [Google Scholar]
  64. De Keijzer, A.; Van Keulen, M. IMPrECISE: Good-is-good-enough data integration. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, Washington, DC, USA, 7–12 April 2008; pp. 1548–1551. [Google Scholar]
  65. Van Keulen, M.; De Keijzer, A. Qualitative effects of knowledge rules and user feedback in probabilistic data integration. VLDB J. 2009, 18, 1191–1217. [Google Scholar] [CrossRef]
  66. Grohe, M.; Lindner, P. Infinite probabilistic databases. arXiv 2020, arXiv:2011.14860. [Google Scholar] [CrossRef]
  67. Li, Y.; Li, Q.; Gao, J.; Su, L.; Zhao, B.; Fan, W.; Han, J. Conflicts to harmony: A framework for resolving conflicts in heterogeneous data by truth discovery. IEEE Trans. Knowl. Data Eng. 2016, 28, 1986–1999. [Google Scholar] [CrossRef]
  68. Xu, J.; Zadorozhny, V.; Grant, J. IncompFuse: A logical framework for historical information fusion with inaccurate data sources. J. Intell. Inf. Syst. 2020, 54, 463–481. [Google Scholar] [CrossRef]
  69. Panse, F.; Ritter, N. Relational data completeness in the presence of maybe-tuples. Ingénierie Systèmes D’information (2001) 2010, 15, 85–104. [Google Scholar] [CrossRef]
  70. Yong-Xin, Z.; Qing-Zhong, L.; Zhao-Hui, P. A novel method for data conflict resolution using multiple rules. Comput. Sci. Inf. Syst. 2013, 10, 215–235. [Google Scholar] [CrossRef]
  71. Cooper, R.; Devenny, L. A Database System for Absorbing Conflicting and Uncertain Information from Multiple Correspondents. In Proceedings of the British National Conference on Databases, Birmingham, UK, 7–9 July 2009; pp. 199–202. [Google Scholar]
  72. Dong, X.L.; Gabrilovich, E.; Heitz, G.; Horn, W.; Murphy, K.; Sun, S.; Zhang, W. From data fusion to knowledge fusion. arXiv 2015, arXiv:1503.00302. [Google Scholar] [CrossRef] [Green Version]
  73. Liu, X.; Dong, X.L.; Ooi, B.C.; Srivastava, D. Online data fusion. Proc. VLDB Endow. 2011, 4, 932–943. [Google Scholar] [CrossRef]
  74. Singh, Y.; Kaur, A.; Suri, B.; Singhal, S. Systematic Literature Review on Regression Test Prioritization Techniques. Informatica 2012, 36, 379–408. [Google Scholar]
  75. Zhang, L.; Xie, Y.; Xidao, L.; Zhang, X. Multi-source heterogeneous data fusion. In Proceedings of the 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 26–28 May 2018; pp. 47–51. [Google Scholar]
  76. Yang, Y.; Gu, L.; Zhu, X. Conflicts Resolving for Fusion of Multi-source Data. In Proceedings of the 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), Hangzhou, China, 23–25 June 2019; pp. 354–360. [Google Scholar]
  77. Bleiholder, J.; Naumann, F. Data fusion. ACM Comput. Surv. (CSUR) 2009, 41, 1–41. [Google Scholar] [CrossRef]
  78. Yin, X.; Han, J.; Philip, S.Y. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 2008, 20, 796–808. [Google Scholar]
  79. Jiang, Z. Reconciling Continuous Attribute Values from Multiple Data Sources. PACIS 2008 Proc. 2008, 264. Available online: https://aisel.aisnet.org/pacis2008/264/ (accessed on 26 September 2022).
  80. Dellis, E.; Seeger, B. Efficient Computation of Reverse Skyline Queries. In Proceedings of the VLDB, Vienna, Austria, 16 February 2007; pp. 291–302. [Google Scholar]
  81. Slaney, J.; Paleo, B.W. Conflict resolution: A first-order resolution calculus with decision literals and conflict-driven clause learning. J. Autom. Reason. 2018, 60, 133–156. [Google Scholar] [CrossRef] [Green Version]
  82. Maunder, M.N.; Piner, K.R. Dealing with data conflicts in statistical inference of population assessment models that integrate information from multiple diverse data sets. Fish. Res. 2017, 192, 16–27. [Google Scholar] [CrossRef]
  83. Pasternack, J.; Roth, D. Making better informed trust decisions with generalized fact-finding. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011. [Google Scholar]
  84. Yin, X.; Tan, W. Semi-supervised truth discovery. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 217–226. [Google Scholar]
  85. Zhao, B.; Rubinstein, B.I.; Gemmell, J.; Han, J. A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 2012, 5, 550–561. [Google Scholar] [CrossRef] [Green Version]
  86. Galland, A.; Abiteboul, S.; Marian, A.; Senellart, P. Corroborating information from disagreeing views. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, New York, NY, USA, 3–6 February 2010; pp. 131–140. [Google Scholar]
  87. Jaradat, A.; Deraman, A.; Idris, S.; Din, L.; Said, N. Pemodelan maklumat biodiversiti: Pendekatan objek digital informative. In Proceedings of the 6th ITB-UKM joint Seminar on Chemistry, Bali, Indonesia, 17–18 May 2005. [Google Scholar]
  88. Deraman, A.; Yahaya, J.; Salim, J.; Idris, S.; Jambari, D.I.; Komoo, A.J.I.; Leman, M.S.; Unjah, T.; Sarman, M.; Sian, L.C. The development of myGeo-RS: A knowledge management system of geodiversity data for tourism industries. Commun. IBIMA 2009, 8, 142–146. [Google Scholar]
  89. Peng, L. Research on Data Uncertainty and Lineage Through Trio. In Proceedings of the 2019 The World Symposium on Software Engineering, Wuhan, China, 20–23 September 2019; pp. 73–77. [Google Scholar]
  90. Roy, S. Uncertain Data Lineage. Encycl. Database Syst. 2018, 4280–4286. [Google Scholar] [CrossRef]
  91. Kimmig, A.; De Raedt, L. Probabilistic logic programs: Unifying program trace and possible world semantics. In Proceedings of the Workshop on Probabilistic Programming Semantics, Paris, France, 1 January 2017. [Google Scholar]
  92. Fan, W.; Geerts, F.; Tang, N.; Yu, W. Conflict resolution with data currency and consistency. J. Data Inf. Qual. (JDIQ) 2014, 5, 1–37. [Google Scholar] [CrossRef] [Green Version]
  93. Klir, G.J. Uncertainty and Information: Foundations of Generalized Information Theory; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
  94. Kuicheu, N.C.; Wang, N.; Fanzou Tchuissang, G.N.; Xu, D.; Dai, G.; Siewe, F. Managing uncertain mediated schema and semantic mappings automatically in dataspace support platforms. Comput. Inform. 2013, 32, 175–202. [Google Scholar]
  95. Doucouliagos, C. A note on the evolution of homo economicus. J. Econ. Issues 1994, 28, 877–883. [Google Scholar] [CrossRef]
Figure 1. Data Integration level and its corresponding conflict resolution tasks.
Figure 1. Data Integration level and its corresponding conflict resolution tasks.
Bdcc 06 00114 g001
Figure 2. The pair-wise-source-to-target matching and formulation process based on the best effort data integration framework. (a) The participated iDO instances as observed from three structured data sources of S 1 . t p 1 ,   S 2 . t p 1 , &   S 3 . t p 1 . (b) The pair-wise-source-to-target matching process for the three structured data sources of S 1 . t p 1 ,   S 2 . t p 1 , &   S 3 . t p 1 . (c) The pair-wise-source-to-target matching process for the three iDO instances according to their local instance ( p D O w x i h )   or reference instance ( r D O w i h ) formulations.
Figure 2. The pair-wise-source-to-target matching and formulation process based on the best effort data integration framework. (a) The participated iDO instances as observed from three structured data sources of S 1 . t p 1 ,   S 2 . t p 1 , &   S 3 . t p 1 . (b) The pair-wise-source-to-target matching process for the three structured data sources of S 1 . t p 1 ,   S 2 . t p 1 , &   S 3 . t p 1 . (c) The pair-wise-source-to-target matching process for the three iDO instances according to their local instance ( p D O w x i h )   or reference instance ( r D O w i h ) formulations.
Bdcc 06 00114 g002
Figure 3. Possible world’s example with associated probability values.
Figure 3. Possible world’s example with associated probability values.
Bdcc 06 00114 g003
Figure 4. Example of the probabilistic data fusion problem under the existence of multi-valued attributes. (a) The participated iDO instances. (b) The pair-wise probabilistic Linkages. (c) The probabilistic entities merging alternatives from 4b instances. (d) The populated data values for the global attributes of the participated instances in 4c.
Figure 4. Example of the probabilistic data fusion problem under the existence of multi-valued attributes. (a) The participated iDO instances. (b) The pair-wise probabilistic Linkages. (c) The probabilistic entities merging alternatives from 4b instances. (d) The populated data values for the global attributes of the participated instances in 4c.
Bdcc 06 00114 g004aBdcc 06 00114 g004b
Figure 5. City data fusion alternatives’ example with their updated reliability score based on the C2.1.1 case. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for City attribute at each possible entities merging alternative as produced based on Figure 5b.
Figure 5. City data fusion alternatives’ example with their updated reliability score based on the C2.1.1 case. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for City attribute at each possible entities merging alternative as produced based on Figure 5b.
Bdcc 06 00114 g005
Figure 6. The updated-City data fusion alternatives’ example is based on the newly added evidence from the fourth data source. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for City attribute at each possible entities merging alternative as produced based on Figure 6b.
Figure 6. The updated-City data fusion alternatives’ example is based on the newly added evidence from the fourth data source. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for City attribute at each possible entities merging alternative as produced based on Figure 6b.
Bdcc 06 00114 g006aBdcc 06 00114 g006b
Figure 7. Phone data fusion alternatives’ example with their updated reliability score based on the C2.2.1 case. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for Phone attribute at each possible entities merging alternative as produced based on Figure 7b.
Figure 7. Phone data fusion alternatives’ example with their updated reliability score based on the C2.2.1 case. (a) The participated entities with their original matching tree. (b) The probabilistic entities linkage/merging results. (c) The possible data fused values for Phone attribute at each possible entities merging alternative as produced based on Figure 7b.
Bdcc 06 00114 g007
Table 1. The data fusion cases are based on the CWA or the OWA assumptions.
Table 1. The data fusion cases are based on the CWA or the OWA assumptions.
Multi-Valued Attribute Cases CWA OWA
CaseRepresentationCaseRepresentation
Multi-True values Case (MTC)MTC-CWA C 1.1 MTC-OWA C 1.2
Inconsistent True values Case (ITC)Contradiction case (ITCC)ITCC-CWA C 2.1.1 ITCC-OWA C 2.1.2
Uncertainty case (ITCU)ITCU-CWA C 2.2.1 ITCU-OWA C 2.2.2
Table 2. Example of the probabilistic data fusion problem under the existence of multi-valued attributes.
Table 2. Example of the probabilistic data fusion problem under the existence of multi-valued attributes.
n D O w . j   n D O w : w = 1   i h Lineage for w.j Merging Alternatives D o m   A 2 ( 1.1 , 1.2 , , 1.4 ) D o m   A 3 ( 1.1 , 1.2 , , 1.4 )
n D O 1.1 =
( ( r D O 1 13 : p D O 11 21 , p D O 12 31 ) [ 0.8245 ] )
( 13 ,   21 ,     31 ) D o m   A 2 1.1 =
{ 818 / 762   1221 2.1 ( 13 , 21 ) , ( 0.9 , 0.6 ) }
D o m   A 3 1.1 = { ( 12224   Ventura   Blvd 3.1 ( 13 , 21 ) , ( 0.9 , 0.6 ) ) , ( 12335   Fienega   Blvd 3.2 31 , 0.45 ) }
n D O 1.2 = ( ( r D O 1 13 :   p D O 11 21 )   [ 0.1455 ] ) ( 13 ,   21 ) D o m   A 2 1.2 =
{ 818 / 7621221 2.1 ( 13 , 21 ) , ( 0.9 , 0.6 ) }
D o m   A 3 1.2 =
{ ( 12224   Ventura   Blvd 3.1 ( 13 , 21 ) , ( 0.9 , 0.6 ) ) }
n D O 1.3 =
( ( r D O 1 13 :   p D O 12 31 )   [ 0.0255 ] )
( 13 ,   31 ) D o m   A 2 1.3 =
{ 818 / 7621221 2.1 ( 13 ) , ( 0.9 ) }
D o m   A 3 1.3 =
{ ( 12224   Ventura   Blvd 3.1 ( 13 ) , 0.9 ) , ( 12335   Fienega   Blvd 3.2 31 , 0.45 ) }
n D O 1.4 =
( ( r D O 1 13 :   )   [ 0.0045 ] )
( 13 ) D o m   A 2 1.4 =
{ 818 / 7621221 2 . 1 ( 13 ) , ( 0.9 ) }
D o m   A 3 1.4 =
{ ( 12224   Ventura   Blvd 3.1 ( 13 ) , 0.9 ) }
Table 3. An example for sample space production.
Table 3. An example for sample space production.
Equation The   Sample   Space   Production   ( Ω 1.1.3 )   form   D o m   A 3 1.1
(4) { ( a 3.1 1.1 . T = 12224   V e n t u r a   B l v d , μ a 3.1 1.1 . T = ( 0.9 , 0.6 ) ) , ( a 3.1 1.1 . F   12224   V e n t u r a   B l v d , μ a 3.1 1.1 . F = ( 0.1 , 0.4 ) ) } . { ( a 3.2 1.1 . T = 12335   F i e n e g a   B l v d , μ a 3.2 1.1 . T = ( 0.45 ) ) , ( a 3.2 1.1 . F 12335   F i e n e g a   B l v d ,   μ a 3.2 1.1 . F = ( 0.55 ) ) }
(5) Ω 1.1.3 = 2 2 = 4 = × g = 1 2 = { ( a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.1 1.1 . ( 13 , 21 ) . T ) , ( a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.1 1.1 . ( 13 , 21 ) . F ) } × { ( a 3.2 1.1.31 . T , μ a 3.2 1.1.31 . T ) , ( a 3.2 1.1.31 . F , μ a 3.2 1.1.31 . F ) } .
(6) Ω 1.1.3 = { Ω t v . 1 , Ω t v .2 , Ω t v .3 , Ω t v .4 } : Ω t v . l = ( a ( Ω t v . l ) , μ ( Ω t v . l ) )
  • a ( Ω t v . 1 ) = ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 1.1.31 . T ) = ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) , 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) ,   &   μ ( Ω t v . 1 ) = ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 1.1.31 . T ) = ( 0.9 ,   0.6 ,   0.45 )
  • a ( Ω t v .2 ) = ( a 3.1 1.1 . ( 13 , 21 ) . T , a 3.2 1.1.31 . F ) = ( 12224   V e n t u r a   B l v d 3.1 1.1 . ( 13 , 21 ) ) ,   &   μ ( Ω t v .2 ) = ( μ a 3.1 1.1 . ( 13 , 21 ) . T , μ a 3.2 1.1.31 . F ) = ( 0.9 ,   0.6 ,   0.55 )
  • a ( Ω t v .3 ) = ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 1.1.31 . T ) = ( 12335   F i e n e g a   B l v d 3.2 1.1 . ( 31 ) ) ,   &   μ ( Ω t v .3 ) = ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 1.1.31 . T ) = ( 0.1 ,   0.4 ,   0.45 )  
  • a ( Ω t v .4 ) = ( a 3.1 1.1 . ( 13 , 21 ) . F , a 3.2 1.1.31 . F ) = ( ) , &   μ ( Ω t v .4 ) = ( μ a 3.1 1.1 . ( 13 , 21 ) . F , μ a 3.2 1.1.31 . F ) = ( 0 . 1 ,   0 . 4 ,   0 . 55 )
Table 4. The possible-worlds generations given a data fusion case.
Table 4. The possible-worlds generations given a data fusion case.
Data Fusion CasesPossible Worlds Generation ( P w s T v C a s e . w . j . k )
C1.2:
MTC-OWA
I f   ( Ω w . j . k | C a s e = C 1.2 ) , t h e n   P w s T v C 1.2 . w . j . k = Ω w . j . k , P w s t v . p w . j . k Ω t v . l ,   a n d   P = L .
  P w s T v C 1.2 . w . j . k = p = 1 P P w s t v . p = l = 1 L Ω t v . l
C1.1:
MTC-CWA
I f   ( Ω w . j . k | C a s e = C 1.1 ) , t h e n   ( P w s T v C 1.1 . w . j . k = Ω w . j . k I w s C 1.1 . w . j . k ) :     Ω t v . l I w s C 1.1 . w . j . k ,   ( | a k . g w . j . T | Ω t v . l ) 1 .
  P w s T v C 1.1 . w . j . k = p = 1 P P w s t v . p : P < L ,     P w s t v . p P w s T v C 1.1 . w . j . k ,     P w s t v . p = ( Ω t v . l I w s C 1.1 . w . j . k ) .
C2.1.1:
ITCC-CWA
I f   ( Ω w . j . k | C a s e = C 2.1.1 ) ,   t h e n   ( P w s C 2.1.1 . w . j . k = Ω w . j . k I w s C 2.1.1 . w . j . k ) :   Ω t v . l I w s C 2.1.1 . w . j . k   i f   o n l y     ( | a k . g w . j . T | Ω t v . l ) = 1 .
P w s T v C 2.1.1 . w . j . k = p = 1 P P w s t v . p : P < L ,   P w s t v . p P w s T v C 2.1.1 . w . j . k ,     P w s t v . p = ( Ω t v . l I w s C 2.1 . w . j . k )  
C2.1.2:
ITCC-OWA
I f   ( Ω w . j . k | C a s e = C 2.1.1 ) ,   t h e n   ( P w s C 2.1.1 . w . j . k = Ω w . j . k I w s C 2.1.2 . w . j . k ) :   Ω t v . l I w s C 2.1.2 . w . j . k   i f   o n l y     ( | a k . g w . j . T | Ω t v . l ) 1 .
P w s T v C 2.1.2 . w . j . k = p = 1 P P w s t v . p : P < L ,     P w s t v . p P w s T v C 2.1.2 . w . j . k ,     P w s t v . p = ( Ω t v . l I w s C 2.1.2 . w . j . k ) .
C2.2.1:
ITCU-CWA
I f   ( Ω w . j . k | C a s e = C 2.2.1 ) ,   t h e n   ( P w s C 2.2.1 . w . j . k = Ω w . j . k I w s C 2.2.1 . w . j . k ) :     Ω t v . l I w s C 2.2.1 . w . j . k   i f   o n l y   ( | a k . g w . j . T | | a k . g . N l w . j . i h . T | Ω t v . l ) = 1 ,     a n d     a k . g . N l w . j . i h . T \ F Ω t v . l .
  P w s T v C 2.2.1 . w . j . k = p = 1 P P w s t v . p : P < L ,     P w s t v . p P w s T v C 2.2.1 . w . j . k ,     P w s t v . p = ( Ω t v . l I w s C 2.2.1 . w . j . k ) .
C2.2.2:
ITCU-OWA
I f   ( Ω w . j . k | C a s e = C 2.2.2 ) ,   t h e n   ( P w s C 2.2.2 . w . j . k = Ω w . j . k I w s C 2.2.2 . w . j . k ) :   Ω t v . l I w s C 2.2.2 . w . j . k   i f   o n l y     ( | a k . g w . j . T | | a k . g . N l w . j . i h . T | Ω t v . l ) 1 ,   a n d     a k . g . N l w . j . i h . T \ F Ω t v . l .
P w s T v C 2.2.2 . w . j . k = p = 1 P P w s t v . p : P < L ,     P w s t v . p P w s T v C 2.2.2 . w . j . k ,     P w s t v . p = ( Ω t v . l I w s C 2.2.2 . w . j . k )
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jaradat, A.; Safieddine, F.; Deraman, A.; Ali, O.; Al-Ahmad, A.; Alzoubi, Y.I. A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes. Big Data Cogn. Comput. 2022, 6, 114. https://doi.org/10.3390/bdcc6040114

AMA Style

Jaradat A, Safieddine F, Deraman A, Ali O, Al-Ahmad A, Alzoubi YI. A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes. Big Data and Cognitive Computing. 2022; 6(4):114. https://doi.org/10.3390/bdcc6040114

Chicago/Turabian Style

Jaradat, Ashraf, Fadi Safieddine, Aziz Deraman, Omar Ali, Ahmad Al-Ahmad, and Yehia Ibrahim Alzoubi. 2022. "A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes" Big Data and Cognitive Computing 6, no. 4: 114. https://doi.org/10.3390/bdcc6040114

APA Style

Jaradat, A., Safieddine, F., Deraman, A., Ali, O., Al-Ahmad, A., & Alzoubi, Y. I. (2022). A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting Attributes. Big Data and Cognitive Computing, 6(4), 114. https://doi.org/10.3390/bdcc6040114

Article Metrics

Back to TopTop