Open AccessArticle

Autonomous Data Association and Intelligent Information Discovery Based on Multimodal Fusion Technology

Wei Wang

¹,

Jingwen Li

^1,2,*,

Jianwu Jiang

Bo Wang

¹,

Qingyang Wang

Ertao Gao

and

Tao Yue

College of Geomatics and Geoinformation, Guilin University of Technology, Guilin 541004, China

Ecological Spatiotemporal Big Data Perception Service Laboratory, Guilin 541004, China

Author to whom correspondence should be addressed.

Symmetry 2024, 16(1), 81; https://doi.org/10.3390/sym16010081

Submission received: 15 November 2023 / Revised: 16 December 2023 / Accepted: 4 January 2024 / Published: 8 January 2024

Download

Browse Figures

Versions Notes

Abstract

The effective association of multimodal data is the basis of massive multi-source heterogeneous data sharing in the era of big data. How to realize data autonomous association between massive multimodal databases and the automatic intelligent screening of valuable information from associated data, so as to provide a reliable data source for artificial intelligence (AI), is an urgent problem to be solved. In this paper, a data autonomous association method based on the organizational structure of data cells is proposed, including transaction abstraction based on information nucleuses, symmetric and asymmetric data association based on strategies and data pipes, and information generation based on big data. To screen meaningful data associations, an information-driven intelligent information discovery method and a task-driven intelligent information discovery method are proposed. The former screens meaningful data associations by training the reward and punishment model to simulate the manual scoring of data associations. The latter is task-oriented and screens meaningful data associations by training the reward and punishment model to simulate the manual ranking of data associations related to the task requests. Through the above work, autonomous data association and intelligent information discovery are effectively realized based on multimodal fusion technology, which provides a novel data source mining approach using multimodal data sharing and intelligent information discovery.

Keywords:

multi-source heterogeneous data; multimodal fusion technology; autonomous data association; symmetric and asymmetric; intelligent information discovery; data sharing

1. Introduction

In this age of constant innovation, AI and automation are reshaping our lives, challenging us to think, work smarter, and enhance our creativity in ways we never have before [1]. What has constrained many of the major breakthroughs in AI is not a lack of advanced algorithms, but a lack of high-quality data sets [2]. Data is critical in developing state-of-the-art machine learning techniques. To ensure the proper operation of AI solutions, large amounts of high-quality data are needed to train the underlying neural networks [3]. A good example is multilingual natural language processing (NLP), which relies on the input of thousands of artificial sample data in various languages [4].

In recent years, the emergence and refinement of many data organization and management techniques have provided new ways to access complex data, representative of which are the data virtualization methods [5,6,7] and intelligent data warehouse methods [8,9,10,11]. Data virtualization is a class of data access methods that separates logical view and physical storage. The client data requirements are automatically associated to multiple storage databases via the virtual layer, and the virtual layer compiles the returned data and feedback to users. However, the construction of the virtual layer depends on a lot of manual definitions as it is difficult for computers to understand the contextual associations between data [12]. Data warehouses are central repositories for storing and managing large amounts of structured data, integrating data from multiple data sources, and storing them uniformly in the warehouse layer. Classical intelligent data warehouse modeling methods include Entity-Relationship modeling methods [13], Dimensional Modeling methods [14], Data Vault Model methods [15], etc., which enable trend analysis and fast complex querying via historical data, etc., but these modeling methods are also manually predefined. Therefore, no matter those data processing methods that provide data processing logic, such as data virtualization or those methods that store complex data directly, such as data warehouses, the key step is to provide the logic of association between data [11]. However, as society evolves, along with the emergence of massive databases, the demand for accurate data for AI is also rising [16,17]. While, ideally, we could manually create an accurate data association for each data requirement, this is too time-consuming. On the other hand, with the emergence of new databases over time, new complex data requirements are constantly emerging, and such a huge amount of work is impossible to realize only using manual labor. The association between data can generate various kinds of new information [18,19,20], but how to realize the autonomous association of extensive multimodal data and the intelligent selection of meaningful associations to provide AI with high-quantity and high-quality complex data in real time and further enhance the application of AI is the problem that needs to be solved [21,22].

The aim of this paper is to achieve the autonomous association between multimodal data and how to train the deep learning model to realize the intelligent screening of association information. Compared with the traditional data processing methods, the proposed methods should have the characteristics of more automatic, less format limitation, and stronger real-time automatic update and so on. The intelligent knowledge discovery method proposed in this paper should not only provide high-quality data sources for AI in real time, but also provide theoretical support for data processing methods that rely on a large number of manual definitions of multimodal data.

The main contributions of this paper are summarized as follows:

Propose a data autonomous association method for multi-source heterogeneous data based on the organizational structure of data cells. Initial data cells are constructed according to the definition of the organizational structure of data cells; the concepts of symmetric random association and asymmetric random association of data pipes are defined by analyzing the symmetric and asymmetric strategic associations of data. Information nucleuses of combined data cells containing independent association information are generated using multi-level association of data pipes. All information nucleuses of combined data cells are obtained to form the set of information nucleuses of combined data cells.
Propose an information-driven intelligent information discovery method. A reward and punishment model is constructed and the model is trained by the manual scoring data of information nucleuses of combined data cells. Ultimately, the trained model replaces the manual labor in intelligent scoring of information nucleuses and realizes the intelligent screening and storage of meaningful information nucleuses, which contain a large number of meaningful associated information.
Propose a task-driven intelligent information discovery method. Based on realistic task requests, the collection of subject headings related to the task request is obtained by the topic model of natural language understanding (NLU). Subject headings are used to search for information nucleuses of initial data cells to obtain the set of matched initial data cells. The data pipes of matched initial data cells are randomly associated to form combined data cells, from which the set of information nucleuses of the matched combined data cells related to task requests can be obtained. A reward and punishment model is constructed and the model is trained using the manually sorted data of information nucleuses of combined data cells related to task requests. The trained model achieves intelligent sorting of information nucleuses, realizing task-driven intelligent information discovery.

The remainder of this paper is mainly organized as follows: Section 2 introduces the related works of this paper; Section 3 introduces the methods of this paper, including the organizational structure of data cells, the data autonomous association method based on the organizational structure of data cells, the information-driven intelligent information discovery method, and the task-driven intelligent information discovery method; Section 4 gives examples of the proposed method; Section 5 discusses the contribution, innovation, application fields, and feasibility of the proposed method; Section 6 introduces the conclusion of this paper.

2. Related Works

The related works involved in this paper include multi-source heterogeneous data fusion, data association, and autonomous knowledge discovery and data self-intelligence. This chapter analyzes the current research situation from the above points.

2.1. Multi-Source Heterogeneous Data Fusion

Multi-source heterogeneous data fusion is a process of cognition, synthesis, and judgment from a large number of multi-source heterogeneous data to obtain more comprehensive, accurate, and reliable information. The key to multi-source heterogeneous data fusion is to deal with the differences between different types of data, including text, pictures, audio, video, and mixed data [23].

Data fusion methods can be divided into three categories according to the fusion level [24]: the data-level methods, the feature-level methods, and the decision-level methods. The data-level fusion methods integrate the original multi-source heterogeneous data to form a new unified data set, which retains the original data information but relies on a lot of computing, so the real-time performance is poor. Liu et al. converted all data to the resource description framework (RDF) data format and input the converted data into the data fusion framework to realize information fusion [25]. Ji et al. proposed a BPR-Review-Score-Social (BRScS) data fusion model to fuse social relationships and user comments, which realized user merchant recommendations based on multi-source and heterogeneous information [26]. The feature-level fusion methods use feature fusion algorithms to generate new representative feature vectors by extracting all kinds of original data feature vectors. Compared with the data-level fusion methods, the feature-level fusion methods are more efficient because they simplify the original data, but the disadvantage is that feature extraction accompanies information loss. Liu et al. mapped the Mel-scale Frequency Cepstral Coefficient (MFCC) features of sound and the image features extracted using Convolutional Neural Networks (CNNs) using subspace, and retrieved them with Euclidean distance, thus realizing the solution of the cross-modal surface material retrieval of the hearing to the vision [27]. The decision-level fusion methods are also called post-fusion methods (decision-level fusion after classifier coding), which conduct coordination and joint decision by finding out the credibility of each mode. The commonly used methods are averaging, voting, weighting, adaptive enhancement, dynamic Bayesian network, and so on. For example, Meng et al. proposed a decision-level fusion method based on a binary classification model and the evidence theory [28], which uses logical regression and a Support Vector Machine (SVM) model to solve the binary classification problem, and then the evidence theory was used for decision confusion.

2.2. Data Association and Autonomous Knowledge Discovery

Knowledge discovery [29] is a process of automatically extracting novel and understandable potential patterns and trends from massive data, and it is an important field for discovering new knowledge via information organization. Data association is an important means of potential knowledge discovery via the integrated expression, management, and retrieval of data with different sources, structures, formats, and characteristics [30].

In recent years, scholars have put forward a series of innovative ideas on knowledge discovery based on data association. For example, Abdulahi Hasan et al. extract knowledge from large data sets using data mining methods based on clustering sampling and realize future prediction and decision analysis based on knowledge discovery [31]. With the rise and development of machine learning, the research of knowledge discovery based on data association has become more extensive. For example, Mollaei et al. realize knowledge discovery using association rules (ARs) and natural language processing (NLP) and effectively extract hidden patterns from different types of data sharing records [32]. The knowledge discovery of machine learning is becoming more and more popular, and knowledge graphs are widely used in knowledge discovery methods based on data association. Cao et al. use a knowledge graph to organize interdisciplinary knowledge and reveal potential knowledge relevance, providing strong support for interdisciplinary scientific research and innovation [20]. Deng et al. attempted the knowledge discovery of oral historical archive resources in the field of digital humanities, carried out multi-dimensional knowledge discovery research on the aspects of the overall overview of the project and the relationship between events and themes [33], and constructed a knowledge graph for experimental data sources, which formed a high-dimensional knowledge representation space which was closely related to the data examples of oral historical archives, which provided a new way to mine the hidden content of human resources and enriched the research system of autonomous knowledge discovery.

2.3. Data Self-Intelligence

The data self-intelligent methods are a series of intelligent methods that can learn and optimize data independently. Data intelligence uses technologies such as data mining and machine learning to extract valuable implication patterns from data [34] and has two characteristics: driven by big data and guided by the application scenario. The data self-intelligent methods use machine learning and data analysis technology to achieve automatic and intelligent data effective information acquisition [35], which is widely used in various fields.

At present, the data self-intelligent methods include knowledge-driven types and data-driven types [36]. Knowledge-driven-type methods use existing models, algorithms, and a series of prior knowledge to carry out multi-dimensional and interdisciplinary analyses of data on complex topics [37]. Zhang et al. use a knowledge graph to obtain the embedded representation of the data structure via the TransR model and obtain the hidden information in the data via the inner product calculation [38]. The data-driven-type methods guide decision-making via data information and patterns [39]. For example, Sarker put forward the viewpoint of “data science” from the point of view of data-driven and carried out data-driven intelligent computing and decision-making according to different application scenarios to provide advanced data analysis references for 10 practical application fields [40]. To integrate the advantages of two kinds of methods, scholars have built knowledge-driven and data-driven collaborative models [41,42,43,44,45,46]. Through the complementary advantages of the two paradigms, hybrid models are established to further realize data intelligence while following the scientific mechanism.

3. Methodology

This paper proposes the methods of autonomous data association and intelligent information discovery for multi-source heterogeneous data (the flowchart is shown in Figure 1): (1) Define the organizational structure of data cells for multi-source heterogeneous data; (2) Initial data cells are constructed according to the organizational structure of data cells, including the construction of information nucleuses of initial data cells by extracting field names and key-value names from databases or technical names used to parse other critical information, the construction of strategies by predefined multiple algorithms and scripts, and the construction of data pipes. According to the symmetric and asymmetric random association concepts of data pipes, information nucleuses of combined data cells containing associated information are generated, and strategies and data pipes are constructed to realize the construction of combined data cells. (3) Meaningful information nucleuses are exacted in two ways: The first one constructs a reward and punishment model and the model is trained using the manual scoring data of combined data cells; the trained model replaces manual scoring and intelligent screening of information nucleuses, and the cloud brain is used to store meaningful nucleuses to achieve information-driven intelligent information discovery. The second one starts with a series of task requests. The collection of subject headings according to task requests is obtained by the topic model of NLU, and subject headings are used to search for information nucleuses of combined data cells to obtain the set of matched initial data cells; data pipes of matched initial data cells are randomly associated and form the set of information nucleuses of combined data cells; A reward and punishment model is constructed and the model is trained using the manually sorted data of information nucleuses of combined data cells related to task requests. Finally, the trained model achieves intelligent sorting of information nucleuses for the given task request, and the cloud brain is used to store the optimal matching information nucleuses to realize task-driven intelligent information discovery.

3.1. Data Cells

This part introduces the organizational structure of data cells. Data cells achieve autonomous data association via independent cognitive mechanisms and data processing patterns (see Figure 2).

A data cell is composed of an information nucleus, a series of strategies, and data pipes corresponding to strategies one by one. Data cells have spatial-temporal properties. The temporality is reflected in the process of birth, existence, and apoptosis of data cells, and the spatiality is reflected in the spatial location attribute of data cells. The data pipes depend on strategies that can be associated with each other and generate combined data cells, so data cells are hierarchical. The association of data pipes is primary and secondary, as the association request is sent by the primary data pipes and the request is accepted by the secondary data pipes. Data cells formed by the association of multiple initial data cells are called combined data cells. The combined data cells composed of two initial data cells are called binary combined data cells; the combined data cells composed of

λ

initial data cells are called

λ - t u p l e

combined data cells, and the unassociated data cells are called the initial data cells. The structure of the initial data cell is shown in Figure 2a and the structure of the binary combined data cell is shown in Figure 2b. In Figure 2a,

c_{u}

represents the information nucleus of the initial data cell

u

s_{u, i}

represents the

i - t h

strategy of

u

, and

{o c}_{u, i}

represents the data pipe corresponding to

s_{u, i}

. The initial data cell

u

can be expressed as:

u = {c_{u}, S_{u}, O_{u}}

(1)

S_{u} = \{s_{u, i}\}, 1 \leq i \leq n

(2)

O_{u} = \{{o c}_{u, i}\}, 1 \leq i \leq n

(3)

where

S_{u}

represents the set of the strategies, and

O_{u}

represents the set of data pipes. Figure 2b shows the structure of the binary data cell

(u, v)

, in which the information nucleus is generated by the association of data pipe

{o c}_{u, i}

u

(the master data pipe) and the data pipe

{o c}_{v, j}

v

(the secondary data pipe). For convenience of representation, a sketch of the binary data cell

(u, v)

is shown in Figure 3.

Without losing generality, the association of data pipes is expressed as:

r ({o c}_{u, i}^{*}, o c_{v, j}^{*}) {, o c}_{u, i}^{*} \in u, {o c}_{v, j}^{*} \in v

(4)

where

u

and

v

are data cells (

u

and

v

can be the same);

{o c}_{u, i}^{*}

is the primary data pipe;

o c_{v, j}^{*}

is the secondary data pipe;

r ({o c}_{u, i}^{*}, o c_{v, j}^{*})

represents the association of the data pipe

{o c}_{u, i}^{*}

and the data pipe

o c_{v, j}^{*}

Since the combined data cells are composed of initial data cells, the definition of elements and the relationships among the elements of the data cells are illustrated here using the initial data cell as an example, including the information nucleus, the strategies, and the data pipes:

Information nucleus: The information nucleus is the core constituent of the initial data cells, which carry the intelligent implementation of data. Strategies and data pipes exist based on the information nucleus. There is no one-to-one correspondence between the information nucleus and things in the real world, but a kind of abstraction of transactions with the same characteristics. The information nucleus is embeddable and objective: The embeddability of the information nucleus refers to the semi-automatic or automatic association between the information nucleus and the database system; the objectivity of the information nucleus means that although information nucleuses do not correspond to the real world one by one, they all have practical significance.
Strategies: Strategies refer to a set of algorithms and scripts, which is the method and process of realizing specific transactions based on the data and logic provided by the information nucleus. Strategies are divided into internal strategies and external strategies. It has been explained earlier that the data cells are hierarchical, so the internal strategy and the external strategy are relative. For example, strategies within the combined data cells are internal strategies relative to the combined data cell and are external strategies to the initial data cells that make up the combined data cell.
Data pipes: Data pipes are channels via which the initial data cells carry out information transfer and communication. Data pipes are not independent but have a one-to-one relationship with strategies. The construction of the information nucleuses of combined data cells is based on the association of data pipes of initial data cells.

3.2. Construction of Initial Data Cells

Initial data cells are directly linked to databases and their construction includes the information nucleus construction, strategy construction, and data pipe construction.

3.2.1. Information Nucleus Construction

Information nucleuses are common properties or characteristics of data in the database, which is a class of abstractions with the same characteristic transactions (see Figure 4). Some databases have fields or key-values with specific meanings, such as table-type relational databases and document-type non-relational databases (Figure 4a,b). For these databases, the field names or key-value names are directly extracted as information nucleuses. However, some special databases do not have the original specific attributes or descriptive information, such as image-type databases or video-type databases. For this kind of database, pre-processing technologies are used to transform the original data into various types of implicit information to replace the original information, such as Image/Video Semantic Analysis and Understanding [47], Image/Video Description [48], Image/Video Comprehension [49], Video Caption [50], and Video Target Relationships Mining [51], and technical names are used as information nucleuses, which represent that the data is processed using the specified technology (Figure 4c,d). The information nucleuses of initial data cells and the field names/key-value names/technical names have a n-to-n relationship:

B : \{c_{1}, c_{2}, \dots, c_{α}\} \leftarrow \{k_{1}, k_{2}, \dots, k_{α}\}

(5)

where

\{k_{1}, k_{2}, \dots, k_{α}\}

represents the set of key-value names/field names/technical names;

\{c_{1}, c_{2}, \dots, c_{α}\}

represents the set of the information nucleuses.

3.2.2. Strategy Construction

Strategy construction includes underlying strategy construction, advanced strategy construction, and the combination strategy construction.

Underlying strategies include input and output strategies, data normalization strategies, mathematical operation strategies, logical operation strategies, and so on.

Input strategy and output strategy.

Input strategy and output strategy are the most basic and most frequently applied strategies in initial data cells, which are expressed as:

F_{I} = I n p u t ()

(6)

F_{O} = O u t p u t (s e l f)

(7)

where

I n p u t ()

function refers to requesting external input, and

O u t p u t (s e l f)

function refers to the output of the data associated with the information nucleus.

2.: Data normalization strategies.

Data normalization strategies map the data to a unified dimension to keep the data on the same metric scale, which is the premise of scale-independent operation of data:

F_{D} (o b j) = {d i m}_{s e l f} (o b j)

(8)

where

{d i m}_{s e l f} (o b j)

represents the function that maps

s e l f

to the dimension of

o b j

such as the input layer data normalization method, the hidden layer data normalization method, the stream data normalization method, the stream data input layer normalization method, and so on [52].

3.: Mathematical operation strategies.

Mathematical operation strategies refer to the four operations of addition, subtraction, multiplication, and division:

F_{M a t h} (o b j) = s e l f M o b j

(9)

where

M

specifically represents the “addition (

+

)”, “subtraction (

-

)”, “multiplication (

\times

)”, and “division (

/

)” operations of numerical values, the logical operations of the database, and so on.

4.: Logical operation strategies.

Logical operation strategies include the and-or-non-operation of data:

F_{L o g i c} (o b j) = s e l f L o b j (o r F_{L o g i c} (o b j) = L o b j)

(10)

where

L

represents binocular logical operations (e.g., “and”, “or”, “different-or”, “same-or”, “with-no”, “or-no” operations) and monadic logical operations (e.g., non-operation, etc.).

Advanced strategies refer to complex algorithms or scripts, which include association rule strategies, spatio-temporal relationship strategies, empirical model strategies, and so on.

Association rule strategies.

The association rule strategies are used to mine frequent items and corresponding association rules in a series of transaction sets.

Suppose itemset

C = {C_{1}, C_{2}, \dots, C_{m}}

\forall

transaction

t_{i} \subseteq C

. Given a transaction set

Τ = \{t_{i}\}, 1 \leq i \leq n

. Association rules are the implication of event

X \to Y

in the transaction set

Τ

, which includes support and confidence:

S u p p o r t (X \to Y) = \frac{p (X \cap Y)}{n}

(11)

where

X

and

Y

represent events,

p (X \cap Y)

represents the probability of the simultaneous occurrence of events

X

and

Y

in set

Τ

n

represents the total number of transactions in set

Τ

, and

S u p p o r t (X \to Y)

represents the support of the event

X \to Y

C o n f i d e n c e (X \to Y) = \frac{p (X \cap Y)}{p (X)}

(12)

where

X

Y

represent events,

p (X \cap Y)

represents the probability of the simultaneous occurrence of events

X

and

Y

in set

Τ

n

represents the total number of transactions in set

Τ

, and

S u p p o r t (X \to Y)

represents the support of the event

X \to Y

2.: Spatio-temporal relationship strategies.

The data are spatial and temporal. Spatiality refers to the absolute or relative position of data. Temporality refers to the presentation of data in a time series. The integration of space and time considers the changes in transactions described by the data with time to realize spatio-temporal relationship mining, spatio-temporal data analysis, spatio-temporal evolution analysis, and so on.

S p a c e, T i m e = S T (s e l f)

(13)

S T (s e l f)

represents the spatio-temporal relationship analysis of data and returns the spatial and temporal attributes of the transaction.

S p a c e, T i m e

, respectively, represent the spatio-temporal properties of the data.

3.: Empirical model strategies.

Empirical model strategies are also known as “black box models” [53] and refer to some methods that do not analyze the actual process mechanism, but carry on the mathematical statistical analysis from the actual data related to the process and predict the relationship between the parameters and variables according to the principle of minimum error.

Combination strategies are nested calls to strategies introduced in this section to achieve complex calculations and analyses:

F_{c o m b i n a t i o n} (*) = F_{1} (F_{2} (*))

(14)

where

F_{1}

and

F_{2}

represent any strategies described in this subsection.

3.2.3. Data Pipe Construction

Data pipes are constructed in one-to-one correspondence with strategies. Data pipes are meaningless on their own and generate new information by associating with each other. The details of the association of data pipes are shown later.

3.3. Autonomous Association of Multi-Source Heterogeneous Data Based on the Organizational Structure of Data Cells

This section introduces the autonomous association of multi-source heterogeneous data based on the organizational structure of data cells. Data autonomous association is the process of multilevel random association of data pipes and continually generating high-level combined information nucleuses of data cells containing new information to construct high-level combined data cells. This section introduces the mode of the random association of data pipes of initial data cells and elicits the general rules of random association of data pipes and the construction of combined data cells.

Using the construction method of initial data cells introduced in Section 3.2, field names, key-value names, or technical names of multimodal databases are extracted to realize the construction of information nucleuses of initial data cells, followed by the construction of strategies and data pipes of initial data cells to realize the construction of initial data cells. All data pipes of initial data cells form the set of data pipes of initial data cells:

O C = \{{o c}_{(1)}, {o c}_{(2)}, \dots {, o c}_{(a)} {, \dots, o c}_{(l)}\}

(15)

where

{o c}_{(a)}

represents a data pipe of initial data cells, and

O C

represents the set of data pipes of initial data cells. Suppose

{o c}_{(a)}

corresponds to the strategy

s_{(a)}

and

s_{(a)}

corresponds to the information nucleus

c_{(a)}

. The random association of data pipes means each data pipe

{o c}_{(p)}

O C

acts as the primary data pipe, and data pipe

{o c}_{(q)}

O C

(

{o c}_{(p)}

can be the same as

{o c}_{(q)}

) is associated as the secondary data pipe:

\{r ({o c}_{(p)}, {o c}_{(q)})| \forall {o c}_{(p)} \in O C a n d {\forall o c}_{(q)} \in O C\}

(16)

where

{o c}_{(p)}

and

{o c}_{(q)},

respectively, represent any data pipe in

O C

;

r ({o c}_{(p)}, {o c}_{(q)})

indicates the association result of

{o c}_{(p)}

and

{o c}_{(q)}

, which is a unified expression formed by the sequential combination or connection of

c_{(p)}

s_{(p)}

c_{(q)},

and

s_{(q)}

. Figure 5 is an example showing the association of the data pipe of “add” from the initial data cell with the information nucleus “weight” and the data pipe of “output” from the initial data cell with the information nucleus “height”, and the result of the association is a unified expression of “weight + height”.

After the random association of data pipes, three types of unified expressions are produced: Expressions conform to arithmetic rules and have practical significance; expressions fail to conform to arithmetic rules; expressions conform to arithmetic rules but have no practical significance. The following example explains the three types of unified expressions:

Suppose the data pipe

{o c}_{(p)} \in c_{u}

and the information nucleus

c_{u}

are “the score of subject A (numerical type)” and

{o c}_{(q)} \in c_{v}

and the information nucleus

c_{v}

are “the score of subject B (numerical type)”:

If the data pipe

{o c}_{(p)} \in c_{u}

is the logical addition operation strategy:

F_{P l u s} (o b j) = s e l f + o b j

and the data pipe

{o c}_{(q)} \in c_{v}

is the output strategy:

F_{O} = O u t p u t (s e l f)

, the association of

{o c}_{(p)}

with

{o c}_{(q)}

produces the uniform expression

r ({o c}_{(p)}, {o c}_{(q)})

: “subject A score + subject B score”, which is a “numeric A + numeric B” type operation, which means the sum of subject A score and subject B score. The unified expression conforms to arithmetic rules and has practical significance.

If the data pipe

{o c}_{(p)} \in c_{u}

is the output strategy:

F_{O} = O u t p u t (s e l f)

and the data pipe

{o c}_{(q)} \in c_{v}

is the logical addition operation strategy:

F_{P l u s} (o b j) = s e l f + o b j

, the association of

{o c}_{(p)}

with

{o c}_{(q)}

produces the uniform expression

r ({o c}_{(p)}, {o c}_{(q)})

: “subject A score subject B score+”. The unified expression does not conform to arithmetic rules.

If the data pipe

{o c}_{(p)} \in c_{u}

is the multiplication strategy:

F_{M u l t i p l y} (o b j) = s e l f \times o b j

and the data pipe

{o c}_{(q)} \in c_{v}

is the output strategy:

F_{O} = O u t p u t (s e l f)

, the association of

{o c}_{(p)}

with

{o c}_{(q)}

produces the uniform expression

r ({o c}_{(p)}, {o c}_{(q)})

: “Subject A score × Subject B score” is a “Numeric A × Numeric B” type operation, which means the product of subject A score and subject B score. The unified expression conforms to arithmetic rules but has no practical significance.

The generation of binary combined data cells is based on the random association of data pipes of initial data cells. Among these random associations of data pipes, some are symmetric while others are asymmetric. Symmetric associations refer to when the primary and secondary pipes are exchanged and the new association is the same as the original association or has the same meaning as the original association. Some examples are shown in Figure 6, where Figure 6a,b are symmetric associations. The association of

A + A

shown in Figure 6a is unchanged when the primary and secondary data pipes are exchanged. For Figure 6b, because

A & B = B & A

, the association

A & B

is also symmetric. On the contrary, Figure 6c,d are asymmetric associations because the meanings of

A - B

and

B - A

are different, and if the variables

A

and

B

are exchanged in the confidence calculation, the meaning will be changed.

Nucleuses of binary combined data cells are the unified expression produced by the random association of data pipes of initial data cells that conforms to arithmetic rules; in order to prevent redundancy, when the association of data pipes is symmetrical, either of two symmetric expressions is selected to construct the nucleuses of binary combined data cells, and when the expression is asymmetric, both the two asymmetric expressions are selected. The construction of external strategies and pipes of binary combined data cells is the same as that of initial data cells. The information nucleus of N-tuple combined data cells (N > 2) is formed by the random association of the data pipes of (N−1)-tuple data cells with the data pipes of initial data cells using the same construction principle of external strategies and data pipes of initial data cells.

3.4. Information-Driven Intelligent Information Discovery

The random association of data pipes of data cells generates a variety of information nucleuses of combined data cells that conform to arithmetic rules and contain new information. However, the determination of whether the information nucleus of combined data cells has practical significance, so as to screen meaningful data associations, is difficult because the evaluation of the practical significance of the information nucleus of combined data cells is subjective and needs to be scored manually by relevant experts. However, the scoring of mass information nucleuses of combined data cells is time-consuming and labor-consuming, so it is necessary to construct an automatic discovery method to realize the intelligent screening of mass information nucleuses. This section describes an information-driven intelligent information discovery method (see Figure 7). Via the introduction of the reward and punishment model and training of the model by simulating the manual scoring mode of information nucleuses, the intelligent scoring and storage of information nucleuses can be realized.

The reward and punishment model guides the behavior of the intelligent system by interacting with artificial prior data. According to these interactions, the intelligent system can gradually simulate artificial intelligent selection and finally realize independent decision-making. The key to the reward and punishment model is to construct the loss function to describe the difference between the model-estimated values and the manual scores. The reward and punishment model is trained using many manual scoring samples to realize the intelligence scoring of information nucleuses of combined data cells. According to the actual needs, the scoring threshold can be set to realize the automatic screening of meaningful information nucleuses. {The information nucleus, Score of the information nucleus} is used as the training data format. The scoring data of information nucleuses is generated by subjective scoring standard construction and manual scoring (see Table 1).

The subjective scoring standard shown in Table 1 divides the practical significance of information nucleuses into five grades, with a corresponding score of 1–5. Assuming that the initial score of any nucleus is 3, in order to maximize model scoring close to human scoring, the reward and punishment model is fitted by maximizing the likelihood function. This means that the scoring ability can be improved by minimizing the loss function. The loss function is expressed as:

L o s s (θ) = - ϵ \log (σ (|r_{θ} (x) - r_{θ}^{*} (x)|))

(17)

where

L o s s (θ)

is the loss function of the reward and punishment model;

x

represents information nucleuses of combined data cells;

r_{θ} (x)

represents the manual score of

x

;

r_{θ}^{*} (x)

represents the model estimated score of

x

;

σ (*)

represents the sigmoid function.

l o g (*)

represents the logarithmic operation and

- ϵ

represents the negative constant. The maximization of the difference is achieved by performing a negative logarithmic operation on the result of the sigmoid function.

The cloud brain is used to store meaningful information nucleuses of combined data cells to facilitate rapid external calls, including meaningful information nucleuses of combined data cells (unified expression body), a database list of field names, key-value names, or technical names in the expression body, and scores of meaningful nucleuses. The above storage method realizes the fast call of external requests according to the database list and calling logic stored in the information nucleus. In addition, the cloud brain also stores the call records of meaningful information nucleuses, including the time of calls and the results of return. According to the comparison of the returned results of the same calls at different times, historical comparative analysis of data can be realized.

3.5. Task-Driven Intelligent Information Discovery

This section introduces the task-driven intelligent information discovery method. This method takes realistic task requests as input. Using the NLU topic model [54], the task requests are decomposed into the collection of subject headings for retrieving information nucleuses of initial data cells [55] to obtain the set of matched initial data cells; the set of information nucleuses of combined data cells are generated by random association of data pipes of matched initial data cells. Via the introduction of the reward and punishment model, combined with the task request, and training of the model by simulating the manual sorting mode of information nucleuses, the realization of intelligent sorting and the storage of information nucleuses of combined data cells is achieved (see Figure 8).

The realistic task request is decomposed by the NLU topic model into the set of subject headings, following the principles of non-repetition, non-omission, and ease of recognition:

R e q u e s t \to S H = \{{s h}_{1}, {s h}_{2}, \dots {s h}_{i}, \dots, {s h}_{k}\}

(18)

where

R e q u e s t

represents the task request;

{s h}_{i}

represents the subject heading;

S H

represents the set of subject headings.

By imitating the principle of search engines [56], subject headings are retrieved from the set of information nucleuses of initial data cells, including synonym substitution, subject heading conversion, and so on, and return to the set of matched initial data cells. Data pipes of initial data cells are randomly associated (same as Section 3.2.2) to obtain the set of information nucleuses of combined data cells related to the task request. After removing information nucleuses which do not conform to the operation rules, the reward and punishment model is trained to realize the intelligent sorting of information nucleuses of combined data cells in the set and thus obtains the optimal information nucleus. Different from the information-driven intelligent information discovery method, the task-driven intelligent information discovery method does not use direct scoring but sorts the set of nucleuses of combined data cells according to the task request and takes {task request, the list of information nucleus of combined data cells} as the training data format for model training. In order to maximize the model sorting result close to that of human sorting for task requests, the reward and punishment model is fitted by maximizing the likelihood. This means that the sorting ability can be improved by minimizing the loss function. The loss function is expressed as:

L o s s (θ) = - ϵ E_{(R e q u e s t, c_{β}, c_{φ}) \sim D} [\log (σ ({R a n k}_{θ} (R e q u e s t, c_{β}) - {R a n k}_{θ} (R e q u e s t, c_{φ})))]

(19)

where

L o s s (θ)

is the loss function of the reward and punishment model;

R e q u e s t

represents the task request;

D

represents the sorting list of information nucleuses for the task request.

c_{β}

represents the information nucleuses of combined data cells sorted before

c_{φ}

in the sorting list;

{R a n k}_{θ} (R e q u e s t, c_{β})

and

{R a n k}_{θ} (R e q u e s t, c_{φ}),

respectively, represent the ranks of

c_{β}

and

c_{φ}

in the sorting list;

σ (*)

represents the sigmoid function.

l o g (*)

represents the logarithmic operation and

- ϵ

represents the negative constant. The maximization of the difference is achieved by performing a negative logarithmic operation on the result of the sigmoid function.

The trained model can autonomously sort the information nucleuses of combined data cells and realizes intelligent information discovery based on task requests.

4. Examples of Applications

This section introduces some examples of intelligent information discovery methods. The information-driven method takes the example of the Cultural Income Table of the Department of Culture and the Tourism Income Table of the Department of Tourism to demonstrate the construction of initial data cells by using table fields and meaningful combined data cells generated by autonomous association of data pipes. The task-driven method takes the task request “What’s the weather today?” as an example and demonstrates the process of subject heading acquisition, subject headings–nucleuses retrieval, the autonomous association of data pipes, and the discovery of the optimal combined data cell for the task request.

4.1. The Example of the Information-Driven Method

The Cultural Income Table includes the fields: Time (IT), Place (IP), and Cultural Income Table (II); the Tourism Income Table includes the fields: Time (TT), Place (TP), and Tourism Income (TI).

For ease of representation, the different strategies are abbreviated in this section, as shown in Table 2:

Initial data cells are constructed using the fields of the Cultural Income Table and the Tourism Income Table. Meaningful nucleuses of combined data cells acquired by multi-level data pipes association are shown in Table 3.

4.2. The Example of the Task-Driven Method

Taking the task request “What’s the weather today?” as an example, the task request is analyzed using the NLU topic model to obtain the set of subject headings “

S H = \{c i t y, d a t e, a i r t e m p e r a t u r e, t e m p e r a t u r e, a i r q u a l i t y\}

”. Subject headings are used to retrieve initial data cells to obtain the set of information nucleuses of initial data cells:

\{c i t y, r e g i o n, d a t e, a i r t e m p e r a t u r e, t e m p e r a t u r e, h u m i d i t y, a i r q u a l i t y, a i r p o l l u t i o n i n d e x \dots\}

(20)

Information nucleuses of combined data cells are generated by the autonomous association of data pipes of data cells.

Information nucleuses are manually sorted and used to train the reward and punishment model. The trained model realizes automatic ranking of information nucleuses of combined data cells and the best matching nucleus is stored (see Table 4).

5. Discussion

The method proposed in this paper can be widely applied to multi-source heterogeneous databases of different fields, types, and structures in a variety of government departments, companies, and social organizations. The autonomous data association method based on the organizational structure of data cells and two different data intelligent information discovery methods provide a new theory for the autonomous mining, management, and application of multi-source heterogeneous data, and a new idea for cross-industry information sharing that provides high-quality data source for AI. The methodology of this paper is feasible in the following three aspects:

Theoretical feasibility: Field names, key-value names, or technical names in the database are data items of the same type divided according to some boundaries, which represent specific transactions. The data association is the analysis of the relationship between transactions. This paper takes field names, key-value names, or technical names as information nucleuses of initial data cells, and the nucleuses of combined data cells are generated by random association of data pipes of initial data cells, so the construction of information nucleuses are in accordance with the essence of data association. The construction of strategies and data pipes and the association of data pipes realize the data autonomous association. Strategies are all kinds of common algorithms and scripts, and unified expressions are generated by the association of data pipes, which realizes the association of multiple data sources with different types and formats. The process of association of data pipes is random, thus maximizing the acquisition of associated new information. According to Section 3.2.2, the unified expressions formed by the autonomous association of data pipes need to be judged using arithmetic rules and practical significance. Therefore, we select expressions that meet arithmetic rules and classify associations into symmetric and asymmetric categories to construct information nucleuses of combined data cells and the reward and punishment models from two perspectives are trained to simulate the manual screening process. The information-driven type directly filters nucleuses, and the task-driven type is task-oriented to the requirements of real applications. In conclusion, the method in this paper extracts meaningful associated information from subjective and objective perspectives and from overall and local perspectives.
Technical feasibility: The construction of information nucleuses of initial data cells is the process of extracting field names, key-value names, or technical names, which can be obtained directly using related database statements and other critical information can be obtained using various existing algorithms, such as image-recognition algorithms. Strategies are divided into underlying strategies and advanced strategies, where underlying strategies are common basic mathematical operations and advanced strategies are some existing complex algorithms or combinations of different strategies, which can be realized via algorithm development. The association of data pipes is a sequential combination or connection of information nucleuses and strategies. Expressions that fail to conform to arithmetic rules can imitate “calculator” applications, where logical judgments are implemented using manually predefined rules. The symmetry of associations of data pipes can also be judged using manual predefinition. For the two intelligent information discovery methods, we have given loss functions and training data formats for the reward and punishment model, so the model can be implemented via more detailed designing and programming. The training data can be obtained using manual scoring or ranking of samples. When the empirical knowledge reaches a certain level, the model can spontaneously screen meaningful information nucleuses to realize autonomous information discovery. The meaningful nucleuses are stored in the cloud brain, which is essentially a relational database, and can be designed with reference to relational databases.
Feasibility of data security: The data invoked in the actual application is according to the unified expressions of meaningful information nucleuses stored in the cloud brain. Although data cells are directly connected with the database, and the data are called and analyzed without changing the source data and source environment, the data invoked according to the unified expression have been calculated using various strategies, which are far different from the original data, and the safety of the original data is initially guaranteed. In addition, for further guarantees of data security, the managers of data can restrict the permission of information nucleuses in the cloud brain to different personnel or departments.

The data autonomous association method and associated information intelligent discovery method proposed in this paper have multiple advantages over existing multi-source heterogeneous data processing methods, as summarized in Table 5:

Our method can deal with a wide range of multi-source and multimodal data, while other data processing methods only deal with certain types of data or are limited to a certain topic. The manual dependence of our method is smaller because the main workload is the manual labelling of data in the early stage, and the manual dependence will be reduced significantly after the model has matured. The data-autonomous association method based on data cells is autonomous while other methods rely on manual definition. The intelligent discovery methods can better cope with data updates because only the association rules are stored, such as changes in data types and the generation of new associations. Therefore, our method has greater robustness than traditional methods.

6. Conclusions

The data autonomous association method based on data cells and two intelligent information discovery methods effectively solves the problem of autonomous association of massive multimodal data and the problem of intelligent acquisition of potential information generated by autonomous association. According to the meaningful association rules, a large number of complex and effective data sources can be effectively provided for AI. Via the examples of applications and feasibility analysis, we demonstrate the effectiveness and the feasibility of methods. In our future work, we intend to propose more optimized data autonomous association methods and further optimized models to discover deeper information from massive multimodal data. Furthermore, we will conduct a series of evaluation indexes to evaluate the methods.

Author Contributions

Conceptualization, W.W., J.L. and J.J.; methodology, W.W., J.L. and J.J.; validation, W.W. and E.G.; formal analysis, W.W. and T.Y.; investigation, W.W., Q.W. and B.W.; writing—original draft preparation, W.W.; writing—review and editing, W.W., J.L. and J.J.; supervision, W.W.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 41961063); Guilin City Technology Application and Promotion Project in 2022 (20220138-2); Key R&D Projects in Guilin City in 2022 (20220109).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to these data were extracted from public data platforms of local government agencies, and we do not have the right to disclose these reorganized data elsewhere.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, B.; Qi, P.; Liu, B.; Di, S.; Liu, J.; Pei, J.; Yi, J.; Zhou, B. Trustworthy AI: From principles to practices. ACM Comput. Surv. 2023, 55, 177. [Google Scholar] [CrossRef]
Lotfian, M.; Ingensand, J.; Brovelli, M.A. The partnership of citizen science and machine learning: Benefits, risks, and future challenges for engagement, data collection, and data quality. Sustainability 2021, 13, 8087. [Google Scholar] [CrossRef]
Zha, D.; Bhat, Z.P.; Lai, K.-H.; Yang, F.; Hu, X. Data-centricai: Perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), Minneapolis, MN, USA, 27–29 April 2023; pp. 945–948. [Google Scholar]
Wang, T. A novel approach of integrating natural language processing techniques with fuzzy TOPSIS for product evaluation. Symmetry 2022, 14, 120. [Google Scholar] [CrossRef]
Shen, Z.; Zhang, X.; Liu, Z. PM² VE: Power Metering Model for Virtualization Environments in Cloud Data Centers. IEEE Trans. Cloud Comput. 2023, 11, 3126–3138. [Google Scholar]
Ethan, A. Data Virtualization: The Key to Realizing Big Data Analytics Potential. Int. J. Comput. Sci. Inf. 2022, 6, 20–50. [Google Scholar]
Shiva, L. Data Virtualization Best Practices for Advanced Analytics in Big Data. Int. J. Comput. Sci. Inf. 2022, 6, 39–66. [Google Scholar]
Al-Okaily, A.; Al-Okaily, M.; Teoh, A.P.; Al-Debei, M.M. An empirical study on data warehouse systems effectiveness: The case of Jordanian banks in the business intelligence era. EuroMed J. Bus. 2023, 18, 489–510. [Google Scholar] [CrossRef]
Nambiar, A.; Mundra, D. An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. Big Data Cogn. Comput. 2022, 6, 132. [Google Scholar] [CrossRef]
Oueslati, W.; Tahri, S.; Limam, H.; Akaichi, J. A systematic review on moving objects’ trajectory data and trajectory data warehouse modeling. Comput. Sci. Rev. 2023, 47, 100516. [Google Scholar] [CrossRef]
Porshnev, S.; Borodin, A.; Ponomareva, O.; Mirvoda, S.; Chernova, O. The development of a heterogeneous MP data model based on the ontological approach. Symmetry 2021, 13, 813. [Google Scholar] [CrossRef]
Muniswamaiah, M.; Agerwala, T.; Tappert, C. Data virtualization for decision making in big data. Int. J. Softw. Eng. Appl. 2019, 10, 45–53. [Google Scholar] [CrossRef]
Saxena, G.; Agarwal, B.B. Data Warehouse Designing: Dimensional Modelling and ER Modelling. Int. J. Eng. Invent. 2014, 3, 28–34. [Google Scholar]
Togatorop, P.R.; Sitorus, D.; Purba, Y.; Tarigan, A.M. Twitter Data Warehouse and Business Intelligence Using Dimensional Model and Data Mining. In Proceedings of the 2022 IEEE International Conference of Computer Science and Information Technology (ICOSNIKOM), Laguboti, Sumatera Utara, Indonesia, 19–21 October 2022; pp. 1–6. [Google Scholar]
Rodríguez-Mazahua, N.; Rodríguez-Mazahua, L.; López-Chau, A.; Alor-Hernández, G.; Machorro-Cano, I. Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses. Appl. Sci. 2022, 12, 10942. [Google Scholar] [CrossRef]
Witanto, E.N.; Oktian, Y.E.; Lee, S.-G. Toward data integrity architecture for cloud-based AI systems. Symmetry 2022, 14, 273. [Google Scholar] [CrossRef]
Wu, X.; Duan, J.; Pan, Y.; Li, M. Medical knowledge graph: Data sources, construction, reasoning, and applications. Big Data Min. Anal. 2023, 6, 201–217. [Google Scholar] [CrossRef]
Hassan, M.M.; Karim, A.; Mollick, S.; Azam, S.; Ignatious, E.; Al Haque, A.F. An Apriori Algorithm-Based Association Rule Analysis to detect Human Suicidal Behaviour. Procedia Comput. Sci. 2023, 219, 1279–1288. [Google Scholar] [CrossRef]
Liu, T.; Zhang, X.; Du, P.; Du, Q.; Li, A.; Gong, L. Knowledge Discovery Method from Text Big Data for Earthquake Emergency. Geomat. Inf. Sci. Wuhan Univ. 2020, 45, 1205–1213. [Google Scholar]
Cao, S.J.; Cao, R.Y. Research on Interdisciplinary Knowledge Discovery Based on Knowledge Graph to Support Scientific Research Innovation. Inf. Stud. Theroy Appl. 2022, 45, 45–53. [Google Scholar]
Huang, X.; Liu, Y.; Huang, L.; Onstein, E.; Merschbrock, C. BIM and IoT data fusion: The data process model perspective. Autom. Constr. 2023, 149, 104792. [Google Scholar] [CrossRef]
Moreno, C.; González, R.A.C.; Viedma, E.H. Data and artificial intelligence strategy: A conceptual enterprise big data cloud architecture to enable market-oriented organisations. Int. J. Interact. 2019, 5, 7–14. [Google Scholar] [CrossRef]
Yang, J.-T.; Chen, W.-Y.; Li, C.-H.; Huang, S.C.-H.; Wu, H.-C. APPFLChain: A Privacy Protection Distributed Artificial-Intelligence Architecture Based on Federated Learning and Consortium Blockchain. arXiv 2022, arXiv:2206.12790. [Google Scholar]
Liu, J.; Li, T.; Xie, P.; Du, S.; Teng, F.; Yang, X. Urban big data fusion based on deep learning: An overview. Inf. Fusion 2020, 53, 123–133. [Google Scholar] [CrossRef]
Liu, W.; Zhang, C.; Yu, B.; Li, Y. A general multi-source data fusion framework. In Proceedings of the 2019 11th International Conference on Machine Learning and Computing, Zhuhai, China, 22–24 February 2019; pp. 285–289. [Google Scholar]
Ji, Z.Y.; Pi, H.Y.; Yao, W. A hybrid recommendation model based on fusion of multi-source heterogeneous data. J. Beijing Univ. Posts Telecommun. 2019, 42, 126. [Google Scholar]
Liu, Z.; Liu, H.; Huang, W.; Wang, B.; Sun, F. Audiovisual cross-modal material surface retrieval. Neural Comput. Appl. 2020, 32, 14301–14309. [Google Scholar] [CrossRef]
Meng, F.; Li, A.; Liu, Z. An Evidence theory and data fusion based classification method for decision making. Procedia Comput. Sci. 2022, 199, 892–899. [Google Scholar] [CrossRef]
Shu, X.; Ye, Y. Knowledge Discovery: Methods from data mining and machine learning. Soc. Sci. Res 2023, 110, 102817. [Google Scholar] [CrossRef] [PubMed]
Rajput, D.S.; Meena, G.; Acharya, M.; Mohbey, K.K. Fault prediction using fuzzy convolution neural network on IoT environment with heterogeneous sensing data fusion. Meas. Sens. 2023, 26, 100701. [Google Scholar] [CrossRef]
Abdulahi Hasan, A.; Fang, H. Data Mining in Education: Discussing Knowledge Discovery in Database (KDD) with Cluster Associative Study. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence and Information Systems, Chongqing, China, 28–30 May 2021; pp. 1–6. [Google Scholar]
Mollaei, N.; Fujao, C.; Rodrigues, J.; Cepeda, C.; Gamboa, H. Occupational health knowledge discovery based on association rules applied to workers’ body parts protection: A case study in the automotive industry. Comput. Methods Biomech. Biomed. 2023, 26, 1875–1888. [Google Scholar] [CrossRef]
Jun, D.; Ruan, W. Research on Knowledge Map and Multidimensional Knowledge Discovery of Oral History Archives Re-sources. Libr. Inf. Serv. 2022, 66, 4–16. [Google Scholar]
Janssen, M.; Brous, P.; Estevez, E.; Barbosa, L.S.; Janowski, T. Data governance: Organizing data for trustworthy Artificial Intelligence. Gov. Inf. Q. 2020, 37, 101493. [Google Scholar] [CrossRef]
Di Vaio, A.; Hassan, R.; Alavoine, C. Data intelligence and analytics: A bibliometric analysis of human–Artificial intelligence in public sector decision-making effectiveness. Technol. Forecast. Soc. Chang. 2022, 174, 121201. [Google Scholar] [CrossRef]
Zhi-Qiang, P.; Jian-Qiang, Y.; Zhen, L.; Teng-Hai, Q.; Jin-Lin, S.; Fei-Mo, L. Knowledge-based and data-driven integrating methodologies for collective intelligence decision making: A survey. Acta Autom. Sin. 2022, 48, 627–643. [Google Scholar]
Zhe, J.; Yin, Z.; Fei, W.; Wenwu, Z.; Yunhe, P. Artificial Intelligence Algorithms Based on Data-driven and Knowledge-guided Models. J. Electron. Sci. Technol. 2023, 45, 2580–2594. [Google Scholar]
Zhang, J.; Xiao, W.; Li, Y. Data and knowledge twin driven integration for large-scale device-free localization. IEEE Internet Things J. 2020, 8, 320–331. [Google Scholar] [CrossRef]
Zhu, J.; Chai, M.; Zhou, W. Three-three-three network architecture and learning optimization mechanism for B5G/6G. J. Commun. 2021, 42, 62–75. [Google Scholar]
Sarker, I.H. Data science and analytics: An overview from data-driven smart computing, decision-making and applications perspective. SN Comput. Sci. 2021, 2, 377. [Google Scholar] [CrossRef] [PubMed]
Yin, T.; Lu, N.; Guo, G.; Lei, Y.; Wang, S.; Guan, X. Knowledge and data dual-driven transfer network for industrial robot fault diagnosis. Mech. Syst. Signal Process. 2023, 182, 109597. [Google Scholar] [CrossRef]
Yin, J.; Ren, X.; Liu, R.; Tang, T.; Su, S. Quantitative analysis for resilience-based urban rail systems: A hybrid knowledge-based and data-driven approach. Reliab. Eng. Syst. Saf. 2022, 219, 108183. [Google Scholar] [CrossRef]
Destro, F.; Salmon, A.J.; Facco, P.; Pantelides, C.C.; Bezzo, F.; Barolo, M. Monitoring a segmented fluid bed dryer by hybrid data-driven/knowledge-driven modeling. IFAC-PapersOnLine 2020, 53, 11638–11643. [Google Scholar] [CrossRef]
Wang, H.; Mao, K.; Yuan, Z.; Shi, J.; Cao, M.; Qin, Z.; Duan, S.; Tang, B. A method for land surface temperature retrieval based on model-data-knowledge-driven and deep learning. Remote Sens. Environ. 2021, 265, 112665. [Google Scholar] [CrossRef]
Wu, Z.; Zhang, Y.; Dong, Z. Prediction of NOx emission concentration from coal-fired power plant based on joint knowledge and data driven. Energy 2023, 271, 127044. [Google Scholar] [CrossRef]
Wu, W.; Song, C.; Liu, J.; Zhao, J. Data-knowledge-driven distributed monitoring for large-scale processes based on digraph. J. Process Control 2022, 109, 60–73. [Google Scholar] [CrossRef]
Shi, Z. Image semantic analysis and understanding. In Proceedings of the International Conference on Intelligent Information Processing, Manchester, UK, 13–16 October 2010; pp. 4–5. [Google Scholar]
Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. 2013, 35, 2891–2903. [Google Scholar] [CrossRef] [PubMed]
Cohn, N.; Jackendoff, R.; Holcomb, P.J.; Kuperberg, G.R. The grammar of visual narrative: Neural evidence for constituent structure in sequential image comprehension. Neuropsychologia 2014, 64, 63–70. [Google Scholar] [CrossRef] [PubMed]
Dong, J.; Li, X.; Snoek, C.G. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multimed. 2018, 20, 3377–3388. [Google Scholar] [CrossRef]
Han, M.; Wang, Y.; Chang, X.; Qiao, Y. Mining inter-video proposal relations for video object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 431–446. [Google Scholar]
Yang, H.; Zhao, X.; Wang, L. Review of data normalization methods. Comput. Appl. Eng. Educ. 2023, 59, 13–22. [Google Scholar] [CrossRef]
Ahmad, Z.; Al-Thani, N.J. Undergraduate Research Experience Models: A systematic review of the literature from 2011 to 2021. Int. J. Educ. Res. 2022, 114, 101996. [Google Scholar] [CrossRef]
Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. arXiv 2023, arXiv:2305.18290. [Google Scholar]
Churchill, R.; Singh, L. The evolution of topic modeling. ACM Comput. Surv. 2022, 54, 215. [Google Scholar] [CrossRef]
Tarakeswar, M.K.; Kavitha, D. Search engines: A study. J. Comput. Appl. 2011, 4, 29–33. [Google Scholar]

Figure 1. Flowchart of the methodology.

Figure 2. The organizational structure of data cells. (a) The structure of the initial data cell

u

c_{u}

represents the information nucleus of

u

;

s_{u, 1}

s_{u, n}

represent

n

different strategies;

{o c}_{u, 1}

{o c}_{u, n}

represent the data pipes corresponding to

s_{1}

s_{n}

; (b) The structure of the binary combined data cell

(u, v)

c_{(u, v)}

represents the information nucleus of the binary combined data cell

(u, v)

;

s_{1}

s_{n}

(Larger yellow square outside cells

u

and

v

) represent

n

different strategies;

{o c}_{(u, v), 1}

{o c}_{(u, v), n}

represent the data pipes corresponding to

s_{(u, v), 1}

s_{(u, v), n}

Figure 2. The organizational structure of data cells. (a) The structure of the initial data cell

u

c_{u}

represents the information nucleus of

u

;

s_{u, 1}

s_{u, n}

represent

n

different strategies;

{o c}_{u, 1}

{o c}_{u, n}

represent the data pipes corresponding to

s_{1}

s_{n}

; (b) The structure of the binary combined data cell

(u, v)

c_{(u, v)}

represents the information nucleus of the binary combined data cell

(u, v)

;

s_{1}

s_{n}

(Larger yellow square outside cells

u

and

v

) represent

n

different strategies;

{o c}_{(u, v), 1}

{o c}_{(u, v), n}

represent the data pipes corresponding to

s_{(u, v), 1}

s_{(u, v), n}

Figure 3. Sketch of the binary combined data cell

(u, v)

Figure 3. Sketch of the binary combined data cell

(u, v)

Figure 4. Construction of information nucleuses of initial data cells. (a) table-type relational databases; (b) document-type non-relational databases. (c) image-type databases; (d) video-type databases.

Figure 5. The plus (+) data pipe with the information nucleus “weight” is associated with the output (⬅) data pipe with the information nucleus “height” to form the unified expression “weight + height”.

Figure 6. Symmetric association of data pipes and asymmetric association of data pipes. (a) and (b) are symmetric and (c) and (d) are asymmetric. The symbols “+”, “&”, “−”, and “confidence (A, B)” in the figure represent “addition”, “and”, “subtraction”, and “confidence” operations, respectively.

Figure 7. The information-driven intelligent information discovery method.

Figure 8. The task-driven intelligent information discovery method.

Table 1. Subjective scoring standard of combined data cells.

Score	Practical Significance Description of Information Nucleuses
1	nucleuses with no practical significance.
2	nucleuses with a little practical significance.
3	nucleuses with general practical significance.
4	nucleuses with great practical significance.
5	nucleuses with common practical significance.

Table 2. Abbreviations of strategies.

Strategies	Abbreviation
Input	$\leftarrow A$
Output	$\to A$
Data normalization	$d (A, B)$
Addition	$A + B$
Subtraction	$A - B$
Multiplication	$A \times B$
Division	$A \div B$
With	$A & B$
Or	$A ‖ B$
Non	$! A$
Supportability	$S (A, B)$
Confidence	$C (A, B)$
Spatio-temporal relationship	$T i m e S p a c e (A)$
Empirical modeling	$M o d e l (A); M o d e l (A, B); M o d e l (A, B, C) \dots$

Table 3. Meaningful nucleuses of combined data cells.

No.	Nucleuses	Explanation
1	←IT ←IP→II	Input a year and a city, output the city’s cultural income for the year.
2	(←IT←IP→II) − (←TT←TP→TI)	When the input satisfies “←IT = ←TT, ←IP = ←TP”, it indicates the difference between tourism income and cultural income in the given city in the given year. (When “−” is replaced by “/”, it indicates the income ratio; when “−” is replaced by “+”, it indicates the sum of income).
3	(←IT1←IP1→II1)–(←IT2←IP2→II2)	When the input satisfies “←IT1 = ←IT2, ←IP1≠←IP2”, it indicates the cultural income difference for the same year of different cities. When “←IT1 ≠ ←IT2, ←IP1 = ←IP2”, it indicates the cultural income difference in the given city in different years. (“−” replaced by “+” for sum of income and “−” replaced by “/” for income ratio)
4	((←IT1←IP1→II1)–(←IT2←IP2→II2))/(←IT3←IP3→II3))	When the input satisfies “←IP1 = ←IP2 = ←IP3, ←IT2 = ←IT3”, and IT1 and IT2 are adjacent years, it indicates the annual growth rate of cultural income of a city.
5	(((←IT1←IP1→II1)–(←IT2←IP2→II2))/(←IT3←IP3→II3)) && (((←TT1←TP1→TI1)–(←TT2←TP2→TI2))/(←TT3←TP3→TI3))	When the inputs satisfy “←IP1 = ←IP2 = ←IP3 = ←IP4 = ←IP5 = ←IP6, ←IT2 = ←IT3”, IT1 and IT2 are adjacent years, “←IT1 = ←TT1, ←IT2 = ←TT2, ←IT3 = ←TT3 “, it indicates the positive and negative correlation analysis of cultural income and tourism income of a city.
6	Model((←IT1←IP1→II1), (←IT2←IP2→II2) …(←ITn←IPn→IIn))	Empirical modeling analysis of changes in cultural income over time, or empirical modeling analysis of future cultural income, etc.
7	Model (((←IT1←IP1→II1), (←IT2←IP2→II2), …(←ITn←IPn→IIn)), ((←TT1←TP1→TI1), (←TT2←TP2→TI2), …, (←TTn←TPn→TIn)))	Mining the relationship between cultural income and tourism income, etc.

The numbers after the fields in the table are to distinguish multiple occurrences of the same field in the same information nucleus, such as “IT1” and “IT2” in Example 3.

Table 4. Sorting list of information nucleuses of combined data cells for the request.

Rank	Nucleuses of Combined Data Cells
1	←city←date←air temperature←temperature←air pollution index*……
2	←city←date←air temperature←temperature←air quality←air pollution index……
3	←city←date←air temperature←temperature
…	…
n−1	←air quality*
n	←air quality

Fields marked “*” in the upper right-hand corner are from the meteorological database.

Table 5. Comparison of our method with other multi-source heterogeneous data processing methods.

	Type and Range of Data Processed	Manual Dependency	Association Pattern between Data	Deep Mining of Data Association	Robustness
Traditional data association [18,19,20]	Several specific types	High. Requires data association definition	Relies on advance manual definition	No	Weak
Data warehouse [5,6,7]	Multimodal data around a topic	High. Requires storage layer design	Relies on advance manual definition	No	General
Data virtualization [8,9,10,11]	Multimodal data around a topic	High. Requires virtual layer design	Relies on advance manual definition	No	General
Our method	Wide range of multimodal data	Low. Only manual annotation of samples	Autonomous association	Yes	Strong

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, W.; Li, J.; Jiang, J.; Wang, B.; Wang, Q.; Gao, E.; Yue, T. Autonomous Data Association and Intelligent Information Discovery Based on Multimodal Fusion Technology. Symmetry 2024, 16, 81. https://doi.org/10.3390/sym16010081

AMA Style

Wang W, Li J, Jiang J, Wang B, Wang Q, Gao E, Yue T. Autonomous Data Association and Intelligent Information Discovery Based on Multimodal Fusion Technology. Symmetry. 2024; 16(1):81. https://doi.org/10.3390/sym16010081

Chicago/Turabian Style

Wang, Wei, Jingwen Li, Jianwu Jiang, Bo Wang, Qingyang Wang, Ertao Gao, and Tao Yue. 2024. "Autonomous Data Association and Intelligent Information Discovery Based on Multimodal Fusion Technology" Symmetry 16, no. 1: 81. https://doi.org/10.3390/sym16010081

APA Style

Wang, W., Li, J., Jiang, J., Wang, B., Wang, Q., Gao, E., & Yue, T. (2024). Autonomous Data Association and Intelligent Information Discovery Based on Multimodal Fusion Technology. Symmetry, 16(1), 81. https://doi.org/10.3390/sym16010081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous Data Association and Intelligent Information Discovery Based on Multimodal Fusion Technology

Abstract

1. Introduction

2. Related Works

2.1. Multi-Source Heterogeneous Data Fusion

2.2. Data Association and Autonomous Knowledge Discovery

2.3. Data Self-Intelligence

3. Methodology

3.1. Data Cells

3.2. Construction of Initial Data Cells

3.2.1. Information Nucleus Construction

3.2.2. Strategy Construction

3.2.3. Data Pipe Construction

3.3. Autonomous Association of Multi-Source Heterogeneous Data Based on the Organizational Structure of Data Cells

3.4. Information-Driven Intelligent Information Discovery

3.5. Task-Driven Intelligent Information Discovery

4. Examples of Applications

4.1. The Example of the Information-Driven Method

4.2. The Example of the Task-Driven Method

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI