CN106709006B

CN106709006B - Query-friendly associated data compression method

Info

Publication number: CN106709006B
Application number: CN201611209081.1A
Authority: CN
Inventors: 顾进广; 彭燊; 黄智生; 符海东; 梅琨
Original assignee: Wuhan Chutianyun Technology Co ltd; Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan Chutianyun Technology Co ltd; Wuhan University of Science and Engineering WUSE
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2020-10-30
Anticipated expiration: 2036-12-23
Also published as: CN106709006A

Abstract

The invention relates to a query-friendly associated data compression method, which comprises the following steps: defining a relation mining rule, and mining potential association relations in the triples; defining a compression query memory model, which consists of a subject vector, a predicate vector and an object matrix; defining a serialization mode of a compression query memory model, and realizing serialization and deserialization by using three auxiliary symbols; defining a query mode of executing SPARQL on a compressed query memory model, wherein a binary search method is used for subject query and predicate query, and a linear traversal method is used for object query; a solution is defined in which when the object matrix is too large, resulting in slow queries, a large data block is split into multiple small data blocks. The related data set processed by the method improves the compression rate compared with most of the existing compression schemes, and can directly carry out SPARQL query operation in a compression state.

Description

Query-friendly associated data compression method

Technical Field

The invention relates to the field of big data, and is used for storing, transmitting and inquiring related data of massive RDF, LOD and knowledge maps. In particular to a query-friendly associated data compression method

Background

There are many existing associated data compression schemes, but most are not query-friendly. The widely accepted compression scheme is HDT, which has a high compression rate, but requires decompression first during query, and is not friendly to query. Inspired by the HDT scheme, many compression techniques based on the HDT scheme have been proposed, such as HDT FoQ, WaterFowl, HDT + +, which all have a common feature: high compression rate, but not query friendly.

There are also query-friendly schemes, such as the BitMat method, in which the compression scheme uses a three-dimensional matrix to express the triple relationship, and a storage space is also reserved for many nonexistent triple relationships. When the associated data set is large to a certain scale, the three-dimensional matrix becomes a super sparse matrix, and the compression rate is not ideal due to the fact that a lot of redundant information is stored. To reduce redundant information stored, K²A triple scheme is proposed, which divides a three-dimensional matrix into a plurality of two-dimensional matrices according to a predicate, and stores the two-dimensional matrices with a structure of a K2 tree. The method improves the compression rate to a certain extent, but also destroys the original intuitive matrix structure, so that the matrix needs to be restored first when the query is carried out, and the query efficiency of RDF is reduced by the operation.

As more and more associated data floods the entire data network, query performance and data scalability become a focus issue when it is necessary to manage and query such data. Although enough storage media may be used to store increasingly large sets of associated data, the large sets of data not only result in reduced query efficiency, but also exacerbate performance issues for other common processes, such as RDF publishing and swapping. With the increasing popularity of remote SPARQL endpoint query approaches that transmit execution results over a network, the issuance and exchange of RDFs is used more and more frequently in queries for associated data. Therefore, the method for compressing the associated data which is query-friendly is significant.

Disclosure of Invention

In view of the above problems, it is an object of the present invention to find a query-friendly compression scheme. Under the condition of not decompressing the related data compressed data, the SPARQL query can be directly carried out, and meanwhile, the compression rate is improved as much as possible.

The objective of the invention is achieved by mining potential relationship matrices in the associated datasets. The method comprises the following steps:

a query-friendly associated data compression method is characterized in that,

a step of constructing a structural model, comprising:

step1, associating and analyzing data of a Triple memory model based on an N-Triple format to obtain a Triple set, then constructing a dictionary, and identifying triples, wherein the analyzing process comprises the following steps:

step 1.1, filtering out lines starting with "#" or empty lines;

step 1.2, reading each line of data and segmenting character strings according to spaces;

step 1.3, mapping the segmented data to a subject, a predicate and an object of the triple to construct a triple;

step2, mining potential associations in the triples based on the relation mining constraint;

step3, defining a compression query memory model, wherein the compression query memory model is composed of header information, a dictionary and a data block set, and each data block is composed of a subject vector, a predicate vector and an object matrix; the compressed query memory model represents the triple relationship by using a subject vector, a predicate vector and an object matrix: defining a subject vector as a column vector with the length of m, a predicate vector as a row vector with the length of n, an object matrix as an m x n matrix, carrying out vector multiplication on the subject vector and the predicate vector to obtain a matrix which is the same as the object matrix in size and consists of subjects and predicates, and mapping the matrix with data items of the object matrix one by one, wherein each mapped item is a triple relationship;

the method for storing the high-compression-rate data based on the structural model specifically comprises the following steps:

step4, defining the same storage length of each ID, and based on a serialization mode of a compression query memory model: using the auxiliary mark to carry out serialization and deserialization operations;

step 4.1, serialization: flattening the object matrix for each data block, connecting the subject vector, the predicate vector and the flattened object matrix data into a linear data structure by using a preposition auxiliary identifier, and connecting the processed linear data structures together by using a data block auxiliary identifier;

step 4.2, deserialization: dividing the serialized data into each data block according to the data block auxiliary identifier, dividing each data block into a subject vector, a predicate vector and a flattened object matrix according to the preposition auxiliary identifier, and reducing the flattened object matrix into a true matrix according to the length of the subject vector;

a step of data query based on a structure model specifically comprises:

step 5, the SPARQL query based on the compressed query memory model: the query of the subject and the predicate uses binary search, and the query of the object uses linear traversal search, and the method specifically comprises the following steps:

step 5.1, the subject query method specifically comprises the following steps: traversing the subject vectors of all data blocks, wherein the subject query time complexity is O (log2n) because the subject vectors are sorted internally and a binary search method is used;

step 5.2, the predicate query method specifically comprises the following steps: traversing the predicate vectors of all data blocks, because the predicate vectors are ordered internally, using a binary search method, so the predicate query time complexity is O (log2 n);

step 5.3, the object query method specifically comprises the following steps: and traversing the object matrixes of all the data blocks, wherein the object matrixes are not sorted internally, only can search in sequence, and the time complexity is O (n).

In the query-friendly associated data compression method, the obtaining of the Triple set based on the N-Triple format associated data and the parsing specifically includes:

step 2.1, filtering out rows starting with # or empty rows;

step 2.2, reading each line of data and segmenting character strings according to spaces;

and 2.3, mapping the segmented data to the subject, predicate and object of the triple to construct a triple.

In the foregoing query-friendly associated data compression method, constructing a dictionary, and identifying triples specifically includes:

step 3.1, flattening operation is carried out on the triples obtained in the previous step, and repeated data items are removed;

step 3.2, allocating a unique ID for each item of data to obtain Dictionary;

step 3.3, extracting the head information with the same data in the Dictionary to obtain a Header; and replacing the original triple data with the ID to obtain the ID triple set.

In the foregoing query-friendly associated data compression method, the relationship mining constraint includes:

constraint one: combining the triples with the same subject and predicate;

constraint two: classifying all triples according to the subjects, combining all predicates and objects of the same subjects to form predicate vectors and object vectors, and extracting the predicate vectors of each subject;

constraint condition three: merging triples with the same predicate (predicate vector) and object;

constraint condition four: and classifying all the triples according to the predicate vectors, and merging the subjects and the objects of the same predicate vectors to form a subject vector and an object matrix.

In the query-friendly associated data compression method, the query memory model is compressed, and the triple relationship is represented by using a subject vector, a predicate vector and an object matrix: the subject vector is assumed to be a column vector with the length of m, the predicate vector is a row vector with the length of n, the object matrix is an m x n matrix, the subject vector and the predicate vector are subjected to vector multiplication to obtain a matrix which is the same as the object matrix in size and consists of subjects and predicates, the matrix is mapped with data items of the object matrix one by one, and each mapped item is a triple relationship.

In the above method for compressing associated data friendly to query, the query of SPARQL further includes:

step 5.4, complex query, wherein all complex queries can be decomposed into the three simple queries, and the results of the simple queries are combined; the steps 5.1 to 5.4 can be performed concurrently.

The query-friendly associated data compression method further comprises a step of splitting an object matrix, wherein the data block with an overlarge object matrix is split into a plurality of data blocks, specifically, when the object matrix is too large, the object query is slow, the solution scheme is adopted, the predicate vector is kept unchanged, the subject vector and the object matrix are correspondingly split, and a plurality of small data blocks are obtained.

Therefore, the invention has the following advantages: the related data set processed by the invention has improved compression ratio compared with most existing compression schemes, and can directly perform SPARQL query operation in a compression state.

Drawings

FIG. 1 is a compression diagram of an embodiment of the present invention.

FIG. 2 is a diagram of a compressed query memory model according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating four relationship mining rules according to an embodiment of the present invention.

FIG. 4 is a diagram illustrating a big data chunk splitting rule according to an embodiment of the present invention.

FIG. 5 is a schematic flow chart of the method of the present invention.

Detailed Description

The technical scheme of the invention is explained in detail in the following by combining the drawings and the embodiment.

The technical scheme provided by the invention is a relational data set compression algorithm based on a relational matrix, and specifically comprises the following steps:

1. defining a triple memory model, wherein the triple memory model comprises three data segments of a subject S, a predicate P and an object O;

2. inputting N-Triple format associated data and analyzing to obtain a Triple set;

the detailed process is as follows:

2.1. filter out rows starting with "#" or empty rows;

2.2. reading each line of data and segmenting character strings according to spaces;

2.3. mapping the segmented data to a subject, a predicate and an object of the triple to construct a triple;

3. constructing a dictionary and identifying the triples;

the detailed process is as follows:

3.1. flattening the triples obtained in the last step, and removing repeated data items;

3.2. allocating a unique ID to each item of data to obtain a Dictionary;

3.3. extracting the Header information with the same data in the Dictionary to obtain a Header;

3.4. replacing original triple data with ID to obtain an ID triple set;

4. a first Step of relation mining, namely combining triples with the same subject and predicate, referring to Step1 in the figure 1, and referring to Rule1 in the figure 3 for derivation formula;

5. a second Step of relation mining, namely classifying all triples according to subjects, combining all predicates and objects of the same subjects to form predicate vectors and object vectors, extracting the predicate vectors of each subject, sequencing the interiors of the predicate vectors, referring to Step2 in the figure 1, and referring to Rule2 in the figure 3 for a derivation formula;

6. a third Step of relation mining, namely combining triples with the same predicate (predicate vector) and object, referring to Step3 in the figure 1, and deducing a formula, referring to Rule3 in the figure 3;

7. a fourth Step of relation mining, namely classifying all triples according to predicates (predicate vectors), combining subjects and objects of the same predicates (predicate vectors) to form internally ordered subject vectors and object matrices, and referring to Step4 in the attached drawing 1 and a derivation formula in the attached drawing 3, namely Rule 4;

8. extracting a subject vector, a predicate vector and an object matrix of each data block and establishing a compression query memory model, wherein the compression query memory model refers to an attached figure 2;

9. in the SPARQL query mode in the compression state, concurrent query operation can be carried out on all data blocks;

the detailed process is as follows:

9.1. subject query, which traverses subject vectors of all data blocks, because the subject vectors are sorted internally, and a binary search method is used, the time complexity of the subject query is O (log2 n);

9.2. predicate query, traversing the predicate vectors of all data blocks, since the predicate vectors are ordered internally, using a binary lookup method, so the predicate query time complexity is O (log2 n);

9.3. object query, traversing the object matrixes of all the data blocks, wherein the object matrixes are not sorted and can only be searched sequentially, the time complexity is O (n), and data blocking can be performed on the data blocks with the particularly large object matrixes so as to reduce the time overhead caused by linear traversal search, referring to the attached figure 4;

9.4. complex queries, all of which can be decomposed into the first three simple queries, and the results of the simple queries are merged.

10. The serialized write file is configured such that the storage length of each ID is the same, and serialization and deserialization are realized using auxiliary symbols "|" (or identifier), "(subject-to-object separator), and"/"(data block separator).

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A query-friendly associated data compression method is characterized in that,

a step of constructing a structural model, comprising:

step 1.1, filtering out rows starting with # or empty rows;

constructing a dictionary, and specifically identifying triples comprising:

step 3.2, distributing a unique ID for each item of data to obtain a dictionary;

step 3.3, extracting the head information with the same data in each item of the dictionary to obtain a Header;

replacing original triple data with ID to obtain an ID triple set;

step2, mining potential associations in the triples based on the relation mining constraint; the relationship mining constraints include:

the first step is as follows: combining the triples with the same subject and predicate;

the second step is that: classifying all triples according to the subjects, combining all predicates and objects of the same subjects to form predicate vectors and object vectors, and extracting the predicate vectors of each subject;

the third step: merging triples with the same predicate vector and object;

the fourth step: classifying all triples according to the predicate vectors, and merging subjects and objects of the same predicate vectors to form subject vectors and object matrixes;

step4, defining the same storage length of each ID, and based on a serialization mode of a compression query memory model: using the auxiliary identification to carry out serialization and deserialization operations;

step 4.2, deserialization: dividing the serialized data into a plurality of data blocks according to the data block auxiliary identification, dividing each data block into a subject vector, a predicate vector and a flattened object matrix according to the preposition auxiliary identification, and reducing the flattened object matrix into a true matrix according to the length of the subject vector;

a step of data query based on a structure model specifically comprises:

2. The query-friendly method for compressing associated data as recited in claim 1, wherein the query by SPARQL further comprises:

step 5.4, complex query, all complex queries can be decomposed into the three simple queries, and the results of the simple queries are combined; the steps 5.1 to 5.4 can be performed concurrently.

3. The query-friendly associated data compression method as claimed in claim 1, further comprising a step of splitting an object matrix, wherein for a data block with an excessively large object matrix, the data block is split into a plurality of data blocks, specifically, when the object matrix is excessively large, which results in a slow object query solution, the predicate vector is kept unchanged, and the subject vector and the object matrix are correspondingly split to obtain a plurality of small data blocks.