Disclosure of Invention
In view of the above problems, it is an object of the present invention to find a query-friendly compression scheme. Under the condition of not decompressing the related data compressed data, the SPARQL query can be directly carried out, and meanwhile, the compression rate is improved as much as possible.
The objective of the invention is achieved by mining potential relationship matrices in the associated datasets. The method comprises the following steps:
a query-friendly associated data compression method is characterized in that,
a step of constructing a structural model, comprising:
step1, associating and analyzing data of a Triple memory model based on an N-Triple format to obtain a Triple set, then constructing a dictionary, and identifying triples, wherein the analyzing process comprises the following steps:
step 1.1, filtering out lines starting with "#" or empty lines;
step 1.2, reading each line of data and segmenting character strings according to spaces;
step 1.3, mapping the segmented data to a subject, a predicate and an object of the triple to construct a triple;
step2, mining potential associations in the triples based on the relation mining constraint;
step3, defining a compression query memory model, wherein the compression query memory model is composed of header information, a dictionary and a data block set, and each data block is composed of a subject vector, a predicate vector and an object matrix; the compressed query memory model represents the triple relationship by using a subject vector, a predicate vector and an object matrix: defining a subject vector as a column vector with the length of m, a predicate vector as a row vector with the length of n, an object matrix as an m x n matrix, carrying out vector multiplication on the subject vector and the predicate vector to obtain a matrix which is the same as the object matrix in size and consists of subjects and predicates, and mapping the matrix with data items of the object matrix one by one, wherein each mapped item is a triple relationship;
the method for storing the high-compression-rate data based on the structural model specifically comprises the following steps:
step4, defining the same storage length of each ID, and based on a serialization mode of a compression query memory model: using the auxiliary mark to carry out serialization and deserialization operations;
step 4.1, serialization: flattening the object matrix for each data block, connecting the subject vector, the predicate vector and the flattened object matrix data into a linear data structure by using a preposition auxiliary identifier, and connecting the processed linear data structures together by using a data block auxiliary identifier;
step 4.2, deserialization: dividing the serialized data into each data block according to the data block auxiliary identifier, dividing each data block into a subject vector, a predicate vector and a flattened object matrix according to the preposition auxiliary identifier, and reducing the flattened object matrix into a true matrix according to the length of the subject vector;
a step of data query based on a structure model specifically comprises:
step 5, the SPARQL query based on the compressed query memory model: the query of the subject and the predicate uses binary search, and the query of the object uses linear traversal search, and the method specifically comprises the following steps:
step 5.1, the subject query method specifically comprises the following steps: traversing the subject vectors of all data blocks, wherein the subject query time complexity is O (log2n) because the subject vectors are sorted internally and a binary search method is used;
step 5.2, the predicate query method specifically comprises the following steps: traversing the predicate vectors of all data blocks, because the predicate vectors are ordered internally, using a binary search method, so the predicate query time complexity is O (log2 n);
step 5.3, the object query method specifically comprises the following steps: and traversing the object matrixes of all the data blocks, wherein the object matrixes are not sorted internally, only can search in sequence, and the time complexity is O (n).
In the query-friendly associated data compression method, the obtaining of the Triple set based on the N-Triple format associated data and the parsing specifically includes:
step 2.1, filtering out rows starting with # or empty rows;
step 2.2, reading each line of data and segmenting character strings according to spaces;
and 2.3, mapping the segmented data to the subject, predicate and object of the triple to construct a triple.
In the foregoing query-friendly associated data compression method, constructing a dictionary, and identifying triples specifically includes:
step 3.1, flattening operation is carried out on the triples obtained in the previous step, and repeated data items are removed;
step 3.2, allocating a unique ID for each item of data to obtain Dictionary;
step 3.3, extracting the head information with the same data in the Dictionary to obtain a Header; and replacing the original triple data with the ID to obtain the ID triple set.
In the foregoing query-friendly associated data compression method, the relationship mining constraint includes:
constraint one: combining the triples with the same subject and predicate;
constraint two: classifying all triples according to the subjects, combining all predicates and objects of the same subjects to form predicate vectors and object vectors, and extracting the predicate vectors of each subject;
constraint condition three: merging triples with the same predicate (predicate vector) and object;
constraint condition four: and classifying all the triples according to the predicate vectors, and merging the subjects and the objects of the same predicate vectors to form a subject vector and an object matrix.
In the query-friendly associated data compression method, the query memory model is compressed, and the triple relationship is represented by using a subject vector, a predicate vector and an object matrix: the subject vector is assumed to be a column vector with the length of m, the predicate vector is a row vector with the length of n, the object matrix is an m x n matrix, the subject vector and the predicate vector are subjected to vector multiplication to obtain a matrix which is the same as the object matrix in size and consists of subjects and predicates, the matrix is mapped with data items of the object matrix one by one, and each mapped item is a triple relationship.
In the above method for compressing associated data friendly to query, the query of SPARQL further includes:
step 5.4, complex query, wherein all complex queries can be decomposed into the three simple queries, and the results of the simple queries are combined; the steps 5.1 to 5.4 can be performed concurrently.
The query-friendly associated data compression method further comprises a step of splitting an object matrix, wherein the data block with an overlarge object matrix is split into a plurality of data blocks, specifically, when the object matrix is too large, the object query is slow, the solution scheme is adopted, the predicate vector is kept unchanged, the subject vector and the object matrix are correspondingly split, and a plurality of small data blocks are obtained.
Therefore, the invention has the following advantages: the related data set processed by the invention has improved compression ratio compared with most existing compression schemes, and can directly perform SPARQL query operation in a compression state.
Detailed Description
The technical scheme of the invention is explained in detail in the following by combining the drawings and the embodiment.
The technical scheme provided by the invention is a relational data set compression algorithm based on a relational matrix, and specifically comprises the following steps:
1. defining a triple memory model, wherein the triple memory model comprises three data segments of a subject S, a predicate P and an object O;
2. inputting N-Triple format associated data and analyzing to obtain a Triple set;
the detailed process is as follows:
2.1. filter out rows starting with "#" or empty rows;
2.2. reading each line of data and segmenting character strings according to spaces;
2.3. mapping the segmented data to a subject, a predicate and an object of the triple to construct a triple;
3. constructing a dictionary and identifying the triples;
the detailed process is as follows:
3.1. flattening the triples obtained in the last step, and removing repeated data items;
3.2. allocating a unique ID to each item of data to obtain a Dictionary;
3.3. extracting the Header information with the same data in the Dictionary to obtain a Header;
3.4. replacing original triple data with ID to obtain an ID triple set;
4. a first Step of relation mining, namely combining triples with the same subject and predicate, referring to Step1 in the figure 1, and referring to Rule1 in the figure 3 for derivation formula;
5. a second Step of relation mining, namely classifying all triples according to subjects, combining all predicates and objects of the same subjects to form predicate vectors and object vectors, extracting the predicate vectors of each subject, sequencing the interiors of the predicate vectors, referring to Step2 in the figure 1, and referring to Rule2 in the figure 3 for a derivation formula;
6. a third Step of relation mining, namely combining triples with the same predicate (predicate vector) and object, referring to Step3 in the figure 1, and deducing a formula, referring to Rule3 in the figure 3;
7. a fourth Step of relation mining, namely classifying all triples according to predicates (predicate vectors), combining subjects and objects of the same predicates (predicate vectors) to form internally ordered subject vectors and object matrices, and referring to Step4 in the attached drawing 1 and a derivation formula in the attached drawing 3, namely Rule 4;
8. extracting a subject vector, a predicate vector and an object matrix of each data block and establishing a compression query memory model, wherein the compression query memory model refers to an attached figure 2;
9. in the SPARQL query mode in the compression state, concurrent query operation can be carried out on all data blocks;
the detailed process is as follows:
9.1. subject query, which traverses subject vectors of all data blocks, because the subject vectors are sorted internally, and a binary search method is used, the time complexity of the subject query is O (log2 n);
9.2. predicate query, traversing the predicate vectors of all data blocks, since the predicate vectors are ordered internally, using a binary lookup method, so the predicate query time complexity is O (log2 n);
9.3. object query, traversing the object matrixes of all the data blocks, wherein the object matrixes are not sorted and can only be searched sequentially, the time complexity is O (n), and data blocking can be performed on the data blocks with the particularly large object matrixes so as to reduce the time overhead caused by linear traversal search, referring to the attached figure 4;
9.4. complex queries, all of which can be decomposed into the first three simple queries, and the results of the simple queries are merged.
10. The serialized write file is configured such that the storage length of each ID is the same, and serialization and deserialization are realized using auxiliary symbols "|" (or identifier), "(subject-to-object separator), and"/"(data block separator).
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.