[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118445347B - Analytical database delayed materialized scanning method compatible with multiple storage engines - Google Patents

Analytical database delayed materialized scanning method compatible with multiple storage engines Download PDF

Info

Publication number
CN118445347B
CN118445347B CN202410904108.7A CN202410904108A CN118445347B CN 118445347 B CN118445347 B CN 118445347B CN 202410904108 A CN202410904108 A CN 202410904108A CN 118445347 B CN118445347 B CN 118445347B
Authority
CN
China
Prior art keywords
data
filtering
engine
filtered
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410904108.7A
Other languages
Chinese (zh)
Other versions
CN118445347A (en
Inventor
丁骁阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Nankai University General Data Technologies Co ltd
Jiangsu Huaku Data Technology Co ltd
Original Assignee
Tianjin Nankai University General Data Technologies Co ltd
Jiangsu Huaku Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Nankai University General Data Technologies Co ltd, Jiangsu Huaku Data Technology Co ltd filed Critical Tianjin Nankai University General Data Technologies Co ltd
Priority to CN202410904108.7A priority Critical patent/CN118445347B/en
Publication of CN118445347A publication Critical patent/CN118445347A/en
Application granted granted Critical
Publication of CN118445347B publication Critical patent/CN118445347B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an analytical database delay materialized scanning method compatible with multiple storage engines, which relates to the technical field of data scanning and specifically comprises the following steps: reading data to be filtered by a disk and storing the data to be filtered into a cache area; calling an interface control buffer area of a storage engine to not release data to be filtered in preset time; packaging the memory address of the data to be filtered into a data block, and transmitting the data block into an execution engine; converting the filtering condition into a data format of a storage engine to be executed to obtain an adaptive filtering condition; and filtering the storage engine according to the adaptive filtering conditions and the data blocks, copying the residual rows after the filtering and returning to the execution engine. The invention can exert the high filtering performance of the computing engine and greatly reduce the memory copy quantity, so that different storage engines can support delayed materialized scanning and can uniformly accept the high-performance filtering of the vectorization execution engine.

Description

Analytical database delayed materialized scanning method compatible with multiple storage engines
Technical Field
The invention relates to the technical field of data scanning, in particular to an analytical database delay materialized scanning method compatible with multiple storage engines.
Background
When an analytical database execution engine needs to be compatible with multiple columnar storage engines, the storage engines are often packaged into different types of "input streams" or "data sources", and the execution engine is only responsible for reading data from the input streams, without considering logic inside the storage engine. Because the functions supported by various storage engines are different, the analysis type database with higher performance requirements often has some barriers when realizing efficient data scanning.
When facing a plurality of storage engines, predicate compatibility of the storage engines and filtering performance of the storage engines cannot be guaranteed, performance can be improved only when the compatibility of the storage engines meets requirements and the filtering performance is higher than or close to that of the calculation engines, and it is almost impossible for each supported storage engine to meet the conditions, the integrity of external components is damaged when the storage engines are modified to meet the conditions, and difficulty in engineering implementation and maintenance cost are high.
Disclosure of Invention
The present invention is directed to solving at least one of the technical problems existing in the related art. Therefore, the invention provides an analytical database delay materialized scanning method compatible with multiple storage engines.
The invention provides an analytical database delay materialized scanning method compatible with a plurality of storage engines, which comprises the following steps:
s1: reading data to be filtered by a disk and storing the data to be filtered into a cache area;
s2: calling an interface of a storage engine to control the buffer area not to release the data to be filtered in preset time;
S3: packaging the memory address of the data to be filtered into a data block, and transmitting the data block into an execution engine;
S4: converting the filtering condition into a data format of a storage engine to be executed to obtain an adaptive filtering condition;
s5: and filtering the storage engine according to the adaptive filtering conditions and the data blocks, copying the residual lines after the filtering and returning to the execution engine.
According to the analytical database delayed materialization scanning method compatible with the multiple storage engines provided by the invention, the preset time in the step S2 is greater than or equal to the execution time of the execution engine for filtering and materializing the data to be filtered.
According to the analytical database delayed materialization scanning method compatible with the multiple storage engines, the storage engines in the step S2 are vectorized storage engines.
According to the analytical database delayed materialization scanning method compatible with the multi-storage engine provided by the invention, in the step S4, the process of converting the filtering condition is realized in the execution engine.
According to the method for delaying materialized scanning of the analytical database compatible with the multiple storage engines, in the step S4, when the filtering condition cannot be converted into the data format of the storage engine, unfiltered data in the data to be filtered is returned to the execution engine, and the unfiltered data is filtered by the execution engine.
According to the analytical database delayed materialization scanning method compatible with the multi-storage engine provided by the invention, the step S4 specifically comprises the following steps: when the data of the storage engine is stored in a coding way by a minimum value, the condition value of the adaptive filtering condition is the difference value between the condition value of the filtering condition and the minimum value; when the data of the storage engine is stored in a comparable binary format, the condition value of the adapted filter condition is the condition value of the filter condition calculated by the comparable binary encoding function.
According to the analytical database delayed materialization scanning method compatible with the multiple storage engines, in the step S4, the filtering condition is converted into the data format of the storage engine to be executed, and the filtering condition is realized by rewriting the expression of the filtering condition.
The analytical database delayed materialization scanning method compatible with the multiple storage engines provided by the invention can ensure the high filtering performance of the computing engine to be exerted and greatly reduce the memory copy quantity no matter what the supporting condition and the filtering performance of the storage engines for condition pushing are, so that different storage engines can support delayed materialization scanning and can uniformly accept the high-performance filtering of the vectorization execution engine.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an analytical database delayed materialization scanning method compatible with multiple storage engines according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
For a better understanding of the present invention, a description will be given below of the background of the study of physical and chemical scanning.
Assuming a query requires filtering the data by conditions, the following problems are faced:
The filtering condition is that pushing down to the storage engine to execute filtering, only transmitting hit data to the execution engine, namely delaying materialized scanning; and directly transmitting all unfiltered data to the execution engine, and filtering in the execution engine, namely early materialized scanning.
It is apparent that deferred materialized scanning appears more efficient because the amount of data output from the storage engine may be significantly reduced, resulting in a substantial reduction in memory copy and thus improved performance. The database in which the execution engine and the storage engine are deeply bound typically adopts such a design.
However, delayed materialized scanning can be supported for different storage engines, and high-performance filtering of the vectorized execution engine cannot be uniformly accepted.
An embodiment of the present invention is described below with reference to fig. 1.
The invention provides an analytical database delay materialized scanning method compatible with a plurality of storage engines, which comprises the following steps:
S1: and reading the data to be filtered by the disk and storing the data to be filtered into the cache area.
Furthermore, after various storage engines read data from the disk into the memory, the data are generally placed in a buffer to wait for the subsequent components to read, and after the data are used, the data are released or directly released according to different cache elimination strategies.
S2: and calling an interface of a storage engine to control the buffer area not to release the data to be filtered in preset time.
The preset time in step S2 is greater than or equal to the execution time of the execution engine for filtering and materializing the data to be filtered.
According to the release condition of the buffer area, the step S2 of the invention is to control the life cycle of the data in the buffer area by calling the interface of the storage engine, so that the data can be released after the filtering and materialization are completed in the execution engine.
In particular, once data is loaded into the buffer, the APIs can be used to access the data, if the data needs to be modified, the corresponding API functions or methods can be used to update the data in the buffer, the present invention can decide which data should be retained in the buffer and which should be removed by setting caching policies for the data, such as LRU (least recently used) or LFU (most recently used), etc., the implementation sets an expiration time for the data, it should be automatically deleted once the time of the data in the buffer exceeds this threshold, then APIs are provided to explicitly delete the data from the buffer, and finally it needs to be refreshed back to the underlying storage system when the data is modified in the buffer.
Wherein, the storage engine in step S2 is a vectorized storage engine.
S3: and packaging the memory address of the data to be filtered into a data block, and transmitting the data block into an execution engine.
Further, if the data stream directly copies the data in the buffer into the data block that can be processed by the execution engine, the data is typically implemented in early materialization, but the step S3 of the present invention does not need to copy, and only encapsulates the memory address of the data required in the buffer into a special data block, even if the data block is transferred into the execution engine, it is still equivalent to not materializing.
S4: and converting the filtering condition into a data format of the storage engine to be executed to obtain an adaptive filtering condition.
Wherein, in step S4: and converting the filtering condition into a data format of the storage engine to be executed, and realizing the filtering condition by rewriting an expression of the filtering condition.
Wherein, in step S4: the process of converting the filter term is implemented within an execution engine.
Wherein, in step S4: and when the filtering condition cannot be converted into the data format of the storage engine, returning the unfiltered data in the data to be filtered to the execution engine, and filtering the unfiltered data by the execution engine.
Further, the buffer is simply referred to, and data cannot be read correctly when facing to storage engines with different storage formats, for example, data stored by some storage engines are encoded, and after being read into a memory, the data needs to be decoded to restore to a general state. At this time, a step of writing the expression is required, and the filtering condition is written to adapt to the data format of the storage engine. Meanwhile, some coding formats cannot be adapted through expression rewrite, so that the program needs to distinguish such situations and fall back to an early materialized execution mode, and it should be noted that the expression rewrite operation is implemented entirely within the data stream or the execution engine, and the code does not need to invade the storage engine.
The step S4 specifically includes: when the data of the storage engine is stored in a coding way by a minimum value, the condition value of the adaptive filtering condition is the difference value between the condition value of the filtering condition and the minimum value; when the data of the storage engine is stored in a comparable binary format, the condition value of the adapted filter condition is the condition value of the filter condition calculated by the comparable binary encoding function.
Some specific examples of the expression overwriting described in step S4 are described below.
1) The data is stored according to the minimum value code, namely the minimum value of a batch of data is stored in metadata, and the actual data is stored after subtracting the value so as to reduce the data volume;
Minimum value: 10000;
Storing the value: 1,2,3;
Conditions are as follows: x >10000;
The condition needs to be rewritten to X > (10000-minimum), i.e. X >0.
2) The data is stored in a comparable binary format, with bin representing some comparable binary encoding function;
Storing the value: bin (10000), bin (10001);
Conditions are as follows: x >10000;
The condition needs to be rewritten to X > bin (10000).
S5: and filtering the storage engine according to the adaptive filtering conditions and the data blocks, copying the residual lines after the filtering and returning to the execution engine.
After steps S1 to S4, the filtering operation of the vectorization execution engine may be utilized to access the data in the buffer area in a read-only manner, and only the remaining rows are copied after the filtering is completed, so as to implement high-performance delayed materialized scanning for different storage engines.
Another specific example will be described in detail below.
Query statement: select sum (a) from t1 where b >2 and c >2;
the above query is characterized by filtering the data according to the values of columns b and c, but the subsequent aggregate computation part only needs the data of column a.
When the analytical database delay materialized scanning method compatible with the multi-storage engine is not used, the specific data scanning process is as follows:
reading the column data file b and filling the cache;
Copying the column b from the cache, filtering b >2, and recording the hit line number after filtering;
If the hit line is not 0, reading the column c data file, and filling the cache;
Copying the corresponding row of the column c from the cache according to the row number hit in the second step, filtering the row c >2, and recording the hit row number;
If the hit line is not 0, reading an a-column data file and filling a cache;
Copying the a column from the cache according to the line number in the fourth step of the preamble;
And outputting the a row, and releasing the memories of the b row and the c row after copying.
When the analytical database delay materialized scanning method compatible with the multi-storage engine provided by the invention is used, the specific data scanning process is as follows:
reading the column data file b and filling the cache;
constructing a reference column of the column b, rewriting the expression according to the requirement, filtering, and recording the hit row number after filtering;
If the hit line is not 0, reading the column c data file, and filling the cache;
Judging whether a reference column (high hit ratio) constructing a column c or materialization (low hit ratio) according to hit line numbers is carried out according to the proportion of hit line numbers in the second step of the preamble, and carrying out c >2 filtration;
If the hit line is not 0, reading an a-column data file and filling a cache;
Materializing the column a according to the row number;
Outputting a row a, and releasing the row c memory if the row c is copied.
Through the comparison, after the analytical database delay materialization scanning method compatible with the multi-storage engine provided by the invention is used, no matter how the storage engine supports the condition pushing and the filtering performance, the high filtering performance of the calculation engine can be guaranteed to be exerted, the memory copy quantity is greatly reduced, materialization of the column b is completely avoided in the example, whether the column c is materialized or not is dynamically adjusted according to the filtering result, if the selectivity is higher, materialization of the column c can be completely avoided, although the column c contains invalid rows of which part is filtered by the column b condition when the column c is filtered, the vectorization execution of the expression causes that a small number of invalid rows are additionally calculated to have extremely small influence on the performance, the data scanning performance is obviously improved finally, and on the premise that the storage engine performance is unchanged and the filtering performance is not lost, the cost of materialization is almost eliminated after the method is used, the TPC-H reference test performance is also slightly improved after the database is optimized, the column type cost related to the query is large, and the more obvious improvement is achieved.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (4)

1. An analytical database delayed materialization scanning method compatible with multiple storage engines is characterized by comprising the following steps of:
s1: reading data to be filtered by a disk and storing the data to be filtered into a cache area;
S2: calling an interface of a storage engine to control the buffer area not to release the data to be filtered in a preset time, wherein the preset time is longer than or equal to the execution time of the execution engine for filtering and materializing the data to be filtered;
S3: packaging the memory address of the data to be filtered into a data block, and transmitting the data block into an execution engine;
s4: converting the filtering condition into a data format of a storage engine to be executed to obtain an adaptive filtering condition, wherein the method specifically comprises the following steps of:
When the data of the storage engine is stored in a coding way by a minimum value, the condition value of the adaptive filtering condition is the difference value between the condition value of the filtering condition and the minimum value;
When the data of the storage engine is stored in a comparable binary format, the condition value of the adaptive filtering condition is the condition value of the filtering condition calculated by the comparable binary coding function;
converting the filtering condition into a data format of a storage engine to be executed, and realizing the filtering condition by rewriting an expression of the filtering condition;
s5: and filtering the storage engine according to the adaptive filtering conditions and the data blocks, copying the residual lines after the filtering and returning to the execution engine.
2. The method for deferred materialized scanning of an analytical database compatible with multiple storage engines of claim 1, the storage engine in the step S2 is a vectorized storage engine.
3. The method of claim 1, wherein in step S4, the process of converting the filtering condition is implemented in an execution engine.
4. The method for delayed materialization scanning of an analytical database compatible with multiple storage engines according to claim 1, wherein in step S4, when the filtering condition cannot be converted into a data format of a storage engine, unfiltered data in the data to be filtered is rolled back to an execution engine, and the unfiltered data is filtered by the execution engine.
CN202410904108.7A 2024-07-08 2024-07-08 Analytical database delayed materialized scanning method compatible with multiple storage engines Active CN118445347B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410904108.7A CN118445347B (en) 2024-07-08 2024-07-08 Analytical database delayed materialized scanning method compatible with multiple storage engines

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410904108.7A CN118445347B (en) 2024-07-08 2024-07-08 Analytical database delayed materialized scanning method compatible with multiple storage engines

Publications (2)

Publication Number Publication Date
CN118445347A CN118445347A (en) 2024-08-06
CN118445347B true CN118445347B (en) 2024-09-20

Family

ID=92309337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410904108.7A Active CN118445347B (en) 2024-07-08 2024-07-08 Analytical database delayed materialized scanning method compatible with multiple storage engines

Country Status (1)

Country Link
CN (1) CN118445347B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251353A (en) * 2023-11-20 2023-12-19 青岛民航凯亚系统集成有限公司 Monitoring method, system and platform for civil aviation weak current system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6839808B2 (en) * 2001-07-06 2005-01-04 Juniper Networks, Inc. Processing cluster having multiple compute engines and shared tier one caches
CN106886368B (en) * 2016-12-30 2019-08-16 北京同有飞骥科技股份有限公司 A kind of block device writes IO shaping and multi-controller synchronization system and synchronous method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251353A (en) * 2023-11-20 2023-12-19 青岛民航凯亚系统集成有限公司 Monitoring method, system and platform for civil aviation weak current system

Also Published As

Publication number Publication date
CN118445347A (en) 2024-08-06

Similar Documents

Publication Publication Date Title
US6691136B2 (en) Fast data retrieval based upon contiguous consolidation of records according to frequency of access
US11182083B2 (en) Bloom filters in a flash memory
AU2001236686B2 (en) Selectively auditing accesses to rows within a relational database at a database server
JPH0740239B2 (en) How to access the data page
US20150310053A1 (en) Method of generating secondary index and apparatus for storing secondary index
WO2022037015A1 (en) Column-based storage method, apparatus and device based on persistent memory
US11681623B1 (en) Pre-read data caching method and apparatus, device, and storage medium
CN104246727A (en) Data processing system and method for operating a data processing system
CN118445347B (en) Analytical database delayed materialized scanning method compatible with multiple storage engines
US9892038B2 (en) Method, apparatus, and system for data caching
US20180011897A1 (en) Data processing method having structure of cache index specified to transaction in mobile environment dbms
CN114896281A (en) Data processing method and system and electronic equipment
KR102389609B1 (en) Method for offloading disk scan directly to gpu in write-optimized database system
KR102321346B1 (en) Data journaling method for large solid state drive device
US12093234B2 (en) Data processing method, apparatus, electronic device, and computer storage medium
US11249921B2 (en) Page modification encoding and caching
CN115640078A (en) Android application loading optimization method based on intelligent prefetching of virtual file system data
CN110007869B (en) Memory data copying method, device, equipment and computer storage medium
CN107506156A (en) A kind of io optimization methods of block device
CN114116711A (en) Data processing method, data processing device, database, storage medium and program product
CN114238417A (en) Data caching method
CN108932111B (en) Method, medium and device for optimizing data read-write performance
CN117806567A (en) Data processing method and device
JPS593567A (en) Tree structure buffer number setting method
CN111694847B (en) Update access method with high concurrency and low delay for extra-large LOB data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant