CN118394846A

CN118394846A - Data synchronization method and device, electronic equipment and storage medium

Info

Publication number: CN118394846A
Application number: CN202410660626.9A
Authority: CN
Inventors: 宋福健; 张建涛; 许哲; 马丽霞
Original assignee: China Securities Co Ltd
Current assignee: China Securities Co Ltd
Priority date: 2024-05-27
Filing date: 2024-05-27
Publication date: 2024-07-26

Abstract

The embodiment of the application provides a data synchronization method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring data to be synchronized from a target data source; writing the data to be synchronized into a designated directory in an HDFS of a hadoop cluster according to an ORC file format through a data synchronization tool so as to generate a target ORC file in the HDFS; calling a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file; and calling a preset loading tool of the HBase cluster to load the data of the target HFile file into the target table of the HBase cluster. By the scheme, the accuracy and the integrity of data synchronization can be considered during data batch synchronization.

Description

Data synchronization method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data synchronization method, apparatus, electronic device, and storage medium.

Background

In the task of data synchronization, HBase (non-relational distributed database) is generally an important target source, e.g. synchronizing data from other databases to HBase is a common requirement. In synchronizing data to an HBase, writing of data is typically achieved by a Put operation of the HBase, which is used to insert a row of data into a specified HBase table or update an existing row of data. However, the data operated by each Put operation is limited, and synchronization of batch data cannot be achieved.

In the prior art, a ImportTsv tool (a command line tool provided by HBase) is typically used for synchronization of batch data. However, the ImportTsv tool cannot correctly process the escape character as the field separator, for example, the ImportTsv tool sometimes recognizes the escape character as a normal character and recognizes the specified character as a normal character in the data as a separation character when the specified character is set as a separation character, so that the accuracy and the integrity of the data synchronization are affected in the existing batch synchronization method.

Therefore, how to achieve both accuracy and integrity of data synchronization during data batch synchronization is a problem to be solved.

Disclosure of Invention

The embodiment of the application aims to provide a data synchronization method, a data synchronization device, electronic equipment and a storage medium, so as to achieve the accuracy and the integrity of data synchronization during data batch synchronization. The specific technical scheme is as follows:

In a first aspect, the present application provides a data synchronization method, applied to a data synchronization tool, the method comprising:

Acquiring data to be synchronized from a target data source;

Writing the data to be synchronized into a designated directory in an HDFS of a hadoop cluster according to an ORC file format so as to generate a target ORC file in the HDFS;

calling a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file; the target HFile file stores the data to be synchronized, and fields included in the target HFile file are the same as fields in a target table, wherein the target table is a predetermined data table for synchronizing the data to be synchronized in an HBase cluster;

and calling a preset loading tool of the HBase cluster to load the data of the target HFile file into the target table of the HBase cluster.

Optionally, before the calling a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file, the method further includes:

Acquiring command line parameters input by a user; the parameter content of the command line parameter comprises a file path of the target ORC file in the specified directory, a target storage address, a table name of the target table and field information of the target ORC file; the target storage address characterizes the address of the target HFile file stored on the HDFS;

the calling a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file, including:

invoking a preset file conversion tool to analyze the command line parameters to obtain the parameter content, configuring a MapReduce job based on the parameter content, and submitting the MapReduce job to a hadoop cluster so that a preset data conversion module in the hadoop cluster executes the MapReduce job to convert the target ORC file in the appointed directory into a target HFile file;

The job parameters of the MapReduce job comprise the file path, the target storage address, the table name of the target table and the field information of the target ORC file;

the process of executing the MapReduce job by the preset data conversion module comprises the following steps:

reading the target ORC file from the specified directory of the HDFS based on the file path, and analyzing the read target ORC file to obtain each row of data in the target ORC file;

Determining target row data corresponding to each row of data in the target ORC file based on a target mapping relation, generating a target HFile file based on each item of target row data, and storing the target HFile file in the target storage address; the target mapping relation is a mapping relation of each field and each appointed field of the target ORC file determined according to the data structure of the target ORC file and the data structure of the target table; the data structure and each field of the target ORC file are determined based on the field information of the target ORC file; each designated field is each field in the target table determined based on the table name; the data structure of the target table is determined based on the table name;

The target line data corresponding to each line of data is a line data which is obtained by mapping the field values of each field of the line of data to the corresponding designated field according to the target mapping relation and is used for writing the target table.

Optionally, each specified field includes a row key in the target table, and each field in the target table except for the row key is used for uniquely identifying a row of data in the target table;

The writing the data to be synchronized into a specified directory in an HDFS of a hadoop cluster according to an ORC file format to generate a target ORC file in the HDFS includes:

Generating a key value of a row key corresponding to each row of data in the data to be synchronized based on a row key rule of the target table;

And writing the key value of the row key corresponding to each row of data and the data to be synchronized into a designated directory in an HDFS of the hadoop cluster in an ORC file format so as to generate a target ORC file in the HDFS.

Optionally, each specified field includes each field of the target table except for a row key, and the row key is used for uniquely identifying the row data in the target table;

The preset data conversion module analyzes the read target ORC file to obtain each row of data in the target ORC file,

Generating a key value of a row key corresponding to each row of data in the target ORC file based on a row key rule of the target table;

The determining, based on the target mapping relationship, the target row data corresponding to each row of data in the target ORC file includes:

Mapping field values of fields of each row of data in the target ORC file to corresponding appointed fields based on a target mapping relation to obtain a mapping result corresponding to each row of data;

and generating target row data corresponding to each row of data based on the mapping result corresponding to each row of data and the key value of the row key corresponding to each row of data.

Optionally, the calling the predetermined loading tool of the HBase cluster to load the data of the target HFile file into the target table of the HBase cluster includes:

Invoking a preset loading tool of the HBase cluster to determine at least one target area corresponding to each row of data in the target HFile file based on a partition rule of the target table, and for each target area, sending the target HFile file to a target server corresponding to the target area in the HBase cluster to enable the target server to load each row of data corresponding to the target area in the target HFile file into the target area;

the partition rule defines a correspondence between a row key and the target area, and each target partition is responsible for storing a part of data in the target table.

Optionally, after calling a predetermined loading tool of the HBase cluster to load data of the target HFile file into the target table of the HBase cluster, the method further includes:

sending a first deleting instruction to an HDFS to delete the target ORC file stored in the HDFS;

And/or the number of the groups of groups,

And sending a second deleting instruction to the HBase cluster so as to delete the HFile files stored in the HBase cluster.

In a second aspect, the present application provides a data synchronization device for use in a data synchronization tool, the device comprising:

the first acquisition module is used for acquiring data to be synchronized from a target data source;

The writing module is used for writing the data to be synchronized into a designated directory in an HDFS of the hadoop cluster according to an ORC file format so as to generate a target ORC file in the HDFS;

the conversion module is used for calling a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file; the target HFile file stores the data to be synchronized, and fields included in the target HFile file are the same as fields in a target table, wherein the target table is a predetermined data table for synchronizing the data to be synchronized in an HBase cluster;

And the loading module is used for calling a preset loading tool of the HBase cluster so as to load the data of the target HFile file into the target table of the HBase cluster.

Optionally, the apparatus further includes:

The second obtaining module is used for obtaining command line parameters input by a user before calling a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file; the parameter content of the command line parameter comprises a file path of the target ORC file in the specified directory, a target storage address, a table name of the target table and field information of the target ORC file; the target storage address characterizes the address of the target HFile file stored on the HDFS;

The conversion module comprises:

The conversion unit is used for calling a preset file conversion tool to analyze the command line parameters, obtaining the parameter content, configuring a MapReduce job based on the parameter content, and submitting the MapReduce job to a hadoop cluster so that a preset data conversion module in the hadoop cluster executes the MapReduce job to convert the target ORC file in the appointed directory into a target HFile file;

Determining target row data corresponding to each row of data in the target ORC file based on a target mapping relation, generating a target HFile file based on each item of target row data, and storing the target HFile file in the target storage address; the target mapping relation is a mapping relation of each field and each appointed field of the target ORC file determined according to the data structure of the target ORC file and the data structure of the target table; the data structure and each field of the target ORC file are determined based on the field information of the target ORC file; each designated field is each field in the target table determined based on the table name; the data structure of the target table is determined based on the table name, wherein the target row data corresponding to each row of data is the row data which is obtained by mapping the field values of each field of the row of data to the corresponding designated field according to the target mapping relation and is used for writing the target table.

Optionally, each designated field includes a row key in the target table, and each field in the target table except for the row key is used for uniquely identifying a row of data in the target table;

The write module includes:

the generating unit is used for generating a key value of a row key corresponding to each row of data in the data to be synchronized based on a row key rule of the target table;

And the writing unit is used for writing the key value of the row key corresponding to each row of data and the data to be synchronized into a designated directory in the HDFS of the hadoop cluster in an ORC file format so as to generate a target ORC file in the HDFS.

Optionally, the loading module includes:

The sending unit is used for calling a preset loading tool of the HBase cluster to determine at least one target area corresponding to each row of data in the target HFile file based on the partition rule of the target table, and sending the target HFile file to a target server corresponding to the target area in the HBase cluster for each target area so that the target server loads each row of data corresponding to the target area in the target HFile file into the target area;

Optionally, the apparatus further includes:

A deleting module, configured to, after invoking a predetermined loading tool of the HBase cluster to load data of the target HFile file into the target table of the HBase cluster

And/or the number of the groups of groups,

In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

And the processor is used for realizing any one of the data synchronization methods when executing the programs stored in the memory.

In a fourth aspect, the present application provides a computer readable storage medium having a computer program stored therein, which when executed by a processor implements any of the above described data synchronization methods.

Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the data synchronization methods described above.

The embodiment of the application has the beneficial effects that:

The embodiment of the application provides a data synchronization method, by using an ORC file format, the method can accurately process an escape symbol and a separator in data, effectively avoid data analysis errors caused by separator conflicts in a traditional text format, ensure the structural integrity of the data, and further improve the suitability of the data synchronization method for the HBase cluster by converting the ORC file into an original storage format HFile of HBase. Therefore, the scheme of the application can consider the accuracy and the integrity of data synchronization when the data are synchronized in batches. In addition, in the scheme of the application, the generation process of the ORC format file and the generation process of the target HFile file are processes with more consumption of computing resources, but the two processes do not need to use the computing resources of the HBase cluster, so that the scheme of the application can reduce the influence on the HBase cluster and keep the stability and the usability of the HBase cluster when the large-batch data is synchronized into the HBase cluster.

Of course, it is not necessary for any one product or method of practicing the application to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the application, and other embodiments may be obtained according to these drawings to those skilled in the art.

FIG. 1 is a flowchart of a data synchronization method according to an embodiment of the present application;

FIG. 2 is a flowchart of another data synchronization method according to an embodiment of the present application;

FIG. 3 is a flowchart of another data synchronization method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a data synchronization device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by the person skilled in the art based on the present application are included in the scope of protection of the present application.

The following describes the terminology involved in the present application:

Hadoop cluster: hadoop clusters are a distributed system of multiple computers that work cooperatively to store and process large-scale data sets. Hadoop clusters are based on an Apache Hadoop software framework, and core components of the Hadoop clusters comprise HDFS (Hadoop distributed file system) and MapReduce (Hadoop distributed computing framework).

HDFS is a reliable and highly scalable file system that aims to store large data sets and provides a method of data access and processing. The method divides the data into blocks and stores each block on different nodes in the cluster respectively so as to realize redundant backup and fault tolerance of the data. At the same time, HDFS also provides a high degree of scalability, as it can easily add new nodes to extend storage capacity.

MapReduce is a programming model and software framework for processing and analyzing large data sets. In a Hadoop cluster, each node processes tasks on its local computer, and finally combines the results into a complete result set. Such a distributed computing framework can process large-scale data sets and provides powerful computing power for analyzing and processing the data sets. In Hadoop clusters, mapReduce jobs are executed in several subtasks, which mainly include two classes: map tasks and reduce tasks.

HBase is a Hadoop-based distributed, nematic NoSQL database that uses the Hadoop Distributed File System (HDFS) as its underlying storage. HBase clusters are deployment approaches for hbases, which can utilize multiple nodes to provide high availability and scalability.

HFile files are file formats in HBase clusters for storing the actual data.

The ORC (Optimized Row Columnar) file format is a file format for large data processing, and is particularly suitable for large-scale data set storage and processing in a Hadoop ecosystem. ORC file formats are designed for efficient storage and querying in large data environments. It supports columnar storage and can efficiently read and query data of a column. The ORC file format supports a variety of complex data types, such as arrays, maps, enumerations, etc., which allow the user to store richer data in the ORC file.

The following describes a data synchronization method provided by the embodiment of the application. The data synchronization method provided by the embodiment of the application is applied to a data synchronization tool; in a specific application, the data synchronization tool can be deployed in an electronic device, wherein the electronic device can be a terminal device, a server and the like, and the terminal device can be a tablet computer, a desktop computer, a mobile phone and the like; the application is not limited in this regard.

The data synchronization method provided by the embodiment of the application can comprise the following steps:

Acquiring data to be synchronized from a target data source;

The application provides a data synchronization method using an ORC file format and an HFile file, which ensures the high efficiency of the synchronization process while maintaining the accuracy and the integrity of data synchronization. By using the ORC file format, the method can accurately process the escape symbol and the separator in the data, effectively avoid data analysis errors caused by separator conflict in the traditional text format, and ensure the structural integrity of the data. Converting the ORC file into the native storage format HFile of the HBase further improves the adaptability to the HBase cluster, and because the HFile file is used as the native internal storage format in the HBase cluster, the adaptability to the HBase cluster is high, so that the HBase cluster can efficiently synchronize the data in the target HFile file without additional data conversion. Therefore, the scheme of the application can consider the accuracy and the integrity of data synchronization when the data are synchronized in batches.

In addition, in the scheme of the application, the generation process of the ORC format file and the generation process of the target HFile file are processes with more consumption of computing resources, but the two processes do not need to use the computing resources of the HBase cluster, so that the scheme of the application can reduce the influence on the HBase cluster and keep the stability and the usability of the HBase cluster when the large-batch data is synchronized into the HBase cluster.

The following describes a data synchronization method provided by an embodiment of the present application with reference to the accompanying drawings.

As shown in fig. 1, a data synchronization method provided by an embodiment of the present application may include the following steps:

s101, acquiring data to be synchronized from a target data source;

In the process of data synchronization, the application can acquire the data to be synchronized from each target data source. The respective target data sources may be relational databases such as MySQL (a relational database management system), oracle (a database management system developed by Oracle corporation), SQL SERVER (a relational database management system developed by microsoft corporation), non-relational databases such as mongo db (a database based on distributed file storage), cassandra (a set of open source distributed database systems), and the like, and other types of data storage systems, and the present application is not limited to the specific type of target data source. The data stored in each target data source can be read, the connection information of each target data source can be determined, and the data synchronization tool can establish a connection relation with each target data source so as to acquire the data to be synchronized from the target data source.

The data to be synchronized is specifically data in a target data source determined according to actual requirements. For example, data with a specified time period in the target data source may be used as data to be synchronized, and data with a data size greater than a specified threshold in the target data source may also be used as data to be synchronized. The method and the device for determining the data to be synchronized are not limited.

The deployment schemes of the Hadoop cluster and the HBase cluster in the target data source and the subsequent embodiments in the application are just according to the existing mature scheme, and the deployment scheme can be selected according to specific service requirements, and the application is not limited to this.

Optionally, the acquiring the data to be synchronized from the target data source includes:

Acquiring data to be synchronized from a target data source based on descriptive information about the target data source; the description information comprises a data source type, an address, a port, a user name, a password, a database to be synchronized and a data table to be synchronized.

The data source type, address, port, username, password, database to be synchronized, and table of data to be synchronized may be pre-configured so that the data synchronization tool may obtain data to be synchronized from the data source based on the information described above. In the present application, the data synchronization tool may perform the data synchronization method of the present application, and in particular, may be a tool that improves or redesigns an existing tool, which is not limited in this aspect of the present application.

S102, writing the data to be synchronized into a designated directory in an HDFS of a hadoop cluster according to an ORC file format so as to generate a target ORC file in the HDFS;

The specified directory may be a directory in the HDFS for storing the target ORC file that is pre-specified. In the application, the data to be synchronized can come from different target data sources, so that the data formats of the data to be synchronized acquired from different target data sources can have differences, and the ORC file format is a more universal file format in large data processing, so that the data to be synchronized can be uniformly generated into an ORC file so as to be subsequently converted into an HFile file, and the accuracy and the integrity of the data can be effectively ensured after the data to be synchronized is generated into the ORC file due to the high universality and excellent data structure support of the ORC file format.

The data synchronization tool has the function of converting the data to be synchronized in each data format into an ORC file. The specific conversion process may include analyzing the data to be synchronized, processing the data to be synchronized based on the analysis result, converting the processed data to be synchronized into an ORC file format, and processing the data to be synchronized based on the analysis result includes: and deleting invalid data in the data to be synchronized and processing missing values and/or abnormal values in the data to be synchronized based on the analysis result, wherein the analysis result is the analyzed invalid data, missing values and abnormal values in the data to be synchronized.

The data in the ORC file converted by the scheme of the application can ensure the accuracy and the integrity of the data, and effectively avoid the data analysis error caused by separator conflict in the traditional text format.

Also, since the present application may exist a scenario requirement for synchronizing data from multiple data sources simultaneously. Therefore, in the process of writing the data to be synchronized into the specified directory in the HDFS of the hadoop cluster according to the ORC file format, the present application can support multiple threads to concurrently read data from multiple data sources and write multiple data into the specified directory in the HDFS of the hadoop cluster.

S103, calling a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file; the target HFile file stores the data to be synchronized, and fields included in the target HFile file are the same as fields in a target table, wherein the target table is a predetermined data table for synchronizing the data to be synchronized in an HBase cluster;

In an HBase cluster, data is stored in the form of tables, where the fields of each table may include fields such as a row key, a column family, a column qualifier, and a value, where each column family may contain multiple columns. Hfile the file format is a file format in which the HBase cluster stores actual data on the HDFS, and the HFile file stores table data such as row keys, column groups, column qualifiers, and values. The data in the HFile file is ordered in the order of column families and row keys.

In the usual case, the HFile file is generated by the HBase cluster based on the data in the table in the HBase cluster. For example, after writing data into a table in the HBase cluster using put operations, the HBase cluster may generate an HFile file using the written data or append the written data to an existing HFile file. The proposal of the application can utilize the file conversion tool to generate the target HFile file without utilizing the computing resource of the HBase cluster.

The specific implementation manner of calling a preset file conversion tool to convert the target ORC file in the specified directory into the target HFile file will be described in detail in the following embodiments, which will not be repeated herein.

S104, calling a preset loading tool of the HBase cluster to load the data of the target HFile file into the target table of the HBase cluster.

The predetermined loading tool may be any bulkload tools (batch data loading tools) provided by the HBase cluster, and exemplary bulkload tools may be existing LoadIncrementalHFiles and CompleteBulkLoad tools. Through the predetermined loading tool, the data of the target HFile file can be loaded into the target table of the HBase cluster, and the process that bulkload tool loads the HFile file into the table of the HBase cluster is related to the bottom data operation of the HBase cluster, and the conventional writing path of the HBase cluster can be bypassed, so that a large amount of data can be more efficiently imported, and thus, the cost of a large amount of network I/O (Input/Output) and CPU (Central Processing Unit ) caused by the way of inserting the data line by line can be avoided.

In a specific loading process, the predetermined loading tool may load the HFile format data file under the directory of the corresponding Region in the HBase cluster, and load the HFile file content into the target table.

In one implementation, the invoking the predetermined loading tool of the HBase cluster to load the data of the target HFile file into the target table of the HBase cluster includes:

The data in HHBase clusters is organized in the form of tables, and the data in the tables is stored in partitions in multiple regions, i.e., the target areas described above. Each Region stores a portion of data and defines its data range by a start line key and an end line key. These regions are distributed on different nodes of the HBase cluster to achieve distributed storage and load balancing of data.

According to the method and the device, each data in the target HFile file can also correspond to different partitions, so that each data in the target HFile file can be stored in a partition according to the steps. The partition rule characterizes the corresponding relation between the row key and the target area, specifically, the multiple partitions of the target table are determined in advance, each partition can divide the hash value range of the key value of the corresponding row key in advance, so that the hash value of the row key value of each row of data can be calculated, and the partition corresponding to each row of data is determined. For example, if the hash value of the partition a ranges from 0 to 100 and the hash value of the row key of the row data 1 is 5, the partition of the row data 1 is the partition a, and the hash value is calculated in a conventional manner, which is not limited in the present application.

Each partition may correspond to a partition server for managing the partition, and after determining each partition corresponding to the target HFile file, the target HFile file may be sent to a target server corresponding to each partition in the HBase cluster, so that the target server loads each row of data corresponding to the target area in the target HFile file into the target area.

After the target HFile file is loaded into the target table, the data in the loaded target HFile file is immediately available in the target table.

And/or the number of the groups of groups,

After synchronizing the data to be synchronized into the target table, the target ORC file and the HFile file may be deleted, thereby freeing up storage space.

The following describes in detail the process of converting the target ORC file in the specified directory into a target HFile file according to the present application:

the method further includes, before the calling a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file:

after the data to be synchronized is written into the specified directory in the HDFS of the hadoop cluster according to the ORC file format after the file path of the target ORC file in the specified directory is executed, the data synchronization tool can directly determine the target ORC file generated in the HDFS, and of course, the data synchronization tool can also parse the information of the file path including the target ORC file returned by the HDFS to the data synchronization tool to obtain the file path.

The target table, i.e. the table to be used for synchronizing data in the HBase cluster according to the present application, is predetermined, and therefore the table name of the target table is also predetermined.

The field information of the target ORC file is specifically described in the later-described embodiments.

It will be appreciated that the file conversion tool communicates with the data synchronization tool, and that the data synchronization tool may communicate the command line parameters to the file conversion tool when a preset file conversion tool is invoked. The job parameters of the MapReduce job may be derived from the command line parameters.

Specifically, the field information of the target ORC file may include field information corresponding to each column of data, for example, the field information of the first column of the target ORC file is name, the second column is age, and so on. It can be seen that the field information of the target ORC file can also be used to describe the data structure of the target ORC file.

The target mapping relationship is set based on the data structure of the target table and the data structure of the ORC file, so that after the field information in the target ORC file and the field information of the target table are obtained, the mapping relationship between each field and each appointed field of the target ORC file can be determined according to the data structure of the ORC file and the data structure of the target table. For example, a set of fields in the ORC file may be mapped to a column family of HBase clusters, and each field in the set of fields may then be mapped to one or more column qualifiers under the column family. It will be appreciated that the table structure of each target table may be different and the table structure of the target table is predefined, so that the table name of the target table may be used to determine the target table, and the table structure of the target table, that is, the data structure, may be obtained. The field information included in the target ORC file is determined based on the fields of the target table and the fields in the target data source, for example, the target data source includes a name, an age and other fields, and only the name and the age fields in the target table are needed to be obtained from the target data source, so that the target ORC file is generated, and the target ORC file includes the name and the age fields. For the generated target ORC file, the field information of the target ORC file can be resolved. It will be appreciated that, although the target ORC file and the target table both have name and age fields, because the data structures are different, it is necessary to determine the mapping relationship between the two, for example, in the target table, the name and age fields are column qualifiers of a column group user, and the name and age fields in the target ORC file may be mapped to the name and age column qualifiers under the user column group. It will be appreciated that the data content contained in the target ORC file and the generated target HFile file are substantially the same, except that the data structures differ.

In a specific example, the file conversion tool may be a class customized by the scheme of the present application, which may be named ImportOrc class, and the class is responsible for configuring and starting a MapReduce job, so that the data files stored in the ORC format are imported into the HBase cluster in batches. It mainly completes the following tasks:

parsing command line parameters: and analyzing command line parameters provided by a user, wherein the command line parameters comprise an input path of an ORC file, a target table name of an HBase cluster and the like.

Configuring MapReduce operation: the input format of the job may be set to the ORC file format, a user-defined Mapper class may be specified, such as OrcToHFileMapper, a part of the Mapper class, i.e. the MapReduce job, and the output format of the job may be set, for example, to HFileOutputFormat.

Setting operation parameters: the operation parameters are configured to enable the MapReduce operation to adapt to the mapping relation between an ORC file structure and the Schema of the target table (the Schema corresponds to the table structure and is mainly used for describing the logic structure of the HBase table, and the Schema can comprise fields such as row keys, column families, column qualifiers, values and the like), and specifically, the operation parameters are configured to enable the MapReduce operation to analyze and obtain row keys, column families, column qualifiers, values and the like matched with the table structure of the HBase table according to the mapping relation in each row of data in the ORC file.

Starting operation: the MapReduce job is submitted to the Hadoop cluster to process the ORC file and generate HFiles.

In one implementation manner, the step of determining the target row data corresponding to each row of data in the target ORC file based on the target mapping relationship may specifically be to analyze each row of data in the ORC file through an inportorc.

The target data generation may specifically be: each field value in the target ORC file is mapped into a specified column family and column qualifier, and these mapped values are then assembled into target row data according to the HBase data structure. In the map function, the target line data may be represented by a Put instance. These Put instances may be formatted into HFile format during the output phase of the MapReduce job.

The preset data conversion module may be a module configured to execute a MapReduce job in a computing framework of the hadoop cluster.

In a specific embodiment of the present application, the Mapper class included in the MapReduce job may be specifically named OrcToHFileMapper class, and the OrcToHFileMapper class is a class customized by the present application. The OrcToHFileMapper classes can read each record, i.e., each row of data, in the ORC file and convert each row of data into Put operations suitable for HBase. The main workflow of the OrcToHFileMapper classes is as follows:

Reading the ORC record: in the map method, each record is read using the ORC read API of Hadoop (Application Programming Interface ). Wherein, the map method is a stage of the map class processing data.

Parsing the data and generating Put instances: the data in the ORC file is mapped to the row keys, column families, column qualifiers and values of the HBase according to each record in the ORC file, i.e., each row of data, and the schema of the HBase table. For each record in the ORC file, a Put instance is created for the row data after mapping to the row key, column family, column qualifier and value of the HBase.

Output Put instance: and taking the generated Put examples as output, and converting each Put example into Hfile files.

It will be appreciated that each Put instance in the present application can be used as a carrier for each row of data, and by reading each row of data in the ORC file, a Put instance is created that contains the row keys, column families, column qualifiers, and values to be inserted into the HBase table. Specifically, since the output format of the MapReduce job may be set to HFileOutputFormat and HFileOutputFormat, and the output format may convert the Put instance output by the Mapper into a file in HFile format, the present application may convert each Put instance into a file in HFile format, and thus generate the target HFile file. The present application converts the Put instance into the file in HFile format, which is only one implementation manner of the present application, the present application does not limit the specific expression form of the target line data and the manner of generating the file in HFile format, for example, the target line data may be the data present application, and may include the line data including the line key, the column group, the column qualifier and the value corresponding to the structure of the target table of the HBase cluster in each line data in the data to be synchronized, and may directly generate the target HFile file by using each target line data.

Optionally, the job parameters of the MapReduce job may further include metadata information, where the metadata information includes each line of data in the data to be synchronized, and the metadata includes: time stamp, attribute, and visibility, etc. The metadata information may be obtained by parsing data to be synchronized obtained from a target data source.

In the process of generating the target line data, the metadata of each line data can be written into the target line data corresponding to the line data based on the metadata information. If the job parameter does not include metadata information, metadata corresponding to each target line data may be generated according to a preset manner, for example, a time stamp of each target line data may be a time when the target line data is loaded into the target table.

In the present application, the generation process of the target HFile file is executed in the computing framework of the hadoop cluster, so after the target HFile file is generated, the target HFile file may be stored in a designated path in the HDFS of the hadoop cluster, where the designated path is the target storage address in the above embodiment. After invoking a predetermined loading tool of the HBase cluster to load data of the target HFile file into the target table of the HBase cluster, the method may further comprise:

and sending a third deleting instruction to the HDFS to delete the target HFile file stored in the HDFS.

In this embodiment, the MapReduce job may include only a Map stage and no Reduce stage, and based on the file path, the method reads the target ORC file from the specified directory of the HDFS, and parses the read target ORC file to obtain each line of data in the target ORC file; based on the target mapping relationship, the step of determining target row data corresponding to each row of data in the target ORC file may be performed in a Map stage; the step of generating the target HFile file based on the line data of each entry may be accomplished by directly setting the output format of the MapReduce job to HFileOutputFormat 2.

The embodiment may start a MapReduce job through ImportOrc class based on the two custom classes ImportOrc and OrcToHFileMapper, and process an input ORC file according to OrcToHFileMapper defined logic by using computing resources of the Hadoop cluster, and generate a file in HFile format and store the file in a path specified on the HDFS.

In the map method of the MapReduce job of the present application, the read API of the ORC file may be called to parse each row of data in the ORC file, for example, using Hive (Hive is a data warehouse tool based on Hadoop, and may be used to perform data extraction, conversion, and loading).

The target HFile file generated according to the method of the embodiment may be one HFile file including all target line data, and in another implementation manner, the target line data corresponding to each line data may be divided according to the key value of the line key in the target line data, and the target line data corresponding to the set key value range is written into the same target HFile file.

The solution of this embodiment may obtain multiple target HFile files, and the key value range in this embodiment may be a hash value range of key values of row keys corresponding to each partition of the HBase cluster in the foregoing embodiment. Therefore, the data in each target HFile file generated according to the embodiment is the data corresponding to the same partition, and when the data of the target HFile file is subsequently loaded into the target table of the HBase cluster, the resource consumption of the HBase cluster can be reduced, and the speed of synchronizing the data of the HBase cluster can be increased. In addition, the method and the device can load the plurality of target HFile files into the HBase cluster in parallel, so that the efficiency of data synchronization can be improved.

In a specific implementation manner, the MapReduce job may be configured using the hfileoutputformat2. Configurable inlinerload method, so that each target HFile file output by the MapReduce job meets the above requirement.

Optionally, in one implementation, each specified field includes a row key in the target table, and each other field in the target table except for the row key, where the row key is used to uniquely identify a row of data in the target table;

specifically, each specified field may include a row key, column family, column qualifier, value, and the like.

The target table has predetermined a rule for generating a row key of data stored in the table, the data synchronization tool may acquire the rule for generating a row key of the target table in advance, and may generate a key value of a row key of each row of data according to the rule for generating a row key after acquiring the data to be synchronized from the target data source. For example, the target table stores information such as a user name, an age, a gender, and the like, and specifies that a rule for generating a row key is to construct by using name and age fields, and then a key value of a row key of each piece of row data may be constructed by using a name-age field in each piece of row data. The application does not limit the specific construction rules of the row keys corresponding to different target tables, but the most basic principle to be followed by the construction rules of the row keys is uniqueness, and the row keys must be capable of uniquely marking a row of data.

And writing the key value of the row key corresponding to each row of data in the data to be synchronized and the data to be synchronized into a designated directory in an HDFS of the hadoop cluster in an ORC file format so as to generate a target ORC file in the HDFS.

The target ORC file includes a key value of a row key corresponding to each row of data and each row of data in the data to be synchronized, and it can be understood that each row of data corresponds to the key value of the row key corresponding to the row of data. Specifically, the key value of the row key corresponding to each row of data may occupy a column in the ORC file, and each row of data in the data to be synchronized and the key value of the row key corresponding to the row of data belong to the same row in the ORC file. The target ORC file generated according to the implementation manner may be directly executed to call the preset file conversion tool, so that the step of converting the target ORC file in the specified directory into the target HFile file may convert the target ORC file into the target HFile file. The field information of the target ORC file generated in this implementation may include field information of a row key, for example, the first column in the target ORC file may be a row key.

Optionally, in an implementation, each specified field includes each field of the target table except for a row key, and the row key is used to uniquely identify a row of data in the target table;

At this time, the specified fields may include column families, column qualifiers, values, and the like. The data in the target ORC file is the data to be synchronized, so that each line of data in the target ORC file corresponds to each line of data in the data to be synchronized. Therefore, for each row of data parsed from the ORC file, the specific process of generating the key value of the row key corresponding to the row of data based on the row key rule of the target table is similar to the above process of generating the key value of the row key corresponding to each row of data in the data to be synchronized based on the row key rule of the target table, and the difference is that the execution body of the row key generation process in this embodiment is the data conversion module, and the row data parsed from the target ORC file is utilized, and the generation process of the key value of the row key is not repeated in this embodiment.

in this embodiment, mapping field values of each field of each row of data in the target ORC file to corresponding specified fields is similar to the implementation manner of determining the target row of data corresponding to each row of data in the target ORC file based on the target mapping relationship described in the foregoing embodiment, and the difference is that in this embodiment, mapping of a row key field is not required when field mapping is performed, so that the mapping result corresponding to each row of data does not include a row key value.

The embodiment may combine the mapping result corresponding to each line of data and the key value of the line key corresponding to each line of data, thereby generating the target line data corresponding to each line of data.

In addition, the method and the device can monitor the execution state of the MapReduce job in real time, and ensure that the MapReduce job is successfully completed. In the execution process of the MapReduce job, if a problem occurs, prompt information can be sent in a preset monitoring window, so that workers are informed of debugging and solving.

In the embodiment of the application, the process of converting the target ORC file in the specified directory into the target HFile file is completed on the HDFS, so that the influence on the HBase cluster can be reduced, and the stability and usability of the HBase cluster can be maintained; in addition, the scheme of the application can convert the ORC format data file stored in the HDFS into the HFile file, and can avoid the problems of escape character processing and field segmentation errors existing in ImportTsv tools in the prior art.

Fig. 2 is another flowchart of a data synchronization method according to the present application, and the data synchronization method is specifically described below with reference to fig. 2.

S201, data is read from a data source using a data synchronization tool.

S201 corresponds to the step S101, and will not be described in detail herein.

S202, writing data into a directory appointed by an HDFS file system by using a data synchronization tool, and generating a data file in a 0RC format.

S202 corresponds to the step S102, and will not be described in detail herein.

And S203, converting the ORC format data file stored in the HDFS to generate an HFile file.

The HFile file in this step is the target HFile file, and S203 corresponds to the step S103, which is not described herein.

S204, using BuIkLoad of the HBase cluster to load HFile into the HBase cluster.

S204 corresponds to the step S104, and BuIkLoad is a predetermined loading tool of the HBase cluster, which is not described in detail herein.

S205, cleaning the temporary data file and the corresponding HFile file generated in the HDFS.

S205, sending a first deleting instruction to the HDFS to delete the target ORC file stored in the HDFS; and/or sending a second deleting instruction to the HBase cluster so as to correspond to the HFile file deleting step stored in the HBase cluster.

Fig. 3 is another flowchart of a data synchronization method according to the present application, and the data synchronization method is specifically described below with reference to fig. 3.

S301, configuring a data synchronization task.

Before the synchronization task begins, the data source is first configured. The type, address, port, user name, password of the specified source database, the database to be synchronized and the data table to be synchronized can be configured, and in addition, the target table needing to write data into the HBase cluster and rowkeycolumn information of the target table can also be configured. The rowkeycolumn information of the target table may characterize information of the generation rule of the row key of the target table.

The functions of the data source type, address, port, user name, password, database to be synchronized and data table to be synchronized are the same as those described in the above embodiments, and will not be repeated here.

S302, writing an ORC data file into the HDFS.

The present embodiment may select DataX as the data synchronization tool. When executing the task of synchronizing data in the HBase cluster, the DataX can establish connection with a target data source, the Hadoop cluster and the HBase cluster, read data from the data source according to configuration, and write the read data into a directory specified in the HDFS file system.

When writing data into the HDFS, rowkey fields are generated according to rowkeyColumn information of the configured target table, and are used to adapt to the data model of the target table of the HBase cluster. In addition, when writing data into HDFS, multiple threads are supported to read data concurrently and write multiple files.

S302 corresponds to S102 described above, and will not be described in detail herein.

S303, converting the ORC file into an HFile file.

After generating the data file in the HDFS, dataX may generate the ORC format data file in the HDFS as an HFile file by executing a call custom ImportOrc tool.

The functions of each part involved in generating the ORC format data file into the HFile file in this embodiment are as follows:

(1) ImportOrc classes

This class is responsible for configuring and starting a MapReduce job for bulk import of data files stored in ORC format into HBase clusters. It mainly completes the following tasks:

Configuring MapReduce operation: the input format of the job is set to ORC file format, a custom Mapper class (such as OrcToHFileMapper) is specified, and the output format of the job is set to HFileOutputFormat.

Setting operation parameters: the job is configured to accommodate the mapping relationship of ORC file structure and target table schema of HBase cluster, including how row keys and column values are parsed from ORC file, etc.

Starting operation: submitting MapReduce jobs to the Hadoop cluster, processing ORC files and generating Hfile files.

(2) OrcToHFileMapper classes

This Mapper class is responsible for reading each record in the ORC file and converting it into Put instances suitable for HBase clusters. The main workflow of this class is as follows:

reading the ORC record: in the map method, each record is read using the ORC read API of Hadoop. And each record is the data of each row in the ORC file.

Parsing the data and generating Put instances: and mapping the data in the ORC file to row keys, column families and column qualifiers corresponding to the target table structure according to the ORC records and the schema of the target table of the HBase cluster. For each record, a Put instance is created.

Output Put instance: the generated Put instances are then converted to HFiles as output.

Based on the two custom classes, the MapReduce job is started through the ImportOrc classes, the input ORC file is processed according to logic defined by OrcToHFileMapper by using the computing resources of the Hadoop cluster, and the file in the HFile format is generated and stored in a path appointed on the HDFS.

S303 corresponds to S103, and will not be described in detail herein.

S304, HFile is loaded to the Hbase cluster.

And loading the generated HFile file into a target table of the Hbase cluster by using bulkload tools (such as LoadIncrementalHFiles tools) provided by the Hbase cluster, so as to complete the batch importing process of the data. The DataX continues to execute the following commands to transfer HFile into Hbase clusters using bulkload of Hbase clusters:

the tool distributes and loads the files by triggering a MapReduce job:

(1) MapReduce task execution:

The Mapper stage: each Mapper is responsible for a set of HFile files. The Mapper determines to which Region (Region) of the Hbase cluster the data contained in these files should be loaded by comparing the row key in HFile with the Region division of the target table of Hbase clusters.

And (3) data distribution: the Mapper sends the HFile file to the correct Region server.

(2) And (3) loading data:

region Server processing: after receiving the HFile file, the Region Server of the Hbase cluster directly moves or copies the HFile file to a final storage position. Since HFile is an internal storage format of the Hbase cluster, this process is very efficient and no additional data conversion is required.

(3) Integration into Hbase clusters:

Metadata update: after the data loading is completed, the Hbase cluster updates metadata thereof, including the size of the area, the row key range and the like, so as to contain the newly imported data.

Availability of data: once the loading process is complete, these data are immediately available in the target table of the Hbase cluster.

S304 corresponds to S104 described above, and will not be described in detail herein.

S305, cleaning the temporary file.

Deleting the temporary file in the HDFS: after the HFile file is successfully loaded into the target table of the Hbase cluster, the generated temporary ORC data file and HFile file are cleaned, storage resources are released, and the whole data synchronization task is completed.

The method combines various data source options in the DataX tool, and can synchronize data in most of the data sources on the market to the Hbase cluster through simple configuration work, so that the Hbase cluster can acquire data more conveniently and effectively.

The application provides a data synchronization method. The data to be synchronized, which can be read from the data source, is written into a directory specified by the HDFS file system, generating a data file in ORC format. The efficient columnar storage capability of the ORC format significantly improves the writing and processing efficiency of the data. The ORC formatted data file stored in the HDFS is then converted to an HFile file by a custom ImportOrc tool. The method not only utilizes the high-efficiency data format conversion capability of Hbase clusters, but also can avoid the problems of escape character processing and field segmentation errors existing in ImportTsv tools. And finally, using bulkload functions of the Hbase cluster to efficiently load the generated HFile file into the Hbase cluster, so as to complete data synchronization.

The method solves the problems of untimely data synchronization and error resolution of separator in field content by ImportTsv method caused by the fact that high throughput data writing performance cannot be provided in a large-scale data scene by the current method, improves the synchronization efficiency when data synchronization is carried out on Hbase clusters, reduces the time required by data synchronization tasks and reduces the consumption of resources. Meanwhile, the method and the device combine various data source options in the DataX tool, and can synchronize data in most of the data sources on the market to the Hbase cluster through simple configuration work, so that the Hbase cluster can acquire data more conveniently and effectively.

The embodiment of the application also provides a data synchronization device, which is applied to a data synchronization tool, as shown in fig. 4, and comprises:

a first obtaining module 401, configured to obtain data to be synchronized from a target data source;

A writing module 402, configured to write the data to be synchronized into a specified directory in an HDFS of a hadoop cluster according to an ORC file format, so as to generate a target ORC file in the HDFS;

A conversion module 403, configured to call a preset file conversion tool to convert the target ORC file in the specified directory into a target HFile file; the target HFile file stores the data to be synchronized, and fields included in the target HFile file are the same as fields in a target table, wherein the target table is a predetermined data table for synchronizing the data to be synchronized in an HBase cluster;

and a loading module 404, configured to call a predetermined loading tool of the HBase cluster, so as to load the data of the target HFile file into the target table of the HBase cluster.

According to the data synchronization device provided by the embodiment of the application, through the use of the ORC file format, the escape symbol and the separator in the data can be accurately processed, the data analysis error caused by the separator conflict in the traditional text format is effectively avoided, the structural integrity of the data is ensured, the suitability of the data synchronization device for the HBase cluster is further improved by converting the ORC file into the native storage format HFile of the HBase, and the adaptation degree of the HFile file serving as the native internal storage format in the HBase cluster to the HBase cluster is high, so that the HBase cluster can efficiently synchronize the data in the target HFile without additional data conversion. Therefore, the scheme of the application can consider the accuracy and the integrity of data synchronization when the data are synchronized in batches. In addition, in the scheme of the application, the generation process of the ORC format file and the generation process of the target HFile file are processes with more consumption of computing resources, but the two processes do not need to use the computing resources of the HBase cluster, so that the scheme of the application can reduce the influence on the HBase cluster and keep the stability and the usability of the HBase cluster when the large-batch data is synchronized into the HBase cluster.

Optionally, the apparatus further includes:

The conversion module comprises:

The job parameters of the MapReduce job comprise the file path, the target storage address, the table name of the target table and the field information of the target ORC file; the process of executing the MapReduce job by the preset data conversion module comprises the following steps:

The write module includes:

Optionally, the loading module includes:

Optionally, the apparatus further includes:

And/or the number of the groups of groups,

The embodiment of the application also provides an electronic device, as shown in fig. 5, which comprises a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 complete communication with each other through the communication bus 504,

A memory 503 for storing a computer program;

The processor 501 is configured to implement any one of the above-described data synchronization methods when executing the program stored in the memory 503.

The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In yet another embodiment of the present application, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any of the data synchronization methods described above.

In yet another embodiment of the present application, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the data synchronization methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method of data synchronization, for use with a data synchronization tool, the method comprising:

Acquiring data to be synchronized from a target data source;

2. The method of claim 1, wherein before invoking a preset file conversion tool to convert the target ORC file in the specified directory to a target HFile file, the method further comprises:

3. The method of claim 1, wherein each designated field includes a row key in the target table, and wherein each field in the target table other than the row key is used to uniquely identify a row of data in the target table;

4. The method of claim 2, wherein each designated field comprises a respective field of the target table other than a row key, the row key to uniquely identify a row of data in the target table;

5. The method of claim 1, wherein the invoking the predetermined loading tool of the HBase cluster to load the data of the target HFile file into the target table of the HBase cluster comprises:

6. The method according to claim 1, wherein after invoking a predetermined loading tool of the HBase cluster to load data of the target HFile file into the target table of the HBase cluster, the method further comprises:

And/or the number of the groups of groups,

7. A data synchronization device for use with a data synchronization tool, the device comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises:

The conversion module comprises:

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any of claims 1-6 when executing a program stored on a memory.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-6.