Disclosure of Invention
In view of the above, the embodiments of the present invention provide a data repairing method, an electronic device, and a storage medium, so as to solve the problem of low efficiency of repairing incomplete data based on an elastic search cluster.
In a first aspect of an embodiment of the present invention, there is provided a method comprising:
transmitting the incomplete information data acquired by the Flume to a Kafka message queue, and repairing the incomplete information through SparkStreaming;
storing the repaired incomplete information into an elastic search cluster, and storing the unrepaired incomplete information into an HDFS cluster;
repairing and extracting unrepaired incomplete information in the HDFS cluster through a graph calculation association technology, and storing the repaired and extracted incomplete information into a Redis database.
In a second aspect of the embodiment of the present invention, there is provided an electronic device, including:
the first repair module is used for sending the incomplete information data acquired by the Flume to a Kafka message queue, and performing repair processing on the incomplete information through sparkStreaming;
the storage module is used for storing the repaired incomplete information into the elastic search cluster and storing the unrepaired incomplete information into the HDFS cluster;
the second repair module is used for repairing and extracting unrepaired incomplete information in the HDFS cluster through a graph calculation association technology, and storing the repaired and extracted incomplete information to the Redis database.
In a third aspect of the embodiments of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect of the embodiments of the present invention when the computer program is executed by the processor.
In a fourth aspect of the embodiments of the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method provided by the first aspect of the embodiments of the present invention.
In the embodiment of the invention, incomplete information acquired by the Flume is sent to the Kafka message queue, the incomplete information is repaired by sparkStreaming, the repaired data is stored in the elastic search cluster, the unrepaired data is stored in the HDFS cluster, the unrepaired data is repaired by a graph calculation correlation technology and then stored in the Redis database, thereby solving the problem of low repair efficiency of incomplete user behavior data, effectively improving the data repair efficiency, ensuring the real-time processing of incomplete element data of the user identity, and facilitating the confirmation of the user identity according to the complete element data.
Detailed Description
In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions in the embodiments of the present invention are described in detail below with reference to the accompanying drawings, and it is apparent that the embodiments described below are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The term "comprising" in the description of the invention or in the claims and in the above-mentioned figures and other similar meaning expressions is meant to cover a non-exclusive inclusion, such as a process, method or system, apparatus comprising a series of steps or elements, without limitation to the steps or elements listed.
Aiming at massive network access data, there may be a problem that data elements are incomplete, for example, a field value corresponding to a certain element is lost, and the identity of a user (identity of terminal equipment) can be influenced for the missing data. Missing part data can be filled through data restoration, so that the service end can conveniently identify and analyze the user identity according to the restored data.
Referring to fig. 1, a flow chart of a data repairing method according to an embodiment of the present invention includes:
s101, transmitting incomplete information data acquired by Flume to a Kafka message queue, and repairing the incomplete information through sparkStreaming;
the Flume can be used for collecting mass logs in a distributed mode, and data are collected by customizing various data transmitters. After the user side sends the network access data to the servers, log information in each server can be collected based on the jump. The Kafka message queue can be used for processing massive logs to realize separation of message production and consumption. The massive logs collected by the server end have user identity information such as IP, port numbers, operation behaviors and the like, and user access information or identity information contains incomplete information, namely incomplete identity information or access information. The incomplete information repair task is generated through the Kafka message queue, and the SparkStreaming can process the repair task. The SparkStreaming is a stream processing framework, and can carry out batch processing on incomplete messages and repair elements filling the incomplete messages.
Optionally, filling and repairing incomplete items in the incomplete information according to the existing complete information of the Redis database. The repaired complete information and the captured complete information are stored in the Redis database, and the complete information item is extracted and filled with incomplete information after comparison according to the key value relation between the current incomplete information and the complete information.
Preferably, the repair of incomplete information based on Redis includes implementing repair and off-line repair, performing field filling on the incomplete information according to data in Redis in real time, and off-line repair uses a specific field to be repaired as a key and a corresponding field as a value according to acquired complete identity information data to form a key value associated data string, thereby improving repair efficiency and reducing memory occupation.
S102, storing the repaired incomplete information into an elastic search cluster, and storing the unrepaired incomplete information into an HDFS cluster;
the data processed by SparkStreaming can be divided into two parts: unrepaired incomplete information and repaired incomplete information. SparkStreaming is a high-throughput streaming system for processing real-time data streams, and can process various data sources (such as flash and Kafk) like maps and Reduce, and store the processed data sources in an external file system or database. The elastic search cluster and the HDFS cluster are both external file storage systems, wherein the elastic search cluster provides a distributed real-time full text search engine, and the HDFS cluster is a high-fault-tolerance and high-throughput distributed file system.
Taking a piece of incomplete data in a data stream as an example, acquiring a specific field in the incomplete data and taking a field containing a data value as a key of a Redis to acquire a corresponding value, taking the corresponding value to fill the piece of data, storing the filled data into an elastic search cluster, and if the corresponding value is not acquired, storing the piece of data into a data file of an HDFS.
And the repaired data is stored in the elastic search cluster, so that the data can be conveniently acquired and analyzed, and the unrepaired data is stored in the HDFS cluster, so that the offline repair is convenient.
S103, repairing and extracting unrepaired incomplete information in the HDFS cluster through a graph calculation association technology, and storing the repaired and extracted incomplete information into a Redis database.
The graph calculation association technology is to solve a specific problem through a data model of vertexes and edges based on association relations among objects. Solving unrepaired incomplete information in the HDFS cluster through a graph calculation association technology, specifically, acquiring a history complete entry in the HDFS by using an association repair mode, repairing the current incomplete information, storing a corresponding relation of key values in a Redis database after repairing, and storing repaired data in the Redis database in an elastic search cluster.
The repaired data in the HDFS cluster may be stored in the elastic search cluster, and the unrepaired data is added to the original data file.
Optionally, historical data in the HDFS cluster is obtained based on association restoration, available information in the historical data is extracted, the incomplete information is restored according to a corresponding relation between the available information and the incomplete information, and the key value relation is stored in a Redis database.
Preferably, according to the incomplete information after repair in the elastic search cluster and the incomplete information after repair and extraction in the HDFS cluster, the user operation behavior is obtained, and the identity of the terminal user is analyzed and identified.
In another embodiment of the present invention, as shown in fig. 2, a process of repairing incomplete information based on graph-based computing correlation technique is shown:
specifically, in S201, historical incomplete data, i.e., source data, is obtained from the HDFS, and the first field at the beginning of the source data is assumed to be an ID, and the second, third, and fourth fields are assumed to be specific identity fields, which may be denoted as a, B, C, where the four fields are denoted by 'N', i.e., repair filling is required. In S202, vertex data and Edge data of the graph are constructed according to the source data, the vertex data is in the form of (vertexId, vd), the Edge data is in the form of Edge (srcId, dstId, 0), the vertex data vertexId is obtained according to the hash value of the non-incomplete value field of the specific identity field of each line of the original data, vd is information corresponding to the non-incomplete value field, and the Edge data srcId and dstId are composed of hash values of two non-incomplete value fields of three specific identity fields. In S203, vertex data (vertexId, minVertexId) of the connected graph is obtained based on the vertex data and the edge data, wherein vertexId is a vertexId of a certain vertex of the connected graph, and minVertexId is a vertexId of a minimum vertex connected with the vertex in the connected graph. After integrating the original vertex data and the connected graph vertex data, the data form is expressed as (minVertexId, attr), wherein attr is the minimum vertex corresponding attribute. The integrated data is filled with incomplete data in the form of data (minvertemid, three specific field values) by a reduced bykey in S204. The complete item data form extracted by filtering is data with three specific fields, namely, the data meeting the requirement that three specific fields exist simultaneously is extracted, and the data with the data form (hashId, three specific fields) is obtained after conversion, wherein hashId represents the hash value of the specific fields. And in S205, storing the converted data into a Redis cluster, performing join integration and deduplication with vertex data to obtain repaired data (namely, the data of all specific identity fields can be repaired in source data), and finally establishing difference integration according to the repaired data ID (first field) and the source data ID (first field) to obtain unrepaired data (namely, the data of the incomplete specific field cannot be repaired).
According to the method provided by the embodiment, aiming at the problem of incomplete captured user information elements, incomplete identity information data is repaired and integrated in real time according to the preset rule, incomplete user identity information items are filled, the integrity of the data is guaranteed, real-time storage is realized, the terminal user can be identified and analyzed based on the complete user identity information, and meanwhile, the figure can be accurately represented.
It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present invention,
fig. 2 is a schematic structural diagram of an electronic device for data repair according to an embodiment of the present invention, where the electronic device includes:
the first repair module 210 is configured to send the incomplete information data acquired by the jume to a Kafka message queue, and repair the incomplete information through SparkStreaming;
optionally, the repairing the incomplete information by SparkStreaming specifically includes:
filling and repairing incomplete items in the incomplete information according to the existing complete information of the Redis database.
A storage module 220, configured to store repair-completed incomplete information into the elastic search cluster, and store unrepaired incomplete information into the HDFS cluster;
the second repair module 230 is configured to repair and extract unrepaired incomplete information in the HDFS cluster by using a graph computation association technique, and store the repaired and extracted incomplete information to the Redis database.
Optionally, the repairing and extracting the unrepaired incomplete information in the HDFS cluster by using the graph computing correlation technology includes:
and acquiring historical data in the HDFS cluster based on the association restoration, extracting available information in the historical data, restoring the incomplete information according to the corresponding relation between the available information and the incomplete information, and storing the key value relation into a Redis database.
Optionally, the second repair module further includes:
the analysis module is used for acquiring user operation behaviors and analyzing and identifying the identity of the terminal user according to the incomplete information after the repair in the elastic search cluster and the incomplete information after the repair and extraction in the HDFS cluster.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
It will be understood by those skilled in the art that all or part of the steps in implementing the method of the above embodiment may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, where the program includes steps S101 to S103 when executed, where the storage medium includes: ROM/RAM, magnetic disks, optical disks, etc.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.