CN106528865A

CN106528865A - Quick and accurate cleaning method of traffic big data

Info

Publication number: CN106528865A
Application number: CN201611094160.2A
Authority: CN
Inventors: 张鹏飞; 赵凯; 梁婷婷; 陶斯琴; 侯俊巍
Original assignee: Casic Wisdom Industrial Development Co Ltd
Current assignee: Casic Wisdom Industrial Development Co Ltd
Priority date: 2016-12-02
Filing date: 2016-12-02
Publication date: 2017-03-22

Abstract

The invention discloses a quick and accurate cleaning method of traffic big data, and relates to the technical field of traffic data processing. With regard to real-time RFID and snapshot data, a Spark Streaming stream processing technology is adopted, Kafka is utilized to provide data caching, data is constantly extracted from the Kafka according to a time window, and comparisons, statistics and exception handling of data are finished according to a data cleaning rule; with regard to off-line batch cumulative data, a Spark internal storage processing technology is adopted, data is read from an HDFS, comparisons, statistics and exception handling of data are conducted according to the data cleaning rule, through comparisons, an algorithm is optimized, and performance of a procedure and accuracy of a data cleaning result are improved. According to the quick and accurate cleaning method of the traffic big data, the quick and accurate processing of data of RFID, snapshots and the like generated in the monitoring and managing process of urban traffic is achieved, so that processing of the traffic data resources is achieved, and storage and utilization of traffic big data resources are guaranteed.

Description

A kind of traffic big data cleaning method of fast accurate

Technical field

The present invention relates to transport data processing technical field, more particularly to a kind of traffic big data cleaning side of fast accurate Method.

Background technology

With the development and the raising of people's level of consumption of urban construction, automobile has become indispensable during people live Instrument, and the process of the huge traffic data for producing therewith also becomes a problem demanding prompt solution.In order to realize quick reality When traffic monitoring and forecast analysis, realize the analysis and inquiry of traffic historical data, need the traffic data to separate sources Cleaning filtration being carried out, and abnormal data being extracted for artificial treatment, the result to processing is deposited respectively using appropriate storage mode Storage, and data access interface is provided, to realize real-time analysis and the query function of traffic data.

At present, cleaning to real time data, the method for employing is traffic big data cleaning method：To directly receive RFID cross car data and capture data flow give spark streaming process, spark streaming are according to cleaning Rule is required to carry out track of vehicle cleaning, crosses vehicle flowrate and anomaly extracting.For off-line data is cleaned, compiled using spark Journey model, requires to cross car data and capture data RFID to be attached according to cleaning rule, extracts effective field, so as to extract Go out track of vehicle, count each collection point crosses vehicle flowrate, and isolate abnormal data and supply artificial treatment.

There is problems with the method：For real time data is cleaned, due to the number that RFID device and candid photograph equipment are collected According to spark streaming process is real-time transmitted to, spark streaming tasks are had to last for after submission Wait until that receiving all data that the time period collects can just carry out the process of next step, so result in big data Platform operational efficiency is seriously reduced.For off-line data process, due to data volume it is huge, according to key assignments do matching connection when Time frequently can lead to memory pressure greatly, the slow consequence of processing speed, so as to affect the performance of program.

The content of the invention

It is an object of the invention to provide a kind of traffic big data cleaning method of fast accurate, so as to solve prior art Present in foregoing problems.

To achieve these goals, the technical solution used in the present invention is as follows：

A kind of traffic big data cleaning method of fast accurate, including the place of the processing method and historical data of real time data Reason method；

The processing method of the real time data is, for real-time RFID and candid photograph data, to take Spark Streaming Stream process technology, constantly extracts data according to time window from Kafka, according to data cleansing rule, complete data comparison, Statistics and abnormality processing；

The processing method of the historical data, using Spark internal memory treatment technologies, reads data, according to number from HDFS According to cleaning rule, data are compared, is counted and abnormality processing.

Preferably, it is described constantly to extract data according to time window from Kafka, specifically, between the time according to setting Car data and candid photograph data are crossed every RFID is obtained from lasting Kafka Distributed Message Queues, when adding up to obtain setting every time Between data within section.

Preferably, in the processing method of the real time data, it is described according to data cleansing rule, complete data comparison, Statistics and abnormality processing, specifically include the cleaning of track of vehicle, cross wagon flow statistics of variables and the extraction of abnormal data；

The cleaning of the track of vehicle, is implemented in accordance with the following steps：

A1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and Collection four, direction field, two kinds of data records are attached；

A2, according to the comparison function that Spark Streaming are provided, is carried out at backward to license plate number and time character string Reason, and car data is crossed to the RFID for connecting according to comparison rules and data are captured filtered, obtain vehicle when collection point Track record, i.e. track of vehicle wash result；

A3, the track of vehicle wash result is stored in HBase, HBase is divided into multiple different domains, with car The backward character string of the trade mark and time character string is stored for key.

Preferably, it is described to cross wagon flow statistics of variables, implemented in accordance with the following steps：

The RFID received in each time period is crossed car data and is converted to the key-value pair shape with collection point field as key by B1 Formula；

B2, according to the principle that the distributed big datas of Spark Streaming are processed, enters to the data record with same keys Row is counted, and the then statistical result to each collection point is sued for peace at set time intervals, obtains each collection point in phase Vehicle flowrate record should be crossed in time period；

B3, is stored to the vehicle flowrate of crossing of each collection point using memory database.

Preferably, the extraction of the abnormal data, is implemented in accordance with the following steps：

C1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and Collection four, direction field, two kinds of data records are attached；

C2, crosses car data respectively according to the decision rule of abnormal data and captures data and filter, extract to RFID Abnormal data；

C3, is stored using relevant database.

Preferably, it is in the processing method of the historical data, described according to data cleansing rule, data are compared, The cleaning of statistics and abnormality processing, specially track of vehicle, excessively wagon flow statistics of variables and the extraction of abnormal data；

RFID is crossed car data and video by the information of license plate number, time, collection point title, four fields in direction by D1 Capture data to be attached；

D2, carries out backward process to license plate number and time character string, using car plate color and transit time field to data Filtered, obtained track of vehicle data；

D3, the backward character string with license plate number and time character string as key, by the track of vehicle data storage in HBase In.

Preferably, in the cleaning process of the track of vehicle, first by RFID data, data and facility information table point are captured Corresponding RDD is not encapsulated as, according to the IP address of equipment, data cube computation is carried out, is obtained RFID data RDD with direction field Candid photograph data RDD directive with band；Then two class data RDD are changed respectively, obtains the RDD of key-value pair form, with Convenient the carrying out for comparing attended operation, wherein key are the character string of the field composition for needing to compare；Finally, by two kinds of data RDD compared according to key assignments and connected, using rules such as time integrity, number plate colour consistency, the integrity of field Requirement is filtered to data, obtains correct data track.

RFID is crossed car data and is converted to collection point field and to be accurate to key assignments of the time character string of hour as key by E1 To form；

E2, according to the principle that the distributed big datas of Spark are processed, counts to the data record with same keys, obtains Record to vehicle flowrate of crossing of each collection point in the corresponding time period；

E3, is stored to the vehicle flowrate result of crossing of each collection point using relevant database.

Preferably, the type of the abnormal data includes：Data field is imperfect, shortage of data and data message differ Cause.

RFID by the number-plate number, collection point title, collection direction and is crossed car by the information of four fields of time by F1 Data and candid photograph data are attached；

F2, according to data exception type, first determines whether whether RFID data lacks, if there is RFID data, then judges In RFID data, color field whether there is, capture, if field is complete, judge Whether RFID data is consistent with number plate color in candid photograph data, finally, by the abnormal data for extracting storage to MySQL database In, and identify Exception Type.

The invention has the beneficial effects as follows：A kind of traffic big data cleaning side of fast accurate provided in an embodiment of the present invention Method, for real-time RFID and candid photograph data, using Spark Streaming stream process technologies, provides data using Kafka and delays Deposit, data are constantly extracted according to time window from Kafka, according to data cleansing rule, complete the comparison of data, count and different Often process；For offline batch accumulation data, using Spark internal memory treatment technologies, data are read from HDFS, according to data Data are compared, are counted and abnormality processing by cleaning rule, by the optimization to alignment algorithm, improve program performance and The accuracy of data cleansing result.Realize to the RFID that produces during urban transportation monitoring management and the data such as to capture quick Track of vehicle cleaning, dealing of abnormal data, vehicle flowrate are accurately carried out, and then realizes the processing to traffic data resource Process, ensure the storage and utilization of traffic big data resource.

Description of the drawings

Fig. 1 is real time data cleaning process schematic diagram；

Fig. 2 is offline historical data cleaning process schematic diagram；

Fig. 3 is track of vehicle cleaning module RDD dependence schematic diagrams.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with accompanying drawing, the present invention is entered Row is further described.It should be appreciated that specific embodiment described herein is not used to only to explain the present invention Limit the present invention.

Embodiments provide a kind of traffic big data cleaning method of fast accurate, including the process of real time data The processing method of method and historical data；

In said method, the cleaning of real time data is believed for the data that RFID device and picture pick-up device are got in real time Breath, and the cleaning of historical data be for history accumulation RFID cross car data and capture data.Because the characteristics of two kinds of data Difference, former data amount are relatively fewer, but the requirement of real-time for processing is higher；The data volume of the latter is huge, no real-time Require, but require can efficiently and accurately complete the cleaning to mass data.

Method provided in an embodiment of the present invention, provides distributed data using the big data platform customized based on Hadoop Process and store.For real-time RFID and candid photograph data, Spark Streaming stream process technologies are taken, is pressed from Kafka Data are constantly extracted according to time window, according to data cleansing rule, comparison, statistics and the abnormality processing of data is completed.For from The batch accumulation data of line, using Spark internal memory treatment technologies, read data from HDFS, according to data cleansing rule, logarithm According to comparing, count and abnormality processing.

For the process of real time data, can be found in shown in Fig. 1.

Off-line data cleaning depend on spark programming models, using Spark on Yarn as program operation platform, By the distributed programmed traffic big data cleaning process for realizing fast accurate.

The specific process cleaned to off-line data using Spark is as shown in Figure 2.

In said method, for real-time RFID and candid photograph data, using Spark Streaming stream process technologies, profit Data buffer storage is provided with Kafka, data is constantly extracted according to time window from Kafka, according to data cleansing rule, complete number According to comparison, statistics and abnormality processing；For offline batch accumulation data, using Spark internal memory treatment technologies, from HDFS Data are read, according to data cleansing rule, data is compared, is counted and abnormality processing, by the optimization to alignment algorithm, The accuracy of the performance and data wash result of raising program.Realize the RFID to producing during urban transportation monitoring management With carry out track of vehicle cleaning, dealing of abnormal data, vehicle flowrate with the data fast accurate such as capturing, and then realize to handing over The processed of logical data resource, ensures the storage and utilization of traffic big data resource.

It is in the embodiment of the present invention, described constantly to extract data according to time window from Kafka, specifically, according to setting Time interval RFID is obtained from lasting Kafka Distributed Message Queues cross car data and capture data, it is every time accumulative to obtain Take the data within setting time section.

In the embodiment of the present invention, time interval can be 5 minutes, and time window can be 10 minutes.

In a preferred embodiment of the invention, it is in the processing method of the real time data, described according to data cleansing Rule, completes comparison, statistics and the abnormality processing of data, specifically includes the cleaning of track of vehicle, crosses wagon flow statistics of variables and different The extraction of regular data；

In said method, by backward process is carried out to license plate number and time character string, reduction comparison field has identical The probability of prefix, can so greatly reduce the number of times for comparing between character string two-by-two, so as to improve the efficiency of comparison.

Due to track of vehicle data volume it is very big, while the efficient inquiry carried out by mass data, so the embodiment of the present invention In, track of vehicle wash result is stored in HBase, is the search efficiency for improving wash result, HBase is divided into into 1000 Individual different domain, is stored with the backward character string of license plate number and time character string as key.

In a preferred embodiment of the invention, it is described to cross wagon flow statistics of variables, reality can be carried out in accordance with the following steps Apply：

Cross vehicle flowrate and count the vehicle number in a time window through each collection point respectively.In order to enter Quickly read-write or inquiry etc. are processed row, the vehicle flowrate of crossing of each collection point are stored using memory database, are being entered When vehicle flowrate of going is inquired about, it is only necessary to one-accumulate calculating is carried out in internal memory, the real-time of vehicle flowrate was improve Property.

In the present invention, the extraction of the abnormal data can be implemented in accordance with the following steps：

C3, is stored using relevant database.

The situation inconsistent due to there may be shortage of data or different types of data information, needs to extract exception Data, so that audit, manual examination and verification are used.The extraction of abnormal data first by some public fields by RFID cross car data and Video capture data are attached；Then, car data is crossed respectively according to the decision rule of abnormal data to RFID and captures data Filtered, extracted abnormal data.As the data volume of abnormal data is limited, can be deposited using relevant database Storage.

It is in a preferred embodiment of the present invention, in the processing method of the historical data, described to advise according to data cleansing Then, data compared, count and abnormality processing, specifically include the cleaning of track of vehicle, cross wagon flow statistics of variables and exception The extraction of data；

In said method, RFID is crossed by car data by Spark first and data is captured with people, car, collection in data base The Back ground Informations such as point are coupled together, and obtain the information useful to track cleaning, traffic statistics, abnormal extraction, in order to offline number According to the enforcement of cleaning process.

The present invention carries out backward process to license plate number and time character string first, two types data is compared with improving To efficiency.Due to track of vehicle data volume it is very big, while the efficient inquiry carried out by mass data, so, the present invention is with car The backward character string of the trade mark and time character string is key, and the track of vehicle for washing out is stored in HBase.In order to improve storage Speed, it is possible to use data import tool Loader imports to track data in HBase.

In the embodiment of the present invention, in the cleaning process of the track of vehicle, first by RFID data, data and equipment are captured Information table is encapsulated as corresponding RDD respectively, according to the IP address of equipment, carries out data cube computation, obtains with direction field RFID data RDD and with it is directive candid photograph data RDD；Then two class data RDD are changed respectively, obtains key-value pair shape The RDD of formula, to facilitate the carrying out for comparing attended operation, wherein key is the character string of the field composition for needing to compare；Finally, The RDD of two kinds of data is compared according to key assignments and is connected, using time integrity, number plate colour consistency, field it is complete The rule such as whole property requires to filter data, obtains correct data track.

Track of vehicle cleaning module RDD dependences are as shown in Figure 3.

It is in the embodiment of the present invention, described to cross wagon flow statistics of variables, implemented in accordance with the following steps：

Cross vehicle flowrate and be divided into two kinds according to measurement type：Full dose data statisticss and all types of vehicle flowrates.Full dose Data statisticss count the wagon flow total amount by each collection point；Vehicle is segmented by all types of vehicle flowrates by type, statistics By each type of wagon flow statistics of variables of a collection point.As collection point number and type of vehicle are all limited, institutes With the data volume of statistical result and less, can be stored in relevant database.

In the present invention, the type of the abnormal data includes：Data field is imperfect, shortage of data and data message differ Cause.

Due to there may be the situation that shortage of data, data field are imperfect or different types of data information is inconsistent, In processing procedure, need for above-mentioned several situations, filter out abnormal RFID and cross car data and capture data, and identify The Exception Type of data, for manual examination and verification and modification, to ensure the correctness of the integrity and track of vehicle cleaning of data.It is abnormal Data mainly include following three kinds of situations：

(1) data field is imperfect

(2) shortage of data

(3) data message is inconsistent.

In the embodiment of the present invention, the extraction of the abnormal data can be implemented in accordance with the following steps：

According to RFID and the characteristics of capture data, data field is imperfect mainly include number plate color it is inconsistent, without candid photograph Picture two types.In order to improve abnormality processing efficiency, the abnormal data of three types can be closed according to the method described above And process.

After extracting abnormal data, make a distinction according to Exception Type and data type, by interface display to examination ＆ verification Data are supplemented and are repaired for different types of exception by auditor, are then forwarded to Data clean system by personnel Processed, so as to improve the accuracy of track of vehicle cleaning.

By using above-mentioned technical proposal disclosed by the invention, having obtained following beneficial effect：The embodiment of the present invention is carried For a kind of fast accurate traffic big data cleaning method, for real-time RFID and capture data, using Spark Streaming stream process technologies, provide data buffer storage using Kafka, constantly extract data according to time window from Kafka, According to data cleansing rule, comparison, statistics and the abnormality processing of data are completed；For offline batch accumulation data, adopt Spark internal memory treatment technologies, read data from HDFS, according to data cleansing rule, data compared, count and abnormal Process, by the optimization to alignment algorithm, improve the accuracy of the performance and data wash result of program.Realize and city is handed over The RFID that produces during logical monitoring management and carry out track of vehicle cleaning with the data fast accurate such as capturing, at abnormal data Reason, vehicle flowrate, and then the processed to traffic data resource is realized, ensure storage and the profit of traffic big data resource With.

Each embodiment in this specification is described by the way of progressive, what each embodiment was stressed be with The difference of other embodiment, between each embodiment identical similar part mutually referring to.

Those skilled in the art should be understood that the sequential of the method and step that above-described embodiment is provided can be entered according to practical situation Row accommodation, is concurrently carried out also dependent on practical situation.

All or part of step in the method that above-described embodiment is related to can be instructed by program correlation hardware come Complete, described program can be stored in the storage medium that computer equipment can read, for performing the various embodiments described above side All or part of step described in method.The computer equipment, for example：Personal computer, server, the network equipment, intelligent sliding Dynamic terminal, intelligent home device, wearable intelligent equipment, vehicle intelligent equipment etc.；Described storage medium, for example：RAM、 ROM, magnetic disc, tape, CD, flash memory, USB flash disk, portable hard drive, storage card, memory stick, webserver storage, network cloud storage Deng.

Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms be used merely to by One entity or operation are made a distinction with another entity or operation, and are not necessarily required or implied these entities or operation Between there is any this actual relation or order.And, term " including ", "comprising" or its any other variant are anticipated Covering including for nonexcludability, so that a series of process, method, commodity or equipment including key elements not only includes that A little key elements, but also including other key elements being not expressly set out, or also include for this process, method, commodity or The intrinsic key element of equipment.In the absence of more restrictions, the key element for being limited by sentence "including a ...", does not arrange Except also there is other identical element in including the process of the key element, method, commodity or equipment.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should Depending on protection scope of the present invention.

Claims

1. the traffic big data cleaning method of a kind of fast accurate, it is characterised in that processing method including real time data and go through The processing method of history data；

The processing method of the real time data is for real-time RFID and captures data, takes at Spark Streaming streams Reason technology, constantly extracts data according to time window from Kafka, according to data cleansing rule, completes comparison, the statistics of data And abnormality processing；

The processing method of the historical data, using Spark internal memory treatment technologies, reads data from HDFS, clear according to data Rule is washed, data is compared, is counted and abnormality processing.

2. the traffic big data cleaning method of fast accurate according to claim 1, it is characterised in that described from Kafka In constantly extract data according to time window, specifically, at set time intervals from lasting Kafka distributed messages team In row, acquisition RFID crosses car data and captures data, every time the data within accumulative acquisition setting time section.

3. the traffic big data cleaning method of fast accurate according to claim 2, it is characterised in that the real time data Processing method in, it is described according to data cleansing rule, complete data comparison, statistics and abnormality processing, specifically include vehicle The cleaning of track, excessively wagon flow statistics of variables and the extraction of abnormal data；

A1, crosses car data according to RFID and captures the public field of data, including license plate number, time, collection point title and collection Four, direction field, two kinds of data records are attached；

A2, according to the comparison function that Spark Streaming are provided, carries out backward process to license plate number and time character string, and Car data is crossed to the RFID for connecting according to comparison rules and data are captured and is filtered, obtain track of vehicle when collection point Record, i.e. track of vehicle wash result；

A3, the track of vehicle wash result is stored in HBase, HBase is divided into multiple different domains, with license plate number Stored for key with the backward character string of time character string.

4. the traffic big data cleaning method of fast accurate according to claim 3, it is characterised in that described to cross vehicle flowrate Statistics, implemented in accordance with the following steps：

The RFID received in each time period is crossed car data and is converted to the key-value pair form with collection point field as key by B1；

B2, according to the principle that the distributed big datas of Spark Streaming are processed, counts to the data record with same keys Number, the then statistical result to each collection point sued for peace at set time intervals, obtains each collection point when corresponding Between cross vehicle flowrate record in section；

5. the traffic big data cleaning method of fast accurate according to claim 3, it is characterised in that the abnormal data Extraction, implemented in accordance with the following steps：

C1, crosses car data according to RFID and captures the public field of data, including license plate number, time, collection point title and collection Four, direction field, two kinds of data records are attached；

C2, crosses car data respectively according to the decision rule of abnormal data and captures data and filter, extract exception to RFID Data；

C3, is stored using relevant database.

6. the traffic big data cleaning method of fast accurate according to claim 1, it is characterised in that the historical data Processing method in, it is described according to data cleansing rule, data are compared, are counted and abnormality processing, specially vehicle rail The cleaning of mark, excessively wagon flow statistics of variables and the extraction of abnormal data；

RFID is crossed car data and video capture by the information of license plate number, time, collection point title, four fields in direction by D1 Data are attached；

D2, carries out backward process to license plate number and time character string, data is carried out using car plate color and transit time field Filter, obtain track of vehicle data；

D3, the backward character string with license plate number and time character string as key, by the track of vehicle data storage in HBase.

7. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that the track of vehicle Cleaning process in, first by RFID data, capture data and facility information table and be encapsulated as corresponding RDD respectively, according to equipment IP address, carry out data cube computation, obtain RFID data RDD with direction field and with it is directive candid photograph data RDD；So Afterwards two class data RDD are changed respectively, the RDD of key-value pair form is obtained, to facilitate the carrying out for comparing attended operation, wherein Key is the character string of the field composition for needing to compare；Finally, the RDD of two kinds of data is compared according to key assignments and is connected Connect, require to filter data using the rule such as time integrity, number plate colour consistency, integrity of field, just obtain True data track.

8. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that described to cross vehicle flowrate Statistics, implemented in accordance with the following steps：

RFID is crossed car data and is converted to collection point field and to be accurate to key-value pair shape of the time character string of hour as key by E1 Formula；

E2, according to the principle that the distributed big datas of Spark are processed, counts to the data record with same keys, obtains each Vehicle flowrate record is crossed the corresponding time period in individual collection point；

9. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that the abnormal data Type include：Data field is imperfect, shortage of data and data message are inconsistent.

10. the traffic big data cleaning method of fast accurate according to claim 9, it is characterised in that the abnormal number According to extraction, implemented in accordance with the following steps：

RFID by the number-plate number, collection point title, collection direction and is crossed car data by the information of four fields of time by F1 It is attached with data are captured；

F2, according to data exception type, first determines whether whether RFID data lacks, and if there is RFID data, then judges RFID In data, color field whether there is, capture, if field is complete, judge RFID Whether data are consistent with number plate color in candid photograph data, finally, the abnormal data for extracting stored in MySQL database, And identify Exception Type.