CN106528865A - Quick and accurate cleaning method of traffic big data - Google Patents
Quick and accurate cleaning method of traffic big data Download PDFInfo
- Publication number
- CN106528865A CN106528865A CN201611094160.2A CN201611094160A CN106528865A CN 106528865 A CN106528865 A CN 106528865A CN 201611094160 A CN201611094160 A CN 201611094160A CN 106528865 A CN106528865 A CN 106528865A
- Authority
- CN
- China
- Prior art keywords
- data
- rfid
- time
- vehicle
- track
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Traffic Control Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a quick and accurate cleaning method of traffic big data, and relates to the technical field of traffic data processing. With regard to real-time RFID and snapshot data, a Spark Streaming stream processing technology is adopted, Kafka is utilized to provide data caching, data is constantly extracted from the Kafka according to a time window, and comparisons, statistics and exception handling of data are finished according to a data cleaning rule; with regard to off-line batch cumulative data, a Spark internal storage processing technology is adopted, data is read from an HDFS, comparisons, statistics and exception handling of data are conducted according to the data cleaning rule, through comparisons, an algorithm is optimized, and performance of a procedure and accuracy of a data cleaning result are improved. According to the quick and accurate cleaning method of the traffic big data, the quick and accurate processing of data of RFID, snapshots and the like generated in the monitoring and managing process of urban traffic is achieved, so that processing of the traffic data resources is achieved, and storage and utilization of traffic big data resources are guaranteed.
Description
Technical field
The present invention relates to transport data processing technical field, more particularly to a kind of traffic big data cleaning side of fast accurate
Method.
Background technology
With the development and the raising of people's level of consumption of urban construction, automobile has become indispensable during people live
Instrument, and the process of the huge traffic data for producing therewith also becomes a problem demanding prompt solution.In order to realize quick reality
When traffic monitoring and forecast analysis, realize the analysis and inquiry of traffic historical data, need the traffic data to separate sources
Cleaning filtration being carried out, and abnormal data being extracted for artificial treatment, the result to processing is deposited respectively using appropriate storage mode
Storage, and data access interface is provided, to realize real-time analysis and the query function of traffic data.
At present, cleaning to real time data, the method for employing is traffic big data cleaning method:To directly receive
RFID cross car data and capture data flow give spark streaming process, spark streaming are according to cleaning
Rule is required to carry out track of vehicle cleaning, crosses vehicle flowrate and anomaly extracting.For off-line data is cleaned, compiled using spark
Journey model, requires to cross car data and capture data RFID to be attached according to cleaning rule, extracts effective field, so as to extract
Go out track of vehicle, count each collection point crosses vehicle flowrate, and isolate abnormal data and supply artificial treatment.
There is problems with the method:For real time data is cleaned, due to the number that RFID device and candid photograph equipment are collected
According to spark streaming process is real-time transmitted to, spark streaming tasks are had to last for after submission
Wait until that receiving all data that the time period collects can just carry out the process of next step, so result in big data
Platform operational efficiency is seriously reduced.For off-line data process, due to data volume it is huge, according to key assignments do matching connection when
Time frequently can lead to memory pressure greatly, the slow consequence of processing speed, so as to affect the performance of program.
The content of the invention
It is an object of the invention to provide a kind of traffic big data cleaning method of fast accurate, so as to solve prior art
Present in foregoing problems.
To achieve these goals, the technical solution used in the present invention is as follows:
A kind of traffic big data cleaning method of fast accurate, including the place of the processing method and historical data of real time data
Reason method;
The processing method of the real time data is, for real-time RFID and candid photograph data, to take Spark Streaming
Stream process technology, constantly extracts data according to time window from Kafka, according to data cleansing rule, complete data comparison,
Statistics and abnormality processing;
The processing method of the historical data, using Spark internal memory treatment technologies, reads data, according to number from HDFS
According to cleaning rule, data are compared, is counted and abnormality processing.
Preferably, it is described constantly to extract data according to time window from Kafka, specifically, between the time according to setting
Car data and candid photograph data are crossed every RFID is obtained from lasting Kafka Distributed Message Queues, when adding up to obtain setting every time
Between data within section.
Preferably, in the processing method of the real time data, it is described according to data cleansing rule, complete data comparison,
Statistics and abnormality processing, specifically include the cleaning of track of vehicle, cross wagon flow statistics of variables and the extraction of abnormal data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
A1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and
Collection four, direction field, two kinds of data records are attached;
A2, according to the comparison function that Spark Streaming are provided, is carried out at backward to license plate number and time character string
Reason, and car data is crossed to the RFID for connecting according to comparison rules and data are captured filtered, obtain vehicle when collection point
Track record, i.e. track of vehicle wash result;
A3, the track of vehicle wash result is stored in HBase, HBase is divided into multiple different domains, with car
The backward character string of the trade mark and time character string is stored for key.
Preferably, it is described to cross wagon flow statistics of variables, implemented in accordance with the following steps:
The RFID received in each time period is crossed car data and is converted to the key-value pair shape with collection point field as key by B1
Formula;
B2, according to the principle that the distributed big datas of Spark Streaming are processed, enters to the data record with same keys
Row is counted, and the then statistical result to each collection point is sued for peace at set time intervals, obtains each collection point in phase
Vehicle flowrate record should be crossed in time period;
B3, is stored to the vehicle flowrate of crossing of each collection point using memory database.
Preferably, the extraction of the abnormal data, is implemented in accordance with the following steps:
C1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and
Collection four, direction field, two kinds of data records are attached;
C2, crosses car data respectively according to the decision rule of abnormal data and captures data and filter, extract to RFID
Abnormal data;
C3, is stored using relevant database.
Preferably, it is in the processing method of the historical data, described according to data cleansing rule, data are compared,
The cleaning of statistics and abnormality processing, specially track of vehicle, excessively wagon flow statistics of variables and the extraction of abnormal data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
RFID is crossed car data and video by the information of license plate number, time, collection point title, four fields in direction by D1
Capture data to be attached;
D2, carries out backward process to license plate number and time character string, using car plate color and transit time field to data
Filtered, obtained track of vehicle data;
D3, the backward character string with license plate number and time character string as key, by the track of vehicle data storage in HBase
In.
Preferably, in the cleaning process of the track of vehicle, first by RFID data, data and facility information table point are captured
Corresponding RDD is not encapsulated as, according to the IP address of equipment, data cube computation is carried out, is obtained RFID data RDD with direction field
Candid photograph data RDD directive with band;Then two class data RDD are changed respectively, obtains the RDD of key-value pair form, with
Convenient the carrying out for comparing attended operation, wherein key are the character string of the field composition for needing to compare;Finally, by two kinds of data
RDD compared according to key assignments and connected, using rules such as time integrity, number plate colour consistency, the integrity of field
Requirement is filtered to data, obtains correct data track.
Preferably, it is described to cross wagon flow statistics of variables, implemented in accordance with the following steps:
RFID is crossed car data and is converted to collection point field and to be accurate to key assignments of the time character string of hour as key by E1
To form;
E2, according to the principle that the distributed big datas of Spark are processed, counts to the data record with same keys, obtains
Record to vehicle flowrate of crossing of each collection point in the corresponding time period;
E3, is stored to the vehicle flowrate result of crossing of each collection point using relevant database.
Preferably, the type of the abnormal data includes:Data field is imperfect, shortage of data and data message differ
Cause.
Preferably, the extraction of the abnormal data, is implemented in accordance with the following steps:
RFID by the number-plate number, collection point title, collection direction and is crossed car by the information of four fields of time by F1
Data and candid photograph data are attached;
F2, according to data exception type, first determines whether whether RFID data lacks, if there is RFID data, then judges
In RFID data, color field whether there is, capture, if field is complete, judge
Whether RFID data is consistent with number plate color in candid photograph data, finally, by the abnormal data for extracting storage to MySQL database
In, and identify Exception Type.
The invention has the beneficial effects as follows:A kind of traffic big data cleaning side of fast accurate provided in an embodiment of the present invention
Method, for real-time RFID and candid photograph data, using Spark Streaming stream process technologies, provides data using Kafka and delays
Deposit, data are constantly extracted according to time window from Kafka, according to data cleansing rule, complete the comparison of data, count and different
Often process;For offline batch accumulation data, using Spark internal memory treatment technologies, data are read from HDFS, according to data
Data are compared, are counted and abnormality processing by cleaning rule, by the optimization to alignment algorithm, improve program performance and
The accuracy of data cleansing result.Realize to the RFID that produces during urban transportation monitoring management and the data such as to capture quick
Track of vehicle cleaning, dealing of abnormal data, vehicle flowrate are accurately carried out, and then realizes the processing to traffic data resource
Process, ensure the storage and utilization of traffic big data resource.
Description of the drawings
Fig. 1 is real time data cleaning process schematic diagram;
Fig. 2 is offline historical data cleaning process schematic diagram;
Fig. 3 is track of vehicle cleaning module RDD dependence schematic diagrams.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with accompanying drawing, the present invention is entered
Row is further described.It should be appreciated that specific embodiment described herein is not used to only to explain the present invention
Limit the present invention.
Embodiments provide a kind of traffic big data cleaning method of fast accurate, including the process of real time data
The processing method of method and historical data;
The processing method of the real time data is, for real-time RFID and candid photograph data, to take Spark Streaming
Stream process technology, constantly extracts data according to time window from Kafka, according to data cleansing rule, complete data comparison,
Statistics and abnormality processing;
The processing method of the historical data, using Spark internal memory treatment technologies, reads data, according to number from HDFS
According to cleaning rule, data are compared, is counted and abnormality processing.
In said method, the cleaning of real time data is believed for the data that RFID device and picture pick-up device are got in real time
Breath, and the cleaning of historical data be for history accumulation RFID cross car data and capture data.Because the characteristics of two kinds of data
Difference, former data amount are relatively fewer, but the requirement of real-time for processing is higher;The data volume of the latter is huge, no real-time
Require, but require can efficiently and accurately complete the cleaning to mass data.
Method provided in an embodiment of the present invention, provides distributed data using the big data platform customized based on Hadoop
Process and store.For real-time RFID and candid photograph data, Spark Streaming stream process technologies are taken, is pressed from Kafka
Data are constantly extracted according to time window, according to data cleansing rule, comparison, statistics and the abnormality processing of data is completed.For from
The batch accumulation data of line, using Spark internal memory treatment technologies, read data from HDFS, according to data cleansing rule, logarithm
According to comparing, count and abnormality processing.
For the process of real time data, can be found in shown in Fig. 1.
Off-line data cleaning depend on spark programming models, using Spark on Yarn as program operation platform,
By the distributed programmed traffic big data cleaning process for realizing fast accurate.
The specific process cleaned to off-line data using Spark is as shown in Figure 2.
In said method, for real-time RFID and candid photograph data, using Spark Streaming stream process technologies, profit
Data buffer storage is provided with Kafka, data is constantly extracted according to time window from Kafka, according to data cleansing rule, complete number
According to comparison, statistics and abnormality processing;For offline batch accumulation data, using Spark internal memory treatment technologies, from HDFS
Data are read, according to data cleansing rule, data is compared, is counted and abnormality processing, by the optimization to alignment algorithm,
The accuracy of the performance and data wash result of raising program.Realize the RFID to producing during urban transportation monitoring management
With carry out track of vehicle cleaning, dealing of abnormal data, vehicle flowrate with the data fast accurate such as capturing, and then realize to handing over
The processed of logical data resource, ensures the storage and utilization of traffic big data resource.
It is in the embodiment of the present invention, described constantly to extract data according to time window from Kafka, specifically, according to setting
Time interval RFID is obtained from lasting Kafka Distributed Message Queues cross car data and capture data, it is every time accumulative to obtain
Take the data within setting time section.
In the embodiment of the present invention, time interval can be 5 minutes, and time window can be 10 minutes.
In a preferred embodiment of the invention, it is in the processing method of the real time data, described according to data cleansing
Rule, completes comparison, statistics and the abnormality processing of data, specifically includes the cleaning of track of vehicle, crosses wagon flow statistics of variables and different
The extraction of regular data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
A1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and
Collection four, direction field, two kinds of data records are attached;
A2, according to the comparison function that Spark Streaming are provided, is carried out at backward to license plate number and time character string
Reason, and car data is crossed to the RFID for connecting according to comparison rules and data are captured filtered, obtain vehicle when collection point
Track record, i.e. track of vehicle wash result;
A3, the track of vehicle wash result is stored in HBase, HBase is divided into multiple different domains, with car
The backward character string of the trade mark and time character string is stored for key.
In said method, by backward process is carried out to license plate number and time character string, reduction comparison field has identical
The probability of prefix, can so greatly reduce the number of times for comparing between character string two-by-two, so as to improve the efficiency of comparison.
Due to track of vehicle data volume it is very big, while the efficient inquiry carried out by mass data, so the embodiment of the present invention
In, track of vehicle wash result is stored in HBase, is the search efficiency for improving wash result, HBase is divided into into 1000
Individual different domain, is stored with the backward character string of license plate number and time character string as key.
In a preferred embodiment of the invention, it is described to cross wagon flow statistics of variables, reality can be carried out in accordance with the following steps
Apply:
The RFID received in each time period is crossed car data and is converted to the key-value pair shape with collection point field as key by B1
Formula;
B2, according to the principle that the distributed big datas of Spark Streaming are processed, enters to the data record with same keys
Row is counted, and the then statistical result to each collection point is sued for peace at set time intervals, obtains each collection point in phase
Vehicle flowrate record should be crossed in time period;
B3, is stored to the vehicle flowrate of crossing of each collection point using memory database.
Cross vehicle flowrate and count the vehicle number in a time window through each collection point respectively.In order to enter
Quickly read-write or inquiry etc. are processed row, the vehicle flowrate of crossing of each collection point are stored using memory database, are being entered
When vehicle flowrate of going is inquired about, it is only necessary to one-accumulate calculating is carried out in internal memory, the real-time of vehicle flowrate was improve
Property.
In the present invention, the extraction of the abnormal data can be implemented in accordance with the following steps:
C1, according to RFID cross car data and capture data public field, including license plate number, the time, collection point title and
Collection four, direction field, two kinds of data records are attached;
C2, crosses car data respectively according to the decision rule of abnormal data and captures data and filter, extract to RFID
Abnormal data;
C3, is stored using relevant database.
The situation inconsistent due to there may be shortage of data or different types of data information, needs to extract exception
Data, so that audit, manual examination and verification are used.The extraction of abnormal data first by some public fields by RFID cross car data and
Video capture data are attached;Then, car data is crossed respectively according to the decision rule of abnormal data to RFID and captures data
Filtered, extracted abnormal data.As the data volume of abnormal data is limited, can be deposited using relevant database
Storage.
It is in a preferred embodiment of the present invention, in the processing method of the historical data, described to advise according to data cleansing
Then, data compared, count and abnormality processing, specifically include the cleaning of track of vehicle, cross wagon flow statistics of variables and exception
The extraction of data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
RFID is crossed car data and video by the information of license plate number, time, collection point title, four fields in direction by D1
Capture data to be attached;
D2, carries out backward process to license plate number and time character string, using car plate color and transit time field to data
Filtered, obtained track of vehicle data;
D3, the backward character string with license plate number and time character string as key, by the track of vehicle data storage in HBase
In.
In said method, RFID is crossed by car data by Spark first and data is captured with people, car, collection in data base
The Back ground Informations such as point are coupled together, and obtain the information useful to track cleaning, traffic statistics, abnormal extraction, in order to offline number
According to the enforcement of cleaning process.
The present invention carries out backward process to license plate number and time character string first, two types data is compared with improving
To efficiency.Due to track of vehicle data volume it is very big, while the efficient inquiry carried out by mass data, so, the present invention is with car
The backward character string of the trade mark and time character string is key, and the track of vehicle for washing out is stored in HBase.In order to improve storage
Speed, it is possible to use data import tool Loader imports to track data in HBase.
In the embodiment of the present invention, in the cleaning process of the track of vehicle, first by RFID data, data and equipment are captured
Information table is encapsulated as corresponding RDD respectively, according to the IP address of equipment, carries out data cube computation, obtains with direction field
RFID data RDD and with it is directive candid photograph data RDD;Then two class data RDD are changed respectively, obtains key-value pair shape
The RDD of formula, to facilitate the carrying out for comparing attended operation, wherein key is the character string of the field composition for needing to compare;Finally,
The RDD of two kinds of data is compared according to key assignments and is connected, using time integrity, number plate colour consistency, field it is complete
The rule such as whole property requires to filter data, obtains correct data track.
Track of vehicle cleaning module RDD dependences are as shown in Figure 3.
It is in the embodiment of the present invention, described to cross wagon flow statistics of variables, implemented in accordance with the following steps:
RFID is crossed car data and is converted to collection point field and to be accurate to key assignments of the time character string of hour as key by E1
To form;
E2, according to the principle that the distributed big datas of Spark are processed, counts to the data record with same keys, obtains
Record to vehicle flowrate of crossing of each collection point in the corresponding time period;
E3, is stored to the vehicle flowrate result of crossing of each collection point using relevant database.
Cross vehicle flowrate and be divided into two kinds according to measurement type:Full dose data statisticss and all types of vehicle flowrates.Full dose
Data statisticss count the wagon flow total amount by each collection point;Vehicle is segmented by all types of vehicle flowrates by type, statistics
By each type of wagon flow statistics of variables of a collection point.As collection point number and type of vehicle are all limited, institutes
With the data volume of statistical result and less, can be stored in relevant database.
In the present invention, the type of the abnormal data includes:Data field is imperfect, shortage of data and data message differ
Cause.
Due to there may be the situation that shortage of data, data field are imperfect or different types of data information is inconsistent,
In processing procedure, need for above-mentioned several situations, filter out abnormal RFID and cross car data and capture data, and identify
The Exception Type of data, for manual examination and verification and modification, to ensure the correctness of the integrity and track of vehicle cleaning of data.It is abnormal
Data mainly include following three kinds of situations:
(1) data field is imperfect
(2) shortage of data
(3) data message is inconsistent.
In the embodiment of the present invention, the extraction of the abnormal data can be implemented in accordance with the following steps:
RFID by the number-plate number, collection point title, collection direction and is crossed car by the information of four fields of time by F1
Data and candid photograph data are attached;
F2, according to data exception type, first determines whether whether RFID data lacks, if there is RFID data, then judges
In RFID data, color field whether there is, capture, if field is complete, judge
Whether RFID data is consistent with number plate color in candid photograph data, finally, by the abnormal data for extracting storage to MySQL database
In, and identify Exception Type.
According to RFID and the characteristics of capture data, data field is imperfect mainly include number plate color it is inconsistent, without candid photograph
Picture two types.In order to improve abnormality processing efficiency, the abnormal data of three types can be closed according to the method described above
And process.
After extracting abnormal data, make a distinction according to Exception Type and data type, by interface display to examination & verification
Data are supplemented and are repaired for different types of exception by auditor, are then forwarded to Data clean system by personnel
Processed, so as to improve the accuracy of track of vehicle cleaning.
By using above-mentioned technical proposal disclosed by the invention, having obtained following beneficial effect:The embodiment of the present invention is carried
For a kind of fast accurate traffic big data cleaning method, for real-time RFID and capture data, using Spark
Streaming stream process technologies, provide data buffer storage using Kafka, constantly extract data according to time window from Kafka,
According to data cleansing rule, comparison, statistics and the abnormality processing of data are completed;For offline batch accumulation data, adopt
Spark internal memory treatment technologies, read data from HDFS, according to data cleansing rule, data compared, count and abnormal
Process, by the optimization to alignment algorithm, improve the accuracy of the performance and data wash result of program.Realize and city is handed over
The RFID that produces during logical monitoring management and carry out track of vehicle cleaning with the data fast accurate such as capturing, at abnormal data
Reason, vehicle flowrate, and then the processed to traffic data resource is realized, ensure storage and the profit of traffic big data resource
With.
Each embodiment in this specification is described by the way of progressive, what each embodiment was stressed be with
The difference of other embodiment, between each embodiment identical similar part mutually referring to.
Those skilled in the art should be understood that the sequential of the method and step that above-described embodiment is provided can be entered according to practical situation
Row accommodation, is concurrently carried out also dependent on practical situation.
All or part of step in the method that above-described embodiment is related to can be instructed by program correlation hardware come
Complete, described program can be stored in the storage medium that computer equipment can read, for performing the various embodiments described above side
All or part of step described in method.The computer equipment, for example:Personal computer, server, the network equipment, intelligent sliding
Dynamic terminal, intelligent home device, wearable intelligent equipment, vehicle intelligent equipment etc.;Described storage medium, for example:RAM、
ROM, magnetic disc, tape, CD, flash memory, USB flash disk, portable hard drive, storage card, memory stick, webserver storage, network cloud storage
Deng.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms be used merely to by
One entity or operation are made a distinction with another entity or operation, and are not necessarily required or implied these entities or operation
Between there is any this actual relation or order.And, term " including ", "comprising" or its any other variant are anticipated
Covering including for nonexcludability, so that a series of process, method, commodity or equipment including key elements not only includes that
A little key elements, but also including other key elements being not expressly set out, or also include for this process, method, commodity or
The intrinsic key element of equipment.In the absence of more restrictions, the key element for being limited by sentence "including a ...", does not arrange
Except also there is other identical element in including the process of the key element, method, commodity or equipment.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
Depending on protection scope of the present invention.
Claims (10)
1. the traffic big data cleaning method of a kind of fast accurate, it is characterised in that processing method including real time data and go through
The processing method of history data;
The processing method of the real time data is for real-time RFID and captures data, takes at Spark Streaming streams
Reason technology, constantly extracts data according to time window from Kafka, according to data cleansing rule, completes comparison, the statistics of data
And abnormality processing;
The processing method of the historical data, using Spark internal memory treatment technologies, reads data from HDFS, clear according to data
Rule is washed, data is compared, is counted and abnormality processing.
2. the traffic big data cleaning method of fast accurate according to claim 1, it is characterised in that described from Kafka
In constantly extract data according to time window, specifically, at set time intervals from lasting Kafka distributed messages team
In row, acquisition RFID crosses car data and captures data, every time the data within accumulative acquisition setting time section.
3. the traffic big data cleaning method of fast accurate according to claim 2, it is characterised in that the real time data
Processing method in, it is described according to data cleansing rule, complete data comparison, statistics and abnormality processing, specifically include vehicle
The cleaning of track, excessively wagon flow statistics of variables and the extraction of abnormal data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
A1, crosses car data according to RFID and captures the public field of data, including license plate number, time, collection point title and collection
Four, direction field, two kinds of data records are attached;
A2, according to the comparison function that Spark Streaming are provided, carries out backward process to license plate number and time character string, and
Car data is crossed to the RFID for connecting according to comparison rules and data are captured and is filtered, obtain track of vehicle when collection point
Record, i.e. track of vehicle wash result;
A3, the track of vehicle wash result is stored in HBase, HBase is divided into multiple different domains, with license plate number
Stored for key with the backward character string of time character string.
4. the traffic big data cleaning method of fast accurate according to claim 3, it is characterised in that described to cross vehicle flowrate
Statistics, implemented in accordance with the following steps:
The RFID received in each time period is crossed car data and is converted to the key-value pair form with collection point field as key by B1;
B2, according to the principle that the distributed big datas of Spark Streaming are processed, counts to the data record with same keys
Number, the then statistical result to each collection point sued for peace at set time intervals, obtains each collection point when corresponding
Between cross vehicle flowrate record in section;
B3, is stored to the vehicle flowrate of crossing of each collection point using memory database.
5. the traffic big data cleaning method of fast accurate according to claim 3, it is characterised in that the abnormal data
Extraction, implemented in accordance with the following steps:
C1, crosses car data according to RFID and captures the public field of data, including license plate number, time, collection point title and collection
Four, direction field, two kinds of data records are attached;
C2, crosses car data respectively according to the decision rule of abnormal data and captures data and filter, extract exception to RFID
Data;
C3, is stored using relevant database.
6. the traffic big data cleaning method of fast accurate according to claim 1, it is characterised in that the historical data
Processing method in, it is described according to data cleansing rule, data are compared, are counted and abnormality processing, specially vehicle rail
The cleaning of mark, excessively wagon flow statistics of variables and the extraction of abnormal data;
The cleaning of the track of vehicle, is implemented in accordance with the following steps:
RFID is crossed car data and video capture by the information of license plate number, time, collection point title, four fields in direction by D1
Data are attached;
D2, carries out backward process to license plate number and time character string, data is carried out using car plate color and transit time field
Filter, obtain track of vehicle data;
D3, the backward character string with license plate number and time character string as key, by the track of vehicle data storage in HBase.
7. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that the track of vehicle
Cleaning process in, first by RFID data, capture data and facility information table and be encapsulated as corresponding RDD respectively, according to equipment
IP address, carry out data cube computation, obtain RFID data RDD with direction field and with it is directive candid photograph data RDD;So
Afterwards two class data RDD are changed respectively, the RDD of key-value pair form is obtained, to facilitate the carrying out for comparing attended operation, wherein
Key is the character string of the field composition for needing to compare;Finally, the RDD of two kinds of data is compared according to key assignments and is connected
Connect, require to filter data using the rule such as time integrity, number plate colour consistency, integrity of field, just obtain
True data track.
8. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that described to cross vehicle flowrate
Statistics, implemented in accordance with the following steps:
RFID is crossed car data and is converted to collection point field and to be accurate to key-value pair shape of the time character string of hour as key by E1
Formula;
E2, according to the principle that the distributed big datas of Spark are processed, counts to the data record with same keys, obtains each
Vehicle flowrate record is crossed the corresponding time period in individual collection point;
E3, is stored to the vehicle flowrate result of crossing of each collection point using relevant database.
9. the traffic big data cleaning method of fast accurate according to claim 6, it is characterised in that the abnormal data
Type include:Data field is imperfect, shortage of data and data message are inconsistent.
10. the traffic big data cleaning method of fast accurate according to claim 9, it is characterised in that the abnormal number
According to extraction, implemented in accordance with the following steps:
RFID by the number-plate number, collection point title, collection direction and is crossed car data by the information of four fields of time by F1
It is attached with data are captured;
F2, according to data exception type, first determines whether whether RFID data lacks, and if there is RFID data, then judges RFID
In data, color field whether there is, capture, if field is complete, judge RFID
Whether data are consistent with number plate color in candid photograph data, finally, the abnormal data for extracting stored in MySQL database,
And identify Exception Type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611094160.2A CN106528865A (en) | 2016-12-02 | 2016-12-02 | Quick and accurate cleaning method of traffic big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611094160.2A CN106528865A (en) | 2016-12-02 | 2016-12-02 | Quick and accurate cleaning method of traffic big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106528865A true CN106528865A (en) | 2017-03-22 |
Family
ID=58354223
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611094160.2A Pending CN106528865A (en) | 2016-12-02 | 2016-12-02 | Quick and accurate cleaning method of traffic big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106528865A (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106878092A (en) * | 2017-03-28 | 2017-06-20 | 上海以弈信息技术有限公司 | A kind of network O&M monitor in real time of multi-source heterogeneous data fusion is presented platform with analysis |
CN107391719A (en) * | 2017-07-31 | 2017-11-24 | 南京邮电大学 | Distributed stream data processing method and system in a kind of cloud environment |
CN107688646A (en) * | 2017-08-30 | 2018-02-13 | 武汉烽火众智数字技术有限责任公司 | A kind of method of the bayonet socket data area crash analysis based on ES |
CN108090191A (en) * | 2017-12-14 | 2018-05-29 | 苏州泥娃软件科技有限公司 | The method and system that a kind of traffic big data cleaning arranges |
CN108171971A (en) * | 2017-12-18 | 2018-06-15 | 武汉烽火众智数字技术有限责任公司 | Vehicular real time monitoring method and system based on Spark Streaming |
CN108319538A (en) * | 2018-02-02 | 2018-07-24 | 世纪龙信息网络有限责任公司 | The monitoring method and system of big data platform operating status |
CN109118806A (en) * | 2017-06-26 | 2019-01-01 | 杭州海康威视系统技术有限公司 | A kind of unit exception detection method, apparatus and system |
CN109753496A (en) * | 2018-11-27 | 2019-05-14 | 天聚地合(苏州)数据股份有限公司 | A kind of data cleaning method for big data |
CN109785595A (en) * | 2019-02-26 | 2019-05-21 | 成都古河云科技有限公司 | A kind of vehicle abnormality track real-time identification method based on machine learning |
CN110287010A (en) * | 2019-06-12 | 2019-09-27 | 北京工业大学 | A kind of data cached forecasting method towards the analysis of Spark time window data |
CN110334081A (en) * | 2019-06-28 | 2019-10-15 | 北京天眼查科技有限公司 | The cleaning method and device of mass data |
CN110502509A (en) * | 2019-08-27 | 2019-11-26 | 广东工业大学 | A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame |
CN110569238A (en) * | 2019-09-12 | 2019-12-13 | 成都中科大旗软件股份有限公司 | data management method, system, storage medium and server based on big data |
CN110569237A (en) * | 2019-09-12 | 2019-12-13 | 上海富数科技有限公司 | System and method for realizing real-time data cleaning processing |
CN110704206A (en) * | 2019-09-09 | 2020-01-17 | 上海凯京信达科技集团有限公司 | Real-time computing method, computer storage medium and electronic equipment |
CN110888972A (en) * | 2019-10-27 | 2020-03-17 | 北京明朝万达科技股份有限公司 | Sensitive content identification method and device based on Spark Streaming |
CN111127949A (en) * | 2019-12-18 | 2020-05-08 | 北京中交兴路车联网科技有限公司 | Vehicle high-risk road section early warning method and device and storage medium |
CN111143415A (en) * | 2019-12-26 | 2020-05-12 | 政采云有限公司 | Data processing method and device and computer readable storage medium |
CN111368134A (en) * | 2019-07-04 | 2020-07-03 | 杭州海康威视系统技术有限公司 | Traffic data processing method and device, electronic equipment and storage medium |
CN112347093A (en) * | 2020-11-05 | 2021-02-09 | 哈尔滨航天恒星数据系统科技有限公司 | Method for facilitating cleaning, integrating and storing of mass multi-source heterogeneous data |
CN113177049A (en) * | 2021-05-13 | 2021-07-27 | 中移智行网络科技有限公司 | Data processing method, device and system |
CN113505119A (en) * | 2021-07-29 | 2021-10-15 | 青岛以萨数据技术有限公司 | ETL method and device based on multiple data sources |
CN114996260A (en) * | 2022-08-05 | 2022-09-02 | 深圳市深蓝信息科技开发有限公司 | Method and device for cleaning AIS data, terminal equipment and storage medium |
CN115359666A (en) * | 2022-08-19 | 2022-11-18 | 重庆首讯科技股份有限公司 | Abnormal traffic behavior detection method based on multi-source data cross validation |
CN115391315A (en) * | 2022-07-15 | 2022-11-25 | 生命奇点(北京)科技有限公司 | Data cleaning method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104778245A (en) * | 2015-04-09 | 2015-07-15 | 北方工业大学 | Similar trajectory mining method and device on basis of massive license plate identification data |
CN105426478A (en) * | 2015-11-18 | 2016-03-23 | 四川长虹电器股份有限公司 | Method for user behavior analysis |
CN105786864A (en) * | 2014-12-24 | 2016-07-20 | 国家电网公司 | Offline analysis method for massive data |
CN105893628A (en) * | 2016-05-17 | 2016-08-24 | 中国农业银行股份有限公司 | Real-time data collection system and method |
-
2016
- 2016-12-02 CN CN201611094160.2A patent/CN106528865A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105786864A (en) * | 2014-12-24 | 2016-07-20 | 国家电网公司 | Offline analysis method for massive data |
CN104778245A (en) * | 2015-04-09 | 2015-07-15 | 北方工业大学 | Similar trajectory mining method and device on basis of massive license plate identification data |
CN105426478A (en) * | 2015-11-18 | 2016-03-23 | 四川长虹电器股份有限公司 | Method for user behavior analysis |
CN105893628A (en) * | 2016-05-17 | 2016-08-24 | 中国农业银行股份有限公司 | Real-time data collection system and method |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106878092A (en) * | 2017-03-28 | 2017-06-20 | 上海以弈信息技术有限公司 | A kind of network O&M monitor in real time of multi-source heterogeneous data fusion is presented platform with analysis |
CN109118806A (en) * | 2017-06-26 | 2019-01-01 | 杭州海康威视系统技术有限公司 | A kind of unit exception detection method, apparatus and system |
CN107391719A (en) * | 2017-07-31 | 2017-11-24 | 南京邮电大学 | Distributed stream data processing method and system in a kind of cloud environment |
CN107688646A (en) * | 2017-08-30 | 2018-02-13 | 武汉烽火众智数字技术有限责任公司 | A kind of method of the bayonet socket data area crash analysis based on ES |
CN108090191A (en) * | 2017-12-14 | 2018-05-29 | 苏州泥娃软件科技有限公司 | The method and system that a kind of traffic big data cleaning arranges |
CN108171971A (en) * | 2017-12-18 | 2018-06-15 | 武汉烽火众智数字技术有限责任公司 | Vehicular real time monitoring method and system based on Spark Streaming |
CN108319538A (en) * | 2018-02-02 | 2018-07-24 | 世纪龙信息网络有限责任公司 | The monitoring method and system of big data platform operating status |
CN109753496A (en) * | 2018-11-27 | 2019-05-14 | 天聚地合(苏州)数据股份有限公司 | A kind of data cleaning method for big data |
CN109785595A (en) * | 2019-02-26 | 2019-05-21 | 成都古河云科技有限公司 | A kind of vehicle abnormality track real-time identification method based on machine learning |
CN110287010A (en) * | 2019-06-12 | 2019-09-27 | 北京工业大学 | A kind of data cached forecasting method towards the analysis of Spark time window data |
CN110287010B (en) * | 2019-06-12 | 2021-09-14 | 北京工业大学 | Cache data prefetching method oriented to Spark time window data analysis |
CN110334081A (en) * | 2019-06-28 | 2019-10-15 | 北京天眼查科技有限公司 | The cleaning method and device of mass data |
CN111368134B (en) * | 2019-07-04 | 2023-10-27 | 杭州海康威视系统技术有限公司 | Traffic data processing method and device, electronic equipment and storage medium |
CN111368134A (en) * | 2019-07-04 | 2020-07-03 | 杭州海康威视系统技术有限公司 | Traffic data processing method and device, electronic equipment and storage medium |
CN110502509A (en) * | 2019-08-27 | 2019-11-26 | 广东工业大学 | A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame |
CN110502509B (en) * | 2019-08-27 | 2023-04-18 | 广东工业大学 | Traffic big data cleaning method based on Hadoop and Spark framework and related device |
CN110704206A (en) * | 2019-09-09 | 2020-01-17 | 上海凯京信达科技集团有限公司 | Real-time computing method, computer storage medium and electronic equipment |
CN110704206B (en) * | 2019-09-09 | 2022-09-27 | 上海斑马来拉物流科技有限公司 | Real-time computing method, computer storage medium and electronic equipment |
CN110569238A (en) * | 2019-09-12 | 2019-12-13 | 成都中科大旗软件股份有限公司 | data management method, system, storage medium and server based on big data |
CN110569237A (en) * | 2019-09-12 | 2019-12-13 | 上海富数科技有限公司 | System and method for realizing real-time data cleaning processing |
CN110569238B (en) * | 2019-09-12 | 2023-03-24 | 成都中科大旗软件股份有限公司 | Data management method, system, storage medium and server based on big data |
CN110888972A (en) * | 2019-10-27 | 2020-03-17 | 北京明朝万达科技股份有限公司 | Sensitive content identification method and device based on Spark Streaming |
CN111127949A (en) * | 2019-12-18 | 2020-05-08 | 北京中交兴路车联网科技有限公司 | Vehicle high-risk road section early warning method and device and storage medium |
CN111127949B (en) * | 2019-12-18 | 2021-12-03 | 北京中交兴路车联网科技有限公司 | Vehicle high-risk road section early warning method and device and storage medium |
CN111143415A (en) * | 2019-12-26 | 2020-05-12 | 政采云有限公司 | Data processing method and device and computer readable storage medium |
CN111143415B (en) * | 2019-12-26 | 2023-12-29 | 政采云有限公司 | Data processing method, device and computer readable storage medium |
CN112347093A (en) * | 2020-11-05 | 2021-02-09 | 哈尔滨航天恒星数据系统科技有限公司 | Method for facilitating cleaning, integrating and storing of mass multi-source heterogeneous data |
CN113177049A (en) * | 2021-05-13 | 2021-07-27 | 中移智行网络科技有限公司 | Data processing method, device and system |
CN113505119A (en) * | 2021-07-29 | 2021-10-15 | 青岛以萨数据技术有限公司 | ETL method and device based on multiple data sources |
CN113505119B (en) * | 2021-07-29 | 2023-08-29 | 青岛以萨数据技术有限公司 | ETL method and device based on multiple data sources |
CN115391315A (en) * | 2022-07-15 | 2022-11-25 | 生命奇点(北京)科技有限公司 | Data cleaning method and device |
CN114996260A (en) * | 2022-08-05 | 2022-09-02 | 深圳市深蓝信息科技开发有限公司 | Method and device for cleaning AIS data, terminal equipment and storage medium |
CN114996260B (en) * | 2022-08-05 | 2022-11-11 | 深圳市深蓝信息科技开发有限公司 | Method and device for cleaning AIS data, terminal equipment and storage medium |
CN115359666A (en) * | 2022-08-19 | 2022-11-18 | 重庆首讯科技股份有限公司 | Abnormal traffic behavior detection method based on multi-source data cross validation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106528865A (en) | Quick and accurate cleaning method of traffic big data | |
CN111488363B (en) | Data processing method, device, electronic equipment and medium | |
CN107958031B (en) | Resident travel OD distribution extraction method based on fusion data | |
CN106777703A (en) | A kind of bus passenger real-time analyzer and its construction method | |
CN111127105A (en) | User hierarchical model construction method and system, and operation analysis method and system | |
CN104199903B (en) | A kind of vehicle data inquiry system and method based on path association | |
CN104778245A (en) | Similar trajectory mining method and device on basis of massive license plate identification data | |
CN105913656B (en) | Based on the frequent method and system for crossing vehicle of distributed statistics | |
CN104462222A (en) | Distributed storage method and system for checkpoint vehicle pass data | |
CN112184625A (en) | Pavement defect identification method and system based on video deep learning | |
CN107993444B (en) | Suspected vehicle identification method based on bayonet vehicle-passing big data analysis | |
CN114596700B (en) | Real-time traffic estimation method for expressway section based on portal data | |
CN115458140A (en) | Internet hospital intelligent operation system based on medical big data | |
CN111105628A (en) | Parking lot portrait construction method and device | |
CN110874369A (en) | Multidimensional data fusion investigation system and method thereof | |
CN112181955A (en) | Data standard management method for information sharing of heavy haul railway comprehensive big data platform | |
CN107729448A (en) | A kind of data handling system based on data warehouse | |
CN107862867B (en) | The method and system for entering city vehicle analysis for the first time are carried out based on big data | |
CN116934270A (en) | Library book borrowing management system based on data analysis | |
CN102156799A (en) | Cascadable complex event processing engine and train overhauling automatic recording method | |
CN109308290A (en) | A kind of efficient data cleaning conversion method based on CIM | |
CN114116742B (en) | Time sequence data filling method and device based on subway integrated monitoring system | |
CN108021361A (en) | A kind of the highway fee evasion of falling card vehicle screening method and device | |
CN107610465B (en) | Traffic monitoring data matching method, system and storage device | |
CN103368790B (en) | A kind of performance delays monitoring method for electronic trading system and system thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170322 |
|
RJ01 | Rejection of invention patent application after publication |