CN107527223A - A kind of method and device of Ticketing information analysis - Google Patents
A kind of method and device of Ticketing information analysis Download PDFInfo
- Publication number
- CN107527223A CN107527223A CN201611198401.8A CN201611198401A CN107527223A CN 107527223 A CN107527223 A CN 107527223A CN 201611198401 A CN201611198401 A CN 201611198401A CN 107527223 A CN107527223 A CN 107527223A
- Authority
- CN
- China
- Prior art keywords
- passenger
- booking
- distribution
- station
- record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000004458 analytical method Methods 0.000 title claims abstract description 28
- 230000008569 process Effects 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 238000007476 Maximum Likelihood Methods 0.000 claims description 35
- 238000006116 polymerization reaction Methods 0.000 claims description 16
- 230000005540 biological transmission Effects 0.000 claims description 14
- 239000004744 fabric Substances 0.000 claims description 13
- 238000004088 simulation Methods 0.000 claims description 7
- 238000007619 statistical method Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 abstract description 35
- 230000002159 abnormal effect Effects 0.000 abstract description 7
- 238000007405 data analysis Methods 0.000 abstract description 4
- 230000006399 behavior Effects 0.000 description 38
- 230000006870 function Effects 0.000 description 35
- 230000000875 corresponding effect Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 12
- 230000008859 change Effects 0.000 description 9
- 238000012546 transfer Methods 0.000 description 6
- 238000012098 association analyses Methods 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 4
- 238000012804 iterative process Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 230000003542 behavioural effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000001174 ascending effect Effects 0.000 description 2
- 235000013405 beer Nutrition 0.000 description 2
- 235000008429 bread Nutrition 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 241000001667 Eueretagrotis sigmoides Species 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 210000001367 artery Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- Algebra (AREA)
- Computational Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Educational Administration (AREA)
- Game Theory and Decision Science (AREA)
- Probability & Statistics with Applications (AREA)
- Primary Health Care (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the invention discloses a kind of method and device of Ticketing information analysis.This method includes:Attribute information, trip purpose distribution, the distribution of booking number, train number type from passenger, booking mode are distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is a point of the passenger in higher dimensional space, if the type of the passenger is unknown, the process of study is fitted to the probability density distribution of passenger's hidden state vector by being converted into the kind judging problem of the passenger.So as to develop iterative state estimation algorithm, vaild act feature can be extracted parallel from mass data, the abnormal patterns of passenger's booking behavior are identified, the requirement of real-time is met in time computational efficiency;The displaying of many-sided multi-angle can be carried out to the output result of model, facilitate the use of Correlative data analysis number.
Description
Technical field
The present embodiments relate to safety detection technical field, more particularly to a kind of Ticketing information analysis method
And device.
Background technology
Railway is the important infrastructure of country, is the backbone of traffic and transportation system, is the main artery of national economy, right
Vital effect is all played in the politics of country, economy, culture, national defense construction.According to 2015 years according to statistics, china railway
Revenue kilometres reach 11.2 ten thousand kilometers, and 116.48 kilometers/ten thousand square kilometres of road mileage, investment planning is more than 3.3 trillion people
Coin, china railway passenger traffic volume is more than 23.57 hundred million person-times.
Safety is the lifeline of railway transportation, is directly related to production efficiency, economic results in society and the person peace of enterprise
Entirely.At present, the railway security monitoring means in China mainly utilizes sensor, data acquisition conveyor apparatus, DAS
Real-time monitoring analysis early warning is carried out to the parameter of the hardware facilities such as track, train, however, theme of the people as passenger transport,
Novel presentation in some booking behavior, transportations of people is also possible to railway transportation, safety in production, normal order
Maintenance has adverse effect on, and how to detect this special pool of passengers or reduces the hunting zone of potential danger crowd, I
State is still without perfect theoretical model and technical products.
But from passenger's booking data of magnanimity, using correlation machine learning algorithm, valuable pattern is extracted, is faced
Many problems:
(1) lack flag data, supervised learning model can not be applied:
Do not have clear and definite flag data in the booking data of passenger and supply model learning, artificially nominal data not only consumes
Duration, cost are high, and have significant subjectivity, first, do not ensure that there is each demarcation personnel field specialty to know
Know, abnormal patterns that can be in accurate judgement booking data, secondly, the criterion for demarcating personnel may be not consistent, causes pair
The demarcation of same data may produce conflict, and again, the passenger's booking data that can be acquired are imperfect informations, from endless
It is difficult to determine a clear and definite standard to judge whether data are abnormal in full information.
(2) data are incomplete, lack multi-aspect information cross validation:
Data are incomplete to be mainly shown as two aspects, first, there is no definite multiply in the passenger's booking data got
Objective booking time data, the booking mode metadata of passenger are simultaneously incomplete;Second, only obtained from passenger's booking data
Information limitation is too strong, and (after being fitted to the probability density function of data, traversal passenger purchases the outlier identified
Ticket data collection, passenger's vector is labeled using maximum likelihood estimate, judges ownership of passenger's vector to each classification cluster
Degree, when passenger is both less than some threshold value to the degree of membership of all categories, labels it as outlier) can not be directly as
Judge that passenger belongs to the foundation of population at risk.The booking behavior pattern of passenger accurately to be described, it is also necessary to other aspect information
Support, checking.
(3) passen-gers are huge, but for personal record by bus than sparse, data compressible space is small:
Data volume is huge, and number by bus has 6,000,000 person-times daily, and peak period is even up to ten million person-time, is related to
Crowd also has millions of crowds.But sum up in the point that individual, a big chunk passenger year used during taking train number may 10 times with
Under, personal tables of data by bus reveals significant openness.The main target applied herein is identification outlier, and detection is started a work shift
The abnormal patterns of objective booking behavior, so, the data of some details of individual can not be lost again, data compressible space is small,
Using association analysis Algorithm Analysis passenger go with trip when, cause algorithm time computation complexity and space calculate
Complexity is all very high.
The content of the invention
The purpose of the embodiment of the present invention is to propose a kind of method and device of Ticketing information analysis, it is intended to which how is solution
From passenger's booking data of magnanimity, using correlation machine learning algorithm, the problem of extracting valuable pattern.
To use following technical scheme up to this purpose, the embodiment of the present invention:
In a first aspect, a kind of method of Ticketing information analysis, methods described include:
Attribute information, trip purpose distribution, the distribution of booking number, train number type from passenger, booking mode are distributed, begun
The distribution of hair station, terminus distribution, the booking behavior pattern feature of relation of going with extraction passenger;
The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is a passenger in height
A point in dimension space, if the type of the passenger is unknown, the kind judging problem of the passenger will be converted into
The process of study is fitted to the probability density distribution of passenger's hidden state vector.
Preferably, the attribute information of the passenger includes data file is counted according to the age to obtain passenger's age
Distributed intelligence;
Trip purpose distribution include province native place number information that basis parses from ID card information in conjunction with
The administrative division code of the starting station and terminus judge native place numbering with the starting station, terminus administrative division code whether phase
Deng passenger to be divided into the classification of predetermined number according to trip purpose, the distribution do not omitted is not repeated between of all categories;Wherein,
It is all consistent with native place that odh represents starting station terminus, and in local, short distance is gone on a journey inside the province;Odo represents the starting station and terminus one
Cause, but province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;other
Represent other situations;
The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is remembered for booking
State is the frequency of 3 record in record;The returned ticket number is the frequency of the record that state is 2 in booking record;It is described effective
Booking number is the frequency of the record that state is 5 in booking record;
The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, point
Each train number type economy, agility, the score of comfortableness are not calculated, then is added summation and simultaneously divided by by bus records number, are respectively referred to
Scale value is scored at the distribution of all average values of the corresponding index of record by bus in preset time;
The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, point
Each train number type economy, the score of agility are not calculated, then is added summation and simultaneously divided by by bus records number, and each desired value obtains
It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
The starting station distribution includes according to the number at the starting station, maximum likelihood probability, station important coefficient and originated
Stand the entropy of distribution, the number at the starting station is all booking records of polymerization passenger, in institute for key with passenger identity card number
In some booking records, there is the quantity at the different starting stations;The maximum likelihood probability is the booking record in each passenger
In, the most station of occurrence number, booking number of the probability with frequency divided by altogether calculates;The station important coefficient
Counted for the important coefficient at each station with the same day total transmission number at the station divided by total transmission number at all stations
Calculate;The entropy of the starting station distribution obtains the starting station in all booking records to polymerize all booking records of single passenger
Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution;
The terminus distribution includes number, maximum likelihood probability, station important coefficient and the terminus point of terminus
The entropy of cloth;The number of the terminus is all booking records of polymerization passenger, all for key with passenger identity card number
In booking record, there is the quantity of different terminus;During the maximum likelihood probability is records in the booking of each passenger,
The most station of occurrence number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is every
The important coefficient at individual station with total estimated arrive at a station number divided by all stations at the same day station it is total it is estimated arrive at a station number come
Calculate;The entropy of the terminus distribution obtains terminus in all booking records to polymerize all booking records of single passenger
Set, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Preferably, the relation of going with is included in passenger's booking flow data in the interval of time of acquisition, if
Passenger A and passenger B on the same day get on the bus by the same starting station, goes to same point of destination, same train is taken, same
In compartment, using same window booking of the identical booking mode at same station, and meet support and confidence level
It is required that the relation of going with that then passenger A and passenger B has.
Preferably, it is described from the attribute information of passenger, trip purpose distribution, the distribution of booking number, train number type, booking
Mode is distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger, including:
With one vector characterize a passenger hidden state, by it is discrete, qualitatively Data Identification is converted into continuously
, quantitative data mode;
Statistical analysis is carried out according to the booking data in prefixed time interval, passenger is described most with maximum likelihood probability
Possible booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking
The entropy and the number of different booking modes that mode is distributed.
Preferably, the booking behavior pattern that passenger is characterized by passenger's hidden state vector, then each passenger is one
A point of the individual passenger in higher dimensional space, if the type of the passenger is unknown, by the kind judging of the passenger
Problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector, including:
It is hidden variable if the classification of passenger is, in the case of known passenger classification, the bar of passenger's hidden state vector
A Multi-dimensional Gaussian distribution is obeyed in part distribution;
Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, during simulation is real
There may be the situation of multiple classifications;
Assume that passenger's vector is distributed Normal Distribution to the condition of passenger's classification according to central-limit theorem;It is multiple high
For the weighted sum of this model with arbitrary extent close to any probability distribution, each Gauss model represents a class, sentences for needs
Determine passenger's vector of classification, calculate the probability that passenger belongs to each class respectively, the class of reselection maximum probability is tied as judgement
Fruit.
Second aspect, a kind of device of Ticketing information analysis, described device include:
Extraction module, for the attribute information from passenger, trip purpose distribution, the distribution of booking number, train number type, purchase
Ticket mode is distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger;
Fitting module, for characterizing the booking behavior pattern of passenger by passenger's hidden state vector, then each passenger is
A point of one passenger in higher dimensional space, if the type of the passenger is unknown, the classification of the passenger will be sentenced
Determine problem and be converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.
Preferably, the attribute information of the passenger includes data file is counted according to the age to obtain passenger's age
Distributed intelligence;
Trip purpose distribution include province native place number information that basis parses from ID card information in conjunction with
The administrative division code of the starting station and terminus judge native place numbering with the starting station, terminus administrative division code whether phase
Deng passenger to be divided into the classification of predetermined number according to trip purpose, the distribution do not omitted is not repeated between of all categories;Wherein,
It is all consistent with native place that odh represents starting station terminus, and in local, short distance is gone on a journey inside the province;Odo represents the starting station and terminus one
Cause, but province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;other
Represent other situations;
The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is remembered for booking
State is the frequency of 3 record in record;The returned ticket number is the frequency of the record that state is 2 in booking record;It is described effective
Booking number is the frequency of the record that state is 5 in booking record;
The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, point
Each train number type economy, agility, the score of comfortableness are not calculated, then is added summation and simultaneously divided by by bus records number, are respectively referred to
Scale value is scored at the distribution of all average values of the corresponding index of record by bus in preset time;
The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, point
Each train number type economy, the score of agility are not calculated, then is added summation and simultaneously divided by by bus records number, and each desired value obtains
It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
The starting station distribution includes according to the number at the starting station, maximum likelihood probability, station important coefficient and originated
Stand the entropy of distribution, the number at the starting station is all booking records of polymerization passenger, in institute for key with passenger identity card number
In some booking records, there is the quantity at the different starting stations;The maximum likelihood probability is the booking record in each passenger
In, the most station of occurrence number, booking number of the probability with frequency divided by altogether calculates;The station important coefficient
Counted for the important coefficient at each station with the same day total transmission number at the station divided by total transmission number at all stations
Calculate;The entropy of the starting station distribution obtains the starting station in all booking records to polymerize all booking records of single passenger
Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution;
The terminus distribution includes number, maximum likelihood probability, station important coefficient and the terminus point of terminus
The entropy of cloth;The number of the terminus is all booking records of polymerization passenger, all for key with passenger identity card number
In booking record, there is the quantity of different terminus;During the maximum likelihood probability is records in the booking of each passenger,
The most station of occurrence number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is every
The important coefficient at individual station with total estimated arrive at a station number divided by all stations at the same day station it is total it is estimated arrive at a station number come
Calculate;The entropy of the terminus distribution obtains terminus in all booking records to polymerize all booking records of single passenger
Set, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Preferably, the relation of going with is included in passenger's booking flow data in the interval of time of acquisition, if
Passenger A and passenger B on the same day get on the bus by the same starting station, goes to same point of destination, same train is taken, same
In compartment, using same window booking of the identical booking mode at same station, and meet support and confidence level
It is required that the relation of going with that then passenger A and passenger B has.
Preferably, the extraction module, is specifically used for:
With one vector characterize a passenger hidden state, by it is discrete, qualitatively Data Identification is converted into continuously
, quantitative data mode;
Statistical analysis is carried out according to the booking data in prefixed time interval, passenger is described most with maximum likelihood probability
Possible booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking
The entropy and the number of different booking modes that mode is distributed.
Preferably, the fitting module, is specifically used for:
It is hidden variable if the classification of passenger is, in the case of known passenger classification, the bar of passenger's hidden state vector
A Multi-dimensional Gaussian distribution is obeyed in part distribution;
Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, during simulation is real
There may be the situation of multiple classifications;
Assume that passenger's vector is distributed Normal Distribution to the condition of passenger's classification according to central-limit theorem;It is multiple high
For the weighted sum of this model with arbitrary extent close to any probability distribution, each Gauss model represents a class, sentences for needs
Determine passenger's vector of classification, calculate the probability that passenger belongs to each class respectively, the class of reselection maximum probability is tied as judgement
Fruit.
A kind of method and device of Ticketing information analysis provided in an embodiment of the present invention, attribute information, trip from passenger
Purpose distribution, the distribution of booking number, train number type, booking mode is distributed, the starting station is distributed, terminus is distributed, the relation of going with carries
Take the booking behavior pattern feature of passenger;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then is each multiplied
Visitor is a point of the passenger in higher dimensional space, if the type of the passenger is unknown, by the class of the passenger
Other decision problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.So as to
The sneak condition for dynamically tracking passenger is realized, it is accurate to estimate passenger's booking behavior pattern, and have necessarily to error information
Tolerance, fault-tolerance;Iterative state estimation algorithm is developed, vaild act spy can be extracted parallel from mass data
Sign, identifies the abnormal patterns of passenger's booking behavior, the requirement of real-time is met in time computational efficiency;The output of model
As a result the requirement of stability should be met, the result of determination of someone's booking pattern should be consistent in special time;It is right
The output result of model can carry out the displaying of many-sided multi-angle, facilitate the use of Correlative data analysis number.
Brief description of the drawings
Fig. 1 is a kind of schematic flow sheet of the method for Ticketing information analysis provided in an embodiment of the present invention;
Fig. 2 is a kind of passenger's booking behavior model provided in an embodiment of the present invention and the schematic diagram of feature extraction;
Fig. 3 is a kind of age distribution block diagram provided in an embodiment of the present invention;
Fig. 4 is a kind of passenger's character classification by age cake chart provided in an embodiment of the present invention;
Fig. 5 is a kind of trip purpose distribution cake chart provided in an embodiment of the present invention;
Fig. 6 is a kind of train number type preference distribution cake chart provided in an embodiment of the present invention;
Fig. 7 is that a kind of booking mode provided in an embodiment of the present invention is distributed cake chart;
Fig. 8 is that a kind of whole nation provided in an embodiment of the present invention is got on the bus 100 station distribution block diagrams before number ranking;
Fig. 9 is that a kind of whole nation provided in an embodiment of the present invention is got off 100 station distribution block diagrams before number ranking;
Figure 10 is that a kind of frequent 3 item collection of candidate provided in an embodiment of the present invention enumerates schematic diagram;
Figure 11 is that candidate's frequent item set provided in an embodiment of the present invention counts schematic diagram;
Figure 12 is correlation rule generation schematic diagram provided in an embodiment of the present invention;
Figure 13 is a kind of high-level schematic functional block diagram of the device of Ticketing information analysis provided in an embodiment of the present invention.
Embodiment
The embodiment of the present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this
The specific embodiment of place description is used only for explaining the embodiment of the present invention, rather than the restriction to the embodiment of the present invention.In addition also
It should be noted that for the ease of description, the part related to the embodiment of the present invention illustrate only in accompanying drawing and not all knot
Structure.
With reference to figure 1, Fig. 1 is a kind of schematic flow sheet of the method for Ticketing information analysis provided in an embodiment of the present invention.
As shown in figure 1, the method for the Ticketing information analysis includes:
Step 101, the attribute information from passenger, trip purpose distribution, the distribution of booking number, train number type, booking mode
Distribution, starting station distribution, terminus distribution, the booking behavior pattern feature of relation of going with extraction passenger;
Specifically, as shown in Fig. 2 Fig. 2 is a kind of schematic diagram of passenger's hidden state provided in an embodiment of the present invention.From
Several aspects such as the attribute information of passenger, trip purpose, booking number, booking mode are distributed, terminus distribution, relation of going with
Portrayed come the booking behavior pattern to passenger.
The hidden state of a passenger is characterized with a vector, takes into account entirety and local message, each passenger of concentrated expression
Behavioural habits, preference, for some discrete data such as trip purpose, booking mode etc., discrete, qualitatively data
Mark is converted into continuous, quantitative data mode, on the one hand provides more information for accurate description user behavior custom,
On the other hand, it is convenient that derived function is optimized to model.Such as statistical is carried out according to the booking data in interval of time
Analysis, the most probable booking mode of the passenger is described with maximum likelihood probability, qualitatively, the data of label be converted into company
Continuous, quantitative numerical value is portrayed, while some the extreme value information that can also reflect in passenger's booking behavior pattern, passes through polymerization one
Booking mode in the section time in all records by bus of the passenger, calculate entropy, of different booking modes that booking mode is distributed
Number reflects Global Information that booking mode is distributed.
Wherein, the attribute information of the passenger includes data file is counted according to the age to obtain passenger's age point
Cloth information;
Specifically, being counted to data file according to the age, passenger's age distribution block diagram is obtained as shown in Fig. 3.It is right
Passenger's age simple classification, if after 90s, after 80s respectively, Ganlei, passenger's character classification by age cake chart are as shown in Figure 4 after 70 etc..Year
Age:Type real, the date of birth is parsed from ID card information, calculate the age corresponding to identification card number.With system time
Days subtract the days in identity card, calculate the age.
Trip purpose distribution include province native place number information that basis parses from ID card information in conjunction with
The administrative division code of the starting station and terminus judge native place numbering with the starting station, terminus administrative division code whether phase
Deng passenger to be divided into the classification of predetermined number according to trip purpose, the distribution do not omitted is not repeated between of all categories;Wherein,
It is all consistent with native place that odh represents starting station terminus, and in local, short distance is gone on a journey inside the province;Odo represents the starting station and terminus one
Cause, but province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;other
Represent other situations;
Specifically, according to the province native place number information parsed from ID card information, then in conjunction with the starting station and end
The administrative division code at point station, judge whether native place numbering is equal with the administrative division code of the starting station, terminus, by passenger
Five major classes are divided into according to trip purpose, do not repeat not omit between of all categories, trip purpose distribution cake chart is as shown in Figure 5.
Wherein odh:It is all consistent with native place to represent starting station terminus, i.e., short distance is gone on a journey inside the province in local, odo:Represent and begin
Hair station is consistent with terminus, but province's short distance trip beyond local, o:Representative leaves local and goes to other provinces to go on a journey, d:Represent
Gone home from other provinces, other:Represent other situations.It can be seen that short distance is gone on a journey, goes home to occupy predominantly inside the province from Fig. 5
Position.
All trips record of passenger is counted, and trip record each time is classified as one kind of above-mentioned five type, polymerization
All classifications obtain discrete trip purpose data set, and trip purpose is described with the above-mentioned other probability of five species.Consider
Independent irrelevance between data dimension, removes the probability of other classifications, and adds the entropy dimension of trip purpose distribution.
O probability:In the trip record of passenger, belong to classification o frequency and the business of total trip number.
Odo probability:In the trip record of passenger, belong to classification odo frequency and the business of total trip number.
D probability:In the trip record of passenger, belong to classification d frequency and the business of total trip number.
● odh probability:In the trip record of passenger, belong to classification odh frequency and the business of total trip number.
The entropy of trip purpose distribution:In all trips record of passenger, the classification of passenger's trip purpose is recorded successively, is obtained
To the array of a discrete distribution, the entropy of the discrete distribution.
The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is remembered for booking
State is the frequency of 3 record in record;The returned ticket number is the frequency of the record that state is 2 in booking record;It is described effective
Booking number is the frequency of the record that state is 5 in booking record;
Specifically, change label number:The frequency for the record that state is 3 in booking record.Returned ticket number:Shape in booking record
State is the frequency of 2 record.Effective booking number:The frequency for the record that state is 5 in booking record.Passenger's booking behavior mould
The calculating of other features is all recorded according to effective booking to calculate in formula.
The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, point
Each train number type economy, agility, the score of comfortableness are not calculated, then is added summation and simultaneously divided by by bus records number, are respectively referred to
Scale value is scored at the distribution of all average values of the corresponding index of record by bus in preset time;
Specifically, the selection of train number type, can be accurate from income situation, the trip purpose of a side reflection passenger
Ground passenger, which draws a portrait, provides extra information.
The coding of train number type all has certain meaning.Mainly there are K (quick), G (high ferro), D in effective booking record
(motor-car), Z (through quick, accurate), C (intercity), T (express train), Y (tourism), the 4-digit number train of 6/7/8/9 beginning is general
Logical passenger train, L (temporary passenger train).Train number type preference distribution cake chart is as shown in Figure 6.
Train number type is described in terms of economy, agility, comfortableness three, economy is mainly from unit mileage
Ticket is measured in price, and agility is from the speed of service of train, whether middle station stops, down time length etc.
Measurement, comfortableness consider the subjective feeling that passenger rides, and are often stood berthing time from everyone average shared train space and train
Measurement.
Economy:It can be weighed from every kilometer of admission fee of train number type, every kilometer of admission fee is smaller, and economy value is higher.
Assuming that certain every kilometer of admission fee of train number type is x members, every kilometer of admission fee average value of all train number types is mean, then economy
Desired value y can be calculated by following formula:
Economic index value y is x subtraction function, i.e. sigmoid functions, and the codomain of function is controlled between 0 to 1, passed through
Different parameter a is set, can be with the slope of control function, a is bigger, and the inclined degree of function is higher, and a is smaller, and function inclines
Oblique degree is lower.When x is equal to mean, when y changes equal to 0.5, x near average value mean, y values level off to linear change
Change, when x from average value farther out when, y value changes level off to gently.When y values are more than 0.5, illustrate that train number type admission fee x is small
In average fare mean, when y values are less than 0.5, illustrate that train fare x is more than average fare mean.
Agility:Calculated by the average speed of the train number type, the average speed of the train number type is bigger, fast
Property index is higher.The calculating of agility desired value equally uses Sigmoid functions, it is assumed that the average speed v of certain train number type
Thousand ms/h, the average of the average speed of all train number types is mean, then:
Agility desired value y is v increasing function, and the codomain of function is controlled between 0 to 1, and a controls function curve
Slope, a is bigger, and the inclined degree of function is higher.When v is equal to mean, y is equal to 0.5, when y is more than 0.5, vehicle speed
Degree v is more than average speed mean.
Comfortableness:Consider everyone shared train space, train often stands berthing time to weigh.Everyone shared train
Space is bigger, and train is often stood, and berthing time is shorter, and comfortableness is higher.The calculating of comfort index value is relatively complicated, comfortably
Property y is everyone shared train space s increasing function, is that train is often stood berthing time t subtraction function, first has to a s and t is immeasurable
Guiding principle, the weighted sum weightedSum of s and t tape symbol is then sought, then reapplies sigmoid functions.Assuming that all cars
Everyone shared train space s of secondary type maximum is smax, minimum value smin, all train number types often stand berthing time t most
Big value is tmax, minimum value tmin, then:
WeightedSum=ws*sstd-wttstd
The weightedSum values of all train number types are calculated, seek weightedSum average value aws,
Y=sigmoid (weightedSum-aws)
To in a period of time, a certain passenger has a plurality of record by bus, obtains the sequence of different train number types, calculates respectively
Each train number type economy, agility, the score of comfortableness, then it is added summation and divided by rides to record number, i.e., each index
It is worth being divided into the average value of all corresponding indexs of record by bus in this period.
The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, point
Each train number type economy, the score of agility are not calculated, then is added summation and simultaneously divided by by bus records number, and each desired value obtains
It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
Specifically, booking mode preference can reflect economic level, the behavioural habits of passenger to a certain extent.Such as university student
Be more likely to online ticketing, business people may more preference third party sell etc..All effective bookings are recorded, according to purchase
Ticket mode counts.
It is as shown in Figure 7 that booking mode is distributed cake chart.
Booking mode number:In all booking records, the number of used different booking mode, this is paid attention to
Value and the difference of effective booking number, it is emphasized that using the number of different booking modes, identical booking mode does not repeat
Count.
Maximum likelihood probability:In all booking records, the probability of the most multiple booking mode of use, probability frequency
Number divided by booking number altogether calculate.
The entropy that booking mode is distributed:It polymerize all booking records of single passenger, obtains the purchase that all booking records use
The set of ticket mode, different item (number of the booking mode used) in statistics set, and calculate the entropy of discrete distribution.
The average metric coefficient of booking mode:For each booking mode, weighed in terms of economy, agility two
Amount.Economy mainly reflects the financial cost using this booking mode, for example, sell, it is necessary to certain transport cost and
Collect corresponding service charge to use, caused expense is higher, and economy is lower;Agility mainly consider time of booking mode into
Originally time taking length, is spent, the cost time is longer, and agility is lower, and the cost time is shorter, and agility is higher.Each property value
Calculate and use and train number type preference similar mode.
Economy
Assuming that the financial cost of certain booking mode is x members, the average financial cost of all booking modes is mean members,
Then economic index value y is financial cost x subtraction function:
Y=sigmoid (mean-x)
Agility
Assuming that the time cost of certain booking mode is x minutes, the average time cost of all booking modes is mean points
Clock, then agility desired value y is time cost x subtraction function:
Y=sigmoid (mean-x)
To in a period of time, a certain passenger has a plurality of record by bus, obtains the sequence of different booking modes, calculates respectively
Each train number type economy, the score of agility, then it is added summation and divided by rides to record number, i.e., each desired value score
For the average value of all corresponding indexs of record by bus in this period.
The starting station distribution includes according to the number at the starting station, maximum likelihood probability, station important coefficient and originated
Stand the entropy of distribution, the number at the starting station is all booking records of polymerization passenger, in institute for key with passenger identity card number
In some booking records, there is the quantity at the different starting stations;The maximum likelihood probability is the booking record in each passenger
In, the most station of occurrence number, booking number of the probability with frequency divided by altogether calculates;The station important coefficient
Counted for the important coefficient at each station with the same day total transmission number at the station divided by total transmission number at all stations
Calculate;The entropy of the starting station distribution obtains the starting station in all booking records to polymerize all booking records of single passenger
Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution;
Specifically, in order to weigh the size of the passenger traffic volume and technical operation amount, and politically, the grade of economic bus loading zone
Status, China has formulated special grade scale to passenger station, comprehensive station, as follows:
It is average daily get on or off the bus and transfer passenger more than 60000 people or in the change one's profession passenger station of bag more than 20000 be top grade
Stand
It is average daily get on or off the bus and transfer passenger more than 15000 people or in the change one's profession passenger station of bag more than 1500 be first-class
Stand
It is average daily get on or off the bus and transfer passenger more than 5000 people or in the change one's profession passenger station of bag more than 500 be secondary station
It is average daily get on or off the bus and transfer passenger more than 2000 people or in the change one's profession passenger station of bag more than 100 be third station
Other
To passenger's booking data of 20100710 to 20,100,716 one weeks of selection, the starting station recorded according to booking is entered
Row statistics, take it is daily get on the bus or transfer number before the station of 100, total number of passengers originated accounts for the 61% of total number of persons, the 100th
The station important coefficient of name is 0.001918.The whole nation 100 station distribution block diagrams before number ranking of getting on the bus are as shown in Figure 8.
In order to which passenger's booking behavior pattern is more accurately more fully described, the station information that passenger is often come in and gone out adds
Into extraction feature, the distribution situation at the starting station in passenger's booking behavior pattern is reflected using following desired value.
The number at the starting station:With passenger identity card number for key, all booking records of polymerization passenger, in all bookings
In record, there is the quantity at the different starting stations.
Maximum likelihood probability:In the booking record of each passenger, the most station of occurrence number, probability is removed with frequency
Calculated with booking number altogether.
Station important coefficient:The important coefficient at each station is used total transmission number at the same day station divided by owned
Total transmission number at station calculates.Passenger has a plurality of record by bus in a period of time, and station important coefficient is ridden with all
The average value of the important coefficient at the starting station in record, maximum, minimum value are weighed.
The entropy of starting station distribution:It polymerize all booking records of single passenger, obtains the starting station in all booking records
Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
The terminus distribution includes number, maximum likelihood probability, station important coefficient and the terminus point of terminus
The entropy of cloth;The number of the terminus is all booking records of polymerization passenger, all for key with passenger identity card number
In booking record, there is the quantity of different terminus;During the maximum likelihood probability is records in the booking of each passenger,
The most station of occurrence number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is every
The important coefficient at individual station with total estimated arrive at a station number divided by all stations at the same day station it is total it is estimated arrive at a station number come
Calculate;The entropy of the terminus distribution obtains terminus in all booking records to polymerize all booking records of single passenger
Set, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Specifically, to passenger's booking data of 20100710 to 20,100,716 one weeks of selection, the end recorded according to booking
Point station is counted, take it is daily get off or transfer number before the station of 100, total number of passengers originated accounts for the 59% of total number of persons,
The station important coefficient of the 100th is 0.001949.The whole nation is got off 100 station distribution block diagrams such as Fig. 9 institutes before number ranking
Show.
In order to which passenger's booking behavior pattern is more accurately more fully described, the station information that passenger is often come in and gone out adds
Into extraction feature, the distribution situation of terminus in passenger's booking behavior pattern is reflected using following desired value.
The number of terminus:With passenger identity card number for key, all booking records of polymerization passenger, in all bookings
In record, there is the quantity of different terminus.
Maximum likelihood probability:In the booking record of each passenger, the most station of occurrence number, probability is removed with frequency
Calculated with booking number altogether.
Station important coefficient:The station total on the day of the important coefficient at each station it is expected that arrive at a station number divided by
The total of all stations is expected to arrive at a station number to calculate.Passenger has a plurality of record by bus in a period of time, and station important coefficient is used
It is all record by bus in the average values of important coefficient of terminus, maximum, minimum value weigh.
The entropy of terminus distribution:It polymerize all booking records of single passenger, obtains terminus in all booking records
Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Preferably, the relation of going with is included in passenger's booking flow data in the interval of time of acquisition, if
Passenger A and passenger B on the same day get on the bus by the same starting station, goes to same point of destination, same train is taken, same
In compartment, using same window booking of the identical booking mode at same station, and meet support and confidence level
It is required that the relation of going with that then passenger A and passenger B has.
Specifically, passenger's booking trip is often gone with behavior with certain colony, by the relation of going with for analyzing passenger
Analysis may find that and passenger's Relationship Comparison is close or people with similar behavioural habits, find some valuable moulds
Formula.Relation of going with mainly is portrayed by the number of going with.
Go with number:In the flow data set got in interval of time, there is the passenger for relation of going with the passenger
Number.
Go with relationship analysis:
Under the scene of imperfect information, a certainty, the rule of universality do not judge between different passengers
In the presence of the relation of going with.So the relation of going with here is not to be related to " relation of going with " in common discourse, we are more prone to
Between multiple passengers are described with incidence relation at the same trip relation.When being associated analysis to Transaction Information of riding,
The problem of needing to handle two keys:1) performance issue, transaction data set (TDS) is bigger by bus, calculates time, space complexity
It is high;2) some patterns found are probably false, it is necessary to assess the correlation rule of generation, reject false mould
Formula.
Go with contextual definition:
The herein relation of going with is based on the basis of support-Confidence Framework, with being related in daily life
" relation of going with " is different.Such as:Passenger A and passenger B go with trip used during taking train, but go on a journey number it is fewer, be unsatisfactory for propping up
The requirement of degree of holding threshold value, in the case of other no effective informations, it can not judge that there is pass of going between passenger A and passenger B
System.
Before contextual definition is gone with proposition, the definition of basic concepts during association analysis is first provided.
Affairs:
Affairs are the set of some, do not have the item repeated in set.Such as someone shopping online, in a shopping online
During, 2 bread are bought, 1 bottled beer, 3 bags of paper diaper, then can be referred to as by gathering { bread, beer, paper diaper } by one
Affairs.
Element in affairs is referred to as item.The width of affairs is the number of element in affairs.
Correlation rule:
Correlation rule be shaped likeRule,X is regular former piece, and Y is consequent, X and Y tools
There is certain causality.
Support:
Assuming that there are set T, the T={ t of an affairs1,t2,...,tn, n is the number of affairs, and X is the collection of affairs middle term
Close, X support counting is defined as:
Wherein:| g | represent the number of element in set, the frequent journey that support measurement item collection X occurs in transaction set T
Degree.
Correlation ruleSupport counting be defined as:
Confidence level:
Correlation ruleConfidence level be defined as:
Confidence metric correlation ruleThe degree of reliability, confidence level is higher, what Y occurred in the affairs comprising X
Probability is higher.
Go with contextual definition:
Passenger, which goes with to go on a journey, can typically meet certain relation, such as be got on the bus at the same starting station on the same day, go to
Same point of destination, same train is taken, in same compartment, order of seats is adjacent closer;In the same of same station
Individual window booking.
Contextual definition of going with is herein:
In passenger's booking flow data in the interval of time of acquisition, if passenger A is same on the same day with passenger B
The individual starting station is got on the bus, and goes to same point of destination, is taken same train, in same compartment, is used identical booking side
Formula and meets the requirement of support and confidence level, then passenger A and passenger B has in the same window booking at same station
Go with relation.
It should be noted that:1) it is based on support-Confidence Framework to define, and searching is frequently pattern, for
Only trip once but meet other relation conditions of going with two passengers may because be unsatisfactory for minimum support threshold value and by
Filter out.
2) time interval may be across more days.In order to avoid certain passenger goes on a journey once and only gone on a journey daily once can be by
Support rule-based filtering falls.
3) not measurement passenger between order of seats distance because indefinite rule can be determined that great
Two people just possess the relation of going with distance, and beyond this distance, relation of going with just is not present.In actual life, trip of going with
People also not necessarily ensure order of seats on it is adjacent.
4) this definition defines the exact range of affairs, it is necessary to while meet that multiple conditions just possess the relation of going with, thing
The scope of business can adjust according to real data situation, but the mean breadth of affairs determines association analysis search space
Depth, algorithm computation complexity is with the change growth of index greatly of average transaction width.
Transaction Information is analyzed by bus:
According to the definition for relation of going with above, booking data by bus are mapped as with the date+train number+starting station+terminus
The character string of+coach number+booking mode+booking station+booking window splicing is key, is worth key assignments with passenger identity card number
It is right, and reduction is carried out according to key, it is worth the set for identification card number, extracts the value of all key-value pair datas, is saved in HDFS files
In, each value is an affairs.
Simple statistics analysis is done to passenger's booking data:
The quantity about 6,000,000 or so of average trip passenger daily, item collection space is very huge.
Average every train accommodates 855 people, and according to the definition for relation of going with above, the mean breadth of affairs is 10.
Booking data are very sparse by bus, and 62% passenger only have purchased a ticket in this time interval.Thing by bus
Data compressible space of being engaged in is smaller.The per day booking number distribution histogram of passenger is as shown in figure 11.
About 1.15% people buys multiple ticket with same identity card and takes same train, and this is hereafter tied
Very big puzzlement is caused with relationship analysis,
Go with relationship analysis method:
For Transaction Information compressible space of riding it is smaller the characteristics of, we using Apriori algorithm analysis passenger it
Between incidence relation.Apriori algorithm is first association rules mining algorithm.Its initiative having used is based on support
Technology of prunning branches, systematically control candidate exponential increase.The main thought of Apriori algorithm is:An if item collection
It is that frequently, then its all subsets are also frequently;If item collection right and wrong are frequently, its superset must be non-frequent
's.
Apriori algorithm produce frequent item set have two it is important the characteristics of:1) it is an algorithm successively, from frequent 1
Item collection, progressive alternate is until generating most long frequent item set;2) frequent item set is found using Test Strategy is produced.New time
The frequent item set that set of choices is all found by preceding an iteration is produced, and then each candidate's frequent item set is counted, and with most
Small support threshold is compared, and meet minimum support threshold value is then frequent item set.
The substantially thinking of algorithm is as described below:
1) initially through the whole data set of traverse scanning, it is determined that the support of each item.This step is completed, can be obtained
Frequent 1 item collection.
2) frequent (k-1) item collection found using last iteration, the k item collections of candidate are produced.
3) transaction data set (TDS) is scanned, it is determined that all k item collections included included in each affairs, calculate candidate item
Support counting.
4) according to minimum support threshold value, candidate's k item collections is filtered, obtain all frequent k item collections.
5) 2-4 steps are repeated, until not new frequent item set produces or met other conditions.
Candidate's Frequent Set generates:
There are a variety of methods on the generative theory of candidate's Frequent Set, space, time complexity are calculated in order to reduce, these
Method should meet several requirements:
1) the too many unnecessary candidate of generation is avoided.
2) ensure that the candidate of generation is complete, without any frequent item set is omitted in generating process, will not produce
The candidate repeated.In process of data preprocessing, ascending order arrangement is carried out according to lexcographical order to all items in affairs, just
It is in order to avoid producing the candidate item repeated, causing the waste of computing resource.
Common generating algorithm has:
1)Fk-1×F1Method:Each frequently F (k-1) item collection is extended with frequent 1 item collection.Assuming that frequent k-1 item collections
For A, frequent 1 item collection is B, and the item in frequent item set A, B is all stored with lexcographical order, any one member in the element ratio A in B
Element is all big, then asks the union of two item collections of A, B to be extended.Although this method has significantly than universe traversal method
Improve, but still can produce a large amount of unnecessary candidate items.
2)Fk-1×Fk-1Method:By merging a pair of frequently (k-1) item collections, to generate a k item collection, it is desirable to two
(k-1) (k-2) item is all identical before item collection.Assuming that frequent k-1 item collections A, B, the item in A, B is all stored with lexcographical order, in A, B
All identical (the k of preceding k-2 items>=2) union of two item collections of A, B, is then asked to be extended, and to the result after extension according to word
Canonical ordering sorts.
Here method in using the 2nd.
Candidate's Frequent Set counts:
A kind of method of support counting is the method for exhaustion:All candidates are traveled through, each candidate and often
One things compares, and judges whether candidate is included in affairs.This method needs repeatedly traversal transaction set, works as affairs
When set candidates item collection number is all very big, calculating time, space cost are all very big.
Another method is:Transaction set is traveled through, enumerates the k item collections that each office includes, and candidate corresponding to renewal
The support of collection.In all possible k item collections in enumerating each affairs, the data structure of prefix trees can be used.In candidate
When item collection compares with affairs, using HashTree, candidate is divided into different buckets, and be stored in
In HashTree, during support counting, the corresponding branches of HashTree are selected successively according to the structure of affairs sequence, finally
Matched with candidate in same bucket.
Assuming that there is affairs { 1,2,3,4,5 }, the item in affairs arranges according to lexcographical order ascending order, enumerates and is possible to
3 item collections, enumeration process is as shown in Figure 10:
1) number of plies depth=k of prefix trees, initialization iterative parameter i=1 are first determined
2) when meeting condition i≤depth, iterative process is continued with
3) determine the beginning item of the i-th node layer and terminate item, start the next item down that item is father node, end item is affairs
(3-i+1) item reciprocal
4) intercept in affairs and starting item and terminating all items between item, each one node of generation
5) i++, the 2) step condition judgment is turned
6) all leaf nodes are traveled through, to each leaf node, export the path from root node to leaf node
It is as shown in figure 11 that 3 all item collections are enumerated to affairs { 1,2,3,4,5 }.
HashTree is built, the item collection of all candidates 3 generated according to chapters and sections 4.2.1 methods describeds is put into different leaves
In node, to each affairs, successively according to HashTree hierarchical structure, corresponding branch is selected successively, then same
Item collection of enumerating in individual bucket matches with candidate, and process description is as shown in figure 12.
Minimum support threshold value requirement is unsatisfactory for support in candidate's frequent item set, beta pruning is directly carried out, in candidate
Iterative process in, no longer nonmatching grids are extended.
Correlation rule generates:
Apriori algorithm produces correlation rule using algorithm successively, wherein every layer of item corresponded in consequent
Number.When initial, containing only all high confidence levels rule of an item in extracting rule consequent, then using these rules, to generate
New rule.It is similar with process caused by frequent item set to produce the process of correlation rule, unlike, in regular generation process
In, it is not necessary to scan data set is come come the support calculated when calculating the confidence level of candidate rule, but being produced using frequent item set
It is determined that the confidence level of each rule.
The generating algorithm of rule, which is similar to, performs Fk-1×Fk-1Method, first calculate in all consequents containing only one
The rule for meeting confidence level requirement of item, the rule that two consequent numbers are (k-1) is then combined with, obtains consequent number
For k rule, then successively iteration successively.Algorithm calculating process is as follows:
For frequent item set { 1,2,3,4,5 }, Association Rules Generating Algorithm.
Algorithm flow is as described below:
1) frequent item set set rules is traveled through, each frequent episode is done extended below:
2) depth=is set | rules |, | g | represent to ask the number of element in set, d=1
3) judge whether to meet condition d<depth
4) F is appliedk-1×Fk-1Method, merge after the consequent that two item numbers are d-1 is the rule that an item number is d
Part, regular former piece then subtract newly-generated consequent for the set of all of frequent item set
5) judge whether newly-generated rule meets min confidence requirement, if being unsatisfactory for, the node is cut
Branch, is no longer extended to the rule of the node later
6) d=d+1, the 3) step is jumped to
Correlation rule is assessed:
Apriori algorithm is established in the frame foundation of the association analysis of support-support, but confidence level rule tool
There is certain limitation, it have ignored the support of the item collection occurred in consequent, and high confidence level rule may result in one
The illusion of a little mistakes.For this problem, consider to weigh the reliability of rule by degree of being lifted.Lifting degree is defined as follows
It is shown:
WhenWhen, it is believed that X, Y positive correlation.WhenWhen, it is believed that X, Y are negatively correlated.When, it is believed that X, Y are separate.
In affairs by bus, selectionRule, it is believed that X, Y, which have, to go with relation.
Preferably, it is described from the attribute information of passenger, trip purpose distribution, the distribution of booking number, train number type, booking
Mode is distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger, including:
With one vector characterize a passenger hidden state, by it is discrete, qualitatively Data Identification is converted into continuously
, quantitative data mode;
Statistical analysis is carried out according to the booking data in prefixed time interval, passenger is described most with maximum likelihood probability
Possible booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking
The entropy and the number of different booking modes that mode is distributed.
Step 102, the booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is one
A point of the passenger in higher dimensional space, if the type of the passenger is unknown, the kind judging of the passenger will be asked
Topic is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.
Specifically, characterize the booking behavior pattern of a passenger with a state vector herein, then each passenger can be with
Regard a point of the passenger in higher dimensional space as.The type of passenger is unknown, and the judgement to passenger type is regarded as one
Individual generation learning process, then passenger's kind judging problem is converted into is carried out to the probability density distribution of passenger's hidden state vector
It is fitted the process of study.
Assuming that the classification of passenger is z, z is hidden variable, in the case of known passenger classification, passenger's hidden state vector y
Condition distribution obey Multi-dimensional Gaussian distribution, we using multiple Gauss models linear weighted function and be fitted passenger's
Probability density distribution curve, there may be the situation of multiple classifications during simulation is real.Because the scale of data is bigger, in
Heart limit theorem is understood, it is assumed that it is rational that passenger's vector y is distributed Normal Distribution to passenger's classification z condition.It is theoretical
On, the weighted sums of multiple Gauss models can be with arbitrary extent close to any probability distribution.Each Gauss model represents one
Class, for needing to judge that the passenger of classification is vectorial, the probability that passenger belongs to each class is calculated respectively, then select probability is maximum
Class as result of determination.
Preferably, the booking behavior pattern that passenger is characterized by passenger's hidden state vector, then each passenger is one
A point of the individual passenger in higher dimensional space, if the type of the passenger is unknown, by the kind judging of the passenger
Problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector, including:
If the classification of passenger is z, z is hidden variable, in the case of known passenger classification, passenger's hidden state vector y's
A Multi-dimensional Gaussian distribution is obeyed in condition distribution;
Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, during simulation is real
There may be the situation of multiple classifications;
Assume that passenger's vector y is distributed Normal Distribution to passenger's classification z condition according to central-limit theorem;It is multiple
For the weighted sum of Gauss model with arbitrary extent close to any probability distribution, each Gauss model represents a class, for needing
Judge passenger's vector of classification, calculate the probability that passenger belongs to each class respectively, the class of reselection maximum probability is tied as judgement
Fruit.
Specifically, GMM model:
Assuming that passenger's hidden state vector obeys mixture gaussian modelling, then the probability that jth position passenger occurs can be retouched
State for:
θk=(μk,Ck)
Wherein:αkRepresent the probability that passenger's classification is k, φ (yj;θk) it is y in the case of known passenger classification is kjCondition
Probability density function, φ (yj;θk) expression mean vector is μk, covariance matrix CkMulti-dimensional Gaussian distribution probability it is close
Spend function, CkFor positive definite matrix, n yjThe dimension of vector.
It is contemplated that passenger's hidden state vector so generates:1) according to probability αkSelect k-th of Gaussian distribution model
φ(yj;θk);2) the probability density function φ (y according to the Gaussian distribution model selectedj;θk) generation observation data yj.At this moment,
Observe data yjIt is known, yjData from k-th of sub-model are unknown, with hidden variable γjkRepresent, it is defined such as
Under:
γjk=1, when j-th of observation comes from k-th of sub-model;
γjk=0, other situations.
There is observation data yjAnd data γ is not observedjk, then complete data is exactly (yj,γj1,γj1,...,γjk),
J=1,2 ..., N.Then, we can write out the likelihood function of complete data:
Wherein:nkThe number from k-th of sub-model in all observation data is represented,
N is the number of all observations.
So, the log-likelihood function of complete data is:
Solution to model uses EM (Expectation Maximization) iterative algorithm.EM algorithms can ensure
The convergence of log-likelihood function process is maximized, but cannot be guaranteed to obtain optimal solution.It is contemplated that EM algorithms are run multiple times, then
Take the logarithm likelihood function maximum maximum, to ensure that model can obtain relatively good fitting effect.
EM algorithm iterations solve:
EM algorithms are iterative algorithms, it is necessary to go to specify specific initial value by hand.Meanwhile train a good GMM mould
Type is a highly difficult task, because the selection of initial value can largely influence the result of model final output, even
Whether energy decision model restrains, and on some specific initial points, EM algorithms can become unstable, or failure.Initial value
Selection can also have influence on convergence of algorithm speed.So algorithm parameter initialization is vital in EM parameter estimation procedures
One link.
Model parameter initializes:
Conventional model parameter initialization strategy has:
Rasterizing is searched for
One reference axis is established with each free parameter, the span of each parameter is determined, then with equidistant
Mode be directed to each parameter value, cartesian product computing is done in the value set to each parameter, finally obtains a higher-dimension
Network structure in space, each point represents a kind of value condition of all parameters, then using each point in grid as just
Initial value is substituted into EM algorithms and solved, and takes the optimal solution of all situations.
The subject matter of rasterizing search strategy be calculation cost with the number of free parameter it is exponential increase.
Random search
This strategy is similar with rasterizing value, the difference is that in the span of parameter, if random selection
Dry value, an edge distribution can be defined to free parameter, Bernoulli Jacob can be used to be distributed the parameters of some discrete types,
Multinomial distribution etc..Random searching strategy is used than rasterizing search strategy and is more convenient, and convergence rate is faster.
Expectation Step
The log-likelihood function logP (y, γ | θ) of complete data is in given observation data Y and parameter current θ(i)Under to not
Observe data γ conditional probability distribution P (γ | Y, θ(i)) expectation be referred to as Q functions.
Pay attention to:I is the ordinal number of the iteration in the iterative process for estimate model parameter with EM algorithms, and j is number in data set
According to ordinal number, k represents the ordinal number of sub-model.
Need exist for calculating E [γjk| y, θ], it is designated asBe under "current" model parameter j-th of observation data from the
The probability of k sub-model, referred to as sub-model k are to observing data yjResponsiveness.
WillAndFormula in substitution, is obtained:
Maximization Step
The M steps of iteration are to find a function Q (θ, θ(i)) to θ maximum, i.e.,:
θ(i+1)=argmax Q (θ, θ(i))
It is 0 to seek partial derivative to Q functions and make it, is obtained:
E steps and M steps are repeated, untill algorithmic statement, the standard of algorithmic statement can reach certain step number or two
The change of Q functions is less than a threshold value in secondary iterative process.
A kind of method of Ticketing information analysis provided in an embodiment of the present invention, attribute information, trip purpose point from passenger
Cloth, booking number, train number type are distributed, booking mode is distributed, the starting station is distributed, terminus is distributed, relation of going with extraction passenger
Booking behavior pattern feature;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is one
A point of the individual passenger in higher dimensional space, if the type of the passenger is unknown, by the kind judging of the passenger
Problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.It is dynamic so as to realize
State tracks the sneak condition of passenger, accurate to estimate passenger's booking behavior pattern, and have to error information certain tolerance,
Fault-tolerance;Iterative state estimation algorithm is developed, vaild act feature can be extracted parallel from mass data, is identified
Go out the abnormal patterns of passenger's booking behavior, the requirement of real-time is met in time computational efficiency;The output result of model should expire
The requirement of sufficient stability, the result of determination of someone's booking pattern should be consistent in special time;To the defeated of model
The displaying of many-sided multi-angle can be carried out by going out result, facilitate the use of Correlative data analysis number.
With reference to figure 13, Figure 13 is that a kind of functional module of the device of Ticketing information analysis provided in an embodiment of the present invention is shown
It is intended to.
As shown in figure 13, the device of the Ticketing information analysis includes:
Extraction module 1301, for the attribute information from passenger, trip purpose distribution, booking number, train number type point
Cloth, booking mode are distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger;
Fitting module 1302, for characterizing the booking behavior pattern of passenger by passenger's hidden state vector, then each multiply
Visitor is a point of the passenger in higher dimensional space, if the type of the passenger is unknown, by the class of the passenger
Other decision problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.
Preferably, the attribute information of the passenger includes data file is counted according to the age to obtain passenger's age
Distributed intelligence;
Trip purpose distribution include province native place number information that basis parses from ID card information in conjunction with
The administrative division code of the starting station and terminus judge native place numbering with the starting station, terminus administrative division code whether phase
Deng passenger to be divided into the classification of predetermined number according to trip purpose, the distribution do not omitted is not repeated between of all categories;Wherein,
It is all consistent with native place that odh represents starting station terminus, and in local, short distance is gone on a journey inside the province;Odo represents the starting station and terminus one
Cause, but province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;other
Represent other situations;
The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is remembered for booking
State is the frequency of 3 record in record;The returned ticket number is the frequency of the record that state is 2 in booking record;It is described effective
Booking number is the frequency of the record that state is 5 in booking record;
The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, point
Each train number type economy, agility, the score of comfortableness are not calculated, then is added summation and simultaneously divided by by bus records number, are respectively referred to
Scale value is scored at the distribution of all average values of the corresponding index of record by bus in preset time;
The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, point
Each train number type economy, the score of agility are not calculated, then is added summation and simultaneously divided by by bus records number, and each desired value obtains
It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
The starting station distribution includes according to the number at the starting station, maximum likelihood probability, station important coefficient and originated
Stand the entropy of distribution, the number at the starting station is all booking records of polymerization passenger, in institute for key with passenger identity card number
In some booking records, there is the quantity at the different starting stations;The maximum likelihood probability is the booking record in each passenger
In, the most station of occurrence number, booking number of the probability with frequency divided by altogether calculates;The station important coefficient
Counted for the important coefficient at each station with the same day total transmission number at the station divided by total transmission number at all stations
Calculate;The entropy of the starting station distribution obtains the starting station in all booking records to polymerize all booking records of single passenger
Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution;
The terminus distribution includes number, maximum likelihood probability, station important coefficient and the terminus point of terminus
The entropy of cloth;The number of the terminus is all booking records of polymerization passenger, all for key with passenger identity card number
In booking record, there is the quantity of different terminus;During the maximum likelihood probability is records in the booking of each passenger,
The most station of occurrence number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is every
The important coefficient at individual station with total estimated arrive at a station number divided by all stations at the same day station it is total it is estimated arrive at a station number come
Calculate;The entropy of the terminus distribution obtains terminus in all booking records to polymerize all booking records of single passenger
Set, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Preferably, the relation of going with is included in passenger's booking flow data in the interval of time of acquisition, if
Passenger A and passenger B on the same day get on the bus by the same starting station, goes to same point of destination, same train is taken, same
In compartment, using same window booking of the identical booking mode at same station, and meet support and confidence level
It is required that the relation of going with that then passenger A and passenger B has.
Preferably, the extraction module 1301, is specifically used for:
With one vector characterize a passenger hidden state, by it is discrete, qualitatively Data Identification is converted into continuously
, quantitative data mode;
Statistical analysis is carried out according to the booking data in prefixed time interval, passenger is described most with maximum likelihood probability
Possible booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking
The entropy and the number of different booking modes that mode is distributed.
Preferably, the fitting module 1302, is specifically used for:
It is hidden variable if the classification of passenger is, in the case of known passenger classification, the bar of passenger's hidden state vector
A Multi-dimensional Gaussian distribution is obeyed in part distribution;
Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, during simulation is real
There may be the situation of multiple classifications;
Assume that passenger's vector is distributed Normal Distribution to the condition of passenger's classification according to central-limit theorem;It is multiple high
For the weighted sum of this model with arbitrary extent close to any probability distribution, each Gauss model represents a class, sentences for needs
Determine passenger's vector of classification, calculate the probability that passenger belongs to each class respectively, the class of reselection maximum probability is tied as judgement
Fruit.
A kind of device of Ticketing information analysis provided in an embodiment of the present invention, attribute information, trip purpose point from passenger
Cloth, booking number, train number type are distributed, booking mode is distributed, the starting station is distributed, terminus is distributed, relation of going with extraction passenger
Booking behavior pattern feature;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is one
A point of the individual passenger in higher dimensional space, if the type of the passenger is unknown, by the kind judging of the passenger
Problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.It is dynamic so as to realize
State tracks the sneak condition of passenger, accurate to estimate passenger's booking behavior pattern, and have to error information certain tolerance,
Fault-tolerance;Iterative state estimation algorithm is developed, vaild act feature can be extracted parallel from mass data, is identified
Go out the abnormal patterns of passenger's booking behavior, the requirement of real-time is met in time computational efficiency;The output result of model should expire
The requirement of sufficient stability, the result of determination of someone's booking pattern should be consistent in special time;To the defeated of model
The displaying of many-sided multi-angle can be carried out by going out result, facilitate the use of Correlative data analysis number.
The technical principle of the embodiment of the present invention is described above in association with specific embodiment.These descriptions are intended merely to explain
The principle of the embodiment of the present invention, and the limitation to protection domain of the embodiment of the present invention can not be construed in any way.Based on this
The explanation at place, those skilled in the art, which would not require any inventive effort, can associate the other of the embodiment of the present invention
Embodiment, these modes are fallen within the protection domain of the embodiment of the present invention.
Claims (10)
- A kind of 1. method of Ticketing information analysis, it is characterised in that methods described includes:Attribute information, trip purpose distribution, the distribution of booking number, train number type from passenger, booking mode are distributed, the starting station point Cloth, terminus distribution, the booking behavior pattern feature of relation of going with extraction passenger;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is a passenger in higher dimensional space In a point, if the type of the passenger is unknown, the kind judging problem of the passenger will be converted into described The probability density distribution of passenger's hidden state vector is fitted the process of study.
- 2. according to the method for claim 1, it is characterised in that the attribute information of the passenger include to data file according to Age is counted to obtain passenger's age distribution information;Trip purpose distribution includes province native place number information that basis parses from ID card information in conjunction with originating Stand and the administrative division code of terminus judges whether native place numbering is equal with the administrative division code of the starting station, terminus, will Passenger is divided into the classification of predetermined number according to trip purpose, and the distribution do not omitted is not repeated between of all categories;Wherein, odh generations Table starting station terminus is all consistent with native place, and in local, short distance is gone on a journey inside the province;It is consistent with terminus that odo represents the starting station, still Province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;Other represents other Situation;The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is in booking record State is the frequency of 3 record;The returned ticket number is the frequency of the record that state is 2 in booking record;Effective booking Number is the frequency of the record that state is 5 in booking record;The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, calculates respectively Each train number type economy, agility, the score of comfortableness, then be added summation and simultaneously divided by by bus record number, each desired value obtains It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, is calculated respectively Each train number type economy, the score of agility, then be added summation and simultaneously divided by by bus record number, each desired value is scored at default The distribution of all average values of the corresponding index of record by bus in time;The starting station distribution is included according to the number at the starting station, maximum likelihood probability, station important coefficient and the starting station point The entropy of cloth, the number at the starting station is all booking records of polymerization passenger, in all purchases for key with passenger identity card number In ticket record, there is the quantity at the different starting stations;The maximum likelihood probability is in the booking record of each passenger, is occurred The most station of number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is each car The important coefficient stood is calculated with the same day total transmission number at the station divided by total transmission number at all stations;It is described to originate The entropy being distributed stand to polymerize all booking records of single passenger, obtains the set at the starting station in all bookings records, counts collection The frequency of different items in conjunction, and calculate the entropy of discrete distribution;What number, maximum likelihood probability, station important coefficient and the terminus that the terminus distribution includes terminus were distributed Entropy;The number of the terminus is for key with passenger identity card number, all booking records of polymerization passenger, is remembered in all bookings In record, there is the quantity of different terminus;The maximum likelihood probability is the occurrence number in the booking record of each passenger Most stations, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is each station Important coefficient estimated arrives at a station number to calculate with the total of total estimated arrive at a station number divided by all stations at the same day station;It is described The entropy of terminus distribution obtains the set of terminus in all booking records, system to polymerize all booking records of single passenger The frequency of different items in meter set, and calculate the entropy of discrete distribution.
- 3. according to the method for claim 1, it is characterised in that the relation of going with is included in the interval of time of acquisition In interior passenger's booking flow data, if passenger A and passenger B on the same day get on the bus by the same starting station, same point of destination is gone to, Same train is taken, in same compartment, using same window booking of the identical booking mode at same station, And meet the requirement of support and confidence level, then the relation of going with that passenger A and passenger B has.
- 4. according to the method described in claims 1 to 3 any one, it is characterised in that the attribute information from passenger, trip Purpose distribution, the distribution of booking number, train number type, booking mode is distributed, the starting station is distributed, terminus is distributed, the relation of going with carries The booking behavior pattern feature of passenger is taken, including:The hidden state of a passenger is characterized with a vector, by it is discrete, qualitatively Data Identification be converted into it is continuous, quantitative Data mode;Statistical analysis is carried out according to the booking data in prefixed time interval, it is most probable to describe passenger with maximum likelihood probability Booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking mode point The number of the entropy of cloth and different booking modes.
- 5. according to the method for claim 4, it is characterised in that the purchase that passenger is characterized by passenger's hidden state vector Ticket behavior pattern, then each passenger is a point of the passenger in higher dimensional space, if the type of the passenger is unknown, The kind judging problem of the passenger will be then converted into and the probability density distribution of passenger's hidden state vector is intended The process of study is closed, including:If the classification of passenger is z, z is hidden variable, in the case of known passenger classification, passenger's hidden state vector y condition A Multi-dimensional Gaussian distribution is obeyed in distribution;Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, may have in simulation reality The situation of multiple classifications;Assume that passenger's vector y is distributed Normal Distribution to passenger's classification z condition according to central-limit theorem;Multiple Gaussian modes For the weighted sum of type with arbitrary extent close to any probability distribution, each Gauss model represents a class, for needing to judge class Other passenger's vector, calculates the probability that passenger belongs to each class, the class of reselection maximum probability is as result of determination respectively.
- 6. a kind of device of Ticketing information analysis, it is characterised in that described device includes:Extraction module, for the attribute information from passenger, trip purpose distribution, the distribution of booking number, train number type, booking mode Distribution, starting station distribution, terminus distribution, the booking behavior pattern feature of relation of going with extraction passenger;Fitting module, for characterizing the booking behavior pattern of passenger by passenger's hidden state vector, then each passenger is one A point of the passenger in higher dimensional space, if the type of the passenger is unknown, the kind judging of the passenger will be asked Topic is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.
- 7. device according to claim 6, it is characterised in that the attribute information of the passenger include to data file according to Age is counted to obtain passenger's age distribution information;Trip purpose distribution includes province native place number information that basis parses from ID card information in conjunction with originating Stand and the administrative division code of terminus judges whether native place numbering is equal with the administrative division code of the starting station, terminus, will Passenger is divided into the classification of predetermined number according to trip purpose, and the distribution do not omitted is not repeated between of all categories;Wherein, odh generations Table starting station terminus is all consistent with native place, and in local, short distance is gone on a journey inside the province;It is consistent with terminus that odo represents the starting station, still Province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;Other represents other Situation;The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is in booking record State is the frequency of 3 record;The returned ticket number is the frequency of the record that state is 2 in booking record;Effective booking Number is the frequency of the record that state is 5 in booking record;The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, calculates respectively Each train number type economy, agility, the score of comfortableness, then be added summation and simultaneously divided by by bus record number, each desired value obtains It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, is calculated respectively Each train number type economy, the score of agility, then be added summation and simultaneously divided by by bus record number, each desired value is scored at default The distribution of all average values of the corresponding index of record by bus in time;The starting station distribution is included according to the number at the starting station, maximum likelihood probability, station important coefficient and the starting station point The entropy of cloth, the number at the starting station is all booking records of polymerization passenger, in all purchases for key with passenger identity card number In ticket record, there is the quantity at the different starting stations;The maximum likelihood probability is in the booking record of each passenger, is occurred The most station of number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is each car The important coefficient stood is calculated with the same day total transmission number at the station divided by total transmission number at all stations;It is described to originate The entropy being distributed stand to polymerize all booking records of single passenger, obtains the set at the starting station in all bookings records, counts collection The frequency of different items in conjunction, and calculate the entropy of discrete distribution;What number, maximum likelihood probability, station important coefficient and the terminus that the terminus distribution includes terminus were distributed Entropy;The number of the terminus is for key with passenger identity card number, all booking records of polymerization passenger, is remembered in all bookings In record, there is the quantity of different terminus;The maximum likelihood probability is the occurrence number in the booking record of each passenger Most stations, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is each station Important coefficient estimated arrives at a station number to calculate with the total of total estimated arrive at a station number divided by all stations at the same day station;It is described The entropy of terminus distribution obtains the set of terminus in all booking records, system to polymerize all booking records of single passenger The frequency of different items in meter set, and calculate the entropy of discrete distribution.
- 8. device according to claim 6, it is characterised in that the relation of going with is included in the interval of time of acquisition In interior passenger's booking flow data, if passenger A and passenger B on the same day get on the bus by the same starting station, same point of destination is gone to, Same train is taken, in same compartment, using same window booking of the identical booking mode at same station, And meet the requirement of support and confidence level, then the relation of going with that passenger A and passenger B has.
- 9. according to the device described in claim 6 to 8 any one, it is characterised in that the extraction module, be specifically used for:The hidden state of a passenger is characterized with a vector, by it is discrete, qualitatively Data Identification be converted into it is continuous, quantitative Data mode;Statistical analysis is carried out according to the booking data in prefixed time interval, it is most probable to describe passenger with maximum likelihood probability Booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking mode point The number of the entropy of cloth and different booking modes.
- 10. device according to claim 9, it is characterised in that the fitting module, be specifically used for:If the classification of passenger is z, z is hidden variable, in the case of known passenger classification, passenger's hidden state vector y condition A Multi-dimensional Gaussian distribution is obeyed in distribution;Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, may have in simulation reality The situation of multiple classifications;Assume that passenger's vector y is distributed Normal Distribution to passenger's classification z condition according to central-limit theorem;Multiple Gaussian modes For the weighted sum of type with arbitrary extent close to any probability distribution, each Gauss model represents a class, for needing to judge class Other passenger's vector, calculates the probability that passenger belongs to each class, the class of reselection maximum probability is as result of determination respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611198401.8A CN107527223A (en) | 2016-12-22 | 2016-12-22 | A kind of method and device of Ticketing information analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611198401.8A CN107527223A (en) | 2016-12-22 | 2016-12-22 | A kind of method and device of Ticketing information analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107527223A true CN107527223A (en) | 2017-12-29 |
Family
ID=60748558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611198401.8A Pending CN107527223A (en) | 2016-12-22 | 2016-12-22 | A kind of method and device of Ticketing information analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107527223A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108573430A (en) * | 2018-03-14 | 2018-09-25 | 北京经纬信息技术公司 | A kind of data processing method, device and computer readable storage medium |
CN108596664A (en) * | 2018-04-24 | 2018-09-28 | 盘缠科技股份有限公司 | A kind of unilateral tranaction costs of electronic ticket determine method, system and device |
CN109376315A (en) * | 2018-09-25 | 2019-02-22 | 海南民航凯亚有限公司 | A kind of civil aviation passenger label analysis method and processing terminal based on machine learning |
CN109783531A (en) * | 2018-12-07 | 2019-05-21 | 北京明略软件系统有限公司 | A kind of relationship discovery method and apparatus, computer readable storage medium |
CN110334963A (en) * | 2019-07-11 | 2019-10-15 | 四川亨通网智科技有限公司 | Admission ticket order background management system |
CN111598162A (en) * | 2020-05-14 | 2020-08-28 | 万达信息股份有限公司 | Cattle risk monitoring method, terminal equipment and storage medium |
CN112836996A (en) * | 2021-03-10 | 2021-05-25 | 西南交通大学 | Method for identifying potential ticket buying demand of passenger |
CN112949926A (en) * | 2021-03-10 | 2021-06-11 | 西南交通大学 | Income maximization ticket amount distribution method based on passenger demand re-identification |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103235933A (en) * | 2013-04-15 | 2013-08-07 | 东南大学 | Vehicle abnormal behavior detection method based on Hidden Markov Model |
CN104702378A (en) * | 2013-12-06 | 2015-06-10 | 华为技术有限公司 | Method and device for estimating parameters of mixture Gaussian distribution |
CN104749624A (en) * | 2015-03-03 | 2015-07-01 | 中国石油大学(北京) | Method for synchronously realizing seismic lithofacies identification and quantitative assessment of uncertainty of seismic lithofacies identification |
CN105516127A (en) * | 2015-12-07 | 2016-04-20 | 中国科学院信息工程研究所 | Internal threat detection-oriented user cross-domain behavior pattern mining method |
CN105701180A (en) * | 2016-01-06 | 2016-06-22 | 北京航空航天大学 | Commuting passenger feature extraction and determination method based on public transportation IC card data |
CN105719023A (en) * | 2016-01-24 | 2016-06-29 | 东北电力大学 | Real-time wind power prediction and error analysis method based on mixture Gaussian distribution |
CN105808639A (en) * | 2016-02-24 | 2016-07-27 | 平安科技(深圳)有限公司 | Network access behavior recognizing method and device |
-
2016
- 2016-12-22 CN CN201611198401.8A patent/CN107527223A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103235933A (en) * | 2013-04-15 | 2013-08-07 | 东南大学 | Vehicle abnormal behavior detection method based on Hidden Markov Model |
CN104702378A (en) * | 2013-12-06 | 2015-06-10 | 华为技术有限公司 | Method and device for estimating parameters of mixture Gaussian distribution |
CN104749624A (en) * | 2015-03-03 | 2015-07-01 | 中国石油大学(北京) | Method for synchronously realizing seismic lithofacies identification and quantitative assessment of uncertainty of seismic lithofacies identification |
CN105516127A (en) * | 2015-12-07 | 2016-04-20 | 中国科学院信息工程研究所 | Internal threat detection-oriented user cross-domain behavior pattern mining method |
CN105701180A (en) * | 2016-01-06 | 2016-06-22 | 北京航空航天大学 | Commuting passenger feature extraction and determination method based on public transportation IC card data |
CN105719023A (en) * | 2016-01-24 | 2016-06-29 | 东北电力大学 | Real-time wind power prediction and error analysis method based on mixture Gaussian distribution |
CN105808639A (en) * | 2016-02-24 | 2016-07-27 | 平安科技(深圳)有限公司 | Network access behavior recognizing method and device |
Non-Patent Citations (1)
Title |
---|
陈宇: "基于高斯混合模型的林业信息文本分类算法", 《中南林业科技大学学报》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108573430A (en) * | 2018-03-14 | 2018-09-25 | 北京经纬信息技术公司 | A kind of data processing method, device and computer readable storage medium |
CN108596664A (en) * | 2018-04-24 | 2018-09-28 | 盘缠科技股份有限公司 | A kind of unilateral tranaction costs of electronic ticket determine method, system and device |
CN108596664B (en) * | 2018-04-24 | 2021-01-05 | 盘缠科技股份有限公司 | Method, system and device for determining unilateral transaction fee of electronic ticket |
CN109376315A (en) * | 2018-09-25 | 2019-02-22 | 海南民航凯亚有限公司 | A kind of civil aviation passenger label analysis method and processing terminal based on machine learning |
CN109783531A (en) * | 2018-12-07 | 2019-05-21 | 北京明略软件系统有限公司 | A kind of relationship discovery method and apparatus, computer readable storage medium |
CN110334963A (en) * | 2019-07-11 | 2019-10-15 | 四川亨通网智科技有限公司 | Admission ticket order background management system |
CN111598162A (en) * | 2020-05-14 | 2020-08-28 | 万达信息股份有限公司 | Cattle risk monitoring method, terminal equipment and storage medium |
CN112836996A (en) * | 2021-03-10 | 2021-05-25 | 西南交通大学 | Method for identifying potential ticket buying demand of passenger |
CN112949926A (en) * | 2021-03-10 | 2021-06-11 | 西南交通大学 | Income maximization ticket amount distribution method based on passenger demand re-identification |
CN112836996B (en) * | 2021-03-10 | 2022-03-04 | 西南交通大学 | Method for identifying potential ticket buying demand of passenger |
CN112949926B (en) * | 2021-03-10 | 2022-03-04 | 西南交通大学 | Income maximization ticket amount distribution method based on passenger demand re-identification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107527223A (en) | A kind of method and device of Ticketing information analysis | |
CN111199343B (en) | Multi-model fusion tobacco market supervision abnormal data mining method | |
Li | Credit risk prediction based on machine learning methods | |
CN101216998B (en) | An urban traffic flow information amalgamation method of evidence theory based on fuzzy rough sets | |
CN112464094B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN115147155A (en) | Railway freight customer loss prediction method based on ensemble learning | |
CN111242484A (en) | Vehicle risk comprehensive evaluation method based on transition probability | |
CN112508600A (en) | Vehicle value evaluation method based on Internet public data | |
CN113379313B (en) | Intelligent preventive test operation management and control system | |
CN115409577A (en) | Intelligent container repurchase prediction method and system based on user behavior and environmental information | |
CN115099450A (en) | Family carbon emission monitoring and accounting platform based on fusion model | |
Rao et al. | Flight Ticket Prediction using Random Forest Regressor Compared with Decision Tree Regressor | |
CN106779214A (en) | A kind of multifactor fusion civil aviation passenger travel forecasting approaches based on topic model | |
CN107992613A (en) | A kind of Text Mining Technology protection of consumers' rights index analysis method based on machine learning | |
Wang et al. | Stacking based LightGBM-CatBoost-RandomForest algorithm and its application in big data modeling | |
Xu et al. | MM-UrbanFAC: Urban functional area classification model based on multimodal machine learning | |
Hu | Overdue invoice forecasting and data mining | |
Mao et al. | Naive Bayesian algorithm classification model with local attribute weighted based on KNN | |
CN117114812A (en) | Financial product recommendation method and device for enterprises | |
CN110347828A (en) | A kind of Metro Passenger demand dynamic acquisition method and its obtain system | |
Ai | Predicting Titanic Survivors by Using Machine Learning | |
CN116805022A (en) | Specific Twitter user mining method based on group propagation | |
CN115545342A (en) | Risk prediction method and system for enterprise electric charge recovery | |
CN116128275A (en) | Event deduction prediction system | |
Lanting et al. | A Transportation Analytic Solution for Predicting Flight Cancellations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171229 |