CN108596664A - A kind of unilateral tranaction costs of electronic ticket determine method, system and device - Google Patents
A kind of unilateral tranaction costs of electronic ticket determine method, system and device Download PDFInfo
- Publication number
- CN108596664A CN108596664A CN201810371303.2A CN201810371303A CN108596664A CN 108596664 A CN108596664 A CN 108596664A CN 201810371303 A CN201810371303 A CN 201810371303A CN 108596664 A CN108596664 A CN 108596664A
- Authority
- CN
- China
- Prior art keywords
- model
- electronic ticket
- data
- unilateral
- transaction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000012360 testing method Methods 0.000 claims abstract description 58
- 238000012549 training Methods 0.000 claims abstract description 54
- 238000004140 cleaning Methods 0.000 claims abstract description 18
- 230000010354 integration Effects 0.000 claims abstract description 6
- 238000007637 random forest analysis Methods 0.000 claims description 38
- 238000007635 classification algorithm Methods 0.000 claims description 35
- 238000004422 calculation algorithm Methods 0.000 claims description 34
- 238000012545 processing Methods 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 description 15
- 230000004927 fusion Effects 0.000 description 15
- 238000009826 distribution Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 7
- 238000003066 decision tree Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005520 cutting process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003442 weekly effect Effects 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 235000019633 pungent taste Nutrition 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0283—Price estimation or determination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
Landscapes
- Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Finance (AREA)
- General Physics & Mathematics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Accounting & Taxation (AREA)
- General Business, Economics & Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Game Theory and Decision Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Primary Health Care (AREA)
- Tourism & Hospitality (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Devices For Checking Fares Or Tickets At Control Points (AREA)
Abstract
The invention discloses a kind of unilateral tranaction costs of electronic ticket to determine that method, system and device, the embodiment of the present invention are arranged unilateral tranaction costs and determine model, and the setting up procedure of the model is:Acquire the data trade record of electronic ticket, the partial data transaction record of electronic ticket is obtained after carrying out cleaning integration, based on the user characteristics and website feature extracted from the partial data transaction record of obtained electronic ticket, model training test is carried out using the learning model of setting, obtains the model.When electronic ticket generates unilateral transaction record, the determination of missing site record and the determination of expense are carried out to unilateral transaction record using set model.In this way, the embodiment of the present invention realizes the accurate determination to the unilateral tranaction costs of electronic ticket.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method, a system and a device for determining unilateral transaction cost of an electronic ticket.
Background
With the introduction of intelligent technology in public transportation systems, the use of electronic tickets is becoming more and more common. A public transportation company attempts to cooperate a conventional Automatic Fare Collection (AFC) service with various emerging mobile payment channels and mobile application vendors, and implements various forms of electronic tickets, such as electronic tickets based on a Near Field Communication (NFC) technology, electronic tickets based on a two-dimensional code technology, or electronic tickets based on a bluetooth technology.
For example, in the AFC service adopted in a subway, each time a user uses a gate of an electronic ticket entrance/exit station, the gate generates an entrance/exit gate data transaction record, and a complete bus taking transaction record includes: transaction data entering the gate and transaction data exiting the gate. The deduction of the one-time riding expense of the electronic ticket is obtained according to the combined calculation of time and mileage, and is directly deducted when the electronic ticket is out of the gate. In order to ensure the brake passing speed of the user, the charge calculation of the electronic ticket is completed by matching the brake with the electronic ticket on line, and does not depend on a background system. In the process of taking a bus each time, the reader-writer arranged on the gate can record single-time bus taking information on the electronic ticket so as to complete subsequent deduction treatment.
However, when the fee of the electronic ticket is deducted by adopting the above method, the single-side transaction of the electronic ticket often occurs. The unilateral transaction of the electronic ticket means that only the brake-in data record is recorded in the data transaction record corresponding to the electronic ticket recorded by the background system, but no brake-out data record is recorded; or only the exit gate data record and no entry gate data record. At this moment, the data transaction record of the riding process is incomplete, the reason for generating the unilateral transaction of the electronic ticket is very complex and can not be avoided, and the unilateral transaction is generally caused by various factors in the AFC business link: when a user uses a bus, the misoperation causes the electronic ticket to be brushed more, the electronic ticket is not brushed to enter (exit) the gate, or the entrance (exit) gate breaks down and causes abnormity when the entrance (exit) gate interacts with the electronic ticket for transaction records, once the electronic ticket is subjected to unilateral transaction, the charge calculation of the electronic ticket is difficult, and the charge income is influenced.
Currently, the rule for determining the unilateral transaction fee of the electronic ticket is as follows: when the data transaction record of the electronic ticket is incomplete, the updating expense of the electronic ticket is deducted after the expense updating or non-payment updating is carried out on the electronic ticket according to the occurrence time of the unilateral transaction of the electronic ticket, and the updating expense can also be called as a penalty to make up the expense loss of the electronic ticket. Meanwhile, the data transaction record of the electronic ticket is updated and rewritten. Generally, the unilateral transaction fee of the electronic ticket is determined as the highest fare in the public transportation system to be deducted, so that the fare income of the public transportation system is not lost, but for a user, the fee loss of the electronic ticket is caused, and the public transportation system is complained.
Therefore, how to accurately determine the unilateral transaction cost of the electronic ticket, so that the ticket money income of the public transportation system is not lost, the cost of the user using the electronic ticket is not lost, and the balance between the two is achieved, which is a problem to be solved urgently.
Disclosure of Invention
One embodiment of the invention provides a method for determining the unilateral transaction cost of an electronic ticket, which can accurately determine the unilateral transaction cost of the electronic ticket.
The embodiment of the invention also provides a system for determining the unilateral transaction fee of the electronic ticket, which can accurately determine the unilateral transaction fee of the electronic ticket;
the embodiment of the invention also provides a device for determining the unilateral transaction fee of the electronic ticket, which can accurately determine the unilateral transaction fee of the electronic ticket.
The embodiment of the invention is realized as follows:
a method for determining the unilateral transaction fee of an electronic ticket comprises the following steps:
setting a unilateral transaction fee determination model, wherein the unilateral transaction fee model is set in the following process: collecting data transaction records of the electronic ticket, cleaning and integrating to obtain a complete data transaction record of the electronic ticket, and performing model training test by adopting a set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket to obtain the model;
and when the electronic ticket generates the unilateral transaction record, determining the missing site record and the corresponding fee of the unilateral transaction record by adopting the set model.
A system for determining a fee for a one-sided transaction of an electronic ticket, the system comprising: a background processing system and a gate processing unit, wherein,
the background processing system is used for setting a unilateral transaction fee determination model and sending the fee determination model to the gate processing unit, and the unilateral transaction fee model is set in the following process: collecting data transaction records of the electronic ticket, cleaning and integrating to obtain a complete data transaction record of the electronic ticket, and performing model training test by adopting a set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket to obtain the model;
and the gate processing unit is used for receiving the unilateral transaction fee determination model from the background processing system, and determining missing site records and corresponding fees of the unilateral transaction records by adopting the set model when the unilateral transaction records are generated by scanning the electronic tickets.
An apparatus for determining a fee for one-sided transaction of an electronic ticket, the apparatus comprising: a transceiver module and a processing module, wherein,
the receiving and sending module is used for receiving the unilateral transaction determination model;
and the processing module is used for determining the missing site record and the corresponding expense of the unilateral transaction record by adopting the set model when the unilateral transaction record is generated by scanning the electronic ticket.
As can be seen from the above, the embodiment of the present invention sets a unilateral transaction fee determination model, and the setting process of the model is as follows: collecting data transaction records of the electronic ticket, cleaning and integrating to obtain a complete data transaction record of the electronic ticket, and performing model training test by adopting a set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket to obtain the model. And when the electronic ticket generates the unilateral transaction record, determining the missing site record and determining the expense of the unilateral transaction record by adopting the set model. Thus, the embodiment of the invention determines the cost of the unilateral transaction record of the electronic ticket according to the set model, and the set model relates to the characteristic extraction and the summarization of various information in the riding process, so that the accurate missing site record corresponding to the unilateral transaction record generated by the electronic ticket can be reflected, and the accurate cost can be obtained through calculation, thereby realizing the accurate determination of the unilateral transaction cost of the electronic ticket.
Drawings
FIG. 1 is a flow chart of a method for determining a single-side transaction fee of an electronic ticket according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a single-side transaction fee determination process for an electronic ticket according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a system for determining a single-side transaction fee of an electronic ticket according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a single-side transaction fee determination apparatus for an electronic ticket according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a second apparatus for determining a single-side transaction fee of an electronic ticket according to an embodiment of the present invention;
fig. 6 is a deduction frequency distribution diagram of the actual fare (Price _ Origin), the prediction of the fare by the Model (Price _ Model), and the deduction mechanism (Price _ Traditional) provided by the background art in the test according to the embodiment of the present invention;
FIG. 7 is a schematic diagram of the evaluation accuracy provided by the embodiment of the present invention;
FIG. 8 is a schematic illustration of fare accuracy and fare frequency provided by an embodiment of the present invention;
fig. 9 is a schematic diagram of comparing the fare accuracy with the user historical data amount according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
In order to accurately determine the unilateral transaction fee of the electronic ticket, the embodiment of the invention sets a unilateral transaction fee determination model, and the setting process of the model is as follows: collecting data transaction records of the electronic ticket, cleaning and integrating to obtain a complete data transaction record of the electronic ticket, and performing model training test by adopting a set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket to obtain the model. And when the electronic ticket generates the unilateral transaction record, determining the missing site record and the corresponding fee of the unilateral transaction record by adopting the set model.
Thus, the embodiment of the invention determines the cost of the unilateral transaction record of the electronic ticket according to the set model, the set model relates to the characteristic extraction and the summarization of various information in the riding process, and the set model analyzes the riding habits of the user for buying the ticket based on the various information in the riding process so as to determine the cost, namely, the set model fully considers the cost calculation error caused by the riding habits of the user for buying the ticket, can reflect the accurate missing site record corresponding to the unilateral transaction record generated by the electronic ticket and calculate the accurate cost, thereby realizing the accurate determination of the unilateral transaction cost of the electronic ticket.
Fig. 1 is a flowchart of a method for determining a single-side transaction fee of an electronic ticket according to an embodiment of the present invention, which includes the following specific steps:
step 101, setting a unilateral transaction fee determination model; the setting process of the unilateral transaction fee model comprises the following steps: collecting data transaction records of the electronic ticket, cleaning and integrating to obtain a complete data transaction record of the electronic ticket, and performing model training test by adopting a set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket to obtain the model;
and 102, when the electronic ticket generates the unilateral transaction record, determining the missing site record and the corresponding expense of the unilateral transaction record by adopting the set model.
In the method, in step 101, a model is set to analyze a user's ticket-buying riding habits based on various information during riding, thereby determining a fee.
In the method, when the model training test is carried out, an XGboost (eXtreme gradient boosting) learning model and/or a Random Forest (RF) learning model are/is adopted, and the XGboost learning model and/or the Random forest learning model are combined with a tree model and an integration method, so that a good effect can be obtained on general classification problems. Specifically, the XGboost learning model combines a tree model and a Boosting method, is an optimized distributed Gradient enhancement library, aims at high efficiency, flexibility and portability, and is based on a machine learning algorithm realized by a Gradient Boosting framework. XGBoost provides parallel tree upgrading (also known as GBDT, GBM) to solve many data science problems in a fast and accurate manner; the RF learning model is a classifier comprising a plurality of decision trees, the output class of the RF learning model is determined by the mode of the class output by individual trees, the RF learning model combines the tree model and the Bagging method, and the training process can be carried out in series and is faster.
In the method, the performing of the model training test may also adopt other learning models for classifying and analyzing the ticket buying riding habits of the user based on various information in the riding process to determine the cost, which is not limited herein.
In the method, the cleaning, integrating and extracting processes are as follows: and setting an application algorithm, cleaning and integrating the data transaction records of the electronic ticket to obtain a complete data transaction record of the electronic ticket, splicing and extracting to obtain user characteristics and site characteristics, and taking the user characteristics and the site characteristics as samples of the model. Specifically, the applied algorithm includes: a lost site multi-classification algorithm, a deduction amount multi-classification algorithm, a lost site two-classification algorithm and a deduction amount two-classification algorithm.
In the method, the model is set as follows: respectively adopting the set learning models to train and test samples obtained by different application algorithms to obtain test results corresponding to the different application algorithms, adopting a weighting mode to internally fuse the different application algorithms, then weighting and fusing the test results among the different application algorithms to obtain the model.
Therefore, the embodiment of the invention utilizes a big data analysis means to calculate the possible station entering and leaving information of the user in the journey based on the riding ticket purchasing habit of the user, and the possible station entering and leaving information is used as the expense of the unilateral transaction record of the electronic ticket, thereby improving the inaccurate expense determination of the electronic ticket in the background technology when the unilateral transaction record occurs, carrying out reasonable expense calculation on the unilateral transaction record of the electronic ticket as far as possible, and not increasing the ticket payment loss of a public transport system.
In the embodiment of the present invention, the complete data transaction record of the electronic ticket is historical, and includes the ticket purchase record, the user information, the site information, and the like, fig. 2 is a schematic diagram of a process for determining the unilateral transaction fee of the electronic ticket according to the embodiment of the present invention, and how to establish the model is described in detail below with reference to fig. 2.
The method comprises the following steps: acquiring all complete data transaction records of the electronic ticket used by the user, and acquiring site information and user information;
in this step, the user information is related information of the user mobile terminal bound to the complete data transaction record, and further includes weather information of each day in a time interval covered by the complete data transaction record;
step two: and cleaning all the acquired complete data transaction records of the electronic ticket, removing null values, abnormal data and invalid data, and integrating after retaining the complete data transaction records with successful transaction.
Specifically, the cleaned complete and successful data transaction record comprises: the system comprises user basic information, user ticket buying time, a user boarding site, a user alighting site, payment information and user basic information; the complete site information comprises the site name, the line where the site is located and the longitude and latitude information of the site; a complete piece of weather information includes: date, weather information (such as sunny, cloudy, rain, snow, etc.), air temperature information (maximum air temperature, minimum air temperature), wind power information (wind power level and wind direction); a piece of complete user information comprises the mobile terminal number of the user, the mobile operator, the province and city information of the attribution.
The integration process is as follows: using data transaction record t ═ t (t)1,t2,...,tn) As a basic configuration data set, one site information s ═(s)1,s2,...,sn) Weather information w on a certain day is (w)1,w2,...,wn) One user information p ═ (p)1,p2,...pn). Selecting corresponding site information s according to the getting-on site and the getting-off site in the transaction recordstartAnd sendSelecting the weather information w of the day according to the date d of occurrence of the transaction recorddSelecting the attribution information p of the mobile terminal corresponding to the user according to the user account u in the transaction recordu。
Step three: integrating the data intoLine splicing, xi=(t,sstart,send,wd,pu) Constructing a prediction result representation y according to different applied algorithmsi. The data transaction record is represented as (x)i,yi) After all the data transaction records are processed, a matrix D { (x) is obtained1,y1),(x2,y2),...,(xn,yn) As a data set.
In particular, yiIs defined as:
missing site multi-classification algorithm: predicted resultsWhereinIs a complete set of subway stations;
and (3) a deduction amount multi-classification algorithm: predicted resultsWhereinIs a complete set of fare for the subway;
missing site binary algorithm: predicted resultsWherein, the subway station is a station and is marked as 1, otherwise, the subway station is marked as 0;
and (3) a deduction amount classifying algorithm: predicted resultsWhereinThe subway fare is that a certain fare is marked as 1, otherwise, the fare is marked as 0.
By stitching, the constructed feature training dataset D { (x)1,y1),(x2,y2),..,(xn,yn)},xn∈RdThe matrix X ═ X1,x2,...,xn)TX input, x, representing training set samplesi∈RdWhere R denotes a real number, d denotes a dimension, and the matrix Y ═ Y1,y2,...,yn)TY output, y, representing training set samplesnAnd xnCorrespondingly, one sample is formed, and n is the number of samples.
In this step, the multi-classification algorithm for missing sites and deduction fees is converted into a binary problem by using a "line expansion" method, and the step of the "line expansion" is specifically as follows.
Taking the multi-classification expansion of the stations as the second classification of the stations as an example, one piece of data in the training data is, for example, set as 0, and the set of the getting-off stations in the historical data of the user is hist _ Send{43, 110, 9, 86, 56, 7, 38 }. The data is shown in table one:
feature 0 | Feature 1 | Feature 2 | … | Characteristic n | Send | |
0 | Value 0 | Value 1 | Value 2 | … | Value n | 43 |
Watch 1
The one piece of data is "line extended":
wherein S isendThe get-off station, his _ S, representing the piece of dataend iIndicating the ith get-off station in the history data.
As shown in table two, the training data numbered 0 is expanded into 7 pieces of data.
Watch two
The data records of the cleaning integration in the step two are divided into historical data, training data and test data by a classification algorithm, new features are added by utilizing the historical data, and the characteristics mainly comprise user historical trip frequency, station, time, weather features and historical pedestrian volume features of various subway stations, which are described in detail below.
1. User historical trip frequency characteristic F1=(F11,F12):
Cutting the historical time at intervals of weeks or months to respectively obtain a set Dw、Dm。
User U in time span of week unitiAnd (4) travel frequency set:
similarly, user U is within the time span of monthiAnd (4) travel frequency set:
wherein,
Uiusers of the ith record
d-date
Therefore, the user U is obtainediStatistical characteristics of trip times in weeksAnd user UiStatistical characteristics of travel times in months
2. User historical travel site feature F2=(F21,F22):
History of usersStation F of going21。UiIndicating the user of the ith record,representing a user UiNumber of historical travel records, skIndicating the site that the user went to in the kth record.
Wherein
Site F in historical data where user frequents when getting on a bus site are known22。si,startIndicating the pick-up station for the ith record.
3. User travel time characteristic F3=(F31,F32,F33,F34,F35,F36):
The travel time t of the user is represented in the following form:
t=yyyy/mm/dd wwHH:MM:SS
where yyyy represents the year of the trip time, as exemplified by 2017/01/02Mon 18:09:10 (the same applies below), e.g., yyy is 2017;
mm represents the month of travel time, such as mm 01;
dd represents the date of the travel time, such as dd 02;
HH represents hours of travel time (24 hours), e.g., HH ═ 18;
ww represents the day of travel is the day of the week, e.g., ww ═ 1;
MM represents the number of minutes of travel time, such as MM ═ 09;
SS represents the number of seconds of travel time, such as SS-10;
based on the above representation, travel time characteristics are defined:
trip month characteristics F31=mm
Travel date characteristic F32=dd
Characteristic of trip weekWhereinRepresenting a rounding down.
Week of travel characteristic F34=ww
Travel hour feature F35=HH
Travel minute feature F36=MM
4. Weather feature F4=(F41,F42,F43,F44):
For a certain day d, the hourly air temperature is represented as a vector Td=(td1,td2,...td24) The weather of the day is W, W belongs to { sunny, cloudy, rainy and snowy. }, and the wind strength grade of the day is WsThe wind direction of the same day is Wd,WdE { N, NE, E, SE, S. }, eight wind directions.
Maximum air temperature characteristicWherein t isdi∈Td;
Lowest air temperature characteristicWherein t isdi∈Td;
Wind characteristics of the day F44=Ws
Features of the direction of the sun
5. Historical pedestrian flow characteristic F of each subway station5:
siRepresenting the subway station to be calculated of the ith record; starting timek、endkRespectively representing the upper station point and the lower station point of the kth record; k is the total number of history records. F5Indicating user presence at site s in historyiFrequency of getting on and off the vehicle.
Wherein
Step four: and taking the characteristics selected in the third step as input, and predicting by using an XGboost learning model and an RF learning model.
Specifically, the XGBoost learning model is based on a gradient lifting tree algorithm, and mainly includes the following steps:
step 1: initializationGamma is a constant value, where f0Representing a tree with only one node;
step 2: m denotes the number of trees, for M1, 2.
a) 1,2, n, n is the number of samples, and a residual estimation value is calculated:
b) obtaining the mth leaf node area Rjm(J1, 2.. said, J represents the number of leaf nodes), estimating the value of a leaf node region by using linear search, and minimizing a loss function;
c) updating tree models
And step 3: obtaining the final tree model f (x) ═ fM(x)。
The RF learning model is based on a self-help resampling technology (bootstrap) and a decision tree algorithm, and mainly comprises the following steps:
step 1: and (3) applying a bootstrap method to perform replacement random sampling on the data set D, and extracting k sub-sample sets DiK, complement of each subsample set (D-D)i) K out-of-pocket datasets were formed. Training k decision trees according to k sub-sample sets;
step 2: setting a total of m variables, randomly extracting n variables (n < m) at each node of each tree, selecting the variable with the most classification capability from the n variables, and determining the threshold value of variable classification by checking each classification point;
and step 3: each tree grows to the maximum extent without pruning.
And 4, step 4: merging the prediction results of k decision trees into a vector C, wherein C is equal to (C)1,c2,...,cn) Where n is the number of classes predicted by the entire model, ci(i ═ 1, 2.. times, n) is the number of decision trees whose prediction results are the classification, and the number is obtainedThe final tree model f (x) ═ argmaxiciI.e. the result with the highest number of votes is selected as the output of the whole model.
Step five: model fusion is carried out on the prediction probability result obtained in the step four to obtain the model,
in this step, the following are defined: the prediction probabilities of the XGboost learning model and the RF learning model of the missing site multi-classification algorithm are recorded as follows: station1xgb,station1rf;
The prediction probabilities of the XGboost learning model and the RF learning model of the deduction amount multi-classification algorithm are recorded as follows: price1xgb,price1rf;
The prediction probabilities of the XGboost learning model and the RF learning model of the missing site two-classification algorithm are recorded as follows: station2xgb,station2rf;
The prediction probabilities of the XGboost learning model and the RF learning model of the deduction amount two-classification algorithm are recorded as follows: price2xgb,price2rf。
The model fusion is divided into the following 3 steps:
step 1: the prediction probabilities with the missing station and the deduction amount as predicted values are respectively weighted linearly, namely, the station is w01station1xgb+w02station1rf+w03station2xgb+w04station2rfAnd price ═ w11price1xgb+w12price1rf+w13price2xgb+w14price2rf;
Step 2: converting the prediction probability of the missing site obtained after the linear weighting into the prediction probability of the deduction amount, and recording the conversion into f, wherein the converted prediction probability of the deduction amount is given as f (station);
and step 3: the price 'and the price' obtained above are subjected to linear weighting to obtain the prediction probability price of the final deduction amount=w31price′+w32price″。
In the above 3 steps, wij∈R(i,j∈N)。
Fig. 3 is a system for determining a single-side transaction fee of an electronic ticket according to an embodiment of the present invention, the system including: a background processing system and a gate processing unit, wherein,
the background processing system is used for setting a unilateral transaction fee determination model and sending the fee determination model to the gate processing unit, and the unilateral transaction fee model is set in the following process: collecting data transaction records of the electronic ticket, cleaning and integrating to obtain a complete data transaction record of the electronic ticket, and performing model training test by adopting a set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket to obtain the model;
and the gate processing unit is used for receiving the unilateral transaction fee determination model from the background processing system, and determining missing site records and corresponding fees of the unilateral transaction records by adopting the set model when the unilateral transaction records are generated by scanning the electronic tickets.
In the system, an XGboost learning model and an RF learning model are adopted during model training test.
In the system, the extraction process comprises the following steps: and setting an application algorithm, splicing the cleaned and integrated complete transaction records, and extracting to obtain user characteristics and site characteristics as samples of the model. Specifically, the applied algorithm includes: a lost site multi-classification algorithm, a deduction amount multi-classification algorithm, a lost site two-classification algorithm and a deduction amount two-classification algorithm.
In the system, the background processing system and the gate processing unit can be in wired or wireless communication.
Fig. 4 is a schematic structural diagram of a single-side transaction fee determination apparatus for an electronic ticket according to an embodiment of the present invention, including: a model setting module and a transceiver module, wherein,
the receiving and sending module is used for receiving the data transaction record of the electronic ticket; sending the model;
the model setting module is used for acquiring data transaction records of the electronic ticket, cleaning and integrating the data transaction records to obtain complete data transaction records of the electronic ticket, and performing model training test by adopting the set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction records of the electronic ticket to obtain a unilateral transaction determination model.
Fig. 5 is a schematic structural diagram of a second apparatus for determining a single-side transaction fee of an electronic ticket according to an embodiment of the present invention, including: a transceiver module and a processing module, wherein,
the receiving and sending module is used for receiving the unilateral transaction determination model;
and the processing module is used for determining the missing site record and the corresponding expense of the unilateral transaction record by adopting the set model when the unilateral transaction record is generated by scanning the electronic ticket.
Compared with the scheme provided in the background art, the scheme provided by the embodiment of the invention can be used for knowing the influence on the order of the unilateral transaction of the public transport system, as shown in the third table:
charge withholding function | Algorithmic predictive deduction | Background art fee deduction mechanism | |
Total sum of money | 22475.00 | 22010.00 | 61121.00 |
Average amount per order | 3.00 | 3.83 | 10.00 |
Percentage of | 100.00% | 97.93%(-2.07%) | 271.95%(+171.95%) |
Watch III
In the single-side transaction fee deduction scheme in the background technology, if data of a terminal station is missing, the highest fare of the single-side transaction in the public transportation network is calculated according to the starting station of the single-side transaction, and fee deduction is performed according to the highest fare, which brings a large amount of loss to users, and in the test, the total fee of the single-side transaction fee in the spine technology is 271.95% of the fee to be deducted.
The unilateral transaction fee determination scheme provided by the embodiment of the invention is adopted to predict the terminal station of the user, and then the fee is deducted according to the prediction result, so that the deduction of the unilateral transaction fee is more scientific and reasonable, and the user can have better riding experience. In the test, the difference between the fee deduction predicted by the algorithm and the fee to be deducted is 2.07 percent, so that the loss of a traditional fee deduction mechanism to a user can be effectively reduced.
As shown in fig. 6, fig. 6 is a frequency distribution diagram of the actual fare (Price _ Origin), the prediction of the fare by the Model (Price _ Model), and the fee deduction mechanism (Price _ Traditional) provided in the background art in the test according to the embodiment of the present invention.
It can be seen that the distribution of the deduction mechanism adopted by the background technology is concentrated on more than 10 yuan, which is seriously deviated from the actual fare distribution and has stronger punishment tendency; the scheme provided by the embodiment of the invention is adopted to predict the fare distribution to be consistent with the actual fare distribution, the model has a better fitting effect on the actual scene, and the model prediction result is more scientific and reasonable.
The following examples illustrate the embodiments of the present invention
In order to better highlight the advantages and features of the embodiment, the following description uses real and complete service data of a certain subway operator for model training and testing. And (3) taking a 'terminal station' field in the service data as a prediction target of the model, taking other fields as known information, and then training and testing the model, so that the model has good adaptability in a real service scene. The generalization performance of the model is evaluated, and an effective and feasible experimental estimation method and an evaluation standard for measuring the generalization capability of the model are required.
In this embodiment, the one-sided transaction problem belongs to a classification problem. Accuracy (Accuracy) is a commonly used performance evaluation index for classification problems, and is simultaneously suitable for two-classification and multi-classification problems:
correct — number of correctly classified samples;
total-number of predicted samples.
In general, the higher the accuracy of the model, the better the performance of the model.
The method comprises the following steps:
acquiring all data transaction data of the electronic ticket used by the user, and acquiring subway station basic information, related information of a mobile terminal bound by the user and covered by the transaction data, and weather information of each day in a time interval covered by the transaction data.
Step two: and (5) cleaning data, and removing null values, abnormal data and invalid data.
And since the feature selection and the model parameter adjustment of each algorithm in the third step and the fourth step are different, the description is separately described according to the missing station multi-classification algorithm, the deduction amount multi-classification algorithm, the missing station two-classification algorithm and the deduction amount two-classification algorithm. In order to ensure that each model can be evaluated under the same measurement standard when being compared, in all the following models, the first 88% of the extracted data set is used as a training set (or historical data and the training set), and the last 12% is divided into a test set, so that all the models can be evaluated on the same test set.
Missing site multi-classification algorithm
Step three: feature extraction is performed as follows:
user characteristics: 1) a mobile phone number; 2) the operator of the mobile phone number; 3) the attribution of the mobile phone number (the attribution of the mobile phone number is simplified into two labels: whether it is in the city or not and whether it is province or not);
ticket features: order time (month, week, hour, minute of the month); the day weather (including the weather, maximum temperature, minimum temperature, wind power and wind direction of the day); and (4) the ticket buying starting station (one-hot coding is carried out on the ticket buying starting station, and the longitude and latitude data of the starting station).
Step four:
and taking the selected features as input, and predicting by utilizing an XGboost learning model and RF learning.
In the model, 88% of the data set is used as a training set, 12% of the data set is used as a test set, the XGboost learning model and the RF learning model are trained, and the parameters of the model are selected as follows (the parameters which do not appear are the same as the default values):
RF parameter name | n_estimators | max_features | max_depth |
Default value | 10 | auto | None |
Parameter selection | 140 | None | 100 |
RF parameter description:
n _ estimators: considering the scale of the data set and the increasing trend of the model prediction accuracy, setting n _ estimators to 140 can make the model have higher accuracy and relatively lower training time consumption.
max _ features: setting this value to "None", i.e., not limiting the number of features in the optimal attribute partition, allows more possibilities to be considered on each node when training the model, thereby improving the prediction accuracy of the model.
max _ depth: the default value is "None", that is, the depth of the subtree is not limited when the subtree is built by the decision tree, and considering the scale of the data set and the number of features, setting the value to 100 can reduce the complexity of the model and improve the training speed of the model.
The XGboost parameter states:
num _ boost _ round: the parameter is set to 700 based on the trend of the model's accuracy as it takes on different values.
max _ depth: the larger max _ depth is, the more specific and local samples can be learned by the model, and the parameters are adjusted by a GridSearchCV (GridSearchCrossValidation) method, so that the model can be well fitted to the training set without overfitting.
subsample: the default value for this value is 1, and by adjusting this value down, overfitting can be prevented. Since the number of extracted feature columns is small, subsample should not be too small, otherwise it would result in under-fitting, so the parameter is adjusted down to 0.9.
colsample _ byte: the value works similar to subsample, except that the subsample byte works with the column and subsample works with the row.
The performance on the test set is as follows:
model (model) | XGBoost | RF |
Accuracy of default parameters | 47.1% | 39.1% |
Accuracy after parameter adjustment | 57.7% | 53.8% |
From an observation of the data set, the following problems were found:
1) the classification is many and the distribution is unbalanced. A city with mature rail transit development often has hundreds of subway stations, which requires that the model is correctly classified in a large number of categories; the ticket purchase orders are concentrated in the hot sites, while the orders of some remote sites are fewer, and the distribution imbalance influences the prediction effect of the model.
2) The user features are less. The unilateral transaction belongs to the individual behavior of the user, so the characteristics of the user are very important to help the correct prediction of the getting-on and getting-off station.
The optimization idea for the model performance mainly comprises the following 3 points:
1) the number of classifications is reduced. The prediction target of the model is converted from the prediction of the sites into the prediction of the fares, generally, the number of the fares is far less than that of the sites, and the data in each classification can be increased due to the small classification number, so that the problems of multiple classifications and unbalanced distribution are relieved to a certain extent.
2) The multi-classification problem is converted into a two-classification problem.
3) User features are added. And cutting the data set, dividing a part of data into historical data, and calculating user characteristics such as average ticket buying frequency, average consumption amount, ticket buying times and the like.
Deduction amount multi-classification algorithm
Step three: feature extraction is performed as follows:
user characteristics: 1) a mobile phone number; 2) the operator of the mobile phone number; 3) the attribution of the mobile phone number (the attribution of the mobile phone number is simplified into two labels: whether it is in the city or not and whether it is province or not);
ticket features: 1) order time (month, week, hour, minute of the month); 2) the day weather (including the weather, maximum temperature, minimum temperature, wind power and wind direction of the day); 3) and (4) the ticket buying starting station (one-hot coding is carried out on the ticket buying starting station, and the longitude and latitude data of the starting station).
Step four: and taking the selected characteristics as input, and predicting by using an XGboost learning model and an RF learning model.
With 88% of the data set as the training set and 12% as the test set, the parameters of the model were chosen as follows:
RF parameter name | n_estimators | max_features | max_depth |
Default value | 10 | auto | None |
Parameter selection | 200 | None | 38 |
Wherein num _ boost _ round: the principle is multi-classification with the site, and the parameter is set to 930 according to the accuracy rate change trend of the model when the parameter takes different values.
max _ depth: compared with the site multi-classification, the number of classes to be divided is reduced, the complexity of each class is increased, and therefore the max _ depth is slightly increased, so that the model can learn more specific features.
subsample: the default value for this value is 1, and by adjusting this value down, overfitting can be prevented. Since the number of extracted feature columns is small, subsample should not be too small, otherwise it would result in under-fitting, so the parameter is adjusted down to 0.9.
colsample _ byte: the effect of this value is similar to subsample.
The performance of the XGBoost learning model and the RF learning model on the test set is as follows:
model (model) | XGBoost | RF |
Accuracy of default parameters | 49.4% | 45.0% |
Accuracy after parameter adjustment | 66.0% | 61.4% |
Missing site two-classification algorithm
Step three: the collected original data is divided into historical data, training data and test data, wherein the historical data accounts for the maximum proportion. Historical data is used to add features, training data is used to train the model, and test data is used to evaluate the model.
And performing feature extraction on the training data and the test data according to the following fields:
user characteristics: 1) user mobile phone number (operator, whether this province, this city); 2) average value and maximum value of monthly and weekly trip times and ticket purchasing number; 3) total travel times and total ticket purchasing times; averaging the number of ticket purchases and the amount of the ticket purchases for each trip; 4) and 4 sites with highest frequency of getting up and getting down the sites by the user in historical data.
Site characteristics: hotness of a site in historical data (frequency of all people getting on and off the site)
User + site characteristics: 1) weather, wind power, wind direction and average temperature of a user on the same trip day respectively correspond to historical data of the user, and 2 stations with the highest getting-off station frequency of the user under the current condition are found; 2) in the historical data, the user getting-on station is taken as the getting-off station, and the corresponding 2 stations with the highest frequency of getting-on stations are taken as the getting-off stations.
Trip characteristics: 1) weather (highest temperature, lowest temperature, weather conditions, wind power and wind direction) on the same day of travel; 2) travel time (month, beginning of month or middle or end of month, day of week, time and minute when the ticket purchase order was created).
Tidal characteristics: 1) the user goes out on the day of the week and buys tickets in specific time periods, which respectively correspond to historical data of the user, and 2 stations with the highest getting-off station frequency of the user under the current condition are found.
Then each piece of data in the training data and the test data is subjected to 'line expansion', and the conversion from the multi-classification problem to the two-classification problem can be completed.
Step four: and taking the selected characteristics as input, and predicting by using an XGboost learning model and an RF learning model.
60% of the data were selected as historical data, 28% of the data were selected as training data, and 12% of the data were selected as test data. Parameters of the XGBoost learning model and the RF learning model are selected as follows (the values of the parameters not listed are the same as the default values):
RF parameter name | n_estimators | max_features | class_weight |
Default value | 10 | auto | None |
Parameter selection | 400 | None | {0:0.1,1:0.9} |
Parametric description of RF learning model:
n _ estimators: compared with multi-classification, the two-classification training speed is higher, and the performance of the model can be improved by improving n _ estimators.
max _ features: setting this value to "None", i.e., not limiting the number of features in the optimal attribute partition, allows more possibilities to be considered on each node when training the model, thereby improving the prediction accuracy of the model.
class _ weight: due to the fact that 'line expansion' is carried out, the number of samples of the category 0 is not balanced with that of the category 1, the weights of the two categories need to be adjusted, therefore, the model can learn useful rules in an unbalanced data set, and the category weights are respectively set to be 0.1 and 0.9 according to the proportion of positive samples and negative samples.
Parameter specification of XGBoost learning model:
num _ boost _ round: compared with multi-classification, the iteration times are reduced. Because a part of data is divided as historical data, the data used for training is reduced, and the iteration number required when the model reaches the optimum is reduced.
max _ depth: compared with multi-classification, the depth of the tree is improved, and the model can learn more specific characteristics.
subsample: compared with multi-classification, the number of the features of the two classifications is increased, so that the generalization capability of the model can be improved by adjusting subsample and colsample _ byte to 0.8.
colsample _ byte: the effect of this value is similar to subsample.
scale _ pos _ weight: similar to class _ weight for RF, to enable the model to learn efficiently in an unbalanced training set, the positive sample weight is set to 7 according to the ratio of positive and negative samples.
In the test set, 20% of travel records are the getting-off stations where the user does not get off in the history record, so that the upper limit of the accuracy of the algorithm is 80% for the current test set, and the accuracy of each model is shown in the following table. The upper limit of accuracy of the algorithm is 80%, and the accuracy of each model is shown in the following table.
From the above table, in the test set, the accuracy rate of the getting-off station of the travel record can be as high as 80.3% if the getting-off station of the travel record in the test set appears in the historical data. Therefore, for the current test set, the biggest limitation of the algorithm is that 20% of travel records cannot be predicted, which results in a significant reduction of the upper limit of the model.
In the travel records that cannot be predicted in the two-category prediction, the contribution of the accuracy of the multi-category prediction to the accuracy of the whole prediction is defined as follows:
wherein,
Correctmulti-multi-classification predicts the correct number of samples;
-number of samples that cannot be predicted by binary classification;
total-number of predicted samples.
Under the XGBoost, RF model of site multi-classification, the following table shows the contribution of the accuracy rate of observing the 20% unpredictable travel records in the two classifications to the accuracy rate of the whole multi-classification:
multi-classification XGboost | Multi-class RF | |
Contribution to accuracy | 0.1% | 0.2% |
As can be seen from the above table, the records that cannot be predicted under the second classification contribute little to the accuracy under the site multi-classification, so that even if the records cannot be predicted under the second classification, the records cannot be lost too much, and the defect can be alleviated by increasing the amount of historical data.
Deduction amount binary classification algorithm
Step three:
the raw data is divided into historical data, training data and test data, wherein the historical data accounts for the greatest proportion. Historical data is used to add features, training data is used to train the model, and test data is used to evaluate the model.
And performing feature extraction on the training data and the test data according to the following fields:
user characteristics: 1) user mobile phone number (operator, whether this province, this city); 2) average value and maximum value of monthly and weekly trip times and ticket purchasing number; 3) total travel times and total ticket purchasing times; averaging the number of ticket purchases and the amount of the ticket purchases for each trip; 4) and the user can get on or off the fare of the 4 sites with the highest frequency in the historical data.
Site characteristics: the heat of the site (the frequency with which all people get on and off the site) in the historical data.
User + site characteristics: 1) weather, wind power, wind direction and average temperature of a user on the same day of travel respectively correspond to historical data of the user, and the fares of 2 stations with highest getting-off station frequency of the user under the current condition are found; 2) in the historical data, the user getting-on station is taken as the getting-off station, and the corresponding fare of 2 stations with the highest frequency of getting-on stations is obtained.
Trip characteristics: 1) weather (highest temperature, lowest temperature, weather conditions, wind power and wind direction) on the same day of travel; 2) travel time (month, beginning of month or middle or end of month, day of week, time and minute when the ticket purchase order was created).
Tidal characteristics: the day of the week of the user's trip and the specific time period for buying tickets respectively correspond to the historical data of the user, and the fares of 2 stations with the highest getting-off station frequency of the user under the current condition are found.
Then each piece of data in the training set and the testing set is subjected to 'line expansion', and the conversion from the multi-classification problem to the two-classification problem can be completed.
Step four: and taking the selected characteristics as input, and predicting by using an XGboost learning model and an RF learning model.
60% of the data set was used as historical data, 28% as training set, and 12% as test set.
The model parameters were set as follows:
RF parameter name | n_estimators | max_features | max_depth | class_weight |
Default value | 10 | auto | None | None |
Parameter selection | 200 | None | 70 | {0:0.2,1:0.8} |
RF parameter description:
max _ features: setting this value to "None", i.e., not limiting the number of features in the optimal attribute partition, allows more possibilities to be considered on each node when training the model, thereby improving the prediction accuracy of the model.
max _ depth: by limiting the maximum depth of the tree, the model can be prevented from overfitting, which is set to 70 by GridSearchCV.
class _ weight: compared with the site two-classification, the problem of unbalance between positive and negative samples of the fare two-classification is relieved, and the class weights are respectively set to be 0.2 and 0.8 according to the proportion of the number of the samples in the class 0 and the class 1.
The XGboost parameter states:
num _ boost _ round: principle same site two classification. Because a part of data is divided as historical data, the data used for training is reduced, and the iteration times required when the model reaches the optimal state are reduced compared with the fare multi-classification.
max _ depth: the number of categories is further reduced (categories of fares are much less than the number of sites) than the site two categories, the complexity of each category increases, so increasing max _ depth enables the model to learn more specific features.
subsample and colsample _ byte: compared with multi-classification, the number of the features of the two classifications is increased, so that the generalization capability of the model can be improved by adjusting subsample and colsample _ byte to 0.8.
scale _ pos _ weight: principle same site two classification. The positive sample weight is set to 2.66 based on the ratio of positive and negative samples.
8.7% of travel records in the test set are obtained by the user getting off at a station where the user does not appear in the history record, so that the upper limit of the accuracy of the scheme is 91.3% for the current test set, and the accuracy of each model is shown in the following table.
XGBoost | RF | |
Data for default parameter 91.3% | 73.5% | 77.3% |
Default parameter full data | 67.1% | 70.6% |
Data 91.3% after parameter adjustment | 80.4% | 81.1% |
All data after parameter adjustment | 73.4% | 74.1% |
The unpredictable data percentage in the prediction set is reduced from 20% to 8.7% relative to the site-binary algorithm; the accuracy also increased from 64.2% to 73.4% for all data. Also, this unpredictable portion of data was observed in the XGBoost, RF model of the deductive amount multi-classification algorithm, and the contribution of the accuracy of these records to the accuracy of the overall multi-classification is shown in the following table:
multi-classification XGboost | Multi-class RF | |
Contribution to accuracy | 0.3% | 0.5% |
As can be seen from the above table, the records that cannot be predicted under the two-classification condition are known from the above table under the multiple-fare-classification condition, and the records that cannot be predicted under the two-classification condition have little contribution to the accuracy rate under the multiple-fare-classification condition, so even if the records cannot be predicted in the two-classification condition, the records cannot cause too much loss, and the upper limit of the accuracy rate of the model can be improved by increasing the number of the historical data.
Step five: and D, performing model fusion on the prediction probability result obtained in the step four.
The specific fusion protocol is as follows:
firstly, respectively fusing different models in the four prediction algorithms; then, fusing the second classification of the station with the multi-classification of the station, and fusing the multi-classification of the fee deduction amount with the second classification of the fee deduction amount; and finally, fusing the four algorithms, predicting the unilateral transaction fee determination, and taking the output unilateral transaction fee determination value as a fee deduction basis.
Intra-algorithm fusion
Firstly, fusing the inside of four prediction algorithms, selecting proper weight, and combining the XGboost model of each scheme and the prediction result of the RF model in a weighted average mode to obtain a fusion model of each scheme.
The model weight directly influences the accuracy of the fused model, and it is very important to accurately select the weight. When setting the weight values of the models, a method is adopted in which the ratio of the accuracy rates of the two models is used as an initial value, and then the weight of each model is adjusted up and down with the initial value as the center. After experimental analysis and multiple adjustments, the weight with the accuracy reaching the maximum value is selected as the maximum value of the weight in the scheme.
The accuracy of the model before and after each protocol was fused is as follows:
it can be seen that the fusion inside the algorithm improves the accuracy of model prediction by a small margin.
Inter-algorithm fusion
Two-classification and multi-classification algorithm
Regardless of the site or the deduction amount, the two-classification is superior to the multi-classification, but compared with the two-classification, the multi-classification theoretically has no unpredictable records; meanwhile, the problems are seen from different visual angles respectively by the two classes and the multiple classes, so that the two classes are fused, the advantages of the two classes can be complemented with each other, and the grasp of the integrity of the problems is improved. And then fusing the secondary classification and the multi-classification of the website and the deduction amount in a linear weighting mode to obtain a fusion model aiming at the website and the deduction amount. The effect of the fusion was as follows:
after the two sub-models of the station are fused, the accuracy is improved to a certain extent; and after the two sub-models are fused, the prediction accuracy rate is reduced on the contrary, which probably means that the overall performance is reduced after the fusion because the accuracy rate of multi-classification prediction is overlarge compared with the difference between two classifications, so the weight of the deduction amount two classification model in the deduction amount fusion model is set as 100%.
Integral fusion
And fusing the fused site prediction model and the deduction amount prediction model to form a final deduction amount prediction model.
The result of site prediction is transformed first, and the fares from the starting station to the various terminals and the probabilities thereof can be calculated according to the known starting station and the probabilities of predicting going to the various sites recorded, so that the result of site prediction is converted into fare prediction. And then, carrying out linear weighting on the result of the site prediction and the result of the fare prediction to obtain a final integral fusion model.
Site prediction model weights | Fare prediction model weights | Rate of accuracy | Improvement of |
0.5 | 0.5 | 75.60% | +1.25% |
0.4 | 0.6 | 75.00% | +0.65% |
0.6 | 0.4 | 76.26% | +1.91% |
0.7 | 0.3 | 76.39% | +2.05% |
0.8 | 0.2 | 76.28% | +1.94% |
When the site prediction model: when the fare prediction model is 7:3, the accuracy of model prediction is highest, and the accuracy of prediction of the final fusion model on the test set is 76.39%.
1. Influence factors for evaluating accuracy
1. Fare accuracy and site popularity
In all travel records, record
Stationstart-the number of records that a station appears as a boarding station;
Stationend-the number of records that a stop appears as a drop-off stop;
T-Total number of records
As shown in fig. 7, fig. 7 is a schematic diagram of the evaluation accuracy provided by the embodiment of the present invention, the abscissa of the two graphs respectively represents the heat of getting on or off the station at the station, and the ordinate represents the accuracy of the fare prediction at the station, it can be seen that the left graph and the right graph are distributed similarly, the variance of the fare accuracy is large for the station with low heat of the station, and the variance of the fare accuracy of the station is reduced when the heat of the station is higher. Therefore, the larger the amount of data relating to a certain site, the more stable the accuracy of the prediction result can be obtained.
2. Fare accuracy and fare frequency
As shown in fig. 8, fig. 8 is a schematic diagram of fare accuracy and fare frequency provided by an embodiment of the present invention, in which the abscissa represents the frequency of occurrence of each fare, and the ordinate represents the accuracy of a prediction result when the true value is the current fare, it can be seen that, for each category of fares, the accuracy is higher than 0.5, and when the frequency of a certain category of fares is higher, the accuracy is higher. Therefore, the method can be obtained, and the improvement of the data volume of certain types of fares is beneficial to improving the accuracy of the types of fares
3. Fare accuracy and user historical data volume
As shown in fig. 9, fig. 9 is a schematic diagram illustrating comparison between fare accuracy and user historical data amount provided by the embodiment of the present invention, where an abscissa in the diagram represents the user historical data amount, and an ordinate represents the accuracy of fare prediction for a trip of the user, and it can be seen that the more the user historical record data amount is, the higher the accuracy of fare prediction is. It is therefore possible to obtain that increasing the number of user history records contributes to increasing the accuracy of fare predictions.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A method for determining the unilateral transaction fee of an electronic ticket is characterized by comprising the following steps:
setting a unilateral transaction fee determination model, wherein the unilateral transaction fee model is set in the following process: collecting data transaction records of the electronic ticket, cleaning and integrating to obtain a complete data transaction record of the electronic ticket, and performing model training test by adopting a set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket to obtain the model;
and when the electronic ticket generates the unilateral transaction record, determining the missing site record and the corresponding fee of the unilateral transaction record by adopting the set model.
2. The determination method according to claim 1, wherein the data transaction records of the electronic ticket are collected, the complete data transaction records of the electronic ticket are obtained after cleaning and integration, and the samples required by the model training test are obtained based on the user characteristics and the site characteristics extracted from the obtained complete data transaction records of the electronic ticket according to at least one set application algorithm;
the set learning model adopts an XGboost learning model and/or a random forest RF learning model.
3. The determination method according to claim 2, wherein when the applied algorithm is plural, the model is obtained as:
respectively adopting the set learning models to train and test samples obtained by different application algorithms to obtain test results corresponding to the different application algorithms, adopting a weighting mode to internally fuse the test results obtained by the different application algorithms, and then weighting and fusing the test results among the different application algorithms to obtain the model.
4. The determination method of claim 2, wherein the application algorithm is: a lost site multi-classification algorithm, a deduction amount multi-classification algorithm, a lost site two-classification algorithm and/or a deduction amount two-classification algorithm.
5. The method of claim 4, wherein the missing site two classification algorithm is transformed from the missing site multiple classification algorithm in a line expansion manner;
the deduction amount two-classification algorithm is obtained by adopting a line expansion method and converting deduction amount multi-classification calculation.
6. The method of determining as set forth in claim 4, wherein the missing site two classification algorithm and the stroke amount two classification algorithm, when calculating, further includes:
and dividing the complete data transaction record of the electronic ticket into historical data, training data and test data, wherein the proportion of the historical data is greater than that of the training data and the test data, the historical data is used for increasing user characteristics, and the training data and the test data are used during model training test.
7. A system for determining a fee for a single-sided transaction of an electronic ticket, the system comprising: a background processing system and a gate processing unit, wherein,
the background processing system is used for setting a unilateral transaction fee determination model and sending the fee determination model to the gate processing unit, and the unilateral transaction fee model is set in the following process: collecting data transaction records of the electronic ticket, cleaning and integrating to obtain a complete data transaction record of the electronic ticket, and performing model training test by adopting a set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket to obtain the model;
and the gate processing unit is used for receiving the unilateral transaction fee determination model from the background processing system, and determining missing site records and corresponding fees of the unilateral transaction records by adopting the set model when the unilateral transaction records are generated by scanning the electronic tickets.
8. The determination system according to claim 7, wherein the background processing system is further configured to collect data transaction records of the electronic ticket, perform cleaning and integration to obtain a complete data transaction record of the electronic ticket, and obtain a sample required by the model training test based on user characteristics and site characteristics extracted from the obtained complete data transaction record of the electronic ticket and obtained according to at least one set application algorithm;
the set learning model adopts an XGboost learning model and/or a random forest RF learning model.
9. An apparatus for determining a fee for a one-sided transaction of an electronic ticket, the apparatus comprising: a model setting module and a transceiver module, wherein,
the receiving and sending module is used for receiving the data transaction record of the electronic ticket; sending the model;
the model setting module is used for acquiring data transaction records of the electronic ticket, cleaning and integrating the data transaction records to obtain complete data transaction records of the electronic ticket, and performing model training test by adopting the set learning model based on user characteristics and site characteristics extracted from the obtained complete data transaction records of the electronic ticket to obtain a unilateral transaction determination model.
10. An apparatus for determining a fee for a one-sided transaction of an electronic ticket, the apparatus comprising: a transceiver module and a processing module, wherein,
the receiving and sending module is used for receiving the unilateral transaction determination model;
and the processing module is used for determining the missing site record and the corresponding expense of the unilateral transaction record by adopting the set model when the unilateral transaction record is generated by scanning the electronic ticket.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810371303.2A CN108596664B (en) | 2018-04-24 | 2018-04-24 | Method, system and device for determining unilateral transaction fee of electronic ticket |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810371303.2A CN108596664B (en) | 2018-04-24 | 2018-04-24 | Method, system and device for determining unilateral transaction fee of electronic ticket |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108596664A true CN108596664A (en) | 2018-09-28 |
CN108596664B CN108596664B (en) | 2021-01-05 |
Family
ID=63614803
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810371303.2A Active CN108596664B (en) | 2018-04-24 | 2018-04-24 | Method, system and device for determining unilateral transaction fee of electronic ticket |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108596664B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334732A (en) * | 2019-05-20 | 2019-10-15 | 北京思路创新科技有限公司 | A kind of Urban Air Pollution Methods and device based on machine learning |
CN110647929A (en) * | 2019-09-19 | 2020-01-03 | 京东城市(北京)数字科技有限公司 | Method for predicting travel destination and method for training classifier |
CN110992037A (en) * | 2020-03-03 | 2020-04-10 | 支付宝(杭州)信息技术有限公司 | Risk prevention and control method, device and system based on multi-party security calculation |
CN111861603A (en) * | 2019-04-26 | 2020-10-30 | 财付通支付科技有限公司 | Riding order generation method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106296293A (en) * | 2016-08-16 | 2017-01-04 | 成都仁通融合信息技术有限公司 | A kind of ride method based on stored value card subway |
US20170323228A1 (en) * | 2014-06-04 | 2017-11-09 | W-Zup Communication Oy | Method and system for using and inspecting e-tickets on a user terminal |
CN107527223A (en) * | 2016-12-22 | 2017-12-29 | 北京锐安科技有限公司 | A kind of method and device of Ticketing information analysis |
-
2018
- 2018-04-24 CN CN201810371303.2A patent/CN108596664B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170323228A1 (en) * | 2014-06-04 | 2017-11-09 | W-Zup Communication Oy | Method and system for using and inspecting e-tickets on a user terminal |
CN106296293A (en) * | 2016-08-16 | 2017-01-04 | 成都仁通融合信息技术有限公司 | A kind of ride method based on stored value card subway |
CN107527223A (en) * | 2016-12-22 | 2017-12-29 | 北京锐安科技有限公司 | A kind of method and device of Ticketing information analysis |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111861603A (en) * | 2019-04-26 | 2020-10-30 | 财付通支付科技有限公司 | Riding order generation method and device |
CN111861603B (en) * | 2019-04-26 | 2024-02-02 | 财付通支付科技有限公司 | Riding order generation method and device |
CN110334732A (en) * | 2019-05-20 | 2019-10-15 | 北京思路创新科技有限公司 | A kind of Urban Air Pollution Methods and device based on machine learning |
CN110647929A (en) * | 2019-09-19 | 2020-01-03 | 京东城市(北京)数字科技有限公司 | Method for predicting travel destination and method for training classifier |
CN110992037A (en) * | 2020-03-03 | 2020-04-10 | 支付宝(杭州)信息技术有限公司 | Risk prevention and control method, device and system based on multi-party security calculation |
Also Published As
Publication number | Publication date |
---|---|
CN108596664B (en) | 2021-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107679557B (en) | Driving model training method, driver identification method, device, equipment and medium | |
CN108596664B (en) | Method, system and device for determining unilateral transaction fee of electronic ticket | |
Ge et al. | Temporal graph convolutional networks for traffic speed prediction considering external factors | |
CN113591380B (en) | Traffic flow prediction method, medium and equipment based on graph Gaussian process | |
CN110738523B (en) | Maintenance order quantity prediction method and device | |
CN112418476A (en) | Ultra-short-term power load prediction method | |
CN115204477A (en) | Bicycle flow prediction method of context awareness graph recursive network | |
Dhivya Bharathi et al. | Bus travel time prediction: a log-normal auto-regressive (AR) modelling approach | |
CN112101132A (en) | Traffic condition prediction method based on graph embedding model and metric learning | |
CN113947025B (en) | Short-time traffic flow prediction method, device, terminal equipment and storage medium | |
CN115348182A (en) | Long-term spectrum prediction method based on depth stack self-encoder | |
CN111985731B (en) | Method and system for predicting number of people at urban public transport station | |
CN117273281A (en) | Intelligent travel data analysis method and system based on RFID technology | |
CN114492552A (en) | Method, device and equipment for training broadband user authenticity judgment model | |
CN117436653A (en) | Prediction model construction method and prediction method for travel demands of network about vehicles | |
Raza et al. | Lane-based short-term urban traffic parameters forecasting using multivariate artificial neural network and locally weighted regression models: A genetic approach | |
CN115796030A (en) | Traffic flow prediction method based on graph convolution | |
CN114898139A (en) | Resident parcel classification method and device, equipment and storage medium | |
Ma et al. | Functional form selection and calibration of macroscopic fundamental diagrams | |
Bass et al. | Utility-scale Building Type Assignment Using Smart Meter Data | |
CN113282842A (en) | Travel purpose identification method based on travel survey of smart phone and artificial neural network particle swarm optimization algorithm | |
Liu et al. | Multi-weighted graph 3D convolution network for traffic prediction | |
CN118051787B (en) | High-resolution emission factor matching method | |
CN117807450B (en) | Urban intelligent public transportation system and method | |
CN112308319B (en) | Prediction method and device for civil aviation member passenger loss |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 518131 room 3502, building 1, Huide building, Beizhan community, Minzhi street, Longhua District, Shenzhen City, Guangdong Province Patentee after: Shucheng Technology Co.,Ltd. Patentee after: Beijing University of Posts and Telecommunications Address before: 510405 shop 302, North block, 3rd floor, 77 shiliuqiao Road, Baiyun District, Guangzhou City, Guangdong Province Patentee before: PANCHAN TECHNOLOGY CO.,LTD. Patentee before: Beijing University of Posts and Telecommunications |