CN104462187B

CN104462187B - Gunz Validation of Data method based on maximum likelihood ratio

Info

Publication number: CN104462187B
Application number: CN201410568300.XA
Authority: CN
Inventors: 闻于天; 张奇; 田晓华; 杨峰; 王新兵
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2014-10-22
Filing date: 2014-10-22
Publication date: 2017-09-08
Anticipated expiration: 2034-10-22
Also published as: CN104462187A

Abstract

The present invention provides a method for verifying the validity of crowd intelligence data based on maximum likelihood ratio. The data is classified according to the observed value; for all the data of the same measured value, the probability density function is calculated using kernel density estimation, and the confidence probability is calculated; the server waits for the user to upload new data; the measurer uses his mobile terminal to perform multiple measurements to obtain a set of The data, together with the observed components observed by the measurer, are uploaded to the server; the server compares the data provided by the user with the database, and calculates the likelihood of this set of data using a group intelligence data validity verification method based on the maximum likelihood ratio Reliability; the server decides whether to accept this set of data, pays remuneration according to reliability, updates the database of this measured value, and recalculates the probability density function and confidence probability.

Description

Crowd Intelligence Data Validation Verification Method Based on Maximum Likelihood Ratio

技术领域technical field

本发明涉及通信技术领域，具体地，涉及一种基于最大似然比的群智数据有效性验证方法。The present invention relates to the field of communication technology, in particular to a maximum likelihood ratio-based method for verifying the validity of crowd intelligence data.

背景技术Background technique

群智(crowdsourcing)在智能手机的应用中有十分广阔的前景。随着互联网技术的飞速发展，网络中个体的数量飞速增长，个体相互之间的联系也越来越紧密。在这样的大环境下，群智服务应运而生。如何有效的构建群智服务平台，促进社会中的资源共享，是下一代互联网研究需要解决的重要问题。Crowdsourcing has a very broad prospect in the application of smart phones. With the rapid development of Internet technology, the number of individuals in the network is increasing rapidly, and the connection between individuals is getting closer and closer. In such a big environment, Swarm Intelligence Service came into being. How to effectively build a group intelligence service platform and promote resource sharing in society is an important issue that needs to be solved in the next generation Internet research.

如今，信息提供商往往采用群智激励机制(Crowdsourcing IncentiveMechanism)，将采集信息的工作交由分散的用户来做，并为他们提供的信息或服务给予一定的回报。例如有人想知道某段道路的拥堵情况，由正在该路段上的用户提供的信息不仅比提供商派人去勘察得到的信息更快也更准确。如今手机传感技术(Mobile PhoneSensing)正在蓬勃的发展之中，多种多样的传感设备正在被安装到智能手机上，例如加速传感器，GPS，距离传感器，相机等。利用这些分散的用户的智能手机传感技术获取到所需的信息并上传给提供商是现阶段逐渐流行的手段。Nowadays, information providers often adopt the Crowdsourcing Incentive Mechanism (Crowdsourcing Incentive Mechanism), assigning the work of collecting information to scattered users, and giving them certain returns for the information or services they provide. For example, if someone wants to know the congestion situation of a certain road, the information provided by the user on this road is not only faster but also more accurate than the information obtained by the provider sending people to investigate. Nowadays, mobile phone sensing technology (Mobile PhoneSensing) is developing vigorously, and a variety of sensing devices are being installed on smart phones, such as acceleration sensors, GPS, distance sensors, cameras, etc. It is an increasingly popular method to use the sensor technology of these decentralized users' smart phones to obtain the required information and upload it to the provider.

尽管群智有众多优点，但是其弊端也是不可避免的。由于数据的测量者没有经过专业训练，测量的数据的观测误差总体来说会比较大，而且，由于测量者未经训练，不同数据的有效性的差异也会比通过传统方法获得的数据更大。极端情况下，如果测量者对测试对象非常陌生，甚至误操作，导致数据严重偏离了正常水平，采用这个数据将会对样本的有效性造成一定损害。Although swarm intelligence has many advantages, its disadvantages are also inevitable. Since the measurers of the data are not professionally trained, the observation error of the measured data will be relatively large in general, and, because the measurers are not trained, the difference in the validity of different data will be greater than that obtained by traditional methods . In extreme cases, if the measurer is very unfamiliar with the test object, or even misuses it, causing the data to seriously deviate from the normal level, the use of this data will cause certain damage to the validity of the sample.

这是群智场景中特有的一种误差，以下称为观测误差；其余的称为测量误差。这两种误差通常都可以用更大的样本量来弥补，但是我们的目的在于通过概率论的方法对群智数据进行定量评价与比较。进一步地，目的在于能从中筛选出相对有效性更高的一部分，也就是观测误差较小的一部分。This is a unique error in the crowd intelligence scene, hereinafter referred to as observation error; the rest are called measurement errors. These two kinds of errors can usually be compensated by a larger sample size, but our purpose is to quantitatively evaluate and compare crowdsmart data through the method of probability theory. Further, the purpose is to select a part with higher relative validity, that is, a part with smaller observation error.

经过对现有技术文献的检索发现，M.Ramadan等2008年在InternationalSymposium on Personal，Indoor and Mobile Radio Communications发表的“Implementation and evaluation of cooperative video streaming for mobiledevices”中提出了基于合作下载的视频分享机制，但该机制要求所有参与用户都相互认识并主动组成无线局域网，因而应用场景受到了极大限制。L.Keller等2012年在International Conference on Mobile Systems，Applications，and Services发表的“MicroCast：cooperative video streaming on smartphones”中提出了一种利用手机之间无线通信实现的视频协作下载加速机制。但该机制要求所有参与用户都希望下载同一个视频，该条件在大部分情况下都得不到满足，因而有很大的局限性。After searching the existing technical literature, it was found that M.Ramadan et al. proposed a video sharing mechanism based on cooperative downloading in "Implementation and evaluation of cooperative video streaming for mobile devices" published by International Symposium on Personal, Indoor and Mobile Radio Communications in 2008. However, this mechanism requires all participating users to know each other and actively form a wireless LAN, so the application scenarios are greatly limited. In "MicroCast: cooperative video streaming on smartphones" published by L.Keller et al. at the International Conference on Mobile Systems, Applications, and Services in 2012, a collaborative video download acceleration mechanism was proposed using wireless communication between mobile phones. However, this mechanism requires that all participating users want to download the same video, which is not satisfied in most cases, so it has great limitations.

发明内容Contents of the invention

针对现有技术中的缺陷，本发明的目的是提供一种基于最大似然比的群智数据有效性验证方法，通过利用服务器数据库中已经积累的大量数据内容更好地筛选有效的数据，减少录入错误数据造成的判断偏差。Aiming at the defects in the prior art, the purpose of the present invention is to provide a method for verifying the validity of group intelligence data based on maximum likelihood ratio, which can better screen effective data by using the accumulated large amount of data content in the server database, and reduce Judgment bias caused by incorrect data entry.

根据本发明提供的一种基于最大似然比的群智数据有效性验证方法，包括如下步骤：According to a method for verifying the validity of crowd intelligence data based on maximum likelihood ratio provided by the present invention, it comprises the following steps:

步骤1：实验获取先验概率p_lj，其中，p_lj表示对于某个观测分量j，一个未经训练的测量者将该观测分量j判断为l的概率；Step 1: Experimentally obtain the prior probability p _lj , where p _lj represents the probability that an untrained measurer judges the observation component j as l for a certain observation component j;

步骤2：服务器对已经积累的所有数据按观测值归类；对同一测量值j的所有数据，使用核密度估计计算概率密度函数，计算置信概率α_j；Step 2: The server classifies all the accumulated data according to the observed value; for all the data of the same measured value j, use kernel density estimation to calculate the probability density function, and calculate the confidence probability α _j ;

步骤3：服务器等待用户上传新的数据；Step 3: The server waits for the user to upload new data;

步骤4：测量者i使用其移动终端进行多次测量，获得一组数据，这组数据连同测量者自己观察得到的观测分量一同上传给服务器；Step 4: The measurer i uses his mobile terminal to conduct multiple measurements to obtain a set of data, which is uploaded to the server together with the observed components observed by the measurer himself;

步骤5：服务器将用户提供的数据与数据库相比较，计算这组数据的似然可靠度；Step 5: The server compares the data provided by the user with the database, and calculates the likelihood reliability of this set of data;

步骤6：服务器决定是否接受这组数据，根据可靠性支付报酬；如果服务器接受这组数据，返回步骤2，更新这个测量值j的数据库，重新使用步骤2中的方法计算概率密度函数和置信概率α_j。Step 6: The server decides whether to accept this set of data, and pays according to the reliability; if the server accepts this set of data, return to step 2, update the database of the measured value j, and use the method in step 2 to calculate the probability density function and confidence probability α _j .

优选地，所述步骤1包括如下步骤：Preferably, said step 1 includes the following steps:

步骤1.1：对于基于Wi-Fi信号强度的室内定位的训练过程中，测量者需要确定自已所处室内的位置，产生观测误差；测量者的观测误差被抽象为其处于房间中一点时对于房间最近的两个墙壁的距离的估计误差；Step 1.1: During the training process of indoor positioning based on Wi-Fi signal strength, the measurer needs to determine his or her indoor position, resulting in an observation error; the measurer’s observation error is abstracted as being the closest to the room when it is at a point in the room The estimation error of the distance between the two walls of ;

步骤1.2：通过预先的一次实验确定先验概率p_lj并将先验概率p_lj应用于所有室内定位的活动中，具体为，令多个测量者在一个没有距离参照物的房间里某些固定点j判断自己的位置l，收集该多个测量者的判断结果分布情况即作为p_lj；Step 1.2: Determine the prior probability p _lj through a pre-experiment and apply the prior probability p _lj to all indoor positioning activities, specifically, let multiple measurers be fixed in a room without distance reference objects Point j judges its own position l, and collects the distribution of the judgment results of the multiple measurers as p _lj ;

步骤1.3：对于不能通过预先的一次实验确定的p_lj，可取克罗内克函数：Step 1.3: For p _lj that cannot be determined by an experiment in advance, the Kronecker function can be used:

其中，δ_lj表示克罗内克函数。Among them, δ _lj represents the Kronecker function.

优选地，所述步骤2包括如下步骤：Preferably, said step 2 includes the following steps:

步骤2.1：服务器的数据库中的每个观测分量对应积累数据集D_j，j＝1，2，3，...，N，N表示观测分量的总数，D_j中的各个元素D_j ^k，k＝1，2，3，...T，服从f^j(x)分布，T表示每个观测分量的数据总数，f^j(x)表示观测分量j所服从的概率密度函数；T＝|D_j|＞＞M，M表示测量者一次上传的数据总数，则Step 2.1: Each observation component in the database of the server corresponds to the accumulated data set D _j , j=1, 2, 3, ..., N, N represents the total number of observation components, each element D _j ^k in D _j , k=1, 2, 3,...T, subject to f ^j (x) distribution, T represents the total number of data of each observation component, f ^j (x) represents the probability density function obeyed by observation component j; T=| D _j |＞＞M, M represents the total number of data uploaded by the measurer once, then

其中，K_h表示核密度函数，x表示数据变量；Among them, K _h represents the kernel density function, and x represents the data variable;

步骤2.2：设即n_s(x)表示[x-h，x+h]内数据库中已存在数据个数，h表示核密度函数K_h的带宽；Step 2.2: Set That is, n _s (x) represents the number of existing data in the database in [xh, x+h], and h represents the bandwidth of the kernel density function K _h ;

n_s(x)可能有T+1个取值，服从分布：n _s (x) may have T+1 values and obey the distribution:

其中，P(·)表示n_s(x)的概率质量函数，n_s(x)表示表示[x-h，x+h]内数据库中已存在数据个数，n_s表示可能的取值，可取0，1，...，T，T+1中的任一值，表示从T个不同元素中取出n_s个的组合数，h表示表示核密度函数K_h的带宽；Among them, P( ) represents the probability mass function of n _s (x), n _s (x) represents the number of existing data in the database in [xh, x+h], n _s represents the possible value, which can be 0 , any value in 1,..., T, T+1, Indicates the number of combinations of n _s taken from T different elements, and h indicates the bandwidth of the kernel density function K _h ;

步骤2.3：通过数据库大小确定r_il的期望，将这个期望作为置信概率α，其中，r_il表示观测者i所上传的数据属于观测分量l的概率密度；显然，不同观测值对应的积累数据量是不同的，因此对于不同观测值有不同的置信概率α_j。Step 2.3: Determine the expectation of r _il through the size of the database, and use this expectation as the confidence probability α, where r _il represents the probability density that the data uploaded by observer i belongs to the observation component l; obviously, the accumulated data amount corresponding to different observation values are different, so there are different confidence probabilities α _j for different observations.

优选地，所述步骤4包括如下步骤：Preferably, said step 4 includes the following steps:

步骤4.1：测量者获得一组M个数据记作下式Step 4.1: The measurer obtains a set of M data and writes it down as the following formula

其中，表示测量者i对同一观测分量进行多次测量获得的一组数据，j表示这组M个数据的一个需要观测的分量的真实值，j∈{1，2，3，...，N}，N表示观测分量的总数；x^t _i服从分量j对应分布f^j(x)，x^t _i表示测量者i上传的第t个数据；in, Indicates a set of data obtained by the measurer i for multiple measurements on the same observed component, j indicates the true value of a component that needs to be observed in this set of M data, j∈{1, 2, 3, ..., N} , N represents the total number of observed components; x ^t _i obeys the distribution f ^j (x) corresponding to component j, and x ^t _i represents the tth data uploaded by measurer i;

步骤4.2：观测误差体现为测量者将j判断为j′上报给服务器，即 Step 4.2: Observation error is reflected in that the measurer judges j as j′ and reports it to the server, that is,

优选地，所述步骤5包括如下步骤：Preferably, said step 5 includes the steps of:

步骤5.1：服务器取得数据后计算所有{r_il}：Step 5.1: The server gets the data After calculating all {r _il }:

其中，M表示测量者一次上传的数据总数，f(·)表示观测分量所服从的概率密度函数，l表示可能的观测分量编号，x^t _ij′表示观测者i上传的第t个数据，并将其判断为观测分量j′，N表示观测分量的总数，r_il的物理意义为属于观测分量l的概率密度；显然，当l＝j时最大；Among them, M represents the total number of data uploaded by the measurer at one time, f(·) represents the probability density function that the observed component obeys, l represents the number of the possible observed component, x ^t _ij′ represents the tth data uploaded by the observer i, and It is judged as the observation component j′, N represents the total number of observation components, and the physical meaning of r _il is The probability density belonging to the observed component l; obviously, when l=j is the largest;

步骤5.2：定义参数 Step 5.2: Define parameters

其中α_j称为置信概率，p_lj′表示对于观测分量j′，测量者将该观测分量j′判断为观测分量l的概率；当α_j＝1时的意义为测量数据的最大可能概率密度的对数；显然对于相同长度的一组数据，较大者更可信；Among them, α _j is called the confidence probability, and p _lj' represents the probability that the measurer judges the observed component j' as the observed component l for the observed component j'; when α _j =1 The meaning of is the logarithm of the maximum possible probability density of the measured data; obviously for a set of data of the same length, The larger is more credible;

步骤5.3：通过能够对所有群智数据的有效性进行排序，根据需要取其中的前若干个。Step 5.3: Pass It can sort the effectiveness of all group intelligence data, and select the first few of them according to needs.

优选地，Preferably,

在步骤2.1中，取核密度函数为均匀核函数：h足够小使得数据在带宽范围内近似均匀分布，落到这个区域内的概率P_s＝P(|x-D_j ^k|＜h)＝f(x)2h；In step 2.1, the kernel density function is taken as the uniform kernel function: h is small enough to make the data approximately uniformly distributed within the bandwidth range, and the probability of falling into this area P _s =P(|xD _j ^k |<h)=f(x)2h;

在步骤2.3中，所有的数据都具有采用的价值，下面是一种计算r_il的期望E{r_il}的方法：In step 2.3, all The data of all have adopted value, the following is a method to calculate the expected E{r _il } of r _il :

其中，f^l(x^t)表示观测分量l取值为x^t的概率密度，l表示第l个观测分量，t表示观测者上传的第t个数据，M表示测量者一次上传的数据总数，！表示阶乘，e表示自然底数，P_s＝P(|x-D_j ^k|＜h)＝f(x_i)2h，f(x_i)用核密度估计得出；上式中不存在T以外的变量，故确定了置信概率α_j与数据库大小T的关系。Among them, f ^l (x ^t ) represents the probability density of the observed component l taking the value of x ^t , l represents the lth observed component, t represents the tth data uploaded by the observer, M represents the total number of data uploaded by the measurer once, ! Represents factorial, e represents natural base, P _s =P(|xD _j ^k |<h)=f( _xi )2h, f( _xi ) is estimated by kernel density; there are no variables other than T in the above formula , so the relationship between the confidence probability α _j and the database size T is determined.

与现有技术相比，本发明具有如下的有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明可以通过预先实验矫正群智数据观测者的判断误差；1. The present invention can correct the judgment error of the swarm intelligence data observer through pre-experimentation;

2、本发明可以基于现有的可靠数据集，评价新进群智数据的有效性，从而合理对新进群智数据做出有效取舍。2. The present invention can evaluate the effectiveness of new swarm intelligence data based on the existing reliable data sets, so as to reasonably make effective choices for the new swarm intelligence data.

附图说明Description of drawings

通过阅读参照以下附图对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other characteristics, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1为本发明的步骤流程图。Fig. 1 is a flow chart of steps of the present invention.

具体实施方式detailed description

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明，但不以任何形式限制本发明。应当指出的是，对本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进。这些都属于本发明的保护范围。The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

本发明提供了一种基于最大似然比的群智数据有效性验证方法，包括步骤：实验获取一个未经训练的普通人将某个观测分量判断错的先验概率；服务器对已经积累的所有数据按观测值归类；对同一测量值的所有数据，使用核密度估计计算概率密度函数，计算置信概率；服务器等待用户上传新的数据；测量者使用其移动终端进行多次测量，获得一组数据，连同测量者自己观察得到的观测分量一同上传给服务器；服务器将用户提供的数据与数据库相比较，使用一种基于最大似然比的群智数据有效性验证方法计算这组数据的似然可靠度；服务器决定是否接受这组数据，根据可靠性支付报酬，更新这个测量值的数据库，重新计算概率密度函数和置信概率。The present invention provides a method for verifying the validity of crowd intelligence data based on maximum likelihood ratio. The data is classified according to the observed value; for all the data of the same measured value, the probability density function is calculated using kernel density estimation, and the confidence probability is calculated; the server waits for the user to upload new data; the measurer uses his mobile terminal to perform multiple measurements to obtain a set of The data, together with the observed components observed by the measurer, are uploaded to the server; the server compares the data provided by the user with the database, and calculates the likelihood of this set of data using a group intelligence data validity verification method based on the maximum likelihood ratio Reliability; the server decides whether to accept this set of data, pays remuneration according to reliability, updates the database of this measured value, and recalculates the probability density function and confidence probability.

具体地，本发明提供一种基于最大似然比的群智数据有效性验证方法，通过利用服务器数据库中已经积累的大量数据内容更好地筛选有效的数据，减少录入错误数据造成的判断偏差。Specifically, the present invention provides a maximum likelihood ratio-based crowd intelligence data validity verification method, which can better screen valid data by using a large amount of data content accumulated in the server database, and reduce judgment bias caused by wrong data entry.

参见附图1，本发明是通过以下技术方案实现的，本发明包括如下步骤：Referring to accompanying drawing 1, the present invention is realized through the following technical solutions, and the present invention comprises the steps:

第一步：实验获取先验概率p_lj，表示对于某个观测分量j，一个未经训练的普通人将之判断为l的概率。Step 1: Experimentally obtain the prior probability p _lj , which means that for a certain observed component j, an untrained ordinary person judges it as the probability of l.

第二步：服务器对已经积累的所有数据按观测值归类。对同一测量值j的所有数据，使用核密度估计计算概率密度函数，计算置信概率α_j。Step 2: The server classifies all the accumulated data according to the observed value. For all data of the same measurement value j, use kernel density estimation to calculate the probability density function, and calculate the confidence probability α _j .

第三步：服务器等待用户上传新的数据。Step 3: The server waits for the user to upload new data.

第四步：测量者i使用其移动终端进行多次测量，获得一组数据，连同测量者自己观察得到的观测分量一同上传给服务器。Step 4: Measurer i uses his mobile terminal to conduct multiple measurements, obtain a set of data, and upload them to the server together with the observed components observed by the measurer himself.

第五步：服务器将用户提供的数据与数据库相比较，使用一种基于最大似然比的群智数据有效性验证方法计算这组数据的似然可靠度。Step 5: The server compares the data provided by the user with the database, and calculates the likelihood reliability of this set of data using a method of verifying the validity of group intelligence data based on the maximum likelihood ratio.

第六步：服务器决定是否接受这组数据，根据可靠性支付报酬；如果服务器接受这组数据，返回步骤2，更新这个测量值j的数据库，重新使用步骤2中的方法计算概率密度函数和置信概率α_j。Step 6: The server decides whether to accept this set of data, and pays according to the reliability; if the server accepts this set of data, return to step 2, update the database of the measured value j, and use the method in step 2 to calculate the probability density function and confidence Probability α _j .

下面更详细地将本发明的实施过程进行阐述。The implementation process of the present invention will be described in more detail below.

步骤一，假设服务器需要通过群智数据对某测量值进行测量，该测量值包含若干个观测分量。受观测误差的影响，测量者以概率p_lj将某个观测分量j误判为另一个观测分量l。实验首先获取先验概率p_lj。Step 1, assuming that the server needs to measure a certain measurement value through swarm intelligence data, and the measurement value includes several observation components. Affected by the observation error, the measurer misjudges a certain observed component j as another observed component l with probability p _lj . The experiment first obtains the prior probability p _lj .

例如，对于基于Wi-Fi信号强度的室内定位的训练过程中，测量者需要确定自己所处室内的位置，产生观测误差。测量者的观测误差可以被抽象为其处于房间中一点时对于房间最近的两个墙壁的距离的估计误差。通过预先的一次实验就可以确定这个分布p_lj并将其应用于所有室内定位的活动中。招募大量志愿者在一个没有显著距离参照物的房间里某些固定点j判断自己的位置l，收集他们的判断结果分布情况即可视作p_lj。For example, during the training process of indoor positioning based on Wi-Fi signal strength, the measurer needs to determine his or her indoor position, resulting in observation errors. The measurement error of the measurer can be abstracted as the estimation error of the distance between the two nearest walls of the room at a point in the room. This distribution p _lj can be determined and applied to all indoor positioning activities through a previous experiment. Recruit a large number of volunteers to judge their own position l at some fixed point j in a room without a significant distance from the reference object, and collect the distribution of their judgment results, which can be regarded as p _lj .

若不能通过预先的一次实验确定的p_lj，可以取Kronecker Delta函数。If p _lj cannot be determined through a previous experiment, the Kronecker Delta function can be used.

步骤二，服务器的数据库中的每个观测分量对应积累数据集D_j，j＝1，2，3，...，N，其中各个元素D_j ^k，k＝1，2，3，...T，服从f^j(x)分布，T＝|D_j|为数据集的大小。假设可以对其通过核密度估计足够精确地恢复出f^j(x)。则Step 2, each observation component in the database of the server corresponds to the accumulated data set D _j , j=1, 2, 3, ..., N, where each element D _j ^k , k = 1, 2, 3, .. .T, subject to f ^j (x) distribution, T=|D _j | is the size of the data set. It is assumed that f ^j (x) can be recovered with sufficient accuracy by kernel density estimation. but

核密度函数可以取其他的任意形式，本领域技术人员可以在权利要求的范围内做出各种变形或修改，这并不影响本发明的实质内容。例如，取核密度函数为均匀核函数：h足够小使得数据在带宽范围内近似均匀分布，落到这个区域内的概率P_s＝P(|x-D_j ^k|＜h)＝f^j(x)2h。The kernel density function can take other arbitrary forms, and those skilled in the art can make various deformations or modifications within the scope of the claims, which do not affect the essence of the present invention. For example, take the kernel density function as the uniform kernel function: h is small enough to make the data approximately evenly distributed in the bandwidth range, and the probability of falling in this area P _s =P(|xD _j ^k |<h)=f ^j (x)2h.

设即[x-h，x+h]内数据库中已存在数据个数。n_s(x)可能有T个取值，其分布满足Assume That is, the number of existing data in the database in [xh, x+h]. n _s (x) may have T values whose distribution satisfies

由于不同的观测分量积累不同的数据量，因此不同的观测分量有不同的置信概率α_j。置信概率α_j用于衡量用户上传数据的采用价值其中表示用户i上传数据，且该用户将其判断为观测分量j′。若用r_il表示属于观测分量l的概率密度，则r_il的期望就可以作为置信概率α。下面是一种计算E{r_il}的方法。Since different observation components accumulate different amounts of data, different observation components have different confidence probabilities α _j . Confidence probability α _j is used to measure the adoption value of user uploaded data in Indicates that user i uploads data, and the user judges it as observation component j′. If expressed by r _il belongs to the probability density of the observed component l, then the expectation of r _il can be used as the confidence probability α. The following is a method to calculate E{r _il }.

其中P_s＝P(|x-D_j ^k|＜h)＝f(x_i)2h，f(x_i)用核密度估计得出。式中不存在T以外的变量，故确定了置信概率α_j与数据库大小T的关系。Wherein P _s =P(|xD _j ^k |<h)=f( _xi )2h, f( _xi ) is estimated by kernel density. There are no variables other than T in the formula, so the relationship between the confidence probability α _j and the database size T is determined.

步骤三，服务器等待用户上传新的数据。Step 3, the server waits for the user to upload new data.

步骤四，测量者i对某个测量分量获得一组M个数据记作下式Step 4, measurer i obtains a set of M data for a certain measurement component and writes it down as the following formula

j表示这组数据测量分量的真实值，j∈{1，2，3，...，N}。x^t _i服从分量j对应分布f^j(x)。观测误差体现为测量者将观测分量j判断为j′，并上报给服务器，即 j represents the real value of the measurement component of this set of data, j∈{1, 2, 3, ..., N}. x ^t _i obeys the distribution f ^j (x) corresponding to component j. The observation error is reflected in the fact that the measurer judges the observed component j as j′ and reports it to the server, that is,

步骤五，服务器取得数据后计算所有{r_il}：Step 5, the server obtains the data After calculating all {r _il }:

显然，当l＝j时最大。定义参数 Obviously, it is the largest when l=j. define parameters

通过系统可以对所有群智数据的有效性进行排序，根据需要取其中的前若干个。pass The system can sort the effectiveness of all group intelligence data, and select the first few of them according to needs.

本实施例的环境参数为：The environmental parameters of this embodiment are:

移动终端设备：六部Android智能手机，都是Nexus 4，每部智能手机都配置有1.5GHz Snapdragon APQ8064 CPU和2 G RAM六部智能手机的操作系统都是Android JellyBean(4.2)。这六部智能手机并列作为测试手机进行室内定位。Mobile terminal equipment: six Android smart phones, all Nexus 4, each smart phone is equipped with 1.5GHz Snapdragon APQ8064 CPU and 2 G RAM, and the operating system of the six smart phones is Android JellyBean (4.2). These six smartphones were used side by side as test phones for indoor positioning.

服务器：宏基4930G笔记本电脑，酷睿双核处理器，2G的内存，2G的主频。Server: Acer 4930G notebook computer, Core Duo processor, 2G memory, 2G main frequency.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变形或修改，这并不影响本发明的实质内容。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art may make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention.

Claims

1. A method for verifying the validity of crowd intelligence data based on maximum likelihood ratio, is characterized in that, comprises the steps:

Step 1: Experimentally obtain the prior probability p _lj , where p _lj represents the probability that an untrained measurer judges the observation component j as the observation component l for a certain observation component j;

Step 2: The server classifies all the accumulated data according to the observation value; for all the data of the same observation component j, use kernel density estimation to calculate the probability density function, and calculate the confidence probability α _j ;

Step 3: The server waits for the user to upload new data;

Step 4: The measurer i uses his mobile terminal to conduct multiple measurements to obtain a set of data, which is uploaded to the server together with the observed components observed by the measurer himself;

Step 5: The server compares the data provided by the user with the database, and calculates the likelihood reliability of this set of data;

Step 6: The server decides whether to accept this set of data, and pays according to the reliability; if the server accepts this set of data, return to step 2, update the database of this observation component j, and re-calculate the probability density function and confidence probability using the method in step 2 α _j .

2. the group intelligence data validation method based on maximum likelihood ratio according to claim 1, is characterized in that, described step 1 comprises the steps:

Step 1.1: In the training process of indoor positioning based on Wi-Fi signal strength, the measurer needs to determine his indoor position, resulting in an observation error; the measurer's observation error is abstracted as the closest point to the room when he is at a point in the room The estimation error of the distance between the two walls of ;

Step 1.2: Determine the prior probability p _lj through a previous experiment and apply the prior probability p _lj to all indoor positioning activities; specifically, let multiple measurers observe some The component j judges the observed component l, and collects the distribution of the judgment results of the multiple measurers as p _lj ;

Step 1.3: For p _lj that cannot be determined by a previous experiment, take the Kronecker function:

<mrow> <msub> <mi>p</mi> <mrow> <mi>l</mi> <mi>j</mi> </mrow> </msub> <mo>=</mo> <msub> <mi>&delta;</mi> <mrow> <mi>l</mi> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <mrow> <mi>i</mi> <mi>f</mi> <mi> </mi> <mi>l</mi> <mo>&NotEqual;</mo> <mi>j</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mrow> <mi>i</mi> <mi>f</mi> <mi> </mi> <mi>l</mi> <mo>=</mo> <mi>j</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>

Among them, δ _lj represents the Kronecker function.

3. the group intelligence data validation method based on maximum likelihood ratio according to claim 1, is characterized in that, described step 2 comprises the steps:

Step 2.1: Each observation component in the database of the server corresponds to the accumulated data set D _j , j=1, 2, 3,..., N, N represents the total number of observation components, each element D _j ^k in D _j , k= 1,2,3,...T, obey f ^j (x) distribution, T represents the total number of data of each observation component, f ^j (x) represents the probability density function that observation component j obeys; T=|D _j |＞ >M, M represents the total number of data uploaded by the measurer at one time, then

<mrow> <msup> <mi>f</mi> <mi>j</mi> </msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msub> <mi>K</mi> <mi>h</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msup> <msub> <mi>D</mi> <mi>j</mi> </msub> <mi>k</mi> </msup> <mo>)</mo> </mrow> </mrow>

Among them, K _h represents the kernel density function, and x represents the data variable;

Step 2.2: Set That is, n _s (x) represents the number of existing data in the database in [xh,x+h], and h represents the bandwidth of the kernel density function K _h ;

n _s (x) may have T+1 values and obey the distribution:

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mi>s</mi> </msub> <mo>(</mo> <mi>x</mi> <mo>)</mo> <mo>=</mo> <msub> <mi>n</mi> <mi>s</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msubsup> <mi>C</mi> <mi>T</mi> <msub> <mi>n</mi> <mi>s</mi> </msub> </msubsup> <mi>P</mi> <msup> <mrow> <mo>(</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msup> <msub> <mi>D</mi> <mi>j</mi> </msub> <mi>k</mi> </msup> <mo>|</mo> <mo><</mo> <mi>h</mi> <mo>)</mo> </mrow> <msub> <mi>n</mi> <mi>s</mi> </msub> </msup> <msup> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>P</mi> <mo>(</mo> <mrow> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msup> <msub> <mi>D</mi> <mi>j</mi> </msub> <mi>k</mi> </msup> <mo>|</mo> <mo><</mo> <mi>h</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mrow> <mi>T</mi> <mo>-</mo> <msub> <mi>n</mi> <mi>s</mi> </msub> </mrow> </msup> </mrow> 1

Among them, P(·) represents the probability mass function of n _s (x), n _s (x) represents the number of existing data in the database in [xh,x+h], and n _s takes 0,1,...,T , a value in T+1, Indicates the number of combinations of n _s taken from T different elements, and h indicates the bandwidth of the kernel density function K _h ;

Step 2.3: Determine the expectation of r _il through the size of the database, and use this expectation as the confidence probability α, where r _il represents the probability density that the data uploaded by observer i belongs to the observation component l; obviously, the accumulated data amount corresponding to different observation values are different, so there are different confidence probabilities α _j for different observations.

4. the group intelligence data validation method based on maximum likelihood ratio according to claim 3, is characterized in that, described step 4 comprises the steps:

Step 4.1: The measurer obtains a set of M data and writes it down as the following formula

in, Indicates a set of data obtained by the measurer i for multiple measurements of the same observed component, j indicates the true value of a component that needs to be observed in this set of M data, j∈{1,2,3,...,N}, N Indicates the total number of observed components; x ^t _i obeys the distribution f ^j (x) corresponding to component j, and x ^t _i represents the tth data uploaded by measurer i;

Step 4.2: Observation error is reflected in that the measurer judges j as j′ and reports it to the server, that is,

5. the group intelligence data validation method based on maximum likelihood ratio according to claim 4, is characterized in that, described step 5 comprises the steps:

Step 5.1: The server gets the data After calculating all {r _il }:

Among them, M represents the total number of data uploaded by the measurer at one time, f(·) represents the probability density function that the observed component obeys, l represents the number of the observed component, x ^t _ij′ represents the tth data uploaded by the observer i, and its It is judged as the observation component j′, N represents the total number of observation components, and the physical meaning of r _il is The probability density belonging to the observed component l; obviously, when l=j is the largest;

Step 5.2: Define parameters

Among them, α _j is called the confidence probability, and p _lj' represents the probability that the measurer judges the observed component j' as the observed component l for the observed component j'; when α _j =1 The meaning of is the logarithm of the maximum possible probability density of the measured data; obviously for a set of data of the same length, The larger is more credible;

Step 5.3: Pass It is possible to sort the effectiveness of all crowd intelligence data and take the first few of them.

6. the group intelligence data validation method based on maximum likelihood ratio according to claim 5, is characterized in that,

In step 2.1, the kernel density function is taken as the uniform kernel function: h is small enough to make the data approximately uniformly distributed within the bandwidth range, and the probability of falling into this area P _s =P(|xD _j ^k |<h)=f(x)2h;

In step 2.3, all The data of all have adopted value, the following is a method to calculate the expected E{r _il } of r _il :

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&alpha;</mi> <mi>j</mi> </msub> <mo>=</mo> <mi>E</mi> <mo>{</mo> <msub> <mi>r</mi> <mrow> <mi>i</mi> <mi>l</mi> </mrow> </msub> <mo>}</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mi>E</mi> <mo>{</mo> <msup> <mrow> <mo>&lsqb;</mo> <msup> <mi>f</mi> <mi>l</mi> </msup> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mi>t</mi> </msup> <mo>)</mo> </mrow> <mo>&rsqb;</mo> </mrow> <mi>M</mi> </msup> <mo>}</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>x</mi> <mo>=</mo> <mo>-</mo> <mi>&infin;</mi> </mrow> <mi>&infin;</mi> </munderover> <mo>{</mo> <munderover> <mo>&Sigma;</mo> <mrow> <msub> <mi>n</mi> <mi>s</mi> </msub> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msup> <mrow> <mo>(</mo> <mfrac> <msub> <mi>n</mi> <mi>s</mi> </msub> <mi>T</mi> </mfrac> <mo>)</mo> </mrow> <mi>M</mi> </msup> <mrow> <mo>(</mo> <mfrac> <mrow> <msup> <msub> <mi>P</mi> <mi>s</mi> </msub> <msub> <mi>n</mi> <mi>s</mi> </msub> </msup> </mrow> <mrow> <msub> <mi>n</mi> <mi>s</mi> </msub> <mo>!</mo> </mrow> </mfrac> <mo>)</mo> </mrow> <msup> <mi>e</mi> <mrow> <mo>-</mo> <msub> <mi>P</mi> <mi>s</mi> </msub> </mrow> </msup> <mo>}</mo> <msub> <mi>P</mi> <mi>s</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced>

Among them, f ^l (x ^t ) represents the probability density of the observed component l taking the value of x ^t , l represents the lth observed component, t represents the tth data uploaded by the observer, M represents the total number of data uploaded by the measurer once, ! Represents factorial, e represents natural base, P _s =P(|xD _j ^k |<h)=f( _xi )2h, f( _xi ) is estimated by kernel density; there are no variables other than T in the above formula , so the relationship between the confidence probability α _j and the total number T of data of each observation component is determined.