CN109597923A - Density Estimator method, apparatus, storage medium and electronic equipment - Google Patents
Density Estimator method, apparatus, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN109597923A CN109597923A CN201811308615.5A CN201811308615A CN109597923A CN 109597923 A CN109597923 A CN 109597923A CN 201811308615 A CN201811308615 A CN 201811308615A CN 109597923 A CN109597923 A CN 109597923A
- Authority
- CN
- China
- Prior art keywords
- instance
- points
- point
- data
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000005315 distribution function Methods 0.000 claims abstract description 40
- 230000006870 function Effects 0.000 claims description 52
- 238000004590 computer program Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract 1
- 230000005484 gravity Effects 0.000 abstract 1
- 238000000638 solvent extraction Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 7
- 230000005236 sound signal Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Algebra (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Complex Calculations (AREA)
Abstract
This disclosure relates to which a kind of Density Estimator method, apparatus, storage medium and electronic equipment include multiple example points in the data set this method comprises: data set is divided into multiple example point sets;Example point is extracted from each example point set, as multiple sample points needed for carrying out Density Estimator to the data set;Density Estimator is carried out to above-mentioned multiple sample points by the weight and default distribution function of each example point set, to generate the corresponding cuclear density matched curve of the data set, the weight is the specific gravity shared in the example point total quantity in the data set of the example point quantity in the example point set, which is the distribution function using the data value of pre-set bandwidths and above-mentioned multiple sample points as input parameter.It is regularly sampled in such a way that a point set extracts sample point to data set, obtains the sample point for capableing of accurate description data set, guarantee the accuracy of Density Estimator while reducing data set scale.
Description
Technical Field
The present disclosure relates to the field of data management, and in particular, to a method and an apparatus for estimating a kernel density, a storage medium, and an electronic device.
Background
Kernel Density Estimation (KDE) is to perform Density fitting on all example points (actually, data values) of a data set through a specified probability Density function, where values and bandwidths of features of the example points are parameter values of the probability Density function respectively, and obtain a fitting curve of the probability Density function of the data set through superposition, where the probability Density function is a Kernel function. The most applied algorithms in kernel density estimation are gaussian mixture models (i.e. kernel density estimation models with gaussian function as kernel function) and neighbor-based kernel density estimation, wherein the gaussian mixture models are more applied in clustering scenarios. In performing the kernel density estimation, a record or calculation needs to be performed for each instance point. An excessively large number of instance points of a dataset may cause the model to swell and result in a large amount of computational resources being consumed in the predictive evaluation. In the related art, for the case of an excessive number of sample points, a random sampling method is generally adopted to extract sample points from a data set, and then kernel density estimation is performed on the extracted sample points. However, the randomly extracted sample points cannot guarantee that the original data set is well described, and therefore accuracy of kernel density estimation is affected.
Disclosure of Invention
To overcome the problems in the related art, it is an object of the present disclosure to provide a core density estimation method, apparatus, storage medium, and electronic device.
In order to achieve the above object, according to a first aspect of embodiments of the present disclosure, there is provided a kernel density estimation method, the method including:
dividing a data set into a plurality of example point sets, wherein the data set comprises a plurality of data values, and each data value is used as an example point;
extracting instance points from each of the sets of instance points as a plurality of sample points required for kernel density estimation of the data set;
and performing the kernel density estimation on the plurality of sample points through the weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set, wherein the weight is the proportion of the number of the instance points in the instance point set in the total number of the instance points in the data set, and the preset distribution function is a distribution function taking a preset bandwidth and the data values of the plurality of sample points as input parameters.
Optionally, the dividing the data set into a plurality of example point sets includes:
dividing the data set into the plurality of example point sets according to a preset sub-packet number, wherein the preset sub-packet number is used for indicating the number of the example point sets;
dividing the data set into the plurality of instance point sets according to a preset instance point number, wherein the preset instance point number is used for indicating the number of instance points in each instance point set.
Optionally, the preset number of packets is m, where m is an integer greater than zero, and dividing the data set into a plurality of example point sets according to the preset number of packets includes:
determining an instance point with a minimum data value and an instance point with a maximum data value in the dataset;
dividing m value intervals equally between the minimum data value and the maximum data value;
and taking all the instance points of the data value in the same value interval as an instance point set to obtain m instance point sets corresponding to the m value intervals as the multiple instance point sets.
Optionally, the equally dividing m value intervals between the minimum data value and the maximum data value includes:
acquiring the ratio of the difference value of the minimum data value and the maximum data value to the m as an interval step length;
and equally dividing the interval between the minimum data value and the maximum data value into the m value intervals according to the interval step length.
Optionally, the preset number of instance points is n, where n is an integer greater than zero, and dividing the data set into a plurality of instance point sets according to the preset number of instance points includes:
sorting all the instance points in the data set according to the size of the data value;
and taking every n instance points in all the ordered instance points as an instance point set to obtain the multiple instance point sets.
Optionally, the extracting instance points from each of the instance point sets to obtain a plurality of sample points required for performing kernel density estimation on the data set includes:
randomly extracting an example point from each example point set as a sample point to obtain the plurality of sample points; or,
acquiring an average value of a plurality of data values corresponding to all instance points in each instance point set;
taking the average value corresponding to each instance point set as the sample point of each instance point set to obtain the plurality of sample points.
Optionally, the preset distribution function is a gaussian function, the multiple example point sets are multiple example point sets divided according to the preset sub-packet number, and the kernel density estimation is performed on the multiple sample points through the weight of each example point set and the preset distribution function to generate a kernel density fitting curve corresponding to the data set, including:
calculating the ratio of the number of the example points contained in each example point set to the total number of the example points as the weight of each example point set;
taking the data values of the plurality of sample points and the preset bandwidth as the input of the Gaussian function to obtain a plurality of Gaussian function values corresponding to the plurality of sample points output by the Gaussian function;
and superposing the multiple Gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
Optionally, the determining the kernel density of the sample points according to the kernel density estimation performed by the sample points according to the weight of each sample point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set includes:
setting the weight of each instance point set to 1;
taking the data values of the plurality of sample points and the preset bandwidth as the input of the Gaussian function to obtain a plurality of Gaussian function values corresponding to the plurality of sample points output by the Gaussian function;
and superposing the multiple Gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
According to a second aspect of embodiments of the present disclosure, there is provided a nuclear density estimation apparatus, the apparatus including:
the device comprises a set dividing module, a data processing module and a data processing module, wherein the set dividing module is used for dividing a data set into a plurality of example point sets, the data set comprises a plurality of data values, and each data value is used as an example point;
a sample point extraction module, configured to extract an instance point from each instance point set as a plurality of sample points required for performing kernel density estimation on the data set;
a fitting curve generation module, configured to perform the kernel density estimation on the multiple sample points through a weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set, where the weight is a proportion of the number of instance points in the instance point set in the total number of instance points in the data set, and the preset distribution function is a distribution function using a preset bandwidth and data values of the multiple sample points as input parameters.
Optionally, the set dividing module is configured to:
dividing the data set into the plurality of example point sets according to a preset sub-packet number, wherein the preset sub-packet number is used for indicating the number of the example point sets;
dividing the data set into the plurality of instance point sets according to a preset instance point number, wherein the preset instance point number is used for indicating the number of instance points in each instance point set.
Optionally, the preset number of packets is m, where m is an integer greater than zero, and the set partitioning module includes:
a data value determination submodule for determining an instance point having a minimum data value and an instance point having a maximum data value in the data set;
the interval division submodule is used for equally dividing m value intervals between the minimum data value and the maximum data value;
and the first set division submodule is used for taking all the instance points of which the data values are in the same value interval as an instance point set so as to obtain m instance point sets corresponding to the m value intervals as the plurality of instance point sets.
Optionally, the interval division submodule is configured to:
acquiring the ratio of the difference value of the minimum data value and the maximum data value to the m as an interval step length;
and equally dividing the interval between the minimum data value and the maximum data value into the m value intervals according to the interval step length.
Optionally, the preset number of instance points is n, where n is an integer greater than zero, and the set partitioning module includes:
the example point sorting submodule is used for sorting all example points in the data set according to the size of the data value;
and the second set dividing submodule is used for taking every n example points in all the sorted example points as an example point set so as to obtain the plurality of example point sets.
Optionally, the sample point extracting module is configured to:
randomly extracting an example point from each example point set as a sample point to obtain the plurality of sample points; or,
acquiring an average value of a plurality of data values corresponding to all instance points in each instance point set;
taking the average value corresponding to each instance point set as the sample point of each instance point set to obtain the plurality of sample points.
Optionally, the preset distribution function is a gaussian function, the multiple example point sets are multiple example point sets divided according to the preset sub-packet number, and the fitting curve generating module is configured to:
calculating the ratio of the number of the example points contained in each example point set to the total number of the example points as the weight of each example point set;
taking the data values of the plurality of sample points and the preset bandwidth as the input of the Gaussian function to obtain a plurality of Gaussian function values corresponding to the plurality of sample points output by the Gaussian function;
and superposing the multiple Gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
Optionally, the multiple example point sets are multiple example point sets divided according to the preset example point number, and the fitting curve generating module is configured to:
setting the weight of each instance point set to 1;
taking the data values of the plurality of sample points and the preset bandwidth as the input of the Gaussian function to obtain a plurality of Gaussian function values corresponding to the plurality of sample points output by the Gaussian function;
and superposing the multiple Gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
According to a third aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the steps of the kernel density estimation method provided by the first aspect of the embodiments of the present disclosure.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a memory having a computer program stored thereon;
a processor configured to execute the computer program in the memory to implement the steps of the kernel density estimation method provided in the first aspect of the embodiments of the present disclosure.
By the technical scheme, the data set can be divided into a plurality of example point sets, the data set comprises a plurality of data values, and each data value is used as an example point; extracting instance points from each instance point set as a plurality of sample points required for kernel density estimation of the data set; and performing kernel density estimation on the plurality of sample points through the weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set, wherein the weight is the proportion of the number of the instance points in the instance point set in the total number of the instance points in the data set, and the preset distribution function is the distribution function taking a preset bandwidth and the data values of the plurality of sample points as input parameters. The data set is regularly sampled in a mode of extracting sample points by grouping to obtain the sample points capable of accurately describing the data set, and the accuracy of kernel density estimation is ensured while the scale of the data set is reduced.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
FIG. 1 is a flow diagram illustrating a method of kernel density estimation in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating an example method of partitioning a set of points according to the embodiment shown in FIG. 1;
FIG. 3 is a flow diagram illustrating another example method of partitioning a set of points according to the embodiment shown in FIG. 1;
FIG. 4 is a flow chart of a method of obtaining a fitted curve of kernel density according to the embodiment shown in FIG. 1;
FIG. 5 is a flow chart of another method of obtaining a fitted curve of nuclear density according to the embodiment shown in FIG. 1;
FIG. 6 is a block diagram illustrating a kernel density estimation apparatus in accordance with an exemplary embodiment;
FIG. 7 is a block diagram of a set partitioning module shown in accordance with the embodiment of FIG. 6;
FIG. 8 is a block diagram of another set partitioning module shown in accordance with the embodiment shown in FIG. 6;
FIG. 9 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
FIG. 1 is a flow chart illustrating a method of kernel density estimation, as shown in FIG. 1, according to an exemplary embodiment, the method comprising:
step 101, dividing a data set into a plurality of example point sets.
Wherein the data set contains a plurality of data values, each of which serves as an instance point.
Taking a Kernel Density Estimation (KDE) of the heights of the community adult males as an example, the data set is a set of data of the heights of the community males, and one data value is an example point of the heights of the adult males, for example, the height of male a is 173 centimeters, where the data value 173 is an example point. Before sampling the data set, the data set needs to be divided into a plurality of example point sets according to the preset sub-packet number or the preset example point number. In the embodiment of the present disclosure, the respective steps of the nuclear density estimation method shown in fig. 1 to 5 will be described by taking the estimation of the nuclear density of the adult male in the community as an example.
Illustratively, this step 101 may be: and dividing the data set into the plurality of example point sets according to a preset sub-packet number, wherein the preset sub-packet number is used for indicating the number of the example point sets. For example, the adult male height interval of the community is [150, 230], the predetermined number of packets is 8, i.e., 8 intervals [150, 160], [161, 170], [171, 180], [181, 190], [191, 200], [201, 210], [211, 220] and [221, 230] are equally divided in the interval [150, 230], and an example point set is composed of example points of which the data values are in a certain interval. For example, 3000 instance points where the data value of the height is in the interval [150, 160], then the 3000 instance points are divided into an instance point set. It will be appreciated that in most cases, the number of instance points contained in each set of instance points divided in the manner described above is not the same.
Alternatively, the step 101 may also be: and dividing the data set into the plurality of example point sets according to a preset example point number, wherein the preset example point number is used for indicating the number of example points in each example point set. For example, the number of adult males in the community, i.e., the number of data values in the data set, is 80000, and the preset instance point number is 5000. At this time, the data values of the heights of the 80000 men are ranked from high to low (or from low to high), and from the first instance point, an instance point set is formed by 5000 instance points. For example, starting with the first data value for height 150, the first set of instance points includes the first through 5000 th instance points, the second set of instance points includes 5001 through 10000 th instance points, and so on. In the dividing manner of the instance point set, the accuracy of the subsequent sample extraction can be improved by reducing the number of the preset instance points. However, when the number of the preset instance points is reduced to a certain degree, the number of the divided instance point sets becomes too large, and the number of the sample points extracted in the following step 102 also becomes too large, which goes against the purpose of reducing the data size of the present disclosure. Therefore, before step 101, the preset number of instance points needs to be set according to the data size of the data set and the professional knowledge of the field to which the data set relates.
Step 102, extracting instance points from each instance point set as a plurality of sample points required for performing kernel density estimation on the data set.
Illustratively, after the above-mentioned multiple instance point sets are divided, the same number (preferably 1) of instance points can be extracted from each instance point set as sample points required for subsequent kernel density estimation. Based on this, the step 102 may be: and randomly extracting one example point from each example point set as a sample point to obtain the plurality of sample points. However, the random acquisition method may cause a problem of uneven sample point selection, for example, one instance point 170 is randomly selected from the set of instance points corresponding to the interval [161, 170], one instance point 171 is randomly selected from the set of instance points corresponding to the interval [171, 180], but the instance points in the interval [161, 170] may be concentrated between 160 and 165, and the instance points in the interval [171, 180] may be concentrated between 175 and 180. It can be seen that the randomly selected sample points do not accurately describe the distribution of the example points corresponding to the two intervals. To avoid this, the step 102 may further include: acquiring the average value of a plurality of data values corresponding to all the instance points in each instance point set; and taking the average value corresponding to each instance point set as the sample point of each instance point set to obtain the plurality of sample points. For example, an example point set includes three example points, 150, 160, and 170, and the average value of the example point set is the example point 160. Since the kernel density estimation method provided by the embodiment of the present disclosure is for a data set with a large data volume, it can be considered that the instance point corresponding to the average value appears in the corresponding instance point set in most cases. However, in a special case, even if the average value is not included in the example point set, the average value is still used as the sample point corresponding to the example point set, and the subsequent kernel density estimation step is performed.
It should be noted that the above two ways of executing step 102 are simultaneously applicable to the instance point set divided according to the preset instance point number and the instance point set divided according to the preset sub-packet number.
And 103, performing kernel density estimation on the plurality of sample points through the weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set.
The number of the example points included in each example point set divided according to the preset sub-packet number is different, and the influence degree of the sample points extracted from each example point set on the kernel density estimation of all the sample points is also different, so that the weight of each example point set needs to be considered when performing the kernel density estimation, the weight is the proportion of the number of the example points in the example point set in the total number of the example points in the data set, and the preset distribution function is a distribution function taking the preset bandwidth and the data values of the plurality of sample points as input parameters.
In summary, the present disclosure can divide a data set into a plurality of instance point sets, where the data set includes a plurality of data values, and each of the data values serves as an instance point; extracting instance points from each instance point set as a plurality of sample points required for kernel density estimation of the data set; and performing kernel density estimation on the plurality of sample points through the weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set, wherein the weight is the proportion of the number of the instance points in the instance point set in the total number of the instance points in the data set, and the preset distribution function is the distribution function taking a preset bandwidth and the data values of the plurality of sample points as input parameters. The data set is regularly sampled in a mode of extracting sample points by grouping to obtain the sample points capable of accurately describing the data set, and the accuracy of kernel density estimation is ensured while the scale of the data set is reduced.
Fig. 2 is a flowchart of an example point set partitioning method according to the embodiment shown in fig. 1, where as shown in fig. 2, the preset number of packets is m, where m is an integer greater than zero, and the step 101 may include:
at step 1011, an instance point with the smallest data value and an instance point with the largest data value are determined in the data set.
In step 1012, m value intervals are equally divided between the minimum data value and the maximum data value.
Illustratively, this step 1012 may include: acquiring the ratio of the difference value of the minimum data value and the maximum data value to the m as an interval step length; and equally dividing the interval between the minimum data value and the maximum data value into the m value intervals according to the interval step length. The calculation formula (1) for obtaining the interval step length can be expressed as:
wherein Margin is the step length of the interval, TmaxFor the maximum data value, TminFor the minimum data value, BinSize is the predetermined number m of packets.
And 1013, taking all the instance points of which the data values are in the same value interval as an instance point set, so as to obtain m instance point sets corresponding to the m value intervals, and taking the m instance point sets as the multiple instance point sets.
Fig. 3 is a flowchart of another example point set partitioning method according to the embodiment shown in fig. 1, where as shown in fig. 3, the preset number of example points is n, where n is an integer greater than zero, and the step 101 may include:
step 1014, sorting all instance points in the dataset according to the size of the data value.
Step 1015, using every n instance points in all the ordered instance points as an instance point set to obtain the multiple instance point sets.
Still taking the estimation of the nuclear density of the body height of the adult males in the community as an example, it should be noted that when the number of data values in the data set is 81000 and the preset number of instance points is 5000, only 1000 data points remain when the last instance point set is divided through the steps 1014 and 1015, and then the 75001 th instance point to the 81000 instance points are divided into one instance point set. Although the purpose of the above partitioning manner is to ensure that the number of instance points in each instance point set is consistent, in reality, the number of instance point sets may be very large, and the number of instance points in only one instance point set (i.e. the last instance point set) is inconsistent and is not enough to affect the subsequent kernel density estimation step.
Fig. 4 is a flowchart of a method for obtaining a kernel density fitting curve according to the embodiment shown in fig. 1, where as shown in fig. 4, the preset distribution function is a gaussian function, the plurality of example point sets are a plurality of example point sets divided according to the preset number of packets, and the step 103 may include:
step 1031, calculating a ratio of the number of the instance points included in each instance point set to the total number of the instance points, as a weight of each instance point set.
Step 1032, the data values of the plurality of sample points and the preset bandwidth are used as the input of the gaussian function, so as to obtain a plurality of gaussian function values corresponding to the plurality of sample points output by the gaussian function.
And 1033, superposing the multiple gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
Illustratively, the probability density function (2) corresponding to this step 1033 can be expressed as:
where i is the number of the instance point set, BinSize is the number of the predetermined sub-packets, CountiIs the number of instance points in the ith set of instance points, and Sum is the total number of instance points, F (Mean)iBandWidth) represents a gaussian function F having the data value of the sample point corresponding to the ith instance point set and the preset BandWidth as inputs. The preset bandwidth determines the smoothness of the generated fitted curve, and can be set according to the data volume of the data set and the requirement on the curve smoothness.
Fig. 5 is a flowchart of another method for obtaining a kernel density fitting curve according to the embodiment shown in fig. 1, where as shown in fig. 5, the multiple instance point sets are multiple instance point sets divided according to the preset number of instance points, and the step 103 may include:
at step 1034, the weight of each instance point set is set to 1.
For example, since the number of instance points included in each instance point set divided according to the preset number of instance points is substantially the same, the weight of each instance point set may be set to the same value (preferably 1) when performing the kernel density estimation.
In step 1035, the data values of the sample points and the preset bandwidth are used as the input of the gaussian function to obtain gaussian function values corresponding to the sample points output by the gaussian function.
Step 1036, overlapping the multiple gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
Illustratively, the probability density function (3) corresponding to this step 1036 can be expressed as:
where i is the number of the set of instance points, NumInterval is the number of the set of instance points divided according to the preset number of instance points, F (Mean)iBandWidth) represents a gaussian function F having the data value of the sample point corresponding to the ith instance point set and the preset BandWidth as inputs.
In summary, the present disclosure can divide a data set into a plurality of instance point sets, where the data set includes a plurality of data values, and each of the data values serves as an instance point; extracting instance points from each instance point set as a plurality of sample points required for kernel density estimation of the data set; and performing kernel density estimation on the plurality of sample points through the weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set, wherein the weight is the proportion of the number of the instance points in the instance point set in the total number of the instance points in the data set, and the preset distribution function is the distribution function taking a preset bandwidth and the data values of the plurality of sample points as input parameters. The data set is regularly sampled in a mode of extracting sample points by grouping to obtain the sample points capable of accurately describing the data set, and the accuracy of kernel density estimation is ensured while the scale of the data set is reduced.
Fig. 6 is a block diagram illustrating a kernel density estimation apparatus according to an exemplary embodiment, as shown in fig. 6, the apparatus 600 includes:
a set dividing module 610, configured to divide a data set into a plurality of instance point sets, where the data set includes a plurality of data values, and each data value serves as an instance point;
a sample point extracting module 620, configured to extract an instance point from each instance point set as a plurality of sample points required for performing kernel density estimation on the data set;
a fitting curve generating module 630, configured to perform kernel density estimation on the multiple sample points through a weight of each instance point set and a preset distribution function, so as to generate a kernel density fitting curve corresponding to the data set, where the weight is a proportion of the number of instance points in the instance point set in the total number of instance points in the data set, and the preset distribution function is a distribution function using a preset bandwidth and data values of the multiple sample points as input parameters.
Optionally, the set dividing module 610 is configured to:
dividing the data set into the plurality of example point sets according to a preset sub-packet number, wherein the preset sub-packet number is used for indicating the number of the example point sets;
and dividing the data set into the plurality of example point sets according to a preset example point number, wherein the preset example point number is used for indicating the number of example points in each example point set.
Fig. 7 is a block diagram of a set partitioning module according to the embodiment shown in fig. 6, where, as shown in fig. 7, the preset number of packets is m, where m is an integer greater than zero, and the set partitioning module 610 includes:
a data value determination submodule 611 for determining an instance point with a minimum data value and an instance point with a maximum data value in the data set;
an interval division submodule 612, configured to equally divide m value intervals between the minimum data value and the maximum data value;
the first set dividing submodule 613 is configured to use all the instance points of the data value in the same value range as an instance point set to obtain m instance point sets corresponding to the m value ranges, and use the m instance point sets as the multiple instance point sets.
Optionally, the interval division sub-module 612 is configured to:
acquiring the ratio of the difference value of the minimum data value and the maximum data value to the m as an interval step length;
and equally dividing the interval between the minimum data value and the maximum data value into the m value intervals according to the interval step length.
Fig. 8 is a block diagram of another set partitioning module according to the embodiment shown in fig. 6, where the preset number of instance points is n, where n is an integer greater than zero, as shown in fig. 8, and the set partitioning module 630 includes:
an instance point sorting sub-module 634 for sorting all instance points in the data set according to the size of the data value;
the second set dividing sub-module 635 is configured to use every n instance points in all the ordered instance points as an instance point set to obtain the multiple instance point sets.
Optionally, the sample point extracting module 620 is configured to:
randomly extracting an example point from each example point set as a sample point to obtain the plurality of sample points; or,
acquiring the average value of a plurality of data values corresponding to all the instance points in each instance point set;
and taking the average value corresponding to each instance point set as the sample point of each instance point set to obtain the plurality of sample points.
Optionally, the preset distribution function is a gaussian function, the multiple example point sets are multiple example point sets divided according to the preset sub-packet number, and the fitting curve generating module 630 is configured to:
calculating the ratio of the number of the example points contained in each example point set to the total number of the example points, and taking the ratio as the weight of each example point set;
taking the data values of the plurality of sample points and the preset bandwidth as the input of the Gaussian function to obtain a plurality of Gaussian function values corresponding to the plurality of sample points output by the Gaussian function;
and superposing the multiple Gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
Optionally, the multiple example point sets are multiple example point sets divided according to the preset example point number, and the fitting curve generating module 630 is configured to:
setting the weight of each instance point set to 1;
taking the data values of the plurality of sample points and the preset bandwidth as the input of the Gaussian function to obtain a plurality of Gaussian function values corresponding to the plurality of sample points output by the Gaussian function;
and superposing the multiple Gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
In summary, the present disclosure can divide a data set into a plurality of instance point sets, where the data set includes a plurality of data values, and each of the data values serves as an instance point; extracting instance points from each instance point set as a plurality of sample points required for kernel density estimation of the data set; and performing kernel density estimation on the plurality of sample points through the weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set, wherein the weight is the proportion of the number of the instance points in the instance point set in the total number of the instance points in the data set, and the preset distribution function is the distribution function taking a preset bandwidth and the data values of the plurality of sample points as input parameters. The data set is regularly sampled in a mode of extracting sample points by grouping to obtain the sample points capable of accurately describing the data set, and the accuracy of kernel density estimation is ensured while the scale of the data set is reduced.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 9 is a block diagram illustrating an electronic device 900 in accordance with an example embodiment. As shown in fig. 9, the electronic device 900 may include: a processor 901, a memory 902, multimedia components 903, input/output (I/O) interfaces 904, and communications components 905.
The processor 901 is configured to control the overall operation of the electronic device 900, so as to complete all or part of the steps in the above-mentioned core density estimation method. The memory 902 is used to store various types of data to support operation of the electronic device 900, such as instructions for any application or method operating on the electronic device 900 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and the like. The Memory 902 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 903 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 902 or transmitted through the communication component 905. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 904 provides an interface between the processor 901 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 905 is used for wired or wireless communication between the electronic device 900 and other devices. Wireless communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding communication component 905 may include: Wi-Fi module, bluetooth module, NFC module.
In an exemplary embodiment, the electronic Device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described core density estimation method.
In another exemplary embodiment, a computer readable storage medium comprising program instructions, such as the memory 902 comprising program instructions, executable by the processor 901 of the electronic device 900 to perform the above-described method of kernel density estimation is also provided.
Preferred embodiments of the present disclosure are described in detail above with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and other embodiments of the present disclosure may be easily conceived by those skilled in the art within the technical spirit of the present disclosure after considering the description and practicing the present disclosure, and all fall within the protection scope of the present disclosure.
It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. Meanwhile, any combination can be made between various different embodiments of the disclosure, and the disclosure should be regarded as the disclosure of the disclosure as long as the combination does not depart from the idea of the disclosure. The present disclosure is not limited to the precise structures that have been described above, and the scope of the present disclosure is limited only by the appended claims.
Claims (10)
1. A method of nuclear density estimation, the method comprising:
dividing a data set into a plurality of example point sets, wherein the data set comprises a plurality of data values, and each data value is used as an example point;
extracting instance points from each of the sets of instance points as a plurality of sample points required for kernel density estimation of the data set;
and performing the kernel density estimation on the plurality of sample points through the weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set, wherein the weight is the proportion of the number of the instance points in the instance point set in the total number of the instance points in the data set, and the preset distribution function is a distribution function taking a preset bandwidth and the data values of the plurality of sample points as input parameters.
2. The method of claim 1, wherein the dividing the data set into a plurality of sets of instance points comprises:
dividing the data set into the plurality of example point sets according to a preset sub-packet number, wherein the preset sub-packet number is used for indicating the number of the example point sets; or,
dividing the data set into the plurality of instance point sets according to a preset instance point number, wherein the preset instance point number is used for indicating the number of instance points in each instance point set.
3. The method of claim 2, wherein the predetermined number of packets is m, where m is an integer greater than zero, and wherein the dividing the data set into a plurality of instance point sets comprises:
determining an instance point with a minimum data value and an instance point with a maximum data value in the dataset;
dividing m value intervals equally between the minimum data value and the maximum data value;
and taking all the instance points of the data value in the same value interval as an instance point set to obtain m instance point sets corresponding to the m value intervals as the multiple instance point sets.
4. The method of claim 3, wherein equally dividing m intervals between the minimum data value and the maximum data value comprises:
acquiring the ratio of the difference value of the minimum data value and the maximum data value to the m as an interval step length;
and equally dividing the interval between the minimum data value and the maximum data value into the m value intervals according to the interval step length.
5. The method of claim 2, wherein the preset number of instance points is n, where n is an integer greater than zero, and wherein dividing the data set into a plurality of sets of instance points comprises:
sorting all the instance points in the data set according to the size of the data value;
and taking every n instance points in all the ordered instance points as an instance point set to obtain the multiple instance point sets.
6. The method of claim 1, wherein said extracting instance points from each of said sets of instance points to obtain a plurality of sample points required for kernel density estimation of said data set comprises:
randomly extracting an example point from each example point set as a sample point to obtain the plurality of sample points; or,
acquiring an average value of a plurality of data values corresponding to all instance points in each instance point set;
taking the average value corresponding to each instance point set as the sample point of each instance point set to obtain the plurality of sample points.
7. The method of claim 2, wherein the predetermined distribution function is a gaussian function, the plurality of sample point sets are a plurality of sample point sets divided according to the predetermined number of packets, and the kernel density estimation is performed on the plurality of sample points according to the weight of each sample point set and the predetermined distribution function to generate a kernel density fitting curve corresponding to the data set, comprising:
calculating the ratio of the number of the example points contained in each example point set to the total number of the example points as the weight of each example point set;
taking the data values of the plurality of sample points and the preset bandwidth as the input of the Gaussian function to obtain a plurality of Gaussian function values corresponding to the plurality of sample points output by the Gaussian function;
and superposing the multiple Gaussian function values based on the weight of each instance point set to obtain a kernel density fitting curve corresponding to the data set.
8. A nuclear density estimation apparatus, characterized in that the apparatus comprises:
the device comprises a set dividing module, a data processing module and a data processing module, wherein the set dividing module is used for dividing a data set into a plurality of example point sets, the data set comprises a plurality of data values, and each data value is used as an example point;
a sample point extraction module, configured to extract an instance point from each instance point set as a plurality of sample points required for performing kernel density estimation on the data set;
a fitting curve generation module, configured to perform the kernel density estimation on the multiple sample points through a weight of each instance point set and a preset distribution function to generate a kernel density fitting curve corresponding to the data set, where the weight is a proportion of the number of instance points in the instance point set in the total number of instance points in the data set, and the preset distribution function is a distribution function using a preset bandwidth and data values of the multiple sample points as input parameters.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
10. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811308615.5A CN109597923A (en) | 2018-11-05 | 2018-11-05 | Density Estimator method, apparatus, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811308615.5A CN109597923A (en) | 2018-11-05 | 2018-11-05 | Density Estimator method, apparatus, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109597923A true CN109597923A (en) | 2019-04-09 |
Family
ID=65957558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811308615.5A Pending CN109597923A (en) | 2018-11-05 | 2018-11-05 | Density Estimator method, apparatus, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109597923A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111126713A (en) * | 2019-12-31 | 2020-05-08 | 方正国际软件(北京)有限公司 | Space-time hot spot prediction method and device based on bayonet data and controller |
CN112598204A (en) * | 2019-09-17 | 2021-04-02 | 北京京东乾石科技有限公司 | Method and device for determining failure rate interval of observation equipment |
-
2018
- 2018-11-05 CN CN201811308615.5A patent/CN109597923A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112598204A (en) * | 2019-09-17 | 2021-04-02 | 北京京东乾石科技有限公司 | Method and device for determining failure rate interval of observation equipment |
CN111126713A (en) * | 2019-12-31 | 2020-05-08 | 方正国际软件(北京)有限公司 | Space-time hot spot prediction method and device based on bayonet data and controller |
CN111126713B (en) * | 2019-12-31 | 2023-05-09 | 方正国际软件(北京)有限公司 | Space-time hot spot prediction method and device based on bayonet data and controller |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20120137308A1 (en) | Adaptive tree structure for visualizing data | |
JP6869347B2 (en) | Risk control event automatic processing method and equipment | |
WO2019169704A1 (en) | Data classification method, apparatus, device and computer readable storage medium | |
CN108875797B (en) | Method for determining image similarity, photo album management method and related equipment | |
EP3582138A1 (en) | Pedestrian search method and apparatus | |
CN109685092B (en) | Clustering method, equipment, storage medium and device based on big data | |
CN109492531B (en) | Face image key point extraction method and device, storage medium and electronic equipment | |
CN109344396A (en) | Text recognition method, device and computer equipment | |
CN107908561B (en) | Virtual reality software performance test method and system | |
WO2024113932A1 (en) | Model optimization method and apparatus, and device and storage medium | |
CN111783812A (en) | Method and device for identifying forbidden images and computer readable storage medium | |
KR20200069848A (en) | Method for computing watershed boundary based on digital elevation model, apparatus, and recording medium thereof | |
US10444062B2 (en) | Measuring and diagnosing noise in an urban environment | |
CN112836806B (en) | Data format adjustment method, device, computer equipment and storage medium | |
CN109726821B (en) | Data equalization method and device, computer readable storage medium and electronic equipment | |
CN112685799B (en) | Device fingerprint generation method and device, electronic device and computer readable medium | |
CN109597923A (en) | Density Estimator method, apparatus, storage medium and electronic equipment | |
CN108255977B (en) | Relationship prediction method, relationship prediction device, computer readable storage medium and electronic equipment | |
CN109697224B (en) | Bill message processing method, device and storage medium | |
CN110796115B (en) | Image detection method and device, electronic equipment and readable storage medium | |
CN106934015A (en) | Address date treating method and apparatus | |
CN108133234B (en) | Sparse subset selection algorithm-based community detection method, device and equipment | |
CN110020166B (en) | Data analysis method and related equipment | |
WO2020113563A1 (en) | Facial image quality evaluation method, apparatus and device, and storage medium | |
CN112396100B (en) | Optimization method, system and related device for fine-grained classification model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |