CN114860811B

CN114860811B - Method and device for searching median approximation value of data set and computer equipment

Info

Publication number: CN114860811B
Application number: CN202210572598.6A
Authority: CN
Inventors: 李肯立; 李芬芳; 罗辉章; 阳王东; 唐卓; 刘楚波
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2024-09-17
Anticipated expiration: 2042-05-25
Also published as: CN114860811A

Abstract

The application relates to a median approximation searching method, a median approximation searching device and computer equipment for a data set. The method comprises the following steps: acquiring the current state of a current array to be processed and a predefined approximate median lookup event, determining the mean value and standard deviation of the array to be processed according to the current state of the approximate median lookup event, constructing an accumulated frequency distribution table, searching out a data packet in which a median approximation value is located based on the accumulated frequency distribution table and the current state of the event, determining the median approximation value of the array to be processed, transmitting the mean value, the standard deviation, the median approximation value and the position identification of the data packet in which the median approximation value is located to the next array, switching the first state into the second state if the current state of the event is the first state, and returning to the step of acquiring the current state of the array to be processed and the predefined approximate median lookup event until the median approximation value of all arrays in the data set is found out. The method can improve the approximate median searching efficiency.

Description

Method and device for searching median approximation value of data set and computer equipment

Technical Field

The present application relates to the field of data searching technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for searching a median approximation value of a data set.

Background

The median is generally a statistical indicator of high value in a data set that is more representative of the average level of the data set than the mean. Median lookup is a method aimed at finding the exact median value from an unordered array. The quick median searching method can solve the technical problems in many fields. In particular, in an application scenario, for example, in a rainfall early warning scenario, people usually find an approximate range where a median value of a sensor dataset is located based on a sensor dataset formed by data collected by a weather rainfall sensor, determine an approximate median value, and then compare the approximate median value with a preset early warning threshold value, so as to make a rainfall early warning decision. However, because the number of weather rainfall sensors is large, and the data acquisition process of the sensors is continuous, the collected sensor data can be massive data, which definitely increases the difficulty of finding the median value and making early warning decisions. Therefore, it is of great significance to provide a median lookup method with fast lookup capability.

At present, the more-used median searching methods comprise a full-order searching method, a partial-order searching method, a random selection algorithm and an approximate median searching algorithm, wherein the full-order searching method for the median is mostly realized by a principal component selection strategy, the partial-order searching method for the median is realized by ordering only part of elements by using strategies such as dimension conversion, forgetting selection and the like, and the approximate median searching algorithm is realized by taking the median based on a triple adjustment algorithm.

However, in the conventional median searching method, a large amount of comparison operation and exchange operation between data exist in the sorting process, and the operations result in excessive time overhead, so that the median searching efficiency is reduced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a median approximation finding method, apparatus, computer device, computer readable storage medium, and computer program product for a data set that can improve median finding efficiency.

In a first aspect, the present application provides a method for searching for a median approximation of a data set, where the data set includes a plurality of arrays, and the plurality of arrays characterize a data distribution using a same mean and standard deviation. The method comprises the following steps:

Acquiring a current state of a current array to be processed and a predefined approximate median lookup event;

Searching the current state of an event according to a predefined approximate median value, and determining the mean value and standard deviation of an array to be processed;

Constructing an accumulated frequency distribution table according to the mean value and the standard deviation, wherein the accumulated frequency distribution table comprises a plurality of data packets which are obtained by dividing an array to be processed;

Based on the accumulated frequency distribution table and the current state of the predefined approximate median lookup event, searching out a data packet in which a median approximation value of the array to be processed is located, determining the median approximation value of the array to be processed, transmitting the mean value, the standard deviation, the median approximation value and a position identifier of the data packet in which the median approximation value is located to the next array, and if the current state of the predefined approximate median lookup event is a first state, switching the first state into a second state, wherein the position identifier is used for representing the position of the data packet in the array after data division is completed;

and returning to the step of acquiring the current states of the arrays to be processed and the predefined approximate median lookup event until the median approximation values of all arrays in the data set are found.

In one embodiment, determining the mean and standard deviation of the array to be processed based on the current state of the predefined approximate median lookup event comprises:

When the current state of the predefined approximate median searching event is the first state, calculating the mean value and standard deviation of the array to be processed;

when the current state of the predefined approximate median lookup event is the second state, the average value and the standard deviation of the received last array to be processed are determined as the average value and the standard deviation of the array to be processed.

In one embodiment, searching the data packet in which the median approximation of the array to be processed is located based on the cumulative frequency distribution table and the current state of the predefined approximate median search event, and determining the median approximation of the array to be processed includes:

when the current state of the predefined approximate median searching event is the first state, searching a data packet in which the median approximation value of the array to be processed is located according to the accumulated frequency distribution table, and determining the median approximation value of the array to be processed;

When the current state of the predefined approximate median searching event is the second state, searching a data packet in which a median approximation value of an array to be processed is positioned according to the accumulated frequency distribution table, acquiring a first data packet position identifier, and when the first data packet position identifier is consistent with the second data packet position identifier, determining the received median approximation value of the last array as the median approximation value of the array to be processed;

The data packet position identifiers are used for representing positions of data packets in an array after data division is completed, the first data packet position identifiers are position identifiers of the data packets in which median approximation values of the arrays to be processed are located, and the second data packet position identifiers are position identifiers of the data packets in which the received median approximation values of the last array are located.

In one embodiment, after obtaining the first data packet location identifier, the method further includes:

If the first data packet position identifier is inconsistent with the second data packet position identifier, judging whether the absolute value of the difference value between the first data packet position identifier and the second data packet position identifier meets a preset error;

if the absolute value of the difference value meets the preset error, determining a median approximation value according to the data packet in which the median approximation value of the array to be processed is located;

if the absolute value of the difference value does not meet the preset error, switching the current state of the predefined approximate median searching event to a first state, and returning to the step of determining the average value and the standard deviation of the array to be processed according to the current state of the predefined approximate median searching event.

In one embodiment, constructing the cumulative frequency distribution table from the mean and standard deviation comprises:

determining a plurality of data dividing intervals according to the mean value and the standard deviation;

dividing an array to be processed into a plurality of data packets according to a plurality of data dividing intervals;

Counting the cumulative frequency of each data packet;

an accumulated frequency distribution table is constructed based on each data packet and the accumulated frequency of each data packet.

In one embodiment, counting the cumulative frequency of each data packet includes:

counting the number of data distributed in each data packet to obtain the frequency number of each data packet;

Determining the frequency of each data packet based on the frequency of each data packet;

And carrying out upward accumulated summation on the frequencies of the data packets to obtain the accumulated frequencies of the data packets.

In a second aspect, the present application further provides a median approximation finding apparatus for a data set, where the data set includes a plurality of arrays, and the plurality of arrays characterize a data distribution using a same mean and standard deviation. The device comprises:

the data acquisition module is used for acquiring the current state of the current array to be processed and a predefined approximate median searching event;

the average standard deviation determining module is used for determining the average value and standard deviation of the array to be processed according to the current state of the predefined approximate median searching event;

The accumulated frequency distribution table construction module is used for constructing an accumulated frequency distribution table according to the mean value and the standard deviation, wherein the accumulated frequency distribution table comprises a plurality of data packets which are divided by an array to be processed;

The median approximation value determining module is used for searching a data packet in which a median approximation value of the array to be processed is located based on the accumulated frequency distribution table and the current state of the predefined approximation median lookup event, determining the median approximation value of the array to be processed, transmitting the mean value, the standard deviation, the median approximation value and the position identification of the data packet in which the median approximation value is located to the next array, and switching the first state into the second state if the current state of the predefined approximation median lookup event is the first state, wherein the position identification is used for representing the position of the data packet in the array after the data division is completed;

And the circulation processing module is used for controlling the data acquisition module to execute the operation of acquiring the current states of the arrays to be processed and the predefined approximate median searching event until the median approximation value of all the arrays in the data set is searched.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

According to the method, the device, the computer equipment, the storage medium and the computer program product for searching the median approximation value of the data set, a two-state mechanism is introduced, the mean value and the standard deviation of the data set to be processed are determined according to the current state of a predefined approximation median searching event, then an accumulated frequency distribution table is constructed, then the data packet where the median approximation value of the data set to be processed is located is searched through the accumulated frequency distribution table and the current state of the predefined approximation median searching event, and the median approximation value is determined, so that the median approximation value of the data set to be processed can be searched without ordering operation, the searching process is simplified, and after the median approximation value of the current data set to be processed is searched, the mean value, the standard deviation, the median approximation value and the position identification of the data packet where the median approximation value is located are transmitted to the next data set, so that the searching of the next data set is facilitated, the comparison and exchange operation between data are reduced, the time cost is saved, and the median value searching efficiency of the approximation value can be improved by adopting the method.

Drawings

FIG. 1 is a diagram of an application environment for a median approximation lookup method for a data set in one embodiment;

FIG. 2 is a flow diagram of a method of median approximation lookup for a data set in one embodiment;

FIG. 3 is a detailed flow chart of a method of finding median approximations of data sets in another embodiment;

FIG. 4 is an algorithm diagram of a median approximation finding method for a data set in another embodiment;

FIG. 5 is a block diagram of a median approximation lookup device for a data set in one embodiment;

FIG. 6 is a block diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The method for searching the median approximation value of the data set, provided by the embodiment of the application, can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. Specifically, a worker may upload a data set to be processed to a data storage system in advance, where the data set includes a plurality of arrays that characterize data distribution using the same mean and standard deviation. Then, the staff sends a median approximation searching message to the server 104 through the terminal 102, the server 104 responds to the message, acquires the current array to be processed from the data storage system, acquires the state of a predefined approximation median searching event, then determines the mean value and standard deviation of the array to be processed according to the current state of the predefined approximation median searching event, constructs an accumulated frequency distribution table according to the mean value and standard deviation, searches out the data packet in which the median approximation of the array to be processed is located based on the accumulated frequency distribution table and the current state of the predefined approximation median searching event, determines the median approximation of the array to be processed, and transmits the mean value, standard deviation, median approximation and the position identification of the data packet in which the median approximation is located to the next array, if the current state of the predefined approximation median searching event is the first state, the first state is switched to the second state, and returns to the step of acquiring the current states of the array to be processed and the predefined approximation median searching event until the median approximation of all arrays in the data set is found out, wherein the terminal 102 can be used for, but is not limited to, various personal computers, notebook computers, intelligent computers, internet of things, intelligent televisions, intelligent air-conditioning devices, portable devices and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, a median approximation searching method for a data set is provided, where the data set includes a plurality of arrays, and the plurality of arrays characterize a data distribution by using the same mean and standard deviation, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

Step 100, obtaining the current state of the current array to be processed and the predefined approximate median lookup event.

In practical application, a median value (hereinafter referred to as an approximate median value) is found for each array in a dataset including a plurality of arrays, and the arrays of the dataset have approximate data distribution, i.e., the data distribution of each array can be measured by using the same mean value and standard deviation. The array to be processed refers to an array to be searched for an approximate median value at the current time, and it can be understood that, in general, data in the array to be processed at this time is out of order. In this embodiment, in combination with the principles of the finite state machine, an approximate median lookup event is predefined for finding the median approximation of the array, the approximate median lookup event having a finite number of states, different states representing different meanings. According to actual needs, the model can complete the specified approximate median searching task under different states. In this embodiment, two states, including state 0 and state 1, are set for the approximate median lookup event, and the approximate median lookup event performs approximate median lookup according to different approximate median lookup logic under the conditions of state 0 and state 1. In implementation, the first state is state 0, the second state is state 1, and the initialization state of the approximate median lookup event is state 0, waiting for the arrival of a new array. It will be appreciated that in other embodiments the first state definition may be state 1 and the second state may be state 0, without limitation.

Step 200, determining the mean value and standard deviation of the array to be processed according to the current state of the predefined approximate median lookup event.

The mean and median are equivalent when the data distribution strictly follows a normal distribution. But in practice the data does not always follow a normal distribution. Therefore, the median cannot be directly replaced by the average value in the array. According to research, when the distribution of the data is positive bias distribution, M > Me > M0, namely the average number is larger than the middle number, and the middle number is larger than the mode number; when the distribution of the data is negative bias distribution, at this time, M < Me < M0, i.e. the mean value is smaller than the median value, which is smaller than the mode. Inspired by the characteristics of the data distribution described above, in this embodiment, the mean (defined as μ) and standard deviation (defined as σ) are used to locate the median.

In practical application, different approximate median lookup procedures are defined for different states of the approximate median lookup event. After the current state of the current array to be processed and the current state of the approximate median searching event are obtained, the corresponding average value and standard deviation processing flow is determined according to the state of the approximate median searching event, and then the average value and standard deviation of the array to be processed are determined.

And 300, constructing an accumulated frequency distribution table according to the mean value and the standard deviation, wherein the accumulated frequency distribution table comprises a plurality of data packets, and the data packets are divided by an array to be processed.

The cumulative frequency distribution table (Cumulative Frequency Table, abbreviated as CFT) may also be called a cumulative relative frequency distribution table and a cumulative frequency table, which are obtained by cumulatively summing the frequencies (also called frequencies) of each group from bottom to top or from top to bottom, and then expressing the cumulative frequencies of each group after each cumulative summation in the form of a statistical table. For example, the data packets and their frequencies have correspondence of [10, 30) to 0.2, [30, 40) to 0.3, and [30, 40) to 0.5, and the frequencies of the respective groups are cumulatively summed up upward to obtain [10, 30) to 0.2, [30, 40) to 0.5, and [30, 40) to 1. The cumulative frequency distribution table includes each data packet, the frequency number of each data packet, the frequency, the corresponding cumulative frequency, and the like.

Step 400, searching the current state of the event based on the accumulated frequency distribution table and the predefined approximate median, finding out the data packet where the median approximation of the array to be processed is located, determining the median approximation of the array to be processed, and transmitting the mean value, standard deviation, median approximation and the position identification of the data packet where the median approximation is located to the next array, returning to step 100 until the median approximation of all arrays in the data set is found out.

In this embodiment, the location identifier is used to characterize the location of the data packet in the array after the data division is completed. After constructing the cumulative frequency distribution table, the median (median) will be in the first packet with a cumulative frequency of occurrence exceeding 50%, and the group can be considered as the packet in which the median is located. And, by calculating half of the sum of the upper and lower boundaries of the set, an approximation of the median can be obtained. In practical application, because specific values in the array distance of the array are not clear, and the numerical requirement of the median is not required to be completely accurate, in the embodiment, a median value approximate value searching mode is adopted to replace median value searching. In specific implementation, the cumulative frequency of each data packet in the cumulative frequency distribution table may be read one by one, if the cumulative frequency of the currently read data packet exceeds 0.5, the median approximation value of the array to be processed is considered to be distributed in the data packet, and further, the median approximation value of the array to be processed is determined based on the data packet where the median approximation value is located and the current state of the approximate median searching event. And simultaneously, transmitting the average value, standard deviation, position identification of the group where the median approximation value is positioned and the median approximation value of the current array to be processed to the next array. It should be noted that, if the current state of the approximate median lookup event is the first state, i.e., state 0, after the median approximation lookup of the current array to be processed is completed, the state 0 needs to be switched to the state 1. If the current state of the approximate median lookup event is the second state, namely the state 1, after the median approximation lookup of the current array to be processed is completed, the state switching operation is not performed, and the current state of the approximate median lookup event is kept as the state 1. Then, returning to step 100, processing of the next array begins, as described above, until a median approximation of all arrays in the dataset is found. It can be understood that when the current state of the approximate median lookup event is state 1, in the process of looking up the median approximation, state 1 may be switched to state 0, and then the current array to be processed is looked up for the median approximation according to the median approximation lookup flow under state 0.

In the method for searching the median approximation value of the data set, a two-state mechanism is introduced, the mean value and the standard deviation of the to-be-processed array are determined according to the current state of the predefined approximation median searching event, then the cumulative frequency distribution table is constructed, then the data packet where the median approximation value of the to-be-processed array is located is searched through the cumulative frequency distribution table and the current state of the predefined approximation median searching event, and the median approximation value is determined, so that the array approximation median value can be searched without ordering the to-be-processed array, the searching process is simplified, and after the median approximation value of the current to-be-processed array is searched, the mean value, the standard deviation, the median approximation value and the position identification of the data packet where the median approximation value is located are transmitted to the next array, so that the median searching of the next array is facilitated, the comparison and exchange operations between data are reduced, the time cost is saved, and the searching efficiency can be improved by adopting the method.

As shown in fig. 3, in one embodiment, step 200 includes: in step 220, when the current state of the predefined approximate median lookup event is the first state, the mean and standard deviation of the array to be processed are calculated, and in step 240, when the current state of the predefined approximate median lookup event is the second state, the mean and standard deviation of the last array to be processed received are determined as the mean and standard deviation of the array to be processed.

In the implementation, since the arrays have similar data distribution, the same mean value and standard deviation can be used for representing the data distribution, therefore, when the current state of the predefined approximate median searching event is state 0, the mean value and standard deviation of the array to be processed can be calculated, and when the current state of the predefined approximate median searching event is state 1, the received mean value and standard deviation of the array to be processed of the last array are determined as the mean value and standard deviation of the array to be processed, so that the operation of calculating the mean value and standard deviation of the array to be processed can be omitted when the approximate median searching event is state 1, and the time cost is reduced.

As shown in fig. 3, in one embodiment, step 300 includes: step 320, determining a plurality of data dividing intervals according to the mean value and the standard deviation, dividing the array to be processed into a plurality of data packets according to the plurality of data dividing intervals, counting the cumulative frequency of each data packet, and constructing a cumulative frequency distribution table based on each data packet and the cumulative frequency of each data packet.

In practice, the data dividing interval may be determined by mu and sigma with reference to the 68-95-99.7 rule in the normal distribution, and the boundary of each data packet may be set so as to divide the array to be processed into a plurality of data packets. For example, the data packets may be divided into (- ++3. Mu. -3. Mu., (μ -3. Mu., [ μ -2. Mu., (μ -2. Mu., [ μ -sigma ], (μ - σ, μ ], (μ, μ+σ ], (μ+σ, μ+2σ ], (μ+3. Mu., + & infinity ]) then, frequency accumulating each data packet, counting the accumulated frequency of each data packet, and constructing an accumulated frequency distribution table based on each data packet and the accumulated frequency of each data packet, it is worth mentioning that, as a result of a large number of data experiments, the probability that data falls into two intermediate groups, i.e., (μ - σ, μ ] and (μ, μ+σ), is high when counting the frequency of the data packet, therefore, to further improve the accuracy of the median approximation, in another embodiment, (μ - σ, μ) and (μ, μ+σ) can be further divided equally into three subgroups, i.e., the length of each group is divided into σ/3, to refine the division of the data group, in this embodiment, through standard deviation and mean, build the cumulative frequency distribution table, can realize treating to treat the array and walk once, can confirm standard deviation, mean and cumulative frequency distribution table, save time, and the time complexity is linear, only need O (n), n is the length of the array.

In one embodiment, counting the cumulative frequency of each data packet includes: counting the number of data distributed in each data packet to obtain the frequency number of each data packet, determining the frequency of each data packet based on the frequency number of each data packet, and carrying out upward accumulation summation on the frequency of each data packet to obtain the accumulation frequency of each data packet.

In this embodiment, the statistics of the cumulative frequency of each data packet may be that the data of the array to be processed is traversed, the data packet in which each data falls is determined, so as to obtain the frequency of each data packet through statistics, then the frequency of each data packet is determined based on the frequency of each data packet, and then the frequency of each data packet is accumulated and summed up in an upward accumulating manner to obtain the cumulative frequency of each data packet. It will be appreciated that in other embodiments, the frequencies of the data packets may be cumulatively summed down, without limitation. In this embodiment, the cumulative frequency of the data packets can be counted correctly and orderly by means of upward cumulative summation.

In one embodiment, as shown in fig. 3, searching the data packet in which the median approximation of the array to be processed is located based on the cumulative frequency distribution table and the current state of the predefined approximate median search event, and determining the median approximation of the array to be processed includes: step 420, when the current state of the predefined approximate median lookup event is the first state, searching out the data packet in which the median approximation value of the to-be-processed array is located according to the cumulative frequency distribution table, and determining the median approximation value of the to-be-processed array; step 440, when the current state of the predefined approximate median searching event is the second state, searching the data packet in which the median approximation value of the array to be processed is located according to the cumulative frequency distribution table, obtaining the first data packet location identifier, and when the first data packet location identifier is consistent with the second data packet location identifier, determining the received median approximation value of the last array as the median approximation value of the array to be processed; the data packet position identifiers are used for representing positions of data packets in an array after data division is completed, the first data packet position identifiers are position identifiers of the data packets in which median approximation values of the arrays to be processed are located, and the second data packet position identifiers are position identifiers of the data packets in which the received median approximation values of the last array are located.

In particular, the data packet location identifier indicates the location of the data packet in the array in which it is located. If the current state of the predefined approximate median lookup event is state 0, the cumulative frequency of each data packet in the cumulative frequency distribution table is directly read, if the cumulative frequency of the currently read data packet exceeds 0.5, the median approximation value is considered to be distributed in the current data packet, the data packet is determined to be the data packet where the median approximation value is located, and then half of the sum of the upper and lower boundaries of the data packet where the median approximation value is located is determined to be the median approximation value of the whole array to be processed. For example, let the frequency of one data packet be Num (i), i is the data packet position identification, which is the number of data packets from left to right in the array after the data division is completed for the current data packet, and i starts the index from 1. When calculating CFT, defining a variable Allsum, according to the calculation method of the cumulative scoring table, allsum =num (1) +num (2) +num (3) +num (4) + …, in the process of frequency accumulation, judging Allsum whether 0.5 is reached or not, if so, stopping accumulation, and recording the current accumulated i value, wherein the i value is the position identifier of the data packet where the found value is the approximate value of the current array to be processed.

When the current state of the predefined approximate median searching event is state 1, the data packet in which the median approximation value is located is searched in the same manner, the position identification of the data packet in which the median approximation value is located is obtained, the position identification of the data packet in which the median approximation value of the to-be-processed array is located is compared with the position identification of the data packet in which the median approximation value of the last array is received, if so, the current to-be-processed array is similar to the data distribution of the last array, at this time, the received median approximation value of the last array is directly used as the median approximation value of the current to-be-processed array, then the state 1 is maintained, the step 100 is returned, and the next median approximation value of the array is continuously searched under the condition of state 1. In this embodiment, when the approximate median lookup event is state 1, if the position identifier of the data packet where the median approximation value of the to-be-processed array is located is consistent with the position identifier of the data packet where the median approximation value of the received previous array is located, the received median approximation value of the previous array is directly used as the median approximation value of the current to-be-processed array, so that the lookup process of the median approximation value can be simplified, and the lookup time can be saved.

In one embodiment, after obtaining the first data packet location identifier, the method further includes: if the first data packet position identifier is inconsistent with the second data packet position identifier, judging whether the absolute value of the difference value between the first data packet position identifier and the second data packet position identifier meets a preset error; if the absolute value of the difference value meets the preset error, determining a median approximation value according to the data packet in which the median approximation value of the array to be processed is located; if the absolute value of the difference does not meet the preset error, the current state of the predefined approximate median lookup event is switched to the first state, and the step 200 is returned.

In this embodiment, if the first data packet position identifier is inconsistent with the second data packet position identifier, it indicates that the data distribution of the to-be-processed array and the last array is not very similar, whether the first data packet position identifier and the second data packet position identifier float within a1 data packet interval may be compared, specifically, whether the absolute value of the difference between the first data packet position identifier and the second data packet position identifier is 1 may be compared, if so, it indicates that the first data packet position identifier and the second data packet position identifier float within a1 data packet interval, and at this time, the median approximation value of the to-be-processed array is determined according to the upper and lower boundaries of the packet where the median approximation value of the to-be-processed array is located. If the absolute value of the difference between the first data packet position identifier and the second data packet position identifier is not 1, the data distribution of the to-be-processed array and the data distribution of the last array are dissimilar, and the mean value and the standard deviation of the to-be-processed array need to be recalculated, so that the median approximation value is searched. At this time, the state of the approximate median lookup event may be switched from state 1 to state 0, and then step 200 is returned to find the median approximation of the array to be processed according to the state 0. In this embodiment, an error of the data packet location identifier is set, and a corresponding determination mode of the median approximation value is determined according to the error, so that the method is closer to the actual situation, and a more accurate median approximation value can be found.

In order to make a clearer description of the median approximation search method of the data set provided by the present application, the following description is made with reference to a specific embodiment and fig. 4:

The user defines an approximate median searching event in advance, the state of the event is initialized to be 0, then a data set to be subjected to approximate median searching is uploaded to a server, the data set comprises a plurality of arrays with approximate data distribution, and the arrays can describe the data distribution by adopting the same mean value and standard deviation. Then, the user sends an approximate median searching message to the server through the terminal, and the server responds to the message to sequentially perform approximate median searching on the arrays in the data set.

Specifically, the process of approximate median lookup may be: the method comprises the steps of obtaining the current state of a first to-be-processed array and a predefined approximate median lookup event, wherein the state of the approximate median lookup event is an initialization state, namely a state 0, then calculating standard deviation and average value of the to-be-processed array, determining a plurality of data dividing intervals according to the average value and the standard deviation, dividing the to-be-processed array into a plurality of data groups according to the plurality of data dividing intervals, counting the accumulation frequency of each data group, and constructing an accumulation frequency distribution table based on each data group and the accumulation frequency of each data group. Then, the accumulated frequency of each data packet in the accumulated frequency distribution table is read, if the accumulated frequency of the current read data packet exceeds 0.5, the median approximation value is considered to be distributed in the current data packet, the data packet is determined to be the data packet where the median approximation value is located, then half of the sum of the upper boundary and the lower boundary of the data packet where the median approximation value is located is determined to be the median approximation value of the whole array to be processed, the median approximation value is output, the standard deviation, the average value, the median approximation value and the data packet position identification where the median approximation value is located of the array to be processed are transmitted to the next array, then the state of the approximate median lookup event is switched from 0 to 1, and the current state of the current array to be processed and the predefined approximate median lookup event is acquired. When the next array is input, the state of the approximate median searching event is state 1, the next array is used as the current array to be processed, and the cumulative frequency distribution table is obtained according to the standard deviation and the mean value of the received previous array and the mode of constructing the cumulative frequency distribution table. Then, the accumulated frequency of each data packet in the accumulated frequency distribution table is read, the data packet with the median approximation value is found, a first data packet position identification is obtained, and when the first data packet position identification is consistent with the second data packet position identification, the received median approximation value of the last array is determined as the median approximation value of the array to be processed; if the difference absolute values of the first data packet position identifier and the second data packet position identifier are inconsistent, judging whether the difference absolute values meet a preset error 1, if the difference absolute values meet the preset error 1, determining a median approximation value according to the data packet in which the median approximation value of the array to be processed is located, if the difference absolute values do not meet the preset error, switching a predefined approximate median searching event from a state 1 to a state 0, and returning to the step of acquiring the current states of the current array to be processed and the predefined approximate median searching event again until the median approximation values of all arrays in the data set are searched.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a median approximation finding device of the data set for realizing the above related median approximation finding method of the data set. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiments of the median approximation searching apparatus for one or more data sets provided below may be referred to the limitation of the median approximation searching method for a data set hereinabove, and will not be repeated here.

In one embodiment, as shown in FIG. 5, a median approximation lookup apparatus for a data set is provided, the data set comprising a plurality of arrays, the plurality of arrays characterizing a data distribution using the same mean and standard deviation. The device comprises: a data acquisition module 510, a mean standard deviation determination module 520, a cumulative frequency distribution table construction module 530, a median approximation determination module 540, and a loop processing module 550, wherein:

a data acquisition module 510, configured to acquire a current state of the current array to be processed and a predefined approximate median lookup event.

The mean standard deviation determining module 520 is configured to determine a mean and a standard deviation of the array to be processed according to a predefined approximate median lookup event current state.

The cumulative frequency distribution table construction module 530 is configured to construct a cumulative frequency distribution table according to the mean and the standard deviation, where the cumulative frequency distribution table includes a plurality of data packets, and the data packets are divided by the array to be processed.

The median approximation determining module 540 is configured to find a data packet in which a median approximation of the array to be processed is located based on the cumulative frequency distribution table and a current state of a predefined approximation median lookup event, determine a median approximation of the array to be processed, and transmit the mean, standard deviation, median approximation, and a location identifier of the data packet in which the median approximation is located to a next array, and if the current state of the predefined approximation median lookup event is the first state, switch the first state to the second state.

The loop processing module 550 is configured to control the data acquisition module to perform an operation of acquiring the current states of the to-be-processed arrays and the predefined approximate median lookup event until the median approximations of all the arrays in the data set are found.

According to the median approximation searching device of the data set, a two-state mechanism is introduced, the mean value and the standard deviation of the to-be-processed array are determined according to the current state of the predefined approximation median searching event, then the accumulated frequency distribution table is constructed, the data packet where the median approximation of the to-be-processed array is located is searched through the accumulated frequency distribution table and the current state of the predefined approximation median searching event, and the median approximation is determined, so that the array approximation median can be searched without ordering operation, the searching process is simplified, and after the median approximation of the current to-be-processed array is searched, the mean value, the standard deviation, the median approximation and the position identification of the data packet where the median approximation are located are transmitted to the next array, so that the median searching of the next array is facilitated, the comparison and exchange operations between data are reduced, the time cost is saved, and the searching efficiency can be improved by adopting the device.

In one embodiment, the mean standard deviation determining module 520 is further configured to calculate the mean and standard deviation of the array to be processed when the current state of the predefined approximate median lookup event is the first state, and determine the mean and standard deviation of the last array to be processed as the mean and standard deviation of the array to be processed when the current state of the predefined approximate median lookup event is the second state.

In one embodiment, the median approximation determining module 540 is further configured to, when the current state of the predefined approximation median lookup event is the first state, find a data packet in which a median approximation of the array to be processed is located according to the cumulative frequency distribution table, determine a median approximation of the array to be processed, when the current state of the predefined approximation median lookup event is the second state, find a data packet in which a median approximation of the array to be processed is located according to the cumulative frequency distribution table, obtain a first data packet location identifier, and when the first data packet location identifier is consistent with the second data packet location identifier, determine the received median approximation of the last array as the median approximation of the array to be processed, where the data packet location identifier is used to characterize a location of the data packet in the array after data division, the first data packet location identifier is a location identifier of the data packet in which the median approximation of the array to be processed is located, and the second data packet location identifier is a location identifier of the data packet in which the median approximation of the last received array is located.

In one embodiment, the median approximation determining module 540 is further configured to determine whether the absolute value of the difference between the first data packet location identifier and the second data packet location identifier meets a preset error if the first data packet location identifier is inconsistent with the second data packet location identifier, determine a median approximation according to a data packet in which the median approximation of the array to be processed is located if the absolute value of the difference meets the preset error, switch the current state of the predefined approximate median lookup event to the first state if the absolute value of the difference does not meet the preset error, and control the data obtaining module 510 to perform an operation of determining the mean and the standard deviation of the array to be processed according to the current state of the predefined approximate median lookup event.

In one embodiment, the mean standard deviation determining module 520 is further configured to determine a plurality of data dividing intervals according to the mean and the standard deviation, divide the array to be processed into a plurality of data packets according to the plurality of data dividing intervals, count the cumulative frequency of each data packet, and construct a cumulative frequency distribution table based on each data packet and the cumulative frequency of each data packet.

In one embodiment, the cumulative frequency distribution table construction module 530 is further configured to count the number of data distributed in each data packet, obtain the frequency number of each data packet, determine the frequency of each data packet based on the frequency number of each data packet, and perform upward cumulative summation on the frequencies of each data packet to obtain the cumulative frequency of each data packet.

The respective modules in the median approximation finding device of the data set described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as a data set of median approximation values to be searched, median approximation values, standard deviation, mean values and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of median approximation finding for a data set.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided that includes a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of the median approximation finding method of the data set described above.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon which, when executed by a processor, performs the steps in the median approximation lookup method for a data set described above.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, implements the steps in the median approximation finding method of a data set described above.

The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. The method is characterized in that the data set is a sensor data set and comprises a plurality of arrays, and the arrays adopt the same mean value and standard deviation to represent data distribution;

the method for searching the median approximation value of the data set comprises the following steps:

Determining the mean value and standard deviation of the array to be processed according to the current state of the predefined approximate median lookup event;

Constructing an accumulated frequency distribution table according to the mean value and the standard deviation, wherein the accumulated frequency distribution table comprises a plurality of data packets, and the data packets are divided by the array to be processed;

based on the accumulated frequency distribution table and the current state of the predefined approximate median lookup event, searching a data packet in which a median approximation value of the array to be processed is located, determining a median approximation value of the array to be processed, and transmitting the mean value, the standard deviation, the median approximation value and a position identifier of the data packet in which the median approximation value is located to a next array, if the current state of the predefined approximate median lookup event is a first state, switching the first state into a second state, wherein the position identifier is used for representing the position of the data packet in the array after data division is completed;

and returning to the step of acquiring the current states of the arrays to be processed and the predefined approximate median lookup event until the median approximation values of all arrays in the dataset are found.

2. The method of claim 1, wherein determining the mean and standard deviation of the array to be processed based on the current state of the predefined approximate median lookup event comprises:

when the current state of the predefined approximate median lookup event is a first state, calculating the mean value and standard deviation of the array to be processed;

And when the current state of the predefined approximate median lookup event is the second state, determining the mean value and standard deviation of the array to be processed of the last received array as the mean value and standard deviation of the array to be processed.

3. The method of claim 1, wherein the searching for the data packet in which the median approximation of the array to be processed is located based on the cumulative frequency distribution table and the current state of the predefined approximate median search event, and determining the median approximation of the array to be processed comprises:

When the current state of the predefined approximate median searching event is a second state, searching a data packet in which a median approximation value of the array to be processed is located according to the accumulated frequency distribution table, acquiring a first data packet position identifier, and when the first data packet position identifier is consistent with the second data packet position identifier, determining the received median approximation value of the last array as the median approximation value of the array to be processed;

The first data packet position identifier is a position identifier of a data packet where a median approximation value of an array to be processed is located, and the second data packet position identifier is a position identifier of a data packet where a median approximation value of a last received array is located.

4. A method of median approximation finding for a data set as claimed in claim 3, further comprising, after said obtaining the first data packet location identifier:

If the first data packet position identifier is inconsistent with the second data packet position identifier, judging whether the absolute value of the difference value of the first data packet position identifier and the second data packet position identifier meets a preset error or not;

If the absolute value of the difference value meets a preset error, determining a median approximation value according to the data packet in which the median approximation value of the array to be processed is located;

and if the absolute value of the difference value does not meet the preset error, switching the current state of the predefined approximate median searching event into a first state, and returning to the step of determining the average value and the standard deviation of the array to be processed according to the current state of the predefined approximate median searching event.

5. The method of claim 1, wherein constructing a cumulative frequency distribution table based on the mean and the standard deviation comprises:

dividing the array to be processed into a plurality of data packets according to the plurality of data dividing intervals;

Counting the cumulative frequency of each data packet;

6. The method of claim 5, wherein said counting the cumulative frequency of each data packet comprises:

7. The median approximation searching device for the data set is characterized in that the data set is a sensor data set and comprises a plurality of arrays, and the arrays adopt the same mean value and standard deviation to represent data distribution;

The median approximation look-up device of the dataset comprises:

The accumulated frequency distribution table construction module is used for constructing an accumulated frequency distribution table according to the mean value and the standard deviation, wherein the accumulated frequency distribution table comprises a plurality of data packets, and the data packets are divided by the array to be processed;

The median approximation value determining module is configured to find a data packet in which a median approximation value of the array to be processed is located based on the cumulative frequency distribution table and a current state of the predefined approximation median lookup event, determine a median approximation value of the array to be processed, and transmit a location identifier of the mean value, the standard deviation, the median approximation value, and the data packet in which the median approximation value is located to a next array, and if the current state of the predefined approximation median lookup event is a first state, switch the first state to a second state, where the location identifier is used to characterize a location of the data packet in the array after data division is completed;

And the circulation processing module is used for controlling the data acquisition module to execute the operation of acquiring the current states of the arrays to be processed and the predefined approximate median lookup event until the median approximation values of all the arrays in the data set are found out.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.