CN108021689A

CN108021689A - The method inquired about using the IVkNN algorithms based on MapReduce

Info

Publication number: CN108021689A
Application number: CN201711358195.7A
Authority: CN
Inventors: 季长清; 汪祖民; 秦静; 吴锐
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2017-10-19
Filing date: 2017-10-19
Publication date: 2018-05-11
Also published as: CN107844532A

Abstract

This divisional application discloses the method that a kind of IVkNN algorithm of use based on MapReduce is inquired about, and belongs to cloud computing, big data field, is for solving the problems, such as：Available data querying method index efficiency is improved, technical essential is：Host node is loaded files into distributed caching, and Mappers reads R from distributed caching_i∈ R and S_jThe key-value pair that each partition informations of ∈ S are formed, map functions generate new key-value pair, and for each object r ∈ R, wherein key is its subregion id, and effect is：Algorithm index efficiency can be improved so that MapReduce can be no longer influenced by room and time influence.

Description

The method inquired about using the IVkNN algorithms based on MapReduce

The application for application number 2017109759745, applying date 2017-10-19, denomination of invention " based on MapReduce with The divisional application of the extensive nearest Neighbor of row's Thiessen polygon ".

Technical field

The invention belongs to cloud computing, big data field, is related to a kind of can be effectively improved under distributed environment and inquires about effect The MapReduce indexes of rate.

Background technology

MapReduce is a kind of currently a popular programming framework based on cloud platform, it can handle and generate large-scale number According to collection, it supports data-intensive application using the shared cluster of nothing.Processing step is specially：In distributed cache system In, it is that key/value pair among one group is generated in map functions, according to phase by MapReduce tasks when handling a key/value pair With middle key merge all medians, each map is independently of each other operations, i.e., all maps can be held parallel OK.One group " reducers " of MapReduce can perform reduction operation, and the output that the Map with identical key is operated at the same time may be used With reduction to same reducer.But one reduction procedure of isolated operation may be such that inefficiency；

MapReduce can be used for supporting than traditional more massive data processing of commerce server cluster, it can be The data of a PB quantity can be only handled in a few houres, carrying out data directory using MapReduce has preferable application Prospect.However, parallel processing of the existing Index Algorithm due to not adapting to MapReduce, builds the time consumption of index not Enough ideals, scalability are bad.

The content of the invention

In order to improve available data querying method index efficiency, the present invention provides following scheme:

A kind of extensive nearest Neighbor based on MapReduce with row's Thiessen polygon, includes the following steps：

S1. the row's of falling Voronoi indexes based on MapReduce are constructed；

S2. subregion is carried out to data set R and S using the row's of falling Voronoi indexes；

S3. the IVKNN based on MapReduce is used to carry out distribution kNN inquiries.

Construction is as follows the step of arranging Voronoi indexes based on MapReduce：

Two datasets R is given in S1.1.d dimension spaces and S, Hadoop carry out burst, part mappers is parallel at the same time to be transported OK, in MapReduce tasks, using the reducer of acquiescence, before map functions are started, generation is obtained using pre- clustering algorithm Table point p, and be loaded into the main memory of each map；

S1.2. in each map treatment progress, the burst of input is read using TextInputForma successively, TextInputFormat reads data into the example of Mapper from file, calculate respectively m object r in data set R with respectively N object s in the distance between point p, data set S is represented with representing the distance between point p, and by range data collection R J-th of object s in i-th of object r and data set S_jImmediate representative point P_ijSelect and be gathered in Voronoi cells VC_iIn, subregion VC1~VCm, the i order for forming m VC are taken all over 1~m, also, are worked as i orders and taken one in 1~m During occurrence, under the value of i, j orders take all values all over 1~n, obtain the immediate Voronoi for representing point of m storage Cell；Output<VC_i, List (P_ij)>It is right；

Known query point p, differentiates its closest VC_iOr most some neighbouring VC_iCollection, it is most adjacent described in mapper outputs Near VC_iOr most some neighbouring VC_iCollect r, s object in corresponding raw data set R and/or S, and export closest VC_i Or most some neighbouring VC_iThe id of collection；

S1.3., mapper is output to the file system of Hadoop.

In the step S1.2, j-th in i-th of object r and data set S in the R by range data collection Object s_jImmediate representative point P_ijSelect and be gathered in Voronoi cells VC_iIn, formed m VC subregion VC1~ The process of VCm is as follows：

First object r in data set R₁With first object s in data set S₁The immediate point that represents as P₁₁, number According to first object r in collection R₁With second object s in data set S₂It is immediate represent point as P₁₂, first object r in data set R₁With n-th of object s in data set S_nImmediate represent a little For P_1n, a immediate representatives of the n, which are put, to be selected and is gathered in Voronoi cells V₁In；

······

M-th of object r in data set R_mWith first object s in data set S₁The immediate point that represents as P_m1, number According to m-th of object r in collection R_mWith second object s in data set S₂The immediate point that represents as P_m2, M-th of object r in data set R_mWith n-th of object s in data set S_nThe immediate point that represents as P_mn, the n closest Representative point select and be gathered in Voronoi cells V_mIn.

The step of being inquired about using the IVkNN algorithms based on MapReduce is as follows：

S1. the partition value file for including data set R and S is put according to IVI；

S2. host node is loaded files into distributed caching, and Mappers reads R from distributed caching_i∈ R and S_j∈ The key-value pair that each partition informations of S are formed, map functions generate new key-value pair, and for each object r ∈ R, wherein key is its point Area id, for each object s ∈ S, if dist (s, P)<Dist (r, P), then map functions also create one group of new key-value pair；

Reducer iteratively reads R from Mapper_iAnd S_iIn all the points, by VC_iIdentical key-value pair into dividing group, KNN inquiries are carried out, check S_jAll subregions after, reducer output kNN (r, S), export key-value pair<R, kNN (r, S)>As KNN query results.

The row's of falling Voronoi indexes include two parts：Master index, it includes all cluster centres；Second index, It include being stored in each VC to as queue.

Beneficial effect：The present invention is that one kind handles kNN inquiries using MapReduce, and can using Voronoi diagram Parallelization suitable for MapReduce is handled, so as to improving algorithm index efficiency so that MapReduce can be no longer influenced by Room and time influences.

Brief description of the drawings

Fig. 1 is Voronoi diagram；

Fig. 2 is Voronoi diagram index；

Fig. 3 is that influence of the node data change to RSD and SDS data collection structure index time in MRIV structures contrasts Figure.

Specific embodiment party

Embodiment 1：A kind of extensive nearest Neighbor based on MapReduce with row's Thiessen polygon, wherein： MapReduce is existing programming model, and for the concurrent operation of large-scale dataset, described method includes following steps：

S1. construct based on MapReduce the row's of falling Voronoi indexes (Thiessen polygon index, InvertedVoronoiIndex,IVI)；

S2. using fall row Voronoi indexes to data set R and S progress subregion, obtain VC subregions, two subregions be by Need to carry out when establishing of Voronoi diagram in the later stage, Voronoi diagram needs two subregions to combine, thus carries out in this step The subregions of two datasets；

S3. the IVKNN based on MapReduce is used to carry out distribution kNN (nearest neighbor algorithm) inquiries, IVKNN is to utilize IVI Fall row Voronoi indexes.

Wherein：Construction is as follows the step of arranging Voronoi indexes based on MapReduce：

Two datasets R is given in S1.d dimension spaces and S, Hadoop (distributed system architecture) carry out burst, portion Divide mappers (a kind of existing coding and decoding) while parallel operation, in MapReduce tasks, use the reducer of acquiescence (subtask merging process), before map functions are started, obtains representing point p, and be loaded into each map using pre- clustering algorithm In the main memory of (subtask, which is decomposed, to be performed)；

Wherein：Inside, is clustered the data clusters of point by the acquisition methods of representative, definite internal cluster and consecutive points, Select cluster centre after cluster to be indexed, required data are to cluster a consecutive points for connection with internal, are clustered with this inside Point is the center of circle, and circle is established comprising adjacent cluster centre point, and Delaunay triangles are used as the triangle of circumscribed circle using this circle Two different inside cluster points are established Delaunay triangles in this method, the two Delaunay triangles by shape respectively Delaunay triangulation network is established by common ground of consecutive points, data object is divided into several big subregions, selects wherein one cluster Point is represented as representing a little, each object being divided is to be clustered in a Voronoi unit, each Voronoi grids In contain object id.

S2. in each map treatment progress, read successively using TextInputForma (a kind of existing read mode) The burst of input is taken, TextInputFormat reads data into the example of Mapper from file, calculates respectively in data set R The distance between the distance between m object r and each representative point p, n object s in data set S and representative point p, and will be away from From j-th of object s in i-th of the object r and data set S in data set R_jImmediate representative point P_ijSelect and be gathered in Voronoi cells VC_iIn, subregion VC1~VCm, the i order for forming m VC (subregion of Voronoi diagram) are taken all over 1 ~m, also, when i orders take an occurrence in 1~m, under the value of i, j orders take all values all over 1~n, obtain The immediate Voronoi cells for representing point of m storage；Output<VC_i, List (P_ij)>It is right；

I.e.：First object r in data set R₁With first object s in data set S₁It is immediate represent point as P₁₁, first object r in data set R₁With second object s in data set S₂It is immediate represent point as P₁₂, first object r in data set R₁With n-th of object s in data set S_nImmediate represent a little For P_1n, a immediate representatives of the n, which are put, to be selected and is gathered in Voronoi cells V₁In；

Second object r in data set R₂With first object s in data set S₁The immediate point that represents as P₂₁, number According to second object r in collection R₂With second object s in data set S₂It is immediate represent point as P₂₂, second object r in data set R₂With n-th of object s in data set S_nImmediate represent a little For P_2n, a immediate representatives of the n, which are put, to be selected and is gathered in Voronoi cells V₂In；

······

M-th of object r in data set R_mWith first object s in data set S₁The immediate point that represents as P_m1, number According to m-th of object r in collection R_mWith second object s in data set S₂The immediate point that represents as P_m2, M-th of object r in data set R_mWith n-th of object s in data set S_nThe immediate point that represents as P_mn, the n closest Representative point select and be gathered in Voronoi cells V_mIn；

S3., mapper is output to the file system of Hadoop.

S1. the partition value file for including data set R and S is put according to IVI (arranging Voronoi indexes)；

S2. host node is loaded files into distributed caching, and Mappers reads R from distributed caching_i∈ R and S_j∈ The key-value pair that each partition informations of S are formed, map functions generate new key-value pair, and for each object r ∈ R, wherein key is its point Area id, is worth and is made of k and v, for each object s ∈ S, if dist (s, P_ij)<Dist (r, P_ij), then map functions also create One group of new key-value pair；

Reducer iteratively reads R from Mapper_iAnd S_jIn all the points, by VC_iIdentical key-value pair into dividing group, KNN inquiries are carried out, check S_jAll subregions after, reducer output kNN (r, S), export key-value pair<R, kNN (r, S)>As KNN query results.

The IVI, IVI include two parts：Master index, it includes all cluster centres；Second index, it includes storage There are each VC to as queue.

Embodiment 2：The present embodiment can implement as independent technical solution or as in embodiment 1 each scheme into one Step explanation, present embodiments provides a kind of extensive nearest Neighbor based on MapReduce with row's Thiessen polygon, this Method is a kind of highly effective algorithm that kNN inquiries are handled using Voronoi diagram based on MapReduce, it can also solve to move Dynamic medical call system meets wireless penetration, networking, mobile this future developing trend.The present embodiment is also for the prior art In deficiency improved, possess good high efficiency and scalability.To achieve these goals, the present embodiment is used Technical solution perform step it is as follows：Carry out MapReduce and arrange building for the extensive NN Query index of Thiessen polygon It is vertical；MapReduce is a kind of currently a popular programming framework based on cloud platform, it can handle and generate large data collection, It supports data-intensive application using the shared cluster of nothing.Processing step is specially：In distributed cache system, by MapReduce tasks are that key/value pair among one group is generated in map functions, according to identical when handling a key/value pair Middle key merges all medians.Each map is independently of each other operations, i.e., all maps can be performed parallel --- Although actually this is limited to separate data source or data source CPU quantity nearby.Likewise, one group " reducers " can be performed Reduction operation, the output that the Map with identical key is operated at the same time can be with reduction to same reducer.Although with serial calculation Method is compared, this process it is possible that inefficiency (because must run a multiple rather than reduction procedure), but MapReduce can be used for supporting than traditional more massive data processing of commerce server cluster, it can be only several small When the interior data that can handle a PB quantity, we are mainly used in Spark systems in practice, have been obtained obvious Improved efficiency, the method execution efficiency more traditional than before use improve more than 12%.

The present embodiment includes two parts using MapReduce structures IVI, IVI：In master index, including all clusters The heart；Second index, including be stored in each VC to as queue.Inverted index is for effectively index position and query object Data object in adjacent queue.When a given inquiry, we can differentiate closest VC or most some are neighbouring VC collection.Then the corresponding queue element (QE)s of these VC are included to come, so as to obtain kNN query results.Due to Voronoi diagram It can be obtained with merging multiple Voronoi diagrams (VP) by splitting, be suitable for so construction falls to arrange Voronoi indexes MapReduce model.The Voronoi for particularly every sub- VP being merged to the end.

Construction falls to arrange Voronoi indexes and concretely comprises the following steps：

S1. give and two datasets R and S are given in d dimension spaces.Hadoop peace default mechanisms carry out burst.Some Mappers parallel operations at the same time.In MapReduce tasks, we use the reducer given tacit consent to.Start map functions it Before, we obtain representing point p using quick pre- clustering algorithm, and are loaded into the main memory of each map.

S2. in each map treatment progress, it will read the burst of input using TextInputFormat successively (pressing the input format in distributed file system), TextInputFormat can read data to the reality of Mapper from file In example.Each r, the distance between s objects and p points are calculated, and by r, s distributes to immediate representative point P. in algorithm In 2-3 rows, each point is collected in a Voronoi cell, it will be produced into m Voronoi cell, in algorithm It can be exported in 4-6 rows<VCm, List (P_i)>Right, mapper output raw data sets (R or S) arrive each of hithermost subregion A object r, s and its subregion VC_mId.

Finally, in algorithm 8-10 rows, it would be desirable to needed according to what is controlled oneself by customized Mapper is output to the file system of Hadoop by MultipleOutputFormat functions.It is determined how task result Write back in the lasting storage of bottom.Voronoi index of the structure based on MapReduce is described in detail in we in algorithm 1 The algorithm pseudo code of structure.Using IVI, if given one represents a little, we can start MapReduce tasks and come into line number According to subregion and collect some data messages of each subregion.

In practice, we first by varying clustered node number, the number change from 2 to 32, is based on to assess structure The row's of falling Voronoi (MRIV) index efficiency of MapReduce.Fig. 2 illustrates node data change to RSD and SDS data collection structure Index the influence of time.In Fig. 2, index structure almost linearly increases with the increase of clustered node number.Practice result is also demonstrate,proved Understanding the structure of MRIV indexes needs less time, and final Voronoi can be obtained by merging multiple Voronoi subgraphs Therefore handled suitable for MapReduce parallelizations.In addition, IVI employs inverted index structure, its scalability is better than its other party Method, to sum up, indexes large-scale spatial data, and it is most suitable method to arrange Voronoi indexes.

Embodiment 3：In today that social security service develops rapidly at a high speed, the living standard of people increasingly improves, Also become more hommization and personalization for the demand of medical services.Also there are more and more people to need more at the same time Prompt and perfect medical services.

With mobile communication and the rapid growth based on location-based service correlation technique, cloud computing, big data, Internet of Things, shifting The technologies such as dynamic calculating and space orientation are progressively ripe, and such as GPS, camera, blue-teeth data are also constantly increasing, and emerge in large numbers Substantial amounts of spatial data, this to be faced with huge challenge in the storage and processing of various spatial datas or object.Therefore, with The development of IT application process, electronic health record, nursing call center system, extensive medical data base etc. in industry of medical care Using being also rapidly developing, improve work efficiency, improve medical services, Economy type medicine cost etc. played it is more and more Effect.

But China's geographical environment difference is huge, economic development is uneven, medical resource skewness weighing apparatus, developed regions with Outlying district is compared, and medical level is there is also very big difference, while as rural area is to the at full speed of the industries such as urban migration, tourism Development so that exponentially type increases on the basis of script population mobility is big, is frequently encountered just to a place, runs into disease Disease is unknown to where see a doctor, and stands in the queue to register it is more likely that need several months ahead of time to buy tickets, tosses about multiple hospitals by bus, most The problem of a large amount of manpower financial capacities have been wasted on road at last, and disease is not got timely medical treatment but.Being frequently encountered needs emergency treatment When, do not know but around have what hospital, which hospital can handle this dangerous situation, which hospital position is more preferable closer to, service, Because the delay time at stop, cause treatment not in time, or even tragedy because of delay treatment and lethal can occur.But with the big data epoch Arrival, there is the mode that more related medical resources are inquired about.Medical resource inquiry makes patient more easily find certainly Oneself required data, for example, patient can inquire the hospital nearest from oneself by medical resource, since it is convenient just Doctor.Based on the starting point, we have designed and Implemented the invention.

In the present embodiment, to the big rule based on MapReduce with row's Thiessen polygon described in above-mentioned two embodiment The concrete application of mould nearest Neighbor is made an explanation.

In one embodiment, the process returned based on following steps realization index, request processing, data：

S1. hospital, which establishes, is used to perform based on MapReduce and the extensive nearest Neighbor for arranging Thiessen polygon System；

S2. user initiates to ask by ambulatory medical device to hospital；

S3. when hospital receives user's request, agreed to, and user requested data is returned into user.

Medical knowledge is sent specific to patient to ask for or during medical aid request, its mobile calls method is as follows：

S1. hospital, which establishes, is used to perform based on MapReduce and the extensive nearest Neighbor for arranging Thiessen polygon System, so that user can effectively improve search efficiency under distributed environment.

S2. patient user, when being badly in need of medical knowledge and/or medical aid, can be exhaled outside hospital by portable medical Equipment is made to send respective request to hospital.

S3. when hospital receives user's request, row's Thiessen polygon index is transferred down, performs distribution Skyline or kNN Deng spatial query algorithms (being not limited to these algorithms), agreement decision-making is carried out if meeting condition, and user requested data is returned Back to user.

In another embodiment, the extensive NN Query side based on MapReduce with row's Thiessen polygon Method is applied to medical material dispatching, it realizes that medical resource dispatching includes the following steps：

S1. medical resource supplier is established to perform and is looked into based on MapReduce and the extensive neighbour of row's Thiessen polygon The system of inquiry, and medical resource data are uploaded in systems；

S2. when user needs medical resource, to medical resource, supplier files a request；

S3. medical resource supplier gives two medical resource data sets, and Hadoop carries out burst, each map processing In process, the burst of input is read using TextInputForma successively, calculates each object and the medical treatment needed for user The distance between resource, and the object searched out is distributed to closest to the medical resource needed for user, by from needed for user The nearest object of medical resource is selected and is gathered in Voronoi cells V_mIn, export on the medical resource needed for user most Neighbouring VC_iOr most some neighbouring VC_iThe id of collection, transfers data to user.

In another embodiment, handheld device medical resource demand client access mobile Internet, and provided with medical treatment Source supplier establishes contact, and medical resource demand client sends medical resource requirement request, medical treatment money to medical resource supplier The spatial geographic information of source demand client be collected to for perform described in embodiment 1-2 based on MapReduce with fall arrange In the system of the extensive nearest Neighbor of Thiessen polygon, and the spatial geographic information of medical resource logistic car is by the system System collection in real time, the system is using medical resource demand customer information as representative point, with medical resource position and medical resource thing It is data set to flow lorry position data, by the logistic car geo-spatial data collection of medical resource progress burst, using based on MapReduce and the extensive nearest Neighbor of the row's of falling Thiessen polygon are apart from medical resource demand client to find The medical resource logistics truck position and medical resource position of arest neighbors.

As being explained further for such scheme, which is also referred to as medical dispensing vehicle, handheld device Medical resource demand client is by the wireless network based on 2G/3G/4G modes or WIFI, while accessing mobile Internet Establish and contact with medical resource supplier, medical resource demand client sends to medical resource supplier and asks, medical treatment dispatching doctor Treat resource dispensing vehicle driver after the system is logged in, by spatial geographic information (including the positional information of oneself and medical treatment provide Source demand customer information) medical resource supplier MapReduce is sent to arranging the extensive NN Query of Thiessen polygon System, spatial geographic information spatial geographic information disclosed in system is to medical dispensing vehicle driver establish index.Medical resource needs Ask client to obtain the information of present position by mobile phone, the positional information got is sent to system, system uses MapReduce and the extensive NN Query index technology for arranging Thiessen polygon, are believed in the medical resource of magnanimity with client Cease for representative point, using the sum data of logistics van as data set, logistics van data set is subjected to burst, finds distance symbol The optimal logistics van of customer information is closed, the medical resource dispensing vehicle driver inquired appears in the doctor of medical resource demand client Treat on resource requirement client software map interface, medical resource demand client selects medical resource dispensing vehicle driver, Ke Yixiang System, which sends request or directly dials the mode of medical resource dispensing vehicle driver's phone, sends request, and server is by medical resource Demand client request is forwarded to medical resource dispensing vehicle driver, and medical resource dispensing vehicle driver receipt is to after sending request, to asking Ask and handle it and return to medical resource supplier, medical resource demand Client handset is sent to by medical resource supplier. This can greatly improve search efficiency.In medical resource dispatching, such case is very common, if complete is manually one really Very big workload.If but can be significantly using the extensive nearest Neighbor based on MapReduce and row's Thiessen polygon The manual working time is reduced, reduces unnecessary expenditures.

Claims

1. the method that a kind of IVkNN algorithm of use based on MapReduce is inquired about, its characterization step step are as follows：S1. root The partition value file for including data set R and S is put according to IVI；S2. host node is loaded files into distributed caching, Mappers R is read from distributed caching_i∈ R and S_jThe key-value pair that each partition informations of ∈ S are formed, map functions generate new key-value pair, For each object r ∈ R, wherein key is its subregion id, for each object s ∈ S, if dist (s, P)<Dist (r, P), then Map functions also create one group of new key-value pair；Reducer iteratively reads R from Mapper_iAnd S_iIn all the points, by VC_i's Identical key-value pair carries out kNN inquiries into group is divided, and checks S_jAll subregions after, reducer output kNN (r, S), output Key-value pair<R, kNN (r, S)>As kNN query results.