[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN107992611A - The high dimensional data search method and system of hash method are distributed based on Cauchy - Google Patents

The high dimensional data search method and system of hash method are distributed based on Cauchy Download PDF

Info

Publication number
CN107992611A
CN107992611A CN201711353318.8A CN201711353318A CN107992611A CN 107992611 A CN107992611 A CN 107992611A CN 201711353318 A CN201711353318 A CN 201711353318A CN 107992611 A CN107992611 A CN 107992611A
Authority
CN
China
Prior art keywords
mrow
similar
msub
high dimensional
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711353318.8A
Other languages
Chinese (zh)
Other versions
CN107992611B (en
Inventor
王建民
龙明盛
刘斌
曹越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201711353318.8A priority Critical patent/CN107992611B/en
Publication of CN107992611A publication Critical patent/CN107992611A/en
Application granted granted Critical
Publication of CN107992611B publication Critical patent/CN107992611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Software Systems (AREA)

Abstract

The present invention, which provides a kind of the high dimensional data search method based on Cauchy's distribution hash method and system, search method, to be included:S1, input the corresponding high dimensional data of data point to be retrieved in trained similar to search model, obtains the corresponding Hash coding of data point to be retrieved, wherein, similar to search model is for neural network model and including Hash coding module;S2, in the inverted index unit of Hash coding input similar to search model, will obtain and bucket of the Hamming distance of Hash coding less than or equal to preset value;S3, reorder all high dimensional datas in bucket, obtains similar high dimensional data list, using similar high dimensional data list as retrieval result.The present invention first carries out beta pruning using hash method, is then resequenced using primitive character, that is, accelerates effectiveness of retrieval, also ensure that the precision of retrieval.The present invention, which can realize, fast and accurately to be retrieved.

Description

The high dimensional data search method and system of hash method are distributed based on Cauchy
Technical field
The present invention relates to big data analysis field, more particularly, to a kind of higher-dimension that hash method is distributed based on Cauchy Data retrieval method and system.
Background technology
In Internet era, as multimedia resource is continuously increased on internet, how from large-scale data it is quick, Relevant data are effectively found, are either still all spatially a greatly test in time.With internet Rapid development, large capacity, high-dimensional image big data are more and more common in search engine and community network, also attract More and more concerns, how fast and effectively to carry out image retrieval analysis is a urgent problem needed to be solved, and approximate neighbour looks into It is exactly to be produced for this problem to ask, and how to ensure computational efficiency at the same time and search quality be approximate NN Query pass Key.For this problem, a very common and effective solution method is hash method, i.e., is converted to high dimensional data compact Binary code, and similar binary code is generated for similar data.The present invention pays close attention to the relevant hash method of data, Verified, such method hash method more incoherent than data (such as local sensitivity Hash) is more effective.
In the past, it has been proposed that excessively many hash methods based on Hamming distance.These methods can be divided into unsupervised Kazakhstan Uncommon method and there is supervision two class of hash method.Wherein unsupervised approaches are more universal, can be in no semantic label or data phase It is trained in the case of closing property.Therefore, many unsupervised hash methods are suggested study compact binary encoded to carry out Efficient Hamming spatial image retrieval.
However, existing most methods are both for data compression rather than candidate's trimming, the i.e. line based on Hash codes Property scanning maximize retrieval performance.Since even if using Hash codes, the cost of linear scan is still very high, therefore existing big Most hash methods deviate from original intention to a certain extent, that is, farthest accelerate under acceptable precision.With The it is proposed of the powerful hash method suitable for linear scan, should turn to Hamming spatial retrieval now, it can realize highest The Time constant search of effect.In Hamming spatial retrieval, by Hash table search rather than linear scan, the Chinese that we will give Data point in bright radius returns to each inquiry.But the loss function of existing hash method is not very applicable, is usually lacked The weary ability concentrated on associated picture in small Hamming ball, therefore the loss function may be not suitable for Hamming spatial retrieval.
The content of the invention
The present invention provides a kind of a kind of high dimensional data retrieval side that hash method is distributed based on Cauchy for overcoming the above problem Method and system.
According to an aspect of the present invention, there is provided a kind of high dimensional data search method that hash method is distributed based on Cauchy, Including:S1, input the corresponding high dimensional data of data point to be retrieved in trained similar to search model, obtains described to be checked The corresponding Hash coding of data point of rope, wherein, the similar to search model encodes mould for neural network model and including Hash Block;S2, in the inverted index unit of similar to search model described in the Hash coding input, will obtain and Hash coding Hamming distance is less than or equal to the bucket of preset value;S3, reorder all high dimensional datas in the bucket, obtains similar High dimensional data list, using the similar high dimensional data list as retrieval result.
Preferably, trained similar to search model obtains as follows described in step S1:Obtained from training set Take multiple untapped high dimensional datas;The multiple untapped high dimensional data is inputted into the hash module, is obtained to Hash The more matched low-dimensional feature vector of coding;Based on the low-dimensional feature vector, the corresponding Hash of the low-dimensional feature vector is obtained Coding, and setting loss function is calculated to the similar to search model according to the corresponding Hash coding of the low-dimensional feature vector The gradient of Hash coding layer;Based on the gradient of the Hash coding layer to the similar to search model, to the similar to search Model is trained, and obtains the trained similar to search model.
Preferably, it is described set loss function as:
Wherein, O is setting loss function, and L is Cauchy's cross entropy loss function,For definition, Q quantifies loss letter for Cauchy Number, λ are to adjust the parameter that Cauchy intersects entropy loss and Cauchy quantifies loss weight.
Preferably, the setting loss function quantifies loss function by Cauchy's cross entropy loss function and the Cauchy Linear combination obtains;Cauchy's cross entropy loss function is represented by following formula:
Wherein, L is Cauchy's cross entropy loss function, and S is similar matrix, sijThe member of j row is arranged for the i in similar matrix Element, 0≤i≤N, 0≤j≤N, N be low-dimensional feature vector total number, wijFor ziFor i-th of low-dimensional feature vector, zjFor jth A low-dimensional feature vector, d (zi,zj) it is zi, zjBetween normalization Euclidean distance, γ be Cauchy distribution zooming parameter;
The zi, zjBetween normalization Euclidean distance obtained by following formula:
Wherein, d (zi,zj) it is zi, zjBetween normalization Euclidean distance, ziFor i-th of low-dimensional feature vector, zjFor J-th of low-dimensional feature vector, K are the digit of Hash coding, | | | | for the Euclid norm of vector;
The Cauchy quantifies loss function and is represented by following formula:
Wherein, Q for the Cauchy quantify loss function, 0≤i≤N, N be training data number, ziFor i-th of low-dimensional Feature vector, d (| zi|, 1) be | zi|, the normalization Euclidean distance between 1, γ is the zooming parameter of Cauchy's distribution.
Preferably, the similar matrix is the matrix of NxN, and N is the data point total amount in training set, and the element in matrix is 0 or 1;For any one element s in matrixij, sij=1 represents data point xiWith data point xjIt is similar, sij=0 represents data Point xiWith data point xjIt is dissimilar;Wherein, sijThe element of j row, x are arranged for the i in similar matrixiFor i-th of data in training set Point, xjFor j-th of data point in training set, 0≤i≤N, 0≤j≤N.
Preferably, step S1 further comprises:S11, input the corresponding high dimensional data of all data points of database and instruct In the similar to search model perfected, obtain the Hash corresponding to the corresponding high dimensional data of all data points of the database and compile Code;S12, based on the Hash coding corresponding to the corresponding high dimensional data of all data points, obtain all of the database The inverted index of data point;The inverted index of S13, all data points based on the database, obtain the number to be retrieved The corresponding high dimensional data in strong point, and obtain the corresponding Hash coding of the high dimensional data.
According to another aspect of the present invention, there is provided a kind of high dimensional data that hash method is distributed based on Cauchy retrieves system System, including:Hash coding module is obtained, for the corresponding high dimensional data of data point to be retrieved to be inputted trained similar inspection In rope model, the corresponding Hash coding of the data point to be retrieved is obtained, wherein, the similar to search model is neutral net Model and including Hash coding module;Bucket module is obtained, for falling similar to search model described in the Hash coding input Arrange in indexing units, obtain the bucket for being less than or equal to preset value with the Hamming distance of Hash coding;Obtain retrieval result Module, for reordering to all high dimensional datas in the bucket, obtains similar high dimensional data list, by the similar height Dimension data list is as retrieval result.
According to a further aspect of the invention, there is provided a kind of electronic equipment for high dimensional data retrieval, including memory And processor, the processor and the memory complete mutual communication by bus;The memory storage has can quilt The programmed instruction that the processor performs, the processor call described program instruction to be able to carry out such as above-mentioned any one of them Search method.
According to a further aspect of the invention, there is provided a kind of computer program product, the computer program product include The computer program being stored on non-transient computer readable storage medium storing program for executing, the computer program include programmed instruction, work as institute When stating programmed instruction and being computer-executed, the computer is set to perform such as above-mentioned any one of them search method.
A kind of still another aspect according to the present invention, there is provided computer-readable recording medium, it is characterised in that the computer Such as above-mentioned any one of them search method is realized when program is executed by processor.
A kind of high dimensional data search method and system that hash method is distributed based on Cauchy provided by the invention, passes through setting Beta pruning reordering technique, is first carried out beta pruning using hash method, is then resequenced using primitive character, that is, accelerate retrieval Efficiency, also ensure that the precision of retrieval.And the present invention provide search method be it is unsupervised, can in no semantic label or It is trained in the case of data dependence.The present invention, which can realize, fast and accurately to be retrieved.
Brief description of the drawings
Fig. 1 is a kind of flow of high dimensional data search method that hash method is distributed based on Cauchy in the embodiment of the present invention Figure;
Fig. 2 is a kind of framework of high dimensional data searching system that hash method is distributed based on Cauchy in the embodiment of the present invention Figure;
Fig. 3 is a kind of totality of high dimensional data search method that hash method is distributed based on Cauchy in the embodiment of the present invention Flow chart;
Fig. 4 is a kind of module of high dimensional data searching system that hash method is distributed based on Cauchy in the embodiment of the present invention Figure;
Fig. 5 is a kind of structure diagram of electronic equipment for high dimensional data retrieval in the embodiment of the present invention.
Embodiment
With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.
Fig. 1 is a kind of flow of high dimensional data search method that hash method is distributed based on Cauchy in the embodiment of the present invention Figure, as shown in Figure 1, including:S1, by the corresponding high dimensional data of data point to be retrieved input trained similar to search model In, obtain the corresponding Hash coding of the data point to be retrieved, wherein, the similar to search model for neural network model and Including Hash coding module;S2, will in the inverted index unit of similar to search model described in the Hash coding input, obtain with The Hamming distance of the Hash coding is less than or equal to the bucket of preset value;S3, carry out all high dimensional datas in the bucket Reorder, obtain similar high dimensional data list, using the similar high dimensional data list as retrieval result.
Specifically, the high dimensional data in the embodiment of the present invention can be shallow-layer feature or by deep neural network The further feature of generation.
Further, preset value is preferably 2.
It should be noted that it is each high dimension based in bucket that all high dimensional datas in the bucket, which reorder, Carried out according to the distance of corresponding high dimensional data.
Further, Hash coding is a kind of Hash coding, Hash coding again it will be understood that for high dimensional data it is corresponding Binary feature.
A kind of high dimensional data search method that hash method is distributed based on Cauchy provided by the invention, by setting beta pruning weight Drainage technique, is first carried out beta pruning using hash method, is then resequenced using primitive character, that is, accelerate the effect of retrieval Rate, also ensure that the precision of retrieval.And the present invention provide search method be it is unsupervised, can be in no semantic label or data It is trained in the case of correlation.
Based on above-described embodiment, trained similar to search model obtains as follows described in step S1:From instruction Practice to concentrate and obtain multiple untapped high dimensional datas;The multiple untapped high dimensional data is inputted into the hash module, is obtained Take and more matched low-dimensional feature vector is encoded to Hash;Based on the low-dimensional feature vector, the low-dimensional feature vector pair is obtained The Hash coding answered, and setting loss function is calculated to the similar inspection according to the corresponding Hash coding of the low-dimensional feature vector The gradient of the Hash coding layer of rope model;Based on the gradient of the Hash coding layer to the similar to search model, to described Similar to search model is trained, and obtains the trained similar to search model.
Further, multiple untapped high dimensional datas are obtained from training set;By the multiple untapped high dimension According to the hash module is inputted, obtain and further include following steps before encoding more matched low-dimensional feature vector to Hash:
Calculate dependency relation matrix S.
It is Epoch_current (initial value 0) to remember current exercise wheel number, and maximum exercise wheel number is Epoch_max.Renewal Epoch_current=Epoch_current+1, if Epoch_current≤Epoch_max after renewal, represents and not yet instruct White silk terminates, and it is unused state to mark all data point features in training set, into next step, otherwise terminates to train.
A collection of epicycle training is obtained at random from training set still between used data point feature and these features Dependency relation, number of data points is batch size, and is marked as having used.Data point feature is denoted as X, X={ x1, x2,…,xN, the dependency relation between data point is denoted as S, S={ sij| 1≤i, j≤N }, sij=1 represents xi,xjIt is semantic related, And sij=-1 represents data point xi,xjIt is semantic unrelated.If all data points have been previously used in epicycle training, turn supreme One step.
Further, setting loss function includes the gradient of the Hash coding layer of the similar to search modelWithAfter gradient is obtained, measured back-propagation algorithm is trained the similar to search model, constantly obtains The step of epicycle training still carries out the present embodiment for used data point feature, until obtaining the trained similar to search Model.
Based on above-described embodiment, it is described set loss function as:
Wherein, O is setting loss function, and L is Cauchy's cross entropy loss function,For definition, Q quantifies loss letter for Cauchy Number, λ are to adjust the parameter that Cauchy intersects entropy loss and Cauchy quantifies loss weight.
Based on above-described embodiment, the setting loss function is quantified by Cauchy's cross entropy loss function and the Cauchy Loss function linear combination obtains;
Cauchy's cross entropy loss function is represented by following formula:
Wherein, L is Cauchy's cross entropy loss function, and S is similar matrix, sijThe member of j row is arranged for the i in similar matrix Element, 0≤i≤N, 0≤j≤N, N be low-dimensional feature vector total number, wijFor ziFor i-th of low-dimensional feature vector, zjFor jth A low-dimensional feature vector, d (zi,zj) it is zi, zjBetween normalization Euclidean distance, γ be Cauchy distribution zooming parameter;
The zi, zjBetween normalization Euclidean distance obtained by following formula:
Wherein, d (zi,zj) it is zi, zjBetween normalization Euclidean distance, ziFor i-th of low-dimensional feature vector, zjFor J-th of low-dimensional feature vector, K are the digit of Hash coding, | | | | for the Euclid norm of vector;
The Cauchy quantifies loss function and is represented by following formula:
Wherein, Q for the Cauchy quantify loss function, 0≤i≤N, N be training data number, ziFor i-th of low-dimensional Feature vector, d (| zi|, 1) be | zi|, the normalization Euclidean distance between 1, γ is the zooming parameter of Cauchy's distribution.
Specifically, Cauchy's cross entropy loss function is used to ensure similarity-based learning, quantifies the phase between low-dimensional feature vector Like property.
Further, Cauchy quantifies loss function and is used to control the quantifiable of binary code quality and quantization signifying.
A kind of high dimensional data search method that hash method is distributed based on Cauchy provided by the invention, by setting Cauchy to hand over Entropy loss function is pitched, remains pairwise similarity information included in similar matrix, that is, is remained between raw data points Similarity relationships, are conducive to preferably train model.By setting Cauchy to quantify loss function so that before binaryzation Character representation more they tends to 0 and 1 so that information loss substantially reduces caused by binaryzation.
Based on above-described embodiment, the similar matrix is the matrix of NxN, and N is the data point total amount in training set, in matrix Element be 0 or 1;For any one element s in matrixij, sij=1 represents data point xiWith data point xjIt is similar, sij=0 Represent data point xiWith data point xjIt is dissimilar;Wherein, sijThe element of j row, x are arranged for the i in similar matrixiFor in training set I-th of data point, xjFor j-th of data point in training set, 0≤i≤N, 0≤j≤N.
Specifically, the building method of similar matrix is:
K most like data points of each input data point, i.e. k neighbours are calculated by input feature vector, k is neighbour Number, is an adjustable parameter.For each input data point xiEach corresponding k neighbours xj, there is sij=1, for its remainder Strong point xk, there is sik=0.
Based on above-described embodiment, step S1 further comprises:S11, by the corresponding high dimension of all data points of database According to inputting in trained similar to search model, obtain corresponding to the corresponding high dimensional data of all data points of the database Hash encodes;S12, based on the Hash coding corresponding to the corresponding high dimensional data of all data points, obtain the database All data points inverted index;The inverted index of S13, all data points based on the database, obtain described to be checked The corresponding high dimensional data of data point of rope, and obtain the corresponding Hash coding of the high dimensional data.
It should be noted that in order to obtain more effective hash function, we use a full connection with R unit Quantify layer as hash module, and use tanh (tanh) to be used as activation primitive, obtain connecting quantization signifying z entirely.Finally Binaryzation is carried out to z and obtains Hash coding.
As a preferred embodiment, Fig. 2 is a kind of height that hash method is distributed based on Cauchy in the embodiment of the present invention The Organization Chart of dimension data searching system, Fig. 3 are a kind of high dimension that hash method is distributed based on Cauchy in the embodiment of the present invention According to the overview flow chart of search method, the present embodiment refers to Fig. 2 and Fig. 3.
Calculate dependency relation matrix S.
It is Epoch_current (initial value 0) to remember current exercise wheel number, and maximum exercise wheel number is Epoch_max.Renewal Epoch_current=Epoch_current+1, if Epoch_current≤Epoch_max after renewal, represents and not yet instruct White silk terminates, and it is unused state to mark all data point features in training set, into next step, otherwise terminates to train.
A collection of epicycle training is obtained at random from training set still between used data point feature and these features Dependency relation, number of data points is batch size, and is marked as having used.Data point feature is denoted as X, X={ x1, x2,…,xN, the dependency relation between data point is denoted as S, S={ sij| 1≤i, j≤N }, sij=1 represents xi,xjIt is semantic related, And sij=-1 represents data point xi,xjIt is semantic unrelated.If all data points have been previously used in epicycle training, turn supreme One step, otherwise into next step.
Multiple untapped high dimensional datas are obtained from training set;By described in the multiple untapped high dimensional data input Hash module, obtains and encodes more matched low-dimensional feature vector to Hash.
Based on the low-dimensional feature vector, the corresponding Hash coding of the low-dimensional feature vector is obtained, and according to described low The corresponding Hash coding of dimensional feature vector calculates gradient of the setting loss function to the Hash coding layer of the similar to search model.
After gradient is obtained, measured back-propagation algorithm is trained the similar to search model, constantly Obtain epicycle training and still carry out above-mentioned steps for used data point feature, until obtaining the trained similar to search mould Type.
The corresponding high dimensional data of data point to be retrieved is inputted in trained similar to search model, is obtained described to be checked The corresponding Hash coding of data point of rope, wherein, the similar to search model encodes mould for neural network model and including Hash Block.
By in the inverted index unit of similar to search model described in the Hash coding input, obtain and encoded with the Hash Hamming distance be less than or equal to preset value bucket.
Reorder to all high dimensional datas in the bucket, obtain similar high dimensional data list, by the similar height Dimension data list is as retrieval result.
Based on above-described embodiment, Fig. 4 is a kind of high dimension that hash method is distributed based on Cauchy in the embodiment of the present invention According to the module map of searching system, as shown in figure 4, including:Hash coding module is obtained, for data point to be retrieved is corresponding High dimensional data is inputted in trained similar to search model, obtains the corresponding Hash coding of the data point to be retrieved, wherein, The similar to search model is for neural network model and including Hash coding module;Bucket module is obtained, for the Hash to be compiled Code is inputted in the inverted index unit of the similar to search model, is obtained and is less than or waits with the Hamming distance of Hash coding In the bucket of preset value;Retrieval result module is obtained, for reordering to all high dimensional datas in the bucket, is obtained similar High dimensional data list, using the similar high dimensional data list as retrieval result.
Based on above-described embodiment, Fig. 5 is a kind of electronic equipment for high dimensional data retrieval in the embodiment of the present invention Structure diagram, as shown in figure 5, including memory and processor, the processor and the memory complete phase by bus Communication between mutually;The memory storage has the programmed instruction that can be performed by the processor, and the processor calls the journey Sequence instruction is able to carry out search method as described in above-mentioned any embodiment, such as including:S1, correspond to data point to be retrieved High dimensional data input in trained similar to search model, obtain the corresponding Hash coding of the data point to be retrieved, its In, the similar to search model is for neural network model and including Hash coding module;S2, by described in the Hash coding input In the inverted index unit of similar to search model, obtain and be less than or equal to preset value with the Hamming distance of Hash coding Bucket;S3, reorder all high dimensional datas in the bucket, obtains similar high dimensional data list, by the similar higher-dimension Data list is as retrieval result.
Based on above-described embodiment, further embodiment of this invention provides a kind of computer program product, the computer journey Sequence product includes the computer program being stored on non-transient computer readable storage medium storing program for executing, and the computer program includes program Instruction, when described program instruction is computer-executed, makes the computer perform the retrieval as described in above-mentioned any embodiment Method, such as including:S1, input the corresponding high dimensional data of data point to be retrieved in trained similar to search model, obtains Take the data point to be retrieved corresponding Hash coding, wherein, the similar to search model for neural network model and including Hash coding module;S2, will in the inverted index unit of similar to search model described in the Hash coding input, obtain with it is described The Hamming distance of Hash coding is less than or equal to the bucket of preset value;S3, to all high dimensional datas in the bucket into rearrangement Sequence, obtains similar high dimensional data list, using the similar high dimensional data list as retrieval result.
Based on above-described embodiment, a further embodiment of the present invention provides a kind of computer-readable recording medium, the computer Search method of the realization as described in above-mentioned any embodiment when program is executed by processor, such as including:S1, by number to be retrieved The corresponding high dimensional data in strong point is inputted in trained similar to search model, obtains the corresponding Hash of the data point to be retrieved Coding, wherein, the similar to search model is for neural network model and including Hash coding module;S2, encode the Hash Input in the inverted index unit of the similar to search model, obtain and be less than or equal to the Hamming distance of Hash coding The bucket of preset value;S3, reorder all high dimensional datas in the bucket, obtains similar high dimensional data list, by described in Similar high dimensional data list is as retrieval result.
A kind of high dimensional data search method and system that hash method is distributed based on Cauchy provided by the invention, passes through setting Beta pruning reordering technique, is first carried out beta pruning using hash method, is then resequenced using primitive character, that is, accelerate retrieval Efficiency, also ensure that the precision of retrieval.And the present invention provide search method be it is unsupervised, can in no semantic label or It is trained in the case of data dependence.By setting Cauchy's cross entropy loss function, remain included in similar matrix Pairwise similarity information, that is, remain the similarity relationships between raw data points, be conducive to preferably instruct model Practice.By setting Cauchy to quantify loss function so that the character representation before binaryzation more they tends to 0 and 1 so that caused by binaryzation Information loss substantially reduces.
One aspect of the present invention quality for controlling quantization error, improving Hash coding good in Optimization Framework, from And the efficiency of beta pruning is improved, retrieval rate is on the other hand greatly improved by beta pruning, there is stronger validity, can be realized Fast and accurately retrieve.
Finally, method of the invention is only preferable embodiment, is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on, should be included in the protection of the present invention Within the scope of.

Claims (10)

  1. A kind of 1. high dimensional data search method, it is characterised in that including:
    S1, input the corresponding high dimensional data of data point to be retrieved in trained similar to search model, obtains described to be checked The corresponding Hash coding of data point of rope, wherein, the similar to search model encodes mould for neural network model and including Hash Block;
    S2, in the inverted index unit of similar to search model described in the Hash coding input, will obtain and Hash coding Hamming distance be less than or equal to preset value bucket;
    S3, reorder all high dimensional datas in the bucket, obtains similar high dimensional data list, by the similar higher-dimension Data list is as retrieval result.
  2. 2. search method according to claim 1, it is characterised in that trained similar to search model described in step S1 Obtain as follows:
    Multiple untapped high dimensional datas are obtained from training set;
    The multiple untapped high dimensional data is inputted into the hash module, obtains and more matched low-dimensional feature is encoded to Hash Vector;
    Based on the low-dimensional feature vector, the corresponding Hash coding of the low-dimensional feature vector is obtained, and it is special according to the low-dimensional The corresponding Hash coding of sign vector calculates gradient of the setting loss function to the Hash coding layer of the similar to search model;
    Based on the gradient of the Hash coding layer to the similar to search model, the similar to search model is trained, Obtain the trained similar to search model.
  3. 3. search method according to claim 2, it is characterised in that it is described set loss function as:
    <mrow> <mi>O</mi> <mover> <mo>=</mo> <mi>&amp;Delta;</mi> </mover> <mi>L</mi> <mo>+</mo> <mi>&amp;lambda;</mi> <mi>Q</mi> <mo>;</mo> </mrow>
    Wherein, O is setting loss function, and L is Cauchy's cross entropy loss function,For definition, Q quantifies loss function, λ for Cauchy To adjust the parameter that Cauchy intersects entropy loss and Cauchy quantifies loss weight.
  4. 4. search method according to claim 3, it is characterised in that the setting loss function is by Cauchy's cross entropy Loss function and the Cauchy quantify loss function linear combination and obtain;
    Cauchy's cross entropy loss function is represented by following formula:
    <mrow> <mi>L</mi> <mo>=</mo> <msub> <mi>&amp;Sigma;</mi> <mrow> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&amp;Element;</mo> <mi>S</mi> </mrow> </msub> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <msup> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>&amp;gamma;</mi> </mfrac> <mo>+</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>(</mo> <mrow> <mn>1</mn> <mo>+</mo> <mfrac> <mi>&amp;gamma;</mi> <mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>;</mo> </mrow>
    Wherein, L is Cauchy's cross entropy loss function, and S is similar matrix, sijThe element arranged for the i rows j in similar matrix, 0 ≤ i≤N, 0≤j≤N, N be low-dimensional feature vector total number, wijFor ziFor i-th of low-dimensional feature vector, zjIt is low for j-th Dimensional feature vector, d (zi, zj) it is zi, zjBetween normalization Euclidean distance, γ be Cauchy distribution zooming parameter;
    The zi, zjBetween normalization Euclidean distance obtained by following formula:
    <mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mi>K</mi> <mn>4</mn> </mfrac> <mo>|</mo> <mo>|</mo> <mfrac> <msub> <mi>z</mi> <mi>i</mi> </msub> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mfrac> <msub> <mi>z</mi> <mi>j</mi> </msub> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>|</mo> <msubsup> <mo>|</mo> <mn>2</mn> <mn>2</mn> </msubsup> <mo>;</mo> </mrow>
    Wherein, d (zi, zj) it is zi, zjBetween normalization Euclidean distance, ziFor i-th of low-dimensional feature vector, zjFor jth A low-dimensional feature vector, K are the digit of Hash coding, | | | | for the Euclid norm of vector;
    The Cauchy quantifies loss function and is represented by following formula:
    <mrow> <mi>Q</mi> <mo>=</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mn>1</mn> <mo>+</mo> <mfrac> <mrow> <mi>d</mi> <mrow> <mo>(</mo> <mo>|</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mi>&amp;gamma;</mi> </mfrac> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
    Wherein, Q for the Cauchy quantify loss function, 0≤i≤N, N be training data number, ziFor i-th of low-dimensional feature to Amount, d (| zi|, 1) be | zi|, the normalization Euclidean distance between 1, γ is the zooming parameter of Cauchy's distribution.
  5. 5. search method according to claim 4, it is characterised in that the similar matrix is the matrix of NxN, and N is training The data point total amount of concentration, the element in matrix are 0 or 1;
    For any one element s in matrixij, sij=1 represents data point xiWith data point xjIt is similar, sij=0 represents data point xiWith data point xjIt is dissimilar;
    Wherein, sijThe element of j row, x are arranged for the i in similar matrixiFor i-th of data point in training set, xjFor in training set J-th of data point, 0≤i≤N, 0≤j≤N.
  6. 6. search method according to claim 1, it is characterised in that step S1 further comprises:
    S11, input the corresponding high dimensional data of all data points of database in trained similar to search model, described in acquisition Hash coding corresponding to the corresponding high dimensional data of all data points of database;
    S12, based on the Hash coding corresponding to the corresponding high dimensional data of all data points, obtain all of the database The inverted index of data point;
    The inverted index of S13, all data points based on the database, obtain the corresponding higher-dimension of the data point to be retrieved Data, and obtain the corresponding Hash coding of the high dimensional data.
  7. A kind of 7. high dimensional data searching system, it is characterised in that including:
    Hash coding module is obtained, for the corresponding high dimensional data of data point to be retrieved to be inputted trained similar to search mould In type, the corresponding Hash coding of the data point to be retrieved is obtained, wherein, the similar to search model is neural network model And including Hash coding module;
    Obtain bucket module, for will in the inverted index unit of similar to search model described in the Hash coding input, obtain with The Hamming distance of the Hash coding is less than or equal to the bucket of preset value;
    Retrieval result module is obtained, for reordering to all high dimensional datas in the bucket, obtains similar high dimensional data List, using the similar high dimensional data list as retrieval result.
  8. A kind of 8. electronic equipment for high dimensional data retrieval, it is characterised in that including memory and processor, the processor Mutual communication is completed by bus with the memory;The memory storage has the program that can be performed by the processor Instruction, the processor call described program instruction to be able to carry out the search method as described in claim 1 to 6 is any.
  9. 9. a kind of computer program product, it is characterised in that the computer program product includes being stored in non-transient computer Computer program on readable storage medium storing program for executing, the computer program include programmed instruction, when described program is instructed by computer During execution, the computer is set to perform the search method as described in claim 1 to 6 is any.
  10. 10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the computer program quilt The search method as described in claim 1 to 6 is any is realized when processor performs.
CN201711353318.8A 2017-12-15 2017-12-15 The high dimensional data search method and system of hash method are distributed based on Cauchy Active CN107992611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711353318.8A CN107992611B (en) 2017-12-15 2017-12-15 The high dimensional data search method and system of hash method are distributed based on Cauchy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711353318.8A CN107992611B (en) 2017-12-15 2017-12-15 The high dimensional data search method and system of hash method are distributed based on Cauchy

Publications (2)

Publication Number Publication Date
CN107992611A true CN107992611A (en) 2018-05-04
CN107992611B CN107992611B (en) 2018-12-28

Family

ID=62038530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711353318.8A Active CN107992611B (en) 2017-12-15 2017-12-15 The high dimensional data search method and system of hash method are distributed based on Cauchy

Country Status (1)

Country Link
CN (1) CN107992611B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059206A (en) * 2019-03-29 2019-07-26 银江股份有限公司 A kind of extensive hashing image search method based on depth representative learning
CN110209867A (en) * 2019-06-05 2019-09-06 腾讯科技(深圳)有限公司 Training method, device, equipment and the storage medium of image encrypting algorithm
CN110489515A (en) * 2019-08-01 2019-11-22 卫盈联信息技术(深圳)有限公司 Method, server and the storage medium of address list retrieval
CN111260101A (en) * 2018-11-30 2020-06-09 北京嘀嘀无限科技发展有限公司 Information processing method and device
CN111612079A (en) * 2020-05-22 2020-09-01 深圳前海微众银行股份有限公司 Data right confirming method, equipment and readable storage medium
CN112395438A (en) * 2020-11-05 2021-02-23 华中科技大学 Hash code generation method and system for multi-label image

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120328215A1 (en) * 2006-12-29 2012-12-27 Jm Van Thong Image-based retrieval for high quality visual or acoustic rendering
CN104834748A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Image retrieval method utilizing deep semantic to rank hash codes
CN105279554A (en) * 2015-09-29 2016-01-27 东方网力科技股份有限公司 Depth neural network training method and device based on Hash coding layer
CN106169961A (en) * 2016-09-07 2016-11-30 北京百度网讯科技有限公司 The network parameter processing method and processing device of neutral net based on artificial intelligence
CN106407352A (en) * 2016-09-06 2017-02-15 广东顺德中山大学卡内基梅隆大学国际联合研究院 Traffic image retrieval method based on depth learning
CN106503106A (en) * 2016-10-17 2017-03-15 北京工业大学 A kind of image hash index construction method based on deep learning
CN106777349A (en) * 2017-01-16 2017-05-31 广东工业大学 Face retrieval system and method based on deep learning
US20170293838A1 (en) * 2016-04-06 2017-10-12 Nec Laboratories America, Inc. Deep high-order exemplar learning for hashing and fast information retrieval
US20170316287A1 (en) * 2015-06-05 2017-11-02 At&T Intellectual Property I, L.P. Image hash codes generated by a neural network
CN107330074A (en) * 2017-06-30 2017-11-07 中国科学院计算技术研究所 The image search method encoded based on deep learning and Hash

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120328215A1 (en) * 2006-12-29 2012-12-27 Jm Van Thong Image-based retrieval for high quality visual or acoustic rendering
CN104834748A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Image retrieval method utilizing deep semantic to rank hash codes
US20170316287A1 (en) * 2015-06-05 2017-11-02 At&T Intellectual Property I, L.P. Image hash codes generated by a neural network
CN105279554A (en) * 2015-09-29 2016-01-27 东方网力科技股份有限公司 Depth neural network training method and device based on Hash coding layer
US20170293838A1 (en) * 2016-04-06 2017-10-12 Nec Laboratories America, Inc. Deep high-order exemplar learning for hashing and fast information retrieval
CN106407352A (en) * 2016-09-06 2017-02-15 广东顺德中山大学卡内基梅隆大学国际联合研究院 Traffic image retrieval method based on depth learning
CN106169961A (en) * 2016-09-07 2016-11-30 北京百度网讯科技有限公司 The network parameter processing method and processing device of neutral net based on artificial intelligence
CN106503106A (en) * 2016-10-17 2017-03-15 北京工业大学 A kind of image hash index construction method based on deep learning
CN106777349A (en) * 2017-01-16 2017-05-31 广东工业大学 Face retrieval system and method based on deep learning
CN107330074A (en) * 2017-06-30 2017-11-07 中国科学院计算技术研究所 The image search method encoded based on deep learning and Hash

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAN ZHU等: ""Deep Hashing Network for Efficient Similarity Retrieval"", 《PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-16)》 *
曹玉东 等: ""基于局部敏感哈希算法的图像高维数据索引技术的研究"", 《辽宁工业大学学报(自然科学版)》 *
王云飞: ""基于隐含语义哈希算法的相似性搜索研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260101A (en) * 2018-11-30 2020-06-09 北京嘀嘀无限科技发展有限公司 Information processing method and device
CN111260101B (en) * 2018-11-30 2022-07-08 北京嘀嘀无限科技发展有限公司 Information processing method and device
CN110059206A (en) * 2019-03-29 2019-07-26 银江股份有限公司 A kind of extensive hashing image search method based on depth representative learning
CN110209867A (en) * 2019-06-05 2019-09-06 腾讯科技(深圳)有限公司 Training method, device, equipment and the storage medium of image encrypting algorithm
CN110209867B (en) * 2019-06-05 2023-05-16 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium for image retrieval model
CN110489515A (en) * 2019-08-01 2019-11-22 卫盈联信息技术(深圳)有限公司 Method, server and the storage medium of address list retrieval
CN111612079A (en) * 2020-05-22 2020-09-01 深圳前海微众银行股份有限公司 Data right confirming method, equipment and readable storage medium
CN111612079B (en) * 2020-05-22 2021-07-20 深圳前海微众银行股份有限公司 Data right confirming method, equipment and readable storage medium
CN112395438A (en) * 2020-11-05 2021-02-23 华中科技大学 Hash code generation method and system for multi-label image

Also Published As

Publication number Publication date
CN107992611B (en) 2018-12-28

Similar Documents

Publication Publication Date Title
CN107992611A (en) The high dimensional data search method and system of hash method are distributed based on Cauchy
CN109657246B (en) Method for establishing extraction type machine reading understanding model based on deep learning
CN109635936A (en) A kind of neural networks pruning quantization method based on retraining
CN107943938A (en) A kind of large-scale image similar to search method and system quantified based on depth product
Cao et al. Automatic selection of t-SNE perplexity
CN109558477A (en) A kind of community&#39;s question answering system, method and electronic equipment based on multi-task learning
CN104598611B (en) The method and system being ranked up to search entry
CN104156433B (en) Image retrieval method based on semantic mapping space construction
CN110457514A (en) A kind of multi-tag image search method based on depth Hash
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN115080801B (en) Cross-modal retrieval method and system based on federal learning and data binary representation
CN111930887A (en) Multi-document multi-answer machine reading understanding system based on joint training mode
CN109063112A (en) A kind of fast image retrieval method based on multi-task learning deep semantic Hash, model and model building method
CN110019652A (en) A kind of cross-module state Hash search method based on deep learning
CN109740158A (en) Text semantic parsing method and device
CN118394890B (en) Knowledge retrieval enhancement generation method and system based on large language model
CN108228674A (en) A kind of information processing method and device based on DKT
CN112487020A (en) Method and system for converting graph of SQL to text into natural language statement
He et al. Directed acyclic graphs-based diagnosis approach using small data sets for sustainability
Yang et al. Endowing Interpretability for Neural Cognitive Diagnosis by Efficient Kolmogorov-Arnold Networks
CN117807232A (en) Commodity classification method, commodity classification model construction method and device
CN117609577A (en) Employment recommendation processing method and system based on artificial intelligence
CN116797832A (en) Stropharia rugoso-annulata hierarchical detection method based on mixed deep learning model
Yang et al. Evolutionary multi-objective neural architecture search for generalized cognitive diagnosis models
CN112632406B (en) Query method, query device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant