CN107992611A - The high dimensional data search method and system of hash method are distributed based on Cauchy - Google Patents
The high dimensional data search method and system of hash method are distributed based on Cauchy Download PDFInfo
- Publication number
- CN107992611A CN107992611A CN201711353318.8A CN201711353318A CN107992611A CN 107992611 A CN107992611 A CN 107992611A CN 201711353318 A CN201711353318 A CN 201711353318A CN 107992611 A CN107992611 A CN 107992611A
- Authority
- CN
- China
- Prior art keywords
- mrow
- similar
- msub
- high dimensional
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2255—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Software Systems (AREA)
Abstract
The present invention, which provides a kind of the high dimensional data search method based on Cauchy's distribution hash method and system, search method, to be included:S1, input the corresponding high dimensional data of data point to be retrieved in trained similar to search model, obtains the corresponding Hash coding of data point to be retrieved, wherein, similar to search model is for neural network model and including Hash coding module;S2, in the inverted index unit of Hash coding input similar to search model, will obtain and bucket of the Hamming distance of Hash coding less than or equal to preset value;S3, reorder all high dimensional datas in bucket, obtains similar high dimensional data list, using similar high dimensional data list as retrieval result.The present invention first carries out beta pruning using hash method, is then resequenced using primitive character, that is, accelerates effectiveness of retrieval, also ensure that the precision of retrieval.The present invention, which can realize, fast and accurately to be retrieved.
Description
Technical field
The present invention relates to big data analysis field, more particularly, to a kind of higher-dimension that hash method is distributed based on Cauchy
Data retrieval method and system.
Background technology
In Internet era, as multimedia resource is continuously increased on internet, how from large-scale data it is quick,
Relevant data are effectively found, are either still all spatially a greatly test in time.With internet
Rapid development, large capacity, high-dimensional image big data are more and more common in search engine and community network, also attract
More and more concerns, how fast and effectively to carry out image retrieval analysis is a urgent problem needed to be solved, and approximate neighbour looks into
It is exactly to be produced for this problem to ask, and how to ensure computational efficiency at the same time and search quality be approximate NN Query pass
Key.For this problem, a very common and effective solution method is hash method, i.e., is converted to high dimensional data compact
Binary code, and similar binary code is generated for similar data.The present invention pays close attention to the relevant hash method of data,
Verified, such method hash method more incoherent than data (such as local sensitivity Hash) is more effective.
In the past, it has been proposed that excessively many hash methods based on Hamming distance.These methods can be divided into unsupervised Kazakhstan
Uncommon method and there is supervision two class of hash method.Wherein unsupervised approaches are more universal, can be in no semantic label or data phase
It is trained in the case of closing property.Therefore, many unsupervised hash methods are suggested study compact binary encoded to carry out
Efficient Hamming spatial image retrieval.
However, existing most methods are both for data compression rather than candidate's trimming, the i.e. line based on Hash codes
Property scanning maximize retrieval performance.Since even if using Hash codes, the cost of linear scan is still very high, therefore existing big
Most hash methods deviate from original intention to a certain extent, that is, farthest accelerate under acceptable precision.With
The it is proposed of the powerful hash method suitable for linear scan, should turn to Hamming spatial retrieval now, it can realize highest
The Time constant search of effect.In Hamming spatial retrieval, by Hash table search rather than linear scan, the Chinese that we will give
Data point in bright radius returns to each inquiry.But the loss function of existing hash method is not very applicable, is usually lacked
The weary ability concentrated on associated picture in small Hamming ball, therefore the loss function may be not suitable for Hamming spatial retrieval.
The content of the invention
The present invention provides a kind of a kind of high dimensional data retrieval side that hash method is distributed based on Cauchy for overcoming the above problem
Method and system.
According to an aspect of the present invention, there is provided a kind of high dimensional data search method that hash method is distributed based on Cauchy,
Including:S1, input the corresponding high dimensional data of data point to be retrieved in trained similar to search model, obtains described to be checked
The corresponding Hash coding of data point of rope, wherein, the similar to search model encodes mould for neural network model and including Hash
Block;S2, in the inverted index unit of similar to search model described in the Hash coding input, will obtain and Hash coding
Hamming distance is less than or equal to the bucket of preset value;S3, reorder all high dimensional datas in the bucket, obtains similar
High dimensional data list, using the similar high dimensional data list as retrieval result.
Preferably, trained similar to search model obtains as follows described in step S1:Obtained from training set
Take multiple untapped high dimensional datas;The multiple untapped high dimensional data is inputted into the hash module, is obtained to Hash
The more matched low-dimensional feature vector of coding;Based on the low-dimensional feature vector, the corresponding Hash of the low-dimensional feature vector is obtained
Coding, and setting loss function is calculated to the similar to search model according to the corresponding Hash coding of the low-dimensional feature vector
The gradient of Hash coding layer;Based on the gradient of the Hash coding layer to the similar to search model, to the similar to search
Model is trained, and obtains the trained similar to search model.
Preferably, it is described set loss function as:
Wherein, O is setting loss function, and L is Cauchy's cross entropy loss function,For definition, Q quantifies loss letter for Cauchy
Number, λ are to adjust the parameter that Cauchy intersects entropy loss and Cauchy quantifies loss weight.
Preferably, the setting loss function quantifies loss function by Cauchy's cross entropy loss function and the Cauchy
Linear combination obtains;Cauchy's cross entropy loss function is represented by following formula:
Wherein, L is Cauchy's cross entropy loss function, and S is similar matrix, sijThe member of j row is arranged for the i in similar matrix
Element, 0≤i≤N, 0≤j≤N, N be low-dimensional feature vector total number, wijFor ziFor i-th of low-dimensional feature vector, zjFor jth
A low-dimensional feature vector, d (zi,zj) it is zi, zjBetween normalization Euclidean distance, γ be Cauchy distribution zooming parameter;
The zi, zjBetween normalization Euclidean distance obtained by following formula:
Wherein, d (zi,zj) it is zi, zjBetween normalization Euclidean distance, ziFor i-th of low-dimensional feature vector, zjFor
J-th of low-dimensional feature vector, K are the digit of Hash coding, | | | | for the Euclid norm of vector;
The Cauchy quantifies loss function and is represented by following formula:
Wherein, Q for the Cauchy quantify loss function, 0≤i≤N, N be training data number, ziFor i-th of low-dimensional
Feature vector, d (| zi|, 1) be | zi|, the normalization Euclidean distance between 1, γ is the zooming parameter of Cauchy's distribution.
Preferably, the similar matrix is the matrix of NxN, and N is the data point total amount in training set, and the element in matrix is
0 or 1;For any one element s in matrixij, sij=1 represents data point xiWith data point xjIt is similar, sij=0 represents data
Point xiWith data point xjIt is dissimilar;Wherein, sijThe element of j row, x are arranged for the i in similar matrixiFor i-th of data in training set
Point, xjFor j-th of data point in training set, 0≤i≤N, 0≤j≤N.
Preferably, step S1 further comprises:S11, input the corresponding high dimensional data of all data points of database and instruct
In the similar to search model perfected, obtain the Hash corresponding to the corresponding high dimensional data of all data points of the database and compile
Code;S12, based on the Hash coding corresponding to the corresponding high dimensional data of all data points, obtain all of the database
The inverted index of data point;The inverted index of S13, all data points based on the database, obtain the number to be retrieved
The corresponding high dimensional data in strong point, and obtain the corresponding Hash coding of the high dimensional data.
According to another aspect of the present invention, there is provided a kind of high dimensional data that hash method is distributed based on Cauchy retrieves system
System, including:Hash coding module is obtained, for the corresponding high dimensional data of data point to be retrieved to be inputted trained similar inspection
In rope model, the corresponding Hash coding of the data point to be retrieved is obtained, wherein, the similar to search model is neutral net
Model and including Hash coding module;Bucket module is obtained, for falling similar to search model described in the Hash coding input
Arrange in indexing units, obtain the bucket for being less than or equal to preset value with the Hamming distance of Hash coding;Obtain retrieval result
Module, for reordering to all high dimensional datas in the bucket, obtains similar high dimensional data list, by the similar height
Dimension data list is as retrieval result.
According to a further aspect of the invention, there is provided a kind of electronic equipment for high dimensional data retrieval, including memory
And processor, the processor and the memory complete mutual communication by bus;The memory storage has can quilt
The programmed instruction that the processor performs, the processor call described program instruction to be able to carry out such as above-mentioned any one of them
Search method.
According to a further aspect of the invention, there is provided a kind of computer program product, the computer program product include
The computer program being stored on non-transient computer readable storage medium storing program for executing, the computer program include programmed instruction, work as institute
When stating programmed instruction and being computer-executed, the computer is set to perform such as above-mentioned any one of them search method.
A kind of still another aspect according to the present invention, there is provided computer-readable recording medium, it is characterised in that the computer
Such as above-mentioned any one of them search method is realized when program is executed by processor.
A kind of high dimensional data search method and system that hash method is distributed based on Cauchy provided by the invention, passes through setting
Beta pruning reordering technique, is first carried out beta pruning using hash method, is then resequenced using primitive character, that is, accelerate retrieval
Efficiency, also ensure that the precision of retrieval.And the present invention provide search method be it is unsupervised, can in no semantic label or
It is trained in the case of data dependence.The present invention, which can realize, fast and accurately to be retrieved.
Brief description of the drawings
Fig. 1 is a kind of flow of high dimensional data search method that hash method is distributed based on Cauchy in the embodiment of the present invention
Figure;
Fig. 2 is a kind of framework of high dimensional data searching system that hash method is distributed based on Cauchy in the embodiment of the present invention
Figure;
Fig. 3 is a kind of totality of high dimensional data search method that hash method is distributed based on Cauchy in the embodiment of the present invention
Flow chart;
Fig. 4 is a kind of module of high dimensional data searching system that hash method is distributed based on Cauchy in the embodiment of the present invention
Figure;
Fig. 5 is a kind of structure diagram of electronic equipment for high dimensional data retrieval in the embodiment of the present invention.
Embodiment
With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below
Example is used to illustrate the present invention, but is not limited to the scope of the present invention.
Fig. 1 is a kind of flow of high dimensional data search method that hash method is distributed based on Cauchy in the embodiment of the present invention
Figure, as shown in Figure 1, including:S1, by the corresponding high dimensional data of data point to be retrieved input trained similar to search model
In, obtain the corresponding Hash coding of the data point to be retrieved, wherein, the similar to search model for neural network model and
Including Hash coding module;S2, will in the inverted index unit of similar to search model described in the Hash coding input, obtain with
The Hamming distance of the Hash coding is less than or equal to the bucket of preset value;S3, carry out all high dimensional datas in the bucket
Reorder, obtain similar high dimensional data list, using the similar high dimensional data list as retrieval result.
Specifically, the high dimensional data in the embodiment of the present invention can be shallow-layer feature or by deep neural network
The further feature of generation.
Further, preset value is preferably 2.
It should be noted that it is each high dimension based in bucket that all high dimensional datas in the bucket, which reorder,
Carried out according to the distance of corresponding high dimensional data.
Further, Hash coding is a kind of Hash coding, Hash coding again it will be understood that for high dimensional data it is corresponding
Binary feature.
A kind of high dimensional data search method that hash method is distributed based on Cauchy provided by the invention, by setting beta pruning weight
Drainage technique, is first carried out beta pruning using hash method, is then resequenced using primitive character, that is, accelerate the effect of retrieval
Rate, also ensure that the precision of retrieval.And the present invention provide search method be it is unsupervised, can be in no semantic label or data
It is trained in the case of correlation.
Based on above-described embodiment, trained similar to search model obtains as follows described in step S1:From instruction
Practice to concentrate and obtain multiple untapped high dimensional datas;The multiple untapped high dimensional data is inputted into the hash module, is obtained
Take and more matched low-dimensional feature vector is encoded to Hash;Based on the low-dimensional feature vector, the low-dimensional feature vector pair is obtained
The Hash coding answered, and setting loss function is calculated to the similar inspection according to the corresponding Hash coding of the low-dimensional feature vector
The gradient of the Hash coding layer of rope model;Based on the gradient of the Hash coding layer to the similar to search model, to described
Similar to search model is trained, and obtains the trained similar to search model.
Further, multiple untapped high dimensional datas are obtained from training set;By the multiple untapped high dimension
According to the hash module is inputted, obtain and further include following steps before encoding more matched low-dimensional feature vector to Hash:
Calculate dependency relation matrix S.
It is Epoch_current (initial value 0) to remember current exercise wheel number, and maximum exercise wheel number is Epoch_max.Renewal
Epoch_current=Epoch_current+1, if Epoch_current≤Epoch_max after renewal, represents and not yet instruct
White silk terminates, and it is unused state to mark all data point features in training set, into next step, otherwise terminates to train.
A collection of epicycle training is obtained at random from training set still between used data point feature and these features
Dependency relation, number of data points is batch size, and is marked as having used.Data point feature is denoted as X, X={ x1,
x2,…,xN, the dependency relation between data point is denoted as S, S={ sij| 1≤i, j≤N }, sij=1 represents xi,xjIt is semantic related,
And sij=-1 represents data point xi,xjIt is semantic unrelated.If all data points have been previously used in epicycle training, turn supreme
One step.
Further, setting loss function includes the gradient of the Hash coding layer of the similar to search modelWithAfter gradient is obtained, measured back-propagation algorithm is trained the similar to search model, constantly obtains
The step of epicycle training still carries out the present embodiment for used data point feature, until obtaining the trained similar to search
Model.
Based on above-described embodiment, it is described set loss function as:
Wherein, O is setting loss function, and L is Cauchy's cross entropy loss function,For definition, Q quantifies loss letter for Cauchy
Number, λ are to adjust the parameter that Cauchy intersects entropy loss and Cauchy quantifies loss weight.
Based on above-described embodiment, the setting loss function is quantified by Cauchy's cross entropy loss function and the Cauchy
Loss function linear combination obtains;
Cauchy's cross entropy loss function is represented by following formula:
Wherein, L is Cauchy's cross entropy loss function, and S is similar matrix, sijThe member of j row is arranged for the i in similar matrix
Element, 0≤i≤N, 0≤j≤N, N be low-dimensional feature vector total number, wijFor ziFor i-th of low-dimensional feature vector, zjFor jth
A low-dimensional feature vector, d (zi,zj) it is zi, zjBetween normalization Euclidean distance, γ be Cauchy distribution zooming parameter;
The zi, zjBetween normalization Euclidean distance obtained by following formula:
Wherein, d (zi,zj) it is zi, zjBetween normalization Euclidean distance, ziFor i-th of low-dimensional feature vector, zjFor
J-th of low-dimensional feature vector, K are the digit of Hash coding, | | | | for the Euclid norm of vector;
The Cauchy quantifies loss function and is represented by following formula:
Wherein, Q for the Cauchy quantify loss function, 0≤i≤N, N be training data number, ziFor i-th of low-dimensional
Feature vector, d (| zi|, 1) be | zi|, the normalization Euclidean distance between 1, γ is the zooming parameter of Cauchy's distribution.
Specifically, Cauchy's cross entropy loss function is used to ensure similarity-based learning, quantifies the phase between low-dimensional feature vector
Like property.
Further, Cauchy quantifies loss function and is used to control the quantifiable of binary code quality and quantization signifying.
A kind of high dimensional data search method that hash method is distributed based on Cauchy provided by the invention, by setting Cauchy to hand over
Entropy loss function is pitched, remains pairwise similarity information included in similar matrix, that is, is remained between raw data points
Similarity relationships, are conducive to preferably train model.By setting Cauchy to quantify loss function so that before binaryzation
Character representation more they tends to 0 and 1 so that information loss substantially reduces caused by binaryzation.
Based on above-described embodiment, the similar matrix is the matrix of NxN, and N is the data point total amount in training set, in matrix
Element be 0 or 1;For any one element s in matrixij, sij=1 represents data point xiWith data point xjIt is similar, sij=0
Represent data point xiWith data point xjIt is dissimilar;Wherein, sijThe element of j row, x are arranged for the i in similar matrixiFor in training set
I-th of data point, xjFor j-th of data point in training set, 0≤i≤N, 0≤j≤N.
Specifically, the building method of similar matrix is:
K most like data points of each input data point, i.e. k neighbours are calculated by input feature vector, k is neighbour
Number, is an adjustable parameter.For each input data point xiEach corresponding k neighbours xj, there is sij=1, for its remainder
Strong point xk, there is sik=0.
Based on above-described embodiment, step S1 further comprises:S11, by the corresponding high dimension of all data points of database
According to inputting in trained similar to search model, obtain corresponding to the corresponding high dimensional data of all data points of the database
Hash encodes;S12, based on the Hash coding corresponding to the corresponding high dimensional data of all data points, obtain the database
All data points inverted index;The inverted index of S13, all data points based on the database, obtain described to be checked
The corresponding high dimensional data of data point of rope, and obtain the corresponding Hash coding of the high dimensional data.
It should be noted that in order to obtain more effective hash function, we use a full connection with R unit
Quantify layer as hash module, and use tanh (tanh) to be used as activation primitive, obtain connecting quantization signifying z entirely.Finally
Binaryzation is carried out to z and obtains Hash coding.
As a preferred embodiment, Fig. 2 is a kind of height that hash method is distributed based on Cauchy in the embodiment of the present invention
The Organization Chart of dimension data searching system, Fig. 3 are a kind of high dimension that hash method is distributed based on Cauchy in the embodiment of the present invention
According to the overview flow chart of search method, the present embodiment refers to Fig. 2 and Fig. 3.
Calculate dependency relation matrix S.
It is Epoch_current (initial value 0) to remember current exercise wheel number, and maximum exercise wheel number is Epoch_max.Renewal
Epoch_current=Epoch_current+1, if Epoch_current≤Epoch_max after renewal, represents and not yet instruct
White silk terminates, and it is unused state to mark all data point features in training set, into next step, otherwise terminates to train.
A collection of epicycle training is obtained at random from training set still between used data point feature and these features
Dependency relation, number of data points is batch size, and is marked as having used.Data point feature is denoted as X, X={ x1,
x2,…,xN, the dependency relation between data point is denoted as S, S={ sij| 1≤i, j≤N }, sij=1 represents xi,xjIt is semantic related,
And sij=-1 represents data point xi,xjIt is semantic unrelated.If all data points have been previously used in epicycle training, turn supreme
One step, otherwise into next step.
Multiple untapped high dimensional datas are obtained from training set;By described in the multiple untapped high dimensional data input
Hash module, obtains and encodes more matched low-dimensional feature vector to Hash.
Based on the low-dimensional feature vector, the corresponding Hash coding of the low-dimensional feature vector is obtained, and according to described low
The corresponding Hash coding of dimensional feature vector calculates gradient of the setting loss function to the Hash coding layer of the similar to search model.
After gradient is obtained, measured back-propagation algorithm is trained the similar to search model, constantly
Obtain epicycle training and still carry out above-mentioned steps for used data point feature, until obtaining the trained similar to search mould
Type.
The corresponding high dimensional data of data point to be retrieved is inputted in trained similar to search model, is obtained described to be checked
The corresponding Hash coding of data point of rope, wherein, the similar to search model encodes mould for neural network model and including Hash
Block.
By in the inverted index unit of similar to search model described in the Hash coding input, obtain and encoded with the Hash
Hamming distance be less than or equal to preset value bucket.
Reorder to all high dimensional datas in the bucket, obtain similar high dimensional data list, by the similar height
Dimension data list is as retrieval result.
Based on above-described embodiment, Fig. 4 is a kind of high dimension that hash method is distributed based on Cauchy in the embodiment of the present invention
According to the module map of searching system, as shown in figure 4, including:Hash coding module is obtained, for data point to be retrieved is corresponding
High dimensional data is inputted in trained similar to search model, obtains the corresponding Hash coding of the data point to be retrieved, wherein,
The similar to search model is for neural network model and including Hash coding module;Bucket module is obtained, for the Hash to be compiled
Code is inputted in the inverted index unit of the similar to search model, is obtained and is less than or waits with the Hamming distance of Hash coding
In the bucket of preset value;Retrieval result module is obtained, for reordering to all high dimensional datas in the bucket, is obtained similar
High dimensional data list, using the similar high dimensional data list as retrieval result.
Based on above-described embodiment, Fig. 5 is a kind of electronic equipment for high dimensional data retrieval in the embodiment of the present invention
Structure diagram, as shown in figure 5, including memory and processor, the processor and the memory complete phase by bus
Communication between mutually;The memory storage has the programmed instruction that can be performed by the processor, and the processor calls the journey
Sequence instruction is able to carry out search method as described in above-mentioned any embodiment, such as including:S1, correspond to data point to be retrieved
High dimensional data input in trained similar to search model, obtain the corresponding Hash coding of the data point to be retrieved, its
In, the similar to search model is for neural network model and including Hash coding module;S2, by described in the Hash coding input
In the inverted index unit of similar to search model, obtain and be less than or equal to preset value with the Hamming distance of Hash coding
Bucket;S3, reorder all high dimensional datas in the bucket, obtains similar high dimensional data list, by the similar higher-dimension
Data list is as retrieval result.
Based on above-described embodiment, further embodiment of this invention provides a kind of computer program product, the computer journey
Sequence product includes the computer program being stored on non-transient computer readable storage medium storing program for executing, and the computer program includes program
Instruction, when described program instruction is computer-executed, makes the computer perform the retrieval as described in above-mentioned any embodiment
Method, such as including:S1, input the corresponding high dimensional data of data point to be retrieved in trained similar to search model, obtains
Take the data point to be retrieved corresponding Hash coding, wherein, the similar to search model for neural network model and including
Hash coding module;S2, will in the inverted index unit of similar to search model described in the Hash coding input, obtain with it is described
The Hamming distance of Hash coding is less than or equal to the bucket of preset value;S3, to all high dimensional datas in the bucket into rearrangement
Sequence, obtains similar high dimensional data list, using the similar high dimensional data list as retrieval result.
Based on above-described embodiment, a further embodiment of the present invention provides a kind of computer-readable recording medium, the computer
Search method of the realization as described in above-mentioned any embodiment when program is executed by processor, such as including:S1, by number to be retrieved
The corresponding high dimensional data in strong point is inputted in trained similar to search model, obtains the corresponding Hash of the data point to be retrieved
Coding, wherein, the similar to search model is for neural network model and including Hash coding module;S2, encode the Hash
Input in the inverted index unit of the similar to search model, obtain and be less than or equal to the Hamming distance of Hash coding
The bucket of preset value;S3, reorder all high dimensional datas in the bucket, obtains similar high dimensional data list, by described in
Similar high dimensional data list is as retrieval result.
A kind of high dimensional data search method and system that hash method is distributed based on Cauchy provided by the invention, passes through setting
Beta pruning reordering technique, is first carried out beta pruning using hash method, is then resequenced using primitive character, that is, accelerate retrieval
Efficiency, also ensure that the precision of retrieval.And the present invention provide search method be it is unsupervised, can in no semantic label or
It is trained in the case of data dependence.By setting Cauchy's cross entropy loss function, remain included in similar matrix
Pairwise similarity information, that is, remain the similarity relationships between raw data points, be conducive to preferably instruct model
Practice.By setting Cauchy to quantify loss function so that the character representation before binaryzation more they tends to 0 and 1 so that caused by binaryzation
Information loss substantially reduces.
One aspect of the present invention quality for controlling quantization error, improving Hash coding good in Optimization Framework, from
And the efficiency of beta pruning is improved, retrieval rate is on the other hand greatly improved by beta pruning, there is stronger validity, can be realized
Fast and accurately retrieve.
Finally, method of the invention is only preferable embodiment, is not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on, should be included in the protection of the present invention
Within the scope of.
Claims (10)
- A kind of 1. high dimensional data search method, it is characterised in that including:S1, input the corresponding high dimensional data of data point to be retrieved in trained similar to search model, obtains described to be checked The corresponding Hash coding of data point of rope, wherein, the similar to search model encodes mould for neural network model and including Hash Block;S2, in the inverted index unit of similar to search model described in the Hash coding input, will obtain and Hash coding Hamming distance be less than or equal to preset value bucket;S3, reorder all high dimensional datas in the bucket, obtains similar high dimensional data list, by the similar higher-dimension Data list is as retrieval result.
- 2. search method according to claim 1, it is characterised in that trained similar to search model described in step S1 Obtain as follows:Multiple untapped high dimensional datas are obtained from training set;The multiple untapped high dimensional data is inputted into the hash module, obtains and more matched low-dimensional feature is encoded to Hash Vector;Based on the low-dimensional feature vector, the corresponding Hash coding of the low-dimensional feature vector is obtained, and it is special according to the low-dimensional The corresponding Hash coding of sign vector calculates gradient of the setting loss function to the Hash coding layer of the similar to search model;Based on the gradient of the Hash coding layer to the similar to search model, the similar to search model is trained, Obtain the trained similar to search model.
- 3. search method according to claim 2, it is characterised in that it is described set loss function as:<mrow> <mi>O</mi> <mover> <mo>=</mo> <mi>&Delta;</mi> </mover> <mi>L</mi> <mo>+</mo> <mi>&lambda;</mi> <mi>Q</mi> <mo>;</mo> </mrow>Wherein, O is setting loss function, and L is Cauchy's cross entropy loss function,For definition, Q quantifies loss function, λ for Cauchy To adjust the parameter that Cauchy intersects entropy loss and Cauchy quantifies loss weight.
- 4. search method according to claim 3, it is characterised in that the setting loss function is by Cauchy's cross entropy Loss function and the Cauchy quantify loss function linear combination and obtain;Cauchy's cross entropy loss function is represented by following formula:<mrow> <mi>L</mi> <mo>=</mo> <msub> <mi>&Sigma;</mi> <mrow> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&Element;</mo> <mi>S</mi> </mrow> </msub> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <msup> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>&gamma;</mi> </mfrac> <mo>+</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>(</mo> <mrow> <mn>1</mn> <mo>+</mo> <mfrac> <mi>&gamma;</mi> <mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>;</mo> </mrow>Wherein, L is Cauchy's cross entropy loss function, and S is similar matrix, sijThe element arranged for the i rows j in similar matrix, 0 ≤ i≤N, 0≤j≤N, N be low-dimensional feature vector total number, wijFor ziFor i-th of low-dimensional feature vector, zjIt is low for j-th Dimensional feature vector, d (zi, zj) it is zi, zjBetween normalization Euclidean distance, γ be Cauchy distribution zooming parameter;The zi, zjBetween normalization Euclidean distance obtained by following formula:<mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mi>K</mi> <mn>4</mn> </mfrac> <mo>|</mo> <mo>|</mo> <mfrac> <msub> <mi>z</mi> <mi>i</mi> </msub> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mfrac> <msub> <mi>z</mi> <mi>j</mi> </msub> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>|</mo> <msubsup> <mo>|</mo> <mn>2</mn> <mn>2</mn> </msubsup> <mo>;</mo> </mrow>Wherein, d (zi, zj) it is zi, zjBetween normalization Euclidean distance, ziFor i-th of low-dimensional feature vector, zjFor jth A low-dimensional feature vector, K are the digit of Hash coding, | | | | for the Euclid norm of vector;The Cauchy quantifies loss function and is represented by following formula:<mrow> <mi>Q</mi> <mo>=</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mn>1</mn> <mo>+</mo> <mfrac> <mrow> <mi>d</mi> <mrow> <mo>(</mo> <mo>|</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mi>&gamma;</mi> </mfrac> <mo>)</mo> </mrow> <mo>;</mo> </mrow>Wherein, Q for the Cauchy quantify loss function, 0≤i≤N, N be training data number, ziFor i-th of low-dimensional feature to Amount, d (| zi|, 1) be | zi|, the normalization Euclidean distance between 1, γ is the zooming parameter of Cauchy's distribution.
- 5. search method according to claim 4, it is characterised in that the similar matrix is the matrix of NxN, and N is training The data point total amount of concentration, the element in matrix are 0 or 1;For any one element s in matrixij, sij=1 represents data point xiWith data point xjIt is similar, sij=0 represents data point xiWith data point xjIt is dissimilar;Wherein, sijThe element of j row, x are arranged for the i in similar matrixiFor i-th of data point in training set, xjFor in training set J-th of data point, 0≤i≤N, 0≤j≤N.
- 6. search method according to claim 1, it is characterised in that step S1 further comprises:S11, input the corresponding high dimensional data of all data points of database in trained similar to search model, described in acquisition Hash coding corresponding to the corresponding high dimensional data of all data points of database;S12, based on the Hash coding corresponding to the corresponding high dimensional data of all data points, obtain all of the database The inverted index of data point;The inverted index of S13, all data points based on the database, obtain the corresponding higher-dimension of the data point to be retrieved Data, and obtain the corresponding Hash coding of the high dimensional data.
- A kind of 7. high dimensional data searching system, it is characterised in that including:Hash coding module is obtained, for the corresponding high dimensional data of data point to be retrieved to be inputted trained similar to search mould In type, the corresponding Hash coding of the data point to be retrieved is obtained, wherein, the similar to search model is neural network model And including Hash coding module;Obtain bucket module, for will in the inverted index unit of similar to search model described in the Hash coding input, obtain with The Hamming distance of the Hash coding is less than or equal to the bucket of preset value;Retrieval result module is obtained, for reordering to all high dimensional datas in the bucket, obtains similar high dimensional data List, using the similar high dimensional data list as retrieval result.
- A kind of 8. electronic equipment for high dimensional data retrieval, it is characterised in that including memory and processor, the processor Mutual communication is completed by bus with the memory;The memory storage has the program that can be performed by the processor Instruction, the processor call described program instruction to be able to carry out the search method as described in claim 1 to 6 is any.
- 9. a kind of computer program product, it is characterised in that the computer program product includes being stored in non-transient computer Computer program on readable storage medium storing program for executing, the computer program include programmed instruction, when described program is instructed by computer During execution, the computer is set to perform the search method as described in claim 1 to 6 is any.
- 10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the computer program quilt The search method as described in claim 1 to 6 is any is realized when processor performs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711353318.8A CN107992611B (en) | 2017-12-15 | 2017-12-15 | The high dimensional data search method and system of hash method are distributed based on Cauchy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711353318.8A CN107992611B (en) | 2017-12-15 | 2017-12-15 | The high dimensional data search method and system of hash method are distributed based on Cauchy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107992611A true CN107992611A (en) | 2018-05-04 |
CN107992611B CN107992611B (en) | 2018-12-28 |
Family
ID=62038530
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711353318.8A Active CN107992611B (en) | 2017-12-15 | 2017-12-15 | The high dimensional data search method and system of hash method are distributed based on Cauchy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107992611B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059206A (en) * | 2019-03-29 | 2019-07-26 | 银江股份有限公司 | A kind of extensive hashing image search method based on depth representative learning |
CN110209867A (en) * | 2019-06-05 | 2019-09-06 | 腾讯科技(深圳)有限公司 | Training method, device, equipment and the storage medium of image encrypting algorithm |
CN110489515A (en) * | 2019-08-01 | 2019-11-22 | 卫盈联信息技术(深圳)有限公司 | Method, server and the storage medium of address list retrieval |
CN111260101A (en) * | 2018-11-30 | 2020-06-09 | 北京嘀嘀无限科技发展有限公司 | Information processing method and device |
CN111612079A (en) * | 2020-05-22 | 2020-09-01 | 深圳前海微众银行股份有限公司 | Data right confirming method, equipment and readable storage medium |
CN112395438A (en) * | 2020-11-05 | 2021-02-23 | 华中科技大学 | Hash code generation method and system for multi-label image |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120328215A1 (en) * | 2006-12-29 | 2012-12-27 | Jm Van Thong | Image-based retrieval for high quality visual or acoustic rendering |
CN104834748A (en) * | 2015-05-25 | 2015-08-12 | 中国科学院自动化研究所 | Image retrieval method utilizing deep semantic to rank hash codes |
CN105279554A (en) * | 2015-09-29 | 2016-01-27 | 东方网力科技股份有限公司 | Depth neural network training method and device based on Hash coding layer |
CN106169961A (en) * | 2016-09-07 | 2016-11-30 | 北京百度网讯科技有限公司 | The network parameter processing method and processing device of neutral net based on artificial intelligence |
CN106407352A (en) * | 2016-09-06 | 2017-02-15 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Traffic image retrieval method based on depth learning |
CN106503106A (en) * | 2016-10-17 | 2017-03-15 | 北京工业大学 | A kind of image hash index construction method based on deep learning |
CN106777349A (en) * | 2017-01-16 | 2017-05-31 | 广东工业大学 | Face retrieval system and method based on deep learning |
US20170293838A1 (en) * | 2016-04-06 | 2017-10-12 | Nec Laboratories America, Inc. | Deep high-order exemplar learning for hashing and fast information retrieval |
US20170316287A1 (en) * | 2015-06-05 | 2017-11-02 | At&T Intellectual Property I, L.P. | Image hash codes generated by a neural network |
CN107330074A (en) * | 2017-06-30 | 2017-11-07 | 中国科学院计算技术研究所 | The image search method encoded based on deep learning and Hash |
-
2017
- 2017-12-15 CN CN201711353318.8A patent/CN107992611B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120328215A1 (en) * | 2006-12-29 | 2012-12-27 | Jm Van Thong | Image-based retrieval for high quality visual or acoustic rendering |
CN104834748A (en) * | 2015-05-25 | 2015-08-12 | 中国科学院自动化研究所 | Image retrieval method utilizing deep semantic to rank hash codes |
US20170316287A1 (en) * | 2015-06-05 | 2017-11-02 | At&T Intellectual Property I, L.P. | Image hash codes generated by a neural network |
CN105279554A (en) * | 2015-09-29 | 2016-01-27 | 东方网力科技股份有限公司 | Depth neural network training method and device based on Hash coding layer |
US20170293838A1 (en) * | 2016-04-06 | 2017-10-12 | Nec Laboratories America, Inc. | Deep high-order exemplar learning for hashing and fast information retrieval |
CN106407352A (en) * | 2016-09-06 | 2017-02-15 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Traffic image retrieval method based on depth learning |
CN106169961A (en) * | 2016-09-07 | 2016-11-30 | 北京百度网讯科技有限公司 | The network parameter processing method and processing device of neutral net based on artificial intelligence |
CN106503106A (en) * | 2016-10-17 | 2017-03-15 | 北京工业大学 | A kind of image hash index construction method based on deep learning |
CN106777349A (en) * | 2017-01-16 | 2017-05-31 | 广东工业大学 | Face retrieval system and method based on deep learning |
CN107330074A (en) * | 2017-06-30 | 2017-11-07 | 中国科学院计算技术研究所 | The image search method encoded based on deep learning and Hash |
Non-Patent Citations (3)
Title |
---|
HAN ZHU等: ""Deep Hashing Network for Efficient Similarity Retrieval"", 《PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-16)》 * |
曹玉东 等: ""基于局部敏感哈希算法的图像高维数据索引技术的研究"", 《辽宁工业大学学报(自然科学版)》 * |
王云飞: ""基于隐含语义哈希算法的相似性搜索研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111260101A (en) * | 2018-11-30 | 2020-06-09 | 北京嘀嘀无限科技发展有限公司 | Information processing method and device |
CN111260101B (en) * | 2018-11-30 | 2022-07-08 | 北京嘀嘀无限科技发展有限公司 | Information processing method and device |
CN110059206A (en) * | 2019-03-29 | 2019-07-26 | 银江股份有限公司 | A kind of extensive hashing image search method based on depth representative learning |
CN110209867A (en) * | 2019-06-05 | 2019-09-06 | 腾讯科技(深圳)有限公司 | Training method, device, equipment and the storage medium of image encrypting algorithm |
CN110209867B (en) * | 2019-06-05 | 2023-05-16 | 腾讯科技(深圳)有限公司 | Training method, device, equipment and storage medium for image retrieval model |
CN110489515A (en) * | 2019-08-01 | 2019-11-22 | 卫盈联信息技术(深圳)有限公司 | Method, server and the storage medium of address list retrieval |
CN111612079A (en) * | 2020-05-22 | 2020-09-01 | 深圳前海微众银行股份有限公司 | Data right confirming method, equipment and readable storage medium |
CN111612079B (en) * | 2020-05-22 | 2021-07-20 | 深圳前海微众银行股份有限公司 | Data right confirming method, equipment and readable storage medium |
CN112395438A (en) * | 2020-11-05 | 2021-02-23 | 华中科技大学 | Hash code generation method and system for multi-label image |
Also Published As
Publication number | Publication date |
---|---|
CN107992611B (en) | 2018-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107992611A (en) | The high dimensional data search method and system of hash method are distributed based on Cauchy | |
CN109657246B (en) | Method for establishing extraction type machine reading understanding model based on deep learning | |
CN109635936A (en) | A kind of neural networks pruning quantization method based on retraining | |
CN107943938A (en) | A kind of large-scale image similar to search method and system quantified based on depth product | |
Cao et al. | Automatic selection of t-SNE perplexity | |
CN109558477A (en) | A kind of community's question answering system, method and electronic equipment based on multi-task learning | |
CN104598611B (en) | The method and system being ranked up to search entry | |
CN104156433B (en) | Image retrieval method based on semantic mapping space construction | |
CN110457514A (en) | A kind of multi-tag image search method based on depth Hash | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
CN115080801B (en) | Cross-modal retrieval method and system based on federal learning and data binary representation | |
CN111930887A (en) | Multi-document multi-answer machine reading understanding system based on joint training mode | |
CN109063112A (en) | A kind of fast image retrieval method based on multi-task learning deep semantic Hash, model and model building method | |
CN110019652A (en) | A kind of cross-module state Hash search method based on deep learning | |
CN109740158A (en) | Text semantic parsing method and device | |
CN118394890B (en) | Knowledge retrieval enhancement generation method and system based on large language model | |
CN108228674A (en) | A kind of information processing method and device based on DKT | |
CN112487020A (en) | Method and system for converting graph of SQL to text into natural language statement | |
He et al. | Directed acyclic graphs-based diagnosis approach using small data sets for sustainability | |
Yang et al. | Endowing Interpretability for Neural Cognitive Diagnosis by Efficient Kolmogorov-Arnold Networks | |
CN117807232A (en) | Commodity classification method, commodity classification model construction method and device | |
CN117609577A (en) | Employment recommendation processing method and system based on artificial intelligence | |
CN116797832A (en) | Stropharia rugoso-annulata hierarchical detection method based on mixed deep learning model | |
Yang et al. | Evolutionary multi-objective neural architecture search for generalized cognitive diagnosis models | |
CN112632406B (en) | Query method, query device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |