Summary of the invention
In view of this, the invention provides a kind of date storage method, can increase the memory data output in the fixed memory space.In the date storage method of the present invention, set in advance compressed encoding information, expression compression coding mode corresponding data types, when having remaining space in the buffer zone, according to described compressed encoding information, judge whether there is the byte multiplexer mode corresponding data types in the data to be stored, if, then to described data to be stored carry out successively that byte is multiplexing, the compressed encoding of differential coding and integer compress mode correspondence, and the compressed encoding result is kept in the described buffer zone; Otherwise, data to be stored are carried out the compressed encoding of differential coding and integer compress mode correspondence successively, and the compressed encoding result are kept in the described buffer zone.
Wherein, the compressed encoding that data to be stored is carried out byte multiplexer mode is:
From described data to be stored, select single datum length less than 1 byte and can merge to 1 word
Multinomial data in the joint, with selected multinomial data 1 byte representation that goes out, and with byte multiplexer mode
The compressed encoding result replaces the selected multinomial data that go out, and puts into data to be stored.
Wherein, the described compressed encoding that data to be stored are carried out the differential coding mode is:
From described data to be stored, select orderly integer sequence, orderly integer sequence is converted to difference sequence, and the compressed encoding result of differential coding mode is replaced the selected orderly integer sequence that goes out, put into data to be stored.
Preferably, between the compressed encoding of the compressed encoding of described byte multiplexer mode and differential coding mode, further comprise:
According to described compressed encoding information, judge whether there is differential coding mode correspondence in the data to be stored
Orderly integer sequence is if then carry out the compressed encoding of described differential coding mode correspondence; Otherwise,
Carry out the compressed encoding of described integer compress mode.
Wherein, the described compressed encoding that data to be stored are carried out the integer compress mode is:
From described data to be stored, select integer data, actual size according to selected integer data, determine the data length that this integer data is required, represent this integer data according to determined data length, and the compressed encoding result of integer compress mode replaced the selected integer data that goes out, put into data to be stored.
Preferably, between the compressed encoding of the compressed encoding of described differential coding mode and integer compress mode, further comprise:
According to described compressed encoding information, judge the integer data that whether has integer compress mode correspondence in the data to be stored, if then carry out the compressed encoding of described integer compress mode correspondence; Otherwise, carry out described the compressed encoding result is kept at operation in the described buffer zone.
Corresponding to above-mentioned date storage method, the invention provides a kind of method for reading data, can from the cache that stores the data of passing through compressed encoding, get access to raw data.In the method for reading data of the present invention, obtain the compressed encoding information of expression compression coding mode corresponding data types in advance, and data to be read are read from buffer zone, according to compressed encoding information, the data that are read out are carried out the decoding of integer compress mode, differential coding mode and byte multiplexer mode correspondence successively, obtain reading the result.
Wherein said the data that are read out are carried out being decoded as of integer compress mode correspondence:
According to described compressed encoding information, from the data that are read out, select data through the compressed encoding of integer compress mode, be reduced to and carry out integer compression numerical value before, and the decoded result that obtains is replaced selected data, put into the data that are read out.
Preferably, before the decoding of described integer compress mode correspondence, further comprise:
According to compressed encoding information, judge the data that whether exist in the data that are read out through the compressed encoding of integer compress mode, if then carry out the decoding of described integer compress mode correspondence; Otherwise, carry out the decoding of differential mode correspondence.
Wherein, described the data that are read out are carried out being decoded as of differential coding mode correspondence:
According to described compressed encoding information, from the data that are read out, select data through the compressed encoding of differential coding mode, be converted to orderly integer sequence, and the decoded result that obtains replaces selected data, put into the data that are read out.
Preferably, between the decoding of the described integer compress mode correspondence decoding corresponding, further comprise with differential coding:
According to compressed encoding information, judge the data that whether exist in the data that are read out through the compressed encoding of differential coding mode, if then carry out the decoding of described differential coding mode correspondence; Otherwise, carry out the decoding of byte multiplexer mode correspondence.
Wherein, described the data that are read out are carried out being decoded as of byte multiplexer mode correspondence:
According to described compressed encoding information, from the data that are read out, select data through the compressed encoding of byte multiplexer mode, the multinomial data that are compressed in 1 byte are used 1 byte representation respectively, and the decoded result that obtains is replaced selected data, put into the data that are read out.
Preferably, between the decoding of the described differential coding mode correspondence decoding corresponding, further comprise with byte multiplexer mode:
According to compressed encoding information, judge the data that whether exist in the data that are read out through the byte multiplexer mode correspondence, if then carry out the decoding of described byte multiplexer mode correspondence; Otherwise, carry out the described operation that obtains reading the result.
The present invention also provides a kind of data retrieval method, can improve the hit rate of result for retrieval.In the data retrieval method of the present invention, set in advance compressed encoding information, expression compression coding mode corresponding data types, this method may further comprise the steps:
A. when receiving the retrieval request that comes from the user, the data of judging this retrieval request correspondence whether in buffer zone, if, execution in step B then; Otherwise, execution in step C;
B. from buffer zone, read out these data, carry out the decoding of integer compression, differential coding and byte multiplexer mode correspondence successively, decoded result as result for retrieval, is returned to the user, and finish the notebook data retrieval flow according to compressed encoding information;
C. from index file, read the index data of retrieval request correspondence, as result for retrieval, meeting when depositing strategy, according to compressed encoding information, index data is carried out successively that byte is multiplexing, puts into described buffer zone behind the compressed encoding of differential coding and integer compress mode, again result for retrieval is returned to the user.
Wherein, the described compressed encoding that index data is carried out byte multiplexer mode is:
From index data, select single datum length less than 1 byte and can merge to multinomial data in 1 byte, with selected multinomial data 1 byte representation that goes out, and the compressed encoding result of byte multiplexer mode replaced the selected multinomial data that go out, put into described index data;
Describedly carry out being decoded as of byte multiplexer mode correspondence:
According to described compressed encoding information, from the index data that is read out, select data through the compressed encoding of byte multiplexer mode, the multinomial data that are compressed in 1 byte are used 1 byte representation respectively, and the decoded result that obtains replaced selected data, put into the index data that is read out.
Wherein, the described compressed encoding that index data is carried out the differential coding mode is:
From index data, select orderly integer sequence, orderly integer sequence is converted to difference sequence, and the compressed encoding result of differential coding mode is replaced the selected orderly integer sequence that goes out, put into described index data;
Describedly carry out being decoded as of differential coding mode correspondence:
According to described compressed encoding information, from the index data that is read out, select data through the compressed encoding of differential coding mode, be converted to orderly integer sequence, and the decoded result that obtains replaces selected data, put into the index data that is read out.
Wherein, the described compressed encoding that index data is carried out the integer compress mode is:
From index data, select integer data, actual size according to selected integer data, determine the data length that this integer data is required, represent this integer data according to determined data length, and the compressed encoding result of integer compress mode replaced the selected integer data that goes out, put into index data;
Describedly carry out being decoded as of integer compress mode correspondence:
According to described compressed encoding information, from the index data that is read out, select data through the compressed encoding of integer compress mode, be reduced to and carry out integer compression numerical value before, and the decoded result that obtains is replaced selected data, put into the index data that is read out.
Preferably, set in advance the replacement condition that expression allows the index data in the buffer zone to be replaced, described the carrying out of step C further comprises before the compressed encoding:
C01. judge whether there is remaining space in the described buffer zone, if then continue to carry out described compressed encoding; Otherwise, execution in step C02;
C02. according to the replacement condition that sets in advance, judge whether to utilize the described index data that from file, reads to replace former index data in the buffer zone, if, then determine and former index data that deletion is replaced, only need carry out described compressed encoding again; Otherwise, carry out the described operation that result for retrieval is returned to the user.
By such scheme as can be known, in date storage method of the present invention, behind the compressed encoding of modes such as data to be stored are multiplexing through byte, differential coding and integer compression, put into cache.Can reduce the space of data occupancy like this, increase the memory data output of the cache of fixed space.In method for reading data of the present invention, after data are taken out from cache, data are decoded according to the compressed encoding information of knowing in advance, be reduced into raw data, guaranteed the correctness of data read.
In addition, in data retrieval method of the present invention,, reduced every index data occupation space, increased the memory data output of the cache with fixed space because index data is put into cache through after the compressed encoding; When data retrieval, more index data can both find in cache, has improved the hit rate of data retrieval effectively.And, because the number of times of visit cache is more in the retrieving, thereby greatly reduce the probability that from index file, reads index data, reduced consumed time, thereby improved the efficient of data retrieval under the mass data situation effectively because of execute file IO.In addition, when from index file, reading index data,, then this index data is deposited among the cache if this index data meets the requirement of depositing strategy in the search engine, replacing condition, thereby can upgrade cache according to user's Search Requirement in time, guarantee the high-level efficiency of data retrieval.
Embodiment
For making purpose of the present invention, technical scheme clearer, below with reference to the accompanying drawing embodiment that develops simultaneously, the present invention is described in further detail.
The invention provides a kind of date storage method, its basic thought is: when having remaining space in the buffer zone, data are carried out depositing in the internal memory behind the compressed encoding.
The mode of compressed encoding comprises that integer compression, differential coding, byte are multiplexing, and the execution sequence of above-mentioned three kinds of compression coding modes is: it is multiplexing at first to carry out byte, then carries out differential coding, carries out the integer compression at last.In the application of reality, since the type decided of data these data can be performed the compressed encoding of above-mentioned which kind of mode, like this, as long as determine the total data type that data to be stored comprise, can determine the compression coding mode carried out and the Data Position of every kind of mode correspondence.Here, each compression coding mode and corresponding data types are known as compressed encoding information.
Fig. 1 shows the process flow diagram of date storage method in the embodiment of the invention.As shown in Figure 1, the date storage method in the present embodiment comprises:
In step 101~102, when in cache, having remaining space, judge whether to exist in the current data to be stored and can carry out the multiplexing data of byte, if, then carry out the compressed encoding of byte multiplexer mode to carrying out the multiplexing data of byte, will be through the multiplexing data of byte and uncompressed coded data as current data to be stored, and execution in step 103; Otherwise, direct execution in step 103.
In the present embodiment, the space of application fixed size is used to preserve data to be stored as cache in internal memory in advance.Because the space among the cache is limited, in the time of therefore can only in cache, having remaining space, can put into data.
Byte is multiplexing to be meant single datum length less than 1 byte and can merge to multinomial data 1 byte representation in 1 byte.Therefore, if include single datum length in the above-mentioned data to be stored less than 1 byte and can merge to multinomial data in 1 byte, then judge to exist and to carry out the multiplexing data of byte; Otherwise, judge not exist and can carry out the multiplexing data of byte.
For example: have font size, capital and small letter, whether increase the weight of and occur the data of four types of positions, wherein the span of font size is 1~32, and data length is 5 bits (bit); The value of capital and small letter is capitalization or small letter, and data length is 1bit; The value that whether increases the weight of is for increasing the weight of or do not increase the weight of, and data length is 1bit; The value that the position occurs is title or text, and data length is 1bit.Because 1 byte comprises 8bit, so, four kinds of above-mentioned data can enough 1 bytes be represented, all take 1 byte and need not every kind of data.
After in determining data to be stored, comprising can byte multiplexing data, can the multiplexing data of byte choose determined, carry out the compressed encoding of byte multiplexer mode again, and replace selecteed data with the multiplexing result of byte, put in the data to be stored, so that the data to be stored that in the subsequent step formed this moment are carried out compressed encoding.
In step 103~104, judge in the current data to be stored and whether have orderly integer data, if, then orderly integer data is carried out the compressed encoding of differential coding mode, will be through the data of differential coding and uncompressed coded data as current data to be stored, and execution in step 105; Otherwise, direct execution in step 105.
In the present embodiment, orderly integer data is meant that a plurality of integer numerical value in the data are the orderly integer sequence form that increases progressively or successively decrease.For example, document is: world cup will be held in Germany in 2006, and we expect the arrival of world cup; Data to be stored are that " world cup " this word appears at the position in the document, and promptly this word is which word in the document, then comprises 3 and 10 two numerical value in these data.Therefore the data to be stored of the above-mentioned type are orderly integer datas.
When carrying out differential coding, orderly integer sequence is converted to difference sequence.Particularly, in order first element in the integer sequence remains unchanged, and each element after this all is expressed as numerical value poor of the former numerical value of this element and last element.For for the 3 and 10 orderly integer sequences of forming, the difference sequence after the conversion is 3,7.After finishing differential coding, the differential coding result is replaced being put into data to be stored by the data of differential coding, so that the data to be stored that in the subsequent step formed this moment are carried out compressed encoding.
In step 105~106, judge in the current data to be stored whether have integer data, if then integer data is carried out the compressed encoding of integer compress mode, with the data behind the compressed encoding and uncompressed coded data as data to be stored, and execution in step 107; Otherwise, direct execution in step 107.
In classic method, adopt fixing data length to represent integer, for example: each integer takies 4 bytes.But in actual conditions, because the numerical value of integer is less, occupation space is less, if still adopt fixing data length, then can cause the waste in space.According to the actual size of integer, determine the required data length of this integer of expression in the present embodiment.For example:, only need 1 byte for integer 1; And, need 2 bytes for integer 1000.
In step 107, data to be stored are deposited among the cache.
So far, finish data compression flow process in the present embodiment.
Above-mentioned data compression method is applicable to such as multiple occasions such as index datastore.In the date storage method in the present embodiment, behind the compressed encoding of modes such as data to be stored are multiplexing through byte, differential coding and integer compression, put into cache.Can reduce the space of data occupancy like this, increase the memory data output of the cache of fixed space.
Correspondingly, in order to obtain raw data from cache, present embodiment also provides a kind of method for reading data.In the application of reality, owing to include the data type of the compressed encoding of carrying out variety of way in the compressed encoding information.Therefore, in data read process, can determine to be compressed coded data, then by obtaining raw data to decoding through the data of compressed encoding according to compressed encoding information.
Fig. 2 shows the exemplary process diagram of method for reading data in the present embodiment.Before this method is carried out, read compressed encoding information in advance, so that read decoding smoothly in the process at follow-up data.As shown in Figure 2, the method for reading data in the present embodiment comprises:
In step 201~202, data to be read are read from cache, judge in the data that are read whether have integer data according to compressed encoding information, if, then integer data is carried out the decoding of integer compress mode correspondence, decoded data are replaced the preceding data of decoding, put into the data that are read out, as current data, and execution in step 203; Otherwise, direct execution in step 203.
Here, after will choosing through the data of integer compression, decode again.When integer data was carried out the decoding of integer compress mode correspondence, the integer data that will read from cache was reduced to the numerical value before the compressed encoding of carrying out the integer compress mode.
In step 203~204, judge the data that whether exist in the current data through differential coding according to compressed encoding information, if, then the data through differential coding are carried out the decoding of differential coding mode correspondence, decoded data are replaced the preceding data of decoding, put into the data that are read out, as current data, and execution in step 205; Otherwise, direct execution in step 205.
Here after determining there are the process data of differential coding, determined data are chosen, decode again.And, the decoding of carrying out differential coding mode correspondence is meant, to remain unchanged through first element in the integer sequence of differential coding, since second element, with the raw value sum of the current numerical value of each element and last element as this element through the corresponding decoded result of differential coding mode.For example: the element that the integer sequence of process differential coding comprises is: 3,7,6; Then at first keep first element 3 constant, the current numerical value of second element is added that the raw value that the numerical value of first element obtains second element is 10, the more current numerical value of the 3rd element is added that the raw value that the raw value of second element obtains the 3rd element is 16.
In step 205~206, judge whether exist in the current data according to compressed encoding information through the multiplexing data of byte, if, then the multiplexing data of process byte are carried out the decoding of byte multiplexer mode correspondence, and execution in step 207; Otherwise, direct execution in step 207.
Here, after determining there are the data multiplexing, these data are chosen, and the multinomial data that will be compressed in 1 byte are used 1 byte representation respectively through byte.
In step 207, obtain raw data.
If do not comprise any data through compressed encoding in the data of reading in the step 201 from cache, then the raw data in this step is the data of being read; If the part in the data of reading is for through the data of compressed encoding, then the raw data in this step is decoded data and combination without the data of decoding; If the data of reading all are the data through compressed encoding, then the raw data in this step is decoded data.
So far, finish data read flow process in the present embodiment.
Above-mentioned date storage method illustrated in figures 1 and 2 and method for reading data can be applied in the data retrieval.
Fig. 3 shows the exemplary process diagram of data retrieval method among the present invention.Referring to Fig. 3, this method comprises:
In step 301, when receiving the retrieval request that comes from the user, whether the data of judging this retrieval request correspondence are in cache, if then execution in step 302; Otherwise, execution in step 303;
In step 302, from cache, take out these data, carry out the decoding of integer compression, differential coding and byte multiplexer mode correspondence according to compressed encoding information, decoded result as result for retrieval, and is finished the notebook data retrieval flow;
In step 303, from index file, read the index data of retrieval request correspondence, as result for retrieval, meeting when depositing strategy, index data is carried out that byte is multiplexing, puts into cache behind the compressed encoding of differential coding and integer compress mode.
Data retrieval method is suitable for multiple indexed mode among the present invention, is for example just arranging index, inverted index etc.
Be example with the inverted index below, the data retrieval method among the present invention is described.
Fig. 4 shows the process flow diagram of data retrieval method in the present embodiment.The position that index data is deposited can be the index file in cache or the disk.What wherein preserve among the cache is to meet the index data of depositing strategy that sets in advance in the search engine, for example: nearest accessed index data, the index data that rate of people logging in is higher etc.In addition, also read the compressed encoding information of the retrieve data correspondence among the cache in the present embodiment in advance, so that after sense data from cache, decode.As shown in Figure 4, the data retrieval method in the present embodiment comprises:
In step 401~402, search engine receives the retrieval request that comes from the user, judges whether the retrieve data of this retrieval request correspondence is stored among the cache, if then execution in step 403; Otherwise, execution in step 404.
In step 403, from cache, read the index data of retrieval request correspondence, and the index data of being read decoded according to compressed encoding information, and with decoded result as result for retrieval, then execution in step 410.
The concrete operations of in this step the index data that comes from cache being decoded are identical with the operation in step 201 shown in Figure 2~207.
In step 404~405, from index file, read the index data of retrieval request correspondence, the index data that reads as result for retrieval, and is judged according to the strategy of depositing that sets in advance whether this index data should be stored among the cache, if then execution in step 406; Otherwise, direct execution in step 410.
In step 406~407, judge whether there is remaining space among the cache, if, behind the compressed encoding of then that current index data is multiplexing through byte, differential coding and integer compress mode, deposit among the cache, and execution in step 410; Otherwise, execution in step 408.
Here step 101 is identical to 107 operation in the method that data are carried out compressed encoding and the data storage flow process shown in Figure 1.
For inverted index, the basic thought of its inverted index is that the record word occurs in which document.The structure of index data is:<T, F
t,<D, F
D, t,<P 〉
* * *.Wherein, T represents word sign, F
tThe document frequency of expression word, D represents document identification, F
D, tThe word frequency of expression word in document, P represents the position of word in document,
*The corresponding project of expression can have a plurality of values.Can also write down such as font size, literal capital and small letter in addition, whether increase the weight of, come across the word additional information of types such as title or text as index data.The data of each type are integer data in the above-mentioned index data, can carry out the integer compression; The data of position this type of the word of P representative in document are ordered sequence, therefore can carry out differential coding; And that additional information can be performed byte is multiplexing.Behind the compressed encoding in this step, the index data requisite space can obviously reduce.
In step 408, according to the replacement condition that sets in advance, judge whether to utilize the former index data among the current index data replacement cache, if, the former index data then definite and deletion is replaced, behind the compressed encoding of again that current index data is multiplexing through byte, differential coding and integer compress mode, deposit among the cache execution in step 410 in; Otherwise, direct execution in step 410.
For what guarantee to preserve among the cache is the index data of being convenient to retrieve most, present embodiment also sets in advance the replacement condition, be used for not existing under the situation of remaining space at cache, weigh new index data and whether substitute partial index data among the cache, can require according to user's retrieval to upgrade so that guarantee the content among the cache.The replacement condition here can be that access times are more, the access time is nearer etc.
In step 410, return result for retrieval to the user.
So far, finish data retrieval flow process in the present embodiment.
Above-mentioned retrieval flow is arranged as seen,, reduced every index data occupation space, increased the memory data output of the cache with fixed space because index data is put into cache through after the compressed encoding; When data retrieval, more index data can both find in cache, has improved the hit rate of data retrieval effectively.And, because the number of times of visit cache is more in the retrieving, thereby greatly reduce the probability that from index file, reads index data, reduced consumed time, thereby improved the efficient of data retrieval under the mass data situation effectively because of execute file IO.In addition, in the present embodiment when from index file, reading index data, if this index data meets the requirement of depositing strategy in the search engine, replacing condition, then this index data is deposited among the cache, thereby can upgrade cache according to user's Search Requirement in time, guarantee the high-level efficiency of data retrieval.
The above only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being made, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.