WO2016131554A1 - Method and apparatus for adaptive data compression - Google Patents
Method and apparatus for adaptive data compression Download PDFInfo
- Publication number
- WO2016131554A1 WO2016131554A1 PCT/EP2016/025015 EP2016025015W WO2016131554A1 WO 2016131554 A1 WO2016131554 A1 WO 2016131554A1 EP 2016025015 W EP2016025015 W EP 2016025015W WO 2016131554 A1 WO2016131554 A1 WO 2016131554A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- segment
- key value
- symbol
- lookup map
- nmax
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3059—Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
- H03M7/3064—Segmenting
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
- H03M7/4031—Fixed length to variable length coding
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/46—Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind
Definitions
- the present invention relates to a computer method and an apparatus for adaptive data compression, and particularly to a computer method and apparatus for adap ⁇ tive data compression in the field of dictionary encoding.
- Dictionary encoding is a commonly used process in compression of structured data. Compression of data re ⁇ fers to process of reducing the amount of data needed to represent a given information. To this end, redundant or unnecessary data is removed from the total of the struc ⁇ tured data. Data compression techniques reduce the costs for information storage and transmission. They are used in many applications, ranging from simple file size reduction to speech and video encoding.
- Lossless compression the source message or string at the encoder input is retrieved exactly at the output of the decoder.
- lossy compression the string is not retrieved exactly but the information loss is tolerable for the type of applica ⁇ tion targeted by the compression schemes.
- Lossy compression is mainly used for speech, audio, image and video signals. The aim of the compression algorithm is to represent a sig- nal with a minimum number of bits while maintaining the signal intelligibility and perceptual quality. All the in- formation that cannot be perceived by human sensors can be removed.
- Lossless compression techniques are used in appli ⁇ cations where no information losses are tolerable such as compressing executable and source code files, databases, satellite imaging and medical imaging. The techniques are also used as part of lossy compression schemes for better compression ratios.
- Dictionary encoding is the process of replacing each instance of a data item with a short code word and using a dictionary data structure to convert code words back into values, thus reducing the size of these data items.
- the use of dictionary encoding is particularly advantageous for data items with repeated values such as prizes, names, and machine operation protocols.
- US 2014/0085115 Al discloses a data compression using dictionary encoding, including subdividing a table of uncompressed data into a first block and a second block of complete rows, determining information about a frequency of occurrence of different values for each column of the first block, selecting a row of the first block to be removed out of the first block using frequency of occurrence- information, removing the row out of the first block to form an updated first block and determining information about a frequency of occurrence of different values for each column of the updated first block, deriving a dictionary containing code words for encoding the values of the updated first block, encoding the values of the updated first block based on the code words, and adding the removed row to the second block.
- US 6,606,040 B2 discloses an adaptive data com ⁇ pression in which an alphabet and vocabulary in the encoder and decoder is built adaptively and stored in a dictionary as symbols are to be encoded and decoded. Each time an un ⁇ known symbol is to be encoded by the encoder, the encoder adds the symbol to the dictionary and transmits it in plain in the encoded string.
- the code words transmitted by the encoder include symbols and indexes. The state of a prefix bit preceding the code word indicates whether the code word is a plain symbol or an index of a symbol or string of sym ⁇ bols stored in the dictionary.
- the decoder examines prefix bit of each code word as it is received to determine if the code word stores a symbol in plain or in index. If the code word stores a symbol in plain, the decoder learns the sym ⁇ bol by adding a sequence of symbols resulting from the con ⁇ catenation of previously decoded symbols and the first sym ⁇ bol of the currently decoded symbol and by adding the sym- bol to its dictionary. If the code word stores an index, the decoder decodes the code word by extracting the symbol or sequence of symbols stored in the dictionary at the re ⁇ spective index in the dictionary.
- the invention provides a computer program method and an apparatus for adaptively compressing an input string comprising a sequence of symbols with the features of claims 1 and 4, respectively.
- the invention further provides a computer program with the fea- tures of claim 7 as well as a digital database with the features of claim 10.
- an adaptive key size storage is combined with multiple dictionaries, or in other words, the encoder dictionary is made up of a plural ⁇ ity of segment dictionaries with a predetermined limited (but variable) maximal key value size.
- the approach accord ⁇ ing to the invention results in a good compression but without the need of analyzing incoming string data upfront.
- the only parameter for tuning the compression is to specify (minimal and) maximal size of the key values to be used.
- An incoming or input string is sequentially an ⁇ alyzed for the symbols it is containing, and a lookup map is searched for each symbol received in the input string. If a symbol is not stored in the lookup map yet, the symbol is added by storing the symbol at a next sequential index in the lookup map. A next sequential key value entry is as ⁇ signed to the symbol, with this key value being added to a key value segment. If the symbol is already stored in the lookup map, the corresponding key value assigned to this symbol is added to the next sequential entry of the key value segment.
- the method of the invention results in an encoder dictionary built up by a plurality of segment dic ⁇ tionaries, with limited decoder key size.
- a segment dictionary D m may contain up to 2 nmax different values/symbols.
- the symbols in a given segment dictionary D m are identified by using a key value k n . Symbol alterna- tion is therefore realized by assigning the values 0 to 2 n - 1.
- n bits are needed to store a value k n .
- the invention also covers a computer program with program coding means which are suitable for carrying out a message according to the invention as described above when the computer program is run on a computer.
- the computer program it self as well as stored on a computer-readable medium is claimed.
- Figure la shows a schematic block diagram de ⁇ piction illustrating an example of an adaptive dictionary setup of the invention
- Figure lb shows the first lookup map of the ex- ample of Figure la in subsequent filling states until crea ⁇ tion of the first segment dictionary.
- Figure 2 is a flow diagram illustrating the method of the invention.
- Figure 3 shows a schematic block diagram illus- trating an apparatus of the invention.
- Figure 4 is a graph showing a temperature pro ⁇ file of a turbine in operation.
- Figure 5 is a schematic depiction of the com ⁇ pressed dictionary database setup of the temperature data of Figure 4 according to the invention.
- Figure 6 is a more detailed illustration of the depiction of Figure 5 showing the content of the compressed dictionary database setup.
- Figure la shows a schematic depiction illus ⁇ trating the invention.
- Figure la shows an incoming string 10 and its adaptive compression (compressed segment stream D) according to the invention comprising a plurality of segment dictionaries Di, D 2 .
- the dictionaries are accompanied by key value segments Si,i, Si, 2 , Si, 3, S 2 , 2 , S 2 ,3.
- the key value segments S m , n are sorted according to the n-bit keys.
- the key value segments S m , n are stored in front of the respective segment dictionaries D m .
- Below the com- pressed segment stream D the bitwise data stream of the key value segments is illustrated, i.e. the actual bit en ⁇ tries are shown.
- Each segment dictionary D m of the embodiment according to the invention can thus contain up to a maximum of eight character entries.
- FIG. 2 shows the flow diagram of the method of the invention
- Figure 3 illustrates an exemplary em ⁇ bodiment of an apparatus 20 to perform the method of the invention.
- the apparatus 20 can be a computer database sys ⁇ tem, and the exemplary embodiment comprises a database storage DB for storing a plurality of segment dictionaries D m , a receiver 30 for receiving an input string 10, an analyzer 40 for excerpting each symbol from the sequence of symbols of the input string 10, a generator 60 for generat ⁇ ing a lookup map L m and a key value segment S m , n , an engine 50 for detecting whether a symbol is contained in a lookup map L m , and a converter 70 to convert a lookup map L m into a segment dictionary D m .
- a database storage DB for storing a plurality of segment dictionaries D m
- a receiver 30 for receiving an input string 10
- an analyzer 40 for excerpting each symbol from the sequence of symbols of the
- an incoming character string 10 consists of a plurality of various charac ⁇ ters.
- the incoming or input string 10 is read from the left to right as indicated in Figure la by arrow R.
- start ⁇ ing step 100 in Figure 2
- the analyzer 40 cf.
- step 120 namely "N” is not to be found in the lookup map Li by the engine 50 and is therefore entered in the lookup map Li at the first key index (cf. sequence 2 in Figure lb, and steps 130, 131, 135 of Figure 2), and an according key value "0" is entered in a first key value segment Si,i containing the one-bit keys (cf. first entry in key value Si,i segment in the depiction of Figure la and steps 140, 150 of Figure 2) .
- the second character is taken from input string 10 (cf. steps 160, 120 of Figure 2), namely "0", which is also not contained in the first lookup map Li yet and therefore entered in the next sequential key index of the first lookup map Li (cf.
- step 160, 120 of Figure 2 As this character is already contained in the lookup map, namely in the first key index of the first lookup map Li, there is no additional entry of this character into the lookup map Li . However, the corre- sponding key value "0" is entered in the next sequential entry of the first key value segment Si,i (cf. Figure 2: step 130 directly refers to steps 140 and 150) .
- Step 160 then refers back to step 120, and the following character "B" is also new to the lookup map Li such that this character is entered into the fourth key in ⁇ dex of first lookup map Li (sequence 5 of Figure lb) and a corresponding key value "3" is added to the second key val ⁇ ue segment Si, 2-
- the next five characters "NOB N" are al- ready contained in the lookup map so that no new entries are done to the first lookup map Li, but of course result ⁇ ing in the new key value assignments "01320" to the second key value segment S 2 ,i.
- This eighth character in the example input string 10 is the character "S" (which is the 24 th character in this string) , and it is assigned key val- ue "7" and is the last entry in the first lookup map Li (cf . sequences 7 to 9 in Figure lb) .
- the next position in input string 10 contains the ninth different character, namely a As the maximum of eight key values is reached, the current key value segment Si, 3 is finalized and the first lookup map Li is converted into first segment dictionary Di by the converter 70 (cf. bottom of Figure lb after sequence 9, and steps 131, 132, 133 of Figure 2) . Then, a new (se- cond) lookup map L2 is opened, with an according key value segment (step 134) .
- the new character "-" is entered into the first entry position or key index of the second lookup map L2 and an according entry is added to the segment key S2, 2 ⁇ [0038] It is to be noted that in the creation of the second lookup map L2 in the embodiment shown in Figure 1, no 1-bit key segment is created but directly the 2-bit key segment S2,2 is opened as the first three characters start ⁇ ing at position 27 "-JU" are three distinct different char- acters so that the 1-bit key segment would only consist of the two entries "01". It is of course understood that there could as well be a 1-bit segment S 2 ,i.
- Figure 4 shows a graph with a temperature pro ⁇ file of a turbine in operation
- Figure 5 is a schematic depiction of the plurality of segment dictionaries obtained from the temperature data of the graph of Figure 4.
- the graph of Figure 4 shows a turbine that is put into operation on 1 st January. In the beginning, the operating temperature of the turbine is at around room tem ⁇ perature, and then increases at a rate of roughly 100 to 120 °C per hour until it reaches a first operating tempera- ture at around 825 ⁇ 5 °C.
- This level is maintained during a long term operation until 4 th January when a reconfiguration mode is run which brings the temperature down to around 640 °C within around two hours and then directly back up to second operation level at around 870 ⁇ 8 °C within another two hours. The turbine is then operated at this second temperature level for several days.
- the two stable operating phases at around 825 °C and 870 °C, respectively each only have one segment dictionary with accordingly longer key value segments (i.e. key value segments with more entries) as the incoming temperature measurement readings are re ⁇ peatedly the same values within a small interval (cf. lines five and ten of Figure 5, corresponding to segment diction ⁇ aries D 5 to Dio) .
- Figure 6 shows the depiction of Figur 5 in more detail, particularly by showing the actual entries in the key value segments and the segment dictionaries.
- the en- tries in segment dictionaries Di to D4 and D6 to Dg reflect the rapid temperature changes in the high gradient portions of the temperature profile of Figure 4, the first tempera ⁇ ture increase starting at 14 °C in Di to 801 °C in D 4 , whereas the entries in segment dictionaries D5 to Dio re- fleet the stable plateau phases of the temperature profile of Figure 4 with temperatures of around 825 °C in segment dictionary D5 and 870 °C in segment dictionary Dio (at the end of segment dictionary D 5 the decrease phase already starts as reflected by the last three to four entries) .
- the key value segments are filled quite fast with the six ⁇ teen available values such that the key value segments of dictionaries Di to D4 and D6 to Dg remain relatively short.
- the temperatures are more repetitive such that are also repetitive and accordingly longer (cf. particularly key value segments S 5 , 4 and Si 0 ,4) .
- the first temperature en ⁇ try of segment dictionary Di equaling 14 °C corresponds to the first 13 measurement values (all four entries of key value segments Si,i and Si, 2 and the first nine entries of key value segments Si, 3) .
- the sequence of the measurement values stored in segment dictionary Di 0 and key value seg ⁇ ments Sio,i and Si 0 , 2 is 864 °C, 866 °C, 864 °C, 863 °C, 863 °C, 870 °C,863 °C, etc.
- Figures 4 to 6 shows how the in ⁇ vention can create a database compression which adapts to the nature of the data to be compressed, i.e. having a higher number of segment dictionaries with short key value segments in cases of highly varying data, and having a very low number of segment dictionaries with accordingly longer key value segments in case of stable or less varying data, all of the segment dictionaries having the same maximum size .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Adaptively compressing an input string (10) comprising a sequence of symbols in order to create a plurality of segment dictionaries Dm, with the steps of: generating a lookup map (110); generating a key value segment Sm,n; searching the lookup map for each symbol received in the input string (120, 130); upon detecting a symbol is not stored in the lookup map, adding the symbol by storing the symbol at a next sequential key index in the lookup map (135) and assigning a next sequential key value entry to the symbol and adding this key value to the key value segment Sm,n (150); upon detecting the symbol is stored in the lookup map, adding the corresponding key value assigned to this symbol to the next sequential entry of the key value segment Sm,n (150); wherein a new key value segment Sm,n+1 of the lookup map is generated if the number of different symbols equals the number of available key values k = 2n for the opened/current key value segment Sm,n (141, 142), and where- in the lookup map is converted into a segment dictionary Dm if the maximal key value size knmax = 2nmax is reached (132, 133, 134), with n being any positive integral number 1 to nmax, nmax denoting the maximal bit size, and m being any positive integral number denoting the consecutive numbering of segment dictionaries Dm.
Description
Method and apparatus for adaptive data compression
Technical Field
[0001] The present invention relates to a computer method and an apparatus for adaptive data compression, and particularly to a computer method and apparatus for adap¬ tive data compression in the field of dictionary encoding.
[0002] Dictionary encoding is a commonly used process in compression of structured data. Compression of data re¬ fers to process of reducing the amount of data needed to represent a given information. To this end, redundant or unnecessary data is removed from the total of the struc¬ tured data. Data compression techniques reduce the costs for information storage and transmission. They are used in many applications, ranging from simple file size reduction to speech and video encoding.
[0003] The two different types of data compression are lossless compression and lossy compression. In lossless compression, the source message or string at the encoder input is retrieved exactly at the output of the decoder. In lossy compression, the string is not retrieved exactly but the information loss is tolerable for the type of applica¬ tion targeted by the compression schemes. Lossy compression is mainly used for speech, audio, image and video signals. The aim of the compression algorithm is to represent a sig- nal with a minimum number of bits while maintaining the signal intelligibility and perceptual quality. All the in-
formation that cannot be perceived by human sensors can be removed. Lossless compression techniques are used in appli¬ cations where no information losses are tolerable such as compressing executable and source code files, databases, satellite imaging and medical imaging. The techniques are also used as part of lossy compression schemes for better compression ratios.
[0004] Dictionary encoding is the process of replacing each instance of a data item with a short code word and using a dictionary data structure to convert code words back into values, thus reducing the size of these data items. The use of dictionary encoding is particularly advantageous for data items with repeated values such as prizes, names, and machine operation protocols. [0005] US 2014/0085115 Al discloses a data compression using dictionary encoding, including subdividing a table of uncompressed data into a first block and a second block of complete rows, determining information about a frequency of occurrence of different values for each column of the first block, selecting a row of the first block to be removed out of the first block using frequency of occurrence- information, removing the row out of the first block to form an updated first block and determining information about a frequency of occurrence of different values for each column of the updated first block, deriving a dictionary containing code words for encoding the values of the updated first block, encoding the values of the updated first block based on the code words, and adding the removed row to the second block. [0006] US 6,606,040 B2 discloses an adaptive data com¬ pression in which an alphabet and vocabulary in the encoder and decoder is built adaptively and stored in a dictionary
as symbols are to be encoded and decoded. Each time an un¬ known symbol is to be encoded by the encoder, the encoder adds the symbol to the dictionary and transmits it in plain in the encoded string. The code words transmitted by the encoder include symbols and indexes. The state of a prefix bit preceding the code word indicates whether the code word is a plain symbol or an index of a symbol or string of sym¬ bols stored in the dictionary. The decoder examines prefix bit of each code word as it is received to determine if the code word stores a symbol in plain or in index. If the code word stores a symbol in plain, the decoder learns the sym¬ bol by adding a sequence of symbols resulting from the con¬ catenation of previously decoded symbols and the first sym¬ bol of the currently decoded symbol and by adding the sym- bol to its dictionary. If the code word stores an index, the decoder decodes the code word by extracting the symbol or sequence of symbols stored in the dictionary at the re¬ spective index in the dictionary.
[0007] The main problem in using dictionary compres- sion is the need to know upfront the number of different values in the column to be processed. This is particularly crucial for streamed data processing as it is not possible to analyze the data upfront.
Summary [0008] In contrast thereto, the invention provides a computer program method and an apparatus for adaptively compressing an input string comprising a sequence of symbols with the features of claims 1 and 4, respectively. The invention further provides a computer program with the fea- tures of claim 7 as well as a digital database with the features of claim 10.
[0009] According to the invention, an adaptive key size storage is combined with multiple dictionaries, or in other words, the encoder dictionary is made up of a plural¬ ity of segment dictionaries with a predetermined limited (but variable) maximal key value size. The approach accord¬ ing to the invention results in a good compression but without the need of analyzing incoming string data upfront. The only parameter for tuning the compression is to specify (minimal and) maximal size of the key values to be used. [0010] An incoming or input string is sequentially an¬ alyzed for the symbols it is containing, and a lookup map is searched for each symbol received in the input string. If a symbol is not stored in the lookup map yet, the symbol is added by storing the symbol at a next sequential index in the lookup map. A next sequential key value entry is as¬ signed to the symbol, with this key value being added to a key value segment. If the symbol is already stored in the lookup map, the corresponding key value assigned to this symbol is added to the next sequential entry of the key value segment. Once the number of different symbols equals the number of available key values for the opened/current key value segment, a new key value segment is generated and the lookup map is converted into a segment dictionary, if the maximal key value size is reached. [0011] Thus, the method of the invention results in an encoder dictionary built up by a plurality of segment dic¬ tionaries, with limited decoder key size.
[0012] For example, given that m and n are natural numbers (integral values) with m being any positive inte- gral number denoting the consecutive numbering of segment dictionaries Dm and n being any positive integral number 1 to nmax, nmax denoting the maximal bit size, and the maxi-
mal key values size is defined as knmax = 2nmax, a segment dictionary Dm may contain up to 2nmax different values/symbols. The symbols in a given segment dictionary Dm are identified by using a key value kn. Symbol identifica- tion is therefore realized by assigning the values 0 to 2n- 1. Thus, n bits are needed to store a value kn.
[0013] The invention also covers a computer program with program coding means which are suitable for carrying out a message according to the invention as described above when the computer program is run on a computer. The computer program it self as well as stored on a computer-readable medium is claimed.
[0014] Further features and embodiments of the inven¬ tion will become apparent from the description and the ac- companying drawings.
[0015] It will be understood that the features men¬ tioned above and those described hereinafter can be used not only in the combination specified but also in other combinations or on their own, without departing on this scope of the present invention.
[0016] The invention is schematically illustrated in the drawings by means of an embodiment by way of example and is hereinafter explained in detail with reference to the drawings. It is understood that the description is in now way limiting on the scope of the present invention and is merely an illustration of a preferred embodiment of the invention .
Brief Description of the Drawings
[0017] Figure la shows a schematic block diagram de¬ piction illustrating an example of an adaptive dictionary setup of the invention, and
[0018] Figure lb shows the first lookup map of the ex- ample of Figure la in subsequent filling states until crea¬ tion of the first segment dictionary.
[0019] Figure 2 is a flow diagram illustrating the method of the invention.
[0020] Figure 3 shows a schematic block diagram illus- trating an apparatus of the invention.
[0021] Figure 4 is a graph showing a temperature pro¬ file of a turbine in operation.
[0022] Figure 5 is a schematic depiction of the com¬ pressed dictionary database setup of the temperature data of Figure 4 according to the invention.
[0023] Figure 6 is a more detailed illustration of the depiction of Figure 5 showing the content of the compressed dictionary database setup.
Detailed Description [0024] Figure la shows a schematic depiction illus¬ trating the invention. Particularly, Figure la shows an incoming string 10 and its adaptive compression (compressed segment stream D) according to the invention comprising a plurality of segment dictionaries Di, D2. The dictionaries are accompanied by key value segments Si,i, Si, 2 , Si, 3, S2,2, S2,3. The key value segments Sm,n are sorted according to the n-bit keys. The key value segments Sm,n are stored in front of the respective segment dictionaries Dm. Below the com-
pressed segment stream D, the bitwise data stream of the key value segments is illustrated, i.e. the actual bit en¬ tries are shown.
[0025] In the embodiment shown in Figure la, the maxi- mal key value size is set to be knmax = 8, resulting in a bit size of nmax = 3 as knmax = 2n. Each segment dictionary Dm of the embodiment according to the invention can thus contain up to a maximum of eight character entries.
[0026] Figure 2 shows the flow diagram of the method of the invention, and Figure 3 illustrates an exemplary em¬ bodiment of an apparatus 20 to perform the method of the invention. The apparatus 20 can be a computer database sys¬ tem, and the exemplary embodiment comprises a database storage DB for storing a plurality of segment dictionaries Dm, a receiver 30 for receiving an input string 10, an analyzer 40 for excerpting each symbol from the sequence of symbols of the input string 10, a generator 60 for generat¬ ing a lookup map Lm and a key value segment Sm,n, an engine 50 for detecting whether a symbol is contained in a lookup map Lm, and a converter 70 to convert a lookup map Lm into a segment dictionary Dm.
[0027] With reference to Figure la, an incoming character string 10 consists of a plurality of various charac¬ ters. The incoming or input string 10 is read from the left to right as indicated in Figure la by arrow R. When start¬ ing (step 100 in Figure 2) the method to adaptively com¬ press an input string 10 received by the receiver 30, a new lookup map Li ( Lm with m = 1) with knmax = 23 = 8 empty array cells (key indices) is generated (cf. step 110 in Figure 2) by generator 60, and there will first be no entry in the lookup map Li (cf . Figure lb, top, sequence 1) .
[0028] Therefore, the first character taken from the input string 10 by the analyzer 40 (cf. step 120), namely "N" is not to be found in the lookup map Li by the engine 50 and is therefore entered in the lookup map Li at the first key index (cf. sequence 2 in Figure lb, and steps 130, 131, 135 of Figure 2), and an according key value "0" is entered in a first key value segment Si,i containing the one-bit keys (cf. first entry in key value Si,i segment in the depiction of Figure la and steps 140, 150 of Figure 2) . [0029] Then the second character is taken from input string 10 (cf. steps 160, 120 of Figure 2), namely "0", which is also not contained in the first lookup map Li yet and therefore entered in the next sequential key index of the first lookup map Li (cf. sequence 3 in Figure lb and repetition of steps 130, 131, 135 of Figure 2) with an ac¬ cording key value "1" entered into first key value segment "Si,i" at the next position (cf. second entry in key value Si,i segment in the depiction of Figure la and repetition of steps 140, 150 of Figure 2) . [0030] Then, the third character, namely "N", is read
(steps 160, 120 of Figure 2) . As this character is already contained in the lookup map, namely in the first key index of the first lookup map Li, there is no additional entry of this character into the lookup map Li . However, the corre- sponding key value "0" is entered in the next sequential entry of the first key value segment Si,i (cf. Figure 2: step 130 directly refers to steps 140 and 150) .
[0031] This is repeated for the next three characters "OON" already contained in the lookup map Li and resulting in the bit key entries "110" in the first key value segment Si,i as depicted in Figure la.
[0032] Then a new character, namely a blank " ", is identified in input string 10, resulting in a new entry in the first lookup map Li (cf . sequence 4 of Figure lb) . As this is the third character in the lookup map Li, the one- bit key segment Si,i is not available anymore, and the next, i.e. second key value segment for two-bit keys Si, 2 is started, bearing entry "2" for the character " " (blank) (cf . steps 140, 141, 142, 150 of Figure 2) .
[0033] Step 160 then refers back to step 120, and the following character "B" is also new to the lookup map Li such that this character is entered into the fourth key in¬ dex of first lookup map Li (sequence 5 of Figure lb) and a corresponding key value "3" is added to the second key val¬ ue segment Si, 2- The next five characters "NOB N" are al- ready contained in the lookup map so that no new entries are done to the first lookup map Li, but of course result¬ ing in the new key value assignments "01320" to the second key value segment S2,i.
[0034] The following character "E" in input string 10 (which is the fourteenth character in the string) is another character which is not yet contained in the lookup map. As this the fifth new character, it is entered in the fifth key index of the first lookup map Li (cf. sequence 6 of Figure lb) , and as the two-bit key value segment Si, 2 is now full (containing 2n with n = 2 values) , the third key value segment Si, 3 is started with the entry "4" assigned to the new character "E" (cf . Figure la) .
[0035] This third key value segment Si, 3 is the last key value segment as it provides entry of up to eight dif- ferent symbols (knmax = 23) . This eighth character in the example input string 10 is the character "S" (which is the 24th character in this string) , and it is assigned key val-
ue "7" and is the last entry in the first lookup map Li (cf . sequences 7 to 9 in Figure lb) .
[0036] As the following two characters on positions 25 and 26 are already contained in the dictionary (" " and "A"), there are two more entries into the third key value segment Si, 3 so that the last three entries of this segment read "725".
[0037] The next position in input string 10 (position number 27) contains the ninth different character, namely a As the maximum of eight key values is reached, the current key value segment Si, 3 is finalized and the first lookup map Li is converted into first segment dictionary Di by the converter 70 (cf. bottom of Figure lb after sequence 9, and steps 131, 132, 133 of Figure 2) . Then, a new (se- cond) lookup map L2 is opened, with an according key value segment (step 134) . The new character "-" is entered into the first entry position or key index of the second lookup map L2 and an according entry is added to the segment key S2, 2 · [0038] It is to be noted that in the creation of the second lookup map L2 in the embodiment shown in Figure 1, no 1-bit key segment is created but directly the 2-bit key segment S2,2 is opened as the first three characters start¬ ing at position 27 "-JU" are three distinct different char- acters so that the 1-bit key segment would only consist of the two entries "01". It is of course understood that there could as well be a 1-bit segment S2,i.
[0039] As shown in the depiction of Figures la and 3 the key segments may be stored in front of the respective corresponding segment dictionary.
[0040] Figure 4 shows a graph with a temperature pro¬ file of a turbine in operation, and Figure 5 is a schematic depiction of the plurality of segment dictionaries obtained from the temperature data of the graph of Figure 4. [0041] The graph of Figure 4 shows a turbine that is put into operation on 1st January. In the beginning, the operating temperature of the turbine is at around room tem¬ perature, and then increases at a rate of roughly 100 to 120 °C per hour until it reaches a first operating tempera- ture at around 825 ± 5 °C. This level is maintained during a long term operation until 4th January when a reconfiguration mode is run which brings the temperature down to around 640 °C within around two hours and then directly back up to second operation level at around 870 ± 8 °C within another two hours. The turbine is then operated at this second temperature level for several days.
[0042] By using the method of the invention for adap- tively compressing the input string comprising the sequence of temperature measurement values, a relatively small dic- tionary size is chosen, unlike known methods which would work with a dictionary size covering the whole span of around 14 °C to 1,000 °C.
[0043] As can be seen from the depiction of Figure 5, nmax is chosen to be 4, leading to a maximum key value size of knmax = 24 = 16, and an according maximum segment dictionary size of 16 entries.
[0044] This approach leads to a compressed segment stream as illustrated in Figure 5 with a plurality of seg¬ ment dictionaries Di to Di0, each having associated four key value segments Si,i, Si,2, Si,3, Si,4; S2,i, S2,2, S2,3, S2,4; - ;
Sl0,l' Sio,2, Sio,3/ Sio,4-
[0045] In the heating-up phase (first four lines in Figure 5, corresponding to segment dictionaries Di to D4) and the reconfiguration phase (lines six to nine in Figure 5, corresponding to segment dictionaries D6 to D9) , the key value segments are rather short and there is a higher num¬ ber of segment dictionaries as the temperature measurement readings vary fast in these phases, i.e. there are a lot of different values.
[0046] In contrast thereto, the two stable operating phases at around 825 °C and 870 °C, respectively, each only have one segment dictionary with accordingly longer key value segments (i.e. key value segments with more entries) as the incoming temperature measurement readings are re¬ peatedly the same values within a small interval (cf. lines five and ten of Figure 5, corresponding to segment diction¬ aries D5 to Dio) .
[0047] Figure 6 shows the depiction of Figur 5 in more detail, particularly by showing the actual entries in the key value segments and the segment dictionaries. The en- tries in segment dictionaries Di to D4 and D6 to Dg reflect the rapid temperature changes in the high gradient portions of the temperature profile of Figure 4, the first tempera¬ ture increase starting at 14 °C in Di to 801 °C in D4, whereas the entries in segment dictionaries D5 to Dio re- fleet the stable plateau phases of the temperature profile of Figure 4 with temperatures of around 825 °C in segment dictionary D5 and 870 °C in segment dictionary Dio (at the end of segment dictionary D5 the decrease phase already starts as reflected by the last three to four entries) . [0048] According to the teachings of the invention, the key value segments are filled quite fast with the six¬ teen available values such that the key value segments of
dictionaries Di to D4 and D6 to Dg remain relatively short. In the stable phases, the temperatures are more repetitive such that are also repetitive and accordingly longer (cf. particularly key value segments S5,4 and Si0,4) . [0049] As a reading example, the first temperature en¬ try of segment dictionary Di equaling 14 °C corresponds to the first 13 measurement values (all four entries of key value segments Si,i and Si, 2 and the first nine entries of key value segments Si, 3) . The sequence of the measurement values stored in segment dictionary Di0 and key value seg¬ ments Sio,i and Si0,2 is 864 °C, 866 °C, 864 °C, 863 °C, 863 °C, 870 °C,863 °C, etc.
[0050] The example of Figures 4 to 6 shows how the in¬ vention can create a database compression which adapts to the nature of the data to be compressed, i.e. having a higher number of segment dictionaries with short key value segments in cases of highly varying data, and having a very low number of segment dictionaries with accordingly longer key value segments in case of stable or less varying data, all of the segment dictionaries having the same maximum size .
Claims
1. A computer method for adaptively compressing an input string (10) comprising a sequence of symbols in order to create a plurality of segment dictionaries Dm, the meth¬ od comprising the steps of:
generating a lookup map (110);
generating a key value segment Sm,n;
searching the lookup map for each symbol received in the input string (120, 130);
upon detecting a symbol is not stored in the lookup map, adding the symbol by storing the symbol at a next sequential key index in the lookup map (135) and as¬ signing a next sequential key value entry to the symbol and adding this key value to the key value segment Sm,n (150);
upon detecting the symbol is stored in the lookup map, adding the corresponding key value assigned to this symbol to the next sequential entry of the key value seg¬ ment Sm,n ( 150 ) ;
wherein a new key value segment Sm,n+i of the lookup map is generated if the number of different symbols equals the number of available key values k = 2n for the opened/current key value segment SmrT1 (141, 142), and where¬ in the lookup map is converted into a segment dictionary Dm if the maximal key value size knmax = 2nmax is reached (132, 133, 134), with n being any positive integral number 1 to nmax, nmax denoting the maximal bit size, and m being any
positive integral number denoting the consecutive numbering of segment dictionaries Dm.
2. The method of claim 1, wherein the plurality of segment dictionaries Dm with the corresponding key value segments SmrT1 form a column store database.
3. The method of claim 1 or 2, wherein the maximal key value size kniinax is predeterminable .
4. An apparatus (20) for adaptively compressing an input string (10) having a sequence of symbols, the appa¬ ratus (20) comprising a database storage (DB) for storing a plurality of segment dictionaries (Dm) , a receiver (30) for receiving an input string (10), an analyzer (40) for excerpting each symbol from the sequence of symbols of the input string (10), a generator (60) for generating a lookup map (Lm) and a key value segment (Sm,n) , an engine (50) for detecting whether a symbol is contained in a lookup map (Lm) , and a converter (70) to convert a lookup map (Lm) in¬ to a segment dictionary (Dm) ,
wherein the generator (60) generates a lookup map (Lm) and a key value segment (Sm,n) once the receiver (30) re¬ ceives an input string (10), and the engine (50) searches the lookup map (Lm) for each symbol received in the input string (10) excerpted by the analyzer (40),
wherein upon detecting a symbol is not stored in the lookup map (Lm) , the symbol is added to the lookup map (Lm) by storing the symbol at a next sequential key index in the lookup map (Lm) and a next sequential key value entry is assigned to the symbol and added to the key value segment (Sm,n), and
wherein upon detecting the symbol is already stored in the lookup map (Lm) , the corresponding key value assigned to this symbol is added to the next sequential entry of the key value segment (Sm,n)
wherein the generator (60) generates a new key value segment (Sm,n+i) of the lookup map (Lm) if the number of dif¬ ferent symbols equals the number of available key values k = 2n for the opened new key value segment (Sm,n) and where¬ in the converter (70) converts the lookup map into a seg- ment dictionary (Dm) if the maximal key value size knmax = 2nmax is reached, with n being any positive integral number 1 to nmax, nmax denoting the maximal bit size, and m being any positive integral number denoting the consecutive num¬ bering of segment dictionaries (Dm) .
5. The apparatus of claim 4, wherein the plurality of segment dictionaries (Dm) with the corresponding key value segments (Sm,n) are stored in the database storage (DB) as a column store database.
6. The apparatus of claim 4 or 5, wherein the con¬ verter (70) prompts the generator (60) to generate a new lookup map (Lm+i) upon converting the opened lookup map (Lm) into a segment dictionary (Dm) .
7. A computer program comprising program code means for performing a method of any one of claims 1 to 3 when said program is executed on a database computer system.
8. A computer program product comprising a computer readable medium on which a computer program according to claim 7 is stored.
9. A digital database (DB) generated according to a method of any one of claims 1 to 3.
10. A digital database (DB) comprising for one column a plurality of segment dictionaries Dmi m being any positive integral number denoting the consecutive numbering of seg¬ ment dictionaries Dm, each segment dictionary Dm having a number of correlated key value segments Sm,n/ with n being any positive integral number 1 to nmax, nmax denoting the maximal bit size of the key values, and a segment diction¬ ary Dm having a maximum number of knmax = 2nmax entries.
11. The digital database (DB) according to claim 10, wherein the segment dictionaries Dm and correlated key val- ue segments Sm,n form a column store database.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/551,377 US10084477B2 (en) | 2015-02-17 | 2016-02-17 | Method and apparatus for adaptive data compression |
EP16710100.5A EP3259849A1 (en) | 2015-02-17 | 2016-02-17 | Method and apparatus for adaptive data compression |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE102015002138.9 | 2015-02-17 | ||
DE102015002138 | 2015-02-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016131554A1 true WO2016131554A1 (en) | 2016-08-25 |
Family
ID=55538173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2016/025015 WO2016131554A1 (en) | 2015-02-17 | 2016-02-17 | Method and apparatus for adaptive data compression |
Country Status (3)
Country | Link |
---|---|
US (1) | US10084477B2 (en) |
EP (1) | EP3259849A1 (en) |
WO (1) | WO2016131554A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722286A (en) * | 2020-05-26 | 2021-11-30 | 杭州海康威视数字技术股份有限公司 | Space-time data compression method, device, server and storage medium |
CN117614699A (en) * | 2023-11-28 | 2024-02-27 | 安徽南瑞中天电力电子有限公司 | Long-distance power grid equipment communication system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6606040B2 (en) | 2001-02-13 | 2003-08-12 | Mosaid Technologies, Inc. | Method and apparatus for adaptive data compression |
WO2010050924A1 (en) * | 2008-10-27 | 2010-05-06 | Micro Motion, Inc. | Method and apparatus for compressing and decompressing data records |
US20130318093A1 (en) * | 2012-05-23 | 2013-11-28 | Thomson Reuters Global Resources | Short string compression |
US20140085115A1 (en) | 2012-09-25 | 2014-03-27 | International Business Machines Corporation | Data compression using dictionary encoding |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2251097B (en) | 1990-12-08 | 1995-05-10 | Dowty Information Systems | An adaptive data compression system |
AU658432B2 (en) | 1991-06-04 | 1995-04-13 | Qualcomm Incorporated | Adaptive block size image compression method and system |
KR0141824B1 (en) | 1991-12-23 | 1998-07-15 | 구자홍 | Image compressing method & apparatus of variable length |
US5376968A (en) | 1993-03-11 | 1994-12-27 | General Instrument Corporation | Adaptive compression of digital video data using different modes such as PCM and DPCM |
US5682152A (en) | 1996-03-19 | 1997-10-28 | Johnson-Grace Company | Data compression using adaptive bit allocation and hybrid lossless entropy encoding |
US6192157B1 (en) | 1998-10-27 | 2001-02-20 | Hewlett-Packard Company | Modifications of postscript adaptive data compression (ADC) for 3 plane, 8 bit color images, JPEG lossy compression, and variable Q factors |
US7180909B1 (en) * | 2001-12-17 | 2007-02-20 | Supergate Technology Usa, Inc. | Interface receive circuits for modularized data optimization engines and methods therefor |
US7167115B1 (en) * | 2005-08-26 | 2007-01-23 | American Megatrends, Inc. | Method, apparatus, and computer-readable medium for data compression and decompression utilizing multiple dictionaries |
US7460032B2 (en) | 2005-10-27 | 2008-12-02 | Evault, Inc. | Methods and apparatus for performing adaptive compression |
-
2016
- 2016-02-17 WO PCT/EP2016/025015 patent/WO2016131554A1/en active Application Filing
- 2016-02-17 EP EP16710100.5A patent/EP3259849A1/en not_active Withdrawn
- 2016-02-17 US US15/551,377 patent/US10084477B2/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6606040B2 (en) | 2001-02-13 | 2003-08-12 | Mosaid Technologies, Inc. | Method and apparatus for adaptive data compression |
WO2010050924A1 (en) * | 2008-10-27 | 2010-05-06 | Micro Motion, Inc. | Method and apparatus for compressing and decompressing data records |
US20130318093A1 (en) * | 2012-05-23 | 2013-11-28 | Thomson Reuters Global Resources | Short string compression |
US20140085115A1 (en) | 2012-09-25 | 2014-03-27 | International Business Machines Corporation | Data compression using dictionary encoding |
Non-Patent Citations (2)
Title |
---|
ANDERSON K: "OVERVIEW OF HUFFMAN ENCODING AS COMPRESSION TECHNIQUE", COMPUTER TECHNOLOGY REVIEW, WESTWORLD PRODUCTION, BEVERLY HILL, CA, US, vol. 11, no. 6, 1 May 1991 (1991-05-01), pages 97 - 98,100, XP000230832, ISSN: 0278-9647 * |
SALOMON D ED - SALOMON DAVID: "DATA COMPRESSION: THE COMPLETE REFERENCE, STATISTICAL METHODS", 1 January 1998, DATA COMPRESSION : THE COMPLETE REFERENCE, SPRINGER, NEW YORK, PAGE(S) 21 - 24, ISBN: 978-0-387-98280-9, XP002296642 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722286A (en) * | 2020-05-26 | 2021-11-30 | 杭州海康威视数字技术股份有限公司 | Space-time data compression method, device, server and storage medium |
CN117614699A (en) * | 2023-11-28 | 2024-02-27 | 安徽南瑞中天电力电子有限公司 | Long-distance power grid equipment communication system |
CN117614699B (en) * | 2023-11-28 | 2024-04-30 | 安徽南瑞中天电力电子有限公司 | Long-distance power grid equipment communication system |
Also Published As
Publication number | Publication date |
---|---|
US20180041223A1 (en) | 2018-02-08 |
US10084477B2 (en) | 2018-09-25 |
EP3259849A1 (en) | 2017-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6484142B1 (en) | Encoder using Huffman codes | |
RU2125765C1 (en) | Symbol compression method and device, statistical coder (options) | |
US5696507A (en) | Method and apparatus for decoding variable length code | |
JP3553106B2 (en) | Text compression driver construction method and input text string compression method | |
US5396595A (en) | Method and system for compression and decompression of data | |
KR101049699B1 (en) | Data Compression Method | |
EP2295947B1 (en) | Coding method, decoding method,and coding apparatus | |
US5623262A (en) | Multi-word variable length encoding and decoding | |
JP2004240975A (en) | Dna sequence encoder and method | |
JPH0682370B2 (en) | Character processor | |
KR20010085220A (en) | Efficient coding of side information in a lossless encoder | |
JP2003218703A (en) | Data coder and data decoder | |
CN1547805A (en) | Method of performing huffman decoding | |
CN115296862B (en) | Network data safety transmission method based on data coding | |
EP0127815B1 (en) | Data compression method | |
US20080030384A1 (en) | Encoding apparatus, decoding apparatus, encoding method, computer readable medium storing program thereof, and computer data signal | |
US6809665B2 (en) | Apparatus and method for decoding variable length code | |
JPH02503259A (en) | Apparatus for implementing variable length encoding method and variable length decoding method | |
US20120139763A1 (en) | Decoding encoded data | |
US5907635A (en) | Picture data decompression apparatus | |
CN110602498A (en) | Self-adaptive finite state entropy coding method | |
EP3259849A1 (en) | Method and apparatus for adaptive data compression | |
RU2709656C2 (en) | Encoder, decoder and method using modal symbols | |
US7683809B2 (en) | Advanced lossless bit coding | |
CN112612762B (en) | Data processing method and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16710100 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15551377 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2016710100 Country of ref document: EP |