WO2008010059A1 - Procédés et dispositifs de compression de documents structurés - Google Patents
Procédés et dispositifs de compression de documents structurés Download PDFInfo
- Publication number
- WO2008010059A1 WO2008010059A1 PCT/IB2007/001992 IB2007001992W WO2008010059A1 WO 2008010059 A1 WO2008010059 A1 WO 2008010059A1 IB 2007001992 W IB2007001992 W IB 2007001992W WO 2008010059 A1 WO2008010059 A1 WO 2008010059A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- event
- stream
- byte
- sequence
- aligned
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/143—Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/149—Adaptation of the text data for streaming purposes, e.g. Efficient XML Interchange [EXI] format
Definitions
- the present invention relates in general to the field of computer systems for transmitting, storing, retrieving and displaying data. It more particularly relates to a method and system for compressing and decompressing structured documents having a structure which is not necessarily known.
- a structured document is a set of information elements each associated with a type and attributes, and interconnected by relationships that are mainly hierarchical.
- Such documents use a markup language such as Standard Generalized Markup Language (SGML), Hypertext Markup Language (HTML), or Extensible Markup Language (XML), serving in particular to distinguish between the various elements of information making up the document.
- SGML Standard Generalized Markup Language
- HTML Hypertext Markup Language
- XML Extensible Markup Language
- the content information of the document is mixed in with layout information and type information.
- a structured document includes markers also called "tags” for separating different information element in the document.
- tags For SGML, XML, or HTML formats, these tags have the form " ⁇ XXXX>" and “ ⁇ /XXXX>", the first tag “XXXX” marking the beginning of an information element, and the second tag “ ⁇ /XXXX>” marking the end of said element.
- An information element may itself be made up of a plurality attributes and lower-level information elements also called “subelements”.
- a structured document presents a tree or hierarchical structure, each node representing an information element and being connected to a node at a higher hierarchical level representing an information element that contains the information elements at lower level.
- the nodes located at the ends of branches in such a tree structure represent information elements containing data of a predetermined unstructured type, which is not divided into information subelements.
- a structured document contains separation markers or tags generally represented in textual form, said tags defining information elements or subelements that can themselves contain other information subelements separated by tags.
- markup languages such as XML are verbose languages and thus they are inefficient to be processed and costly to be transmitted or stored.
- many software applications tend to produce very large structured documents. This is particularly the case of software applications creating HTML documents and digital graphical documents such as scene description, art, technical drawings, schematics and the like.
- the documents produced by graphical applications include graphical data describing a large number of points, lines and curves.
- graphical objects are described by graphical structured elements using a language such as SVG (Scalable Vector Graphics) describing two-dimensional vector and mixed vector/raster graphic objects.
- ISO/IEC 15938-1 MPEG-7 - Moving Picture Expert Group
- ISO/IEC 23001-1 proposes a method and a binary format for encoding (compressing) a XML structured document and decoding such a binary format.
- This standard is more particularly designed to deal with highly structured data, such as multimedia metadata, having a known structure defined in one or more schemas.
- ISO/IEC 23001-1 provide a binary stream which is not byte-aligned.
- conventional compression algorithms such as ZLIB (Compression Library) are not efficient to further compress the binary streams provided by these compression methods.
- ZLIB Compression Library
- An embodiment is to enable a structured document to be encoded and decoded without using a schema defining the structure of the document.
- An embodiment provides a compression method providing a byte-aligned binary stream that can be further processed by a conventional compression algorithm such as ZLIB.
- one embodiment provides a compression method of compressing a structured document having a tree-like structure comprising structured elements nested in each other, each said structured elements comprising structuring elements defining the structure of the element and delimiting at least one value element which is a set of at least one structured element or unstructured element.
- the method comprises steps of: converting the structured document into a stream of events comprising events corresponding to structuring elements of the structured document, and encoding the event stream by generating a binary stream comprising byte-aligned codes each encoding an event or at least a second occurrence of a sequence of consecutive events occurring in the event stream.
- the method comprises a step of applying a compression algorithm to the binary stream to obtain a compressed binary stream.
- the compression algorithm is ZLIB.
- the encoding step comprises for an event of the event stream: attributing one byte-aligned code to an event sequence including at least two consecutive events occurring in the event stream and ending with said event, and attributing one byte-aligned code to said event.
- the encoding step comprises for an event of the event stream: attributing one byte-aligned code to an event sequence including three consecutive events occurring in the event stream and ending with said event, attributing one byte-aligned code to an event sequence including two consecutive events occurring in the event stream and ending with said event, and attributing one byte-aligned code to said event.
- a correspondence table establishes a link between each byte-aligned code of the binary stream and an event or an event sequence, said correspondence table being of limited size.
- a new event or event sequence is inserted in the correspondence table by replacing an oldest event or event sequence by the new event or event sequence, so that the byte-aligned code of the oldest event or event sequence is attributed to the new event or event sequence.
- the byte-aligned codes are one byte long.
- Another embodiment of the present invention provides a decompression method of decompressing a binary stream resulting from compression of an original structured document, the original structured document having a tree-like structure comprising structured elements nested in each other, each said structured elements comprising structuring elements defining the structure of the element and delimiting at least one value element which is a set of at least, one structured element or unstructured element.
- the binary stream comprises a succession of byte-aligned codes encoding events of an event stream, said events corresponding to structuring elements of the structured document, said decompression method comprising steps of: decoding the binary stream by generating said event stream which comprises an event or at least a second occurrence of an event sequence of consecutive events for each byte-aligned code of the binary stream, and converting each event of the event stream into structuring elements so as to provide the original structured document.
- the decompression method comprises a previous step of applying a decompression algorithm to a compressed binary stream to obtain the binary stream.
- the decompression algorithm is ZLIB.
- the decoding step comprises for an event of the event stream: attributing one byte-aligned code to an event sequence including at least two consecutive events in the event stream and ending with said event, and attributing one byte-aligned code to said event.
- the decoding step comprises for an event of the event stream: attributing one byte-aligned code to an event sequence including three consecutive events occurring in the event stream and ending with said event, attributing one byte-aligned code to an event sequence including two consecutive events occurring in the event stream and ending with said event, and attributing one byte-aligned code to said event.
- a correspondence table establishes a link between each byte-aligned code of the binary stream and an event or an event sequence, said correspondence table being of limited size.
- a new event or event sequence is inserted in the correspondence table by replacing an oldest event or event sequence by the new event or event sequence, so that the byte-aligned code of the oldest event or event sequence is attributed to the new event or event sequence.
- the byte-aligned codes are one byte long.
- a compression device for compressing a structured document having a tree-like structure comprising structured elements nested in each other, each said structured elements comprising structuring elements defining the structure of the element and delimiting at least one value element which is a set of at least one structured element or unstructured element.
- the compression device comprises: a converter for converting the structured document into a stream of events comprising events corresponding to structuring elements of the structured document, and an encoder for encoding the event stream by generating a binary stream comprising byte-aligned codes each encoding an event or at least a second occurrence of a sequence of consecutive events occurring in the event stream.
- the compression device comprises a compression module for applying a compression algorithm to the binary stream to obtain a compressed binary stream.
- the compression algorithm is ZLIB.
- the encoder is configured to process an event of the event stream by: attributing one byte-alig ⁇ ed code to an event sequence including at least two consecutive events in the event stream and ending with said event, and attributing one byte-aligned code to said event.
- the encoder is configured to process an event of the event stream by: attributing one byte-aligned code to an event sequence including three consecutive events occurring in the event stream and ending with said event, attributing one byte-aligned code to an event sequence including two consecutive events occurring in the event stream and ending with said event, and attributing one byte-aligned code to said event.
- a correspondence table establishes a link between each byte-aligned code of the binary stream and an event or an event sequence, said correspondence table being of limited size.
- the encoder is configured to insert a new event or event sequence in the correspondence table when it is full, by replacing an oldest event or event sequence by the new event or event sequence, so that the byte-aligned code of the oldest event or event sequence is attributed to the new event or event sequence.
- the byte-aligned codes are one byte long.
- Another embodiment of the present invention provides a decompression device for decompressing a binary stream resulting from compression of an original structured document, the original structured document having a tree-like structure comprising information elements nested in each other, each said structured elements comprising structuring elements defining the structure of the element and delimiting at least one value element which is a set of at least one structured element or unstructured element.
- the binary stream comprises a succession of byte-aligned codes encoding events of an event stream, said events corresponding to structuring elements of the structured document
- said decompression device comprising: a decoder for decoding the binary stream by generating said event stream which comprises an event or at least a second occurrence of an event sequence of consecutive events for each byte-aligned code of the binary stream, and a converter for converting each event of the event stream into structuring elements so as to provide the original structured document.
- the decompression device comprises a decompression module for applying a decompression algorithm to a compressed binary stream to obtain the binary stream.
- the decompression algorithm is ZLIB.
- the decoder is configured to process an event of the event stream by: attributing one byte-aligned code to an event sequence including at least two consecutive events in the event stream and ending with said event, and attributing one byte-aligned code to said event.
- the decoder is configured to process an event of the event stream by: attributing one byte-aligned code to an event sequence including three consecutive events occurring in the event stream and ending with said event, attributing one byte-aligned code to an event sequence including two consecutive events occurring in the event stream and ending with said event, and attributing one byte-aligned code to said event.
- a correspondence table establishes a link between each byte-aligned code of the binary stream and an event or an event sequence, said correspondence table being of limited size.
- the decoder is configured to insert a new event or event sequence in the correspondence table when it is full, by replacing an oldest event or event sequence by the new event or event sequence, so that the byte-aligned code of the oldest event or event sequence is attributed to the new event or event sequence.
- Figure 1 represents in block form a structured document
- Figure 2 represents in block form a structured document compression device according to one embodiment of the present invention
- Figure 3 represents in block form a structured document decompression device according to one embodiment of the present invention
- Figures 4 to 6 are flow charts of procedures executed by the compression device of Figure 2
- Figure 7 is a flow chart of a procedure executed by the decompression device of Figure 3.
- Figure 1 represents a structured document 1 comprising a header HD and a main element MEL.
- the main element MEL comprises a type identifier Type, a set of attributes Att.l, Att.2, ... Attn and a value VaI.
- the value of the main element MEL may include one or more structured elements 4 called "subelements of the main element", each comprising a type identifier Type, a set of attributes Att.l- Attn and a value VaI.
- the value of each element 4 may itself also include one or more structured or unstructured subelements.
- the unstructured elements have a known format such as string, integer number, floating-point number, ...
- Each element or subelement is associated with a type defining the structure of the element.
- Each type of the elements of a structured document may be defined in a schema (for example XML schema in XML language).
- a structured element of a structured document has the following form in
- XML or in languages derived from XML such as HTML and SVG:
- "type” is a type identifier of the structured element
- " ⁇ /type>” is an end tag delimiting the end of the element in the document
- value is the value of the element which may comprise structured or unstructured subelements.
- Figure 2 represents a compressing device according to an embodiment of the invention.
- the compressing device comprises a parser XSXP receiving a structured document DOC to be compressed, a binary encoder BCD, and preferably a compression module ZIP such as ZLIB providing a compressed binary stream BDOC.
- the parser analyzes the structured document DOC in the form of an alphanumerical document and identifies structuring elements of the document, i.e. alphanumerical strings defining the tags, attributes and values of the elements composing the document and converts these structuring elements into a stream of events EVST.
- the generated event stream EVST comprises at least one event such as a SAX event (SAX: Simple API - Application Program Interface - for XML) for each structuring element of the document DOC.
- SAX is defined in detail in http://www.saxproject.org/. For example, the apparition of a XML opening or closing tag in a XML document is a SAX event.
- the binary encoder BCD converts the event stream EVST into a binary stream BST.
- the binary stream comprises a byte code for each event or event sequence of two or three consecutive events. Every occurrence of a new event sequence of two or three consecutive events in the event stream is memorized and a byte code is attributed to the event sequence. When another occurrence of a memorized event sequence is detected in the event stream, the sequence is encoded using the byte code attributed thereto.
- the binary encoder BCD uses a symbol table STl and a symbol code map table SCMl, these tables containing events provided by the converter XSXP. These tables are initialized with a table TNEVT containing possible events or most frequent events that can be provided by the converter XSXP.
- the table SCMl establishes a correspondence between each event or event sequence in the table STl and a byte code used by the encoder BCD to encode the event or event sequence.
- the table STl contains the last events or event sequences encoded by the encoder.
- the binary stream provided by the binary encoder BCD is byte-aligned, i.e. each byte or sequence of successive bytes of the binary stream corresponds to a part of the structured document DOC which is encoded using an integer number of bytes.
- a compression algorithm such as ZLIB can be applied with efficiency to the binary stream BST provided by the binary encoder BCD.
- the binary stream BST is further compressed by a compression module ZIP which provides a compressed binary stream BDOC.
- the module ZIP implements a conventional compression algorithm such as ZLIB.
- Figure 3 represents a decompressing device according to an embodiment of the invention.
- the decompressing device comprises a binary decoder BDCD, and a converter SXXP providing a structured document DOC which is the same as the document applied to the compression device.
- the decompressing device further comprises a decompression module DZIP applying to the binary stream BDOC an initial decompression processing implementing a conventional decompression algorithm such as ZLIB and providing a binary stream BST which is processed by the decoder BDCD.
- a decompression module DZIP applying to the binary stream BDOC an initial decompression processing implementing a conventional decompression algorithm such as ZLIB and providing a binary stream BST which is processed by the decoder BDCD.
- the binary decoder BDCD converts the binary stream BST applied to the decompression device or provided by the decompression module DZIP into a stream of events EVST.
- the converter SXXP converts the event stream EVST provided by the binary decoder BDCD into tags constituting the structured document DOC.
- the binary decoder BDCD uses a symbol table ST2 and a symbol code map table SCM2 which are similar to the tables STl and SCMl used by the encoder BCD. These tables are initialized with the same table INEVT containing possible events or most frequent events that can be provided by the converter XSXP.
- the table SCM2 establishes a correspondence between each event or event sequence in the table ST2 and a byte code that may appear in the binary stream BST.
- the table ST2 contains the last events or event sequences appearing in the event stream EVST provided by the decoder BDCD. hi the case of XML and SAX, the SAX events are listed in the table below:
- AU SAX events are defined by a corresponding SAX event callback.
- the three special events ADD_NS, ADD_ENAME and ADD_ANAME are used to dynamically add a namespace, an element name or an attribute name in the corresponding dictionary.
- the UID number is a fixed numerical ID able to unambiguously define an event, but this is not the value used to encode an event, as explained below.
- An event can carry zero, one or several parameters, which can be strings or numerical values, which point to a corresponding dynamic strings dictionary. Strings are encoded in UTF-8 format, with a terminating zero.
- Figure 4 represents a process performed by the binary encoder BCD.
- the process of figure 4 comprises steps S1-S17.
- This process uses the symbol table STl and the symbol code map table SCMl.
- the table STl is previously initialized with table INEVT so as to contain all events listed above in Tables 1 and 2.
- Tables STl and SCMl contain a limited number of events which is equal to 127 for example.
- step Sl the stream of events EVST is read event by event until all events in the stream EVST provided by the converter XSXP from the document are processed (step S2).
- three events are loaded into a FIFO buffer (First-In First-Out) Bevt.
- Table SCMl establishes a correspondence between each event in the table STl and a code which is used to encode an event in the binary stream BST generated by the encoder BCD.
- the code corresponding to each event in the tables STl and SCMl is equal for example to the position of the event in the table SCMl.
- each of the memory location of the buffer Bevt is compared to null. If only the third event Bevt[0] in the buffer Bevt is not null, a symbol sym containing the event Bevt[0] is generated at step S6.
- the code corresponding to the event Bevt[0] is determined from table SCMl . This code and the parameters associated with the event are inserted into the binary stream BST generated by the encoder BCD.
- the symbol sym is used to update the symbol table STl and the symbol code map table SCMl.
- the process of updating tables STl and SCMl which will be explained below in reference of Figures 5 and 6 consists in inserting the symbol into the tables if it is not already in these tables and putting the symbol at the beginning of the table STl.
- table STl is arranged so that the newest symbols are listed at the beginning of the table.
- the oldest event Bevt[0] is pushed outside the buffer Bevt at step S9 and the process continues with a new iteration at step Sl where a new event is read in the event stream EVST and loaded in the location Bevt[2] of the buffer Bevt.
- step S8 tables STl and SCMl are updated with the events contained in the symbol sym
- step S9 the buffer Bevt is shifted once more. It should be noted that the execution of steps S 12 and S9 pushes two events outside the buffer Bevt since the two events Bevt[0] and Bevt[l] have been processed. Then the process continues with a new iteration at step Sl where two new events are read in the event stream EVST and loaded in the location Bevt[l] and Bevt[2] of the buffer Bevt.
- This code and the parameters associated with the events Bevt[0], Bevt[l] and Bevt[2] are inserted into the binary stream BST generated by the encoder BCD.
- the process continues at step S8 where tables STl and SCMl are updated with the three events contained in the symbol sym, and at step S9 where the buffer Bevt is shifted once more.
- the buffer Bevt is thus shifted three times at steps S 16, S 12 and S9. Therefore the buffer Bevt is empty when executing step Sl again for a new iteration since three events were processed.
- Figure 5 represents a process 20 executed at step S8.
- the process 20 which comprises steps S21 to S28 updates the tables STl and SCMl with one symbol sym that can contain up to three events.
- a counter i is initialized.
- steps S23 and S24 are executed to test whether the two previously processed events memorized by variables evt-2 and evt-1 are null or not. If the two previously processed events are null, a procedure of inserting a symbol equal to the event evt into tables STl and SCMl is executed at step S25.
- step S26 the variable evt-2 is updated with the value of variable evt-1, the variable evt-1 is updated with the value of variable evt, and the counter i is incremented by 1. Then the process continues at step S22 for a new iteration to process a symbol sym with two or three events.
- variable evt-1 If the previously processed event memorized in variable evt-1 is not null at step S24, the procedure of insertion of a symbol into tables STl and SCMl is executed at step S27 for inserting a symbol equal to the concatenated events evt-1 and evt. Then the process continues at step S25. Thus if the variable evt-1 is not null the symbols evt-1 //evt and evt are successively inserted into both tables STl and SCMl.
- variable evt-2 If the previously processed event memorized in variable evt-2 is not null at step S23, the procedure of insertion of a symbol into tables STl and SCMl is executed at step S28 for inserting a symbol equal to the concatenated events evt-2, evt-1 and evt. Then the process continues at step S27.
- the variables evt-1 and evt-2 are not null the symbols evt-2//evt-l//evt, evt-1 //evt and evt are successively inserted into both tables STl and SCMl .
- Figure 6 represents a process 30 of insertion of a symbol sym into the symbol table STl and the symbol code map table SCMl.
- the process 30 which comprises steps S31 to S35 is executed at steps S25, S27 and S28 of procedure 20.
- the symbol sym is searched in table STl. If it is found in table STl, the symbol sym is removed from table STl and inserted at the beginning of this table at step S32. Otherwise, if table STl is not full at step S33, the symbol sym is inserted at the beginning of table STl and inserted into table SCMl at a position corresponding to a code equal to the size of table STl (step S34).
- the first bit (most significant bit) of each code encoding an event indicates whether the event is encoded with one or two bytes, and the 7 or 15 other bits is the code of the event provided by table SCMl or INEVT.
- the first bit of the code being equal to 1 indicating that the event is encoded with two bytes.
- This XMS sequence is processed by the parser XSXP which generates the following stream of events: START_ELT_1, START_ELT__2, END_ELT, START_ELT_2, END_ELT (2)
- the symbol table STl after initialization has the following content:
- the events in the table STl are arranged so that the most recent event is memorized at the beginning of the table.
- the event IDLE at the beginning of the table is not an existent event but is used to separate the events inserted during and after initialization of the table.
- the column "code” gives the code corresponding to each symbol in table STl as provided by table SCMl .
- the events START_ELT_1, START_ELT_2 and END_ELT are loaded into the buffer Bevt at step Sl.
- the steps S2, S3, S14, S15, SlO, Sl 1, S6-S9 are successively executed by the encoder BCD.
- the first event START_ELT_1 encoded with the byte 25 is inserted into the binary stream provided by the encoder BCD.
- the event STARTJELT_1 is moved up at the beginning of table STl as follows:
- step S 8 only the symbol START_ELT_1 is inserted at the beginning of table STl since the variables evt-1 and evt-2 of process 20 are null.
- the buffer Bevt is loaded at step Sl with a next event START_ELT_2 of the event stream, so that the buffer contains the events START_ELT_2, END_ELT and START_ELT_2.
- the encoder BCD then executes again steps S2, S3, S14, S15, SlO, SIl, S6-S9.
- step S7 the second event START_ELT_2 encoded with the byte 27 is inserted into the binary stream provided by the encoder BCD.
- step S8 the symbols START_ELT_1//START_ELT_2 and START_ELT_2 are successively inserted at the beginning of table STl (the variable evt-2 is null), as shown in the table below:
- table STl Before inserting the symbol START_ELT_1//START_ELT_2, table STl is full. Therefore, when inserting this symbol, the symbol ATTR_52 at the end of table STl is removed and the code 127 is attributed to the new symbol START_ELT_1//START_ELT_2. Then the symbol START_ELT_2 is moved up at the beginning of table STl.
- the buffer Bevt is loaded at step Sl with a next event END_ELT of the event stream, so that the buffer contains the events END_ELT, START_ELT_2 and END_ELT.
- the encoder BCD then executes again steps S2, S3, S14, S15, SlO, SIl, S6-S9.
- the third event END_ELT encoded with the byte 4 is inserted into the binary stream BST provided by the encoder BCD.
- step S8 the symbols START_ELT_1//START_ELT_2 //END_ELT, START_ELT__2//END_ELT and END_ELT are successively inserted at the beginning of table STl, as shown in the table below:
- START_ELT_2//END_ELT is inserted at the beginning of table STl and receives the code 125 of the last symbol which is removed from the table. Then the symbol END_ELT is moved up at the beginning of table STl .
- the buffer Bevt is loaded at step Sl with a next event of the event stream, so that the buffer contains the events START_ELT_2 and END_ELT in the first and second positions of the buffer.
- the encoder BCD then executes again the steps S2, S3, S 14, Sl 5 and SlO, SIl. Since the symbol START_ELT_2//END_ELT belongs to table STl, the encoder BCD further executes steps S12, S13 and S8, S9. It results that the sequence of the two consecutive events START_ELT_2 and END_ELT are encoded with the single byte code 125 which is inserted into the binary stream provided by the encoder BCD at step S13.
- step S8 the symbols START_ELT_2 and END_ELT are successively moved up at the beginning of table STl and the corresponding sequences of two and three consecutive events inserted in tables STl and SCMl.
- the symbols START_ELT_2//END_ELT//START_ELT_2 are successively moved up at the beginning of table STl and the corresponding sequences of two and three consecutive events inserted in tables STl and SCMl.
- Figure 7 represents a process performed by the binary decoder BDCD.
- the process of Figure 7 comprises steps S41-S45.
- This process also uses a symbol table ST2 and a symbol code map table SCM2.
- the table ST2 is previously initialized and contains all events listed above in Tables 1 and 2.
- Table ST2 and SCM2 contain a limited number of events which is equal to 128 for example.
- the decoder BDCD reads a next code in the binary stream BST to be decoded. If the end of the binary stream BST is not reached at step S42, the decoder executes step S43 where the code read is translated into a symbol, i.e. a sequence of one or more concatenated events, using the table symbol code map
- step S44 the symbol is translated into events.
- tables SCM2 and ST2 are updated by executing the procedure 20 with each event obtained after execution of step 44.
- the decoding process performed by the decoder DBCD will be now described using the above example used for illustrating the encoding process. In this example, the decoder has to decode the byte code sequence (2).
- the decoder reads the code 25 in the binary stream.
- code 25 corresponds to the symbol START_ELT_1 which contains a single event.
- the decoder inserts the event thus obtained into the event stream it provides.
- the table ST2 is then updated by moving up the event START_ELT_1 at the beginning of the table.
- table ST2 contains the symbols of table 4.
- the decoder BDCD reads the next code 27 in the binary stream.
- Code 27 corresponds in table SCM2 to the symbol START_ELT_2 which contains a single event. Then the decoder inserts the event thus obtained into the event stream it provides.
- the tables ST2 and SCM2 are updated by adding the symbol START_ELT_1//START_ELT_2 associated with the code 127, this symbol being put at the beginning of table ST2.
- Table ST2 is further updated by moving up the event START_ELT_2 at the beginning of the table.
- table ST2 contains the symbols of table 5.
- the decoder BDCD reads the next code 4 in the binary stream.
- Code 4 corresponds in table SCM2 to the symbol END_ELT which contains a single event. Then the decoder inserts the event thus obtained into the event stream EVST it provides.
- the tables ST2 and SCM2 are updated by adding the symbol START_ELT_1//START_ELT_2//END_ELT associated with the code 126, and the symbol START_ELT_2//ENDJELT associated with the code 125, these symbols being put at the beginning of table ST2.
- Table ST2 is further updated by moving up the event END_ELT at the beginning of the table.
- table ST2 contains the symbols of table 6.
- the decoder BDCD reads the next code 125 in the binary stream.
- Code 125 corresponds in table SCM2 to the symbol START_ELT_2//END_ELT which contains a two events.
- the decoder inserts the events thus obtained into the event stream EVST it provides.
- the tables ST2 and SCM2 are updated so that the events START_ELT_2 and END_ELT are successively moved up at the beginning of table STl and the corresponding sequences of two and three consecutive events inserted in tables STl and SCMl.
- table ST2 contains the symbols of table 7.
- the decoder BDCD provides the event stream (2) which is then translated by the converter SXXP into the XML tag sequence (1).
- Simulations have been conducted on a collection of 228 MPEG-7 and MPEG- 21 test files with three different setups.
- An external tag name table containing the namespaces, element names and attribute names dealt in these simulations can be used or not.
- a ZLIB post- compression using the tag name table as a bootstrap dictionary is done on the output event streams EVST. Moreover, comments, white spaces between elements and prefix mappings can be transmitted or not.
- a tag name table is used and comments, white spaces and prefix mappings are omitted.
- An average compression ratio of 19.15 is observed with respect to raw input XML files and a ratio of 6.58 is observed when compared to ZLIB-compressed XML input files.
- a tag name table is used and comments, white spaces and prefix mappings are kept.
- An average compression ratio of 4.01 is observed with respect to raw input XML files and a ratio of 1.30 is observed when compared to ZLIB-compressed XML input files.
- API application program interfaces
- StAX Streaming API for XML
- DOM Document Object Model
- the events of the event stream correspond to nodes in the tree which are successively considered according to a predefined path in the tree.
- the invention is not limited to the algorithm described above for generating sequences of one to three events associated to a same code. Sequences of two events or more than three events can be thus generated, so that a single code is used to encode several events.
- other algorithms for grouping together events into sequences can be used. For example it can be provided that a sequence of consecutive events is inserted into tables ST and SCM only after the second occurrence of the event sequence, so as to prevent event sequences that have only one occurrence in the event stream EVST to be inserted into the tables. In this manner, the events are changed in the table ST and SCM not too quickly. Some events appear rarely in an event stream generated from a structured document. Thus it can be provided to prevent generation of event sequences including such events.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Document Processing Apparatus (AREA)
Abstract
L'invention concerne un procédé de compression d'un document structuré (DOC) possédant une structure arborescente comprenant des éléments imbriqués, chacun desdits éléments structurés comprenant des éléments structurants qui en définissent la structure et délimitent au moins un élément de valeur constituant un ensemble d'au moins un élément structuré ou un élément non structuré. Le procédé comprend les étapes consistant à convertir le document structuré (DOC) en un flux d'événements (EVST) comprenant des événements correspondant aux éléments structurants du document structuré; et à coder le flux d'événements en générant un flux binaire (BST) comprenant des codes alignés sur des octets codant chacun un événement ou au moins une deuxième occurrence d'une séquence d'événements consécutifs apparaissant dans le flux d'événements. L'invention permet de comprimer un document XML sans faire appel à un schéma XLM du document.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP07734998A EP2039009A1 (fr) | 2006-07-12 | 2007-07-06 | Procédés et dispositifs de compression de documents structurés |
JP2009518997A JP2009543243A (ja) | 2006-07-12 | 2007-07-06 | 構造化文書の圧縮のための方法と装置 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US80713106P | 2006-07-12 | 2006-07-12 | |
US60/807,131 | 2006-07-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008010059A1 true WO2008010059A1 (fr) | 2008-01-24 |
Family
ID=38578679
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2007/001992 WO2008010059A1 (fr) | 2006-07-12 | 2007-07-06 | Procédés et dispositifs de compression de documents structurés |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP2039009A1 (fr) |
JP (1) | JP2009543243A (fr) |
WO (1) | WO2008010059A1 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103186611B (zh) * | 2011-12-30 | 2016-03-30 | 北大方正集团有限公司 | 一种压缩、解压及查询文档的方法、装置 |
-
2007
- 2007-07-06 EP EP07734998A patent/EP2039009A1/fr not_active Withdrawn
- 2007-07-06 JP JP2009518997A patent/JP2009543243A/ja active Pending
- 2007-07-06 WO PCT/IB2007/001992 patent/WO2008010059A1/fr active Application Filing
Non-Patent Citations (3)
Title |
---|
CHENEY, J.: "Compressing XML with multiplexed hierarchical PPM models", DATA COMPRESSION CONFERENCE, 2001. PROCEEDINGS. DCC 2001., 29 March 2001 (2001-03-29), XP002455711 * |
JAMES CHENEY: "SAX event encoding", INTERNET ARTICLE, 24 November 2000 (2000-11-24), XP002455712, Retrieved from the Internet <URL:http://xmlppm.sourceforge.net/paper/node5.html> [retrieved on 20071018] * |
ÖZDEN M: "A Binary Encoding for Efficient XML Processing", INTERNET CITATION, 17 December 2002 (2002-12-17), XP002386926, Retrieved from the Internet <URL:http://www.ti5.tu-harburg.de/publication/2002/> [retrieved on 20060623] * |
Also Published As
Publication number | Publication date |
---|---|
EP2039009A1 (fr) | 2009-03-25 |
JP2009543243A (ja) | 2009-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20080294980A1 (en) | Methods and Devices for Compressing and Decompressing Structured Documents | |
KR100614677B1 (ko) | 구조화된 문서를 압축/복원하기 위한 방법 | |
US7707154B2 (en) | Method and devices for encoding/decoding structured documents, particularly XML documents | |
JP4373721B2 (ja) | マークアップ言語文書を符号化するための方法およびシステム | |
EP1562193A1 (fr) | Système pour stocker et rendre des données multimédia | |
US7275060B2 (en) | Method for dividing structured documents into several parts | |
EP1969457A2 (fr) | Objet de representation de schema comprime et procede pour le traitement de metadonnees | |
US8015218B2 (en) | Method for compressing/decompressing structure documents | |
CN102214170A (zh) | 一种xml数据压缩和解压缩方法及系统 | |
US20040111677A1 (en) | Efficient means for creating MPEG-4 intermedia format from MPEG-4 textual representation | |
US7627586B2 (en) | Method for encoding a structured document | |
JP2006517309A (ja) | MPEG−4IntermediaFormatからMPEG−4TextualRepresentationを作成する効率的な手段 | |
US7571152B2 (en) | Method for compressing and decompressing structured documents | |
WO2008010059A1 (fr) | Procédés et dispositifs de compression de documents structurés | |
WO2019018030A1 (fr) | Compression et récupération d'enregistrements structurés | |
KR20050023411A (ko) | 구조화된 문서들, 특히 xml 문서들을인코딩/디코딩하기 위한 방법 및 장치 | |
JP2004342029A (ja) | 構造化文書圧縮方法及び装置 | |
EP1199893A1 (fr) | Méthode pour structurer un flux de données de descriptions multimédia binaires et méthode d'analyse synthaxique associée | |
JP2005276193A (ja) | Dibrデータのためのスキーマ及びスタイルシート |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 2007734998 Country of ref document: EP |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07734998 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009518997 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: RU |