[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

GB2490731A - Method for encoding and decoding structured data using an associated schema - Google Patents

Method for encoding and decoding structured data using an associated schema Download PDF

Info

Publication number
GB2490731A
GB2490731A GB1108031.4A GB201108031A GB2490731A GB 2490731 A GB2490731 A GB 2490731A GB 201108031 A GB201108031 A GB 201108031A GB 2490731 A GB2490731 A GB 2490731A
Authority
GB
United Kingdom
Prior art keywords
schema
transformation
transformed
structured data
encoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1108031.4A
Other versions
GB201108031D0 (en
Inventor
Herva Ruellan
Romain Bellessort
Youenn Fablet
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to GB1108031.4A priority Critical patent/GB2490731A/en
Publication of GB201108031D0 publication Critical patent/GB201108031D0/en
Publication of GB2490731A publication Critical patent/GB2490731A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3068Precoding preceding compression, e.g. Burrows-Wheeler transformation
    • H03M7/3079Context modeling
    • G06F17/2252
    • G06F17/30914
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/146Coding or compression of tree-structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • H03M7/707Structured documents, e.g. XML

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention concerns a method for encoding and decoding structured data (e.g. XML to EXI (Efficient XML Interchange)), using an associated schema. The encoding method comprises: - obtaining S105 a transformed schema ST by applying a first transformation T1 (e.g. based on efficiency criteria) to the associated schema S, the transformed schema defining a modified structure model, - obtaining S110, from the first transformation T1, a second transformation T2 (e.g. XSLT stylesheet) which defines modifications to apply to structured data D conforming to the associated schema S to obtain transformed structured data DT conforming to the transformed schema ST; - applying S125 the obtained second transformation T2 to structured data D to be encoded, so as to obtain transformed structured data DT; and - encoding S130 the transformed structured data DT using the obtained transformed schema ST. The invention transforms data to encode into a representation that is more suitable for efficient encoding than the original representation. It results in encoding cost savings.

Description

I
METHOD FOR ENCODING AND DECODING STRUCTURED DATA USING AN
ASSOCIATED SCHEMA, AND CORRESPONDING DEVICES
FIELD OF THE INVENTION
The present invention concerns a method for encoding and decoding structured data using an associated schema.
A particular but non-exclusive application of the present invention is the encoding of a structured document such as an XML file (standing for "eXtensible Markup Language") into an EXI bitstream (standing for "Efficient XML Interchange"), using the structure model of an XML Schema as prior knowledge.
A corresponding exemplary application is the decoding of the EXI bitstream obtained, still using the structure model of the XML Schema as prior knowledge.
BACKGROUND OF THE INVENTION
The XML format is a syntax for defining computer languages, which makes it possible to create languages adapted to different uses which may however be processed by the same tools.
An XML document is composed of elements, each element being delimited by an opening tag (or start tag) comprising the name of the element (for example: <tag>) and a closing tag (or end tag) which also comprises the name of the element (for example </tag>). Each element may contain other elements in a hierarchical child-parent relationship and/or contain text data or a value defining content. Given these structure tags, data in such a document is referred to as "structured data".
The definition of an element may also be refined by a set of attributes, each attribute being defined by a name and having a value. The attributes are then placed in the start tag of the element whose definition they are refining (for example <tag attribute="value">).
XML syntax also makes it possible to define comments (for example: "<!--Comment-->") and processing instructions, which may specify to a computer application what processing operations to apply to the XML document (for example: "c?myprocessing?>").
Several markup languages based on the XML language may contain elements with the same name. To allow several different languages to be mixed in the same document, an extension of the XML language has been specified to define "Namespaces" for XML. Using this extension, two elements are identical only if they have the same name (also named "local-name") and are located in the same namespace. A namespace is defined by a URI (standing for "Uniform Resource Identifier"), for example "http://canon.crffr/xml/mylanguage". The use of a namespace in an XML document is via the definition of a prefix which is a compact representation of the URI of the namespace. The prefix is defined using a specific attribute (for example "xmlns:ml="http://canon.crifr/xml/mylanguage" binds the prefix "ml" to the URI "http://canon.crf.fr/xml/mylanguage"). Next, the namespace of an element or of an attribute is specified by preceding its name with the prefix associated with the namespace followed by ":" (for example "cml:tag ml:attribute="value">" indicates that the element "tag" arises from the namespace ml and that the same applies for the attribute "attribute").
In XML terminology, the set of the terms "element", "attribute", "text data", "comment", "processing instruction" and "escape section" are grouped together under the generic name of "item".
To process an XML document, it must be read from memory. However, there exists more than one reading approach.
A first family of reading approaches consists in representing all the XML data in memory as a tree, such as the DOM API (standing for "Document Object Model'). The methods of this family generally enable easy and fast access to each part of the XML document. However, it requires a large amount of memory to store all the data simultaneously.
A second family of reading approaches consists in reading the XML data as a sequence of events, processing one event at a time. The methods of this family, such as for example the SAX API (standing for "Simple API for XML"), allow reading of the XML document by streaming, enabling use of little memory space. However, easy access to any desired part of the data is not provided by these approaches.
As shown and suggested above, the XML items may be described in terms of events, also called XML events. Thus, there are XML events corresponding to a piece of structure information such as the start tag and the end tag, and there are XML events representing content, for example the attributes or the text data of the elements or comments.
For example, the following XML events can be used to describe an XML document: -SE: start tag event; -EE: end tag event; -AT: attribute event; -CH: character or text event.
XML has many advantages and has become a reference language for storing data in a file or for exchanging data. XML in particular provides access to numerous tools for processing the files generated. Moreover, an XML document can be edited manually with a simple text editor. In addition, as an XML document contains its structure integrated in the data, such a document is highly legible even without knowing
its specification.
However, the main drawback of XML syntax is to be verbose. This means that the size of an XML document may be several times greater than the intrinsic size of the raw data. This large size of XML documents also gives rise to a long processing time when XML documents are generated and read. It also leads to a long transmission time.
To remedy these drawbacks, methods of compressing and encoding an XML document have been sought. The aim of these methods is to encode the content of an XML document in a more efficient form, while enabling the XML document to be reconstructed easily.
This is in particular the case for the Binary XML formats that produce binary bitstream.
The simplest binary way to encode XML data is to encode markup information using a binary format instead of a text format. This has been improved by eliminating or decreasing redundancy in markup information, for example by avoiding specifying the name of the element in both the start tag and the end tag. Such mechanisms are used in all Binary XML formats.
More advanced mechanisms, such as some involved in EXI and Fast Infoset formats, use one or more indexing tables for encoding and decoding XML data, for example string tables or "partitions" as defined in the EXI recommendation. These indexing tables are very efficient when encoding repeated strings such as item names, in particular element or attribute names that are usually repeated in the XML data.
In practice, on the first occurrence of a repeated name, it is encoded literally as a string and an index is associated with it in a corresponding entry of the indexing table. The indexes are incremented each time, and later coded over the fewest number of bits adapted to represent all the N indexes of the table.
Next, on each further occurrence of that name, it is encoded using the index of the associated entry, instead of the full string. This allows the size of the encoded document to be reduced, but also allows the parsing speed of the document to be increased. This is because, when parsing an item or event, the parser only needs to decode an index instead of parsing an entire string, and then compare integers instead of strings.
Indexing tables also exist for indexing content values such as attribute values or textual content.
Another mechanism that provides additional compression efficiency acts on the structure of the data. Indexing tables are used to describe the structure of the elements and maps a structure item with an index. For example, the same index can be used for each item or event having the same given name.
For example, on the first occurrence of a child element inside the parent element content, a new entry describing that child element is added to the indexing table. Further occurrences of similar child elements inside an occurrence of the same parent element can then be encoded using the index associated with that entry.
Such a mechanism is used by the EXI format where the indexing structure tables are named "grammars", while their entries associated with XML events are named "productions".
Usually, the indexing tables for encoding or decoding structured data, such as the grammars, are directly created and updated from the structured data to process when it is encoded or decoded.
However, those indexing tables may also be created prior to encoding or decoding the structured data. This is done by using prior knowledge, for example by using an associated schema, such as an XML Schema, defining a structure model of data that describes the structured data.
To be precise, XML schema (of which a description may be found at the addresses: http://www.w3.org/TR/xmlschema-1/ and http://www.w3.org/TR/xmlschema-2/) is a structured language (XML based) which defines the structure and the types of data present in a set of XML documents.
A structured document written in XML schema is like a "directory" of the types of data authorized and a structure model for all the XML documents conforming to that schema. This may for example concern the integer, float, or string type of a value, but also the name (qname) of particular structure information.
An XML Schema is used by both the encoder and the decoder to create the grammars before processing the data.
Generally, using an XML Schema leads to improved compression of the structure since no learning is actually needed and codes obtained from schema-informed grammars are generally shorter than codes obtained from learned grammars.
For instance, schema-informed grammars produce shorter codes for a sequence of four mandatory elements (0 bit for each element -all predictable) than the learned grammar counterpart (2 bits for each element as learned grammars are modelled as a choice).
In addition, knowledge of the schema makes it possible to give a type to the XML values and to use specific encoding types that are generally more compactly encoded than the default string type encoding. For instance, Boolean values can be represented as 1 bit while their default string representation will be coded using at least 16 bits.
However, there are numerous cases where the grammars created from the XML Schema are inefficient or of low efficiency. For example, an XML Schema may be lax, allowing many options or possibilities for the structure elements. The created grammars, which cover all the possibilities, comprise numerous entries that are therefore costly to encode with EXI.
A reason of the lack of efficiency when using an XML Schema to create indexing encoding tables can be that the XML Schema was originally created for validating or defining the structure of an XML document. It was not designed for the purpose of encoding.
The prior art provides few solutions to mitigate this situation.
One is known from publication US 7,707,154 which suggests simplifying the XML Schema before encoding XML documents based on the simplified XML Schema.
However, compression improvement with this approach remains limited, as the proposed simplification change the XML Schema without changing the definition of the structure of the XML documents.
In this context, the present invention seeks to improve the processing, encoding and decoding of structured data when using such a schema. Encoded structured data with higher compression may then be obtained and decoded.
SUMMARY OF THE INVENTION
To that end, a first aspect of the invention relates to various methods as defined in the appended claims.
In particular, an encoding method for encoding structured data using an associated schema which defines a structure model of data, comprises, according to the invention, the following steps: -obtaining a transformed schema by applying a first transformation to the associated schema, the transformed schema defining a modified structure model, -obtaining, from the first transformation, a second transformation which defines modifications to apply to structured data conforming to the associated schema to obtain transformed structured data conforming to the transformed schema; -applying the obtained second transformation to structured data to be encoded, so as to obtain transformed structured data; and -encoding the transformed structured data using the obtained transformed schema.
Thanks to the invention, the compression or encoding of structured data can be significantly improved, in particular where the associated schema is not suitable for efficient encoding.
This is achieved by transforming the structure of the structured data into another structure model more suitable for encoding.
According to the invention, the first transformation is obtained to convert the structure model of the schema into the more suitable structure model. Choice of this first transformation preferably takes into account the nature of the encoding mechanism, for example an EXI compression mechanism. The resulting schema is referred to herein as the transformed schema.
In parallel, the structured data have to be transformed to reflect the transformed schema. This is done by obtaining the second transformation from the first transformation, which is then applied to the structured data.
Efficient encoding of the structured data can then be performed, thanks to the more efficient transformed schema that is used as prior knowledge for encoding the (transformed) structured data conforming to that transformed schema.
At the decoding end, a decoding method for decoding encoded structured data, the structure of the structured data conforming to an associated schema that defines a structure model of data, comprises, according to the invention, the following steps: -obtaining a transformed schema defining a transformed structure model, the transformed schema being the associated schema modified by a first transformation, -obtaining, from the first transformation, a second transformation which defines modifications to apply to transformed structured data conforming to the transformed schema to obtain structured data conforming to the associated schema; -decoding the encoded structured data using the obtained transformed schema to obtain decoded transformed structured data; -applying the obtained second transformation to the decoded transformed structured data to obtain decoded structured data conforming to the associated schema.
Such decoding may be applied to the encoded structured data obtained when applying the encoding method as defined above.
Similarly to the encoding, the decoding method performs a translation of reference structure for the structured data to pass from a (transformed) schema efficient for compressing the data (i.e. used when decoding) to a less efficient schema that was originally associated with the structured data and known by the users.
Correlatively, a second aspect of the invention relates to various encoding and decoding devices as defined in the appended claims.
In particular, an encoding device for encoding structured data using an associated schema which defines a structure model of data, comprises, according to the invention: -obtaining means for obtaining a transformed schema by applying a first transformation to the associated schema, the transformed schema defining a modified structure model, -obtaining means for obtaining, from the first transformation, a second transformation which defines modifications to apply to structured data conforming to the associated schema to obtain transformed structured data conforming to the transformed schema; -a transformation engine for applying the obtained second transformation to structured data to be encoded, so as to obtain transformed structured data; and -encoding means for encoding the transformed structured data using the obtained transformed schema.
Similarly, a decoding device for decoding encoded structured data, the structure of the structured data conforming to an associated schema that defines a structure model of data, comprises, according to the invention: -obtaining means for obtaining a transformed schema defining a transformed structure model, the transformed structure model being the associated schema modified by a first transformation, -obtaining means for obtaining, from the first transformation, a second transformation which defines modifications to apply to transformed structured data conforming to the transformed schema to obtain structured data conforming to the associated schema; -decoding means for decoding the encoded structured data using the obtained transformed schema to obtain decoded transformed structured data; -a transformation engine for applying the obtained second transformation to the decoded transformed structured data to obtain decoded structured data conforming to the associated schema.
A third aspect of the invention relates to an information storage means, able to be read by a computer system, comprising instructions for a computer program adapted to implement one of the methods as set out above, when the program is loaded into and executed by the computer system.
A fourth aspect of the invention relates to a computer program product able to be read by a microprocessor, comprising portions of software code adapted to implement one of the methods as set out above, when it is loaded into and executed by the microprocessor.
The devices, the computer program and the information storage means may have features and advantages that are analogous to those set out above and below in relation to the methods for encoding or decoding structured data.
A fifth aspect of the invention relates to a method for encoding structured data substantially as herein described with reference to, and as shown in, Figures 1 and 2; or Figures 1, 2 and 7; or Figures 1, 2, 7 and 8; or Figures 1, 2, 7 and 9; or Figures 1, 2, 7, 8 and 9 of the accompanying drawings.
Another aspect of the invention relates to a method for decoding encoded structured data substantially as herein described with reference to, and as shown in, Figures 1 and 6 of the accompanying drawings Optional features of the invention are further defined in the dependent appended claims.
In particular, the method may further comprise determining the first transformation to apply to the associated schema, based on an efficiency criterion, wherein the determining comprises comparing an efficiency measure when using the associated schema and the same efficiency measure when using the transformed schema.
The efficiency should be related to the encoding or decoding to provide an appreciable improvement in compressing structured data.
Thanks to the structured nature of the schema, applying the first transformation is facilitated, for example only requiring handling of the structure elements therein (e.g. grouping elements, reordering elements, factorizing elements, etc.) According to a particular feature, determining the first transformation comprises successively considering a set of transformations and selecting a transformation from the set that optimizes the efficiency measure when using the corresponding transformed schema.
According to another particular feature, the efficiency measure when using a schema is selected from the set comprising: the size of sampled data encoded using the schema; the size of memory required to store indexing structure tables created from the schema; the encoding time to encode sampled data using the schema; a complexity measure reflecting the complexity of the transformation associated with the schema; a combination of two or more of those measures.
These are various relevant measures that provide efficient improvement to the compression of structured data, in particular when implementing the EXI compression.
In one embodiment of the invention, determining the first transformation first comprises selecting a type of transformation.
In another embodiment of the invention, determining the first transformation comprises iteratively selecting, from sets of transformations, transformations to apply to different parts of the associated schema.
This results in having a complex first transformation (since it groups several transformations) that may provide a very efficient transformed schema for compressing data.
In yet another embodiment, determining the first transformation comprises iteratively selecting, from sets of transformations, transformations to apply to the same part of the associated schema. This tends to locally (i.e. for a part of the schema and corresponding elements in the structured data) provide an improved compression of structured data.
Of course, transformations to apply to the same part of the schema may be combined with transformations to apply to different parts of the schema.
In yet another embodiment, determining the first transformation takes into account statistics about documents conforming to the associated schema. This provision helps in selecting the transformation that is the most appropriate for known documents, for example with a view to efficiently encoding those documents using the selected transformation. This is because, since there are many ways to modify the schema, such statistics provide indications for efficiently modifying the schema. The first transformation is then determined with a view to optimizing the transformed schema specifically for those documents.
According to a feature of the invention, the first transformation applied to the associated schema includes a transformation from the set comprising: reordering structure elements within the structure model of data; grouping optional child elements of a parent structure element as sub-elements of a new child element; splitting attributes of an element of the structure model into one or more child structure elements; splitting possible value types of a structure element into one or more child structure elements; splitting a value of a structure element into one or more child structure elements; factorizing a respective part of child elements of a parent structure element into one or more attributes of the parent structure element.
All these various transformation examples lead to reducing the number of entries in created EXI grammars, i.e. to reducing the size of the corresponding encoded data. They may be combined together to further improve the efficiency of the invention in reducing the size of the encoded data.
As described below, the possible value types of a structure element are the value type that the structure element can taken in its instances within a structured document conforming to that structure model of data.
Also, the value of a structure element can be composed of a series of atomic elements, for example a function and corresponding parameters, from which the split may be done.
According to another feature of the invention, the method further comprises applying the first transformation to the associated schema, taking into account statistics on structure elements within sampled structured data.
The statistics may be about the structure items of the schema, for example statistics about the occurrence of these items within a set of documents or data already processed, or statistics about co-occurrence of similar items.
The statistics are helpful to group similar items or to give preference to more frequent items within the schema, for example when reordering. This may result in a transformed schema and corresponding indexing tables that provide improved compression for encoding structured data conforming to those statistics.
In one embodiment, the second transformation is expressed in a declarative, structured-data-based language used for transforming structured data. In particular, the second transformation is an XSLT stylesheet module to be applied to the structured data to encode.
Thanks to an existing XSLT engine for example or similar, the transformation of the structured data according to the invention is easy to implement, with little impact on the processing (encoding or decoding) time.
In another embodiment, the encoding method further comprises generating a transformation which is inverse to the second transformation, and transmitting the inverse transformation to a decoder together with the transformed schema. In this embodiment, the inverse transformation is the "second transformation" as defined above for the decoding method.
This provision makes it possible for the decoder to be able to efficiently decode an EXI bitstream encoded by the invention.
In particular, the inverse transformation may be transmitted in a compressed form, or at least in a compact representation so as to drastically reduce the overhead of sending transformation information resulting from the invention.
The decoding device, information storage means and computer program have features and advantages that are analogous to the decoding method.
BRIEF DESCRIPTION OF THE DRAWINGS
Still other particularities and advantages of the invention will appear in the following description, illustrated by the accompanying drawings, in which: -Figure 1 is a schematic overview of the invention in the case of processing XML data; -Figure 2 shows general steps of an encoding method according to the invention; -Figures 3a-3d illustrate reordering transformations within a XML element according to the invention; -Figure 3e illustrates a restructuring transformation of the same XML element according to the invention; -Figure 4a-4b illustrate a restructuring transformation of attributes in a XML element according to the invention; -Figures 5a-5d illustrate restructuring transformations of an element value due to the value type, according to the invention; -Figure 6 shows general steps of an encoding method according to the invention; -Figure 7 shows steps for obtaining transformations within the process of Figure 2; -Figure 8 shows steps illustrating a generation of a transformation that reorders or restructures sub-elements contained in a parent element based on their optionality, taking place in the process of Figure 7; -Figure 9 shows steps illustrating a generation of a transformation restructuring values of structured elements, taking place in the process of Figure 7; and -Figure 10 shows a particular hardware configuration of a device adapted for an implementation of the method or methods according to the invention;
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
The present invention, in both its encoding and decoding aspects, can be used in various applications relating to structured data such as XML data. For the remainder of the description, reference will be made to XML data while the invention concerns more broadly structured data.
As explained above, embodiments of the invention achieve a translation between two representations of the same XML data in order to use a more efficient representation for providing the XML data in an encoded form. The representations conform to the structure models of different XML Schemas, one XML schema resulting from transforming the other by a transformation, referred to as "first transformation".
The translation between the two representations of the XML data depends on this first transformation, and can be implemented, as an example, through transformation such as the well-known XSLT styletsheet module. A particularity of the first transformation used is that an inverse transformation of that first transformation may be easily obtained.
A first example of application of the invention regards an improved internal storage of XML data.
In such an example, the XML data is to be stored in a device, such as a server, or in a group of similar devices, using the EXI format.
Each time new XML data is imported from third party devices, this data is transformed from its public XML representation to a private XML representation dedicated for storage with better compression.
Next the XML data is stored using EXI compression based on the private XML representation. In particular, this compressing operation may use an associated private XML schema as prior knowledge. The private XML representation is chosen when it provides more efficient coding or compression compared to the public representation that conforms to a less efficient XML schema.
The transformation between the two representations uses an XSLT transformation corresponding to the transformation between the public XML schema and the private XML schema.
Using the invention in this case of storage ensures better compression of the XML data to store thanks to the private XML representation, and thus decreased use of the memory.
In response to a request for obtaining stored XML data, this stored XML data is decoded using the private XML representation, and then the reverse XSLT transformation is applied to the decoded XML data so to obtain XML data conforming to the public XML representation. The resulting XML data is then transmitted in response to the requesting device.
Using the reverse XSLT transformation enables recovery of the public XML representation of the XML data.
A second example of use of the invention regards frequent XML document transfers between two devices.
First, when initializing the communication, an XSLT transformation is transmitted by one of the devices to the other.
Before sending a document to exchange, each device transforms the XML document using the XSLT transformation. The resulting XML document is then encoded using EXI based on the transformed XML representation before it is transmitted.
Since the XSLT transformation is chosen to improve the efficiency of the EXI encoding, using the invention in this case improves the exchange performance.
It is to be noted that the "Schemald" field included in each EXI document can reference both the original XML Schema (which is publicly available) and the XSLT transformation used (used to obtain the transformed XML schema).
With reference to Figure 1, an overview of the invention in the case of XML data is now given by way of illustration. The encoder and the decoder can be implemented in different devices when the encoded data (i.e. the EXI document as disclosed below) is sent between them. They also can be implemented in the same device when the data is stored in it in a compressed form before retrieval.
At the encoder end, an original XML document D is obtained with its associated XML Schema S. A transformed XML Schema 5T is generated, possibly before processing the XML document 0. This transformed XML Schema derives from the associated XML Schema 5, using a first transformation TI. Examples of selection of this first transformation TI are further described below, with a view to providing improved and efficient EXI encoding of the XML document D. A second transformation T2 is then obtained which defines modifications to apply to XML data conforming to the associated XML Schema S to obtain transformed XML data conforming to the transformed XML Schema 5T In this example, the second transformation T2 is an XSLT transformation, which may be generated at the same time as the first transformation TI since it is based on that first transformation TI.
Next, a transformed XML document DT is generated by applying the XSLT transformation to the original XML document D: this transformed XML document 01 thus conforms to the transformed XML Schema 1 created according to the invention.
The transformed XML document 0' is then encoded into encoded XML data E using conventional EXI encoding, in particular by using schema-informed grammars created from the transformed XML Schema 5T At the decoder end, the EXI document E is decoded, using the same grammars initially created based on the transformed XML Schema 51 The transformed XML document 0' is then obtained.
Next, the reverse XSLT transformation T21 (or XSLT1) is applied to this transformed XML document 0', enabling to obtain the original XML document 0 conforming to the original associated XML Schema S. When the encoder and the decoder are in different devices, it is necessary to transfer some information for the latter to be aware of the transformation to apply.
Information such as the transformed XML Schema T or the XSLT transformation T2 (or its inverse T21) or the first transformation TI may be sent in this respect. This ensures that the decoder is able to retrieve the required transformations to obtain XML data according to the original XML representation.
For example, the XSLT transformation T2 can be transmitted from which both the transformed XML Schema 5T and the reverse XSLT transformation T21 can be computed.
As the XSLT language is rather verbose, the XSLT transform can be compressed before being transmitted, or preferably it is represented using a more compact representation.
It should be noted that in a preferred implementation of the invention, the creaUon of the transformed XML Schema T and of its associated XSLT transformations T2 and T21 is performed at the encoder end, for example before processing XML documents D. However, it is also possible for another device to create the transformed schema 5T and the associated XSLT transformations T2 and T21, while other devices (encoders and decoders) use that transformed schema and those transformations in their encoding and decoding process.
A specific case occurs when two devices are exchanging the same type of structured documents in both directions (i.e. they act as both an encoder and a decoder). In this case, a first device may create the transformed schema 5T and the XSLT transformations T2 and T21, and send that generated information to the other device. Both devices then use that information while encoding and/or decoding XML documents.
Figures 2 and 6 show general steps at respectively the encoding end and the decoding end.
With reference to Figure 2, the encoder obtains at step S100 an original schema S defining the structure model of structured data D to encode.
At step S105, the first transformation TI is determined for example as described below with reference to Figure 7, and the corresponding transformed schema T is also determined by applying the first transformation TI to the original schema S. Various approaches for determining the first transformation TI may be implemented as further described below, for example by testing various predefined transformations.
In particular, this determination may be based on the original schema S alone, or on statistics on sample structured data conforming to this original schema 5, or on a combination of the original schema S and statistics on sample structured data.
Some examples of transformations that potentially improve the efficiency of encoding structured data are now given. A first example concerns reordering elements within the original schema (Figures 3a to 3d). Other examples concern restructuring the original schema based for example on attribute information, value information and/or type information (Figures 3e to 5d).
Figure 3a shows a simplified exemplary excerpt of an XML Schema S for illustrating a reordering transformation of sub-elements within a parent element. In this example, a single parent element <elem> is shown with two optional sub or child elements or items followed by two mandatory sub or child elements or items. For clarity of the explanation, the definition of the content of those sub-elements has not been shown in the Figure.
Encoding an occurrence of XML data conforming to that Schema costs from 2 to 3 bits, depending on the presence or not of the optional sub-elements: first, encoding what is the first sub-element in the element <elem> costs 2 bits, since there are three possible choices (optioni, option2 or mandatoryl); second, if optioni is present, encoding what is the second present element costs I bit, since there are two possible choices (option2 or mandatoryl).
The remainder of the element <elem> is totally predictable (only mandatory sub-element or sub-elements), thus requiring no further encoding bit.
Figure 3b shows a corresponding reordered XML schema, i.e. the transformed XML Schema T according to the invention.
In this case, the cost for encoding an occurrence of an celem> element is 2 bits: encoding what is the first present element costs I bit (optionl or mandatoryl); then, encoding what is the element present after mandatoryl costs also 1 bit (option2 or mandatory2).
It may be inferred from this example that optional elements should be preferably separated from the others in order to reduce the encoding cost.
Figures 3c and 3d show respectively the corresponding second transformation T2, here an XSLT stylesheet, and the corresponding inverse of the second transformation T21, also an XSLT stylesheet.
For illustrative purposes, the first XSLT stylesheet (Figure 3c) defines that each celem> element must be rewritten by first copying the <option 1> sub-element, then copying the <mandatoryl> sub-element, then copying the <option2> sub-element, and then copying the cmandatory2> sub-element. The second XSLT stylesheet (Figure 3d) defines the reverse operation.
This example transforms the following simplified XML excerpt: <elem> <optionl/> <option2/> <mandatoryl /> <mandatory2 /> </elem> into the following transformed XML data: <e 1 em> <opt±onl/> <mandatoryl /> <opt±on2/> <mandatory2 /> </elem> Figure 3e shows another way of transforming the original XML Schema S of Figure 3a. This example is not limited to reordering the sub-elements since the XML Schema is restructured which is a more important change.
In this example, a new sub or child-element <options> of the <elem> element is created to group all optional elements of this <elem> element. Of course, in some cases, several new sub or child-elements may be created so that each of them groups part of the optional elements.
This restructuring of the XML Schema may further improve the encoding efficiency: if no optional element is present in an occurrence of the <elem> element, then the encoding cost is I bit used to encode the fact that the <options> element is absent; if one or more optional elements are present, then the cost is up to 4 bits (in the worst case, when both optionl and option2 are present, their encoding requires 1 bit to encode that the <options> element is present, 2 bits to encode that the first sub-element is option I and I bit to encode that the second sub-element is option2).
Therefore, the cost for the restructured structure is from I to 4 bits.
The main advantage of this restructuring transformation is to group all optional elements into a set or sets of optional elements.
On average, the reordering is less costly than restructuring the XML Schema. However, if optional elements are often absent, restructuring the XML Schema as in this example can provide a lower encoding cost. In this respect, determining whether or not to perform such restructuring (compared to simply reordering) may be considered depending on sample XML data that give overall statistics on frequency of the optional elements.
In addition, it is possible to create several groups for containing the optional sub-elements of an element. These groups can be created according to statistics on the co-occurrence of the optional elements: the elements in a group are those that occur simultaneously in the sample documents used to build the statistics. It is even possible to make all the elements of a group mandatory inside this group, therefore further reducing the encoding cost. In such a case, it is better to provide a faliback mechanism for the case when some elements of a group are missing. This can be for example a supplementary group where all the optional elements are present.
Figure 4a shows another exemplary excerpt of an XML Schema S for illustrating a restructuring transformation of attributes within an XML element.
Usually structured elements have attributes. Those attributes could be grouped into sets, either in a random manner, or based on their names, or using some statistics on their co-occurrence within sample XML data.
Using statistics may identify unused attributes, or may give a frequency of use of the attributes in order to favour grouping of consecutive attributes having similar frequency (e.g. always used, often used, moderately used, rarely used). This approach takes into account the optionality of the attributes.
Here is an exemplary XML excerpt that conforms to the Schema S of Figure4a: <elem borcler-top="5" borcler-bottom="5" border_left=TnlOfl border-right="8" foreground_co1or=TFFFFFF! background_co1or=OOOOOOTT anim-start=T15s" anim_encl=TT1OSTT> </elem> In this example, the attributes can be split and grouped into new sub-elements based on the repetitive part of their names to provide the transformed Schema T of Figure 4b. This is because attributes with similar names are generally used together. This may be confirmed by statistics to validate the obtained split based on name.
Transformed XML data corresponding to the above excerpt may be as follows: <elem> <border top="5" bottom=".5" left='T1O" right="8"/>
<color foreground="FFFFFF" background=TT000000/>
<anim start="5sTT end="lOs"/> </elem> Figure 5a shows another exemplary excerpt of an XML Schema S for illustrating a restructuring transformation within an XML element based on value type.
In this example, the transformation is determined based only on the XML Schema S itself. Sometimes statistics on sample XML data may permit to determine a type that is never used and can be excluded.
Such kind of transformation can frequently be determined based on statistics on sample XML data: while the XML Schema defines a generic type for a value, sample values show that one or more specific types are generally used.
Figure 5b shows a corresponding transformed Schema 5T, in which the possible types of the element value are split into different sub-elements.
An exemplary XML excerpt conforming to this XML Schema is as follows: <elem value="int or float or str±ng"/>. The resulting transformed XML data conforming to the schema 51 are structured as follows: <elem> <value> <int/> <float/> <string/> </value> </elem> Since EXI always encodes such a value of an element celem> conforming to that schema S as a string (the longest coding information consisting of about 16 bits), using this transformation and the present invention makes it possible to encode each occurrence of the celem> value according to its specific type, saving coding bits.
The EXI encoding using the present invention is therefore more efficient.
Figures 5c and 5d show respectively the corresponding second transformation T2, here an XSLT stylesheet, and the corresponding inverse of the second transformation T21, also an XSLT stylesheet.
In another example, a complex structure is hidden in the value of an element. Restructuring the complex structure into one or more child elements of the element may improve the EXI compression efficiency.
The excerpt <elem value="function(param) "/> gives an example of a complex definition hidden in a value. In this example, a function definition is in the value, in which the name of the function and the parameters may vary at each occurrence.
For this reason, restructuring the value as follows improves the compression efficiency, where "fnName" is used to define the name of the function and "intParam" is used to define the parameter of the function (an integer in the example): <elem> <value> <fnName/> <intParam/> </value> </eiem> Determining that the function name fnName and the parameter intParam vary may be performed based on sample XML data.
Where a function has several parameters, an element may be created for each parameter.
Yet another example relates to factorizing a respective part of sub-elements of an element celem> into one or more attributes of that element celem>.
This makes it possible to avoid coding the same part several times when encoding that element.
This also makes it possible to change the type of the data to encode, and then to use more efficient type encoding, as illustrated in the following example.
An XML excerpt can contain several values with unit information as follows: <elem> <item value="3OsTT/> <item value="4OsTT/> <item value="5OsTT/> <item value="6OsTT/> </elem> In normal encoding, the unit information "s" (standing for seconds) is encoded four times. Furthermore, since the values comprise characters, they are encoded as string values.
Application of the invention may define the unit information "s" as an attribute of the element <elem>, resulting in each value only comprising an integer. The transformed XML data are as follows: <elem type="s"> <item value=TT3OT/> <item value=TT4OT/> <item vaiue="50"/> <item vaiue="60"/> </eiem> When encoding the transformed XML data, the information "s' is encoded only once, while each value can be encoded as an integer, therefore requiring a lower number of coding bits.
Of course, other transformations may be used in the context of the invention, provided they result in transformed XML data that requires less information for them to be encoded.
Returning to Figure 2, following step S105 of determining TI and T, step SI 10 consists in generating the corresponding transformation T2, in particular using the XSLT language. This step may further comprise generating the inverse transformation T21 when the latter is transmitted by the encoder to the decoder.
Knowing the original schema S and the transformed schema 5T according to the invention, generating the XSLT transform T2 that transforms structured data conforming to the original schema S into transformed structured data conforming to the transformed schema 5T can be made using normal skills.
The following step, namely step S115, refers to the transmission of the required information (e.g. TI or T2 or T21 or ST) over a communication network to the decoder. This step is optional because there is no need for such transmission if the encoder and the decoder are on the same device.
The steps S100-S115 are part of an initializing phase at the encoder. It then waits for structured data to encode (step S120). This structured data conforms to the original schema S. When it has structured data 0 to encode, the encoder applies the XSLT transform T2 to the structured data D, at step 5125. This results in obtaining transformed structured data DT conforming to the schema Next, at step S130, the encoder encodes the transformed data DT into encoded data E. For the above examples based on EXI particularities, the data DT are encoded using the EXI format and schema-informed grammars. The schema-informed grammars are produced from the transformed schema 5T, before actually encoding the data DT.
Of course, the above transformation TI may be determined based on particularities of any other encoding format (e.g. Fastlnfoset) in which case the encoding of step 3130 is performed using that encoding format.
Thanks to the transformation according to the invention, the encoding of step 5130 is more efficient than a similar coding based on the data D untransformed (using schema-informed grammars built from the original schema S). The steps S105, SI 10 and 3125 are specific to the invention.
Next, at step S135, the encoded data E is sent to the decoder if it is a different device.
Alternatively, the encoded data E can be stored on a disk or on any other medium.
The process then goes back to step S120 to wait for new structured data to encode.
Figure 6 shows the associated processing at the decoding end.
During an initializing phase, the decoder obtains (step S200) transformation information required for implementing the invention. This may be TI or T2 or T21 or S1 sent by the encoder at step S 115.
Based on this information, at step S205, the decoder computes or determines the (XSLT) transformation T21 that is inverse to the transformation T2 used at the encoding end.
At step S210, the decoder waits for encoded structured data E to decode.
Upon reception of such data, the decoder performs decoding, at step S215, according to the encoding format used. This step results in obtaining decoded data conforming to the transformed schema ST and corresponding to the transformed structured data 01. This is referred to below as decoded transformed structured data.
In particular, the decoding uses schema-informed grammars built from S1.
At step 5220, the decoder applies the inverse transformation T21 to the decoded transformed structured data DT. This step results in obtaining decoded data conforming to the original schema S and corresponding to the original structured data D. The process then goes back to step S210 to wait for new encoded data to decode.
Returning to steps S105 and 5110, obtaining the first transformation TI and the corresponding XSLT transformations T2 and T21 is now described with reference to Figure 7.
First, the original XML Schema S is obtained at step S300.
As an option, other useful information can be also obtained at this step, such as statistical information about sample XML documents conforming to this schema 5, or more details about the structure of some values of that schema as illustrated below.
A transformation type is selected at step S305. It may be selected from a group of predefined transformation types.
Examples of transformation type, as illustrated above, are: -reordering of elements; -restructuring of elements or attributes; -value restructuring or factorization.
In next step S310, one or more transformations T belonging to the selected type are generated.
Usually, the transformations need to be adapted to the original XML Schema S since it comprises particularities, and are not generic transformations. This step S310 is further illustrated below with reference to Figures 8 and 9, which consider one of the examples mentioned above.
Next, each generated transformation T is evaluated according to criteria selected by the user, at step S315. One or several criteria may be involved and combined.
As an example, the evaluation of the efficiency of a transformation T is performed by comparing the cost associated with the original schema S to the cost associated with the corresponding transformed schema This may be done by applying these schemas to sample structured data or by estimating the theoretical savings on each element of the schema S possibly weighted by statistics on that element.
These costs can be evaluated using various measures, for example the cost in terms of encoded bit size, in term of representation size in memory, or in terms of encoding speed.
In one embodiment, the encoding cost is evaluated in term of encoded bit size.
The different transformations enable the encoded size to be reduced.
For element reordering or re-structuring, the encoding cost is evaluated by computing the encoding cost of the structure.
Using only the schema S or 5T, the minimum (i.e. without optional element for example), maximum (i.e. the most complex structure) encoding costs can be computed. Using these two boundary values, an average value can be computed, either as a simple mean or by weighting those two values to take into account differently optional and mandatory elements. For elements that can be repeated, a default repetition value is used to compute the maximum cost. A last possibility for computing an average cost is to compute the encoding cost for all the possible structure instances. This enables obtaining a more precise measure, but it can be very costly for complex structures.
If sample documents or document statistics are available, then a more precise cost can be computed using those examples or statistics.
For value restructuring, the encoding cost comprises both the cost of encoding the structure and the cost of encoding the values. The cost can be either evaluated using only the schema S or 5T, leading to minimum, maximum and average encoding costs. However, it is better to perform evaluation further based on sample documents.
Another measure that can be used is the cost of grammar representation in memory. To evaluate this, a model of the representation of schema-informed grammars used by the encoder and the decoder is necessary.
The schema-informed grammars corresponding to the original schema S and those corresponding to the transformed schema S', or at least those changed by the transformation T, are computed and the corresponding required memory sizes are computed.
A third measure is the encoding speed. This speed can be either evaluated using the characteristics of the encoder and decoder. It can also be computed on sample XML data or documents.
A fourth measure is the complexity of the transformation. It can be either evaluated using a model of the process used when applying the XSLT transformation.
Or it can be measured by performing the transformation on sample documents. In both cases several measurements can be used, such as the time required to apply the transformation, or the memory required to apply the transformation.
Another way of measuring the complexity of the transformation is to evaluate characteristics of the transformation. An example characteristic is whether the transformation can be performed in a streaming mode. A more precise characteristic is the size of the buffer needed by the transformation (for example, while reordering the content of an element, how many sub-elements will need to be buffered).
Lastly, the measure used can be a combination of those previous measures to take into account several factors.
Returning to Figure 7 and more particularly to step S315, if it is determined (test S320) that one of the transformations T is more efficient than when only using the original schema S, this more efficient transformation is chosen as first transformation TI in the meaning of the invention.
If it is determined that several transformations are more efficient than the original schema S alone, then the best one among those several transformations T is chosen as first transformation TI.
Of course, the best transformation among the transformations T is the one that gives the best average result at step 5315, for example in terms of coding cost.
The average may be computed as the average cost for all possibilities. If statistical information is available, then it may be used to weight the possibilities while computing the average cost.
If a transformation TI is chosen as being more efficient than the original schema S alone, it is applied to the schema S to generate the transformed schema ST, at step 5325.
The obtained first transformation TI is also used to create the two XSLT transforms, namely the second transformation T2 and the corresponding inverse transformation T21. This is done at step S330, which is followed by step 5335.
In case no transformation T appears to be more efficient than the original schema S alone (test 3320), the process directly goes to step S335.
At that step S335, the algorithm checks whether the process should be reiterated.
A new iteration can either use the same transformation type, but on another part of the XML structure (for example, reordering the content of another element), or another transformation type can be used for the same or another part of the XML structure. To achieve this new iteration, the process goes back to step S305 as described above.
Reiterating the steps S305-S330 causes several transformations to be applied to the same schema S to obtain a transformed schema 5T A preferred way of doing this is to sort the transformation types by efficiency order and then to successively select each transformation type in said order.
The sorting may be obtained based on a history of previous processing, and/or on particularities of the schema, and/or on statistics.
The reiteration may then consist in considering successively each transformation type, and for each type, successively considering each location in the schema S where transformations of that type can be applied. This means that for each transformation type and for each considered location, the steps 5305-S330 are applied.
In other words, the first transformation TI which is eventually applied may combine several transformations at the same location and/or at several locations within the schema S. As a variant, a more thorough search can be made. For each location, each transformation type is selected in turn, and the steps S305-S330 are applied.
The best transformation from all the types is chosen.
Then all non-selected transformation types are tried again on the transformation location until all transformation types have been tested, or until no other transformation type provides any more improvement.
While this variant may produce better results regarding coding efficiency, it results in longer processing. This means that this approach is preferably considered when a large amount of structured data or documents conforming to the same schema is to be encoded.
At the end, all the transformations T2 and T21 obtained at step S330 are concatenated (or combined) to obtain the global transformations used for encoding and decoding documents.
With reference to Figure 8, there is now illustrated a generation of a transformation T that reorders or restructures the sub-elements contained in a parent element based on their optionality.
This algorithm aims at reducing the encoding cost of the element structure while using EXI.
At step 3400, an element for which a transformation is to be generated is first considered. This is for example the element celem> of the above examples.
At step 3405, the sub-elements contained in this element are obtained and grouped into optional and mandatory categories.
Two options are then available (test 3410): using only reordering or enabling restructuring of the element content. These two options are illustrated above with reference to Figures 3a-3e.
In the first case (output "no" of test 3410 as shown in Figure), the sub- elements are re-ordered at step 3415 by alternating optional and mandatory sub-elements as shown in Figure 3b.
Preferably, the newly ordered list in the transformed schema T should avoid having two consecutive optional sub-elements. However, this is not always possible, in particular if the number m of optional sub-elements is greater than the number n of mandatory sub-elements.
In the more general case, considering n mandatory sub-elements, n+1 sets of optional sub-elements are created. The number of sub-elements in each set should be substantially the same, as much as possible. The new list for the transformed schema 1 is then created by alternating a set of optional sub-elements with a mandatory element.
If statistical information about the presence of optional sub-elements within sample XML data is available, it can be used to further optimize the reordering. For example, the most frequent optional sub-elements are placed at the end of the sets, while the less frequent ones are placed at the beginning of the sets. This enables the encoding cost for the most frequent events to be reduced.
In the second case (output "yes" of test 3410), the sub-elements are re-structured using new additional grouping sub-elements as illustrated for example in Figure 3e.
In this case, groups of optional sub-elements are created.
If there is no statistical presence information available for the optional sub-elements in a set of sample XML data, a few possibilities are available: -only one group is created for all optional sub-elements.
-n+1 groups are created for the optional sub-elements, where n is the number of mandatory sub-elements and where the optional sub-elements are split uniformly in the created groups; -k groups are created for optional sub-elements, where I «= k «= n+1, k being selected to minimize the average cost for structure encoding.
If there is some statistical presence information available for the optional sub-elements, preference may be given to grouping together sub-elements having high probabilities of simultaneous presence.
Furthermore, the groups corresponding to sub-elements with high probability of presence should contain fewer sub-elements than the groups of sub-elements with low probability of presence.
Still other approaches to reorganize the optional sub-elements based on statistics may be considered and/or combined therewith.
In an additional approach, an optional sub-element with a high probability of presence can be treated like a mandatory sub-element. A threshold, for example a probability of presence higher or equal to 75%, can be used to define the "high probability" of presence.
A more extensive approach considers all possible ways of grouping of the optional sub-elements and computes the encoding cost for each grouping using the statistics. Then the best way of grouping is retained.
Once groups have been created, mandatory sub-elements and the created groups of optional sub-elements are reordered at step S425 so that, in the transformed schema T, all groups of optional sub-elements are separated from each other by mandatory sub-elements.
For each group containing several sub-elements, a new grouping sub- element is created (still at step S425) in the parent element, which grouping sub-element contains all the optional sub-elements of the group. For groups containing only one optional sub-element, the optional sub-element is kept as a direct child of the parent element.
For such a transformation T, a corresponding XSLT transformation T2 in the meaning of the invention reorders or re-structures the sub-elements of the main element, while the reverse XSLT transformation T21 restores the original structure.
Reordering and re-structuring as shown in Figure 8 can also be combined.
As an example, it is first considered that when statistics about sample XML documents are available, optional sub-elements with a probability of simultaneous occurrence greater or equal to a threshold are grouped together in new grouping sub-elements, as described previously.
In the particular exemplary embodiment of combining reordering and re-structuring, the remaining sub-elements (those having a probability less than the threshold) may be simply re-ordered.
It may be noted that the grouping using new grouping sub-elements can also have several levels: for example, two new grouping sub-elements created for containing two groups of optional sub-elements are themselves grouped into a third new grouping sub-element. Grouping according to these levels may be based on probability of simultaneous presence of the corresponding optional sub-elements.
Several threshold limits can also be used for generating several possible ways of re-structuring the sub-elements.
An illustration is now described with reference to Figure 9 of a generation of a transformation T restructuring values of structured elements. Examples of such transformation have been given above.
First, at step S500, it is tested whether some type information is available for the current value. The type information can be a type for a non-typed value, a more precise type for a typed value (string, integer, float, Boolean, etc.), or several possible types for a value. An alternative is generated for each new type found.
If some type information can be guessed from sample documents or directly from the original schema 5, each alternative of the type information is considered and split in the transformed schema ST at step S505.
For an attribute value having several possible types, the type alternatives are split into new attribute values.
For a textual content, the type alternatives are split into new sub-elements as shown in Figure 5b for example.
For such a type value restructuring transformation T, the corresponding XSLT transformation T2 according to the invention takes the original value, checks which alternative it corresponds to, and creates the corresponding alternative.
The created alternative replaces the original value in the transformed structured data DT.
The reverse XSLT transformation T21 looks for any of the alternatives and uses it to regenerate the original value in the decoded structured data D. Otherwise if no type information can be obtained, it is determined at step S510 whether structure information about the current value can be obtained from external means (either from sample document analysis or from a specification).
If so, the obtained structure information is used to split the current value into its atomic components at step 5515.
This new structure is reflected into one or more new sub-elements, or in some new attributes if the structure is simple. See for example the above case of the hidden function definition based on fnName and intParam.
For such restructuring transformations T, the corresponding XSLT transformation T2 takes the original value, split it into its atomic components, and generates the corresponding new sub-elements or attributes representing those components.
The original value is discarded from the structured data.
The reverse XSLT transformation T21 takes the new sub-elements or attributes and concatenates them to regenerate the original value.
The value factorization transformation as described previously can be used in combination with the value restructuring transformation as just described, when an atomic component is similar for several split values. This is the case shown in the example above with the unit information "s".
It can also be used to specify a type (or a more precise type) for several values. For example, if several values can either be an integer or a float, the exact type could be specified for all values at once.
The corresponding XSLT transformation T2 checks whether part of the transformed values can be factorized, or not, and if the check is successful removes the individual value parts and creates the new factorized part as an attribute for example. The reverse XSLT transformation T21 takes the factorized part and distributes it to all the original values.
Similar approaches as those described with reference to Figures 8 and 9 can be used to handle the attributes of a structured element. Grouping attributes based on name factorization or based on statistics may then be achieved. For each attribute group created, a new child element is created which comprises all the attributes of the group.
With reference to Figure 10, a description is now given by way of example of a particular hardware configuration of an encoding device or a decoding device for an implementation of the methods according to the invention.
A processing device implementing the present invention is for example a micro-computer 50, a workstation, a personal digital assistant, or a mobile telephone connected to different peripherals. According to another embodiment of the invention, the processing device takes the form of a camera provided with a communication interface to enable connection to a network.
The peripherals connected to the processing device comprise for example a digital camera 64, or a scanner or any other means of image acquisition or storage, connected to an input/output card (not shown) and supplying multimedia data to the processing device.
The device 50 comprises a communication bus 51 to which there are connected: -a central processing unit CPU 52 for example in the form of a microprocessor; -a read only memory 53 in which may be contained the programs whose execution enables the implementation of the method according to the invention. It may be a flash memory or EEPROM; -a random access memory 54, which, after powering up of the device 50, contains the executable code of the programs of the invention necessary for the implementation of the invention. As this memory 54 is of random access type (RAM), it provides fast access compared to the read only memory 53; -a screen 55 for displaying data and/or serving as a graphical interface with the user, who may thus interact with the programs according to the invention, using a keyboard 56 or any other means such as a pointing device, for example a mouse 57 or an optical stylus; -a hard disk 58 or a storage memory, such as a memory of compact flash type, able to contain the programs of the invention as well as data used or produced on implementation of the invention; -an optional floppy disk drive 59, or another reader for a removable data carrier, adapted to receive a floppy disk 63 and to read/write thereon data processed or to process in accordance with the invention; and -a communication interface 60 connected to the telecommunications network 61, the interface 60 being adapted to transmit and receive data.
The communication bus 51 permits communication and interoperability between the different elements included in the device 50 or connected to it. The representation of the bus 51 is non-limiting and, in particular, the central processing unit 52 unit may communicate instructions to any element of the device 50 directly or by means of another element of the device 50.
The floppy disks 63 can be replaced by any information carrier such as a compact disc (CD-ROM) rewritable or not, a ZIP disk or a memory card. Generally, an information storage means, which can be read by a micro-computer or microprocessor, integrated or not into the device, and which may possibly be removable, is adapted to store one or more programs whose execution permits the implementation of the methods according to the invention.
The executable code enabling the encoding or decoding device to implement the invention may equally well be stored in read only memory 53, on the hard disk 58 or on a removable digital medium such as a floppy disk 63 as described earlier. According to a variant, the executable code of the programs is received by the intermediary of the telecommunications network 61, via the interface 60, to be stored in one of the storage means of the device 50 (such as the hard disk 58) before being executed.
The central processing unit 52 controls and directs the execution of the instructions or portions of software code of the program or programs of the invention, the instructions or portions of software code being stored in one of the aforementioned storage means. On powering up of the device 50, the program or programs which are stored in a non-volatile memory, for example the hard disk 58 or the read only memory 53, are transferred into the random-access memory 54, which then contains the executable code of the program or programs of the invention, as well as registers for storing the variables and parameters necessary for implementation of the invention.
It will also be noted that the device implementing the invention or incorporating it may be implemented in the form of a programmed apparatus. For example, such a device may then contain the code of the computer program(s) in a fixed form in an application specific integrated circuit (ASIC).
The device described here and, particularly, the central processing unit 52, may implement all or part of the processing operations described in relation with Figures 1 to 9, to implement the encoding and/or decoding method of the present invention and constitute the devices of the present invention.
The preceding examples are only embodiments of the invention which is not limited thereto.

Claims (20)

  1. CLAIMS1. An encoding method for encoding structured data using an associated schema which defines a structure model of data, comprising the following steps: -obtaining a transformed schema by applying a first transformation to the associated schema, the transformed schema defining a modified structure model, -obtaining, from the first transformation, a second transformation which defines modifications to apply to structured data conforming to the associated schema to obtain transformed structured data conforming to the transformed schema; -applying the obtained second transformation to structured data to be encoded, so as to obtain transformed structured data; and -encoding the transformed structured data using the obtained transformed schema.
  2. 2. The method of Claim 1, further comprising determining the first transformation to apply to the associated schema, based on an efficiency criterion, wherein the determining comprises comparing an efficiency measure when using the associated schema and the same efficiency measure when using the transformed schema.
  3. 3. The method of Claim 2, wherein determining the first transformation comprises successively considering a set of transformations and selecting a transformation from the set that optimizes the efficiency measure when using the corresponding transformed schema.
  4. 4. The method of Claim 2 or 3, wherein the efficiency measure when using a schema is selected from the set comprising: the size of sampled data encoded using the schema; the size of memory required to store indexing structure tables created from the schema; the encoding time to encode sampled data using the schema; a complexity measure reflecting the complexity of the transformation associated with the schema; a combination of two or more of those measures.
  5. 5. The method of any of Claims 2 to 4, wherein determining the first transformation comprises iteratively selecting, from sets of transformations, transformations to apply to different parts of the associated schema.
  6. 6. The method of any of Claims 2 to 5, wherein determining the first transformation comprises iteratively selecting, from sets of transformations, transformations to apply to the same part of the associated schema.
  7. 7. The method of any of Claims 2 to 6, wherein determining the first transformation takes into account statistics about documents conforming to the associated schema.
  8. 8. The method of any of the preceding claims, wherein the first transformation applied to the associated schema includes a transformation from the set comprising: reordering structure elements within the structure model of data; grouping optional child elements of a parent structure element as sub-elements of a new child element; splitting attributes of an element of the structure model into one or more child structure elements; splitting possible value types of a structure element into one or more child structure elements; splitting a value of a structure element into one or more child structure elements; factorizing a respective part of child elements of a parent structure element into one or more attributes of the parent structure element.
  9. 9. The method of any of the preceding claims, further comprising applying the first transformation to the associated schema, taking into account statistics on structure elements within sampled structured data.
  10. 10. The method of any of the preceding claims, wherein the second transformation is expressed in a declarative, structured-data-based language used for transforming structured data.
  11. 11. The method of Claim 10, wherein the second transformation is an XSLT stylesheet module to be applied to the structured data to encode.
  12. 12. The method of any of the preceding claims, further comprising a transformation which is inverse to the second transformation, and transmitting the inverse transformation to a decoder together with the transformed schema.
  13. 13. The method of Claim 12, wherein the inverse transformation is transmitted in a compressed form.
  14. 14. A decoding method for decoding encoded structured data, the structure of the structured data conforming to an associated schema that defines a structure model of data, the method comprising the following steps: -obtaining a transformed schema defining a transformed structure model, the transformed schema being the associated schema modified by a first transformation, -obtaining, from the first transformation, a second transformation defining modifications to apply to the transformed structured data conforming to the transformed schema to obtain structured data conforming to the associated schema; -decoding the encoded structured data using the obtained transformed schema to obtain decoded transformed structured data; -applying the obtained second transformation to the decoded transformed structured data to obtain decoded structured data conforming to the associated schema.
  15. 15. An encoding device for encoding structured data using an associated schema which defines a structure model of data, comprising: -obtaining means for obtaining a transformed schema by applying a first transformation to the associated schema, the transformed schema defining a modified structure model, -obtaining means for obtaining, from the first transformation, a second transformation which defines modifications to apply to structured data conforming to the associated schema to obtain transformed structured data conforming to the transformed schema; -a transformation engine for applying the obtained second transformation to structured data to be encoded, so as to obtain transformed structured data; and -encoding means for encoding the transformed structured data using the obtained transformed schema.
  16. 16. A decoding device for decoding encoded structured data, the structure of the structured data conforming to an associated schema that defines a structure model of data, comprising: -obtaining means for obtaining a transformed schema defining a transformed structure model, the transformed structure model being the associated schema modified by a first transformation, -obtaining means for obtaining, from the first transformation, a second transformation which defines modifications to apply to transformed structured data conforming to the transformed schema to obtain structured data conforming to the associated schema; -decoding means for decoding the encoded structured data using the obtained transformed schema to obtain decoded transformed structured data; -a transformation engine for applying the obtained second transformation to the decoded transformed structured data to obtain decoded structured data conforming to the associated schema.
  17. 17. An information storage means, able to be read by a computer system, comprising instructions for a computer program adapted to implement one of the methods as in any of Claims I to 14, when the program is loaded into and executed by the computer system.
  18. 18. A computer program product able to be read by a microprocessor, comprising portions of software code adapted to implement one of the methods as in any of Claims ito 14, when it is loaded into and executed by the microprocessor.
  19. 19. A method for encoding structured data substantially as hereinbefore described with reference to, and as shown in, Figures 1 and 2; or Figures 1, 2 and 7; or Figures 1, 2, 7 and 8; or Figures 1, 2, 7 and 9; or Figures 1, 2, 7, 8 and 9 of the accompanying drawings.
  20. 20. A method for decoding encoded structured data substantially as hereinbefore described with reference to, and as shown in, Figures 1 and 6 of the accompanying drawings.
GB1108031.4A 2011-05-13 2011-05-13 Method for encoding and decoding structured data using an associated schema Withdrawn GB2490731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1108031.4A GB2490731A (en) 2011-05-13 2011-05-13 Method for encoding and decoding structured data using an associated schema

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1108031.4A GB2490731A (en) 2011-05-13 2011-05-13 Method for encoding and decoding structured data using an associated schema

Publications (2)

Publication Number Publication Date
GB201108031D0 GB201108031D0 (en) 2011-06-29
GB2490731A true GB2490731A (en) 2012-11-14

Family

ID=44260487

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1108031.4A Withdrawn GB2490731A (en) 2011-05-13 2011-05-13 Method for encoding and decoding structured data using an associated schema

Country Status (1)

Country Link
GB (1) GB2490731A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3065061A1 (en) * 2015-03-05 2016-09-07 Fujitsu Limited Grammar generation for augmented datatypes

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040225754A1 (en) * 2003-02-05 2004-11-11 Samsung Electronics Co., Ltd. Method of compressing XML data and method of decompressing compressed XML data
US20070150494A1 (en) * 2006-12-14 2007-06-28 Xerox Corporation Method for transformation of an extensible markup language vocabulary to a generic document structure format
US7292160B1 (en) * 2006-04-19 2007-11-06 Microsoft Corporation Context sensitive encoding and decoding
US20100287460A1 (en) * 2009-05-05 2010-11-11 Canon Kabushiki Kaisha Method and device for coding a structured document

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040225754A1 (en) * 2003-02-05 2004-11-11 Samsung Electronics Co., Ltd. Method of compressing XML data and method of decompressing compressed XML data
US7292160B1 (en) * 2006-04-19 2007-11-06 Microsoft Corporation Context sensitive encoding and decoding
US20070150494A1 (en) * 2006-12-14 2007-06-28 Xerox Corporation Method for transformation of an extensible markup language vocabulary to a generic document structure format
US20100287460A1 (en) * 2009-05-05 2010-11-11 Canon Kabushiki Kaisha Method and device for coding a structured document

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3065061A1 (en) * 2015-03-05 2016-09-07 Fujitsu Limited Grammar generation for augmented datatypes
US10311137B2 (en) 2015-03-05 2019-06-04 Fujitsu Limited Grammar generation for augmented datatypes for efficient extensible markup language interchange

Also Published As

Publication number Publication date
GB201108031D0 (en) 2011-06-29

Similar Documents

Publication Publication Date Title
US20120330984A1 (en) Method for processing a structured document to render, and corresponding processor
US8949207B2 (en) Method and apparatus for decoding encoded structured data from a bit-stream
US9208256B2 (en) Methods of coding and decoding, by referencing, values in a structured document, and associated systems
KR100461019B1 (en) web contents transcoding system and method for small display devices
US8601368B2 (en) Processing method and device for the coding of a document of hierarchized data
CN101231636B (en) Convenient information search method, system and an input method system
US8346737B2 (en) Encoding of hierarchically organized data for efficient storage and processing
US7302678B2 (en) Symmetric transformation processing system
US9069734B2 (en) Processing method and system for configuring an EXI processor
US8397157B2 (en) Context-free grammar
US8892991B2 (en) Encoder compiler, computer readable medium, and communication device
WO1997034240A1 (en) Compact tree for storage and retrieval of structured hypermedia documents
WO1997034240A9 (en) Compact tree for storage and retrieval of structured hypermedia documents
US7027973B2 (en) System and method for converting a standard generalized markup language in multiple languages
WO2009118664A1 (en) Optimized methods and devices for the analysis, processing and evaluation of expressions of the xpath type on data of the binary xml type
US20110153531A1 (en) Information processing apparatus and control method for the same
US8972851B2 (en) Method of coding or decoding a structured document by means of an XML schema, and the associated device and data structure
US8140575B2 (en) Apparatus, method, and program product for information processing
US7925643B2 (en) Encoding and decoding of XML document using statistical tree representing XSD defining XML document
US20090271695A1 (en) Method of accessing or modifying a part of a binary xml document, associated devices
GB2490731A (en) Method for encoding and decoding structured data using an associated schema
US20100107052A1 (en) Encoding/decoding apparatus, method and computer program
JP4260481B2 (en) Schema, parsing method, and method for generating bitstream based on schema
JP5026407B2 (en) How to handle tree data structures
JP2005215950A (en) Retrieval method for encoded document data, and program therefor

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)